1.6: Knowledge
From Data to New Knowledge Other Information:
Most of the world’s information is now “born digital,” and legacy texts, images, sounds, videos, and films as well are being
digitized around the clock. Although statistical estimates vary, they agree that the amount of digital data generated annually
is many orders of magnitude greater than the total amount of information in all the books ever written, and the total is expected
to continue growing exponentially. In the advanced sciences alone, the proliferation of ultra-powerful and distributed data-collection
instruments and experimental facilities has turned the conduct of leading-edge research into a global-scale, data-intensive
enterprise. The Federal agencies in the NITRD Program together generate exabytes of research data annually. Financial, commercial,
communications, and Web-based enterprises likewise generate vast amounts of new digital information on a moment-by-moment
basis. Where we are now: Today, our capacity to create electronic data is outpacing advances in the technologies needed to
manage and make effective use of society’s data resources. Ultra-largescale data sets – what scientists refer to as “big data”
– are troves of potential new knowledge, but as noted above, the current networking infrastructure does not provide levels
of end-to-end performance that would enable individuals and groups to access and work with big data on their desktops. While
the plummeting cost of mass storage eases the stress of archiving massive data resources, we also do not yet know how to design
scalable technologies for rapidly identifying, integrating, refining, analyzing, and visualizing heterogeneous and ultra-scale
information in ways that help people learn, think, and decide. Nor do we yet have a rationalized, robust information infrastructure
for the long-term preservation, curation, federation, sustainability, accessibility, and survivability of vital Federal electronic
records and data collections, such as those overseen by NARA. Harnessing the Power of Digital Data for Science and Society,
the 2009 report of the Interagency Working Group on Digital Data (which includes many NITRD agencies), has proposed an initial
framework for developing such an infrastructure. Research needs: We need far more powerful and nuanced tools than exist today
to mine data troves deeply, and to combine diverse forms of data, in order to find significant items, patterns, and relationships
that could lead to new insights. To support complex human, societal, and organizational ideas, analysis, and timely action
and decision-making, multisource forms of large-scale, raw digital information (e.g., sensor data) must be managed, assimilated,
and accessible in formats responsive to the user’s needs and expertise. At the extreme scale represented by 21st century scientific
and other data, significant R&D challenges in applying information to enhance discovery and decision-making remain to be addressed,
including: * Information standards: Data interoperability and integration of distributed data; generalizable ontologies; data
format description language (DFDL) for electronic records and data; data structure research for complex digital objects; interoperability
standards for semantically understood ubiquitous health information records; and information services for cloud-based systems
* Decision support: Next-generation machine learning and data mining algorithms; portals and frameworks for data and processes;
tools for large-scale collaboration; user-oriented and collaborative techniques and tools for thematic discovery, synthesis,
data provenance, analysis, and visualization for decision making; mobile, distributed information for emergency personnel;
management of human responses to data; collaborative information triage; portfolio analysis; development of data corpora for
impact assessment and other metrics of scientific R&D; and multidisciplinary R&D in ways to convert data into knowledge and
discovery * Information management: Intelligent rule-based data management; increasing access to and cost-effective integration
and maintenance of complex collections of heterogeneous data; innovative architectures for data-intensive and power-aware
computing; scalable technologies; integration of policies (differential sensitivity, security, user authentication) with data;
integrated data repositories and computing grids; testbeds; sustainability and validation of complex models; and grid-enabled
visualization for petascale collections
Indicator(s):
|