Documents/NITRD2012/1: WeCompute/1.6: Knowledge

1.6: Knowledge

From Data to New Knowledge

Other Information:

From Data to New Knowledge -- Most of the world's information is now "born digital," and legacy texts, images, sounds, videos, and films as well are being digitized around the clock. Although statistical estimates vary, they agree that the amount of digital data generated annually is many orders of magnitude greater than the total amount of information in all the books ever written, and the total is expected to continue growing exponentially. In the advanced sciences alone, the proliferation of ultra-powerful and distributed data-collection instruments and experimental facilities has turned the conduct of leading-edge research into a global-scale, data-intensive enterprise. The Federal agencies in the NITRD Program together generate exabytes of research data annually. Financial, commercial, communications, and Web-based enterprises likewise continually generate vast amounts of new digital information. Where we are now -- today, our capacity to create electronic data is outpacing advances in the technologies needed to manage and make effective use of society's data resources. Ultra-large-scale data sets -- what scientists refer to as "big data" -- are troves of potential new knowledge, but as noted above, the current networking infrastructure does not provide levels of end-to-end performance that would enable individuals and groups to access and work with big data on their desktops. While the plummeting cost of mass storage eases the stress of archiving massive data resources, we also do not yet know how to design scalable technologies—such as semantic frameworks and open ontologies—that would substantially advance capabilities for rapidly identifying, integrating, refining, analyzing, and visualizing heterogeneous and ultra-scale information in ways that would help people learn, think, and decide. Nor do we yet have a rationalized, robust information infrastructure for the long-term preservation, curation, federation, sustainability, accessibility, and survivability of vital Federal electronic records and data collections, such as those overseen by the National Archives and Records Administration (NARA). "Harnessing the Power of Digital Data for Science and Society," the 2009 report of the Interagency Working Group on Digital Data (which includes many NITRD agencies), has proposed an initial framework for developing such an infrastructure. Research needs -- we need far more powerful and nuanced tools than exist today to mine data troves deeply, and to combine and visualize diverse forms of data, in order to "see" the significant items, patterns, and relationships that could lead to new insights. To support complex human, societal, and organizational ideas, analysis, and timely action and decision-making, multisource forms of large-scale, raw digital information (e.g., sensor data) must be managed, assimilated, and accessible in formats responsive to the user's needs and expertise. At the extreme scale represented by 21st century scientific and other data, significant R&D challenges in applying information to enhance discovery and decision-making remain to be addressed, including: * Information standards—data interoperability and integration of distributed data; generalizable ontologies; data format description language (DFDL) for electronic records and data; data structure research for complex digital objects; interoperability standards for semantically understood ubiquitous health information records; and information services for cloud-based systems * Decision support -- next-generation machine learning, semantic logic, and data mining algorithms; portals and frameworks for data and processes; tools for large-scale collaboration; user-oriented and collaborative techniques and tools for thematic discovery, synthesis, data provenance, analysis, and visualization for decision making; mobile, distributed information for emergency personnel; management of human responses to data; collaborative information triage; portfolio analysis; development of data corpora for impact assessment and other metrics of scientific R&D; and multidisciplinary R&D in ways to convert data into knowledge and discovery * Information management -- intelligent rule-based data management; increasing access to and cost-effective integration and maintenance of complex collections of heterogeneous data; innovative architectures for data-intensive and power-aware computing; scalable technologies; integration of policies (differential sensitivity, security, user authentication) with data; integrated data repositories and computing grids; testbeds; sustainability and validation of complex models; and grid-enabled visualization for petascale collections

Indicator(s):