Documents/CONOPS/3: Future Conceptual Solution Architecture/3.6: Semantic Web

3.6: Semantic Web

Take an evolutionary approach to implementing semantic web techniques

Other Information:

OMB Memorandum M‐06‐02 released on December 16, 2005, stated “when interchanging data among specific identifiable groups or disseminating significant information dissemination products, advance preparation, such as using formal information models, may be necessary to ensure effective interchange or dissemination”. OMB Memorandum M‐06‐02 further noted that “formal information models” would “unambiguously describe information or data for the purpose of enabling precise exchange between systems”. A good example of this is OMB’s Office of Information and Regulatory Affair’s development, support, and use of formal statistical policy standards12 like the standards for data on Race and Ethnicity, Metropolitan Statistical Areas (MSA), and the North American Industry Classification System (NAICS). Agencies can enable cross‐domain correlation between datasets by tagging datasets or fields in datasets as belonging to standard categories of such data standards. For example, let’s say a web‐savvy developer wants to create a mashup that visualizes and ranks various industries on revenue per employee. If one agency has published data on a designated industry’s revenue and another agency has published data on its employment, then these records could be correlated if both datasets are categorized via the standard NAICS codes to produce revenue per employee for the given industry. Through reuse of these semantically harmonized and uniquely identified categories across domains, the data from multiple sources can be appropriately merged and new insights achieved. The government has also produced several cross‐domain data models that can be leveraged to improve both semantic understanding and discoverability of government data sets. The National Information Exchange Model (NIEM) and the Universal Core (UCore) are two robust data models that are gaining traction, incorporating new domains and increasing information sharing across federal agencies, the Department of Defense and the Intelligence Community. The NIEM data model is designed in accordance with Resource Description Framework (RDF) principles and can generate an OWL representation. NIEM has extensive use across levels and domains of government. In particular, it has been endorsed by the National Association of State Chief Information Officers. The US Army has created the UCore‐Semantic Layer (SL) which is an OWL representation of the basic interrogative concepts (who, what, when, and where). These efforts are prime examples of the government’s ability and commitment to providing robust tagging and modeling mechanisms to improve discovery of, sharing of and eventually reasoning about federal data. Today’s “industry best practices” are more frequently grounded in semantic techniques that enable the semantic web and query points that the public can directly access (like Amazon Web Services13). Under this model, it is the (formally coded) data concepts themselves that are cross‐linked, as opposed to just cross linked web pages. There is a push among some search engine companies to create standards for indicating certain kinds of metadata directly within web pages. Rich Snippets from Google and Search Monkey from Yahoo14 are competing attempts (but with similar goals) to allow content developers to associate structured data with information shown on their websites. They currently support a variety of formats, including micro formats and Resource Description Framework (RDF). In accordance with the philosophy of OMB Memorandum M‐06‐02, and leveraging today’s mainstream “formal information model” capabilities, the evolution of Data.gov will include the incorporation of semantically enabled techniques within the sites and within the datasets themselves. Semantic Web Techniques -- The semantic web has a simple value proposition: create a web of data instead of a web of documents. The “web of data” will be designed to be both human and machine readable. The core insight is that data has distinct or overlapping meaning in different contexts. This is a core information technology problem and is manifest in applications such as cross‐boundary, cross‐domain information sharing, natural language processing, and in enterprise data integration and business intelligence (i.e., mash‐ups, dashboards). An example of how this is manifest is the ambiguity highlighted via an example in Wordnet as depicted in Figure 17 Figure 17 shows how the word tank can have quite a few different meanings as both a verb and a noun. In some applications the context is implicitly understood and this is not an issue. But as soon as two distinct data sets use the same label to have distinct meanings, or the meanings overlap but only partially, or the meanings are the same but that is hidden due to distinct coding or syntactical issues, we introduce ambiguity and most likely defeat the purpose of combining the data sets in the first place. In order to create this web of data, the W3C and other standards groups have designed specific data modeling techniques to provide such machine readable precision via identification, relationships, advanced modeling and rules. Let’s briefly describe each technique and then demonstrate examples of this “curated” data approach. Unique and persistent identification of a unique concept is important to insure unambiguous linking and the accrual of facts on a specific topic. For example, Sir Tim Berners‐Lee uses the identifier, http://www.w3.org/People/Berners‐Lee/, to identify himself and the people he knows using a Resource Description Framework (RDF) formatted data model called FOAF for “Friend of a Friend” as depicted in Figure 18. Unambiguously identifying all things in a domain is the key first step to enabling machine readable correlation and reasoning about those things. Additionally, by identifying something with a unique Uniform Resource Locator (a URL is a form of URI), one can retrieve a document that provides additional information about the topic and possible equate other things that have been previously identified and are the “same as” this one. Once things are identified, formal relationships between things (and unique identifiers for those relationships) can be asserted. For example, also shown in Figure 18 is the FOAF relationship labeled “knows” which is uniquely identified with the URI: http://xmlns.com/foaf/0.1/knows. Semantic web modeling expands the traditional modeling techniques of Entity‐Relationship Diagrams (ERDs) and Class modeling (as in the Unified Modeling Language or UML) to add powerful logical primitives like relationship characteristics and set theory. Some powerful relationship characteristics are relationships that are “transitive” or “symmetric”. A transitive relationship is something like the genealogical relationship “has Ancestor” which is very important in deductive reasoning as is depicted in Figure 19. Additionally, as you can see in the figure, since Matthew “has an ancestor” named Peter and Peter “has an ancestor” named William then it holds that Matthew “has an ancestor” named William. A geographic example of a transitive relationship would be “encompasses” as in “Virginia encompasses Prince William County and Prince William County encompasses Manassas”. A symmetric relationship is something that holds in both directions. For example, if Mary is “married to” Bill then Bill is “married to” Mary. One final advanced modeling technique is the ability to model types or classes of things using set theory primitives like distinct, intersection and union. This is a very powerful technique for mathematically determining when a logical anomaly has occurred. For example, if a user has an alerting application that is scanning message traffic for the location of a violent criminal on the loose, he/she needs a precise model of a violent criminal as opposed to non‐violent criminals (as depicted in Figure 20) and a person cannot be both (or there is an anomaly). Additionally, to create these advanced domain models there are even free tools, like protégé at http://protege.stanford.edu, and many tutorials on the web to educate agencies on these topics. In conclusion, curation is the process of selecting, organizing and presenting the right items in a collection that best deliver a desired outcome. Curation of data is preparing data so that it is more usable and more exploitable by more applications. In that light, the semantic web techniques previously discussed are the next logical step in the widespread curation of data. In particular, it is a leading edge, potential best practice in Federal data management. A good example of the benefits of such curation is the Wolfram Alpha website (http://www.wolframalpha.com). Wolfram Alpha exclusively uses curated data in order to calculate meaningful results to queries. For example, returning to our crime scenario, a user could input to Wolfram Alpha, “violent crime in Virginia/violent crime in the US” and it computes the information in Figure 21. Other benefits of using semantic web techniques include cross‐domain correlation, rule‐based alerting and robust anomaly detection. While out of scope for this document, it should be obvious that increasing the fidelity of data increases its applicability to solving problems and increases its value to the Data.gov developer and end‐user. The Semantic Web Roadmap -- Semantic web techniques are not yet widespread in the Federal government. Given our principle of program control, Data.gov takes an evolutionary approach to implementing these techniques. Such an evolution involves pilots, a piece‐meal transition and a lot of education. The result will be to demonstrate the value proposition, establish end user demand, and empower data stewards to adopt semantic web techniques. In order to accelerate evolution, an experimental semantic‐web‐driven site will be established as depicted in Figure 22. In addition to agency pilots, the semantic.Data.gov site will leverage lessons learned from the United Kingdom’s version of Data.gov (soon to be released) which will be built entirely on semantic web technologies. An ancillary benefit of piloting techniques like unique identification and explicit relationships is that the lessons learned will assist the more traditional implementations of these techniques on Data.gov. It is envisioned that as the benefits and applications based on semantic Data.gov datasets increase, a migration and transition plan will be developed to merge the efforts.

Indicator(s):