3.6: Semantic Web
Take an evolutionary approach to implementing semantic web techniques Other Information:
OMB Memorandum M‐06‐02 released on December 16, 2005, stated “when interchanging data among specific identifiable groups or
disseminating significant information dissemination products, advance preparation, such as using formal information models,
may be necessary to ensure effective interchange or dissemination”. OMB Memorandum M‐06‐02 further noted that “formal information
models” would “unambiguously describe information or data for the purpose of enabling precise exchange between systems”. A
good example of this is OMB’s Office of Information and Regulatory Affair’s development, support, and use of formal statistical
policy standards12 like the standards for data on Race and Ethnicity, Metropolitan Statistical Areas (MSA), and the North
American Industry Classification System (NAICS). Agencies can enable cross‐domain correlation between datasets by tagging
datasets or fields in datasets as belonging to standard categories of such data standards. For example, let’s say a web‐savvy
developer wants to create a mashup that visualizes and ranks various industries on revenue per employee. If one agency has
published data on a designated industry’s revenue and another agency has published data on its employment, then these records
could be correlated if both datasets are categorized via the standard NAICS codes to produce revenue per employee for the
given industry. Through reuse of these semantically harmonized and uniquely identified categories across domains, the data
from multiple sources can be appropriately merged and new insights achieved. The government has also produced several cross‐domain
data models that can be leveraged to improve both semantic understanding and discoverability of government data sets. The
National Information Exchange Model (NIEM) and the Universal Core (UCore) are two robust data models that are gaining traction,
incorporating new domains and increasing information sharing across federal agencies, the Department of Defense and the Intelligence
Community. The NIEM data model is designed in accordance with Resource Description Framework (RDF) principles and can generate
an OWL representation. NIEM has extensive use across levels and domains of government. In particular, it has been endorsed
by the National Association of State Chief Information Officers. The US Army has created the UCore‐Semantic Layer (SL) which
is an OWL representation of the basic interrogative concepts (who, what, when, and where). These efforts are prime examples
of the government’s ability and commitment to providing robust tagging and modeling mechanisms to improve discovery of, sharing
of and eventually reasoning about federal data. Today’s “industry best practices” are more frequently grounded in semantic
techniques that enable the semantic web and query points that the public can directly access (like Amazon Web Services13).
Under this model, it is the (formally coded) data concepts themselves that are cross‐linked, as opposed to just cross linked
web pages. There is a push among some search engine companies to create standards for indicating certain kinds of metadata
directly within web pages. Rich Snippets from Google and Search Monkey from Yahoo14 are competing attempts (but with similar
goals) to allow content developers to associate structured data with information shown on their websites. They currently support
a variety of formats, including micro formats and Resource Description Framework (RDF). In accordance with the philosophy
of OMB Memorandum M‐06‐02, and leveraging today’s mainstream “formal information model” capabilities, the evolution of Data.gov
will include the incorporation of semantically enabled techniques within the sites and within the datasets themselves. Semantic
Web Techniques -- The semantic web has a simple value proposition: create a web of data instead of a web of documents. The
“web of data” will be designed to be both human and machine readable. The core insight is that data has distinct or overlapping
meaning in different contexts. This is a core information technology problem and is manifest in applications such as cross‐boundary,
cross‐domain information sharing, natural language processing, and in enterprise data integration and business intelligence
(i.e., mash‐ups, dashboards). An example of how this is manifest is the ambiguity highlighted via an example in Wordnet as
depicted in Figure 17 Figure 17 shows how the word tank can have quite a few different meanings as both a verb and a noun.
In some applications the context is implicitly understood and this is not an issue. But as soon as two distinct data sets
use the same label to have distinct meanings, or the meanings overlap but only partially, or the meanings are the same but
that is hidden due to distinct coding or syntactical issues, we introduce ambiguity and most likely defeat the purpose of
combining the data sets in the first place. In order to create this web of data, the W3C and other standards groups have designed
specific data modeling techniques to provide such machine readable precision via identification, relationships, advanced modeling
and rules. Let’s briefly describe each technique and then demonstrate examples of this “curated” data approach. Unique and
persistent identification of a unique concept is important to insure unambiguous linking and the accrual of facts on a specific
topic. For example, Sir Tim Berners‐Lee uses the identifier, http://www.w3.org/People/Berners‐Lee/, to identify himself and
the people he knows using a Resource Description Framework (RDF) formatted data model called FOAF for “Friend of a Friend”
as depicted in Figure 18. Unambiguously identifying all things in a domain is the key first step to enabling machine readable
correlation and reasoning about those things. Additionally, by identifying something with a unique Uniform Resource Locator
(a URL is a form of URI), one can retrieve a document that provides additional information about the topic and possible equate
other things that have been previously identified and are the “same as” this one. Once things are identified, formal relationships
between things (and unique identifiers for those relationships) can be asserted. For example, also shown in Figure 18 is the
FOAF relationship labeled “knows” which is uniquely identified with the URI: http://xmlns.com/foaf/0.1/knows. Semantic web
modeling expands the traditional modeling techniques of Entity‐Relationship Diagrams (ERDs) and Class modeling (as in the
Unified Modeling Language or UML) to add powerful logical primitives like relationship characteristics and set theory. Some
powerful relationship characteristics are relationships that are “transitive” or “symmetric”. A transitive relationship is
something like the genealogical relationship “has Ancestor” which is very important in deductive reasoning as is depicted
in Figure 19. Additionally, as you can see in the figure, since Matthew “has an ancestor” named Peter and Peter “has an ancestor”
named William then it holds that Matthew “has an ancestor” named William. A geographic example of a transitive relationship
would be “encompasses” as in “Virginia encompasses Prince William County and Prince William County encompasses Manassas”.
A symmetric relationship is something that holds in both directions. For example, if Mary is “married to” Bill then Bill is
“married to” Mary. One final advanced modeling technique is the ability to model types or classes of things using set theory
primitives like distinct, intersection and union. This is a very powerful technique for mathematically determining when a
logical anomaly has occurred. For example, if a user has an alerting application that is scanning message traffic for the
location of a violent criminal on the loose, he/she needs a precise model of a violent criminal as opposed to non‐violent
criminals (as depicted in Figure 20) and a person cannot be both (or there is an anomaly). Additionally, to create these advanced
domain models there are even free tools, like protégé at http://protege.stanford.edu, and many tutorials on the web to educate
agencies on these topics. In conclusion, curation is the process of selecting, organizing and presenting the right items in
a collection that best deliver a desired outcome. Curation of data is preparing data so that it is more usable and more exploitable
by more applications. In that light, the semantic web techniques previously discussed are the next logical step in the widespread
curation of data. In particular, it is a leading edge, potential best practice in Federal data management. A good example
of the benefits of such curation is the Wolfram Alpha website (http://www.wolframalpha.com). Wolfram Alpha exclusively uses
curated data in order to calculate meaningful results to queries. For example, returning to our crime scenario, a user could
input to Wolfram Alpha, “violent crime in Virginia/violent crime in the US” and it computes the information in Figure 21.
Other benefits of using semantic web techniques include cross‐domain correlation, rule‐based alerting and robust anomaly detection.
While out of scope for this document, it should be obvious that increasing the fidelity of data increases its applicability
to solving problems and increases its value to the Data.gov developer and end‐user. The Semantic Web Roadmap -- Semantic web
techniques are not yet widespread in the Federal government. Given our principle of program control, Data.gov takes an evolutionary
approach to implementing these techniques. Such an evolution involves pilots, a piece‐meal transition and a lot of education.
The result will be to demonstrate the value proposition, establish end user demand, and empower data stewards to adopt semantic
web techniques. In order to accelerate evolution, an experimental semantic‐web‐driven site will be established as depicted
in Figure 22. In addition to agency pilots, the semantic.Data.gov site will leverage lessons learned from the United Kingdom’s
version of Data.gov (soon to be released) which will be built entirely on semantic web technologies. An ancillary benefit
of piloting techniques like unique identification and explicit relationships is that the lessons learned will assist the more
traditional implementations of these techniques on Data.gov. It is envisioned that as the benefits and applications based
on semantic Data.gov datasets increase, a migration and transition plan will be developed to merge the efforts.
Indicator(s):
|