3.2: Core Modules
Include six core modules: (1) the website, (2) the DMS, (3) the metadata catalog, (4) a performance tracking and analysis
engine, (5) an audit tool, and (6) a hosting service.
Other Information:
As depicted in the following visual, Data.gov’s future architecture will include six core modules: (1) the website, (2) the
DMS, (3) the metadata catalog, (4) a performance tracking and analysis engine, (5) an audit tool, and (6) a hosting service.
The architecture will also utilize at least four data infrastructure tools: collaboration, feedback, agency and site performance
dashboards, and search related tools. The modules and tools will be made more accessible through a collection of application
programming interfaces (API) that expose metadata and data. Together, these modules, tools, and APIs will allow Data.gov to
adapt to its customer base as needed. Note that many of the capabilities outlined in section 2, such as the Dataset Management
System, are currently in use. Where this is the case they will be enhanced and extended. In other cases, for example the data
infrastructure tools, the Data.gov team will partner with others to deliver the capability. Module 1 – The Site All citizens,
technically inclined or not, can access the Data.gov website to discover structured data, otherwise known as data sets, published
by the federal government and download them to their local computer. To serve up these data sets, the Data.gov website accesses
a catalog of records with one record representing each data set published to it. Data.gov visualization services could be
delivered through the site and could include analytics, graphics, charting, and other ways of using the data. In many cases
enhanced visualizations will be delivered by the Data.gov team or others as data infrastructure tools, built on top of published
APIs. These enhanced visualizations or other uses will in some cases be accessed via the Data.gov site and in others via external
web sites. Another enhanced feature of Data.gov could allow customers to receive alerts on the availability of new data sets
in a subject area in which they are interested. A variation of this would be alerts to developers related to changes or updates
to data sets they use to power their applications. Alerting and notification as a feature could be implemented via a data
infrastructure tool, or via specific features added into core modules, or both. This seems an area where Data.gov should implement
a basic capability and invite experimentation and innovation to identify opportunities for greater added value – data domain
specific, in general, or in some unforeseen manner. Module 2 – The Dataset Management System The Dataset Management System
(DMS) was recently unveiled to facilitate agencies’ efforts to organize and maintain their Data.gov submissions via a web‐based
user interface. The Data.gov DMS provides agencies a self‐service process for publishing datasets into the Data.gov catalog.
The DMS is the approach of choice if an agency does not have its own metadata repository and does not have the resources to
leverage the Data.gov metadata API or harvesting approaches. The DMS allows the originators to submit new datasets and review
the status of previously submitted datasets. New datasets can be submitted either one dataset or multiple datasets at a time.
Once a dataset suggestion has been added to the DMS, its status can be tracked through the submission lifecycle. Agency POCs
can access the DMS to view the entire published catalog, all published datasets and tools submitted by their agency, and a
dashboard of all pending submissions. The DMS could, in the future, also disclose to the POCs compliance issues that are not
being met by the agency and its data stewards. Module 3 – The Metadata Catalog The Data.gov metadata catalog will evolve into
a shared metadata storage service that allows agencies to utilize a metadata repository that is centralized in a Data.gov
controlled host, and use it for their own needs. Agencies that do not have metadata repositories of their own will be able
to leverage Data.gov’s shared metadata repository as a service. So that agencies can leverage the shared metadata repository
as an enterprise service, agencies will be able to flag which of their metadata they choose to share with the public via Data.gov
versus those stored in the service but not exposed via Data.gov. Additionally, agencies will be able to designate whether
their data contains personally identifiable data and whether the data adheres to information quality requirements. Figure
10 depicts the key components of a catalog record. It is important to understand that while these various components are drawn
in separate boxes, they are actually all part of a single catalog record. The four parts of a robust catalog record are: *
Catalog record header – this part holds both administrative book‐keeping parts of the overall record and all data needed to
manage the target data resource. To manage a target data resource, this part will keep track of ratings, comments and metrics
about the resource. * Data resource part – a data resource is the target data referred to by the catalog record. A data resource
could be a dataset, result set or any new type of structured data pointed to by a catalog record. * Data resource domain part
– a data resource belongs to a domain or area of knowledge. The domain of a data resource has two basic parts: resource coverage
and resource context. Resource coverage is a description about what the resource “covers”. Resource context is metadata about
the environment that produced the data including the production process. * Related resources part – a structured data resource
may have one or more resources related to it. For example, structured data may have images, web pages or other unstructured
data (like policy documents) related to it. Additionally, as evidenced on the current site, a dataset may have tools related
to it or tools that help visualize or manipulate the data. Module 4 – Performance Tracking and Analysis Engine Data.gov will
include a performance tracking and analysis engine that will store Data.gov and wider Federal information on data dissemination
performance. Data.gov related measures will be combined with Federal‐wide data dissemination measures to gain a better understanding
of overall Federal data dissemination. Agencies will supply measures to Data.gov and the total set of performance and measurement
data will be made available to the public. A discussion of performance measures is in sections: “Measuring Success” and “Appendix
A – Detailed Metrics for Measuring Success”. Module 5 – Audit Tool Over time, any organization can find that data have been
published and exist in the public domain without active management or visibility inside of the organization. The Data.gov
team may provide the expertise to assist agencies with identifying previously published data to assist those agencies in their
own processes for data management and potential publication to Data.gov. The Data.gov team is considering deploying a search
agent to scan Federal government domains in order to provide data that will assist agencies in evaluating their data management
practices and accelerate integration of already public data resources into Data.gov. The audit tool will prioritize delivery
for a basic capability focused on identifying and characterizing already public data assets in a useful manner for agency
POCs. It would scan through Federal domains and formulate an index of potential datasets and build reports to deliver to agencies.
Associated reporting would serve to provide some basis for the total population of data, provide intelligence to agencies
on their potential data assets, and serve to assist the data steward community with an assessment of what is currently exposed
to the public. This is not intended to automatically populate Data.gov, but rather to assist agencies with their own data
inventory, management, and publication processes. The result should be better, more granular agency plans to integrate their
already public data sets into Data.gov; more efficient and lower cost data management and dissemination activities through
leveraging reported data to jump start and validate data inventories; enhanced ability to develop a proactive understanding
of agency compliance with information dissemination and related policy. Most importantly, through continuous measurement the
audit tool provides timely and actionable management data to agencies and makes their progress with integration into Data.gov
transparent. Module 6 – Shared Hosting Services Data.gov will implement a shared data storage service for use by agencies.
This service will be accessible via APIs and will provide agencies with a cost effective mechanism for storing data that will
be made available to the public. The data stored within the service will be made available via feeds and APIs so that the
application development community can receive direct enablement from Data.gov. Providing data in the right format is as critical
as providing the data themselves. For instance, the shared hosting service could be used to provide data using query points
such as RESTful web services, web queries, application programming interfaces, or bulk downloads. Data can be made more useful
through these services and by extending the metadata template to include data‐type specific or domain‐specific elements in
addition to the core ‘fitness for use’ type metadata currently in the Data.gov metadata template. Agency use of query points
drives value in some instances. For example, agencies using query points would be able to directly measure “run‐time” use
of their data as opposed to just recording instances of data downloads. Also, given agency control over the query point, agencies
would be able to better support access to most the current and correct versions of data resources as well as more clearly
understand downstream use and value creation resulting from their data resources. Data storage and publishing (end user access)
would be subject to metering of some sort, to be determined. Given the operational aspect of this module and the need to scale
based on volume and end user usage of data, the Data.gov team will look to fully align on the Federal Cloud Computing Initiative
and leverage its managed service focus for this module. The core value proposition to agencies for using the shared hosting
service is integration with the other modules, as well as alignment with the cloud initiative, which should reduce total costs
and enable more efficient and effective realization of the full Data.gov value proposition.
Indicator(s):
|