FRS Linked Open Data Concept v1.3 20101130
-
Upload
dave-smith-usepa-office-of-environmental-information -
Category
Technology
-
view
697 -
download
1
description
Transcript of FRS Linked Open Data Concept v1.3 20101130
FRS and Linked Open Data Potential – Conceptual Discussion v 1.3November 30, 2010
Dave Smith USEPA/OEI/OIC/IESD/ISSB
Document Change HistoryRevision Date Author Description
1.0 11/12/2010 David G. Smith Initial Version1.1 11/24/2010 David G. Smith Minor
updates/revisions as followon to 11/23 discussion
1.2 11/29/2010 David G. Smith Collaborations, potential pilots, FOAF and other models
1.3 11/30/2010 David G. Smith Additional collaborations and detail on facility granularity concept
FRS Data Model Initial Conceptual DiscussionNovember 11, 2010 November 30, 2010
ContentsDocument Change History.......................................................................1
Introduction:............................................................................................1
Concept:...................................................................................................1
Current Situation:.....................................................................................1
Linked Open Data Issues:.........................................................................1
Data Model Issues:...................................................................................1
Linked Open Data Development:.............................................................1
Existing Resources....................................................................................1
Short-Term data needs:...........................................................................1
Longer-Range, Emergent data needs:......................................................1
Other Ongoing, Related Activities............................................................1
Anticipated Next Steps:............................................................................1
Introduction:The intent of this concept paper is to initially explore some conceptual, blue-sky, no-constraints for potential improvements to the FRS Linked Open Data approach being published via data.gov, and to stimulate additional ideas and brainstorming. Followon to this will be examination of alternatives, prioritizations and finalization of thoughts toward implementation.
Concept:Provide enhancements to FRS Linked Open Data approach to improve analysis, enhance facility representation, improve robustness of LOD querying and analytics, integrate other existing metadata capabilities and improve capabilities to support Semantic Web approaches, such as more-informed RDF serialization.
2
FRS Data Model Initial Conceptual DiscussionNovember 11, 2010 November 30, 2010
Current Situation:FRS data is currently being published via Data.gov, e.g. RDF button on Data.gov catalog pages (e.g. http://www.data.gov/raw/1030 ) for FRS data.
Figure 1: Example of Current FRS RDF Offering (highlighted in red box)
The data returned is tied to a data.gov URL, e.g. http://www.data.gov/semantic/data/alpha/1030/dataset-1030.rdf.gz
Linked Open Data Issues:Currently, FRS and other datasets published via Data.gov are being serialized as RDF to support semantic web and linked open data. A basic problem with the Data.gov RDF does not just apply to the FRS RDF data, it likely applies across the board.
3
FRS Data Model Initial Conceptual DiscussionNovember 11, 2010 November 30, 2010
Firstly, in terms of access, the data is a gzipped download. Data must be downloaded and unzipped before it can be accessed - more ideally, it would be good to see Data.gov serving the data up as a SPARQL endpoint, or as a SESAME repository or other means of serving up a triple store. That download/unzip paradigm does not lend itself to dynamic mashups.
With regard to the Data.gov RDF, it appears to be a brute-force serialization of data tables into RDF. It doesn't really have the semantic depth to support analysis that it could use (See Fig. 1-3).
<rdf:Description rdf:about="#entry9985">
<hdatum_desc>NAD83</hdatum_desc>
<state_name>NEBRASKA</state_name>
<latitude83>40.944623</latitude83>
<interest_types>STATE MASTER</interest_types>
<city_name>GARLAND</city_name>
<create_date>01-MAR-00</create_date>
<frs_facility_detail_report_url rdf:resource=" http://iaspub.epa.gov/enviro/fii_query_detail.disp_program_facility?p_registry_id=110006555085 "/>
<congressional_dist_num>01</congressional_dist_num>
<pgm_sys_acrnms>NE-IIS</pgm_sys_acrnms>
<epa_region_code>07</epa_region_code>
<country_name>USA</country_name>
<fips_code>31159</fips_code>
<huc_code>10200203</huc_code>
<collect_desc>ADDRESS MATCHING-HOUSE NUMBER</collect_desc>
<primary_name>TERRI KELLER RESIDENCE</primary_name>
<rdf:type rdf:resource=" http://data-gov.tw.rpi.edu/2009/data-gov-twc.rdf#DataEntry "/>
<ref_point_desc>ENTRANCE POINT OF A FACILITY OR STATION</ref_point_desc>
<postal_code>683609338</postal_code>
<registry_id>110006555085</registry_id>
<location_address>1976 OLD MILL RD</location_address>
<accuracy_value>30</accuracy_value>
<update_date>06-AUG-01</update_date>
<county_name>SEWARD</county_name>
<conveyor>FRS</conveyor>
<longitude83>-96.990306</longitude83>
<state_code>NE</state_code>
<site_type_name>STATIONARY</site_type_name>
4
FRS Data Model Initial Conceptual DiscussionNovember 11, 2010 November 30, 2010
</rdf:Description>
Figure 1: Sample of current Data.gov FRS RDF/XML Representation
< http://www.data.gov/semantic/data/alpha/997/dataset-997.rdf#entry9985 > < http://www.data.gov/semantic/data/alpha/997/dataset-997.rdf#hdatum_desc > "NAD83" .
< http://www.data.gov/semantic/data/alpha/997/dataset-997.rdf#entry9985 > < http://www.data.gov/semantic/data/alpha/997/dataset-997.rdf#state_name > "NEBRASKA" .
< http://www.data.gov/semantic/data/alpha/997/dataset-997.rdf#entry9985 > < http://www.data.gov/semantic/data/alpha/997/dataset-997.rdf#latitude83 > "40.944623" .
< http://www.data.gov/semantic/data/alpha/997/dataset-997.rdf#entry9985 > < http://www.data.gov/semantic/data/alpha/997/dataset-997.rdf#interest_types > "STATE MASTER" .
< http://www.data.gov/semantic/data/alpha/997/dataset-997.rdf#entry9985 > < http://www.data.gov/semantic/data/alpha/997/dataset-997.rdf#city_name > "GARLAND" .
< http://www.data.gov/semantic/data/alpha/997/dataset-997.rdf#entry9985 > < http://www.data.gov/semantic/data/alpha/997/dataset-997.rdf#create_date > "01-MAR-00" .
< http://www.data.gov/semantic/data/alpha/997/dataset-997.rdf#entry9985 > < http://www.data.gov/semantic/data/alpha/997/dataset-997.rdf#frs_facility_detail_report_url > < http://iaspub.epa.gov/enviro/fii_query_detail.disp_program_facility?p_registry_id=110006555085 > .
< http://www.data.gov/semantic/data/alpha/997/dataset-997.rdf#entry9985 > < http://www.data.gov/semantic/data/alpha/997/dataset-997.rdf#congressional_dist_num > "01" .
< http://www.data.gov/semantic/data/alpha/997/dataset-997.rdf#entry9985 > < http://www.data.gov/semantic/data/alpha/997/dataset-997.rdf#pgm_sys_acrnms > "NE-IIS" .
< http://www.data.gov/semantic/data/alpha/997/dataset-997.rdf#entry9985 > < http://www.data.gov/semantic/data/alpha/997/dataset-997.rdf#epa_region_code > "07" .
< http://www.data.gov/semantic/data/alpha/997/dataset-997.rdf#entry9985 > < http://www.data.gov/semantic/data/alpha/997/dataset-997.rdf#country_name > "USA" .
< http://www.data.gov/semantic/data/alpha/997/dataset-997.rdf#entry9985 > < http://www.data.gov/semantic/data/alpha/997/dataset-997.rdf#fips_code > "31159" .
< http://www.data.gov/semantic/data/alpha/997/dataset-997.rdf#entry9985 > < http://www.data.gov/semantic/data/alpha/997/dataset-997.rdf#huc_code > "10200203" .
< http://www.data.gov/semantic/data/alpha/997/dataset-997.rdf#entry9985 > < http://www.data.gov/semantic/data/alpha/997/dataset-997.rdf#collect_desc > "ADDRESS MATCHING-HOUSE NUMBER" .
< http://www.data.gov/semantic/data/alpha/997/dataset-997.rdf#entry9985 > < http://www.data.gov/semantic/data/alpha/997/dataset-997.rdf#primary_name > "TERRI KELLER RESIDENCE" .
< http://www.data.gov/semantic/data/alpha/997/dataset-997.rdf#entry9985 > < http://www.w3.org/1999/02/22-rdf-syntax-ns#type > < http://data-gov.tw.rpi.edu/2009/data-gov-
5
FRS Data Model Initial Conceptual DiscussionNovember 11, 2010 November 30, 2010
twc.rdf#DataEntry > .
< http://www.data.gov/semantic/data/alpha/997/dataset-997.rdf#entry9985 > < http://www.data.gov/semantic/data/alpha/997/dataset-997.rdf#ref_point_desc > "ENTRANCE POINT OF A FACILITY OR STATION" .
< http://www.data.gov/semantic/data/alpha/997/dataset-997.rdf#entry9985 > < http://www.data.gov/semantic/data/alpha/997/dataset-997.rdf#postal_code > "683609338" .
< http://www.data.gov/semantic/data/alpha/997/dataset-997.rdf#entry9985 > < http://www.data.gov/semantic/data/alpha/997/dataset-997.rdf#registry_id > "110006555085" .
< http://www.data.gov/semantic/data/alpha/997/dataset-997.rdf#entry9985 > < http://www.data.gov/semantic/data/alpha/997/dataset-997.rdf#location_address > "1976 OLD MILL RD" .
< http://www.data.gov/semantic/data/alpha/997/dataset-997.rdf#entry9985 > < http://www.data.gov/semantic/data/alpha/997/dataset-997.rdf#accuracy_value > "30" .
< http://www.data.gov/semantic/data/alpha/997/dataset-997.rdf#entry9985 > < http://www.data.gov/semantic/data/alpha/997/dataset-997.rdf#update_date > "06-AUG-01" .
< http://www.data.gov/semantic/data/alpha/997/dataset-997.rdf#entry9985 > < http://www.data.gov/semantic/data/alpha/997/dataset-997.rdf#county_name > "SEWARD" .
< http://www.data.gov/semantic/data/alpha/997/dataset-997.rdf#entry9985 > < http://www.data.gov/semantic/data/alpha/997/dataset-997.rdf#conveyor > "FRS" .
< http://www.data.gov/semantic/data/alpha/997/dataset-997.rdf#entry9985 > < http://www.data.gov/semantic/data/alpha/997/dataset-997.rdf#longitude83 > "-96.990306" .
< http://www.data.gov/semantic/data/alpha/997/dataset-997.rdf#entry9985 > < http://www.data.gov/semantic/data/alpha/997/dataset-997.rdf#state_code > "NE" .
< http://www.data.gov/semantic/data/alpha/997/dataset-997.rdf#entry9985 > <
http://www.data.gov/semantic/data/alpha/997/dataset-997.rdf#site_type_name > "STATIONARY" .
Figure 2: Sample of current Data.gov FRS Representation as Triples
The current RDF serialization is essentially just a brute force conversion - there is plenty of opportunity to enhance and improve.
The properties are things that some EPA users might easily understand, but would others, e.g. huc_code, pgm_sys_acrnms – are these uniquely identifiable and understood, within this dataset? Thinking import reference to EPA data dictionary, perhaps EPA namespace or other means of defining them more positively is needed. We have a lot of metadata that we can bring into the mix, toward enhancing identifiability, understandability and usability of the RDF data.
There isn't really much structure or model, it's essentially a flat table. Everything is just treated as alphanumeric data types. No temporal intelligence to dates, et cetera. It doesn't identify registry ID as something unique or indexable. There are many things that can and should be defined better. There is probably a semantic analogue to our data model that we can develop as an RDF/OWL/etc analogue and then map to it.
6
FRS Data Model Initial Conceptual DiscussionNovember 11, 2010 November 30, 2010
One approach which may make more sense is to go back and look at the relational database model, which can support more richness – essentially, individual tables and their relationships would be generated as Linked Open Data, and the SPARQL queries would then have the flexibility of current SQL queries.
Regarding the properties, are there in some cases other namespaces that we could/should be leveraging? geo: as one example - our data is, however, NAD83, and geo: assumes WGS84. We could reproject to WGS84 and provide geo: values to supplement what we have, as one possibility. Similarly, maybe foaf: or other namespaces, which deal with addresses and points of contact. The RDF only carries locations, but FRS also has contacts, if we should at some point incorporate those as well.
In summary, I think it could stand to be improved from a standpoint of accessibility (SPARQL, et cetera - I think Data.gov needs to look at that from a services infrastructure standpoint), and then, improved usability, by following more of a data model approach, as opposed to this flat mapping, and approaches like mapping to existing namespaces and following existing models where appropriate, and we should be able to leverage some of our metadata elements, data models and other artifacts toward a better representation and mapping.
Data Model Issues: Long range, some additional tweaks to FRS data model may be needed in order to enhance data representation and better support Linked Open Data - some of these are described in brief below.
Linked Open Data Development:Potential collaboration with
Joshua Lieberman (OGC Geospatial Semantics SWG)
Spatial Ontology Community of Practice
Jim Hendler (RPI), George Thomas (HHS): CIO Council and Data.gov Geospatial Semantics threads
John Harman / Michael Pendleton (LOD, SRS)
Steve Young / Zach Scott / Open Gov Team (LOD)
Talis, pending contract (LOD)
TRI Program (Potential Pilot)
7
FRS Data Model Initial Conceptual DiscussionNovember 11, 2010 November 30, 2010
Kevin Kirby (Data Model)
Tom Giffen (Data Model, Business Rules)
Ken Blumberg (Business Rules)
Cindy Dickinson (Standards, Business Rules)
Others (program offices, regions, GISWG)
Existing Resources Leverage Data Modeling work that Kevin Kirby has been working on
Drill into gist.owl and other potential resources
Short-Term data needs: Semantic Enhancements / Linked Open Data
Improvement of capabilities for supporting Linked Open Data applications – Analysis of data structure toward supporting faceted, dimensional analyses (Figure 1)Development of URI schemes, potentially namespaces, and mans and approaches for allowing unique identification and linkage
8
FRS Data Model Initial Conceptual DiscussionNovember 11, 2010 November 30, 2010
Figure 3: Potential Facets / Dimensions for Analysis and Semantic Enhancement
Semantic Dimensions:Explore various dimensions of facility:
Spatial – o GML representation of absolute location (lat/long, etc)o Spatial representation framework for facility (building footprints, parcel boundary,
others for future)o Facility data modeling granularity and relationships - get a better handle on what
the facility "thing" represents, and its' relation to other things - for example, a parcel boundary, containing an industrial complex with manufacturing and storage buildings (differing NAICS, possibly even different companies operating and licensed/permitted), plus associated air stacks, SPCC measures, water outfalls, et
9
FRS Data Model Initial Conceptual DiscussionNovember 11, 2010 November 30, 2010
cetera. When we pull up "facility" it should ultimately reflect that bigger picture for context, with the component of interest in highlight.
Temporalo Data currencyo Temporal aspects to regulation, enforcement, permitting, et cetera – future
Corporate Dimensiono Corporate ownership – at facility level and at ultimate corporate parent level
Function - Activity and Useo NAICS/SIC Codeso EPA Regulatory programo EPA Interest Typeo Linkages / translation between interest type and other ontologies/vocabularieso Linkages to regulatory programs and other components
Interrelationships of facilities (future)
Individualso Friend-of-a-friend (FOAF) and other existing RDF constructs
Many other potential enhancements
Potential PilotsA number of potential pilots for mashups can be considered. What may be “low hanging fruit” for OEI build upon exploitation of known internal assets, i.e.
FRS
TRI (Toxic Release Quantities for Given Location)
SRS (Substance)
Potentially, as one scenario, one could tie TRI discharges to reaches via OW web services and TRI reported receiving waters, and then tie this to observed impacts downstream.
One caveat of using EPA data is that it is known to EPA users, but ideally needs to be more fully fleshed-out to make it discoverable and uniquely identifiable for external users, perhaps via embedded EPA identifiers (perhaps an epa: namespace or similar means of identifying our assets)
10
FRS Data Model Initial Conceptual DiscussionNovember 11, 2010 November 30, 2010
Other potential scenarios TBD… OECA targeted enforcement vs. OSHA, or OPP vs. USDA pesticides application data.
Longer-Range, Emergent data needs:These are not specific to LOD, but are instead emergent attributes of interest for FRS – LOD approaches may help inform on how to structure these.
HUC CodesCompletion of prepopulating of HUC Codes can support identification of facilities impacting major watersheds, e.g. Chesapeake Bay (OECA need) – Other potential needs: Airsheds
Municipality Toward improving data quality – Physical street address may include ZIP Code for city which is different than actual municipality where site resides – for example, Suburban Drive, State College PA is actually Ferguson Township, PA – and local planning and building code officials and emergency responders who either have or need information on the facility of interest would be different than that of the one listed
RelationshipAbility to relate facilities – relating individual components of a larger system of infrastructure, such as relating a gas terminal to a compressor station – changes to one may impact others.Ability to organize information in appropriate fashions, such as relating multiple individual oil platforms with discrete permits to a lease boundary with another level of permitting.
Indian CountryMore robust identification/validation of facilities which may lie within tribal boundaries – refinement of IND-3 boundaries with other source data, analysis of flows containing either tribal flag (Y/N) and/or tribal identifier (tribe/reservation name) - (collaboration with Elizabeth Jackson / Ed Liu)
Facility DefinitionPotential broadening of scope and use of FRS to accomodate grant award locations and other types of locations – 2005 NAPA Report recommendations for consistent agencywide site identification. May be predicated on buildout of other capabilities, such as being able to relate sites.
11
FRS Data Model Initial Conceptual DiscussionNovember 11, 2010 November 30, 2010
Other Ongoing, Related ActivitiesA number of activities, internal and external, can help to inform on direction and data model for FRS data collection and publishing activities – some of these are listed below:
Potential EPA Corporate ID WorkgroupCollaborate with TRI, TSCA, FRP, RMP, Others who collect corporate parent information, as well as OECA and others who need corporate parent information to support analysis.
White House Corporate ID WorkgroupCollaborate with emergent White House Corporate ID workgroup – Beth Noveck / Steve Croley, SEC, Labor and other agencies to align, coordinate and collaborate on corporate identifiers
OpenGovCollaboration with EPA Open Gov initiatives to inform on how best to publish data for external reuse.
National Academy of Public AdministrationFollow-through on 2005 NAPA Report recommendations
Spatial Ontology Community of Practices (SOCOP)Collaboration on vocabularies, standards and data modeling approaches
Data.Gov Data Architecture SubgroupCollaboration on vocabularies, standards and data modeling approaches
EPA OEI/OIC/IESD Data Standards BranchCollaboration on vocabularies, standards and data modeling approaches
Others…
Anticipated Next Steps:TBD, develop ideas for potential pilots, engage on “LOD Cookbook” and approaches for representing and rendering our data as RDF.
12