Publishing Census as Linked Open Data. A Case Study€¦ · Publishing Census as Linked Open Data....
Transcript of Publishing Census as Linked Open Data. A Case Study€¦ · Publishing Census as Linked Open Data....
Publishing Census as Linked Open Data.
A Case Study
Irene Petrou
George Papastefanatos
Institute for the Management of Information Systems
RC “Athena”
2nd International Workshop on Open Data (WOD 2013) @ Paris
George Papastefanatos
Theodore Dalamagas
Outline
− Introduction to LOD and statistical data
− Greek census data overview
− LOD Technology adopted
− Case Study: Publishing Population Data− Case Study: Publishing Population Data
− Conclusions and Future Work
2
Linked Open Data (LOD)
• Principles of Linked Data by Tim Berners-Lee:
1. Use URIs as names for things
2. Use HTTP URIs so that people can look up those
names
3. When someone looks up a URI, provide useful
information, using the standards (RDF, SPARQL)
4. Include links to other URIs so that they can discover
more things
3
Statistical data and related vocabularies
� Statistical data are everywhere!– consumed by governmental institutions, private organizations, journalists,
scientists, etc.
� Statistical vocabularies and standards:
– SDMX (Statistical data and metadata exchange) standard• Sponsored by EUROSTAT, UN, World Bank, BIS, ECB, IMF, OECD
• SDMX CONTENT-ORIENTED GUIDELINES (2009) (COG’s)• SDMX CONTENT-ORIENTED GUIDELINES (2009) (COG’s)– Cross-Domain Concepts
– Cross-Domain Code Lists
– Statistical Subject-Matter Domains
– Metadata Common Vocabulary
– SCOVO (Statistical Cοre Vocabulary)• Simple and minimal
• Concepts: dataset, data item and dimension
– Data Cube Vocabulary
4
Greek Census Data Overview Inhabitants (Π-1.2) Households and dwellings (Π-1.1)
• sex, marital status, nationality,
educational level, address, etc.
• type of residence, total area,
facilities, number of residents, etc.
5
Motivation - Why publish these data as
LOD?
─ easier to process format
─ crawling and querying via SPARQL
─ Identifiable and linkable
─ comparable against other datasets─ comparable against other datasets
─ consumption by third parties
─ data exploration and development of novel
applications
─ consistency and uniformity between datasets
6
Outline of the Data Cube Vocabulary
7
• Multidimensional model (Cube)
• Dimensions, attributes, measures
(components)
• What the observation applies to
• What is the phenomenon being
observed
• How it was measured
Google Refine
• Simple and powerful tool
• Clean up messy data
• Process data with its own expression language (GREL)
• Transform data between different formats, such as TSV, CSV, *SV, Excel, JSON,
XML, and Google Data documents
• RDF Refine plugin to convert files to RDF
8
URI Scheme
• Base URI: http://linked-statistics.gr
• The URI scheme distinguishes between:
− Schema – Structural components (DSD, Components Specifications)
http://{BASE_URI}/schema#{ComponentName}
@prefix schema:<http://linked-statistics.gr/schema/>
− Dataset and observations− Dataset and observations
http://{BASE_URI}/data/{DatasetName}
http://{BASE_URI}/data/{DatasetName}#{DatasetKey}
@prefix data:<http://linked-statistics.gr/data/>
− Concepts and their values
http://{BASE_URI}/dic/{ConceptName}
http://{BASE_URI}/dic/{ConceptName}#{value}
@prefix dic:<http://linked-statistics.gr/dic/>
9
Converting to RDF
• Step 1- Download the census results
in .xls format from the Hellenic
Statistical Authority website and
imported the data in Google Refine
• Census results: permanent
population in Greece for 2011
based on the place of residence
• Step 2 – Clean up the data (removed
(Dimension) (Measure)geographical code Permanent population
• Step 2 – Clean up the data (removed
unwanted empty rows, cells)
• Step 3 - Build skeleton with Data Cube
Vocabulary:
a. Define the components
b. Define the DSD, Dataset and
Component Specifications
c. Define Observations
• Step 4 – Export data to RDF file as
RDF/XML
10
Modelling with Data Cube Vocabulary
Data Cube Vocabulary Custom Vocabulary Instances
qb:DimensionProperty
qb:CodedProperty
schema:geocodeDim URI of an administrative
division (geographical code,
“geocode”)
qb:MeasureProperty schema:population number
qb:AttributeProperty schema:UnitOfMeasure URI representing that the
population is measured by population is measured by
number of inhabitants
qb:DataStructureDefinition schema:PopulationPerGeocodeCensus2011
qb:DataSet data:PopulationPerGeocodeCensus2011
qb:Observation data:PopulationPerGeocodeCensus2011#{ge
ocode} *
11
*Each observation was connected with the appropriate administrative division using
the schema:geocodeDim property and to denote the population for the corresponding
geocode schema:population was used.
Modelling Administrative divisions of residence
• Characteristics of each division:
– hierarchical geocode value
dic:geocode#{geocodeValue}
– description – skos:prefLabel property
– administrative level - dic:haslevel
property12
Example – Population of the division with the geocode
0102
geolevel#5 geocode#0102
observation
13
RDF Output Exampleobservation:
geocode:geocode:
14
Conclusions and Future Work
• Case study on publishing census data as LOD
• The census data concerned Greece’s resident population census (2011)
• Tabular data converted (.xls file) to RDF • Tabular data converted (.xls file) to RDF
• Further work aims at extending the proposed data model for representing more statistical indexes and more complex census datasets
• New release of Data Cube Vocabulary (13/03/13)
• Configuration of SPARQL endpoint service
15
Merci
Beaucoup!!!
Questions?
16
All published data are available at:
http://linked-statistics.gr