Post on 01-Jul-2015
description
CEDAR & PRELIDA Preservation of Linked Socio-
Historical Data
Albert Meroño-Peñuela@albertmeronyo
PRELIDA consolidation workshop @ ISWC, 17-10-2014
CEDAR: Harmonizing Historical Census Data in the Semantic Web
CEDAR: Source Historical DataDutch Historical Censuses (1795-1971)
[Public Historical Statistical Data]
4
From scans to spreadsheets
CEDAR goal: cross queries
?
1795 1830 1889 1930 1971
(through ~3K tables)
Towards 5-star Census Data
Towards 5-star Census Data
>1 year ago
1 year ago
• Web publishable• Machine processable• Dynamic schema• Easily link with other
datasets
Why with semantic technology?
• Web publishable, human & machine readable
• Finer granularity level (cell level)
• Statistical comparability by leveraging semantic descriptions
• Provenance
• Harmonization through linkage to other datasets (the 5th star)
RDF Data Cube
“There are many situations where it would be useful to be able to publish multi-dimensional data, such as
statistics, on the web in such a way that they can be linked to related data sets and concepts.”
RDF Data Cube vocabulary (QB)• SDMX compatible• Defines cubes as a set of observations that consist of
dimensions, measures and attributes
• Dimensions: time period, region, sex (qb:DimensionProperty)• Measure: population life expectancy (qb:MeasureProperty)
• Attribute: unit of measure = years, metadata status = measured (qb:AttributeProperty)
Observation: “the measured life expectancy of males in Newport in the period 2004-2006 is 76.7 years”
CEDAR Integrator
https://github.com/CEDAR-project/Integrator
Raw data
cedar:BRT_1889_08_T1-S0-K17 a tablink:DataCell ;
rdfs:label "K17";
tablink:value "12.0" ;
tablink:dimension cedar:BRT_1889_08_T1-S0-A8 ;
tablink:dimension cedar:BRT_1889_08_T1-S0-K6 ;
tablink:dimension cedar:BRT_1889_08_T1-S0-J3 ;
tablink:dimension cedar:BRT_1889_08_T1-S0-K4 ;
tablink:dimension cedar:BRT_1889_08_T1-S0-K5 ;
tablink:dimension cedar:BRT_1889_08_T1-S0-B8 ;
tablink:dimension cedar:BRT_1889_08_T1-S0-C12 ;
tablink:dimension cedar:BRT_1889_08_T1-S0-E17 ;
tablink:dimension cedar:BRT_1889_08_T1-S0-F17 ;
tablink:sheet cedar:BRT_1889_08_T1-S0 .
Harmonization Rules as Open Annotations
cedar:BRT_1889_08_T1-S0-K4-mapping a oa:Annotation ;
oa:hasBody cedar:BRT_1889_08_T1-S0-K4-mapping-body ;
oa:hasTarget cedar:BRT_1889_08_T1-S0-K4 ;
oa:serializedAt "2014-09-24"^^xsd:date ;
oa:serializedBy
<https://github.com/CEDAR-project/Integrator> ;
prov:wasGeneratedBy
cedar:BRT_1889_08_T1-S0-mapping-activity .
cedar:BRT_1889_08_T1-S0-K4-mapping-body a rdfs:Resource ;
sdmx-dimension:sex sdmx-code:sex-F .
Harmonized RDF Data Cube
cedar:BRT_1889_02_T1-S0-K17-h a qb:Observation ;
cedar:population "12"^^xml:decimal ;
maritalstatus:maritalStatus
maritalstatus:single ;
cedarterms:occupationPosition cedarterms:job-D ;
sdmx-dimension:sex sdmx-code:sex-F ;
cedarterms:occupation hisco:88030 ;
sdmx-dimension:refArea gg:11150 ;
prov:wasDerivedFrom
cedar:BRT_1889_08_T1-S0-K17 ;
prov:wasGeneratedBy
cedar:BRT_1889_08_T1-S0-K17-activity .
Classification Systems and Concept Schemes
• Some missing harmonized dimensions!• Encode all variables and their values using concept
schemes• Some already exist
– Which ones? How many of them?– Where? – By whom?– Are they used at all? Can I reuse them?
• Some need to be created– Manual and expert knowledge based– Can we do it automatically? Or assist the process?
Dutch Historical
Censuses
(CEDAR)
Dutch Ships
and Sailors
Gemeente
geschiede
nis.nl
HISCO
ICONCLASS
Dutch
Historical
Religions
Dutch
Historical
House Types
Existing dimensions
• Gemeentegeschiedenis.nl
Existing LSD dimensions
• P1: Discoverability? How to discover dimensions created by others?
• P2: Reusability? How often are dimensions reused? Can we reuse dimensions created by others?
• P3: Relevance? What’s the size of LSD?
LSD Dimensions
http://lsd-dimensions.org/https://github.com/albertmeronyo/LSD-Dimensions
Hourly JSON-LD dumps
Existing LSD dimensions
• P1: Discoverability? How to discover dimensions created by others? LSD Dimensions
• P2: Reusability? How often are dimensions reused? Can we reuse dimensions created by others? Logarithmic law / probably yes
• P3: Relevance? What’s the size of LSD? ~7.9% of the LOD cloud
Creating new LSD Dimensions
• CEDAR needs concept schemes for
– Historical religious denominations (i.e. religions in the NL in 18th-20th c.)
– Historical occupations (id.)
– Historical building types (id.)
https://github.com/CEDAR-project/TabCluster
TabCluster
Leverages● Lexical properties
○ Hierarchical clustering in Python scipy○ String distances
● Semantic properties (LOD tagging)○ skos:Concept of most frequent cluster-term○ Closest common skos:broader skos:Concept of all
cluster-terms
Compatibility? Remixability? Reusability?
Sarven Capadisli, Albert Meroño-Peñuela, Sören Auer, Reinhard Riedl. “Semantic Similarity and Correlation of Linked Statistical Data Analysis”. 2nd Int. Workshop on Semantic Statistics (SemStats) ISWC 2014.
Concept Drift
Census classification of occupations as for
1859
• Root node is void• Depth 1: occupation groups• Leaves: actual occupations
Concept Drift
Census classification of occupations as for
1889
• Root node is void• Depth 1: occupation groups• Leaves: actual occupations
Concept Drift
Census classification of occupations as for
1899
• Root node is void• Depth 1: occupation groups• Leaves: actual occupations
Concept Drift
Upper ontologies
(HISCO, AC)
Year-
dependent
ontologies
1859 1869 1879
Concept Drift
Upper ontologies
(HISCO, AC)
Year-
dependent
ontologies
Concept Drift
Upper ontologies
(HISCO, AC)
Year-
dependent
ontologies
? ?
Preserving CEDAR
Preserving CEDAR
• DANS-EASY as backend (http://easy.dans.knaw.nl/)
• Archived objects: Turtle snapshots
– 20Go uncompressed, 200Mo compressed (per snapshot)
– Versioning (stats on current release)
• Users still need to
– SPARQL the data => bring up the endpoint on demand
– Run analytics on the data => outsource statistical analysis
Thank you
Questions, suggestions, comments most welcome
@albertmeronyo
http://www.cedar-project.nlhttp://krr.cs.vu.nl/
http://easy.dans.knaw.nl/http://lsd-dimensions.org/
Me in 6 tweetshttp://www.albertmeronyo.org
• Background: Computer Science, Web hacker, AI & Law
• PhD candidate at the VU University Amsterdam, DANS, and eHumanities group (KNAW)
• Topic: Semantic Web for the Humanities
• CEDAR project (2012-2015): harmonized historical Dutch censuses in the Semantic Web
• Problem: statistical data publishing, concept drift and dynamics of meaning
• Last paper: What is Linked Historical Data? (EKAW 2014)