Post on 15-Jul-2020
PHDD – An RDF Vocabulary for the Physical Data Descrip:on
Work in Progress
Joachim Wackerow and Thomas Bosch (both GESIS – Leibniz Ins:tute for the Social Sciences)
What is it?
• Descrip:on of the physical proper:es of a data file.
• Focus on most common format types – Rectangular format – Character-‐separated values (CSV) or fixed-‐record length
Character-‐separated Values Header row
Fixed Record-‐length Format
Mo:va:on
• Data.gov and similar ini:a:ves provide data in CSV format or similar
• W3C Government Linked Data Working Group Charter – “The mission … is to provide standards and other informa:on which help governments around the world publish their data as effec:ve and usable Linked Data using Seman:c Web technologies.”
• Machine-‐ac:onability -‐ intended for program use
Example at data.gov
Exis:ng Approaches • CSV on the Web Working Group Charter (W3C) • Linkage
– Linked CSV (Jeni Tennison) – CSV linked data (Quoderat)
• Descrip:on – Common Format and MIME Type for Comma-‐Separated Values (CSV) Files,
RFC 4180 – csv: a vocabulary for describing CSV files (Rurik Thomas Greenall, Norwegian
University at Trondheim) • Representa:ons
– URI design for RDF conversion of CSV-‐based data (Tim Lebo, Gregory Todd Williams)
– CSV2RDF Applica:on (Ivan Ermilov, Sören Auer, Claus Stadler) Research results:
– Intended for different purposes – Descrip:on approaches not sufficient
PHDD – First Ideas
PHDD – UML Model class phdd
TableDescription
-‐ caseQuantity :xsd:nonNegativeInteger [0..1]-‐ fi leName :xsd:string, xsd:uri-‐ recordsPerCase :xsd:positiveInteger = 1-‐ overallRecordCount :xsd:nonNegativeInteger [0..1]
TableStructure
-‐ characterSet :xsd:string-‐ defaultDecimalSeparator :xsd:string [0..1]-‐ defaultDigitGroupSeparator :xsd:string [0..1]-‐ defaultLanguage :xsd:string [0..1]-‐ defaultLocale :xsd:string [0..1]-‐ defaultDecimalPositions :xsd:positiveInteger [0..1]-‐ newLine :xsd:string = CRLF
ColumnDescription
-‐ recommendedDataType :xsd:string [0..1]-‐ storageFormat :xsd:string [0..1]-‐ recommendedDisplayDataFormat :xsd:string [0..1]-‐ decimalPositions :xsd:positiveInteger [0..1]-‐ recordNumber :xsd:positiveInteger [0..1] = 1
skos:Concept
FixedRecordLength
-‐ recordLength :xsd:positiveInteger [0..1]
Delimited
-‐ delimiter :xsd:string-‐ textQualifier :xsd:string [0..1]-‐ consecutiveDelimitersAsOne :xsd:boolean [0..1] = false-‐ namesOnFirstRow :xsd:boolean [0..1] = true-‐ firstDataLine :xsd:positiveInteger [0..1] = 2
DistributionTable
Column
InputProgram
-‐ programFileName :xsd:string-‐ softwareType :xsd:string-‐ programVersion :xsd:string
FixedColumnDescription
-‐ startPosition :xsd:positiveInteger-‐ endPosition :xsd:positiveInteger [0..1]-‐ width :xsd:positiveInteger [0..1]
DelimitedColumnDescription
-‐ columnPosition :xsd:positiveInteger
0..*
isDescribedBy
1
0..*
isStructuredBy
1
0..*
isDescribedBy
1
0..*
column
1..*
0..*
storageFormat
0..1
0..1
inputProgram
0..*
0..*
defaultLocale
0..1
0..*
defaultLanguage
0..1
0..*
characterSet
1
0..*
recommendedDisplayDataFormat
0..1
PHDD -‐ Overview
General approach is not really new, just a complete set of proper:es for the most common cases.
Structure • Table – the rectangular data file
[disco::DataFile, dcat::Distribu8on] – TableStructure -‐ common proper:es plus specific ones for delimited and fixed columns • Column -‐ common proper:es plus specific ones for delimited and fixed columns
[disco::Variable]
class phdd
TableDescription
-‐ caseQuantity :xsd:nonNegativeInteger [0..1]-‐ fi leName :xsd:string, xsd:uri-‐ recordsPerCase :xsd:positiveInteger = 1-‐ overallRecordCount :xsd:nonNegativeInteger [0..1]
TableStructure
-‐ characterSet :xsd:string-‐ defaultDecimalSeparator :xsd:string [0..1]-‐ defaultDigitGroupSeparator :xsd:string [0..1]-‐ defaultLanguage :xsd:string [0..1]-‐ defaultLocale :xsd:string [0..1]-‐ defaultDecimalPositions :xsd:positiveInteger [0..1]-‐ newLine :xsd:string = CRLF
ColumnDescription
-‐ recommendedDataType :xsd:string [0..1]-‐ storageFormat :xsd:string [0..1]-‐ recommendedDisplayDataFormat :xsd:string [0..1]-‐ decimalPositions :xsd:positiveInteger [0..1]-‐ recordNumber :xsd:positiveInteger [0..1] = 1
skos:Concept
FixedRecordLength
-‐ recordLength :xsd:positiveInteger [0..1]
Delimited
-‐ delimiter :xsd:string-‐ textQualifier :xsd:string [0..1]-‐ consecutiveDelimitersAsOne :xsd:boolean [0..1] = false-‐ namesOnFirstRow :xsd:boolean [0..1] = true-‐ firstDataLine :xsd:positiveInteger [0..1] = 2
DistributionTable
Column
InputProgram
-‐ programFileName :xsd:string-‐ softwareType :xsd:string-‐ programVersion :xsd:string
FixedColumnDescription
-‐ startPosition :xsd:positiveInteger-‐ endPosition :xsd:positiveInteger [0..1]-‐ width :xsd:positiveInteger [0..1]
DelimitedColumnDescription
-‐ columnPosition :xsd:positiveInteger
0..*
isDescribedBy
1
0..*
isStructuredBy
1
0..*
isDescribedBy
1
0..*
column
1..*
0..*
storageFormat
0..1
0..1
inputProgram
0..*
0..*
defaultLocale
0..1
0..*
defaultLanguage
0..1
0..*
characterSet
1
0..*
recommendedDisplayDataFormat
0..1
Table Structure
class phdd
TableDescription
-‐ caseQuantity :xsd:nonNegativeInteger [0..1]-‐ fi leName :xsd:string, xsd:uri-‐ recordsPerCase :xsd:positiveInteger = 1-‐ overallRecordCount :xsd:nonNegativeInteger [0..1]
TableStructure
-‐ characterSet :xsd:string-‐ defaultDecimalSeparator :xsd:string [0..1]-‐ defaultDigitGroupSeparator :xsd:string [0..1]-‐ defaultLanguage :xsd:string [0..1]-‐ defaultLocale :xsd:string [0..1]-‐ defaultDecimalPositions :xsd:positiveInteger [0..1]-‐ newLine :xsd:string = CRLF
ColumnDescription
-‐ recommendedDataType :xsd:string [0..1]-‐ storageFormat :xsd:string [0..1]-‐ recommendedDisplayDataFormat :xsd:string [0..1]-‐ decimalPositions :xsd:positiveInteger [0..1]-‐ recordNumber :xsd:positiveInteger [0..1] = 1
skos:Concept
FixedRecordLength
-‐ recordLength :xsd:positiveInteger [0..1]
Delimited
-‐ delimiter :xsd:string-‐ textQualifier :xsd:string [0..1]-‐ consecutiveDelimitersAsOne :xsd:boolean [0..1] = false-‐ namesOnFirstRow :xsd:boolean [0..1] = true-‐ firstDataLine :xsd:positiveInteger [0..1] = 2
DistributionTable
Column
InputProgram
-‐ programFileName :xsd:string-‐ softwareType :xsd:string-‐ programVersion :xsd:string
FixedColumnDescription
-‐ startPosition :xsd:positiveInteger-‐ endPosition :xsd:positiveInteger [0..1]-‐ width :xsd:positiveInteger [0..1]
DelimitedColumnDescription
-‐ columnPosition :xsd:positiveInteger
0..*
isDescribedBy
1
0..*
isStructuredBy
1
0..*
isDescribedBy
1
0..*
column
1..*
0..*
storageFormat
0..1
0..1
inputProgram
0..*
0..*
defaultLocale
0..1
0..*
defaultLanguage
0..1
0..*
characterSet
1
0..*
recommendedDisplayDataFormat
0..1
Column Descrip:on
Rela:onship to other RDF Vocabularies
class externalVocabularies
Table
dcat::Distribution
dcat::Dataset dcat::Catalog
TableStructure Column
disco::DataFile disco::LogicalDataSet
disco::Variable
dcat:distribution dcat:dataset
0..*
isStructuredBy
1 0..*
column
1..*
´ owl:equivalentClassª
disco:dataFile
´ owl:equivalentClassª
disco:containsVariable
Usage Scenarios
Data
PHDD Discovery DCAT
Program
Descrip:on
Transforma:on Analysis
Data Data
User
Provider
Search
DDI XML
Rela:onship to DDI XML
• Mapping to DDI XML Specifica:ons – DDI Codebook 2.* • approx. half of the proper:es of PHDD
– DDI Lifecycle 3.* • almost all proper:es of PHDD
Rela:onship of DDI Specifica:ons
DDI Codebook 2.*
DDI Lifecycle 3.*
DDI 4 Model
OWL/RDF Representa:on XML Schema Representa:on
PHDD
Discovery
XKOS
XML Schema OWL/RDF
Future
Acknowledgements
• Contribu:ons by – Larry Hoyle (Ins:tute for Policy & Social Research, University of Kansas)
– Richard Cyganiak (DERI -‐ Digital Enterprise Research Ins:tute )
Further Informa:on
• Development repository of PHDD – hlps://github.com/linked-‐sta:s:cs/physical-‐data-‐descrip:on
• DDI Alliance RDF Vocabularies – hlp://www.ddialliance.org/Specifica:on/RDF