Assessment and Visualizationof Metadata Qualityfor Open Government Data
Konrad Johannes Reiche*, Edzard Höfig, Ina Schieferdecker**, presented by Nikolay
Tcholtchev**[email protected]*,
{firstname.lastname}@fokus.fraunhofer.de**
“A piece of content or data is open if anyone is free to use, reuse, and redistribute it — subject only, at most, to the requirement to attribute and/or share-alike.”
O·pen Da·ta /ˈəʊp(ə)n ˈdeɪtə/
“A piece of content or data is open if anyone is free to use, reuse, and redistribute it — subject only, at most, to the requirement to attribute and/or share-like.”
O·pen Da·ta /ˈəʊp(ə)n ˈdeɪtə/
License
Government
Data Citizens
DOMAIN
Government
Data Citizens
DOMAIN
DESIGN
Repositories
XML
JSON
RDF
Metadata
PDF XLS CSVDOC
Resources
Quality.What could possibly go wrong?
Metadata Record
Name regional-household-income
ID 98899446-0a1a-43bc-874c-2d54dc700670
Maintainer Margaret Jarmon
Maintainer Email [email protected]
Author Office for National Statistics
Author Email [email protected]
License ID uk-ogl
ResourcesURL http:/ / www.ons.gov.uk/ ons/ rhi13
Description Spring 2013
Format CSV
URL http:/ / www.ons.gov.uk/ ons/ rhi14
Description Spring 2014
Format CSV
Quality.What could possibly go wrong?
Metadata Record
Name regional-household-income
ID 98899446-0a1a-43bc-874c-2d54dc700670
Maintainer
Maintainer Email
Author Office for National Statistics
Author Email
License ID uk-ogl
ResourcesURL http:/ / www.ons.gov.uk/ ons/ rhi13
Description Spring 2013
Format CSV
URL http:/ / www.ons.gov.uk/ ons/ rhi14
Description
Format CSV
Quality.What could possibly go wrong?
Metadata Record
Name regional-household-income
ID 98899446-0a1a-43bc-874c-2d54dc700670
Maintainer
Maintainer Email
Author Office for National Statistics
Author Email
License ID uk-ogl
ResourcesURL http:/ / www.ons.gov.uk/ ons/ rhi13
Description Spring 2013
Format CSV
URL http:/ / www.ons.gov.uk/ ons/ rhi14
Description
Format CSV
CSV
HTML
Metadata Record
Name regional-household-income
ID 98899446-0a1a-43bc-874c-2d54dc700670
Maintainer
Maintainer Email
Author Office for National Statistics
Author Email
License ID uk-ogl
ResourcesURL http:/ / www.ons.gov.uk/ ons/ rhi13
Description Spring 2013
Format CSV
URL http:/ / www.ons.gov.uk/ ons/ rhi14
Description
Format CSV
Quality.What could possibly go wrong?
CSV
Metadata Record
Name
ID 98899446-0a1a-43bc-874c-2d54dc700670
Maintainer
Maintainer Email
Author
Author Email
License ID uk-ogl
ResourcesURL http:/ / www.ons.gov.uk/ ons/ rhi13
Description Spring 2013
Format CSV
URL http:/ / www.ons.gov.uk/ ons/ rhi14
Description
Format CSV
Quality.What could possibly go wrong?
CSV
Reputation Loss
QUALITY LOSSInformation Loss
- Missing Fields- Dead Links- Inaccurate
Information- False Information
- Outdated Values- Missing
Information- Bad Spelling- Non-Schema
CompliantBad Searchability Unreliable
Untrustworthy
Meta·da·ta Qual·i·ty/ˈmɛtədeɪtə kwɒlɪti/
The fitness to describe the data (resources), supporting the task dimensions of finding, identifying, selecting and eventually obtaining the resources. The quality is inversely proportional to the uncertainty of the user about the actual data.
Assessing Metadata Quality is HARDHighly
Subjective
Metadata
Resource
?
1. Manual 2. Automated
Wrong
Qualified ProcessPrinciples + Guidelines
Postulated as being not feasible anymore due to the large number of metadata records.
- Algorithms?- Procedures?- Oracle?- Machine
Learning?
Automated Quality AssessmentEmpirical Analysis + Visual Aid- Field Usage- Field Values
Framework- Based on Information
Quality- Three Dimensions:
- Intrinsic- Relational /
Contextual- Reputational
- Evaluation Criteria- Completeness- Accuracy- Provenance- Logical Consistency- Timeliness …
QUALITY METRICS
𝑞𝑚 :𝑟𝑒𝑐𝑜𝑟𝑑𝑡⟶𝑉∈ [0 ,1]
Measurement. Assigning a symbolic value to an object to enable the characterization of a certain attribute of that object.
Process P
Quality. Complex Attribute. No single measure. Highly Subjective. Use of Proxies.
Completeness. How many fields have been completed?
Record contains all the information required to have an ideal representation of the described resource.
Metadata Record
Name uk-civil-service-high-earners
ID 68addaac-59ae-4230-bb67-c5a8f6a76285
Maintainer
Maintainer Email
Author Civil Service Capability Group
Author Email [email protected]
License ID uk-ogl
ResourcesSize 40959
Description Civil Servants Salaries 2010
Format CSV
Size
Description Civil Servants Salaries 2011
Format CSV
Weighted Completeness. Not all fields are equally relevant.
Weight value expresses the relative importance of field .
Metadata Record
Name uk-civil-service-high-earners
ID 68addaac-59ae-4230-bb67-c5a8f6a76285
Maintainer
Maintainer Email
Author Civil Service Capability Group
Author Email [email protected]
License ID uk-ogl
ResourcesSize 40959
Description Civil Servants Salaries 2010
Format CSV
Size
Description Civil Servants Salaries 2011
Format CSV
Accuracy. How accurate is the resource represented?
Semantic distance . Difference between the information a user can extract from the record and the resource.
Metadata Record
Name regional-household-income
ID 98899446-0a1a-43bc-874c-2d54dc700670
Maintainer
Maintainer Email
Author Office for National Statistics
Author Email
License ID uk-ogl
ResourcesURL http:/ / www.ons.gov.uk/ons/ rhi13
Description Spring 2013
Format CSV
URL http:/ / www.ons.gov.uk/ons/ rhi14
Description
Format CSV
CSV
HTML
Richness of Information. How much value is added?
𝑞𝑖 (𝑟𝑒𝑐𝑜𝑟𝑑 )=∑𝑖=1
𝑛
𝐼 ( 𝑓𝑖𝑒𝑙𝑑𝑖 )
𝑛
Vocabulary terms and descriptions should be meaningful. Information should be unique and not redundant.
𝑚Number of DocumentsNumber of Words
𝑛
Readability. How readable are the descriptions? Readable in terms of cognitive accessibility.
Flesch-Kincaid Reading Ease
Availability. Are the links working?
Metadata only links to the resources. Without working links the actual data is not available.
is true if the th resource is reachable through the URL.
Implementation.
Metadata Census
REQUIREMENTS
Metadata HarvesterSchemaless Data StoreQuality MetricsVisualizationLeaderboard
ScalabilityExtensibility
Non-functional
Functional
Repository
+ url : String
+ name : String+ type : Symbol
Snapshot
+ date : Date
MetaMetadata
+ metadata_record : Hash+ score : Float
+ statistics : Hash + completeness : Hash+ weighted_completeness : Hash+ richness_of_information: Hash...
+ latitude : String+ longitude : String + best_record() : MetaMetadata
+ worst_record() : MetaMetadata+ score() : Float
0..* 1..*
DESIGN.
CompletenessMetric
WeightedCompleteness
<<Interface>>
Metric
+ compute(record)
MetricWorker
+ perform(snapshot, metric)
GenericMetricWorker
CompletenessMetricWorker
OpennessMetric
<<use>>
<<use>>
<<use>>
Metadata Harvester
JSON JSON
JSON
Archives
API
Req
uests
Reco
rds
Imports
Persist
Metadata Census
Metadata Harvester
JSON JSON
JSON
Archives
API
Req
uests
Reco
rds
Preliminary Analyzer
Dump Importer
Database
Imports
Persist
Metadata Census
Metadata Harvester
JSON JSON
JSON
Archives
API
Req
uests
Reco
rds
Metric Processor
Query
Records
Scheduler
Analyzer
Preliminary Analyzer
Dump Importer
Database
ViewUser
Generates
Investigates
Imports
Persist
Metadata Census
Metadata Harvester
JSON JSON
JSON
Archives
API
Req
uests
Reco
rds
Metric Processor
Query
Records
Scheduler
Analyzer
Preliminary Analyzer
Dump Importer
Database
Open Government Data.
Evaluation
Implementation focused exclusively on CKAN repositories.
Rank RepositoryScor
e
Misspelling
Richness of Information
Openness
Completeness
Availability
Weighted Completeness
Readability
Accuracy
1 data.gc.ca 74 97 86 80 79 79 81 71 20
2 data.sa.gov.au 71 98 63 94 77 86 82 72 0
3 GovData.de 67 99 4 38 55 81 87 79 56
4 data.qld.gov.au 66 99 67 96 73 60 78 59 0
4 PublicData.eu 66 98 84 69 64 70 67 42 32
4 data.gov.uk 66 97 85 69 62 74 67 44 28
4 africaopendata.org 66 100 20 78 70 87 68 55 53
5 datos.codeandomexico.org 65 100 55 84 65 100 75 37 0
6 catalogodatos.gub.uy 63 100 64 1 70 74 78 65 52
6 data.openpolice.ru 63 100 0 0 58 100 81 100 64
7 dados.gov.br 61 100 87 36 53 57 72 44 39
8 opendata.admin.ch 59 100 12 0 58 100 68 35 100
9 data.gv.at 57 100 21 99 51 68 65 59 0
10 data.gov.sk 49 100 51 0 48 92 58 37 7
Conclusion
What is good about this approach?
Metadata quality is quantified, but every quality aspect on its own. Metric scores are aggregated to make it comparable.
Every additional quality metric is supposed to complete the quality puzzle.
Automated — Generic — Quantifiable — Repeatable
Platform has the advantage that it acts as a beacon...
If your metadata breaks bad everyone will see it.
What is bad not so good about this approach?
- Lacks number of quality metrics- No empirical analysis beforehand- Overvalues problems with the
metadata
More quality metrics are necessary. Current metrics need to consider more special cases in the metadata records.
Final Thought. Do not aim for excellence, aim for low-quality metadata.
Quality Feed. Monitor metadata changes live and record changes in a timeline.
Repository Support. There are more repository software with public APIs. Socrata being most prominent.
More Quality Metrics- Duplicate
Detection- Discoverability- Coherence- Advancement- Reputation
Metadata Revision System. Avoid storing whole snapshots, but the change set.
Domain-Specific Language. Make it even easier to add individual quality metrics.
DEMOmetadata-census.com
Top Related