Data Quality: Towards a Common Validator
description
Transcript of Data Quality: Towards a Common Validator
Data qualityTowards a common validator
Christian Gendreau, Anne Bruneau, David ShorthouseUniversité de Montréal, Biodiversity Centre
What is data quality?● Relative● Fitness for use
o Coordinate precision for distributiono Hierarchy not provided
● What –When –Where
Examples● 2008 VI 13, 2008-06-13, 13-06-2008, June 13 2008, 13 junho
2008● Canada, Québec, Montréal, -73.55399 45.508669● Narwalus microcephalus => Monodon monoceros
Public Domain: Freshwater and Marine Image Bank, University of WashingtonLibraries Digital Collections
Data Quality Information Chain
Courtesy of Arthur D. Chapman
Brief History● First Canadensys Explorer/Harvester (2012)● narwhal-processor (2013)● TDWG 2013
o DQ Interest Groupo Presentation about our plano Discussion with GBIF
Why do we need a validator?● Identify and quantify potential issues● DarwinCore is permissive
o DarwinCore itself can change● Records and technologies will always evolve
What should a validator allow?● Define a validation scope
o at the source (e.g. collection)o national nodeo aggregatoro GBIF
Validator - Expected design● Modular, scalable, reusable● Customizable
o per configuration/extensiono use user defined dictionary
● Validation Chain
Current Options● GBIF validator● CRIA tools● ALA tools
Probably all organisations have their own tools.
dwca-validator● Starting from previous GBIF validator● Building a community project● Provide framework for Biodiversity Data Quality Interest
Group (TDWG)
https://github.com/gbif/dwca-validator
Vision• Library
o Core module, reusable (e.g. IPT)
• Webo Send archive, view report
• narwhal-processoro Suggest interpreted value
• Extensionso Domain knowledge / Quality index
Validation chain● Chain element
o Self contained (never relies on another chain element)o Ordering independent
● Composed chain element (narwhal and extensions)o Wrap chain elements under a new elemento Ordering possible between wrapped element
Chain element example
Validation types● Structure
o metadatao organization of data
● Rowso dates, coordinates, ...
● Columnso ID uniqueness
Result Accumulator● Records validation result as they occur
o ID/Validator/Context/ValidationType/Result/Message
● Allows different views of resulto Web viewo Feed another application
Current Status● Library with CLI (command line interface)● Basic evaluators and rules● Ready for contributions
Demo• Darwin Core Archive, Taxon Checklisto Invalid characterso Broken link synonym accepted taxon
Lynx Canadensis, http://www.animalgalleries.org/
Future validations● Use semantic web (e.g. GeoNames)● Use external resolver (e.g. CoL)● Use more complex validation (e.g. climate layer)
Future validationsAccomodate localisation vs misspellings● Brésil (fr)● Brazil (en)● Brasil (pt)● Brasilien (se)
● Brézil (??)
Questions?
Public Domain: robynm
Acknowledgements
Contacthttp://www.canadensys.net
http://github.com/Canadensys
@Canadensys