Post on 11-May-2015
description
Data quality challenges in the Canadensys network of
occurrence records: examples, tools, and solutions
Chris&an Gendreau, David Shorthouse & Peter Desmet
Game plan • Introduc&on to Canadensys • Data quality @ Canadensys • Canadensys processing solu&ons • Numbers from Canadensys • Hopes and expecta&ons
A Network Of people and collections
Canadensys Headquarters Université de Montréal Biodiversity Centre
data.canadensys.net/vascan
data.canadensys.net/ipt
data.canadensys.net/explorer
Data quality related activities From an aggregator perspective
During data entry • Help to avoid typographical errors • Help to convert verba&m data
Actor : data entry person
Before publica&on
Actor : data publisher
• Detect file character encoding issue • Detect duplicate or missing IDs
Previous Activity: Data entry
During aggrega&on • Process data: valida&on, cleaning • Produce structured reports : quality control
Actor : data aggregator
Previous Activity: Before publication
AKer aggrega&on • Allow and facilitate community feedback • Help data publisher to integrate correc&ons
Actor : users and community
Previous Activity: Aggregation
Canadensys tools during data entry
data.canadensys.net/tools
Why do we process data? • Enrich our Explorer, h"p://data.canadensys.net • Provide structured reports to data providers
• Help iden&fy records that need re-‐examina&on • Help to improve data entry procedure
Data processing
Processing solu&ons Narwhals to the rescue
Narwhal image Public Domain
The narwhal-‐processor approach ● Single field processing to allow complex
processing (combined fields) ● Processors with common interface ease
integra&on and usage ● Collabora&on
https://github.com/Canadensys/narwhal-processor
Data usability before processing
92%
60%
96%
44%
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
country text state/province text coordinates dates
% of n
on-‐null clean
verba
>m data
Data usability aKer processing
• 7% of provided country text
USA ISO 3166-‐2:US, United States
Data usability aKer processing
• 7% of provided country text • 16% of provided state/province text
Qué ISO 3166-‐2 CA-‐QC, Quebec
Data usability aKer processing
• 7% of provided country text • 16% of provided state/province text • 4% of provided coordinates
45° 32' 25" N, 129° 40' 31" W
45.5402778, -‐129.6752778
Data usability aKer processing
• 7% of provided country text • 16% of provided state/province text • 4% of provided coordinates • 42% of provided dates
2008 VI 13 2008-‐06-‐13
Data usability including processed data
92%
60%
96%
44%
7%
16%
4%
42%
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
country text state/province text coordinates dates
% of n
on-‐null provide
d
Projects With Data Quality Tools • Atlas of living Australia • GBIF Norway, GBIF Spain, Na&onal Biodiversity Network, BioVeL …
• GBIF libraries • Most nodes have their own data quality rou&ne
Hopes and expecta&ons
• Maintain taxonomic authority files • Maintain country, province and city lists
We do not want to
• Efficiently use specialized resources/services • Provide report, quality indices
We prefer to
Help from Seman&c Web • Data in other languages (French, Spanish, …)
should not be flagged as error • Misspellings should be shared as a common
resource (e.g. SKOS) • Understand historical data (e.g. collected in
USSR in 1980)
Repor&ng and log • DarwinCore annota&ons for processed data • Shared vocabulary for structured reports and
quality indices
Summary • Tools available for sharing • Use, review, contribute • Opportunity for broad coordina&on and increased efficiencies
Thanks
Anne Bruneau, Institut de recherche en biologie végétale and Département de Sciences Biologiques, Université de Montréal
Contact hrp://www.canadensys.net hrp://github.com/Canadensys @Canadensys
Gulo gulo, Larry Master (www.masterimages.org)
Mul&-‐field processing DwC Field Raw data Processed data
verba&mLa&tude 45°30ʹ′N 45.5
verba&mLongitude 73°34ʹ′W -‐73.5666667
country Canada Canada
stateProvince QC Quebec
municipality Montreal City Montreal
Mul&-‐field processing 1. Get informa&on on coordinates
45.5,-‐73.5666667 2. Compare with processed data 3. Assert that these coordinates are in Montréal