Christian Gendreau , David Shorthouse & Peter Desmet
description
Transcript of Christian Gendreau , David Shorthouse & Peter Desmet
![Page 1: Christian Gendreau , David Shorthouse & Peter Desmet](https://reader035.fdocuments.in/reader035/viewer/2022062501/5681667a550346895dda1cca/html5/thumbnails/1.jpg)
Data quality challenges in the Canadensys network of
occurrence records: examples, tools, and solutions
Christian Gendreau, David Shorthouse & Peter Desmet
![Page 2: Christian Gendreau , David Shorthouse & Peter Desmet](https://reader035.fdocuments.in/reader035/viewer/2022062501/5681667a550346895dda1cca/html5/thumbnails/2.jpg)
Game plan• Introduction to Canadensys• Data quality @ Canadensys• Canadensys processing solutions• Numbers from Canadensys• Hopes and expectations
![Page 3: Christian Gendreau , David Shorthouse & Peter Desmet](https://reader035.fdocuments.in/reader035/viewer/2022062501/5681667a550346895dda1cca/html5/thumbnails/3.jpg)
A NetworkOf people and collections
![Page 4: Christian Gendreau , David Shorthouse & Peter Desmet](https://reader035.fdocuments.in/reader035/viewer/2022062501/5681667a550346895dda1cca/html5/thumbnails/4.jpg)
Canadensys Headquarters Université de Montréal Biodiversity Centre
![Page 5: Christian Gendreau , David Shorthouse & Peter Desmet](https://reader035.fdocuments.in/reader035/viewer/2022062501/5681667a550346895dda1cca/html5/thumbnails/5.jpg)
data.canadensys.net/vascan
![Page 6: Christian Gendreau , David Shorthouse & Peter Desmet](https://reader035.fdocuments.in/reader035/viewer/2022062501/5681667a550346895dda1cca/html5/thumbnails/6.jpg)
data.canadensys.net/ipt
![Page 7: Christian Gendreau , David Shorthouse & Peter Desmet](https://reader035.fdocuments.in/reader035/viewer/2022062501/5681667a550346895dda1cca/html5/thumbnails/7.jpg)
data.canadensys.net/explorer
![Page 8: Christian Gendreau , David Shorthouse & Peter Desmet](https://reader035.fdocuments.in/reader035/viewer/2022062501/5681667a550346895dda1cca/html5/thumbnails/8.jpg)
Data quality related activitiesFrom an aggregator perspective
![Page 9: Christian Gendreau , David Shorthouse & Peter Desmet](https://reader035.fdocuments.in/reader035/viewer/2022062501/5681667a550346895dda1cca/html5/thumbnails/9.jpg)
During data entry• Help to avoid typographical errors• Help to convert verbatim data
Actor : data entry person
![Page 10: Christian Gendreau , David Shorthouse & Peter Desmet](https://reader035.fdocuments.in/reader035/viewer/2022062501/5681667a550346895dda1cca/html5/thumbnails/10.jpg)
Before publication
Actor : data publisher
• Detect file character encoding issue• Detect duplicate or missing IDs
Previous Activity:Data entry
![Page 11: Christian Gendreau , David Shorthouse & Peter Desmet](https://reader035.fdocuments.in/reader035/viewer/2022062501/5681667a550346895dda1cca/html5/thumbnails/11.jpg)
During aggregation• Process data: validation, cleaning• Produce structured reports : quality control
Actor : data aggregator
Previous Activity:Before publication
![Page 12: Christian Gendreau , David Shorthouse & Peter Desmet](https://reader035.fdocuments.in/reader035/viewer/2022062501/5681667a550346895dda1cca/html5/thumbnails/12.jpg)
After aggregation• Allow and facilitate community feedback• Help data publisher to integrate corrections
Actor : users and community
Previous Activity:Aggregation
![Page 13: Christian Gendreau , David Shorthouse & Peter Desmet](https://reader035.fdocuments.in/reader035/viewer/2022062501/5681667a550346895dda1cca/html5/thumbnails/13.jpg)
Canadensys toolsduring data entry
data.canadensys.net/tools
![Page 14: Christian Gendreau , David Shorthouse & Peter Desmet](https://reader035.fdocuments.in/reader035/viewer/2022062501/5681667a550346895dda1cca/html5/thumbnails/14.jpg)
Why do we process data?• Enrich our Explorer, http://data.canadensys.net• Provide structured reports to data providers
• Help identify records that need re-examination• Help to improve data entry procedure
![Page 15: Christian Gendreau , David Shorthouse & Peter Desmet](https://reader035.fdocuments.in/reader035/viewer/2022062501/5681667a550346895dda1cca/html5/thumbnails/15.jpg)
Data processing
![Page 16: Christian Gendreau , David Shorthouse & Peter Desmet](https://reader035.fdocuments.in/reader035/viewer/2022062501/5681667a550346895dda1cca/html5/thumbnails/16.jpg)
Processing solutionsNarwhals to the rescue
Narwhal image Public Domain
![Page 17: Christian Gendreau , David Shorthouse & Peter Desmet](https://reader035.fdocuments.in/reader035/viewer/2022062501/5681667a550346895dda1cca/html5/thumbnails/17.jpg)
The narwhal-processor approach● Single field processing to allow complex
processing (combined fields)● Processors with common interface ease
integration and usage● Collaboration
https://github.com/Canadensys/narwhal-processor
![Page 18: Christian Gendreau , David Shorthouse & Peter Desmet](https://reader035.fdocuments.in/reader035/viewer/2022062501/5681667a550346895dda1cca/html5/thumbnails/18.jpg)
Data usabilitybefore processing
country text state/province text coordinates dates0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
92%
60%
96%
44%
% o
f non
-nul
l cle
an v
erba
tim d
ata
![Page 19: Christian Gendreau , David Shorthouse & Peter Desmet](https://reader035.fdocuments.in/reader035/viewer/2022062501/5681667a550346895dda1cca/html5/thumbnails/19.jpg)
Data usabilityafter processing
• 7% of provided country text
USAISO 3166-
2:US, United States
![Page 20: Christian Gendreau , David Shorthouse & Peter Desmet](https://reader035.fdocuments.in/reader035/viewer/2022062501/5681667a550346895dda1cca/html5/thumbnails/20.jpg)
Data usabilityafter processing
• 7% of provided country text• 16% of provided state/province text
QuéISO 3166-2
CA-QC, Quebec
![Page 21: Christian Gendreau , David Shorthouse & Peter Desmet](https://reader035.fdocuments.in/reader035/viewer/2022062501/5681667a550346895dda1cca/html5/thumbnails/21.jpg)
Data usabilityafter processing
• 7% of provided country text• 16% of provided state/province text• 4% of provided coordinates
45° 32' 25" N, 129° 40' 31"
W
45.5402778, -129.6752778
![Page 22: Christian Gendreau , David Shorthouse & Peter Desmet](https://reader035.fdocuments.in/reader035/viewer/2022062501/5681667a550346895dda1cca/html5/thumbnails/22.jpg)
Data usabilityafter processing
• 7% of provided country text• 16% of provided state/province text• 4% of provided coordinates• 42% of provided dates
2008 VI 13 2008-06-13
![Page 23: Christian Gendreau , David Shorthouse & Peter Desmet](https://reader035.fdocuments.in/reader035/viewer/2022062501/5681667a550346895dda1cca/html5/thumbnails/23.jpg)
Data usabilityincluding processed data
country text state/province text coordinates dates0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
92%
60%
96%
44%
7%
16%
4%
42%
% o
f non
-nul
l pro
vide
d
![Page 24: Christian Gendreau , David Shorthouse & Peter Desmet](https://reader035.fdocuments.in/reader035/viewer/2022062501/5681667a550346895dda1cca/html5/thumbnails/24.jpg)
Projects With Data Quality Tools• Atlas of living Australia• GBIF Norway, GBIF Spain, National Biodiversity
Network, BioVeL … • GBIF libraries• Most nodes have their own data quality
routine
![Page 25: Christian Gendreau , David Shorthouse & Peter Desmet](https://reader035.fdocuments.in/reader035/viewer/2022062501/5681667a550346895dda1cca/html5/thumbnails/25.jpg)
Hopes and expectations
![Page 26: Christian Gendreau , David Shorthouse & Peter Desmet](https://reader035.fdocuments.in/reader035/viewer/2022062501/5681667a550346895dda1cca/html5/thumbnails/26.jpg)
• Maintain taxonomic authority files• Maintain country, province and city lists
We do not want to
![Page 27: Christian Gendreau , David Shorthouse & Peter Desmet](https://reader035.fdocuments.in/reader035/viewer/2022062501/5681667a550346895dda1cca/html5/thumbnails/27.jpg)
• Efficiently use specialized resources/services• Provide report, quality indices
We prefer to
![Page 28: Christian Gendreau , David Shorthouse & Peter Desmet](https://reader035.fdocuments.in/reader035/viewer/2022062501/5681667a550346895dda1cca/html5/thumbnails/28.jpg)
Help from Semantic Web• Data in other languages (French, Spanish, …)
should not be flagged as error• Misspellings should be shared as a common
resource (e.g. SKOS)• Understand historical data (e.g. collected in
USSR in 1980)
![Page 29: Christian Gendreau , David Shorthouse & Peter Desmet](https://reader035.fdocuments.in/reader035/viewer/2022062501/5681667a550346895dda1cca/html5/thumbnails/29.jpg)
Reporting and log• DarwinCore annotations for processed data• Shared vocabulary for structured reports and
quality indices
![Page 30: Christian Gendreau , David Shorthouse & Peter Desmet](https://reader035.fdocuments.in/reader035/viewer/2022062501/5681667a550346895dda1cca/html5/thumbnails/30.jpg)
Summary• Tools available for sharing• Use, review, contribute• Opportunity for broad coordination and
increased efficiencies
![Page 31: Christian Gendreau , David Shorthouse & Peter Desmet](https://reader035.fdocuments.in/reader035/viewer/2022062501/5681667a550346895dda1cca/html5/thumbnails/31.jpg)
Thanks
Anne Bruneau, Institut de recherche en biologie végétale andDépartement de Sciences Biologiques, Université de Montréal
![Page 32: Christian Gendreau , David Shorthouse & Peter Desmet](https://reader035.fdocuments.in/reader035/viewer/2022062501/5681667a550346895dda1cca/html5/thumbnails/32.jpg)
Contacthttp://www.canadensys.nethttp://github.com/Canadensys@Canadensys
Gulo gulo, Larry Master (www.masterimages.org)