Untangling Names Lessons learned (so far) from the linking of IPNI and TROPICOS Julius Welby RBG Kew...
-
Upload
mitchell-harrison -
Category
Documents
-
view
215 -
download
0
Transcript of Untangling Names Lessons learned (so far) from the linking of IPNI and TROPICOS Julius Welby RBG Kew...
Untangling Names
Lessons learned (so far) from the linking ofIPNI and TROPICOS
Julius WelbyRBG Kew
TROPICOS + IPNI
Why match?
Why is this difficult?
Variation
Calophyllum kiong K.Schum. & Lauterb.
Fl. Deutsch. Sudsee, 450.
Calophyllum kiong Lauterb. & K.Schum.
Die Flora der Deutschen Schutzgebiete in der Sudsee 1900
Duplication• Poa annua L. -- Sp. Pl. 68. 1753 (GCI)• Poa annua L. -- Species Plantarum 2 1753 (APNI)• Poa annua L. -- Sp. Pl. 68. (IK)
Duplication• Calophyllum microphyllum Scheff
in Tijdschr. Nederl. Ind. xxxii. (1871) 406. (IK)• Calophyllum microphyllum Planch. & Triana
in Ann. Sc. Nat. Ser. IV. xv. (1861) 282. (IK)• Calophyllum microphyllum T.Anders.
Fl. Brit. Ind. (J. D. Hooker). i. 272. (IK)
Matching
Fields
1 Calophyllum Calophyllum
2 kiong kiong
3 K.Schum. & Lauterb. Lauterb. & K.Schum.
4 Fl. Deutsch. Sudsee Die Flora der Deutschen…
5 450. 1900
Lesson 1
Speed matters
Speed matters
2,500 by 2,000 by 4 fields
20,000,000 comparisons
~5.5 hours at 1ms per comparison
Be lazy
• Do as little as possible• Do easy things if possible• Do hard things only if necessary• Only expend effort when it’s worth it
Be lazy
• Do as little as possible– Specify fields as ‘must match’– If a ‘must match’ field fails
• Mark the match as failed• Stop comparing fields
Parameterised matchingspecies
infragenusinfraspeciesauthorsrank …
How lazy?
Optimising
• The order of field matching is important– Choose suitable fields to match first– Aim to fail matches early
• Significant speed-up
Also, for speed
• Do as little as possible– Do escaping or standardisation once
– Done on import for each dataset
– Keep field matching functions clean
More speed optimisation• Do easy things if possible
– Define cascading tests– Do easy tests first, if practical
– Length comparisons– Composition comparisons
Speed Lessons
• Speed matters
• Minimise comparisons made– ‘Must match’ parameters– Match fields in an efficient order
• Do data cleaning once, up front
• Look for ways to fail matches cheaply
Accuracy
Accuracy
False +
False -
OK
Strict match F-
OK
Fuzzy match
F+OK
Doughnut of uncertainty
Lesson 2:Look at near misses
Near misses are checkable
One approach• Currently, to get best results:
– Tend towards strictness– Handle false negatives
One approach• Currently, best results from:
– Tend towards strictness– Handle false negatives
• Failures on ‘rightmost’ fields can be written to a report
• Checked and fed back in as escapes
• Rerun
Lesson 3:Remove predictable variation
Predictable variation• Gendered endings
• Common alternatives– Endings:
• ii,i• Iae,ae
• Dataset specific quirks:– &, &
The framework• Python
• Psyco• Modular• Extensible • In progress• More details will be available on the TDWG website• Source code availability
The framework• Some results (HTML)
Thanks to• Bob Magill• Sally Hinchcliffe• The Moore Foundation
• Contact:• [email protected]• or after Jan 2007 :