Untangling Names
description
Transcript of Untangling Names
Untangling Names
Lessons learned (so far) from the linking ofIPNI and TROPICOS
Julius WelbyRBG Kew
TROPICOS + IPNI
Why match?
Why is this difficult?
Variation
Calophyllum kiong K.Schum. & Lauterb.
Fl. Deutsch. Sudsee, 450.
Calophyllum kiong Lauterb. & K.Schum.
Die Flora der Deutschen Schutzgebiete in der Sudsee 1900
Duplication• Poa annua L. -- Sp. Pl. 68. 1753 (GCI)• Poa annua L. -- Species Plantarum 2 1753 (APNI)• Poa annua L. -- Sp. Pl. 68. (IK)
Duplication• Calophyllum microphyllum Scheff
in Tijdschr. Nederl. Ind. xxxii. (1871) 406. (IK)• Calophyllum microphyllum Planch. & Triana
in Ann. Sc. Nat. Ser. IV. xv. (1861) 282. (IK)• Calophyllum microphyllum T.Anders.
Fl. Brit. Ind. (J. D. Hooker). i. 272. (IK)
Matching
Fields
1 Calophyllum Calophyllum
2 kiong kiong
3 K.Schum. & Lauterb. Lauterb. & K.Schum.
4 Fl. Deutsch. Sudsee Die Flora der Deutschen…
5 450. 1900
Lesson 1
Speed matters
Speed matters
2,500 by 2,000 by 4 fields
20,000,000 comparisons
~5.5 hours at 1ms per comparison
Be lazy
• Do as little as possible• Do easy things if possible• Do hard things only if necessary• Only expend effort when it’s worth it
Be lazy
• Do as little as possible– Specify fields as ‘must match’– If a ‘must match’ field fails
• Mark the match as failed• Stop comparing fields
Parameterised matchingspecies
infragenusinfraspeciesauthorsrank …
How lazy?
Optimising
• The order of field matching is important– Choose suitable fields to match first– Aim to fail matches early
• Significant speed-up
Also, for speed
• Do as little as possible– Do escaping or standardisation once
– Done on import for each dataset
– Keep field matching functions clean
More speed optimisation• Do easy things if possible
– Define cascading tests– Do easy tests first, if practical
– Length comparisons– Composition comparisons
Speed Lessons
• Speed matters
• Minimise comparisons made– ‘Must match’ parameters– Match fields in an efficient order
• Do data cleaning once, up front
• Look for ways to fail matches cheaply
Accuracy
Accuracy
False +
False -
OK
Strict match F-
OK
Fuzzy match
F+OK
Doughnut of uncertainty
Lesson 2:Look at near misses
Near misses are checkable
One approach• Currently, to get best results:
– Tend towards strictness– Handle false negatives
One approach• Currently, best results from:
– Tend towards strictness– Handle false negatives
• Failures on ‘rightmost’ fields can be written to a report
• Checked and fed back in as escapes
• Rerun
Lesson 3:Remove predictable variation
Predictable variation• Gendered endings
• Common alternatives– Endings:
• ii,i• Iae,ae
• Dataset specific quirks:– &, &
The framework• Python
• Psyco• Modular• Extensible • In progress• More details will be available on the TDWG website• Source code availability
The framework• Some results (HTML)
Thanks to• Bob Magill• Sally Hinchcliffe• The Moore Foundation
• Contact:• [email protected]• or after Jan 2007 :