Untangling Names Lessons learned (so far) from the linking of IPNI and TROPICOS Julius Welby RBG Kew...

33
Untangling Names Lessons learned (so far) from the linking of IPNI and TROPICOS Julius Welby RBG Kew [email protected]

Transcript of Untangling Names Lessons learned (so far) from the linking of IPNI and TROPICOS Julius Welby RBG Kew...

Page 1: Untangling Names Lessons learned (so far) from the linking of IPNI and TROPICOS Julius Welby RBG Kew j.welby@kew.org.

Untangling Names

Lessons learned (so far) from the linking ofIPNI and TROPICOS

Julius WelbyRBG Kew

[email protected]

Page 2: Untangling Names Lessons learned (so far) from the linking of IPNI and TROPICOS Julius Welby RBG Kew j.welby@kew.org.

TROPICOS + IPNI

Page 3: Untangling Names Lessons learned (so far) from the linking of IPNI and TROPICOS Julius Welby RBG Kew j.welby@kew.org.

Why match?

Page 4: Untangling Names Lessons learned (so far) from the linking of IPNI and TROPICOS Julius Welby RBG Kew j.welby@kew.org.

Why is this difficult?

Page 5: Untangling Names Lessons learned (so far) from the linking of IPNI and TROPICOS Julius Welby RBG Kew j.welby@kew.org.

Variation

Calophyllum kiong K.Schum. & Lauterb.

Fl. Deutsch. Sudsee, 450.

Calophyllum kiong Lauterb. & K.Schum.

Die Flora der Deutschen Schutzgebiete in der Sudsee 1900

Page 6: Untangling Names Lessons learned (so far) from the linking of IPNI and TROPICOS Julius Welby RBG Kew j.welby@kew.org.

Duplication• Poa annua L. -- Sp. Pl. 68. 1753 (GCI)• Poa annua L. -- Species Plantarum 2 1753 (APNI)• Poa annua L. -- Sp. Pl. 68. (IK)

Page 7: Untangling Names Lessons learned (so far) from the linking of IPNI and TROPICOS Julius Welby RBG Kew j.welby@kew.org.

Duplication• Calophyllum microphyllum Scheff

in Tijdschr. Nederl. Ind. xxxii. (1871) 406. (IK)• Calophyllum microphyllum Planch. & Triana

in Ann. Sc. Nat. Ser. IV. xv. (1861) 282. (IK)• Calophyllum microphyllum T.Anders.

Fl. Brit. Ind. (J. D. Hooker). i. 272. (IK)

Page 8: Untangling Names Lessons learned (so far) from the linking of IPNI and TROPICOS Julius Welby RBG Kew j.welby@kew.org.

Matching

Page 9: Untangling Names Lessons learned (so far) from the linking of IPNI and TROPICOS Julius Welby RBG Kew j.welby@kew.org.

Fields

1 Calophyllum Calophyllum

2 kiong kiong

3 K.Schum. & Lauterb. Lauterb. & K.Schum.

4 Fl. Deutsch. Sudsee Die Flora der Deutschen…

5 450. 1900

Page 10: Untangling Names Lessons learned (so far) from the linking of IPNI and TROPICOS Julius Welby RBG Kew j.welby@kew.org.

Lesson 1

Speed matters

Page 11: Untangling Names Lessons learned (so far) from the linking of IPNI and TROPICOS Julius Welby RBG Kew j.welby@kew.org.

Speed matters

2,500 by 2,000 by 4 fields

20,000,000 comparisons

~5.5 hours at 1ms per comparison

Page 12: Untangling Names Lessons learned (so far) from the linking of IPNI and TROPICOS Julius Welby RBG Kew j.welby@kew.org.

Be lazy

• Do as little as possible• Do easy things if possible• Do hard things only if necessary• Only expend effort when it’s worth it

Page 13: Untangling Names Lessons learned (so far) from the linking of IPNI and TROPICOS Julius Welby RBG Kew j.welby@kew.org.

Be lazy

• Do as little as possible– Specify fields as ‘must match’– If a ‘must match’ field fails

• Mark the match as failed• Stop comparing fields

Page 14: Untangling Names Lessons learned (so far) from the linking of IPNI and TROPICOS Julius Welby RBG Kew j.welby@kew.org.

Parameterised matchingspecies

infragenusinfraspeciesauthorsrank …

Page 15: Untangling Names Lessons learned (so far) from the linking of IPNI and TROPICOS Julius Welby RBG Kew j.welby@kew.org.

How lazy?

Page 16: Untangling Names Lessons learned (so far) from the linking of IPNI and TROPICOS Julius Welby RBG Kew j.welby@kew.org.

Optimising

• The order of field matching is important– Choose suitable fields to match first– Aim to fail matches early

• Significant speed-up

Page 17: Untangling Names Lessons learned (so far) from the linking of IPNI and TROPICOS Julius Welby RBG Kew j.welby@kew.org.

Also, for speed

• Do as little as possible– Do escaping or standardisation once

– Done on import for each dataset

– Keep field matching functions clean

Page 18: Untangling Names Lessons learned (so far) from the linking of IPNI and TROPICOS Julius Welby RBG Kew j.welby@kew.org.

More speed optimisation• Do easy things if possible

– Define cascading tests– Do easy tests first, if practical

– Length comparisons– Composition comparisons

Page 19: Untangling Names Lessons learned (so far) from the linking of IPNI and TROPICOS Julius Welby RBG Kew j.welby@kew.org.

Speed Lessons

• Speed matters

• Minimise comparisons made– ‘Must match’ parameters– Match fields in an efficient order

• Do data cleaning once, up front

• Look for ways to fail matches cheaply

Page 20: Untangling Names Lessons learned (so far) from the linking of IPNI and TROPICOS Julius Welby RBG Kew j.welby@kew.org.

Accuracy

Page 21: Untangling Names Lessons learned (so far) from the linking of IPNI and TROPICOS Julius Welby RBG Kew j.welby@kew.org.

Accuracy

False +

False -

OK

Page 22: Untangling Names Lessons learned (so far) from the linking of IPNI and TROPICOS Julius Welby RBG Kew j.welby@kew.org.

Strict match F-

OK

Page 23: Untangling Names Lessons learned (so far) from the linking of IPNI and TROPICOS Julius Welby RBG Kew j.welby@kew.org.

Fuzzy match

F+OK

Page 24: Untangling Names Lessons learned (so far) from the linking of IPNI and TROPICOS Julius Welby RBG Kew j.welby@kew.org.

Doughnut of uncertainty

Page 25: Untangling Names Lessons learned (so far) from the linking of IPNI and TROPICOS Julius Welby RBG Kew j.welby@kew.org.

Lesson 2:Look at near misses

Page 26: Untangling Names Lessons learned (so far) from the linking of IPNI and TROPICOS Julius Welby RBG Kew j.welby@kew.org.

Near misses are checkable

Page 27: Untangling Names Lessons learned (so far) from the linking of IPNI and TROPICOS Julius Welby RBG Kew j.welby@kew.org.

One approach• Currently, to get best results:

– Tend towards strictness– Handle false negatives

Page 28: Untangling Names Lessons learned (so far) from the linking of IPNI and TROPICOS Julius Welby RBG Kew j.welby@kew.org.

One approach• Currently, best results from:

– Tend towards strictness– Handle false negatives

• Failures on ‘rightmost’ fields can be written to a report

• Checked and fed back in as escapes

• Rerun

Page 29: Untangling Names Lessons learned (so far) from the linking of IPNI and TROPICOS Julius Welby RBG Kew j.welby@kew.org.

Lesson 3:Remove predictable variation

Page 30: Untangling Names Lessons learned (so far) from the linking of IPNI and TROPICOS Julius Welby RBG Kew j.welby@kew.org.

Predictable variation• Gendered endings

• Common alternatives– Endings:

• ii,i• Iae,ae

• Dataset specific quirks:– &, &

Page 31: Untangling Names Lessons learned (so far) from the linking of IPNI and TROPICOS Julius Welby RBG Kew j.welby@kew.org.

The framework• Python

• Psyco• Modular• Extensible • In progress• More details will be available on the TDWG website• Source code availability

Page 32: Untangling Names Lessons learned (so far) from the linking of IPNI and TROPICOS Julius Welby RBG Kew j.welby@kew.org.

The framework• Some results (HTML)

Page 33: Untangling Names Lessons learned (so far) from the linking of IPNI and TROPICOS Julius Welby RBG Kew j.welby@kew.org.

Thanks to• Bob Magill• Sally Hinchcliffe• The Moore Foundation

• Contact:• [email protected]• or after Jan 2007 :

[email protected]