Untangling Names

33
Untangling Names Lessons learned (so far) from the linking of IPNI and TROPICOS Julius Welby RBG Kew [email protected]

description

Untangling Names. Lessons learned (so far) from the linking of IPNI and TROPICOS Julius Welby RBG Kew [email protected]. TROPICOS + IPNI. Why match?. Why is this difficult?. Variation. Calophyllum kiong K.Schum. & Lauterb. Fl. Deutsch. Sudsee, 450. Calophyllum kiong Lauterb. & K.Schum. - PowerPoint PPT Presentation

Transcript of Untangling Names

Page 1: Untangling Names

Untangling Names

Lessons learned (so far) from the linking ofIPNI and TROPICOS

Julius WelbyRBG Kew

[email protected]

Page 2: Untangling Names

TROPICOS + IPNI

Page 3: Untangling Names

Why match?

Page 4: Untangling Names

Why is this difficult?

Page 5: Untangling Names

Variation

Calophyllum kiong K.Schum. & Lauterb.

Fl. Deutsch. Sudsee, 450.

Calophyllum kiong Lauterb. & K.Schum.

Die Flora der Deutschen Schutzgebiete in der Sudsee 1900

Page 6: Untangling Names

Duplication• Poa annua L. -- Sp. Pl. 68. 1753 (GCI)• Poa annua L. -- Species Plantarum 2 1753 (APNI)• Poa annua L. -- Sp. Pl. 68. (IK)

Page 7: Untangling Names

Duplication• Calophyllum microphyllum Scheff

in Tijdschr. Nederl. Ind. xxxii. (1871) 406. (IK)• Calophyllum microphyllum Planch. & Triana

in Ann. Sc. Nat. Ser. IV. xv. (1861) 282. (IK)• Calophyllum microphyllum T.Anders.

Fl. Brit. Ind. (J. D. Hooker). i. 272. (IK)

Page 8: Untangling Names

Matching

Page 9: Untangling Names

Fields

1 Calophyllum Calophyllum

2 kiong kiong

3 K.Schum. & Lauterb. Lauterb. & K.Schum.

4 Fl. Deutsch. Sudsee Die Flora der Deutschen…

5 450. 1900

Page 10: Untangling Names

Lesson 1

Speed matters

Page 11: Untangling Names

Speed matters

2,500 by 2,000 by 4 fields

20,000,000 comparisons

~5.5 hours at 1ms per comparison

Page 12: Untangling Names

Be lazy

• Do as little as possible• Do easy things if possible• Do hard things only if necessary• Only expend effort when it’s worth it

Page 13: Untangling Names

Be lazy

• Do as little as possible– Specify fields as ‘must match’– If a ‘must match’ field fails

• Mark the match as failed• Stop comparing fields

Page 14: Untangling Names

Parameterised matchingspecies

infragenusinfraspeciesauthorsrank …

Page 15: Untangling Names

How lazy?

Page 16: Untangling Names

Optimising

• The order of field matching is important– Choose suitable fields to match first– Aim to fail matches early

• Significant speed-up

Page 17: Untangling Names

Also, for speed

• Do as little as possible– Do escaping or standardisation once

– Done on import for each dataset

– Keep field matching functions clean

Page 18: Untangling Names

More speed optimisation• Do easy things if possible

– Define cascading tests– Do easy tests first, if practical

– Length comparisons– Composition comparisons

Page 19: Untangling Names

Speed Lessons

• Speed matters

• Minimise comparisons made– ‘Must match’ parameters– Match fields in an efficient order

• Do data cleaning once, up front

• Look for ways to fail matches cheaply

Page 20: Untangling Names

Accuracy

Page 21: Untangling Names

Accuracy

False +

False -

OK

Page 22: Untangling Names

Strict match F-

OK

Page 23: Untangling Names

Fuzzy match

F+OK

Page 24: Untangling Names

Doughnut of uncertainty

Page 25: Untangling Names

Lesson 2:Look at near misses

Page 26: Untangling Names

Near misses are checkable

Page 27: Untangling Names

One approach• Currently, to get best results:

– Tend towards strictness– Handle false negatives

Page 28: Untangling Names

One approach• Currently, best results from:

– Tend towards strictness– Handle false negatives

• Failures on ‘rightmost’ fields can be written to a report

• Checked and fed back in as escapes

• Rerun

Page 29: Untangling Names

Lesson 3:Remove predictable variation

Page 30: Untangling Names

Predictable variation• Gendered endings

• Common alternatives– Endings:

• ii,i• Iae,ae

• Dataset specific quirks:– &, &

Page 31: Untangling Names

The framework• Python

• Psyco• Modular• Extensible • In progress• More details will be available on the TDWG website• Source code availability

Page 32: Untangling Names

The framework• Some results (HTML)

Page 33: Untangling Names

Thanks to• Bob Magill• Sally Hinchcliffe• The Moore Foundation

• Contact:• [email protected]• or after Jan 2007 :

[email protected]