Untangling Names

Post on 12-Jan-2016

48 views 3 download

Tags:

description

Untangling Names. Lessons learned (so far) from the linking of IPNI and TROPICOS Julius Welby RBG Kew j.welby@kew.org. TROPICOS + IPNI. Why match?. Why is this difficult?. Variation. Calophyllum kiong K.Schum. & Lauterb. Fl. Deutsch. Sudsee, 450. Calophyllum kiong Lauterb. & K.Schum. - PowerPoint PPT Presentation

Transcript of Untangling Names

Untangling Names

Lessons learned (so far) from the linking ofIPNI and TROPICOS

Julius WelbyRBG Kew

j.welby@kew.org

TROPICOS + IPNI

Why match?

Why is this difficult?

Variation

Calophyllum kiong K.Schum. & Lauterb.

Fl. Deutsch. Sudsee, 450.

Calophyllum kiong Lauterb. & K.Schum.

Die Flora der Deutschen Schutzgebiete in der Sudsee 1900

Duplication• Poa annua L. -- Sp. Pl. 68. 1753 (GCI)• Poa annua L. -- Species Plantarum 2 1753 (APNI)• Poa annua L. -- Sp. Pl. 68. (IK)

Duplication• Calophyllum microphyllum Scheff

in Tijdschr. Nederl. Ind. xxxii. (1871) 406. (IK)• Calophyllum microphyllum Planch. & Triana

in Ann. Sc. Nat. Ser. IV. xv. (1861) 282. (IK)• Calophyllum microphyllum T.Anders.

Fl. Brit. Ind. (J. D. Hooker). i. 272. (IK)

Matching

Fields

1 Calophyllum Calophyllum

2 kiong kiong

3 K.Schum. & Lauterb. Lauterb. & K.Schum.

4 Fl. Deutsch. Sudsee Die Flora der Deutschen…

5 450. 1900

Lesson 1

Speed matters

Speed matters

2,500 by 2,000 by 4 fields

20,000,000 comparisons

~5.5 hours at 1ms per comparison

Be lazy

• Do as little as possible• Do easy things if possible• Do hard things only if necessary• Only expend effort when it’s worth it

Be lazy

• Do as little as possible– Specify fields as ‘must match’– If a ‘must match’ field fails

• Mark the match as failed• Stop comparing fields

Parameterised matchingspecies

infragenusinfraspeciesauthorsrank …

How lazy?

Optimising

• The order of field matching is important– Choose suitable fields to match first– Aim to fail matches early

• Significant speed-up

Also, for speed

• Do as little as possible– Do escaping or standardisation once

– Done on import for each dataset

– Keep field matching functions clean

More speed optimisation• Do easy things if possible

– Define cascading tests– Do easy tests first, if practical

– Length comparisons– Composition comparisons

Speed Lessons

• Speed matters

• Minimise comparisons made– ‘Must match’ parameters– Match fields in an efficient order

• Do data cleaning once, up front

• Look for ways to fail matches cheaply

Accuracy

Accuracy

False +

False -

OK

Strict match F-

OK

Fuzzy match

F+OK

Doughnut of uncertainty

Lesson 2:Look at near misses

Near misses are checkable

One approach• Currently, to get best results:

– Tend towards strictness– Handle false negatives

One approach• Currently, best results from:

– Tend towards strictness– Handle false negatives

• Failures on ‘rightmost’ fields can be written to a report

• Checked and fed back in as escapes

• Rerun

Lesson 3:Remove predictable variation

Predictable variation• Gendered endings

• Common alternatives– Endings:

• ii,i• Iae,ae

• Dataset specific quirks:– &, &

The framework• Python

• Psyco• Modular• Extensible • In progress• More details will be available on the TDWG website• Source code availability

The framework• Some results (HTML)

Thanks to• Bob Magill• Sally Hinchcliffe• The Moore Foundation

• Contact:• j.welby@kew.org• or after Jan 2007 :

julius.welby@gmail.com