Tackling the difficult areas of chemical entity extraction: Misspelt chemical names and...

35
ACS National Meeting, Indianapolis, USA 8 th September 2013 Tackling the difficult areas of chemical entity extraction: Misspelt chemical names and unconventional entities Daniel Lowe and Roger Sayle NextMove Software Cambridge, UK

description

Extracting the structures of small molecules from unstructured text is now a mature field, however there still remain areas that present considerable difficulty or have until this point remained unexplored. One such area is identification of chemical names with misspellings or errors introduced by optical character recognition. The approach we have taken employs a formal grammar describing the syntax of a systematic name. To provide coverage over the vast majority of organic nomenclature including carbohydrates, amino acids and natural products we have developed a new way of representing the grammar such as to allow an order of magnitude more states than previous efforts1 whilst simultaneously reducing memory consumption. To efficiently perform spelling correction against this grammar we will describe a heuristic spelling correction algorithm. Another area that remains underexplored is the identification and resolution of chemical line formulae by which we also include domain specific line formulae such as are used to describe oligosaccharides and peptides. We describe the recognition and resolution of these often overlooked chemical entities. We also show how one can identify entities such as journal and patent references, which can aid in the navigation of semantically enhanced documents. (1) Sayle, R.; Xie, P. H.; Muresan, S. Improved Chemical Text Mining of Patents with Infinite Dictionaries and Automatic Spelling Correction. J. Chem. Inf. Model. 2011, 52, 51–62.

Transcript of Tackling the difficult areas of chemical entity extraction: Misspelt chemical names and...

Page 1: Tackling the difficult areas of chemical entity extraction: Misspelt chemical names and unconventional entities

ACS National Meeting, Indianapolis, USA 8th September 2013

Tackling the difficult areas of chemical entity extraction:

Misspelt chemical names and unconventional

entities

Daniel Lowe and Roger Sayle

NextMove Software

Cambridge, UK

Page 2: Tackling the difficult areas of chemical entity extraction: Misspelt chemical names and unconventional entities

ACS National Meeting, Indianapolis, USA 8th September 2013

Text mining is big business

2013 Bio-IT World Best Practices winner

Page 3: Tackling the difficult areas of chemical entity extraction: Misspelt chemical names and unconventional entities

ACS National Meeting, Indianapolis, USA 8th September 2013

Approaches to Entity recognition

• Dictionary based

• Grammar based

• Machine Learning

LeadMine LeadMine

Page 4: Tackling the difficult areas of chemical entity extraction: Misspelt chemical names and unconventional entities

ACS National Meeting, Indianapolis, USA 8th September 2013

Approaches to Entity recognition

• Dictionary based approaches are ideal for relating entities to concepts but only recognise a finite number of terms

– Will not recognise novel compound names

• Hence for chemistry, dictionary approaches need to be used in conjunction with another method

Page 5: Tackling the difficult areas of chemical entity extraction: Misspelt chemical names and unconventional entities

ACS National Meeting, Indianapolis, USA 8th September 2013

Advantages of grammars

• Don’t require annotated corpora

• Encode knowledge about the domain

• Very fast recognition

• Allow spelling correction if an entity is a near match to one recognised by the grammar

Page 6: Tackling the difficult areas of chemical entity extraction: Misspelt chemical names and unconventional entities

ACS National Meeting, Indianapolis, USA 8th September 2013

Simple grammar Example

Digit1to9 : ‘1’ | ‘2’ |’4’ |’5’ |’6’ |’7’ |’8’ |’9’

Digit : Digit1to9 | ‘0’

Cid : ‘CID:’ Digit1to9 Digit*

C I D 1..9 : 0..9

Page 7: Tackling the difficult areas of chemical entity extraction: Misspelt chemical names and unconventional entities

ACS National Meeting, Indianapolis, USA 8th September 2013

Grammar for IUPAC names

• Grammar for complete molecules: 485 rules – trivialRing : 'aceanthren'|'aceanthrylen'|'acenaphthen'...

– ringGroup : trivialRing | hantzschWidmanRing | vonBaeyerSystem ...

• Generally aims to match a superset of the nomenclature covered by IUPAC

• Specifically this is the superset that can be theoretically be converted to structures

Page 8: Tackling the difficult areas of chemical entity extraction: Misspelt chemical names and unconventional entities

ACS National Meeting, Indianapolis, USA 8th September 2013

State machine size

0

2000000

4000000

6000000

8000000

10000000

12000000

14000000

0.82 0.84 0.86 0.88 0.9 0.92 0.94 0.96 0.98 1

Stat

es

req

uir

ed

Recall on names from MayBridge catalogue

Page 9: Tackling the difficult areas of chemical entity extraction: Misspelt chemical names and unconventional entities

ACS National Meeting, Indianapolis, USA 8th September 2013

Two Level State Machines

• Breaks problems into a state machine that keeps track of when concepts have to be matched and a state machine that matches each concept e.g. an acyclic group

– Avoids duplication of states to match the same concept in slightly different contexts

– Slower as multiple concepts may be possible that are allowed to start with the same characters

Page 10: Tackling the difficult areas of chemical entity extraction: Misspelt chemical names and unconventional entities

ACS National Meeting, Indianapolis, USA 8th September 2013

State machine RevisiteD

0

2000000

4000000

6000000

8000000

10000000

12000000

14000000

0.82 0.84 0.86 0.88 0.9 0.92 0.94 0.96 0.98 1

Stat

es

req

uir

ed

Recall on names from MayBridge catalogue

Page 11: Tackling the difficult areas of chemical entity extraction: Misspelt chemical names and unconventional entities

ACS National Meeting, Indianapolis, USA 8th September 2013

Grammar inheritance

• Molecule grammar serves as a good starting point for a substituent grammar or generic chemical grammar

– Inherit rules rather than duplicate them

– Allow overriding of rules

pluralisedChemical : chemical 's'

elementaryMetalAtom : 'lanthanide'|'lanthanoid'|'transition

metal'|'transuranic element' | _elementaryMetalAtom

Page 12: Tackling the difficult areas of chemical entity extraction: Misspelt chemical names and unconventional entities

ACS National Meeting, Indianapolis, USA 8th September 2013

Unconventional entities #1

• Formulae:

– Sum formulae

• C20H25NO6

– Line formulae

• CH3CH2CH2Cl (complete molecule)

• CH2CH2 (linker)

• CH3CH2 (substituent)

– Salts

• MgSO4

Page 13: Tackling the difficult areas of chemical entity extraction: Misspelt chemical names and unconventional entities

ACS National Meeting, Indianapolis, USA 8th September 2013

Unconventional entities #2

• Peptide formulae

– Cys-Tyr-Phe-Gln-Asn-Cys-Pro-Arg-Gly-NH2

• Oligosaccharides

– α-L-Fucp-(1→4)-[β-D-Galp-(1→3)]-β-D-GlcpNAc-(1→3)-β-D-Galp-(1→4)-D-Glc-ol

• Oligonucleotides

– 3'-AATG-5'

Page 14: Tackling the difficult areas of chemical entity extraction: Misspelt chemical names and unconventional entities

ACS National Meeting, Indianapolis, USA 8th September 2013

Unconventional entities #3

• Patent numbers

– U.S. Pat. No. 6,677,355

• Journal references

– (1974) J. Biol. Chem. 249, 4250-4256

• CAS numbers

– 90-13-1

• InChI and SMILES

Page 15: Tackling the difficult areas of chemical entity extraction: Misspelt chemical names and unconventional entities

ACS National Meeting, Indianapolis, USA 8th September 2013

navigating

Page 16: Tackling the difficult areas of chemical entity extraction: Misspelt chemical names and unconventional entities

ACS National Meeting, Indianapolis, USA 8th September 2013

Fast spelling correction

• Historically we have used Levenshtein-like distance measures (all possible corrections)

• Only use spelling correction when recognition fails

• Allow a certain level of “look behind”

– 13 characters empirically found to yield identical results

– Speeds up spelling correction ~80%

• Dictionary of common English words can be used to prevent attempting spelling correction

Page 17: Tackling the difficult areas of chemical entity extraction: Misspelt chemical names and unconventional entities

ACS National Meeting, Indianapolis, USA 8th September 2013

Words Ignored for spelling correction (gray)

Page 18: Tackling the difficult areas of chemical entity extraction: Misspelt chemical names and unconventional entities

ACS National Meeting, Indianapolis, USA 8th September 2013

Exceptions to local errors

• Whether a space is allowed may only be decidable once the suffix of a chemical name is encountered

propyl bromochloromethanol

propylbromochloromethanol

propyl bromochloromethanoate

19 character look behind required!

Page 19: Tackling the difficult areas of chemical entity extraction: Misspelt chemical names and unconventional entities

ACS National Meeting, Indianapolis, USA 8th September 2013

BioCreative IV

• CHEMDNER (Chemical compound and drug name recognition task)

• 10000 annotated PubMed abstracts (3500 for training, 3500 for development and 3000 for testing)

• Deadline for submission: This Thursday

Page 20: Tackling the difficult areas of chemical entity extraction: Misspelt chemical names and unconventional entities

ACS National Meeting, Indianapolis, USA 8th September 2013

Typical annotated Abstract

Page 21: Tackling the difficult areas of chemical entity extraction: Misspelt chemical names and unconventional entities

ACS National Meeting, Indianapolis, USA 8th September 2013

Dictionaries… bigger is better

• For high recall of trivial names dictionaries with high coverage are required.

• The largest publically available dictionary is PubChem with over 94 million terms

• However most of these terms are either not useful or actually detrimental to text mining

Page 22: Tackling the difficult areas of chemical entity extraction: Misspelt chemical names and unconventional entities

ACS National Meeting, Indianapolis, USA 8th September 2013

Aggressive filtering

• “what you don't see won't hurt you”

• Hence remove terms are also English words or start with an English word

– Accomplished using a large English dictionary with chemistry terms removed

• Remove internal identifiers used by depositors

• Remove terms that are matched by our grammars

• Ultimate result: 94 million less than 3 million

Page 23: Tackling the difficult areas of chemical entity extraction: Misspelt chemical names and unconventional entities

ACS National Meeting, Indianapolis, USA 8th September 2013

Structure Aware filtering

• “Do not tag proteins, polypeptides (> 15aa), nucleic acid polymers, polysaccharides, oligosaccharides [tetrasaccharide or longer] and other biochemicals.”

• About 40,000 polypeptides and oligosaccharides excluded from PubChem using these criteria

Page 24: Tackling the difficult areas of chemical entity extraction: Misspelt chemical names and unconventional entities

ACS National Meeting, Indianapolis, USA 8th September 2013

Entity Extension

• Even PubChem is far from comprehensive hence it can be useful to extend the start and/or end of entities to avoid partial hits

– α-santalol can be recognised from santalol in the dictionary

• Extension is bracketing aware and blocked by English words

• Entity trimming also performed to comply with the annotation guidelines

– ‘Allura Red AC dye’ ‘Allura Red AC’

Page 25: Tackling the difficult areas of chemical entity extraction: Misspelt chemical names and unconventional entities

ACS National Meeting, Indianapolis, USA 8th September 2013

Entity Merging

• Adjacent entities may actually be the same entities

– Ethyl ester one entity

– (+)-limonene epoxide one entity

BUT

– Hexane-benzene two entities

Page 26: Tackling the difficult areas of chemical entity extraction: Misspelt chemical names and unconventional entities

ACS National Meeting, Indianapolis, USA 8th September 2013

Using an ontology to determine when terms add information

• Genistein isoflavone two entities

• Glycine ester one entity

Genistein showing isoflavone core structure

Page 27: Tackling the difficult areas of chemical entity extraction: Misspelt chemical names and unconventional entities

ACS National Meeting, Indianapolis, USA 8th September 2013

Abbreviation detection

• Based on the Hearst and Schwartz algorithm

• Detects abbreviations of the following forms:

– Tetrahydrofuran (THF)

– THF (tetrahydrofuran)

– Tetrahydrofuran (THF;

– (tetrahydrofuran, THF)

– THF = tetrahydrofuran

Schwartz, A.; Hearst, M. Proceedings of the Pacific Symposium on Biocomputing 2003.

Page 28: Tackling the difficult areas of chemical entity extraction: Misspelt chemical names and unconventional entities

ACS National Meeting, Indianapolis, USA 8th September 2013

AnTI-Abbreviation detection

• Finds entities detected as abbreviations of unrecognised entities

– Can mean a common chemical abbreviation has been redefined in the scope of the document

current good manufacturing practice (cGMP)

cGMP = Cyclic guanosine monophosphate =

Page 29: Tackling the difficult areas of chemical entity extraction: Misspelt chemical names and unconventional entities

ACS National Meeting, Indianapolis, USA 8th September 2013

Grammars used

• Systematic molecule

• Systematic prefix

• Systematic generic name

• Registry number

• CAS number

• Chemical formulae

• Systematic polymer

• Semi systematic chemical name – Systematic prefix + common trivial name/name from PubChem

Page 30: Tackling the difficult areas of chemical entity extraction: Misspelt chemical names and unconventional entities

ACS National Meeting, Indianapolis, USA 8th September 2013

Dictionaries used

• Noise words e.g. lead

• Trivial polymer

• Generic chemical terms (some from ChEBI)

• Common abbreviations

• Common trivial names

• Filtered PubChem

• Alloys

• Allotropes

• Minerals

Page 31: Tackling the difficult areas of chemical entity extraction: Misspelt chemical names and unconventional entities

ACS National Meeting, Indianapolis, USA 8th September 2013

Making the most of the knowledge provided

• Use training data to identify terms that are not currently recognised (a whitelist)

• Identify terms that are often false positives (a blacklist)

• Each false positive and false negative is placed into such a list if its inclusion increased F-score (harmonic mean of precision/recall)

Page 32: Tackling the difficult areas of chemical entity extraction: Misspelt chemical names and unconventional entities

ACS National Meeting, Indianapolis, USA 8th September 2013

Results (on development set)

Configuration Precision Recall F-score

Baseline 0.87 0.82 0.84

WhiteList 0.86 0.85 0.85

BlackList 0.88 0.80 0.84

WhiteList + BlackList

0.87 0.83 0.85

Page 33: Tackling the difficult areas of chemical entity extraction: Misspelt chemical names and unconventional entities

ACS National Meeting, Indianapolis, USA 8th September 2013

Future work

• Typically we are focused on generating structures from the entities we recognise

– Line formula parsing

– Generic chemical name parsing (difficult to do in a way that the results are not tied to a particular toolkit)

• Grammars serve as an excellent starting point for writing parsers

Page 34: Tackling the difficult areas of chemical entity extraction: Misspelt chemical names and unconventional entities

ACS National Meeting, Indianapolis, USA 8th September 2013

Conclusions

• Two level state machines allow many complicated grammars to be represented by far fewer states

• Back tracking spelling correction can provide significant speed improvements without effecting recall

• Check out our blog (nextmovesoftware.co.uk/blog) in a couple of weeks to find out how we did in BioCreative!

Page 35: Tackling the difficult areas of chemical entity extraction: Misspelt chemical names and unconventional entities

ACS National Meeting, Indianapolis, USA 8th September 2013

[email protected]

Tackling the difficult areas of chemical entity extraction:

Misspelt chemical names and unconventional entities

Thank you for your attention