Analysing and Classifying Names of Chemical Compounds with ... · Analysing and Classifying Names...

Post on 11-Aug-2019

213 views 0 download

Transcript of Analysing and Classifying Names of Chemical Compounds with ... · Analysing and Classifying Names...

Analysing and Classifying

Names of Chemical Compounds

with CHEMorph

Stefanie Anstein Gerhard Kremer

��� ��

�� ��

�� ��

�� ��

�� ��

�� ��

�� ��

�� ��

�� ��

�� ��

�� ��

�� ��

�� ��

�� ��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

�� IMS, University of Stuttgart

April 11, 2006

Introduction System Details Conclusion

Example Analysis

CH

H

H

C

O

C

H

H

C

H

H

C

H

H

C

H

H

C

H

H

O H

7-hydroxyheptan-2-one

compd(ane(7*C),pref([1*[7]-hydroxy]),suff([1*[2]-one]))

CC(=O)CCCCCO ALCOHOL,KETONE,...

Stefanie Anstein, Gerhard Kremer CHEMorph 2 / 13

Introduction System Details Conclusion

Example Analysis

CH

H

H

C

O

C

H

H

C

H

H

C

H

H

C

H

H

C

H

H

O H

7-hydroxyheptan-2-one

compd(ane(7*C),pref([1*[7]-hydroxy]),suff([1*[2]-one]))

CC(=O)CCCCCO ALCOHOL,KETONE,...

Stefanie Anstein, Gerhard Kremer CHEMorph 2 / 13

Introduction System Details Conclusion

Example Analysis

CH

H

H

C

O

C

H

H

C

H

H

C

H

H

C

H

H

C

H

H

O H

7-hydroxyheptan-2-one

compd(ane(7*C),pref([1*[7]-hydroxy]),suff([1*[2]-one]))

CC(=O)CCCCCO ALCOHOL,KETONE,...

Stefanie Anstein, Gerhard Kremer CHEMorph 2 / 13

Introduction System Details Conclusion

Example Analysis

CH

H

H

C

O

C

H

H

C

H

H

C

H

H

C

H

H

C

H

H

O H

7-hydroxyheptan-2-one

compd(ane(7*C),pref([1*[7]-hydroxy]),suff([1*[2]-one]))

CC(=O)CCCCCO ALCOHOL,KETONE,...

Stefanie Anstein, Gerhard Kremer CHEMorph 2 / 13

Introduction System Details Conclusion

Motivation & Background

life sciences . . .

and the amount of biomedical data

terminology . . .

and biochemical nomenclature

Stefanie Anstein, Gerhard Kremer CHEMorph 3 / 13

Introduction System Details Conclusion

Motivation & Background

life sciences . . .

and the amount of biomedical data

terminology . . .

and biochemical nomenclature

Stefanie Anstein, Gerhard Kremer CHEMorph 3 / 13

Introduction System Details Conclusion

Challenges

term reference

coreferences

R-0. 1.7.3 (IUPAC nomenclature of organic compounds):

Addition of the vowel “o”.

For euphonic reasons, the vowel “o” is sometimes inserted

between consonants.

Stefanie Anstein, Gerhard Kremer CHEMorph 4 / 13

Introduction System Details Conclusion

Challenges

term reference

coreferences

R-0. 1.7.3 (IUPAC nomenclature of organic compounds):

Addition of the vowel “o”.

For euphonic reasons, the vowel “o” is sometimes inserted

between consonants.

Stefanie Anstein, Gerhard Kremer CHEMorph 4 / 13

Introduction System Details Conclusion

Modules Overview

name

parser

semantic representation

SMILES string

generator

SMILES string

classifier

classes

Stefanie Anstein, Gerhard Kremer CHEMorph 5 / 13

Introduction System Details Conclusion

Modules Overview

name

parser

semantic representation

SMILES string

generator

SMILES string

classifier

classes

Stefanie Anstein, Gerhard Kremer CHEMorph 5 / 13

Introduction System Details Conclusion

Modules Overview

name

parser

semantic representation

SMILES string

generator

SMILES string

classifier

classes

Stefanie Anstein, Gerhard Kremer CHEMorph 5 / 13

Introduction System Details Conclusion

Modules Overview

name

parser

semantic representation

SMILES string

generator

SMILES string

classifier

classes

Stefanie Anstein, Gerhard Kremer CHEMorph 5 / 13

Introduction System Details Conclusion

Name Types

fully specified underspecified

systematic 7-hydroxyheptan-2-one heptanone

trivial benzene ∅semi-systematic benzene-1,3,5-triacetic acid dihydrobenzene

class ∅ alcohol

semi-systematic ∅ 2-deoxysugar

Stefanie Anstein, Gerhard Kremer CHEMorph 6 / 13

Introduction System Details Conclusion

Parser

7 - hydroxy hept an - 2 - one

mult7

parent suffixλ(X,ane(X*’C’))

parent nonsugarane(7*’C’)

organic compound

prefix[??*[7]-hydroxy]

locant??*[7]

loc[7]

hyphen∅

prefhydroxy

locant??*[2]

suffix[??*[2]-one]

hyphen∅

loc[2]

hyphen∅

suffone

compd( ane(7*C) , pref( [??*[7]-hydroxy] ) ,

suff( [??*[2]-one] ) )

Stefanie Anstein, Gerhard Kremer CHEMorph 7 / 13

Introduction System Details Conclusion

Parser

7 - hydroxy hept an - 2 - one

mult7

parent suffixλ(X,ane(X*’C’))

parent nonsugarane(7*’C’)

organic compound

prefix[??*[7]-hydroxy]

locant??*[7]

loc[7]

hyphen∅

prefhydroxy

locant??*[2]

suffix[??*[2]-one]

hyphen∅

loc[2]

hyphen∅

suffone

compd( ane(7*C) , pref( [??*[7]-hydroxy] ) ,

suff( [??*[2]-one] ) )

Stefanie Anstein, Gerhard Kremer CHEMorph 7 / 13

Introduction System Details Conclusion

Parser

7 - hydroxy hept an - 2 - one

mult7

parent suffixλ(X,ane(X*’C’))

parent nonsugarane(7*’C’)

organic compound

prefix[??*[7]-hydroxy]

locant??*[7]

loc[7]

hyphen∅

prefhydroxy

locant??*[2]

suffix[??*[2]-one]

hyphen∅

loc[2]

hyphen∅

suffone

compd( ane(7*C) , pref( [??*[7]-hydroxy] ) ,

suff( [??*[2]-one] ) )

Stefanie Anstein, Gerhard Kremer CHEMorph 7 / 13

Introduction System Details Conclusion

Parser

7 - hydroxy hept an - 2 - one

mult7

parent suffixλ(X,ane(X*’C’))

parent nonsugarane(7*’C’)

organic compound

prefix[??*[7]-hydroxy]

locant??*[7]

loc[7]

hyphen∅

prefhydroxy

locant??*[2]

suffix[??*[2]-one]

hyphen∅

loc[2]

hyphen∅

suffone

compd( ane(7*C) , pref( [??*[7]-hydroxy] ) ,

suff( [??*[2]-one] ) )

Stefanie Anstein, Gerhard Kremer CHEMorph 7 / 13

Introduction System Details Conclusion

Parser

7 - hydroxy hept an - 2 - one

mult7

parent suffixλ(X,ane(X*’C’))

parent nonsugarane(7*’C’)

organic compound

prefix[??*[7]-hydroxy]

locant??*[7]

loc[7]

hyphen∅

prefhydroxy

locant??*[2]

suffix[??*[2]-one]

hyphen∅

loc[2]

hyphen∅

suffone

compd( ane(7*C) , pref( [??*[7]-hydroxy] ) ,

suff( [??*[2]-one] ) )

Stefanie Anstein, Gerhard Kremer CHEMorph 7 / 13

Introduction System Details Conclusion

SMILES String Generator

representation of single chain elements

consistency check

underspecification:

underspecified( CC(=O)CCCCC , [{1,3,4,5,6,7}-hydroxy] )

Stefanie Anstein, Gerhard Kremer CHEMorph 8 / 13

Introduction System Details Conclusion

SMILES String Generator

representation of single chain elements

consistency check

underspecification:

underspecified( CC(=O)CCCCC , [{1,3,4,5,6,7}-hydroxy] )

Stefanie Anstein, Gerhard Kremer CHEMorph 8 / 13

Introduction System Details Conclusion

SMILES String Generator

representation of single chain elements

consistency check

underspecification:

underspecified( CC(=O)CCCCC , [{1,3,4,5,6,7}-hydroxy] )

Stefanie Anstein, Gerhard Kremer CHEMorph 8 / 13

Introduction System Details Conclusion

Classifier

morpheme class

hydroxy- | -ol ALCOHOL

cyclo- & -ane CYCLOALKANE

compd( ane(7*C) , pref([1*[7]-hydroxy]) , suff([1*[2]-one]) )

_ ALKANE, ALCOHOL, KETONE

compd( ene(??*[??],ane(4*’C’)) , pref([]) , suff([]) )

_ ALKENE

Stefanie Anstein, Gerhard Kremer CHEMorph 9 / 13

Introduction System Details Conclusion

Classifier

morpheme class

hydroxy- | -ol ALCOHOL

cyclo- & -ane CYCLOALKANE

compd( ane(7*C) , pref([1*[7]-hydroxy]) , suff([1*[2]-one]) )

_ ALKANE, ALCOHOL, KETONE

compd( ene(??*[??],ane(4*’C’)) , pref([]) , suff([]) )

_ ALKENE

Stefanie Anstein, Gerhard Kremer CHEMorph 9 / 13

Introduction System Details Conclusion

Classifier

morpheme class

hydroxy- | -ol ALCOHOL

cyclo- & -ane CYCLOALKANE

compd( ane(7*C) , pref([1*[7]-hydroxy]) , suff([1*[2]-one]) )

_ ALKANE, ALCOHOL, KETONE

compd( ene(??*[??],ane(4*’C’)) , pref([]) , suff([]) )

_ ALKENE

Stefanie Anstein, Gerhard Kremer CHEMorph 9 / 13

Introduction System Details Conclusion

Classifier

morpheme class

hydroxy- | -ol ALCOHOL

cyclo- & -ane CYCLOALKANE

compd( ane(7*C) , pref([1*[7]-hydroxy]) , suff([1*[2]-one]) )

_ ALKANE, ALCOHOL, KETONE

compd( ene(??*[??],ane(4*’C’)) , pref([]) , suff([]) )

_ ALKENE

Stefanie Anstein, Gerhard Kremer CHEMorph 9 / 13

Introduction System Details Conclusion

Results & Applications

SMILES string and classification

underspecification

term reference

coreference resolution

database curation and ontology acquisition

Stefanie Anstein, Gerhard Kremer CHEMorph 10 / 13

Introduction System Details Conclusion

Results & Applications

SMILES string and classification

underspecification

term reference

coreference resolution

database curation and ontology acquisition

Stefanie Anstein, Gerhard Kremer CHEMorph 10 / 13

Introduction System Details Conclusion

Results & Applications

SMILES string and classification

underspecification

term reference

coreference resolution

database curation and ontology acquisition

Stefanie Anstein, Gerhard Kremer CHEMorph 10 / 13

Introduction System Details Conclusion

Results & Applications

SMILES string and classification

underspecification

term reference

coreference resolution

database curation and ontology acquisition

Stefanie Anstein, Gerhard Kremer CHEMorph 10 / 13

Introduction System Details Conclusion

Results & Applications

SMILES string and classification

underspecification

term reference

coreference resolution

database curation and ontology acquisition

Stefanie Anstein, Gerhard Kremer CHEMorph 10 / 13

Introduction System Details Conclusion

Conclusion & Outlook

feasible, extendable and transferable approach

extend grammar and lexicon

elaborate SMILES and classification

sophisticated linguistic analysis _ database curation

term identification _ text processing applications

Stefanie Anstein, Gerhard Kremer CHEMorph 11 / 13

Introduction System Details Conclusion

Conclusion & Outlook

feasible, extendable and transferable approach

extend grammar and lexicon

elaborate SMILES and classification

sophisticated linguistic analysis _ database curation

term identification _ text processing applications

Stefanie Anstein, Gerhard Kremer CHEMorph 11 / 13

Introduction System Details Conclusion

Conclusion & Outlook

feasible, extendable and transferable approach

extend grammar and lexicon

elaborate SMILES and classification

sophisticated linguistic analysis _ database curation

term identification _ text processing applications

Stefanie Anstein, Gerhard Kremer CHEMorph 11 / 13

Introduction System Details Conclusion

Acknowledgements

Stefanie Anstein

Uwe Reyle

Jasmin Saric

EML Research gGmbH

Stefanie Anstein, Gerhard Kremer CHEMorph 12 / 13

Introduction System Details Conclusion

Schonen Dank.

Stefanie Anstein, Gerhard Kremer CHEMorph 13 / 13

IUPAC Nomenclatures

Amino Acids and Peptides EC 5 Isomerases Phosphorus containing compds

Biochemical thermodynamics EC 6 Ligases Polymerized amino acids

Branched nucleic acids Folic acid Polypeptide conformation

Carbohydrates Glycolipids Polynucleotide conformation

Carotenoids Glycoproteins Polysaccharide conformation

Corrinoids (vitamin B12) myo-Inositol numbering Prenol nomenclature

Cyclitols Lignan Nomenclature Pyridoxal (vitamin B6)

Electron transport proteins Lipid Nomenclature Quinones w. an Isoprenoid Chain

Enzyme kinetics Multienzymes Retinoids

Enzyme nomenclature Multiple forms of enzymes Steroids

EC 1 Oxidoreductases Nucleic acid constituents Tetrapyrroles

EC 2 Transferases Nucleic acid sequence Tocopherols (vitamin E)

EC 3 Hydrolases Organic Chemistry Translation Factors

EC 4 Lyases Peptide hormones Vitamin D

KEGG: Kyoto Encyclopedia of Genes and Genomes

7-HYDROXYHEPTAN-2-ONE

PRIMARY ALCOHOL

ALCOHOL

7-HYDROXYHEPTANE

HYDROXYHEPTANE

HEPTANE

7-HYDROXYALKANE

HYDROXYALKANE

7-HYDROXYKETONE

HYDROXYKETONE

HYDROXYHEPTAN-2-ONE

HEPTAN-2-ONE

KETONEALKANE