Evaluating Patent Full Text Documents with Chemical Ontologies

35
Evaluating patent full text documents with chemical ontologies OntoChem IT Solutions GmbH Blücherstr. 24 06120 Halle (Saale) Germany Tel. +49 345 4780472 Fax: +49 345 4780471 mail: info(at)ontochem.com

Transcript of Evaluating Patent Full Text Documents with Chemical Ontologies

Evaluating patent full text

documents with chemical ontologies

OntoChem IT Solutions GmbHBlücherstr. 2406120 Halle (Saale)Germany

Tel. +49 345 4780472Fax: +49 345 4780471mail: info(at)ontochem.com

Evaluating patent full text

documents with chemical ontologies

• spin-out from OntoChem GmbH

• started 1.7.2015

• 15 chemists, bioinformatics, biologists, linguists, pharmacists

• extracting knowledge from documents, selling software & services

OntoChem IT Solutions GmbHBlücherstr. 2406120 Halle (Saale)Germany

Tel. +49 345 4780472Fax: +49 345 4780471mail: info(at)ontochem.com

3

Computer readable, formal representation of knowledge...

describe relationships between knowledge concepts:

aspirin benzoic acid carboxylic acid

acetyl salicylic acids

can be used to infer extract, search, sort and analyse knowledge

What are Ontologies ?

„is a“ „is a“

4

ChEBI Chemical Entities of Biological Interest

https://www.ebi.ac.uk/chebi/ has about 40,000 compounds manually classified:

MeSH – medical subject headings ... PubChem

Chemical Ontologies...

5

SODIAC:

automated compound classification software

Structure based Ontology Development and Individual Assignment Center

ontology editor, OBO specification conformity

Definition of compound classes via SMARTS

chemical structure editor

sub-structure AND, OR and NOT logic compound to class assignment

chemistry error detection

chemical hierarchy construction

Classifying Chemistry: SODIAC

6

SODIAC:

AND/OR logic to assign Vitamin C derivatives:

• described in different tautomeric forms in databases

• logic needed for classifying correct stereochemistry in substituted compounds

Classifying Chemistry: SODIAC

concept: Vitamin C derivatives

AND AND ANDOR OR

7

structural chemical ontologies are often not based on sub-structures !

Progesterone 19-Norprogesterone 4-8* more active

class: Gestagens class: Gestagens>Progestins

Pregnane (female hormons) Androstane (male hormons)

class: Gonans>Pregnans class: Gonans>Estrans

Classifying Chemistry: not straightforward...

drugbank & ChEBI:

Progestin,

a synthetic progestogen

parent

& SSS

not parent

but SSS

not parent

but SSS

ChEBI:

corticosteroid hormone

same family

different family

8

Chemistry Ontologies

Organic chemistry

7.586 class concepts, 29.709 class terms

3,185 concepts linked to ChEBI concepts

2,465 concepts linked to MeSH concepts

68 million concepts linked to PubChem

Inorganic materials

52.4209 concepts, 56.332 terms

Groups-substituents-fragments

4.428 concepts, 12.754 terms

Substances

989 concepts, 3.522 terms

Polymers

2361 concepts, 7.176 terms

9

Acetylsalicylic acid

SODIAC v2.5.2

Direct Parents:

aromatic compounds, benzenes, carbon compounds, carboxylic acids,

ethanoic acid esters, methyl esters, monocyclic compounds, oxygen compounds,

salicylic acid derivatives

bioavailable molecules, hydrophilic molecules, lead like molecules, lipinski molecules, small molecules

CHEBI:15365; MeSH:D001241

Ancestors:

6-membered carbocycles, 6-membered cyclic compounds, acetic acid derivatives, acids,

carbocycles, carbon group compounds, carbonyl compounds, carboxylic acid derivatives,

carboxylic acid esters, chalcogen compounds, cyclic compounds, esters, fatty acyls,

fatty esters, lipids, monocarboxylic acid derivatives, monocyclic carbocycles, organic acids,

organic compounds, organic esters, salicylic acid derivatives, short chain fatty acid esters

Classifying Chemistry: Example

10

Basic Biology Ontologies

Genes, Proteins & Peptides

annotation version: 708,141 concepts, 2,627,612 terms

classification version: 832,902 concepts, 3,177,057 terms

with linkouts to GO, InterPro, HomoloGene, HUGO, KEGG, Uniprot ...

Diseases

SNOMED-CT, MedDRA, ICD-9, ICD-10, HDO, UMLS, Loinc, MeSH

annotation version: 105,824 concepts, 360,077 terms

Species

based on NCBI, GRIN, IPNI, Cornucopia, World Economic Plants ...

annotation version: 1,012,634 concepts, 1,664,042 terms

Anatomy

different species and stage dependent ontologies available

general anatomy: 4,773 concepts, 19,450 terms

11

Other Biology Ontologies

Cell lines

5,566 concepts, 13,083 terms

Cosmetology

1,187 concepts, 2,017 terms

Effects

35,477 concepts, 111,012 terms

Nutrition

19,193 concepts, 115,699 terms

Physiology

533 concepts, 619 terms

Toxicology

1,019 concepts, 2,150 terms

12

Other Ontologies

Countries

annotation version: 245 concepts, 85,069 terms

Companies

annotation version: 26,388 concepts, 5,757 terms

Material properties

annotation version: 1,081 concepts, 2,428 terms

Methods

annotation version: 2,502 concepts, 10,053 terms

Regions & Geopolitics

annotation version: 3774 concepts, 13,356 terms

Relations

annotation version: 603 concepts, 2,290 syntaxes

13

General Ontologies

Wikipedia

annotation version: 5,200,842 concepts, 11,490,831 terms

Magnitudes & Units

annotation version: 228 concepts, 510 terms

Persons

annotation version: >1,000,000 persons

Relations

annotation version: 603 concepts, 2,290 syntaxes

14

Understanding Patents with Ontologies

NLP for patents pose some unique challenges:

• multilingual

• poor OCR (optical character recognition)

• multi-disciplinary

• many>90 million full text documents from >110 patent offices

• largeup to 500 pageswith sentences spanning >20 pages

• obscure:hand drawingsunclear language

15

Understanding Patents

Collaboration with infoapps GmbH (Munich)

Standard full text data

US, EP, DE, WO,

AT, CH, BE, CA, ES, FR, GB, MA.

Standard full text data

AR, BR, CN, DK, FI, ID, EI, EN,

JP, KR, MX, MY, NL, NO, RU, SE,

TH, TW, VN.

Original full text data

Machine/human translation (EN)

AR, AT, BE, BR, CA, CH, CN, DE,

DK, EP, ES, FI, FR, ID, JP, KR,

MX, NL, NO, RU, SE, TH, TW,

VN, WO.

16

chemistry annotator

OCMiner® UIMA Pipeline

identify

document

type

OCMiner® UIMA Pipeline

picture PDFOCR

Text PDF

PDF

reader

XML doc

XML

reader

Office doc

Office

reader

document

classifierXML

detagger

language

detector

normalize

text

tokenize

text

acronym

abbrev

detector

person

annotator

document

structure

domain

annotators

1…n

dictionaryname-2-

structure

formula &

molpuzzler

class/group

resolution

cleanup &

rule

combiner

coordinated

entity

resolution

context

handler

NE

confidence

domain

annotators

1…n

domain

annotators

1…n

relationship

extraction consumer

BRAT

consumer

index

consumer

XML

17

BRAT (Goran Topić) file example:

PLoS One. 2014 Sep 30;9(9):e107477. doi: 10.1371/journal.pone.0107477. eCollection 2014.

Annotated chemical patent corpus: a gold standard for text mining.

Akhondi SA, Klenner AG, Tyrchan C, Manchala AK, Boppana K, Lowe D, Zimmermann M, Jagarlapudi SA, Sayle R,

Kors JA, Muresan S

Regular Names in Patents

18

Chemical Compound

5,7-bis(trifluoromethyl)-pyrazolo[1,5-a]pyrimidine-2-carbonitrile :

Chemical Class

pyrazolo[1,5-a]pyrimidines :

Chemical substituent + class

2-Bromo-, 2-fluoro-, and 2-chloro pyrazolo[1,5-a]pyrimidines:

Other Name Types in Patents

19

Named Entities in Patents

extracting named entities (NE) from infoapps patents

from 19 million patents with chemistry, selected

4.7 million patents from 2001-2010 (publication year)

Ontologyterm annotation

count

unique concepts

per doc

unique

concepts

Chemistry 1,465,510,682 294,771,572 ?

Proteins 204,902,329 30,167,344 67,993

Anatomy non-plants 126,856,048 21,192,154 2,378

Methods 112,230,880 21,725,977 1,959

Species 105,618,715 25,901,359 81,036

Diseases 82,857,385 24,592,233 21,367

Physiology 68,504,035 12,703,542 497

Nutrition 59,367,731 12,839,777 3,861

Cosmetology 23,465,151 4,883,741 920

Anatomy plants, fungi 22,326,124 4,212,548 802

Cell lines 9,857,621 2,325,743 2,079

Toxicity 7,986,832 2,858,977 423

Species plants, fungi 7,444,143 2,345,605 7,347

Regions 6,974,421 2,781,913 1,040

Herbal drugs 162,729 46,830 131

20

Understanding Patents with Ontologies

21

3 reasons:

patent claims are „ontological“

background knowledge helps to extract the meaning of named entities

end user, using knowledge classifications

which natural product compound class is useful to treat inflammation of the skin?

Ontologies – Why ?

22

Patent claims are “ontological”

Patent classes & ad hoc classes:

e.g. chemical

„compounds according to claim 1“

„acyl-pyrrolopyridines“

any Markush structure, Patent classes etc

e.g. uses: „anti-infectives“ (e.g. antibacterial, antiviral, antiparasitic ... )

Chemical Ontologies – Why ?

23

ontology based NLP to extract the meaning of named entities

• ontology based context sensitive Named Entity resolution

...glucose... ...glucose oxidase... ...glucose oxidase activity...

finally: ...inhibitor of glucose oxidase activity...

• ontology based anaphora & cataphora resolution

Tetrahydrofurane is a commonly used solvent in organic ...

This cyclic ether has a melting point of -108,4 °C

• ontology based fingerprints

classifying documents, e.g. into patent classes

Chemical Ontologies – Why ?

24

3 BRAT parts of one document:

Ontology Based Property Extraction

25

Understanding Patent Claims Logic

high quality patent annotations need:

• annotated text corpus “Gold Set”

• background ontologies

Annotated between <chemistry> & <disease>: p=is_Active_Part_Of, i=is_Instance_Of.

LREC 2014: Creating a Gold Standard Corpus for the Extraction of Chemistry-Disease Relations

from Patents, Antje Schlaf, Claudia Bobach, Matthias Irmer

26

Enduser Application Examples

27

End User: Understanding Patents

Collaboration with infoapps GmbH (Munich): ChemAnalyser

28

End User: Understanding Patents

ChemAnalyser – causative relationship mining

29

End User: Understanding Patents

ChemAnalyser – causative relationship mining

30

End User: Understanding Patents

ChemAnalyser – causative relationship mining

31

End User: Patent Big Data Analytics

Hot Compounds, hot targets ?

L. Weber, T. Böhme, M. Irmer, Pharm. Pat. Analyst 2013, 2,Ontology-based content analysis of US patent applications from 2001–2010

32

End User: Patent Big Data Analytics

enrichment factors for chemistry related diseases...

Chemistry Conceptcardiovascular

system

disease of

mental health

disease of

metabolism

respiratory

system

nervous

system

musculo-skeletal

system

reproductive

system

gastro-

intestinal

system

immune

system

endocrine

system

prostaglandin F2β derivatives 557 0 0 0 607 427 0 0 375 0

hallucinogens 494 1922 332 449 538 364 3146 622 199 1901

cichoric acid 821 1662 432 1625 509 652 11623 1480 604 7239

alpha 1-adrenoceptor agonist 821 0 267 1736 501 611 8684 1014 543 5636

pregn-4,9(11)-enes 398 256 231 450 491 386 0 467 317 1296

canrenoic acids 771 1343 425 1180 473 534 8474 1260 459 4960

aconitane derivatives 0 1785 205 0 458 257 0 0 0 0

pseudoalkaloid derivatives 0 1778 204 0 456 256 0 0 0 0

diterpene alkaloid derivatives 0 1778 204 0 456 256 0 0 0 0

13,14-dihydro-15-keto-prostaglandin D2

derivatives651 0 213 1831 447 482 0 1188 521 3956

ripisartan derivatives 953 0 351 0 436 411 0 0 409 0

potassium-sparing diuretics 896 1387 399 1156 425 496 6456 1218 501 3863

steroid acids 692 1193 379 1046 423 485 7578 1132 412 4418

Milfasartan 926 0 304 0 407 414 0 917 404 0

pyrrolizidine alkaloids 453 1041 293 1264 407 464 0 1081 498 0

milfasartan derivatives 930 0 303 0 406 416 0 913 402 0

Pratosartan 695 929 450 523 394 240 2747 794 246 2800

33

End User: Online Database ChemAnalyser

ChemAnalyser – Structure

ChemAnalyser – Full text & ontology based semantic searching

ChemAnalyser – Organic chemistry & drug discovery

ChemAnalyser – Alloys & Inorganic Materials

ChemAnalyser – Cosmetics & Nutrition

ChemAnalyser – Polymers

ChemAnalyser – Reach Report Support

34

Thanks!

Please register at

www.chemanalyser.com

for more information and a free trial.

35

Thanks!