NLP for Biomedicine - Ontology building and Text Mining -

63
NLP for Biomedicine - Ontology building and Text Mining - Junichi Tsujii GENIA Project (http://www-tsujii.is.s.u-tokyo.ac.jp/GEN IA/) Computer Science Graduate School of Information Science an d Technology University of Tokyo JAPAN

description

NLP for Biomedicine - Ontology building and Text Mining -. Junichi Tsujii GENIA Project (http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA/) Computer Science Graduate School of Information Science and Technology University of Tokyo JAPAN. My Talk - PowerPoint PPT Presentation

Transcript of NLP for Biomedicine - Ontology building and Text Mining -

NLP for Biomedicine- Ontology building and Text Mining -

Junichi Tsujii

GENIA Project(http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA/)

Computer ScienceGraduate School of Information Science and Technology

University of TokyoJAPAN

My Talk

1. Background : Why NLP in Biomedicines

2. Examples of NLP in Biomedicines

3. Text Mining and NLP

4. Our current Work 4.1 Terms and NE 4.2 Resource Building 4.3 Event Recognition

5. Concluding Remarks

My Talk

1. Background : Why NLP in Biomedicines

2. Examples of NLP in Biomedicines

3. Text Mining and NLP

4. Our current Work 4.1 Terms and NE 4.2 Resource Building 4.3 Event Recognition

5. Concluding Remarks

Why NLP in Biomedicine ?

From Biology and Medical Sciences

From Natural Language Processing

Why NLP in Biomedicine ?

From Biology and Medical Sciences

From Natural Language Processing

by D. Devos

Genome sequencing.

Function

Sequence

Structure

Sequence, structure and function

Information Exploitation

Scientists in areas such as molecular biology and biochemistry aim to discover new biological entities and their functions. Typical cases could be discoveries of the implications of new proteins and genes in an already known process, or implication of proteins with previouslycharacterized functions in a separate process.

The use of available information (published papers, etc.) is a key stepfor the discovery process, since in many cases weak or indirect evidences about possible relations hidden in the literature are used tosubstantiate working hypothesis that are experimentally explored.

[C.Blaschke, A.Valencia: 2001]

Scientists in areas such as molecular biology and biochemistry aim to discover new biological entities and their functions. Typical cases could be discoveries of the implications of new proteins and genes in an already known process, or implication of proteins with previouslycharacterized functions in a separate process.

The use of available information (published papers, etc.) is a key stepfor the discovery process, since in many cases weak or indirect evidences about possible relations hidden in the literature are used tosubstantiate working hypothesis that are experimentally explored.

[C.Blaschke, A.Valencia: 2001]

Scientists in areas such as molecular biology and biochemistry aim to discover new biological entities and their functions. Typical cases could be discoveries of the implications of new proteins and genes in an already known process, or implication of proteins with previouslycharacterized functions in a separate process.

The use of available information (published papers, etc.) is a key stepfor the discovery process, since in many cases weak or indirect evidences about possible relations hidden in the literature are used tosubstantiate working hypothesis that are experimentally explored.

[C.Blaschke, A.Valencia: 2001]

Why NLP in Biomedicine ?

From Biology and Medical Sciences

From Natural Language Processing

Revolution in LT in the last decade

Information

KnowledgeLanguageTexts

GrammarSyntax-Semantic Mapping

Interpretation based on Knowledge

Machine Learning

Knowledge Acquisition

Statistical Biases

Huge Ontology: Next Revolution ?Bio-Medical Application: UMLS, Gene Ontology, etc.

My Talk

1. Background : Why NLP in Biomedicines

2. Examples of NLP in Biomedicines

3. Text Mining and NLP

4. Our current Work 4.1 Terms and NE 4.2 Resource Building 4.3 Event Recognition

5. Concluding Remarks

What can we do in Biomedical domains by NLP ?

Examples

Protein-Protein Interaction extracted from texts

by C. Blaschke

Organized Knowledge through terms

by C. Blaschke

From Data to Understanding :Interpretation by Language

Oliveros, Blaschke et al., GIW 2000

Information Extraction from TextsQA Answering Systems

Characteristics of Signal Pathway (1)

• Granularity of Knowledge Units Different types of entities

which are interrelated with each other

Cells, Sub-locations of cellsProteins, substructures of proteins,Subclasses of proteinsIons, other chemical substances

Genes, RNA, DNA

G-protein coupled receptor pathway modelfigure from TRANSPATH

CSNDB( National Institute of Health Sciences)

• A data- and knowledge- base for signaling pathways of human cells.– It compiles the information on biological molecules,

sequences, structures, functions, and biological reactions which transfer the cellular signals.

– Signaling pathways are compiled as binary relationships of biomolecules and represented by graphs drawn automatically.

– CSNDB is constructed on ACEDB and inference engine CLIPS , and has a linkage to TRANSFAC.

– Final goal is to make a computerized model for various biological phenomena.

Example. 1

• A Standard Reaction Excerpted @[Takai98]

Signal_Reaction:

“EGF receptor Grb2” From_molecule “EGF receptor”To_molecule “Grb2”Tissue “liver”Effect “activation”Interaction

“SH2+phosphorylated Tyr”Reference [Yamauchi_1997]

Example. 3

• A Polymerization Reaction Excerpted @[Takai98]

Signal_Reaction:

“Ah receptor + HSP90 ” Component “Ah receptor” “HSP90”Effect “activation dissociation”Interaction

“PAS domain of Ah receptor” Activity

“inactivation of Ah receptor”Reference [Powell-Coffman_1998]

My Talk

1. Background : Why NLP in Biomedicines

2. Examples of NLP in Biomedicines

3. Text Mining and NLP

4. Our current Work 4.1 Terms and NE 4.2 Resource Building 4.3 Event Recognition

5. Concluding Remarks

Theories in ScienceObserved Data

Observable Non-Observable

Data Mining

Objects of Science

Knowledge In Mind

Non-Observable

DescriptionsOf Knowledge

Observable

Observed Data

Quantitative Data

Mathematical Formula

Qualitative, Structures, Classification

OntologyTexts

Objects Of Science

Knowledge In Mind

Non-Observable

Descriptions Of Knowledge

Observable

Natural Language

Incomplete System    Diversity    Ambiguity

Theories in ScienceObserved Data

Observable Non-Observable

Data Mining

Objects of Science

Knowledge In Mind

Non-ObservableObservable

Observed Data

Quantitative Data

Mathematical Formula

Qualitative, Structures, Classification

OntologyTexts

DescriptionsOf Knowledge

Data Mining+

Text Mining

Knowledge in MindDescriptions of KnowledgeObservable

Non-Observable

CharacteristicsOf Language

Text Mining

Objects of science

Data Mining

CharacteristicsOf Knowledge

Objects Of Science

Knowledge In Mind

Non-Observable

Descriptions Of Knowledge

Observable

Natural Language

Incomplete System    Diversity    Ambiguity

Objects Of Science

Knowledge In Mind

Non-Observable

Descriptions Of Knowledge

Observable

Natural Language

Incomplete System    Diversity    Ambiguity

My Talk

1. Background : Why NLP in Biomedicines

2. Examples of NLP in Biomedicines

3. Text Mining and NLP

4. Our current Work 4.1 Terms and NE 4.2 Resource Building 4.3 Event Recognition

5. Concluding Remarks

Terms are the basic units of knowledgeClassification, Features

NE recognitionEvent Recognition

Semantic Disambiguation

•Inconsistent naming conventions

e.g. IL-2, IL2, Interleukin 2, Interleukin-2, Il-2

   NF kappa B, NF-kappa B, (NF)-kappa B, NF-Kappa B, …

•Wide-spread synonymy

Many synonyms in wide usage, e.g. PKB and Akt

cycline-dependent kinase inhibitor p27, p27kip1

<cdc25, cdc25a>, <p52shc, p52(Shc)>

•Open, growing vocabulary for many classes

•Cross-over of names between classes depending on context

•Protein vs DNA

•Frequent uses of coordination inside term formations

Task difficulties in molecular-biology

Linking ProblemDiversityLexicon

Static Processing

Term RecognitionAmbiguity

Context DependentDynamic Processing

Ambiguity

• Abbreviation Extraction ( Schwartz 2003)– Extracts short and long form pairs

Short form Long form

AA Alcoholic Anonymous

American

Americans

Arachidonic acid

arachidonic acid

amino acid

amino acids

anaemia

anemia

:

Experiment[Tsuruoka, et.al. 03 SIGIR]

• Corpus– MEDLINE: the largest collection of abstracts in the

biomedical domain

• Rule learning– 83,142 abstracts

– Obtained rules: 14,158

• Evaluation– 18,930 abstracts

– Count the occurrences of each generated variant.

Results: “NF-kappa B”

Generation Probability

Generated Variants Frequency

1.0 (Input) NF-kappa B 857

0.417 NF-kappaB 692

0.417 nF-kappa B 0

0.337 Nf-kappa B 0

0.275 NF kappa B 25

0.226 NF-kappa b 0

: : :

Results: “antiinflammatory effect”

Generation Probability

Generated Variants Frequency

1.0 (input) antiinflammatory effect 7

0.462 anti-inflammatory effect 33

0.393 antiinflammatory effects 6

0.356 Antiinflammatory effect 0

0.286 antiinflammatory-effect 0

0.181 anti-inflammatory effects 23

: : :

Results: “tumour necrosis factor alpha”

Generation Probability

Generated Variants Frequency

1.0 (Input) tumour necrosis factor alpha 15

0.492 tumor necrosis factor alpha 126

0.356 tumour necrosis factor-alpha 30

0.235 Tumour necrosis factor alpha 2

0.175 tumor necrosis factor alpha 182

0.115 Tumor necrosis factor alpha 8

: : :

•Inconsistent naming conventions

e.g. IL-2, IL2, Interleukin 2, Interleukin-2, Il-2

   NF kappa B, NF-kappa B, (NF)-kappa B, NF-Kappa B, …

•Wide-spread synonymy

Many synonyms in wide usage, e.g. PKB and Akt

cycline-dependent kinase inhibitor p27, p27kip1

<cdc25, cdc25a>, <p52shc, p52(Shc)>

•Open, growing vocabulary for many classes

•Cross-over of names between classes depending on context

•Protein vs DNA

•Frequent uses of coordination inside term formations

Task difficulties in molecular-biology

Linking ProblemDiversityLexicon

Static Ptocessing

Term RecognitionAmbiguity

Context DependentDynamic Processing

Genia OntologySubstance

+substance-+-compound-+-organic-+-nucleic_acid-+-poly_nucleotides

| | | | +-nucleotide

| | | | +-DNA

| | | | +-RNA

| | | +-amino_acid-+-peptide

| | | | +-amino_acid_monomer

| | | | +-protein

| | | +-lipid

| | | +-carbohydrate

| | | +-other_organic_compounds

| | +-inorganic

| +-atom

Genia Ontology :Source

+-source-+-natural-+-organism-+-multi_cell

| | | +-mono_cell

| | | +-virus

| | +-body_part

| | +-tissue

| | +-cell_type

| +-artificial-+-cell_line

| +-other_artificial_sources

Number of Tagged Objects

• Texts: 2,500 MEDLINE Abstracts – Papers on Transcription Factors in Human blood cells

– 550,000 words, 20,000 sentences

• Tagged objects: 147,000– Protein: ~ 77,000– DNA: ~ 24,000– RNA: ~ 2,400– Source: ~ 27,000– Other: ~ 37,000

Distributions of Semantic Classes

cell line

artificial source

protein

peptide

amino acid monomerDNA

RNA

polynucleotidesnucleotides

lipidcarbohydrate

other organiccompound

atom

inorganic compound

cell component

cell typetissueorganism

others

Extension of GENIA Ontology• Small classes (to be embedded in UMLS)

– 5242 terms labelled with ‘other_names’ class

• Events, Biological reactions 3800 • Disease 636

– Names of Diseases 501– Treatments 61– Diagnoses 52– Pathology 3– Others 39

• Experiments 578– Methods 493– Materials 25– Others 60

• Others 228

Classification of "other_names"

Event or Reaction Disease Experiment Other

Sub-classification of "Disease"

Disease name Treatment methodDiagnosis PathologyOther

Sub-classification of "Experiment"

Method Material Other

DNAPROTEIN

DNA CELLTYPE

and classify

Thus, CIITA not only activates the expression of class II genes

but recruits another B cell-specific coactivator to increase

transcriptional activity of class II promoters in B cells .

• Recognize “names” in the text– Technical terms expressing proteins, genes, cells,

etc.

Biomedical NE Task (Collier Coling00,Kazama ACL02, Kim ISMB02)

Identify

NE Task as Classification• To a class (tag) representing the semantic class and the

position in the term– The task is reduced to a tagging task

• We can use methods developed for tagging

– The structure is encoded in a tag• BIO (Begin, Inside, and Other) tagging

Term of class X

B-X I-X I-Xo

Term of class Y

B-Yo o o oWords:

BIO tags:

(OTHER)

NE Tagging Illustrated• Classify a word depending on the context

activity of class II promoters in

B-DNA I-DNA

conversion to features

classifier

N P N Sym Ns P

context

BIO tags:

POS tags:

O O

Words:

Deterministic tagging:

- Only the most probable tag at each word (SVM)

The Viterbi tagging:

- The most probable sequence among all (probabilistic models)

The GENIA Corpus[Tateishi HLT02., Ohta PSB00, ISMB02]

Annotated MEDLINE abstracts

A gold standard for biomedical NLP tasks

# of abstracts:

# of sentences:

# of tokens (words):

# of named entities:

# of semantic classes:

670

5,109

152,216

23,793

24

- 2,000-abstract version soon

http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA/

Big enough to: make SVM usage nontrivial

Small enough to: make sparseness serious

the ME Method• Maximum Entropy model

P(y | h) 1

Z(h) i

Fi (h,y )

i

Feature function

Weight for Fi

Feature function:

F(h, y) 1 if y = T f (h) 1

0 otherwise

Target term Same as the feature in SVMs

The Viterbi algorithm is used for tagging

ContextTag

SOHMM modeling(J.KIM, et.al. ACL03)

• SOHMM modeling

– No assumption is made arbitrarily.– Instead, a context classification function is induced from a corpus.

• SOHMM learning– Inducing the context classification function– Estimating parameters

l

iiiii

ttwPctPW

l 1

||maxarg,1

A set of contextual feature values which are visible at the moment of predicting .

A classification function from sets of contextual feature values to context patterns grouped appropriately.

ic

icit

Experimental Results• Biological source recognition

• Biological substance recognition

Matching method precision recall F-score

hard matching 59.72 68.92 63.99

soft matching left 63.23 72.97 67.75

soft matching right 61.36 70.81 65.75

soft matching either 64.87 74.86 69.51

Matching method precision recall F-score

hard matching 73.76 66.92 70.17

soft matching left 77.64 70.67 73.99

soft matching right 75.19 68.22 71.54

soft matching either 79.07 71.98 75.36

Event Recognition

Identity of events in our mindDisambiguation of different events by context

Problem: Syntactic Variations

RAF6 activates NF-kappaB.

Lck is activated by autophosphorylation at Tyr 394.

Anandamide induces vasodilation by activating vanilloid receptors.

the activation of Rap1 by C3G

the GTPase-activating protein rhoGAP

the stress-activated group of MAP kinases

ACTIVATOR activate ACTIVATEE

Verbs Related to Biological EventsFrequent Verbs in 100 MEDLINE Abstracts

Verb Count Verb Count Verb Count Verb Countbe 255 involve 16 determine 9 explain 6induce 56 identify 16 construct 9 exert 6bind 50 act 15 associate 9 enhance 6show 49 stimulate 14 reduce 8 display 6suggest 42 provide 14 prevent 8 characterize 6activate 42 express 13 locate 8 participate 5factor 36 affect 13 line 8 localize 5demonstrate 35 type 12 differ 8 investigate 5inhibit 26 report 12 trigger 7 imply 5have 25 form 12 synergize 7 establish 5reveal 21 contribute 12 examine 7 conclude 5require 21 study 11 block 7 compare 5regulate 21 observe 11 become 7 use 4indicate 21 lead 11 analyze 7 transform 4find 21 function 11 target 6 transfect 4result 20 assay 11 signal 6 test 4play 19 appear 11 remain 6 suppress 4interact 18 occur 10 produce 6 support 4mediate 17 increase 10 present 6 substitute 4contain 17 phosphorylate 9 possess 6 share 4

Argument Frame Extractor

133 argument structures, marked by a domain specialist in 97 sentences among the 180 sentences

Extracted Uniquely

Extracted with ambiguity

ParsingFailures

Extractable from pp’s

31

32

26

Not extractable 27

Memory limitation,etc 17

68%

My Talk

1. Background : Why NLP in Biomedicines

2. Examples of NLP in Biomedicines

3. Text Mining and NLP

4. Our current Work 4.1 Terms and NE 4.2 Resource Building 4.3 Event Recognition

5. Concluding Remarks

Revolution in LT in the last decade

Information

KnowledgeLanguageTexts

GrammarSyntax-Semantic Mapping

Interpretation based on Knowledge

Machine Learning

Knowledge Acquisition

Statistical Biases

Huge Ontology: Next Revolution ?Bio-Medical Application: UMLS, Gene Ontology, etc.

by D. Devos

Genome sequencing.

Actual demands in the real worldwith more homogenous user groups and

more concrete criteria for evaluating results

http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA/

Resources available

Medline Abstracts (4000, about 1 million words) GENIA ontology POS tags Semantic tags Structural tags Co-reference annotations with a Singaporean team

Lexical resources mapped to existing ontology