Natural Language Processing in the biomedical domain SBI Course WS 2005/2006 Thomas Karopka...
-
date post
22-Dec-2015 -
Category
Documents
-
view
213 -
download
0
Transcript of Natural Language Processing in the biomedical domain SBI Course WS 2005/2006 Thomas Karopka...
![Page 1: Natural Language Processing in the biomedical domain SBI Course WS 2005/2006 Thomas Karopka 19.01.2006.](https://reader035.fdocuments.in/reader035/viewer/2022081519/56649d785503460f94a5acdc/html5/thumbnails/1.jpg)
Natural Language Processing in the
biomedical domain
SBI Course WS 2005/2006
Thomas Karopka
19.01.2006
![Page 2: Natural Language Processing in the biomedical domain SBI Course WS 2005/2006 Thomas Karopka 19.01.2006.](https://reader035.fdocuments.in/reader035/viewer/2022081519/56649d785503460f94a5acdc/html5/thumbnails/2.jpg)
Natural Language Processing in the Biomedical Domain
Outline Motivation Introduction to Natural Language Processing Named Entity Recognition (NER) Information Extraction (IE) GATE-General Architecture for Text
Engineering Some Tools, some applications.... (Short introduction to GATE)
![Page 3: Natural Language Processing in the biomedical domain SBI Course WS 2005/2006 Thomas Karopka 19.01.2006.](https://reader035.fdocuments.in/reader035/viewer/2022081519/56649d785503460f94a5acdc/html5/thumbnails/3.jpg)
Natural Language Processing in the Biomedical Domain
• Huge amount of biomedical knowledge• Problem: unstructured text difficult to analyze automatically
40.000 abstracts á 5 min – app. 400 days (8 h a day)
Solution: NLP – Information Extraction
• MEDLINE: currently contains over 16 million biomedical abstracts• 50.000 new abstracts per month
Motivation
![Page 4: Natural Language Processing in the biomedical domain SBI Course WS 2005/2006 Thomas Karopka 19.01.2006.](https://reader035.fdocuments.in/reader035/viewer/2022081519/56649d785503460f94a5acdc/html5/thumbnails/4.jpg)
Natural Language Processing in the Biomedical Domain
What is NLP?Definition 1:
Natural Language Processing (NLP) is a subfield of artificial intelligence and linguistics. It studies the problems
inherent in the processing and manipulation of natural language, but not, generally, natural language
understanding.
Definition 2:
A study of how to use computers to do things with human languages.
Synonyms: Language Engineering, Human Language Technology
![Page 5: Natural Language Processing in the biomedical domain SBI Course WS 2005/2006 Thomas Karopka 19.01.2006.](https://reader035.fdocuments.in/reader035/viewer/2022081519/56649d785503460f94a5acdc/html5/thumbnails/5.jpg)
Natural Language Processing in the Biomedical Domain
Publications in MEDLINE
0
2
4
6
8
10
12
14
Million
Year
Publications in MEDLINE
jährliche Publikationen
kumulierte Anzahl
Publications per year
![Page 6: Natural Language Processing in the biomedical domain SBI Course WS 2005/2006 Thomas Karopka 19.01.2006.](https://reader035.fdocuments.in/reader035/viewer/2022081519/56649d785503460f94a5acdc/html5/thumbnails/6.jpg)
Natural Language Processing in the Biomedical Domain
Main fields of NLP
Text to speech Speech recognition
Natural language generation Machine translation Question answering Information retrieval
Information extraction Named entity recognition
Text classification Translation technology
Text Summaries
![Page 7: Natural Language Processing in the biomedical domain SBI Course WS 2005/2006 Thomas Karopka 19.01.2006.](https://reader035.fdocuments.in/reader035/viewer/2022081519/56649d785503460f94a5acdc/html5/thumbnails/7.jpg)
Natural Language Processing in the Biomedical Domain
Why is NLP so hard?
Ambiguity Context Acronyms Semantics
![Page 8: Natural Language Processing in the biomedical domain SBI Course WS 2005/2006 Thomas Karopka 19.01.2006.](https://reader035.fdocuments.in/reader035/viewer/2022081519/56649d785503460f94a5acdc/html5/thumbnails/8.jpg)
Natural Language Processing in the Biomedical Domain
Ambiguity
Time flies like an arrow, fruit flies like a banana
(Groucho Marx)
![Page 9: Natural Language Processing in the biomedical domain SBI Course WS 2005/2006 Thomas Karopka 19.01.2006.](https://reader035.fdocuments.in/reader035/viewer/2022081519/56649d785503460f94a5acdc/html5/thumbnails/9.jpg)
Natural Language Processing in the Biomedical Domain
Global vs. Local ambiguity
Local ambiguity means that part of a sentence can have more than 1 interpretation, but not the whole sentence.
Global ambiguity means that the whole sentence can have more than 1 interpretation.
![Page 10: Natural Language Processing in the biomedical domain SBI Course WS 2005/2006 Thomas Karopka 19.01.2006.](https://reader035.fdocuments.in/reader035/viewer/2022081519/56649d785503460f94a5acdc/html5/thumbnails/10.jpg)
Natural Language Processing in the Biomedical Domain
Global vs. Local ambiguity cont. Local ambiguity
The old train..... ...the young. ...left the station.
Here syntax can tell us that TRAIN must be a verb in sentence 1.
Global ambiguity "I saw the Grand Canyon flying to New York"
"I saw a Boeing 747 flying to New York"
Here we know the meaning of the two sentences because we know
what can and cannot fly.
![Page 11: Natural Language Processing in the biomedical domain SBI Course WS 2005/2006 Thomas Karopka 19.01.2006.](https://reader035.fdocuments.in/reader035/viewer/2022081519/56649d785503460f94a5acdc/html5/thumbnails/11.jpg)
Natural Language Processing in the Biomedical Domain
Types of Ambiguity Categorical ambiguity
Noun : "Time is money" Verb: "Time me on the last lap" Adjective: "Time travel is not likely in my life time„
Word sense ambiguity Electrical : "The battery was charged with jump leads" Legal: "Thief was charged by PC Smith" Responsibility: "The lecturer was charged with student
recruitment"
![Page 12: Natural Language Processing in the biomedical domain SBI Course WS 2005/2006 Thomas Karopka 19.01.2006.](https://reader035.fdocuments.in/reader035/viewer/2022081519/56649d785503460f94a5acdc/html5/thumbnails/12.jpg)
Natural Language Processing in the Biomedical Domain
Types of Ambiguity cont.
Structural ambiguity "You can have peas and beans or carrots with the set meal„
Referential ambiguity What can THEY refer to in: "After THEY finished the exam
the students and lecturers left.„ Lectures only?Students only?Both?
![Page 13: Natural Language Processing in the biomedical domain SBI Course WS 2005/2006 Thomas Karopka 19.01.2006.](https://reader035.fdocuments.in/reader035/viewer/2022081519/56649d785503460f94a5acdc/html5/thumbnails/13.jpg)
Natural Language Processing in the Biomedical Domain
Problems in NLP
Polysemy - one word carrying different meanings. (Glück 1993, 474) (in different contexts)
beam ('Lichtstrahl' und 'Balken')
Synonymy - the semantic relation that holds between two words that can (in a given context) express the same meaning
ship – vessel buy - purchase
Semantics - the meaning of a word, phrase, clause, or sentence, as opposed to its syntactic construction.
„Baby swallows fly“
![Page 14: Natural Language Processing in the biomedical domain SBI Course WS 2005/2006 Thomas Karopka 19.01.2006.](https://reader035.fdocuments.in/reader035/viewer/2022081519/56649d785503460f94a5acdc/html5/thumbnails/14.jpg)
Natural Language Processing in the Biomedical Domain
Basic NLP Tasks
TokenizationSplit text into units called tokens (words, .,-)
Sentence SplittingDetect sentence boundaries
Part of Speech (POS) TaggingApply parts of speech (verb, noun, adjective..)
ParsingWork out parse trees
![Page 15: Natural Language Processing in the biomedical domain SBI Course WS 2005/2006 Thomas Karopka 19.01.2006.](https://reader035.fdocuments.in/reader035/viewer/2022081519/56649d785503460f94a5acdc/html5/thumbnails/15.jpg)
Natural Language Processing in the Biomedical Domain
Basic NLP Tasks cont.
Verb Phrase chunkingFind verbal phrases
Noun Phrase chunkingFind noun phrases
Acronym resolutionFind long forms for acronyms
Corefference resolutionNew York, .... The big apple
![Page 16: Natural Language Processing in the biomedical domain SBI Course WS 2005/2006 Thomas Karopka 19.01.2006.](https://reader035.fdocuments.in/reader035/viewer/2022081519/56649d785503460f94a5acdc/html5/thumbnails/16.jpg)
Natural Language Processing in the Biomedical Domain
Basic NLP Tasks cont.
Named Entity RecognitionFind named entities
.....
![Page 17: Natural Language Processing in the biomedical domain SBI Course WS 2005/2006 Thomas Karopka 19.01.2006.](https://reader035.fdocuments.in/reader035/viewer/2022081519/56649d785503460f94a5acdc/html5/thumbnails/17.jpg)
Natural Language Processing in the Biomedical Domain
What is NER? NER
Named Entity Recognition Including two tasks
Identification of proper names in text Classification of proper names in text
Newswire Domain Person, Location, Organization
Biomedical Domain Protein, DNA, RNA, Body Part, Cell Type, Lipid, etc.
![Page 18: Natural Language Processing in the biomedical domain SBI Course WS 2005/2006 Thomas Karopka 19.01.2006.](https://reader035.fdocuments.in/reader035/viewer/2022081519/56649d785503460f94a5acdc/html5/thumbnails/18.jpg)
Natural Language Processing in the Biomedical Domain
NER in biomedical domain
BioNER aims to recognize following namesFirst Priority
Protein name, DNA name, RNA nameSecond Priority
cell type, other organic compound, cell line, lipid, multi-cell, virus, cell component, body part, tissue, amino acid monomer, polynucleotide, mono-cell, inorganic, peptide, nucleotide, atom, other artificial source, carbohydrate, organic
![Page 19: Natural Language Processing in the biomedical domain SBI Course WS 2005/2006 Thomas Karopka 19.01.2006.](https://reader035.fdocuments.in/reader035/viewer/2022081519/56649d785503460f94a5acdc/html5/thumbnails/19.jpg)
Natural Language Processing in the Biomedical Domain
Example of NER - BiomedicalProtei
n/gene
Cell type
![Page 20: Natural Language Processing in the biomedical domain SBI Course WS 2005/2006 Thomas Karopka 19.01.2006.](https://reader035.fdocuments.in/reader035/viewer/2022081519/56649d785503460f94a5acdc/html5/thumbnails/20.jpg)
Natural Language Processing in the Biomedical Domain
Problems in BioNER
Unknown words Long compound words Variations of expressions Nested NEs
![Page 21: Natural Language Processing in the biomedical domain SBI Course WS 2005/2006 Thomas Karopka 19.01.2006.](https://reader035.fdocuments.in/reader035/viewer/2022081519/56649d785503460f94a5acdc/html5/thumbnails/21.jpg)
Natural Language Processing in the Biomedical Domain
Unknown Words
Words containing hyphen, digit, letter, Greek letter, Roman numeral. Alpha B1 Adenyly cyclase 76E Latent membrane protein 1 4’-mycarosyl isovaleryl-CoA transferase oligodeoxyribonucleotide 18-deoxyaldosterone
Abbreviation and Acronym IL, TECd, IFN, TPA
![Page 22: Natural Language Processing in the biomedical domain SBI Course WS 2005/2006 Thomas Karopka 19.01.2006.](https://reader035.fdocuments.in/reader035/viewer/2022081519/56649d785503460f94a5acdc/html5/thumbnails/22.jpg)
Natural Language Processing in the Biomedical Domain
Long Compound words
interleukin 1 (IL-1)-responsive kinase interleukin 1-responsive kinase epidermal growth factor receptor SH2 domain containing tyrosine kinase
Syk SH2 domain (GENIA example)
![Page 23: Natural Language Processing in the biomedical domain SBI Course WS 2005/2006 Thomas Karopka 19.01.2006.](https://reader035.fdocuments.in/reader035/viewer/2022081519/56649d785503460f94a5acdc/html5/thumbnails/23.jpg)
Natural Language Processing in the Biomedical Domain
Various expressions of the same NE
Spelling variation N-acetylcysteine, N-acetyl-cysteine, NAcetylCysteine
Word permutation beta-1 intergrin, integrin beta-1
Ambiguous expressions epidermal growth factor receptor, EGF receptor,
EGFR c-jun, c-Jun, c jun
![Page 24: Natural Language Processing in the biomedical domain SBI Course WS 2005/2006 Thomas Karopka 19.01.2006.](https://reader035.fdocuments.in/reader035/viewer/2022081519/56649d785503460f94a5acdc/html5/thumbnails/24.jpg)
Natural Language Processing in the Biomedical Domain
Various expressions: the name explains its function
the Ras guanine nucleotide exchange factor Sos
the Ras guanine nucleotide releasing protein Sos
the Ras exchanger Sos the GDP-GTP exchange factor Sos Sos(mSos), a GDP/GTP exchange protein
for Ras
![Page 25: Natural Language Processing in the biomedical domain SBI Course WS 2005/2006 Thomas Karopka 19.01.2006.](https://reader035.fdocuments.in/reader035/viewer/2022081519/56649d785503460f94a5acdc/html5/thumbnails/25.jpg)
Natural Language Processing in the Biomedical Domain
Various expressions: The name includes preposition and/or
conjunction (ambiguity of dependencies)
p85 alpha subunit of PI 3-kinase SH2 and SH3 domains of Src NF-AT1 , AP-1 , and NF-kB sites E2F1 and -3 Residues 432, 435, 437, 438, and 440
![Page 26: Natural Language Processing in the biomedical domain SBI Course WS 2005/2006 Thomas Karopka 19.01.2006.](https://reader035.fdocuments.in/reader035/viewer/2022081519/56649d785503460f94a5acdc/html5/thumbnails/26.jpg)
Natural Language Processing in the Biomedical Domain
Nested Named Entity
An NE embedded in another NE. IL-2: protein IL-2 gene: gene CBP/p300 associated factor: protein CBP/p300 associated factor binding
promoter: DNA
![Page 27: Natural Language Processing in the biomedical domain SBI Course WS 2005/2006 Thomas Karopka 19.01.2006.](https://reader035.fdocuments.in/reader035/viewer/2022081519/56649d785503460f94a5acdc/html5/thumbnails/27.jpg)
Natural Language Processing in the Biomedical Domain
Gene Naming Conventions
"Biologists would rather share their toothbrush than share a gene name„ Michael Ashburner [1]
[1] Pearson H. Biology's name game. Nature. 2001;411:631–632.
![Page 28: Natural Language Processing in the biomedical domain SBI Course WS 2005/2006 Thomas Karopka 19.01.2006.](https://reader035.fdocuments.in/reader035/viewer/2022081519/56649d785503460f94a5acdc/html5/thumbnails/28.jpg)
Natural Language Processing in the Biomedical Domain
Protein/Gene name recognitionFor comic relief don‘t miss the ‚worst gene names‘ page:
http://tinman.vetmed.helsinki.fi/eng/drosophila.html
My favourite ones: drop dead FBgn0000494 lost in space FBgn0016996 ken and barbie FBgn0011236
Source: FlyBase http://flybase.bio.indiana.edu/
![Page 29: Natural Language Processing in the biomedical domain SBI Course WS 2005/2006 Thomas Karopka 19.01.2006.](https://reader035.fdocuments.in/reader035/viewer/2022081519/56649d785503460f94a5acdc/html5/thumbnails/29.jpg)
Natural Language Processing in the Biomedical Domain
![Page 30: Natural Language Processing in the biomedical domain SBI Course WS 2005/2006 Thomas Karopka 19.01.2006.](https://reader035.fdocuments.in/reader035/viewer/2022081519/56649d785503460f94a5acdc/html5/thumbnails/30.jpg)
Natural Language Processing in the Biomedical Domain
![Page 31: Natural Language Processing in the biomedical domain SBI Course WS 2005/2006 Thomas Karopka 19.01.2006.](https://reader035.fdocuments.in/reader035/viewer/2022081519/56649d785503460f94a5acdc/html5/thumbnails/31.jpg)
Natural Language Processing in the Biomedical Domain
State-of-the-art Systems on NER: Two evaluation contests
BioCreative 2004 (March) Critical Assessment of Information Extraction
Systems in Biology Task 1: Entity extraction
Target: genes (or proteins, where there is ambiguity) 10000 sentences from Medline as training data, and
5000 sentences as testing data BioNLP 2004 (August)
GENIA Corpus as training data and 404 abstracts as testing data
Target: 5 classes, including protein, DNA, gene, cell line and cell type.
Both use exact match scoring.
![Page 32: Natural Language Processing in the biomedical domain SBI Course WS 2005/2006 Thomas Karopka 19.01.2006.](https://reader035.fdocuments.in/reader035/viewer/2022081519/56649d785503460f94a5acdc/html5/thumbnails/32.jpg)
Natural Language Processing in the Biomedical Domain
BioNLP 2004 Datasets
# of abstract
s# of sentences # of tokens
Training Set 2,000 20,546 (10.27/abs)472,006 (236.00/abs)
(22.97/sen)
Test Set
Total 404 4,260 (10.54/abs)96,780 (239.55/abs)
(22.72/sen)
1978-1989
104 991 ( 9.53/abs)22,320 (214.62/abs)
(22.52/sen)
1990-1999
106 1,115 (10.52/abs)25,080 (236.60/abs)
(22.49/sen)
2000-2001
130 1,452 (11.17/abs)33,380 (256.77/abs)
(22.99/sen)
S/1998-2001
204 2,254 (11.05/abs)51,628 (253.08/abs)
(22.91/sen)
![Page 33: Natural Language Processing in the biomedical domain SBI Course WS 2005/2006 Thomas Karopka 19.01.2006.](https://reader035.fdocuments.in/reader035/viewer/2022081519/56649d785503460f94a5acdc/html5/thumbnails/33.jpg)
Natural Language Processing in the Biomedical Domain
R/P/F 1978-1989
set 1990-1999
set 2000-2001
set S/1998-2001 set
Total
[Zho04
]
75.3 / 69.5 / 72.3
77.1 / 69.2 / 72.9
75.6 / 71.3 / 73.8
75.8 / 69.5 / 72.5
76.0 / 69.4 / 72.6
[Fin04]
66.9 / 70.4 / 68.6
73.8 / 69.4 / 71.5
72.6 / 69.3 / 70.9
71.8 / 67.5 / 69.6
71.6 / 68.6 / 70.1
[Set04
]
63.6 / 71.4 / 67.3
72.2 / 68.7 / 70.4
71.3 / 69.6 / 70.5
71.3 / 68.8 / 70.1
70.3 / 69.3 / 69.8
[Son04
]
60.3 / 66.2 / 63.1
71.2 / 65.6 / 68.2
69.5 / 65.8 / 67.6
68.3 / 64.0 / 66.1
67.8 / 64.8 / 66.3
[Zha04]
63.2 / 60.4 / 61.8
72.5 / 62.6 / 67.2
69.1 / 60.2 / 64.7
69.2 / 60.3 / 64.4
69.1 / 61.0 / 64.8
[Rös04]
59.2 / 60.3 / 59.8
70.3 / 61.8 / 65.8
68.4 / 61.5 / 64.8
68.3 / 60.4 / 64.1
67.4 / 61.0 / 64.0
[Par04]
62.8 / 55.9 / 59.2
70.3 / 61.4 / 65.6
65.1 / 60.4 / 62.7
65.9 / 59.7 / 62.7
66.5 / 59.8 / 63.0
[Lee04]
42.5 / 42.0 / 42.2
52.5 / 49.1 / 50.8
53.8 / 50.9 / 52.3
52.3 / 48.1 / 50.1
50.8 / 47.6 / 49.1
BL 47.1 / 33.9 /
39.4 56.8 / 45.5 /
50.5 51.7 / 46.3 /
48.8 52.6 / 46.0 /
49.1 52.6 / 43.6 /
47.7
![Page 34: Natural Language Processing in the biomedical domain SBI Course WS 2005/2006 Thomas Karopka 19.01.2006.](https://reader035.fdocuments.in/reader035/viewer/2022081519/56649d785503460f94a5acdc/html5/thumbnails/34.jpg)
Natural Language Processing in the Biomedical Domain
Current Methods
Machine LearningHMM, SVM, ME (Maximum Entropy), CRF
(Conditional Random Field) Hybrid methods
Dictionary BasedApproximate String matching algorithm
Naming Rules Dynamic Programming
![Page 35: Natural Language Processing in the biomedical domain SBI Course WS 2005/2006 Thomas Karopka 19.01.2006.](https://reader035.fdocuments.in/reader035/viewer/2022081519/56649d785503460f94a5acdc/html5/thumbnails/35.jpg)
Natural Language Processing in the Biomedical Domain
Features for Machine Learning Methods
Morphological Features Orthographical Features POS Features
Genia POS tagger Semantic Trigger Features
Head-noun Features NF-kappaB consensus site IL-2 gene
![Page 36: Natural Language Processing in the biomedical domain SBI Course WS 2005/2006 Thomas Karopka 19.01.2006.](https://reader035.fdocuments.in/reader035/viewer/2022081519/56649d785503460f94a5acdc/html5/thumbnails/36.jpg)
Natural Language Processing in the Biomedical Domain
Morphological FeaturesPrefix/Suffix Example
~cin~mide~zole
actinomycinCycloheximideSulphamethoxazole
~lipid~rogen~vitamin
phospholipidsestrogendihydroxyvitamin
~blast~cyte~phil
erythroblastthymocyteeosinophil
phosph~methyl~immuno~
phosphorylationmethyltranferaseimmunomodulator
![Page 37: Natural Language Processing in the biomedical domain SBI Course WS 2005/2006 Thomas Karopka 19.01.2006.](https://reader035.fdocuments.in/reader035/viewer/2022081519/56649d785503460f94a5acdc/html5/thumbnails/37.jpg)
Natural Language Processing in the Biomedical Domain
Orthographical Features
OrthographicalFeatures
Example Orthographical Features
Example
AllCaps EBNA, NFAT AlphaDigit p50, p65
AlphaDigitAlpha IL23R, E1A ATGCSequence
CCGCCC
CapLowAlpha Src, Ras, Epo CapMixAlpha NFkappaB
CapsAndDigits IL2, STAT4, SH2
DigitAlpha 2xNFkappaB
![Page 38: Natural Language Processing in the biomedical domain SBI Course WS 2005/2006 Thomas Karopka 19.01.2006.](https://reader035.fdocuments.in/reader035/viewer/2022081519/56649d785503460f94a5acdc/html5/thumbnails/38.jpg)
Natural Language Processing in the Biomedical Domain
Head Nouns
Head Nouns
Unigram factor, protein, receptor, alpha, NF-kappaB, IL-2, cytokine,kinase, transcription, domain, complex, TNF-alpha, Nuclear, p50, CD28, TNF, molecule, subunit, cell, STAT3, family, tumor, factor-alpha, expression, interleukin
Bigram NF-kappa B, transcription factor, I kappa, nuclear factor, protein kinase, B alpha, kinase C, tumor necrosis, T cell,glucocorticoid receptor, binding protein, factor alpha, adhesion molecule, monoclonal antibody, gene product, binding domain
![Page 39: Natural Language Processing in the biomedical domain SBI Course WS 2005/2006 Thomas Karopka 19.01.2006.](https://reader035.fdocuments.in/reader035/viewer/2022081519/56649d785503460f94a5acdc/html5/thumbnails/39.jpg)
Natural Language Processing in the Biomedical Domain
Excursus: Head Noun, Noun phrase
A noun is usually embedded in a noun phrase (NP), a syntactic unit of the sentence in which information about the noun is gathered.
The noun is the head of the noun phrase, the central constituent that determines the syntactic character of the phrase.
![Page 40: Natural Language Processing in the biomedical domain SBI Course WS 2005/2006 Thomas Karopka 19.01.2006.](https://reader035.fdocuments.in/reader035/viewer/2022081519/56649d785503460f94a5acdc/html5/thumbnails/40.jpg)
Natural Language Processing in the Biomedical Domain
Excursus: Head Noun, Noun phrase cont.
A noun phrase normally consists of: An optional determiner Zero or more adjective phrases A head noun Optional post-modifier (prepositional phrase or clausal
modifier)
Example:
The homeless old man in the park that I tried to help yesterday
human umbilical vein endothelial cellslipopolysaccharide-stimulated human saphenous vein endothelial cells
![Page 41: Natural Language Processing in the biomedical domain SBI Course WS 2005/2006 Thomas Karopka 19.01.2006.](https://reader035.fdocuments.in/reader035/viewer/2022081519/56649d785503460f94a5acdc/html5/thumbnails/41.jpg)
Natural Language Processing in the Biomedical Domain
Zhou et al. approach
HMM + SVM Post-processing
Rule-based: used to resolve nested name entities.
Top1 in the NLPBA Task, F=72.5%
![Page 42: Natural Language Processing in the biomedical domain SBI Course WS 2005/2006 Thomas Karopka 19.01.2006.](https://reader035.fdocuments.in/reader035/viewer/2022081519/56649d785503460f94a5acdc/html5/thumbnails/42.jpg)
Natural Language Processing in the Biomedical Domain
Manning et al. method Machine learning:
ME Markov model Local features External resources and larger context
Post-processing To correct gene’s boundary (mainly for BioCreative
Task)
Top 1 in BioCreative, F= 83.2% Top 2 in NLPBA Task, F=70.1%
![Page 43: Natural Language Processing in the biomedical domain SBI Course WS 2005/2006 Thomas Karopka 19.01.2006.](https://reader035.fdocuments.in/reader035/viewer/2022081519/56649d785503460f94a5acdc/html5/thumbnails/43.jpg)
Natural Language Processing in the Biomedical Domain
IE-Systems analyse unstructured text,extract predefined named entities and store these entities in a structured form
What is Informationsextraction(IE)?
Text Mining is the discovery by computer of new, previously unknown information, by automatically extracting information from different written resources. A key element is the linking together of the extracted information to form new facts or new hypotheses to be explored further by more conventional means
of experimentation.
Source: Marti Hearst, What is text mining? http://www.sims.berkeley.edu/~hearst/text-mining.html
![Page 44: Natural Language Processing in the biomedical domain SBI Course WS 2005/2006 Thomas Karopka 19.01.2006.](https://reader035.fdocuments.in/reader035/viewer/2022081519/56649d785503460f94a5acdc/html5/thumbnails/44.jpg)
Natural Language Processing in the Biomedical Domain
Targets of Information Extraction
Protein-Protein interaction/binding/inhibition Protein-Small Molecules Gene-Gene regulation Gene-Gene Product interaction Gene-Drug relation Protein-Subcellular location Amino Acid-Protein relation
Example relationships between gene and drugs: The gene is the drug target The gene confers resistance to the drug The gene metabolizes the drug
![Page 45: Natural Language Processing in the biomedical domain SBI Course WS 2005/2006 Thomas Karopka 19.01.2006.](https://reader035.fdocuments.in/reader035/viewer/2022081519/56649d785503460f94a5acdc/html5/thumbnails/45.jpg)
Natural Language Processing in the Biomedical Domain
Information Extraction Tasks
Identify Target Named Entities
Identify Relationsamong Named
Entities
Identify Relationsamong Events and
Named Entities
Associate Resultswith existing
database records
![Page 46: Natural Language Processing in the biomedical domain SBI Course WS 2005/2006 Thomas Karopka 19.01.2006.](https://reader035.fdocuments.in/reader035/viewer/2022081519/56649d785503460f94a5acdc/html5/thumbnails/46.jpg)
Natural Language Processing in the Biomedical Domain
IE-SystemsRulebased Systems using rules for the extraction
Machine Learning: Support Vector Machine (SVM), Maximum Entropy (ME), Memory Based Learning (MBL),Inductive Logic Programming (ILP)Artificial Neural Networks (ANNs)
Hybrid Systems combining the two approaches
![Page 47: Natural Language Processing in the biomedical domain SBI Course WS 2005/2006 Thomas Karopka 19.01.2006.](https://reader035.fdocuments.in/reader035/viewer/2022081519/56649d785503460f94a5acdc/html5/thumbnails/47.jpg)
Natural Language Processing in the Biomedical Domain
GATE –General Architecture for Text Engineering
„GATE is an architecture, a framework and a development environment for LE (Language Engineering)“ (Cunningham, 2002) • Integrated Development Environment for LE-Applications• Reusable Components• Extensive amount of APIs• Integration of different NLP plattforms• WEKA (machine learning), Protégé (Ontology)• Open Source, Java
![Page 48: Natural Language Processing in the biomedical domain SBI Course WS 2005/2006 Thomas Karopka 19.01.2006.](https://reader035.fdocuments.in/reader035/viewer/2022081519/56649d785503460f94a5acdc/html5/thumbnails/48.jpg)
Natural Language Processing in the Biomedical Domain
Extractor
TokenizerSentenceSplitter
GeneGazetteer
Gene-relationtransducer
POS Tagger
AcronymResolution
NP-Chunking
XMLdocs
GATE standard components
external modules
New developed modules
Finite State Transducer Uses JAPE Grammar JAPE rules are compiled to Java Objects that are used by the GATE API
Consists of an Indexfile which is used to access lists with keywords Lists are compiled to finite state machines every keyword is annotated with a type (e.g. Gene, Relation, Organism ...)
![Page 49: Natural Language Processing in the biomedical domain SBI Course WS 2005/2006 Thomas Karopka 19.01.2006.](https://reader035.fdocuments.in/reader035/viewer/2022081519/56649d785503460f94a5acdc/html5/thumbnails/49.jpg)
Natural Language Processing in the Biomedical Domain
JAPEExample: IL-1beta and TNF-alpha significantly
enhanced the production of GM-CSF
MacroLabel
Rule: G1ofG2( (GENE):gene1
(mRNA)?
(ADVS)?
({Lookup.majorType == relverb}):rel
({Token.category == DT})?
{Token.category == NN}
{Token.string == "of"}
(GENE):gene2
(mRNA)?): cgr --> { Java code }
RHS
Gazetteer Lookup
POS-tag
![Page 50: Natural Language Processing in the biomedical domain SBI Course WS 2005/2006 Thomas Karopka 19.01.2006.](https://reader035.fdocuments.in/reader035/viewer/2022081519/56649d785503460f94a5acdc/html5/thumbnails/50.jpg)
Natural Language Processing in the Biomedical Domain
Examples
G1(-)relverb G2 TNF-alpha(-)mediated GM-CSF
G1 adverb? modal? rel G2 IFNG (mRNA) significantly downregulates IL8 (mRNA)
G2 (mRNA) relation by G1 (mRNA)
CDC2 activation by cyclin B1
G1 relverb rel of G2TNF-alpha(-)mediated upregulation of GM-CSF
G1 relverb and/but G2 relverb G3 rel
IL1 upregulates but IFNG downregulates IL8 expression
Pattern Example
![Page 51: Natural Language Processing in the biomedical domain SBI Course WS 2005/2006 Thomas Karopka 19.01.2006.](https://reader035.fdocuments.in/reader035/viewer/2022081519/56649d785503460f94a5acdc/html5/thumbnails/51.jpg)
Natural Language Processing in the Biomedical Domain
Gene-gene
relation
![Page 52: Natural Language Processing in the biomedical domain SBI Course WS 2005/2006 Thomas Karopka 19.01.2006.](https://reader035.fdocuments.in/reader035/viewer/2022081519/56649d785503460f94a5acdc/html5/thumbnails/52.jpg)
Natural Language Processing in the Biomedical Domain
Evaluation
POS
PARCORREC
*5.0
ACT
PARCORPRE
*5.0
COR = correct relations, POS = possible relations, ACT = actual extracted relations, PAR = partial correct extracted relations
Estimation based on 100 manual checked abstracts
PRE = 83% REC = ?
Standard for evaluation necessary: BioCreAtive? GENIA?
![Page 53: Natural Language Processing in the biomedical domain SBI Course WS 2005/2006 Thomas Karopka 19.01.2006.](https://reader035.fdocuments.in/reader035/viewer/2022081519/56649d785503460f94a5acdc/html5/thumbnails/53.jpg)
Natural Language Processing in the Biomedical Domain
N
N: Correct RelationsM:Retrieved RelationsC: Correct Relations that are actually retrieved
M
C
Query
Collection of Documents
Precision
Recall
CMCN
F-Value:
(P):
(R):
P+R2*P*R
More complicated due to partially filled templates
![Page 54: Natural Language Processing in the biomedical domain SBI Course WS 2005/2006 Thomas Karopka 19.01.2006.](https://reader035.fdocuments.in/reader035/viewer/2022081519/56649d785503460f94a5acdc/html5/thumbnails/54.jpg)
Natural Language Processing in the Biomedical Domain
Recall vs. Precision
High recall: You get all the right answers, but garbage too. Good when incorrect results are not problematic. More common from automatic systems.
High precision: When all returned answers must be correct. Good when missing results are not problematic. More common from hand-built systems.
In general in these things, one can trade one for the other But it’s harder to score well on both
precision
recall
x
x
x
x
![Page 55: Natural Language Processing in the biomedical domain SBI Course WS 2005/2006 Thomas Karopka 19.01.2006.](https://reader035.fdocuments.in/reader035/viewer/2022081519/56649d785503460f94a5acdc/html5/thumbnails/55.jpg)
Natural Language Processing in the Biomedical Domain
1. CC Coordinating conjunction
2. CD Cardinal number 3. DT Determiner 4. EX Existential there 5. FW Foreign word 6. IN Preposition or
subordinating conjunction
7. JJ Adjective 8. JJR Adjective,
comparative 9. JJS Adjective,
superlative 10.LS List item marker 11.MD Modal 12.NN Noun, singular or
mass 13.NNS Noun, plural 14.NP Proper noun,
singular 15.NPS Proper noun, plural 16.PDT Predeterminer
17. POS Possessive ending
18. PP Personal pronoun
19. PP$ Possessive pronoun
20. RB Adverb 21. RBR Adverb,
comparative 22. RBS Adverb,
superlative 23. RP Particle 24. SYM Symbol 25. TO to 26. UH Interjection 27. VB Verb, base form 28. VBD Verb, past tense 29. VBG Verb, gerund or
present participle 30. VBN Verb, past
participle 31. VBP Verb, non-3rd
person singular present
Penn Treebank Tagset
![Page 56: Natural Language Processing in the biomedical domain SBI Course WS 2005/2006 Thomas Karopka 19.01.2006.](https://reader035.fdocuments.in/reader035/viewer/2022081519/56649d785503460f94a5acdc/html5/thumbnails/56.jpg)
Natural Language Processing in the Biomedical Domain
Tools for NLP in the biomedical domain
![Page 57: Natural Language Processing in the biomedical domain SBI Course WS 2005/2006 Thomas Karopka 19.01.2006.](https://reader035.fdocuments.in/reader035/viewer/2022081519/56649d785503460f94a5acdc/html5/thumbnails/57.jpg)
Natural Language Processing in the Biomedical Domain
![Page 58: Natural Language Processing in the biomedical domain SBI Course WS 2005/2006 Thomas Karopka 19.01.2006.](https://reader035.fdocuments.in/reader035/viewer/2022081519/56649d785503460f94a5acdc/html5/thumbnails/58.jpg)
Natural Language Processing in the Biomedical Domain
![Page 59: Natural Language Processing in the biomedical domain SBI Course WS 2005/2006 Thomas Karopka 19.01.2006.](https://reader035.fdocuments.in/reader035/viewer/2022081519/56649d785503460f94a5acdc/html5/thumbnails/59.jpg)
Natural Language Processing in the Biomedical Domain
![Page 60: Natural Language Processing in the biomedical domain SBI Course WS 2005/2006 Thomas Karopka 19.01.2006.](https://reader035.fdocuments.in/reader035/viewer/2022081519/56649d785503460f94a5acdc/html5/thumbnails/60.jpg)
Natural Language Processing in the Biomedical Domain
![Page 61: Natural Language Processing in the biomedical domain SBI Course WS 2005/2006 Thomas Karopka 19.01.2006.](https://reader035.fdocuments.in/reader035/viewer/2022081519/56649d785503460f94a5acdc/html5/thumbnails/61.jpg)
Natural Language Processing in the Biomedical Domain
![Page 62: Natural Language Processing in the biomedical domain SBI Course WS 2005/2006 Thomas Karopka 19.01.2006.](https://reader035.fdocuments.in/reader035/viewer/2022081519/56649d785503460f94a5acdc/html5/thumbnails/62.jpg)
Natural Language Processing in the Biomedical Domain
![Page 63: Natural Language Processing in the biomedical domain SBI Course WS 2005/2006 Thomas Karopka 19.01.2006.](https://reader035.fdocuments.in/reader035/viewer/2022081519/56649d785503460f94a5acdc/html5/thumbnails/63.jpg)
Natural Language Processing in the Biomedical Domain