BioNLP for NLPeople - Computer Sciencemartin/Csci5832/Slides/S07/bionlp.pdf · growth cone motility...
Transcript of BioNLP for NLPeople - Computer Sciencemartin/Csci5832/Slides/S07/bionlp.pdf · growth cone motility...
![Page 1: BioNLP for NLPeople - Computer Sciencemartin/Csci5832/Slides/S07/bionlp.pdf · growth cone motility in response to targeting signals linked by the SH2/SH3 adaptor protein Dock. Here,](https://reader034.fdocuments.in/reader034/viewer/2022042105/5e84294fdcac337abb39c736/html5/thumbnails/1.jpg)
1
BioNLP for NLPeople
CS5832/HLT-NAACL/RANLP
The weirdest job in the world
![Page 2: BioNLP for NLPeople - Computer Sciencemartin/Csci5832/Slides/S07/bionlp.pdf · growth cone motility in response to targeting signals linked by the SH2/SH3 adaptor protein Dock. Here,](https://reader034.fdocuments.in/reader034/viewer/2022042105/5e84294fdcac337abb39c736/html5/thumbnails/2.jpg)
2
The weirdest job in the world
The weirdest job in the world
![Page 3: BioNLP for NLPeople - Computer Sciencemartin/Csci5832/Slides/S07/bionlp.pdf · growth cone motility in response to targeting signals linked by the SH2/SH3 adaptor protein Dock. Here,](https://reader034.fdocuments.in/reader034/viewer/2022042105/5e84294fdcac337abb39c736/html5/thumbnails/3.jpg)
3
The weirdest job in the world
The weirdest job in the world
![Page 4: BioNLP for NLPeople - Computer Sciencemartin/Csci5832/Slides/S07/bionlp.pdf · growth cone motility in response to targeting signals linked by the SH2/SH3 adaptor protein Dock. Here,](https://reader034.fdocuments.in/reader034/viewer/2022042105/5e84294fdcac337abb39c736/html5/thumbnails/4.jpg)
4
How I got here
How I got here
![Page 5: BioNLP for NLPeople - Computer Sciencemartin/Csci5832/Slides/S07/bionlp.pdf · growth cone motility in response to targeting signals linked by the SH2/SH3 adaptor protein Dock. Here,](https://reader034.fdocuments.in/reader034/viewer/2022042105/5e84294fdcac337abb39c736/html5/thumbnails/5.jpg)
5
How I got here
• Voice Input Technologies• Linguistix• Nationwide Insurance• MapQuest• Berdy Medical Systems• OneRealm [sic]
How I got here
• Perl hacker, SLM data preprocessing• Linguist, Corpus construction• Senior Programmer/Analyst,
Interactive Voice Response (yuck)• Software test dept. manager; senior
software engineer• Consultant/Perl hacker• Senior software engineer
![Page 6: BioNLP for NLPeople - Computer Sciencemartin/Csci5832/Slides/S07/bionlp.pdf · growth cone motility in response to targeting signals linked by the SH2/SH3 adaptor protein Dock. Here,](https://reader034.fdocuments.in/reader034/viewer/2022042105/5e84294fdcac337abb39c736/html5/thumbnails/6.jpg)
6
What is BioNLP?
• Natural language processing appliedto biomedical language– Publications– Medical records– Ontologies
Part 0
![Page 7: BioNLP for NLPeople - Computer Sciencemartin/Csci5832/Slides/S07/bionlp.pdf · growth cone motility in response to targeting signals linked by the SH2/SH3 adaptor protein Dock. Here,](https://reader034.fdocuments.in/reader034/viewer/2022042105/5e84294fdcac337abb39c736/html5/thumbnails/7.jpg)
7
Why a field called BioNLP?
There is little reason for thedata on which a linguist worksto have the right to name thatwork.
Shuy 2002:8
(One lab’s) funding for NLP incomputational biology
• INIA (Neuroinformatics ofAlcoholism) $5M, 5 years
• Wyeth Genomics Institute ($200K, 2years)
• National Library of Medicine ($4.2M,3 years)
• National Library of Medicine ($XM, 3years)
![Page 8: BioNLP for NLPeople - Computer Sciencemartin/Csci5832/Slides/S07/bionlp.pdf · growth cone motility in response to targeting signals linked by the SH2/SH3 adaptor protein Dock. Here,](https://reader034.fdocuments.in/reader034/viewer/2022042105/5e84294fdcac337abb39c736/html5/thumbnails/8.jpg)
8
Why biologists care
• High-throughput data interpretation• Literature search• Annotation• Database construction
But, I’m a NLPerson(computer scientist, mathematician,
engineer…)
• Hard, but might be possible• Might be harder in biomedical domain
than in newswire text• Might be more possible in biomedical
domain than in newswire text
![Page 9: BioNLP for NLPeople - Computer Sciencemartin/Csci5832/Slides/S07/bionlp.pdf · growth cone motility in response to targeting signals linked by the SH2/SH3 adaptor protein Dock. Here,](https://reader034.fdocuments.in/reader034/viewer/2022042105/5e84294fdcac337abb39c736/html5/thumbnails/9.jpg)
9
ResourcesThe big drawing point for NLPeople
• Data– Lexical resources– 500 * 16M words of text– Labelled training data
• Tools– NER, POS taggers, parsers, semantic
normalizers....
$$$
![Page 10: BioNLP for NLPeople - Computer Sciencemartin/Csci5832/Slides/S07/bionlp.pdf · growth cone motility in response to targeting signals linked by the SH2/SH3 adaptor protein Dock. Here,](https://reader034.fdocuments.in/reader034/viewer/2022042105/5e84294fdcac337abb39c736/html5/thumbnails/10.jpg)
10
Job market
• Academia: great– US, Europe
• Industry: not bad, but genomics-specific right now
Surely Shuy jests...
There is littlereason for thedata on which alinguist works tohave the rightto name thatwork.
![Page 11: BioNLP for NLPeople - Computer Sciencemartin/Csci5832/Slides/S07/bionlp.pdf · growth cone motility in response to targeting signals linked by the SH2/SH3 adaptor protein Dock. Here,](https://reader034.fdocuments.in/reader034/viewer/2022042105/5e84294fdcac337abb39c736/html5/thumbnails/11.jpg)
11
It really is different on every level
•Tokenization•Named entity recognition•Corpus construction•Semantic representation
NLP actually could make theworld a better place....
![Page 12: BioNLP for NLPeople - Computer Sciencemartin/Csci5832/Slides/S07/bionlp.pdf · growth cone motility in response to targeting signals linked by the SH2/SH3 adaptor protein Dock. Here,](https://reader034.fdocuments.in/reader034/viewer/2022042105/5e84294fdcac337abb39c736/html5/thumbnails/12.jpg)
12
An embarrassing truth aboutBioNLP...
www.chilibot.net
1
![Page 13: BioNLP for NLPeople - Computer Sciencemartin/Csci5832/Slides/S07/bionlp.pdf · growth cone motility in response to targeting signals linked by the SH2/SH3 adaptor protein Dock. Here,](https://reader034.fdocuments.in/reader034/viewer/2022042105/5e84294fdcac337abb39c736/html5/thumbnails/13.jpg)
13
Part 1:Just enough biology
Cells and proteins
<illustration: cell, structures, proteins>
![Page 14: BioNLP for NLPeople - Computer Sciencemartin/Csci5832/Slides/S07/bionlp.pdf · growth cone motility in response to targeting signals linked by the SH2/SH3 adaptor protein Dock. Here,](https://reader034.fdocuments.in/reader034/viewer/2022042105/5e84294fdcac337abb39c736/html5/thumbnails/14.jpg)
14
How biologists see the world
Wattarujeekrit et al. (2004)
The Central Dogma: from genes toproteins
http://www.swbic.org/products/clipart/images/dogmag.jpg
![Page 15: BioNLP for NLPeople - Computer Sciencemartin/Csci5832/Slides/S07/bionlp.pdf · growth cone motility in response to targeting signals linked by the SH2/SH3 adaptor protein Dock. Here,](https://reader034.fdocuments.in/reader034/viewer/2022042105/5e84294fdcac337abb39c736/html5/thumbnails/15.jpg)
15
The Central Dogma:from genes to proteins
http://www.ncbi.nlm.nih.gov/Class/MLACourse/Modules/MolBioReview/images/central_dogma.gif
Higher-level structures
• Genotype, phenotype• Tissue, organ, organism
![Page 16: BioNLP for NLPeople - Computer Sciencemartin/Csci5832/Slides/S07/bionlp.pdf · growth cone motility in response to targeting signals linked by the SH2/SH3 adaptor protein Dock. Here,](https://reader034.fdocuments.in/reader034/viewer/2022042105/5e84294fdcac337abb39c736/html5/thumbnails/16.jpg)
16
Biological structures are complex
SNAP Receptor
Vesicle SNARE
V-SNARE
N-Ethylmaleimide-Sensitive Fusion Protein
Soluble NSF Attachment Protein
Maleic acid N-ethylimide
Vesicle Soluble Maleic acid N-ethylimide SensitiveFusion Protein Attachment Protein Receptor
(Alex Morgan, MITRE)
Part 2:Why bioscientists fund and publish
research in BioNLP
![Page 17: BioNLP for NLPeople - Computer Sciencemartin/Csci5832/Slides/S07/bionlp.pdf · growth cone motility in response to targeting signals linked by the SH2/SH3 adaptor protein Dock. Here,](https://reader034.fdocuments.in/reader034/viewer/2022042105/5e84294fdcac337abb39c736/html5/thumbnails/17.jpg)
17
Two basic markets, multiple usertypes
• Medical– Clinicians– Consumers– “Informationists”– Administrators
(billing, qualityassurance, ...)
• “MolBio” (genomic)– High-throughput
experimentalists– “Bench scientists”– Model organism
database curators
![Page 18: BioNLP for NLPeople - Computer Sciencemartin/Csci5832/Slides/S07/bionlp.pdf · growth cone motility in response to targeting signals linked by the SH2/SH3 adaptor protein Dock. Here,](https://reader034.fdocuments.in/reader034/viewer/2022042105/5e84294fdcac337abb39c736/html5/thumbnails/18.jpg)
18
![Page 19: BioNLP for NLPeople - Computer Sciencemartin/Csci5832/Slides/S07/bionlp.pdf · growth cone motility in response to targeting signals linked by the SH2/SH3 adaptor protein Dock. Here,](https://reader034.fdocuments.in/reader034/viewer/2022042105/5e84294fdcac337abb39c736/html5/thumbnails/19.jpg)
19
Structured vocabulary
Free text (phenotypes)
![Page 20: BioNLP for NLPeople - Computer Sciencemartin/Csci5832/Slides/S07/bionlp.pdf · growth cone motility in response to targeting signals linked by the SH2/SH3 adaptor protein Dock. Here,](https://reader034.fdocuments.in/reader034/viewer/2022042105/5e84294fdcac337abb39c736/html5/thumbnails/20.jpg)
20
122 references...
Medical
![Page 21: BioNLP for NLPeople - Computer Sciencemartin/Csci5832/Slides/S07/bionlp.pdf · growth cone motility in response to targeting signals linked by the SH2/SH3 adaptor protein Dock. Here,](https://reader034.fdocuments.in/reader034/viewer/2022042105/5e84294fdcac337abb39c736/html5/thumbnails/21.jpg)
21
1997
<scanned picture of business card>
![Page 22: BioNLP for NLPeople - Computer Sciencemartin/Csci5832/Slides/S07/bionlp.pdf · growth cone motility in response to targeting signals linked by the SH2/SH3 adaptor protein Dock. Here,](https://reader034.fdocuments.in/reader034/viewer/2022042105/5e84294fdcac337abb39c736/html5/thumbnails/22.jpg)
22
<happy-face photo>
One year later…
![Page 23: BioNLP for NLPeople - Computer Sciencemartin/Csci5832/Slides/S07/bionlp.pdf · growth cone motility in response to targeting signals linked by the SH2/SH3 adaptor protein Dock. Here,](https://reader034.fdocuments.in/reader034/viewer/2022042105/5e84294fdcac337abb39c736/html5/thumbnails/23.jpg)
23
A sad story: physicians don’t buya lot of NLP software
Another sad story: trying to sell“gisting” to physicians
![Page 24: BioNLP for NLPeople - Computer Sciencemartin/Csci5832/Slides/S07/bionlp.pdf · growth cone motility in response to targeting signals linked by the SH2/SH3 adaptor protein Dock. Here,](https://reader034.fdocuments.in/reader034/viewer/2022042105/5e84294fdcac337abb39c736/html5/thumbnails/24.jpg)
24
Sold for $400K: 14.5 or 2.9¢ on thedollar…
Salesperson’s thought process
![Page 25: BioNLP for NLPeople - Computer Sciencemartin/Csci5832/Slides/S07/bionlp.pdf · growth cone motility in response to targeting signals linked by the SH2/SH3 adaptor protein Dock. Here,](https://reader034.fdocuments.in/reader034/viewer/2022042105/5e84294fdcac337abb39c736/html5/thumbnails/25.jpg)
25
Physician’s thought process
Genomics
![Page 26: BioNLP for NLPeople - Computer Sciencemartin/Csci5832/Slides/S07/bionlp.pdf · growth cone motility in response to targeting signals linked by the SH2/SH3 adaptor protein Dock. Here,](https://reader034.fdocuments.in/reader034/viewer/2022042105/5e84294fdcac337abb39c736/html5/thumbnails/26.jpg)
26
Why biologists care
• High-throughput data interpretation• Literature search• Annotation• Database construction
Why biologists care
10 years ago...
![Page 27: BioNLP for NLPeople - Computer Sciencemartin/Csci5832/Slides/S07/bionlp.pdf · growth cone motility in response to targeting signals linked by the SH2/SH3 adaptor protein Dock. Here,](https://reader034.fdocuments.in/reader034/viewer/2022042105/5e84294fdcac337abb39c736/html5/thumbnails/27.jpg)
27
Why biologists careToday....
Double exponential growthin the literature
New entries in Medline with publication date inJan-Aug 2005: 431,478 (avg. 1775/ day) 1
![Page 28: BioNLP for NLPeople - Computer Sciencemartin/Csci5832/Slides/S07/bionlp.pdf · growth cone motility in response to targeting signals linked by the SH2/SH3 adaptor protein Dock. Here,](https://reader034.fdocuments.in/reader034/viewer/2022042105/5e84294fdcac337abb39c736/html5/thumbnails/28.jpg)
28
Biological Nomenclature: “V-SNARE”
SNAP Receptor
Vesicle SNARE
V-SNARE
N-Ethylmaleimide-Sensitive Fusion Protein
Soluble NSF Attachment Protein
Maleic acid N-ethylimide
Vesicle Soluble Maleic acid N-ethylimide SensitiveFusion Protein Attachment Protein Receptor
(Alex Morgan, MITRE)
Part 3
Some things that make BioNLPdifferent
![Page 29: BioNLP for NLPeople - Computer Sciencemartin/Csci5832/Slides/S07/bionlp.pdf · growth cone motility in response to targeting signals linked by the SH2/SH3 adaptor protein Dock. Here,](https://reader034.fdocuments.in/reader034/viewer/2022042105/5e84294fdcac337abb39c736/html5/thumbnails/29.jpg)
29
Named Entity Recognition
Genes have names??
![Page 30: BioNLP for NLPeople - Computer Sciencemartin/Csci5832/Slides/S07/bionlp.pdf · growth cone motility in response to targeting signals linked by the SH2/SH3 adaptor protein Dock. Here,](https://reader034.fdocuments.in/reader034/viewer/2022042105/5e84294fdcac337abb39c736/html5/thumbnails/30.jpg)
30
Suzanna Lewis
•Fruitfly geneticist•5 kids•Latte + 3 shots
Suzanna Lewis
It is the middle of the night (2:38to be precise), I am away fromfriends and family, It has beenthis way for over 2 years, I can'tsleep because of all the work thereis yet to do, and there is no endin sight. So when do the magiclittle elves appear out of nowhereand get everything done?
p.s. I am serious.
![Page 31: BioNLP for NLPeople - Computer Sciencemartin/Csci5832/Slides/S07/bionlp.pdf · growth cone motility in response to targeting signals linked by the SH2/SH3 adaptor protein Dock. Here,](https://reader034.fdocuments.in/reader034/viewer/2022042105/5e84294fdcac337abb39c736/html5/thumbnails/31.jpg)
31
Suzanna Lewis
pray for elves
D. melanogaster gene Pray For Elves,abbreviated as PFE, is reported here. It hasalso been known in FlyBase as CG15151.Similar sequences have been identified inCaenorhabditis elegans, Homos sapiens, Musmusculus, Rattus norvegicus andSaccharomyces cerevisiae.
(FlyBase report FBal0138651)
![Page 32: BioNLP for NLPeople - Computer Sciencemartin/Csci5832/Slides/S07/bionlp.pdf · growth cone motility in response to targeting signals linked by the SH2/SH3 adaptor protein Dock. Here,](https://reader034.fdocuments.in/reader034/viewer/2022042105/5e84294fdcac337abb39c736/html5/thumbnails/32.jpg)
32
D. melanogaster gene Pray For Elves,abbreviated as PFE, is reported here. It hasalso been known in FlyBase as CG15151.Similar sequences have been identified inCaenorhabditis elegans, Homos sapiens, Musmusculus, Rattus norvegicus andSaccharomyces cerevisiae.
(FlyBase report FBal0138651)
Named entity recognition
• Molecular biology entity identificationproblem:– large list of classes– some of them much harder
• Usual case-related cues don't help• More variability of content• Huge lexical ambiguity problem• Common English
– as posed, not useful
![Page 33: BioNLP for NLPeople - Computer Sciencemartin/Csci5832/Slides/S07/bionlp.pdf · growth cone motility in response to targeting signals linked by the SH2/SH3 adaptor protein Dock. Here,](https://reader034.fdocuments.in/reader034/viewer/2022042105/5e84294fdcac337abb39c736/html5/thumbnails/33.jpg)
33
white
white
"wild-type" (notmutated)
![Page 34: BioNLP for NLPeople - Computer Sciencemartin/Csci5832/Slides/S07/bionlp.pdf · growth cone motility in response to targeting signals linked by the SH2/SH3 adaptor protein Dock. Here,](https://reader034.fdocuments.in/reader034/viewer/2022042105/5e84294fdcac337abb39c736/html5/thumbnails/34.jpg)
34
white
"mutant"
white
white
![Page 35: BioNLP for NLPeople - Computer Sciencemartin/Csci5832/Slides/S07/bionlp.pdf · growth cone motility in response to targeting signals linked by the SH2/SH3 adaptor protein Dock. Here,](https://reader034.fdocuments.in/reader034/viewer/2022042105/5e84294fdcac337abb39c736/html5/thumbnails/35.jpg)
35
Case is meaningful
whiteWhite
Case is meaningful
white
Symbol: w
White
Symbol: W
![Page 36: BioNLP for NLPeople - Computer Sciencemartin/Csci5832/Slides/S07/bionlp.pdf · growth cone motility in response to targeting signals linked by the SH2/SH3 adaptor protein Dock. Here,](https://reader034.fdocuments.in/reader034/viewer/2022042105/5e84294fdcac337abb39c736/html5/thumbnails/36.jpg)
36
Yes, there are genes with thesymbols I, a, R, p....
Case is meaningful
Misshapen (Msn) has been proposed toshut down Drosophila photoreceptor (Rcell) growth cone motility in responseto targeting signals linked by theSH2/SH3 adaptor protein Dock.
![Page 37: BioNLP for NLPeople - Computer Sciencemartin/Csci5832/Slides/S07/bionlp.pdf · growth cone motility in response to targeting signals linked by the SH2/SH3 adaptor protein Dock. Here,](https://reader034.fdocuments.in/reader034/viewer/2022042105/5e84294fdcac337abb39c736/html5/thumbnails/37.jpg)
37
Case is meaningful
Misshapen (Msn) has been proposed toshut down Drosophila photoreceptor (Rcell) growth cone motility in responseto targeting signals linked by theSH2/SH3 adaptor protein Dock. (Ruanet al. 2002)
…even sentence-initially.
sunday driver (syd) was identified in ascreen for novel axonal transportmutants in Drosophila. Syd is a~137kDa protein that is broadlyconserved in evolution with homogousproteins identified in C. elegans, mouseand human. (Bowman 2000)
![Page 38: BioNLP for NLPeople - Computer Sciencemartin/Csci5832/Slides/S07/bionlp.pdf · growth cone motility in response to targeting signals linked by the SH2/SH3 adaptor protein Dock. Here,](https://reader034.fdocuments.in/reader034/viewer/2022042105/5e84294fdcac337abb39c736/html5/thumbnails/38.jpg)
38
Case is meaningful
Misshapen (Msn) has been proposed to shutdown Drosophila photoreceptor (R cell)growth cone motility in response to targetingsignals linked by the SH2/SH3 adaptorprotein Dock. Here, we show that Bifocal(Bif), a putative cytoskeletal regulator, is acomponent of the Msn pathway for regulatingR cell growth targeting. bif displays stronggenetic interaction with msn.
Surely you could determine on adocument-by-document basis…
Misshapen (Msn) has been proposed to shutdown Drosophila photoreceptor (R cell)growth cone motility in response to targetingsignals linked by the SH2/SH3 adaptorprotein Dock. Here, we show that Bifocal(Bif), a putative cytoskeletal regulator, is acomponent of the Msn pathway for regulatingR cell growth targeting. bif displays stronggenetic interaction with msn.
![Page 39: BioNLP for NLPeople - Computer Sciencemartin/Csci5832/Slides/S07/bionlp.pdf · growth cone motility in response to targeting signals linked by the SH2/SH3 adaptor protein Dock. Here,](https://reader034.fdocuments.in/reader034/viewer/2022042105/5e84294fdcac337abb39c736/html5/thumbnails/39.jpg)
39
Surely you could determine on adocument-by-document basis…
Axonal traffic jams with a sunday driver:Identification of a broadly conservedtransmembrane protein required foraxonal transport in Drosophila.(Bowman 2000)
Evolution
• What it looks like• What it acts like• Metaphor• …
![Page 40: BioNLP for NLPeople - Computer Sciencemartin/Csci5832/Slides/S07/bionlp.pdf · growth cone motility in response to targeting signals linked by the SH2/SH3 adaptor protein Dock. Here,](https://reader034.fdocuments.in/reader034/viewer/2022042105/5e84294fdcac337abb39c736/html5/thumbnails/40.jpg)
40
Looks like…
• white• swiss cheese• clown• daschund• dreadlocks
Acts like…
• ether a go-go• lush• agnostic• amontillado
![Page 41: BioNLP for NLPeople - Computer Sciencemartin/Csci5832/Slides/S07/bionlp.pdf · growth cone motility in response to targeting signals linked by the SH2/SH3 adaptor protein Dock. Here,](https://reader034.fdocuments.in/reader034/viewer/2022042105/5e84294fdcac337abb39c736/html5/thumbnails/41.jpg)
41
Metaphor/metonymy
• lot• maggie• scott of the antarctic• always early -> british rail• asp -> cleopatra• tudor -> vasa -> gustavus• nanos -> smaug
whimsy
• chablis, merlot, zinfandel, retsina,moonshine (16 zebrafish genes)
• milkah, murashka, zolotistyuy, zloday(32 Drosophila genes)
![Page 42: BioNLP for NLPeople - Computer Sciencemartin/Csci5832/Slides/S07/bionlp.pdf · growth cone motility in response to targeting signals linked by the SH2/SH3 adaptor protein Dock. Here,](https://reader034.fdocuments.in/reader034/viewer/2022042105/5e84294fdcac337abb39c736/html5/thumbnails/42.jpg)
42
But, that’s not the only way ofnaming genes....
• Breast cancer 1 (BRCA1)• p53• Ribosomal protein S27• Heat shock protein 110• Mitogen activated protein kinase 15• Mitogen activated protein kinase
kinase kinase 5
• fuculokinase• GABA• Heat shock protein 60• calmodulin• dHAND• suppressor of p53
• cheap date• lush• ken and barbie• ring• to• the• there• a
![Page 43: BioNLP for NLPeople - Computer Sciencemartin/Csci5832/Slides/S07/bionlp.pdf · growth cone motility in response to targeting signals linked by the SH2/SH3 adaptor protein Dock. Here,](https://reader034.fdocuments.in/reader034/viewer/2022042105/5e84294fdcac337abb39c736/html5/thumbnails/43.jpg)
43
Worst gene names
• sema domain, seven thrombospondinrepeats (type 1 and type 1-like),transmembrane domain (TM) andshort cytoplasmic domain,(semaphorin) 5A
Worst gene names
• sema domain, seven thrombospondinrepeats (type 1 and type 1-like),transmembrane domain (TM) andshort cytoplasmic domain,(semaphorin) 5A
![Page 44: BioNLP for NLPeople - Computer Sciencemartin/Csci5832/Slides/S07/bionlp.pdf · growth cone motility in response to targeting signals linked by the SH2/SH3 adaptor protein Dock. Here,](https://reader034.fdocuments.in/reader034/viewer/2022042105/5e84294fdcac337abb39c736/html5/thumbnails/44.jpg)
44
Worst gene names
• sema domain, seven thrombospondinrepeats (type 1 and type 1-like),transmembrane domain (TM) andshort cytoplasmic domain,(semaphorin) 5A
• SEMA5A
Worst gene names
• sema domain, seven thrombospondinrepeats (type 1 and type 1-like),transmembrane domain (TM) and shortcytoplasmic domain, (semaphorin) 5A
• SEMA5A• Tyrosine kinase with immunoglobulin and
epidermal growth factor homology domains• tie
![Page 45: BioNLP for NLPeople - Computer Sciencemartin/Csci5832/Slides/S07/bionlp.pdf · growth cone motility in response to targeting signals linked by the SH2/SH3 adaptor protein Dock. Here,](https://reader034.fdocuments.in/reader034/viewer/2022042105/5e84294fdcac337abb39c736/html5/thumbnails/45.jpg)
45
• What doesn’t work• What does (as of 2004)
“Gene mention” (NER)
Yeh et al. (2005)
![Page 46: BioNLP for NLPeople - Computer Sciencemartin/Csci5832/Slides/S07/bionlp.pdf · growth cone motility in response to targeting signals linked by the SH2/SH3 adaptor protein Dock. Here,](https://reader034.fdocuments.in/reader034/viewer/2022042105/5e84294fdcac337abb39c736/html5/thumbnails/46.jpg)
46
Gene mention (NER)
Yeh et al. (2005)
Good systems?
• Handle multi-word names (heat shockprotein 60) (base NP chunking, abbreviationdefinitions, post-processing)
• Use some form of machine learning(MaxEnt, HMM, CRF, SVM) (or a cleverhack)
• Do some rule-based post-processing• Don’t rely on dictionaries
![Page 47: BioNLP for NLPeople - Computer Sciencemartin/Csci5832/Slides/S07/bionlp.pdf · growth cone motility in response to targeting signals linked by the SH2/SH3 adaptor protein Dock. Here,](https://reader034.fdocuments.in/reader034/viewer/2022042105/5e84294fdcac337abb39c736/html5/thumbnails/47.jpg)
47
The Jim Martin techniquereally works
Kinoshita et al. (2005)
...which isn’t to say that externalknowledge is bad
• Markert/Nissim’s extensions ofPoesio’s use of Google
![Page 48: BioNLP for NLPeople - Computer Sciencemartin/Csci5832/Slides/S07/bionlp.pdf · growth cone motility in response to targeting signals linked by the SH2/SH3 adaptor protein Dock. Here,](https://reader034.fdocuments.in/reader034/viewer/2022042105/5e84294fdcac337abb39c736/html5/thumbnails/48.jpg)
48
Most feature sets include...
• Typo/orthographic features– Patterns like \w+-?\d+– Contains Greek letters
• Local/distant context– Next word is “protein”– Followed by “protein” somewhere else in
document
Why not better?
• Length• Case• Tokenization• Annotation issues
– Inconsistency– Multiple correct
answers– Inter-corpus
differences indefinition
Yeh et al. (2005)
![Page 49: BioNLP for NLPeople - Computer Sciencemartin/Csci5832/Slides/S07/bionlp.pdf · growth cone motility in response to targeting signals linked by the SH2/SH3 adaptor protein Dock. Here,](https://reader034.fdocuments.in/reader034/viewer/2022042105/5e84294fdcac337abb39c736/html5/thumbnails/49.jpg)
49
Length effect(and why the Jim Martin technique
works so well for this)
Kinoshita et al. (2005)
A great research project
• Build an NER system for...– Species– Laboratory techniques– Cell types– Cell lines– Tissues– ....
![Page 50: BioNLP for NLPeople - Computer Sciencemartin/Csci5832/Slides/S07/bionlp.pdf · growth cone motility in response to targeting signals linked by the SH2/SH3 adaptor protein Dock. Here,](https://reader034.fdocuments.in/reader034/viewer/2022042105/5e84294fdcac337abb39c736/html5/thumbnails/50.jpg)
50
...and, NER isn’t what you needanyways
• GN task and results
Tokenization
• How to build a cheap base nounphrase chunker– Start from right, move left
• If next token is not conjunction, preposition,comma, period, or right parenthesis, add it
• Else start a new chunk
![Page 51: BioNLP for NLPeople - Computer Sciencemartin/Csci5832/Slides/S07/bionlp.pdf · growth cone motility in response to targeting signals linked by the SH2/SH3 adaptor protein Dock. Here,](https://reader034.fdocuments.in/reader034/viewer/2022042105/5e84294fdcac337abb39c736/html5/thumbnails/51.jpg)
51
Tokenization
• Commas– 2,6-diaminohexanoic acid– tricyclo(3.3.1.13,7)decanone
Four kinds of hyphens
• “Syntactic:”– Calcium-dependent– Hsp-60
• Knocked-out gene: lush-- flies• Negation: -fever• Electric charge: Cl-
![Page 52: BioNLP for NLPeople - Computer Sciencemartin/Csci5832/Slides/S07/bionlp.pdf · growth cone motility in response to targeting signals linked by the SH2/SH3 adaptor protein Dock. Here,](https://reader034.fdocuments.in/reader034/viewer/2022042105/5e84294fdcac337abb39c736/html5/thumbnails/52.jpg)
52
B-cell-CD4(+)-T-cell interactions
• PMID: 10516078
Special challenges in biomedicalcorpus construction
![Page 53: BioNLP for NLPeople - Computer Sciencemartin/Csci5832/Slides/S07/bionlp.pdf · growth cone motility in response to targeting signals linked by the SH2/SH3 adaptor protein Dock. Here,](https://reader034.fdocuments.in/reader034/viewer/2022042105/5e84294fdcac337abb39c736/html5/thumbnails/53.jpg)
53
•How do you parse
rat epithelial growthfactor receptor 2
?
• Don’t—pretag allnamed entities
• How do you tokenize
tricyclo(3.3.1.13,7)decanone
• Don’t—pretag allnamed entities
![Page 54: BioNLP for NLPeople - Computer Sciencemartin/Csci5832/Slides/S07/bionlp.pdf · growth cone motility in response to targeting signals linked by the SH2/SH3 adaptor protein Dock. Here,](https://reader034.fdocuments.in/reader034/viewer/2022042105/5e84294fdcac337abb39c736/html5/thumbnails/54.jpg)
54
• How do you hire alinguistics graduatestudent to tag ratepithelial growthfactor receptor 2?
• You can’t...
• How do you do PAStagging when youdon’t havesyntacticallytagged text?
• Sigh...
![Page 55: BioNLP for NLPeople - Computer Sciencemartin/Csci5832/Slides/S07/bionlp.pdf · growth cone motility in response to targeting signals linked by the SH2/SH3 adaptor protein Dock. Here,](https://reader034.fdocuments.in/reader034/viewer/2022042105/5e84294fdcac337abb39c736/html5/thumbnails/55.jpg)
55
Some specific cases of wordsense disambiguation
Abbreviation disambiguation
• Incidence of ambiguous abbreviations(Jeff Chang’s paper)
• Statistical approaches– Chang
• Rule-based– Schwartz and Hearst
![Page 56: BioNLP for NLPeople - Computer Sciencemartin/Csci5832/Slides/S07/bionlp.pdf · growth cone motility in response to targeting signals linked by the SH2/SH3 adaptor protein Dock. Here,](https://reader034.fdocuments.in/reader034/viewer/2022042105/5e84294fdcac337abb39c736/html5/thumbnails/56.jpg)
56
Part 4: getting up to speed
(about) 10 papers and resourcesthat will let you read most other
papers in BioNLP
Named entity recognition 1:rule-based
• Fukuda et al. (1998): first NER paper– Find something that looks like a symbol
for a yeast gene (ABC1)– Extend name to the left (yeast ABC1)– Extend name to the right (ABC1 protein)
• Results in 90s– Never replicated– Yeast is easy
![Page 57: BioNLP for NLPeople - Computer Sciencemartin/Csci5832/Slides/S07/bionlp.pdf · growth cone motility in response to targeting signals linked by the SH2/SH3 adaptor protein Dock. Here,](https://reader034.fdocuments.in/reader034/viewer/2022042105/5e84294fdcac337abb39c736/html5/thumbnails/57.jpg)
57
Named entity recognition 2:machine learning
• Collier et al. (XXX)
NER 3: state of the art
![Page 58: BioNLP for NLPeople - Computer Sciencemartin/Csci5832/Slides/S07/bionlp.pdf · growth cone motility in response to targeting signals linked by the SH2/SH3 adaptor protein Dock. Here,](https://reader034.fdocuments.in/reader034/viewer/2022042105/5e84294fdcac337abb39c736/html5/thumbnails/58.jpg)
58
Information extraction 1:rule-based
• Blaschke 1998
Information extraction 2:machine learning
• Craven and Kumlein 199X• Identify entity pairs
– Protein/protein– Protein/disease– Protein/?
• Use naïve Bayes to classify sentencesas +/- positing a relation– Features: bag-of-words
![Page 59: BioNLP for NLPeople - Computer Sciencemartin/Csci5832/Slides/S07/bionlp.pdf · growth cone motility in response to targeting signals linked by the SH2/SH3 adaptor protein Dock. Here,](https://reader034.fdocuments.in/reader034/viewer/2022042105/5e84294fdcac337abb39c736/html5/thumbnails/59.jpg)
59
Information extraction 3:rules, linguistics, knowledge
• Friedman: MedLEE, BioMedLEE• NER• Syntax
Corpora: 1
• PubMed/MEDLINE– MEDLINE: database of 16M+ abstracts– PubMed: interface for searching
MEDLINE– ASCII and free
NOT a corpus—not really even a “text collection”
![Page 60: BioNLP for NLPeople - Computer Sciencemartin/Csci5832/Slides/S07/bionlp.pdf · growth cone motility in response to targeting signals linked by the SH2/SH3 adaptor protein Dock. Here,](https://reader034.fdocuments.in/reader034/viewer/2022042105/5e84294fdcac337abb39c736/html5/thumbnails/60.jpg)
60
Corpora: 2
• GENIA– Fully annotated corpus– 2,000 abstracts– X00,000 words– Now: POS, named entities, 25%
treebanked– Coming: anaphora; events?; PAS?;
dependency parses?
Lexical resources: 1
• Gene Ontology– Biological functions– Molecular processes– Cell components
• Building blocks– Terms + definitions– Is-a, part-of
![Page 61: BioNLP for NLPeople - Computer Sciencemartin/Csci5832/Slides/S07/bionlp.pdf · growth cone motility in response to targeting signals linked by the SH2/SH3 adaptor protein Dock. Here,](https://reader034.fdocuments.in/reader034/viewer/2022042105/5e84294fdcac337abb39c736/html5/thumbnails/61.jpg)
61
Lexical resources: 2
• Entrez Gene (formerly LocusLink)– Names– Symbols– Synonyms– Protein products– “Summary”– Gene References Into Function
Lexical resources: 3
• UMLS (Unified Medical LanguageSystem)– MetaThesaurus– Semantic Network
![Page 62: BioNLP for NLPeople - Computer Sciencemartin/Csci5832/Slides/S07/bionlp.pdf · growth cone motility in response to targeting signals linked by the SH2/SH3 adaptor protein Dock. Here,](https://reader034.fdocuments.in/reader034/viewer/2022042105/5e84294fdcac337abb39c736/html5/thumbnails/62.jpg)
62
Tools overview
• Probably something available• Might work decently• Definitely improvable for your
specific task
Tools: 1
• POS tagging:– GENIA– MEDPOST– LingPipe?
![Page 63: BioNLP for NLPeople - Computer Sciencemartin/Csci5832/Slides/S07/bionlp.pdf · growth cone motility in response to targeting signals linked by the SH2/SH3 adaptor protein Dock. Here,](https://reader034.fdocuments.in/reader034/viewer/2022042105/5e84294fdcac337abb39c736/html5/thumbnails/63.jpg)
63
Tools: 2
• Named entity recognition– ABNER (Settles 200x)– KeX– AbGene
• LESSON: distribute a .jar file andthe world will beat a path to yourdoor
Part 6:Current hot topics
![Page 64: BioNLP for NLPeople - Computer Sciencemartin/Csci5832/Slides/S07/bionlp.pdf · growth cone motility in response to targeting signals linked by the SH2/SH3 adaptor protein Dock. Here,](https://reader034.fdocuments.in/reader034/viewer/2022042105/5e84294fdcac337abb39c736/html5/thumbnails/64.jpg)
64
What’s the right model for semanticrepresentation?
• So far: binary relations• Arguments that that’s not good
enough– Rzhetsky/GeneWays paper– Penn folks/IE paper– Native speaker intuitions (Juliane, etc.)
What’s the right model for semanticrepresentation?
• Two ways forward– Differentiating binary relations
• Marti HLT/EMNLP; Tsujii– PAS
• PASBio/Wattarujeekrit et al.• Kogan et al.
Karin: how do theserepresentationalchoices affect what abiologist would get outof the text?
![Page 65: BioNLP for NLPeople - Computer Sciencemartin/Csci5832/Slides/S07/bionlp.pdf · growth cone motility in response to targeting signals linked by the SH2/SH3 adaptor protein Dock. Here,](https://reader034.fdocuments.in/reader034/viewer/2022042105/5e84294fdcac337abb39c736/html5/thumbnails/65.jpg)
65
The ontology wars
• Point:– Hunter; PASBio; Barry Smith; L&C....– GOA; MGI; EBI; ...
• Counterpoint:– Tsujii/Ananiadou; Pedersen/Pakhomov;
Markert/Nissim...
True integration of NLP intolaboratory data interpretation
• <Last chapter of Sophia and John’sbook>
![Page 66: BioNLP for NLPeople - Computer Sciencemartin/Csci5832/Slides/S07/bionlp.pdf · growth cone motility in response to targeting signals linked by the SH2/SH3 adaptor protein Dock. Here,](https://reader034.fdocuments.in/reader034/viewer/2022042105/5e84294fdcac337abb39c736/html5/thumbnails/66.jpg)
66
The embarrassing truth aboutBioNLP (take 2)...
References
• Shuy, Roger (2002) Linguistic battlesin trademark disputes. Palgrave.
• Yeh, Alexander; Alexander Morgan;Marc Colosimo; and LynetteHirschman (2004) BioCreative Task1A: gene mention finding evaluation.BMC Bioinformatics 6(Suppl. 1):S2.