Parallel Itoh-Tsujii Multiplicative Inversion Algorithm for a Special ...
NLP Techniques (Machine Learning) NER in Biomedical Domain Tsujii Laboratory Hong-Woo CHUN (D1)...
-
Upload
juan-tobin -
Category
Documents
-
view
212 -
download
0
Transcript of NLP Techniques (Machine Learning) NER in Biomedical Domain Tsujii Laboratory Hong-Woo CHUN (D1)...
NLP Techniques (Machine NLP Techniques (Machine Learning)Learning)
NER in Biomedical DomainNER in Biomedical Domain
Tsujii LaboratoryTsujii LaboratoryHong-Woo CHUN (D1)Hong-Woo CHUN (D1)
February 10th , 2005
Univ. of Tokyo
2/11
Introduction
As the research in biomedical domain has grown rapidly in recent years, a huge amount of nature language resources have been developed and become a rich knowledge base.
NER (Named Entity Recognition) is strongly demanded to be applied in biomedical
domain. identifies names of genes, gene products and diseases in a b
iomedical text in this project.From now on, genes and gene products are called by ‘gene’.
has not got high performance.
compared with those in newswire domain
Univ. of Tokyo
3/11
Introduction::Problems in NER
Some modifiers are often before basic NEs activated B cell lines
Sometimes biomedical NEs are very long 47 kDa sterol regulatory element binding factor
Two or more NEs share one head noun by using conjunction or disjunction construction
91 and 84 kDa proteins An entity may be found with various spelling forms NE may be cascaded One NE may be embedded in anothe
r NE Abbreviations are frequently used
Therefore, it is necessary to explore more evidential features and more effective methods to cope with such difficulties.
Univ. of Tokyo
4/11
NER without NLP tech.
Dictionary based longest matching ! The number of words in dictionaries
Gene : 44,463
Disease : 159,477
Corpus 1,000 biomedical sentences which are tagged by
biologists
Gene and Disease names and their Association
Gene Disease
Hishiki Nagata Hishiki Nagata
Precision 57.7% 65.0% 78.0% 82.1%
Recall 100% 100% 100% 100%
F-score 73.2% 78.8% 87.6% 90.2%
Univ. of Tokyo
5/11
Experimental results(1)
Maximum Entropy based model Features
Local context (Name itself, Unigrams and Bigrams)
POS (Name itself, Unigrams and Bigrams)
Capitalization (All capital, Mixed capital, No capital)
Digitalization ( All digit, Mixed digit, No digit)
24 Greek Letters (alpha, beta, gamma, …)
12 suffix
Corpus 1,000 biomedical sentences which are tagged by biologists
Gene and Disease names and their Association
Evaluations 10-fold cross validation
L2 L1 NE R1 R2
Univ. of Tokyo
6/11
Experimental results(2)
Example of Corpus
Univ. of Tokyo
7/11
Experimental results(3)::Useful features
Gene Disease
Local context
Capitalization
Digitalization
Greek Letters
Affix
POS NE
NE, Uni
NE, Uni, Bi
Univ. of Tokyo
8/11
Experimental results(4)Agreement for Annotations between Hishiki san and Nagata san
Comparison Features
Gene Local context, Capitalization, POS of NEDisease Local context, Capitalization, POS of NE and Unigram
Evaluation : 10fold-cross validation
Gene 90.3%
Disease 89.3%
Test data Training data Gene Disease
P R F P R F
Nagata
Gene:650
Disease:821
Hishiki 88.6 81.4 84.8 90.4 92.8 91.6
Nagata 86.8 90.9 88.8 89.6 95.7 92.6
Intersection 90.6 80.0 85.0 91.1 89.9 90.5
Union 85.4 91.7 88.4 88.8 97.4 92.9
Hishiki
Gene:577
Disease:780
Hishiki 80.2 83.0 81.6 88.7 95.9 92.2
Nagata 77.5 91.5 83.9 86.8 97.6 91.9
Intersection 81.7 81.3 81.5 89.9 93.3 91.6
Union 76.5 92.5 83.8 85.5 98.7 91.6
Univ. of Tokyo
9/11
Experimental results(5)::Gene
Univ. of Tokyo
10/11
Experimental results(6)::Disease
Univ. of Tokyo
11/11
Conclusions
Through the experiments, we found that the NLP techniques (ML approach) play an important role in improving the performance We can expect that the performance may be
increases by considering more evidential features.
It is necessary to explore more evidential features and more effective methods to cope with NER difficulties.
We found that the performance was improved as the size of training corpus increases.
Univ. of Tokyo
12/11
Thank you!!!
Univ. of Tokyo
13/11
Gaussian Prior (Hishiki)
Gaussian Prior
Gene Disease
P R F P R F
20 73.8 78.5 76.0 85.6 96.5 90.7
50 75.2 79.0 77.1 87.0 95.5 91.0
80 75.4 79.0 77.1 87.3 95.4 91.2
100 75.4 79.9 77.2 87.5 95.4 91.2
200 75.6 79.0 77.2 87.6 95.3 91.3
300 75.6 79.0 77.2 87.7 95.2 91.3
400 75.4 79.0 77.1 87.8 95.2 91.3
500 75.4 78.9 77.1 87.8 95.1 91.3
800 75.3 78.2 76.7 87.8 94.9 91.2
1000 75.3 78.1 76.7 87.8 94.7 91.1
1500 75.5 77.4 76.4 87.7 94.7 91.1
2000 75.5 76.7 76.1 87.7 94.7 91.1
Univ. of Tokyo
14/11
Experimental results (Hishiki)Features Gene Disease
P R F P R F
Name, context (W ) 76.6 83.4 79.8 89.1 95.6 92.3
Caps Info 73.5 68.1 70.7 78.0 99.4 87.4
Digit Info. 63.7 86.8 73.5 77.9 99.5 87.4
Greek 63.2 84.4 72.3 77.9 99.5 87.4
Affix 62.9 83.7 71.8 78.0 99.5 87.4
POS 64.4 78.9 70.9 78.1 99.2 87.4
W+Caps Info. 80.7 84.6 82.6 87.8 98.2 92.7
W+Digit Info. 79.0 83.9 81.3 87.7 98.2 92.7
W+Greek 75.2 84.1 79.4 87.6 98.3 92.6
W+Affix 75.0 84.9 79.7 87.7 98.2 92.7
W+D+G 79.7 84.2 81.9 87.7 98.2 92.7
W+C+D 80.7 84.6 82.6 87.7 98.2 92.7
W+C+G 80.4 84.1 82.2 87.7 98.2 92.7
W +A+C 80.6 84.2 82.4 87.8 98.2 92.7
W+A+D 78.9 84.1 81.4 87.8 98.2 92.7
W+A+G 75.0 83.9 79.2 87.6 98.2 92.6
W+C+D+G 80.5 83.7 82.1 87.8 98.2 92.7
W+A+C+D 80.5 84.2 82.3 88.0 98.3 92.9
W+A+C+G 80.3 84.1 82.1 87.8 98.2 92.7
W+A+D+G 79.5 83.9 81.6 87.9 98.3 92.8
W+A+C+D+G 80.5 83.9 82.2 87.9 98.2 92.8
Univ. of Tokyo
15/11
Experimental results (Hishiki)Features Gene Disease
P R F P R F
Name, context(W)
76.6 83.4 79.8 89.1 95.6 92.3
W +POS of NE 76.3 84.2 80.1 87.7 97.6 92.0
W +POS(NE,uni) 75.9 82.3 79.0 88.6 95.8 92.1
W +POS(NE,uni,bi) 76.0 79.4 77.6 87.8 94.9 91.2
W+Caps Info. 80.7 84.6 82.6 87.8 98.2 92.7
W+C+POS 81.0 83.5 82.3 87.8 97.6 92.4
W+C+POS1 80.0 82.5 81.2 88.6 95.6 92.0
W+C+POS2 77.2 78.9 78.0 88.4 95.1 91.7
W+C+D 80.7 84.6 82.6 87.7 98.2 92.7
W+C+D+POS 80.8 83.0 81.9 87.6 97.6 92.3
W+C+D+POS1 79.9 82.5 81.2 88.7 95.9 92.2
W+C+D+POS2 77.2 79.2 78.2 88.4 95.1 91.7
W+A+C+D 80.5 84.2 82.3 88.0 98.3 92.9
W+A+C+D+POS 81.0 83.4 82.2 87.8 97.6 92.4
W+A+C+D+POS1 79.8 82.3 81.1 88.8 95.8 92.2
W+A+C+D+POS2 77.0 79.0 78.0 88.1 94.5 91.2
Univ. of Tokyo
16/11
Experimental results (Nagata)Features Gene Disease
P R F P R F
Name, context (W ) 82.7 88.3 85.4 89.7 95.0 92.3
Caps Info 73.4 88.8 80.4 82.1 99.4 89.9
Digit Info. 72.2 89.7 80.0 82,1 99.5 90.0
Greek 71.7 86.2 78.3 82.1 99.5 90.0
Affix 71.6 85.1 77.8 82.1 99.5 90.0
POS 72.8 86.3 79.0 82.2 99.3 89.9
W+Caps Info. 86.4 90.2 88.3 88.5 97.8 92.9
W+Digit Info. 82.2 91.2 86.5 88.5 97.9 93.0
W+Greek 80.9 92.0 86.1 88.6 98.1 93.1
W+Affix 80.4 92.0 85.8 88.6 98.1 93.1
W+D+G 82.7 91.4 86.8 88.6 98.2 93.1
W+C+D 85.9 90.2 88.0 88.5 97.7 92.9
W+C+G 86.2 90.6 88.4 88.5 97.7 92.9
W +A+C 86.0 90.2 88.1 88.5 97.8 92.9
W+A+D 82.3 91.4 86.6 88.5 98.1 93.0
W+A+G 80.7 91.5 85.8 88.6 98.2 93.1
W+C+D+G 86.1 90.8 88.4 88.6 98.1 93.1
W+A+C+D 85.9 90.2 88.0 88.5 97.8 92.9
W+A+C+G 86.2 90.5 88.3 88.7 98.1 93.1
W+A+D+G 82.6 91.4 86.8 88.7 98.1 93.1
W+A+C+D+G 85.7 90.6 88.1 88.6 97.8 93.0
Univ. of Tokyo
17/11
Experimental results (Nagata)Features Gene Disease
P R F P R F
Name, context(W)
82.7 88.3 85.4 89.7 95.0 92.3
W +POS 81.5 90.6 85.8 88.5 96.0 92.1
W +POS1 81.7 90.6 85.9 89.8 95.5 92.6
W +POS2 81.8 86.3 84.0 89.3 95.4 92.2
W+Caps Info. 86.4 90.2 88.3 88.5 97.8 92.9
W+C+POS 86.3 89.4 87.8 88.6 97.0 92.6
W+C+POS1 85.9 90.2 88.0 90.0 96.1 92.9
W+C+POS2 85.7 87.5 86.6 89.4 95.2 92.2
W+C+D+G 86.1 90.8 88.4 88.6 98.1 93.1
W+C+D+G+POS 86.5 89.1 87.8 88.6 97.1 92.6
W+C+D+G+POS1
85.5 89.8 87.6 89.9 96.1 92.9
W +C+D+G+POS2 85.3 87.5 86.4 89.5 95.1 92.2
W+C+G+POS 86.7 89.1 87.9 88.5 97.0 92.6
W+C+G+POS1 85.6 89.7 87.6 89.8 96.3 92.9
W +C+G+POS2 85.2 87.5 86.3 89.2 95.0 92.0
Univ. of Tokyo
18/11
Prefix and suffix Important cue for terminology identification
~cin
~mide
~zole
actinomycin
cycloheximide
sulphamethoxazole
~lipid
~rogen
~vitamin
phospholipids
estrogen
dihydroxyvitamin
etc …