A New Approach for HMM Based Chunking for Hindi Ashish Tiwari Arnab Sinha Under the guidance of Dr....
-
Upload
anabel-nichols -
Category
Documents
-
view
216 -
download
4
Transcript of A New Approach for HMM Based Chunking for Hindi Ashish Tiwari Arnab Sinha Under the guidance of Dr....
![Page 1: A New Approach for HMM Based Chunking for Hindi Ashish Tiwari Arnab Sinha Under the guidance of Dr. Sudeshna Sarkar Department of Computer Science and.](https://reader035.fdocuments.in/reader035/viewer/2022072011/56649e3b5503460f94b2d1ec/html5/thumbnails/1.jpg)
A New Approach for HMM Based Chunking for Hindi
Ashish TiwariArnab Sinha
Under the guidance ofDr. Sudeshna Sarkar
Department of Computer Science and Engineering
Indian Institute of Technology, Kharagpur
![Page 2: A New Approach for HMM Based Chunking for Hindi Ashish Tiwari Arnab Sinha Under the guidance of Dr. Sudeshna Sarkar Department of Computer Science and.](https://reader035.fdocuments.in/reader035/viewer/2022072011/56649e3b5503460f94b2d1ec/html5/thumbnails/2.jpg)
.utf file (IIIT corpus)
.tt file (tagged training
Data)
script
tnt_para .123 file .lex file tnt
.t file (untagged
data)
.tts file (tagged by
TnT)
.tt file (tagged)tnt_diff
Accuracy
Model files
TnT
![Page 3: A New Approach for HMM Based Chunking for Hindi Ashish Tiwari Arnab Sinha Under the guidance of Dr. Sudeshna Sarkar Department of Computer Science and.](https://reader035.fdocuments.in/reader035/viewer/2022072011/56649e3b5503460f94b2d1ec/html5/thumbnails/3.jpg)
Parse the corpus
Apply 4 types of token schemes
Apply 3 different tag schemes
Add POS context to chunk-tags
Do Chunk-labeling
Results
Results
Results
Compare the accuracies
Results
Recommendations
Chunklabeling
ChunkBoundary
Tool Flow
![Page 4: A New Approach for HMM Based Chunking for Hindi Ashish Tiwari Arnab Sinha Under the guidance of Dr. Sudeshna Sarkar Department of Computer Science and.](https://reader035.fdocuments.in/reader035/viewer/2022072011/56649e3b5503460f94b2d1ec/html5/thumbnails/4.jpg)
1. (word-token, Chunk-tag)(( ashish)) (( arnab ke pIche )) (( bajar meM )) (( gaya . )) ashish arnab of behind market in went . NN NN PREP PREP NN PREP VB SYM
2. (POS-tag, Chunk-tag)(( ashish)) (( arnab ke pIche )) (( bajar meM )) (( gaya . )) NN NN PREP PREP NN PREP VB SYM
3. (word_POS-tag, Chunk-tag)(( ashish)) (( arnab ke pIche )) (( bajar meM )) (( gaya .
)) ashish arnab of behind market in went
.ashish _NN arnab _NN of _PREP market _NN in _PREP went _VB
SYM behind _PREP
4. (POS-tag_word, Chunk-tag)(( ashish)) (( arnab ke pIche )) (( bajar meM ))
(( gaya . )) ashish arnab of behind market in went
.NN_ ashish NN_ arnab PREP_ of NN_ market PREP_ in VB_ went
SYM_ PREP_ behind
Token schemes
![Page 5: A New Approach for HMM Based Chunking for Hindi Ashish Tiwari Arnab Sinha Under the guidance of Dr. Sudeshna Sarkar Department of Computer Science and.](https://reader035.fdocuments.in/reader035/viewer/2022072011/56649e3b5503460f94b2d1ec/html5/thumbnails/5.jpg)
Chunk Tag schemes
2-Tag Scheme: {STRT, CNT}
3-Tag Scheme: {STRT, CNT, END}
4-Tag Scheme: {STRT, CNT, END, STRT_END}
![Page 6: A New Approach for HMM Based Chunking for Hindi Ashish Tiwari Arnab Sinha Under the guidance of Dr. Sudeshna Sarkar Department of Computer Science and.](https://reader035.fdocuments.in/reader035/viewer/2022072011/56649e3b5503460f94b2d1ec/html5/thumbnails/6.jpg)
Adding POS-tag to Chunk-tag
(( ashish)) (( arnab ke pIche )) (( bajar meM )) (( gaya . ))
ashish arnab of behind market in went .
NN NN PREP PREP NN PREP VB SYM
NN :STRT NN :STRT NN:STRT VB :STRT
PREP :CNT PREP :CNT PREP:CNT SYM :CNT
Ex: Word as token and POS:2tag chunking
![Page 7: A New Approach for HMM Based Chunking for Hindi Ashish Tiwari Arnab Sinha Under the guidance of Dr. Sudeshna Sarkar Department of Computer Science and.](https://reader035.fdocuments.in/reader035/viewer/2022072011/56649e3b5503460f94b2d1ec/html5/thumbnails/7.jpg)
Colon vs Non-Colon
• Corpus size=20000 words
• In large data-set, <Word_POS-tag> token might perform better
Marginal Improvement
![Page 8: A New Approach for HMM Based Chunking for Hindi Ashish Tiwari Arnab Sinha Under the guidance of Dr. Sudeshna Sarkar Department of Computer Science and.](https://reader035.fdocuments.in/reader035/viewer/2022072011/56649e3b5503460f94b2d1ec/html5/thumbnails/8.jpg)
Chunk Boundary identification
Results are improved !
4tag2tag gives the highest precision and recall.!!
![Page 9: A New Approach for HMM Based Chunking for Hindi Ashish Tiwari Arnab Sinha Under the guidance of Dr. Sudeshna Sarkar Department of Computer Science and.](https://reader035.fdocuments.in/reader035/viewer/2022072011/56649e3b5503460f94b2d1ec/html5/thumbnails/9.jpg)
Addition of POS-tag Information to Chunk-tags
Significant increment in precision and recall is observed.42-tag scheme for <word_POS, chunk:POS> scores highest
![Page 10: A New Approach for HMM Based Chunking for Hindi Ashish Tiwari Arnab Sinha Under the guidance of Dr. Sudeshna Sarkar Department of Computer Science and.](https://reader035.fdocuments.in/reader035/viewer/2022072011/56649e3b5503460f94b2d1ec/html5/thumbnails/10.jpg)
Labeling the Chunks
First Scheme Second Scheme
Third Scheme
token: <word>_<POS-tag>label: <2-tag chunk boundary>:POS-tag:<chunk label> (if this is the first token of the chunk.)
<2-tag chunk boundary>:POS-tag (otherwise)
token: <word>_<POS-tag>label: <2-tag chunk boundary>:POS-tag:<chunk label> (for all tokens)
token: <word>_<POS-tag>label: <2-tag chunk boundary>:POS-tag:<chunk label> (if this is the last token of the chunk.)
<2-tag chunk boundary>:POS-tag (otherwise)
![Page 11: A New Approach for HMM Based Chunking for Hindi Ashish Tiwari Arnab Sinha Under the guidance of Dr. Sudeshna Sarkar Department of Computer Science and.](https://reader035.fdocuments.in/reader035/viewer/2022072011/56649e3b5503460f94b2d1ec/html5/thumbnails/11.jpg)
Results –Labelling Of Chunks
• The first scheme is giving the highest precision 89.02% but again to be noted that word_pos tag approach is not far behind with 85.58% precision and highest recall 98.48%.
• Recall value of word_pos and pos_word approach is same in all schemes, this is because ordering seems to add no new knowledge to existing model.
![Page 12: A New Approach for HMM Based Chunking for Hindi Ashish Tiwari Arnab Sinha Under the guidance of Dr. Sudeshna Sarkar Department of Computer Science and.](https://reader035.fdocuments.in/reader035/viewer/2022072011/56649e3b5503460f94b2d1ec/html5/thumbnails/12.jpg)
Recommendations
scheme 1 is best POS-tag info addition improves the precision
and recall of chunk labeling.
For Identification of Chunk Boundary
For chunk labeling
this approach can be used for other Indian languages as well !!!
Best option: <POS word_POS> <chunk_tag>:<POS_tag>
Subsequent convertion to 2-tag set gives better results
![Page 13: A New Approach for HMM Based Chunking for Hindi Ashish Tiwari Arnab Sinha Under the guidance of Dr. Sudeshna Sarkar Department of Computer Science and.](https://reader035.fdocuments.in/reader035/viewer/2022072011/56649e3b5503460f94b2d1ec/html5/thumbnails/13.jpg)
References• An Introduction to Natural Language Processing,
Computational Linguistics, and Speech Recognition. By Daniel Jurafsky and James H. Martin
• Miles Osborne 2000. Shallow Parsing as Partof-Speech Tagging. Proceedings of CoNLL-2000.(2000)
• Lance A. Ramshaw, and Mitchell P. Marcus. 1995. Text Chunking Using Transformation-Based Learning. Proceedings of the 3rd Workshop on Very Large Corpora (1995) 88.94
• W. Skut and T. Brants 1998. Chunk Tagger, Statistical Recognition of Noun Phrases. ESSLLI-1998 (1998)
• Thorsten Brants. 2000. TnT - A Statistical Part-of-Speech Tagger Proceedings of the sixth conference on Applied Natural Language Processing (2000) 224.231