POS tagging and Chunking for Indian Languages Rajeev Sangal and V. Sriram, International Institute...
-
Upload
clara-chase -
Category
Documents
-
view
220 -
download
2
Transcript of POS tagging and Chunking for Indian Languages Rajeev Sangal and V. Sriram, International Institute...
![Page 1: POS tagging and Chunking for Indian Languages Rajeev Sangal and V. Sriram, International Institute of Information Technology, Hyderabad.](https://reader035.fdocuments.in/reader035/viewer/2022070411/56649f455503460f94c661c8/html5/thumbnails/1.jpg)
POS tagging and Chunking for Indian Languages
Rajeev Sangal and V. Sriram, International Institute of Information Technology, Hyderabad
![Page 2: POS tagging and Chunking for Indian Languages Rajeev Sangal and V. Sriram, International Institute of Information Technology, Hyderabad.](https://reader035.fdocuments.in/reader035/viewer/2022070411/56649f455503460f94c661c8/html5/thumbnails/2.jpg)
Contents
NLP : Introduction Language Analysis - Representation
Part-of-speech tags in Indian Languages (Ex. Hindi) Corpus based methods: An introduction POS tagging using HMMs
Introduction to TnT
Chunking for Indian languages – Few experiments
Shared task - Introduction
![Page 3: POS tagging and Chunking for Indian Languages Rajeev Sangal and V. Sriram, International Institute of Information Technology, Hyderabad.](https://reader035.fdocuments.in/reader035/viewer/2022070411/56649f455503460f94c661c8/html5/thumbnails/3.jpg)
Language
A unique ability of humans Animals have signs – Sign for danger
But cannot combine the signs
Higher animals – Apes Can combine symbols (noun & verb)
But can talk only about here and now
![Page 4: POS tagging and Chunking for Indian Languages Rajeev Sangal and V. Sriram, International Institute of Information Technology, Hyderabad.](https://reader035.fdocuments.in/reader035/viewer/2022070411/56649f455503460f94c661c8/html5/thumbnails/4.jpg)
Language : Means of Communication
CONCEPT CONCEPT
Language
coding decoding
* The concept gets transferred through language
![Page 5: POS tagging and Chunking for Indian Languages Rajeev Sangal and V. Sriram, International Institute of Information Technology, Hyderabad.](https://reader035.fdocuments.in/reader035/viewer/2022070411/56649f455503460f94c661c8/html5/thumbnails/5.jpg)
Language : Means of thinking
What should I wear today?
* Can we think without language ?
![Page 6: POS tagging and Chunking for Indian Languages Rajeev Sangal and V. Sriram, International Institute of Information Technology, Hyderabad.](https://reader035.fdocuments.in/reader035/viewer/2022070411/56649f455503460f94c661c8/html5/thumbnails/6.jpg)
What is NLP ?
The process of computer analysis of input provided in a human language is known as Natural Language Processing.
Concept
Language
Intermediate representationUsed for processing by computer
![Page 7: POS tagging and Chunking for Indian Languages Rajeev Sangal and V. Sriram, International Institute of Information Technology, Hyderabad.](https://reader035.fdocuments.in/reader035/viewer/2022070411/56649f455503460f94c661c8/html5/thumbnails/7.jpg)
Applications
Machine translation
Document Clustering
Information Extraction / Retrieval
Text classification
![Page 8: POS tagging and Chunking for Indian Languages Rajeev Sangal and V. Sriram, International Institute of Information Technology, Hyderabad.](https://reader035.fdocuments.in/reader035/viewer/2022070411/56649f455503460f94c661c8/html5/thumbnails/8.jpg)
MT system : Shakti
• Machine translation system being developed at
IIIT – Hyderabad.
• A hybrid translation system which uses the combined
strengths of Linguistic, Statistical and Machine learning
techniques.
• Integrates the best available NLP technologies.
![Page 9: POS tagging and Chunking for Indian Languages Rajeev Sangal and V. Sriram, International Institute of Information Technology, Hyderabad.](https://reader035.fdocuments.in/reader035/viewer/2022070411/56649f455503460f94c661c8/html5/thumbnails/9.jpg)
Shakti architecture
English sentence
English sentence analysis
Transfer from English to Hindi
Hindi sentence generation
Hindi sentence
MorphologyPOS tagging
ChunkingParsing
Word reorderingHindi word subs.
AgreementWord-generation
![Page 10: POS tagging and Chunking for Indian Languages Rajeev Sangal and V. Sriram, International Institute of Information Technology, Hyderabad.](https://reader035.fdocuments.in/reader035/viewer/2022070411/56649f455503460f94c661c8/html5/thumbnails/10.jpg)
Contents
NLP : Introduction Language Analysis - Representation
Part-of-speech tags in Indian Languages (Ex. Hindi) Corpus based methods: An introduction POS tagging using HMMs
Introduction to TnT
Chunking for Indian languages – Few experiments
Shared task - Introduction
![Page 11: POS tagging and Chunking for Indian Languages Rajeev Sangal and V. Sriram, International Institute of Information Technology, Hyderabad.](https://reader035.fdocuments.in/reader035/viewer/2022070411/56649f455503460f94c661c8/html5/thumbnails/11.jpg)
Levels of Language Analysis
• Morphological analysis
• Lexical Analysis ( POS tagging )
• Syntactic Analysis ( Chunking, Parsing )
• Semantic Analysis ( Word sense disambiguation )
• Discourse processing ( Anaphora resolution )
Let’s take an example sentence
“Children are watching some programmes on television in the house”
![Page 12: POS tagging and Chunking for Indian Languages Rajeev Sangal and V. Sriram, International Institute of Information Technology, Hyderabad.](https://reader035.fdocuments.in/reader035/viewer/2022070411/56649f455503460f94c661c8/html5/thumbnails/12.jpg)
Chunking
What are chunks ?
[[ Children ]] (( are watching )) [[ some programmes ]] [[ on television ]] [[ in the house ]]
Chunks Noun chunks (NP, PP) in square brackets Verb chunks (VG) in parentheses
Chunks represent objects Noun chunks represent objects/concepts Verb chunks represent actions
![Page 13: POS tagging and Chunking for Indian Languages Rajeev Sangal and V. Sriram, International Institute of Information Technology, Hyderabad.](https://reader035.fdocuments.in/reader035/viewer/2022070411/56649f455503460f94c661c8/html5/thumbnails/13.jpg)
Chunking
Representation in SSF
![Page 14: POS tagging and Chunking for Indian Languages Rajeev Sangal and V. Sriram, International Institute of Information Technology, Hyderabad.](https://reader035.fdocuments.in/reader035/viewer/2022070411/56649f455503460f94c661c8/html5/thumbnails/14.jpg)
Part-of-Speech tagging
![Page 15: POS tagging and Chunking for Indian Languages Rajeev Sangal and V. Sriram, International Institute of Information Technology, Hyderabad.](https://reader035.fdocuments.in/reader035/viewer/2022070411/56649f455503460f94c661c8/html5/thumbnails/15.jpg)
Morphological analysis
Deals with the word form and it’s analysis.
Analysis consists of characteristic properties like Root/Stem Lexical category Gender, number, person … Etc …
Ex: watching Root = watch Lexical category = verb Etc …
![Page 16: POS tagging and Chunking for Indian Languages Rajeev Sangal and V. Sriram, International Institute of Information Technology, Hyderabad.](https://reader035.fdocuments.in/reader035/viewer/2022070411/56649f455503460f94c661c8/html5/thumbnails/16.jpg)
Morphological analysis
![Page 17: POS tagging and Chunking for Indian Languages Rajeev Sangal and V. Sriram, International Institute of Information Technology, Hyderabad.](https://reader035.fdocuments.in/reader035/viewer/2022070411/56649f455503460f94c661c8/html5/thumbnails/17.jpg)
Contents
NLP : Introduction Language Analysis - Representation
Part-of-speech tags in Indian Languages (Ex. Hindi)
Corpus based methods: An introduction POS tagging using HMMs
Introduction to TnT
Chunking for Indian languages – Few experiments
Shared task - Introduction
![Page 18: POS tagging and Chunking for Indian Languages Rajeev Sangal and V. Sriram, International Institute of Information Technology, Hyderabad.](https://reader035.fdocuments.in/reader035/viewer/2022070411/56649f455503460f94c661c8/html5/thumbnails/18.jpg)
POS Tags in Hindi
POS Tags in Hindi
Broadly categories are noun, verb, adjective & adverb.
Word are classified depending on their role, both individually as well as in the sentence.
Example:
vaha aama khaa rahaa hei
Pron noun verb verb verb
![Page 19: POS tagging and Chunking for Indian Languages Rajeev Sangal and V. Sriram, International Institute of Information Technology, Hyderabad.](https://reader035.fdocuments.in/reader035/viewer/2022070411/56649f455503460f94c661c8/html5/thumbnails/19.jpg)
POS Tagging
Simplest method of POS tagging
Looking in the dictionary
khaanaa
Dictionary lookup
verb
![Page 20: POS tagging and Chunking for Indian Languages Rajeev Sangal and V. Sriram, International Institute of Information Technology, Hyderabad.](https://reader035.fdocuments.in/reader035/viewer/2022070411/56649f455503460f94c661c8/html5/thumbnails/20.jpg)
Problems with POS Tagging
Size of the dictionary limits the scope of POS-tagger.
Ambiguity The same word can be used both as a noun as well as
a verb.khaanaa
noun verb
![Page 21: POS tagging and Chunking for Indian Languages Rajeev Sangal and V. Sriram, International Institute of Information Technology, Hyderabad.](https://reader035.fdocuments.in/reader035/viewer/2022070411/56649f455503460f94c661c8/html5/thumbnails/21.jpg)
Problems with POS Tagging
Ambiguity Sentences in which the word “khaanaa” occurs
tum bahuta achhaa khaanaa banatii ho. mein jilebii khaanaa chaahataa hun.
Hence, complete sentence has to be looked at before
determining it’s role and thus the POS tag.
![Page 22: POS tagging and Chunking for Indian Languages Rajeev Sangal and V. Sriram, International Institute of Information Technology, Hyderabad.](https://reader035.fdocuments.in/reader035/viewer/2022070411/56649f455503460f94c661c8/html5/thumbnails/22.jpg)
Problems with POS Tagging
Many applications need more specific POS tags.
For example,
Hence, the need for defining a tagset.
… seba khaa rahaa … Verb Finite Main
… khaate huE … Verb Non-Finite Adjective
… khaakara … Verb Non-Finite Adverb
sharaaba piinaa sehata …
Verb Non-Finite Nominal
![Page 23: POS tagging and Chunking for Indian Languages Rajeev Sangal and V. Sriram, International Institute of Information Technology, Hyderabad.](https://reader035.fdocuments.in/reader035/viewer/2022070411/56649f455503460f94c661c8/html5/thumbnails/23.jpg)
Defining the tagset for Hindi (IIIT Tagset)
Issues !
1. Fineness V/s Coarseness in linguistic analysis
2. Syntactic Function V/s lexical category
3. New tags V/s tags from a standard tagger
![Page 24: POS tagging and Chunking for Indian Languages Rajeev Sangal and V. Sriram, International Institute of Information Technology, Hyderabad.](https://reader035.fdocuments.in/reader035/viewer/2022070411/56649f455503460f94c661c8/html5/thumbnails/24.jpg)
Fineness V/s Coarseness
Decision has to be taken whether tags will account for finer distinctions of various features of the parts of speech.
Need to strike a balance
Not too fine to hamper machine learning Not too coarse to loose information
![Page 25: POS tagging and Chunking for Indian Languages Rajeev Sangal and V. Sriram, International Institute of Information Technology, Hyderabad.](https://reader035.fdocuments.in/reader035/viewer/2022070411/56649f455503460f94c661c8/html5/thumbnails/25.jpg)
Fineness V/s Coarseness
Nouns Plurality information not taken into account
(noun singular and noun plural are marked with same tags). Case information not marked
(noun direct and noun oblique are marked with same tags).
Adjectives and Adverbs No distinction between comparitive and superlative forms
Verbs Finer distinctions are made (eg., VJJ, VRB, VNN) Helps us understand the arguments that a verb form can take.
![Page 26: POS tagging and Chunking for Indian Languages Rajeev Sangal and V. Sriram, International Institute of Information Technology, Hyderabad.](https://reader035.fdocuments.in/reader035/viewer/2022070411/56649f455503460f94c661c8/html5/thumbnails/26.jpg)
Fineness in Verb tags Useful for tasks like dependency parsing as we have
better information about arguments of verb form.
Non-finite form of verbs which are used as nouns or adjectives or adverbs still retain their verbal property.
(VNN -> Noun formed for a verb)
Example:
aasamaana/NN mein/PREP udhane/VNN vaalaa/PREP ghodhaa/NN
“sky” “in” “flying” “horse”
niiche/NLOC utara/VFM aayaa/VAUX
“down” “climb” “came”
![Page 27: POS tagging and Chunking for Indian Languages Rajeev Sangal and V. Sriram, International Institute of Information Technology, Hyderabad.](https://reader035.fdocuments.in/reader035/viewer/2022070411/56649f455503460f94c661c8/html5/thumbnails/27.jpg)
Syntactic V/S Lexical
Whether to tag the word based on lexical or syntactic category.
Should “uttar” in “uttar bhaarata” be tagged as noun oradjective ?
Lexical category is given more importance than syntactic category while marking text manually.
Leads to consistency in tagging.
![Page 28: POS tagging and Chunking for Indian Languages Rajeev Sangal and V. Sriram, International Institute of Information Technology, Hyderabad.](https://reader035.fdocuments.in/reader035/viewer/2022070411/56649f455503460f94c661c8/html5/thumbnails/28.jpg)
New tags v/s tags from standard tagset
Entirely new tagset for Indian languages not desirable as people are familiar with standard tagsets like Penn tags.
Penn tagset has been used as benchmark while deciding tags for Hindi.
Wherever Penn tagset has been found inadequate, new tags introduced.
NVB New tag for kriyamuls or Light verbs QW Modified tag for question words
![Page 29: POS tagging and Chunking for Indian Languages Rajeev Sangal and V. Sriram, International Institute of Information Technology, Hyderabad.](https://reader035.fdocuments.in/reader035/viewer/2022070411/56649f455503460f94c661c8/html5/thumbnails/29.jpg)
IIIT Tagset
Tags are grouped into three types.
1. Group1 : Adopted from the Penn tagset with minor changes.
2. Group2 : Modification over Penn tagset.3. Group3 : Tags not present in Penn tagset.
Examples of tags in Group3
1. INTF ( Intensifier ) : Words like ‘baHuta’, ‘kama’ etc.
2. NVB, JVB, RBVB : Light verbs.
Detailed guidelines would be put online.
![Page 30: POS tagging and Chunking for Indian Languages Rajeev Sangal and V. Sriram, International Institute of Information Technology, Hyderabad.](https://reader035.fdocuments.in/reader035/viewer/2022070411/56649f455503460f94c661c8/html5/thumbnails/30.jpg)
Contents
NLP : Introduction Language Analysis - Representation
Part-of-speech tags in Indian Languages (Ex. Hindi) Corpus based methods: An introduction POS tagging using HMMs
Introduction to TnT
Chunking for Indian languages – Few experiments
Shared task - Introduction
![Page 31: POS tagging and Chunking for Indian Languages Rajeev Sangal and V. Sriram, International Institute of Information Technology, Hyderabad.](https://reader035.fdocuments.in/reader035/viewer/2022070411/56649f455503460f94c661c8/html5/thumbnails/31.jpg)
Corpus – based approach
POS tagged corpus Learn POS tagger
Untagged new corpus
Tagged new corpus
![Page 32: POS tagging and Chunking for Indian Languages Rajeev Sangal and V. Sriram, International Institute of Information Technology, Hyderabad.](https://reader035.fdocuments.in/reader035/viewer/2022070411/56649f455503460f94c661c8/html5/thumbnails/32.jpg)
POS tagging : A simple method
• Pick the most likely tag for each word
• Probabilities can be estimated from a tagged corpus.
• Assumes independence between tags.
• Accuracy < 90%
![Page 33: POS tagging and Chunking for Indian Languages Rajeev Sangal and V. Sriram, International Institute of Information Technology, Hyderabad.](https://reader035.fdocuments.in/reader035/viewer/2022070411/56649f455503460f94c661c8/html5/thumbnails/33.jpg)
POS tagging : A simple method
Example
• Brown corpus, 182159 tagged words (training section), 26 tags
• Example :
• mujhe xo kitabein xijiye •Word xo occurs 267 times,
• 227 times tagged as QFN• 29 times as VAUX
• P(QFN|W=xo) = 227/267 = 0.8502• P(NN | W=xo) = 29/267 = 0.1086
![Page 34: POS tagging and Chunking for Indian Languages Rajeev Sangal and V. Sriram, International Institute of Information Technology, Hyderabad.](https://reader035.fdocuments.in/reader035/viewer/2022070411/56649f455503460f94c661c8/html5/thumbnails/34.jpg)
Corpus-based approaches
Learning Rules Statistical Transformation-based error driven learning.
Brill - 1995
Hidden Markov models.
TnT, Brants 00
Inductive Logic programming.
Cussens - 1997
Maximum entropy. Ratnaparakhi’ 96
![Page 35: POS tagging and Chunking for Indian Languages Rajeev Sangal and V. Sriram, International Institute of Information Technology, Hyderabad.](https://reader035.fdocuments.in/reader035/viewer/2022070411/56649f455503460f94c661c8/html5/thumbnails/35.jpg)
Contents
NLP : Introduction Language Analysis - Representation
Part-of-speech tags in Indian Languages (Ex. Hindi) Corpus based methods: An introduction POS tagging using HMMs
Introduction to TnT
Chunking for Indian languages – Few experiments
Shared task - Introduction
![Page 36: POS tagging and Chunking for Indian Languages Rajeev Sangal and V. Sriram, International Institute of Information Technology, Hyderabad.](https://reader035.fdocuments.in/reader035/viewer/2022070411/56649f455503460f94c661c8/html5/thumbnails/36.jpg)
POS tagging using HMMs
Let W be a sequence of words
W = w1 , w2 … wn
Let T be the corresponding tag sequence
T = t1 , t2 … tn
Task : Find T which maximizes P ( T | W )
T’ = argmaxT P ( T | W )
![Page 37: POS tagging and Chunking for Indian Languages Rajeev Sangal and V. Sriram, International Institute of Information Technology, Hyderabad.](https://reader035.fdocuments.in/reader035/viewer/2022070411/56649f455503460f94c661c8/html5/thumbnails/37.jpg)
POS tagging using HMM
By Bayes Rule,
P ( T | W ) = P ( W | T ) * P ( T ) / P ( W )
T’ = argmaxT P ( W | T ) * P ( T )
P ( T ) = P ( t1 ) * P ( t2 | t1 ) * P ( t3 | t1 t2 ) …… * P ( tn | t1 … tn-
1 )
Applying Bi-gram approximation,
P ( T ) = P ( t1 ) * P ( t2 | t1 ) * P ( t3 | t2 ) …… * P ( tn | tn-1 )
![Page 38: POS tagging and Chunking for Indian Languages Rajeev Sangal and V. Sriram, International Institute of Information Technology, Hyderabad.](https://reader035.fdocuments.in/reader035/viewer/2022070411/56649f455503460f94c661c8/html5/thumbnails/38.jpg)
POS tagging using HMM
P ( W | T ) = P ( w1 | T ) * P ( w2 | w1 T ) * P ( w3 | w1.w2 T ) * ……… P ( wn | w1 … wn-1 , T )
= Πi = 1 to n P ( wi | w1…wi-1 T )
Assume, P ( wi | w1…wi-1 T ) = P ( wi | ti )
Now,
T’ is the one which maximizes,
P ( t1 ) * P ( t2 | t1 ) * …… * P ( tn | tn-1 ) * P ( w1 | t1 ) * P ( w2 | t2 ) * …… * P ( wn | wn-1 )
![Page 39: POS tagging and Chunking for Indian Languages Rajeev Sangal and V. Sriram, International Institute of Information Technology, Hyderabad.](https://reader035.fdocuments.in/reader035/viewer/2022070411/56649f455503460f94c661c8/html5/thumbnails/39.jpg)
POS tagging using HMM
If we use Tri-gram model instead for the tag sequence,
P ( T ) = P ( t1 ) * P ( t2 | t1 ) * P ( t3 | t1 t2 ) …… * P ( tn | tn-2 tn-1 )
Which model to choose ?
• Depends on the amount of data available !
• Richer models ( Tri-grams, 4-grams ) require lots of data.
![Page 40: POS tagging and Chunking for Indian Languages Rajeev Sangal and V. Sriram, International Institute of Information Technology, Hyderabad.](https://reader035.fdocuments.in/reader035/viewer/2022070411/56649f455503460f94c661c8/html5/thumbnails/40.jpg)
Chain rule with approximations
• P( W = “vaha ladakaa gayaa” , T = “det noun verb” )
== P(det) * P(vaha|det) * P(noun|det) * P(ladakaa|noun) * P(verb|noun) * P(gayaa|verb)
det noun verb
vaha ladakaa gayaa
![Page 41: POS tagging and Chunking for Indian Languages Rajeev Sangal and V. Sriram, International Institute of Information Technology, Hyderabad.](https://reader035.fdocuments.in/reader035/viewer/2022070411/56649f455503460f94c661c8/html5/thumbnails/41.jpg)
Chain rule with approximations: Example
• P (vaha | det ) = ( Number of times ‘vaha’ appeared as ‘det’ in the corpus ) ------------------------------------------------------------- ( Total number of occurrences of ‘det’ in the corpus )
• P ( verb | noun ) = ( Number of times ‘verb’ followed ‘noun’ in the corpus ) ------------------------------------------------------------- ( Total number of occurrences of ‘noun’ in the corpus )
If we obtained the following estimates from the corpus
det noun verb
vaha ladakaa gayaa
0.5
0.4
0.99
0.5
0.4
0.02
P ( W , T ) = 0.5 * 0.4 * 0.99 * 0.5 * 0.4 * 0.02 = 0.000792
![Page 42: POS tagging and Chunking for Indian Languages Rajeev Sangal and V. Sriram, International Institute of Information Technology, Hyderabad.](https://reader035.fdocuments.in/reader035/viewer/2022070411/56649f455503460f94c661c8/html5/thumbnails/42.jpg)
POS tagging using HMM
We need to estimate three types of parameters from the corpus
Pstart(ti) = (no. of sentences which begin with ti ) / ( no. of sentences )
P ( ti | ti-1 ) = count ( ti-1 ti ) / count ( ti-1 )
P ( wi | ti ) = count ( wi with ti ) / count ( ti )
These parameters can be directly represented using the HiddenMarkov Models (HMMs) and the best tag sequence can be
computed by applying Viterbi algorithm on the HMMs.
![Page 43: POS tagging and Chunking for Indian Languages Rajeev Sangal and V. Sriram, International Institute of Information Technology, Hyderabad.](https://reader035.fdocuments.in/reader035/viewer/2022070411/56649f455503460f94c661c8/html5/thumbnails/43.jpg)
Markov models
Markov Chain
• An event is dependent on the previous events.
Consider the word sequence
usane kahaa ki
Here, each word is dependent on the previous one word. Hence, it is said to form markov chain of order 1.
![Page 44: POS tagging and Chunking for Indian Languages Rajeev Sangal and V. Sriram, International Institute of Information Technology, Hyderabad.](https://reader035.fdocuments.in/reader035/viewer/2022070411/56649f455503460f94c661c8/html5/thumbnails/44.jpg)
Hidden Markov models
Hidden states follow markov property. Hence, this model is know as Hidden Markov Model.
Observation sequence O o1 o2 o3 o4
x1 x2 x3 x4Hidden states sequence X
Index of sequence t 1 2 3 4
![Page 45: POS tagging and Chunking for Indian Languages Rajeev Sangal and V. Sriram, International Institute of Information Technology, Hyderabad.](https://reader035.fdocuments.in/reader035/viewer/2022070411/56649f455503460f94c661c8/html5/thumbnails/45.jpg)
Hidden Markov models
• Representation of parameters in HMMs
• Define O(t) = tth Observation
• Define X(t) = Hidden State Value at tth position
A = aab = P ( X ( t+1 ) = Xb | X ( t ) = Xa ) Transition matrix
B = bak = P ( O ( t ) = Ok | X ( t ) = Xa ) Emission matrix
PI = pia = Probability of the starting with hidden state Xa PI matrixThe model is μ = { A , PI , B }
![Page 46: POS tagging and Chunking for Indian Languages Rajeev Sangal and V. Sriram, International Institute of Information Technology, Hyderabad.](https://reader035.fdocuments.in/reader035/viewer/2022070411/56649f455503460f94c661c8/html5/thumbnails/46.jpg)
HMM for POS tagging
Observation sequence === Word sequence
Hidden state sequence === Tag sequence
Model
A = P ( current tag | previous tag )
B = P ( current word | current tag )
PI = Pstart ( tag )
Tag sequences are mapped to Hidden state sequences because they are not observable in the natural language text.
![Page 47: POS tagging and Chunking for Indian Languages Rajeev Sangal and V. Sriram, International Institute of Information Technology, Hyderabad.](https://reader035.fdocuments.in/reader035/viewer/2022070411/56649f455503460f94c661c8/html5/thumbnails/47.jpg)
Example
A =
det noun
verb
det .01 .99 .00
noun
.30 .30 .40
verb .40 .40 .20
vaha ladakaa
gayaa
det .40 .00 .00
noun .00 .015 .0031
verb .00 .0004 .020
B =
PI =
det 0.5
noun
0.4
verb .01
![Page 48: POS tagging and Chunking for Indian Languages Rajeev Sangal and V. Sriram, International Institute of Information Technology, Hyderabad.](https://reader035.fdocuments.in/reader035/viewer/2022070411/56649f455503460f94c661c8/html5/thumbnails/48.jpg)
POS tagging using HMM
The problem can be formulated as,
Given the observation sequence O and the model
μ = (A, B, PI), how to choose the best state sequence X which explains the observations ?
• Consider all the possible tag sequences and choose the tag sequence having the maximum joint probability with the observation sequence.
• X_max = argmax ( P(O , X) ) • The complexity of the above is high. Order NT
• Viterbi algorithm is used for computational efficiency.
![Page 49: POS tagging and Chunking for Indian Languages Rajeev Sangal and V. Sriram, International Institute of Information Technology, Hyderabad.](https://reader035.fdocuments.in/reader035/viewer/2022070411/56649f455503460f94c661c8/html5/thumbnails/49.jpg)
POS tagging using HMM
det
noun
verb
det
noun
verb
det
noun
verb
vaha ladakaa hansaa
27 tag sequences possible ! = 27 paths
t 1 2 3
O
X’s
![Page 50: POS tagging and Chunking for Indian Languages Rajeev Sangal and V. Sriram, International Institute of Information Technology, Hyderabad.](https://reader035.fdocuments.in/reader035/viewer/2022070411/56649f455503460f94c661c8/html5/thumbnails/50.jpg)
Viterbi algorithm
det
noun
verb
det
noun
verb
det
noun
verb
vaha ladakaa hansaa
Let αnoun(ladakaa) represent the probability of reaching the state‘noun’ taking the best possible path and generating observation ‘ladakaa’
t 1 2 3
O
X’s
![Page 51: POS tagging and Chunking for Indian Languages Rajeev Sangal and V. Sriram, International Institute of Information Technology, Hyderabad.](https://reader035.fdocuments.in/reader035/viewer/2022070411/56649f455503460f94c661c8/html5/thumbnails/51.jpg)
Viterbi algorithm
det
noun
verb
det
noun
verb
det
noun
verb
vaha ladakaa hansaa
Best probability of reaching a state associated with first word
αpron(vaha) = PI (det) * B [det, vaha ]
t 1 2 3
O
X’s
![Page 52: POS tagging and Chunking for Indian Languages Rajeev Sangal and V. Sriram, International Institute of Information Technology, Hyderabad.](https://reader035.fdocuments.in/reader035/viewer/2022070411/56649f455503460f94c661c8/html5/thumbnails/52.jpg)
Viterbi algorithm
det
noun
verb
det
noun
verb
det
noun
verb
vaha ladakaa hansaa
Probability of reaching a state elsewhere in the best possible way
αnoun(ladakaa) =
t 1 2 3
O
X’s
![Page 53: POS tagging and Chunking for Indian Languages Rajeev Sangal and V. Sriram, International Institute of Information Technology, Hyderabad.](https://reader035.fdocuments.in/reader035/viewer/2022070411/56649f455503460f94c661c8/html5/thumbnails/53.jpg)
Viterbi algorithm
det
noun
verb
det
noun
verb
det
noun
verb
vaha ladakaa hansaa
t 1 2 3
O
X’s
Probability of reaching a state in the best possible way
αnoun(ladakaa) = MAX { αpron(vaha) * A [det, noun ] * B [ noun, ladakaa ] ,
![Page 54: POS tagging and Chunking for Indian Languages Rajeev Sangal and V. Sriram, International Institute of Information Technology, Hyderabad.](https://reader035.fdocuments.in/reader035/viewer/2022070411/56649f455503460f94c661c8/html5/thumbnails/54.jpg)
Viterbi algorithm
det
noun
verb
det
noun
verb
det
noun
verb
vaha ladakaa hansaa
t 1 2 3
O
X’s
Probability of reaching a state in the best possible way,
αnoun(ladakaa) = MAX { αpron(vaha) * A [ det, noun ] * B [ noun, ladakaa ] , αnoun(vaha) * A [ noun, noun ] * B [ noun, ladakaa ] ,
![Page 55: POS tagging and Chunking for Indian Languages Rajeev Sangal and V. Sriram, International Institute of Information Technology, Hyderabad.](https://reader035.fdocuments.in/reader035/viewer/2022070411/56649f455503460f94c661c8/html5/thumbnails/55.jpg)
Viterbi algorithm
det
noun
verb
det
noun
verb
det
noun
verb
vaha ladakaa hansaa
t 1 2 3
O
X’s
Probability of reaching a state in the best possible way
αnoun(ladakaa) = MAX { αpron(vaha) * A [det, noun ] * B [ noun, ladakaa ] , αnoun(vaha) * A [ noun, noun ] * B [ noun, ladakaa ] ,
αverb(vaha) * A [ verb, noun ] * B [ noun, ladakaa ] }
![Page 56: POS tagging and Chunking for Indian Languages Rajeev Sangal and V. Sriram, International Institute of Information Technology, Hyderabad.](https://reader035.fdocuments.in/reader035/viewer/2022070411/56649f455503460f94c661c8/html5/thumbnails/56.jpg)
Viterbi algorithm
det
noun
verb
det
noun
verb
det
noun
verb
vaha ladakaa hansaa
t 1 2 3
O
X’s
What is the best way to come to a particular state ?
phinoun(ladakaa) = ARGMAX { αpron(vaha) * A [ pron, noun ] * B [ noun, ladakaa ] , αnoun(vaha) * A [ noun, noun ] * B [ noun, ladakaa ] ,
αverb(vaha) * A [ verb, noun ] * B [ noun, ladakaa ] }
![Page 57: POS tagging and Chunking for Indian Languages Rajeev Sangal and V. Sriram, International Institute of Information Technology, Hyderabad.](https://reader035.fdocuments.in/reader035/viewer/2022070411/56649f455503460f94c661c8/html5/thumbnails/57.jpg)
Viterbi algorithm
det
noun
verb
det
noun
verb
det
noun
verb
vaha ladakaa hansaa
The last tag of the most likely sequence
phi (T+1) = ARGMAX { αpron(hansaa) , αnoun(hansaa) , αverb(hansaa) }
t 1 2 3
O
X’s
![Page 58: POS tagging and Chunking for Indian Languages Rajeev Sangal and V. Sriram, International Institute of Information Technology, Hyderabad.](https://reader035.fdocuments.in/reader035/viewer/2022070411/56649f455503460f94c661c8/html5/thumbnails/58.jpg)
Viterbi algorithm
det
noun
verb
det
noun
verb
det
noun
verb
vaha ladakaa hansaa
Most likely sequence is obtained by backtracking.
t 1 2 3
O
X’s
![Page 59: POS tagging and Chunking for Indian Languages Rajeev Sangal and V. Sriram, International Institute of Information Technology, Hyderabad.](https://reader035.fdocuments.in/reader035/viewer/2022070411/56649f455503460f94c661c8/html5/thumbnails/59.jpg)
Preliminary Results POS tagging for Indian languages
Training set = 182159 tokens, Testing set = 14277 tokens
Tags = 26.
Most frequent tag labelling = 78.85 %
Hidden Markov Models = 86.75 %
Needs improvement!
By experimenting with a variety of tags and tokens ( Some experiments on the chunking task are shown in following slides ).
![Page 60: POS tagging and Chunking for Indian Languages Rajeev Sangal and V. Sriram, International Institute of Information Technology, Hyderabad.](https://reader035.fdocuments.in/reader035/viewer/2022070411/56649f455503460f94c661c8/html5/thumbnails/60.jpg)
Preliminary Results
Most Common error seen.
NNP, NNC NN
<* see the output of the system >
Opportunity to carry out experiments to eliminate such errors as part of NLPAI shared task , 2006 (will be introduced at the end).
![Page 61: POS tagging and Chunking for Indian Languages Rajeev Sangal and V. Sriram, International Institute of Information Technology, Hyderabad.](https://reader035.fdocuments.in/reader035/viewer/2022070411/56649f455503460f94c661c8/html5/thumbnails/61.jpg)
Contents
NLP : Introduction Language Analysis - Representation
Part-of-speech tags in Indian Languages (Ex. Hindi) Corpus based methods: An introduction POS tagging using HMMs
Introduction to TnT
Chunking for Indian languages – Few experiments
Shared task - Introduction
![Page 62: POS tagging and Chunking for Indian Languages Rajeev Sangal and V. Sriram, International Institute of Information Technology, Hyderabad.](https://reader035.fdocuments.in/reader035/viewer/2022070411/56649f455503460f94c661c8/html5/thumbnails/62.jpg)
Introduction to TnT
Efficient implementation of Viterbi’s algorithm for 2nd order Markov Chains ( Trigram approximation ).
Language independent – Can be trained on any corpus.
Easy to use.
![Page 63: POS tagging and Chunking for Indian Languages Rajeev Sangal and V. Sriram, International Institute of Information Technology, Hyderabad.](https://reader035.fdocuments.in/reader035/viewer/2022070411/56649f455503460f94c661c8/html5/thumbnails/63.jpg)
Introduction to TnT 4 main programs –
tnt-para – trains the model (parameter generation) tnt-para [options] <corpus_file>
tnt – tagging tnt [options] <model> <corpus>
tnt-diff - Comparing two files to get precision/ recall figures. tnt-diff [options] <original file 1> <new output file>
tnt-wc – count tokens (words) and types (pos-tag/chunk-tag) in different files. tnt-wc [options] <corpusfile>
![Page 64: POS tagging and Chunking for Indian Languages Rajeev Sangal and V. Sriram, International Institute of Information Technology, Hyderabad.](https://reader035.fdocuments.in/reader035/viewer/2022070411/56649f455503460f94c661c8/html5/thumbnails/64.jpg)
Introduction to TnT Training file format
Tokens and tag separated by white space.
Example,
%% <comment> nirAlA NNP kI PREP sAhiwya NN
%% blank line – new sentence yahAz PRP yaha PRP aXikAMRa JJ
![Page 65: POS tagging and Chunking for Indian Languages Rajeev Sangal and V. Sriram, International Institute of Information Technology, Hyderabad.](https://reader035.fdocuments.in/reader035/viewer/2022070411/56649f455503460f94c661c8/html5/thumbnails/65.jpg)
Introduction to TnT
Testing file – consists of only the first column.
Other files – Used to store the model .lex file .123 file .map file
Demo1.
![Page 66: POS tagging and Chunking for Indian Languages Rajeev Sangal and V. Sriram, International Institute of Information Technology, Hyderabad.](https://reader035.fdocuments.in/reader035/viewer/2022070411/56649f455503460f94c661c8/html5/thumbnails/66.jpg)
Contents
NLP : Introduction Language Analysis - Representation
Part-of-speech tags in Indian Languages (Ex. Hindi) Corpus based methods: An introduction POS tagging using HMMs
Introduction to TnT
Chunking for Indian languages – Few experiments
Shared task - Introduction
![Page 67: POS tagging and Chunking for Indian Languages Rajeev Sangal and V. Sriram, International Institute of Information Technology, Hyderabad.](https://reader035.fdocuments.in/reader035/viewer/2022070411/56649f455503460f94c661c8/html5/thumbnails/67.jpg)
An Example (Chunk boundary identification)
![Page 68: POS tagging and Chunking for Indian Languages Rajeev Sangal and V. Sriram, International Institute of Information Technology, Hyderabad.](https://reader035.fdocuments.in/reader035/viewer/2022070411/56649f455503460f94c661c8/html5/thumbnails/68.jpg)
Chunking with TnT
Chunk Tags STRT: A chunk starts at this token CNT: This token lies in the middle of a chunk STP: This token lies at the end of a chunk STRT_STP: This token lies in a chunk of its own
Chunk Tag Schemes 2-tag Scheme: {STRT, CNT} 3-tag Scheme: {STRT, CNT, STP} 4-tag Scheme: {STRT, CNT, STP, STRT_STP}
![Page 69: POS tagging and Chunking for Indian Languages Rajeev Sangal and V. Sriram, International Institute of Information Technology, Hyderabad.](https://reader035.fdocuments.in/reader035/viewer/2022070411/56649f455503460f94c661c8/html5/thumbnails/69.jpg)
Input Tokens
What kinds of input tokens can we use?
Word only – simplest
POS tag only – use only the part of speech tag of the word
Combinations of the above – Word_POStag: word followed by POS tag POStag_Word: POS tag followed by word.
![Page 70: POS tagging and Chunking for Indian Languages Rajeev Sangal and V. Sriram, International Institute of Information Technology, Hyderabad.](https://reader035.fdocuments.in/reader035/viewer/2022070411/56649f455503460f94c661c8/html5/thumbnails/70.jpg)
Chunking with TnT: Experiments
Training corpus = 150000 tokens Testing corpus = 20000 tokens
Trick to improve learning is by training on larger tagset and reduce it to smaller tagset NO LOSS of INFO. as all the tagsets convey same info.
Best results (Precision = 85.6%) obtained for Input Tokens of the form ‘Word_POS’ Learning trick : 4 tags reduced to 2
![Page 71: POS tagging and Chunking for Indian Languages Rajeev Sangal and V. Sriram, International Institute of Information Technology, Hyderabad.](https://reader035.fdocuments.in/reader035/viewer/2022070411/56649f455503460f94c661c8/html5/thumbnails/71.jpg)
Chunking with TnT: Improvement
85.6 not good enough. Improvement of model (Precision = 88.63%) by
adding contextual information (POS tags). Example,
![Page 72: POS tagging and Chunking for Indian Languages Rajeev Sangal and V. Sriram, International Institute of Information Technology, Hyderabad.](https://reader035.fdocuments.in/reader035/viewer/2022070411/56649f455503460f94c661c8/html5/thumbnails/72.jpg)
Chunking with TnT: Improvements
For experiments which lead to furthur improvements in chunk boundary identification, see
Akshay Singh; Sushama Bendre; Rajeev Sangal, HMM based Chunker for Hindi, In Second International Joint Conference on Natural Language Processing: Companion Volume including Posters/Demos and tutorial abstracts.
![Page 73: POS tagging and Chunking for Indian Languages Rajeev Sangal and V. Sriram, International Institute of Information Technology, Hyderabad.](https://reader035.fdocuments.in/reader035/viewer/2022070411/56649f455503460f94c661c8/html5/thumbnails/73.jpg)
Chunking labelling & Results
Chunk labelling: Chunks which have been identified have to be labelled as Noun chunks, Verb chunks etc.
Rule based chunk labelling performed best.
RESULTS:
Final Chunk Boundary Identification accuracy = 92.6%
Chunk boundary identification + Chunk labelling = 91.5%
![Page 74: POS tagging and Chunking for Indian Languages Rajeev Sangal and V. Sriram, International Institute of Information Technology, Hyderabad.](https://reader035.fdocuments.in/reader035/viewer/2022070411/56649f455503460f94c661c8/html5/thumbnails/74.jpg)
Contents
NLP : Introduction Language Analysis - Representation
Part-of-speech tags in Indian Languages (Ex. Hindi) Corpus based methods: An introduction POS tagging using HMMs
Introduction to TnT
Chunking for Indian languages – Few experiments
Shared task - Introduction
![Page 75: POS tagging and Chunking for Indian Languages Rajeev Sangal and V. Sriram, International Institute of Information Technology, Hyderabad.](https://reader035.fdocuments.in/reader035/viewer/2022070411/56649f455503460f94c661c8/html5/thumbnails/75.jpg)
Shared task.
For information on the shared task, refer to the flyer on NLPAI shared task 2006.
![Page 76: POS tagging and Chunking for Indian Languages Rajeev Sangal and V. Sriram, International Institute of Information Technology, Hyderabad.](https://reader035.fdocuments.in/reader035/viewer/2022070411/56649f455503460f94c661c8/html5/thumbnails/76.jpg)
Thank you