Segmentation in Sanskrit texts
-
Upload
amrith-krishna -
Category
Engineering
-
view
1.047 -
download
7
Transcript of Segmentation in Sanskrit texts
देहिनोऽस्मिन्यथा देिे कौिारं यौवनं जरा .तथा देिान्तरप्रास्ततर्धीरमतत्र न िहु्यतत
देहिनः अस्मिन ्यथा देिे कौिारं यौवनं जरा तथा देिान्तर प्रास्ततः र्धीरः तत्र न िहु्यतत
तथा देिान्तरप्रास्ततर्धीरमतत्र न िहु्यतत
तथा देिान्तर प्रास्ततः र्धीरः तत्र न िहु्यतत
तथा देिान्तरप्रास्ततर्धीरमतत्र न िुह्यतत
A
तथा देिान्तरप्रास्ततर्धीरमतत्र न िुह्यतत
रािरािेभ्यः रािमय
witi
PMI Matrix of the un-segmentable token lemmas
P(w1,w2,w3,w4) = P(w1 | <s>)P(w2|w1)P(w3|w2)P(w4|w3)P(</s>|w4)
Set (Size in sentences) Micro Accuracy Macro Accuracy
Training set (1700) 87.76 % 92.56 %
Testing Set (150) 87.82 93.56 %
•
•
•
•
• Treat the problem as a query expansion problem.• Start with unsegmented tokens• At each step a new candidate word is selected and added to query• The query expansion iterates till a complete sentence is output.
Chunk 1 – c1 c2 c3 c4
w1
w2 .....wk.
.
.
.
.
Wl6
S = c1 + c2 + c3 + c4
C2 = Set of wi, which are candidates for semantically correct segmentation.
Similarly for c2 and c3
• Treat the problem as a query expansion problem.• Start with unsegmented tokens• At each step a new candidate word is selected and added to query• The query expansion iterates till a complete sentence is output.
Chunk 1 – c1 c2 c3 c4
w1
w2 .....wk.
.
.
.
.
Wl6
S = c1 + c2 + c3 + c4
C2 = Set of wi, which are candidates for semantically correct segmentation.
Similarly for c2 and c3
• From Query Nodes, reach the most promising candidate word nodes.• Perform multiple personalised random walks.• Edge weights – Accommodate heterogeneous information• Learn weights for each of the random walk approach (path) by
supervised methods.• The weighted sum of all the random walk methods, gives the most
suitable candidate• PS- We use 4 lakh tagged sentences from Digital corpus of Sanskrit.
Language Model (LM) with word lemmas
LM with morphological types
Verb specific Expectancy
Compound word formation patterns
Language Model with words - LMw
LM with morphological types - LMt
Verb specific Expectancy – ViE
Compound word formation patterns
PCRW -Unifying
Framework
• Handle Free Word Order• Incorporate heterogeneous types of information• Bonus – Form different relational paths(upto l) by combination of
individual edge weights.• For l = 3, some sample paths that can be formed as combination.• LMw -> LMt ->LMw• LMt -> V1E -> LMt• LMt -> VkE -> LMt