Investigation and Modeling of the Structure of Texting Languages Monojit Choudhury, Rahul Saraf*,...
-
Upload
alexis-weber -
Category
Documents
-
view
221 -
download
2
Transcript of Investigation and Modeling of the Structure of Texting Languages Monojit Choudhury, Rahul Saraf*,...
Investigation and Modeling of the Structure of Texting Languages
Monojit Choudhury, Rahul Saraf*, Vijit Jain,
Sudeshna Sarkar, Anupam BasuDepartment of Computer Science & Engineering, IIT Kharagpur
*Department of Computer Engineering, NIT Jaipur
Monojit Choudhury, CSE, IIT Kharagpur
Investigation and Modeling of the Structure of Texting Languages
AND 2007, Hyderabad
Texting Language
• A new genre of English & also other languages used in chats, sms, emails, blogs, etc.
• Ungrammatical, unconventional spellings
dis is n eg 4 txtin lang
This is an example for Texting language
Monojit Choudhury, CSE, IIT Kharagpur
Investigation and Modeling of the Structure of Texting Languages
AND 2007, Hyderabad
Texting Language
• A new genre of English & also other languages used in chats, sms, emails, blogs, etc.
• Ungrammatical, unconventional spellings
dis is n eg 4 txtin lang
This is an example for Texting language
24 39
The shorter the fasterConstraint: understandability
Monojit Choudhury, CSE, IIT Kharagpur
Investigation and Modeling of the Structure of Texting Languages
AND 2007, Hyderabad
Objectives
• Modeling the structure of Texting language
• Decoder from Texting language to standard English
• Domain: SMS texts
• Applications– Search Engines
– Noisy text Correction
– Correction of ASR transcribed data
Monojit Choudhury, CSE, IIT Kharagpur
Investigation and Modeling of the Structure of Texting Languages
AND 2007, Hyderabad
The Noisy Channel Model
NOISY CHANNEL
Texting Language
Standard Language
S: s1 s2 … sn T: t1 t2 … tm
(T) = argmax Pr(T|S) Pr(S)
= argmax [ΠPr(ti|si)]Pr(S)i = 1
nS
S
Monojit Choudhury, CSE, IIT Kharagpur
Investigation and Modeling of the Structure of Texting Languages
AND 2007, Hyderabad
Refined Objective
• Given Texting Language word: t • Find the set of possible Standard language
words {s1, s2, s3…} such that Pr(si|t) > p
t = “tns”
s1 = “teens” s2 = “tins”s3 = “tons” s4 = “tens”s5 = “tense” s6 = “turns”
Monojit Choudhury, CSE, IIT Kharagpur
Investigation and Modeling of the Structure of Texting Languages
AND 2007, Hyderabad
Texting Language Data
• 1000 SMS texts collected from web [http://www.treasuremytext.com]
• Manually translated to standard English• Automatic word alignment through heuristics• Word – Variation pair extracted from corpus and
manually corrected
Available at:
http://www.mla.iitkgp.ernet.in/~monojit/sms.html
Monojit Choudhury, CSE, IIT Kharagpur
Investigation and Modeling of the Structure of Texting Languages
AND 2007, Hyderabad
Texting Language Data
• 1000 SMS texts collected from web [http://www.treasuremytext.com]
• Manually translated to standard English• Automatic word alignment through heuristics• Word – Variation pair extracted from corpus and
manually corrected
Available at:
http://www.mla.iitkgp.ernet.in/~monojit/sms.html
No of Tokens: ~ 20000No of Types: ~ 2000 (Std English)No of Frequent Types: 234 (freq > 10)Compression Rate: 0.83
Monojit Choudhury, CSE, IIT Kharagpur
Investigation and Modeling of the Structure of Texting Languages
AND 2007, Hyderabad
Tomorrow never dies!!!• 2moro (9)• tomoz (25) • tomoro (12) • tomrw (5)• tom (2)• tomra (2)• tomorrow (24)• tomora (4)
• tomm (1)• tomo (3)• tomorow (3)• 2mro (2)• morrow (1)• tomor (2)• tmorro (1)• moro (1)
Monojit Choudhury, CSE, IIT Kharagpur
Investigation and Modeling of the Structure of Texting Languages
AND 2007, Hyderabad
Patterns or Compression Operators
• Phonetic substitution (phoneme)– psycho syco, then den
• Phonetic substitution (syllable)– today 2day , see c
• Deletion of vowels– message mssg, about abt
• Deletion of repeated characters– tomorrow tomorow
Monojit Choudhury, CSE, IIT Kharagpur
Investigation and Modeling of the Structure of Texting Languages
AND 2007, Hyderabad
Patterns or Compression Operators
• Truncation (deletion of tails)– introduction intro, evaluation eval
• Common Abbreviations– Kharagpur kgp, text back tb
• Informal pronunciation– going to gonna, better betta
Monojit Choudhury, CSE, IIT Kharagpur
Investigation and Modeling of the Structure of Texting Languages
AND 2007, Hyderabad
Successive Application of Operators
• Because cause (informal usage)
• cause cauz (phonetic substitution)
• cauz cuz (vowel deletion)
Monojit Choudhury, CSE, IIT Kharagpur
Investigation and Modeling of the Structure of Texting Languages
AND 2007, Hyderabad
Approach
• Supervised Machine Learning using
Hidden Markov Models
• Training Instance – Only positive examples
(t, s, freq)(“tns”, “teens”, 52)(“tns”, “tins”, 34)(“tns”, “tens”, 27)(“tns”, “tense”, 2)
Monojit Choudhury, CSE, IIT Kharagpur
Investigation and Modeling of the Structure of Texting Languages
AND 2007, Hyderabad
HMM Construction: Graphemic path
G1
‘T’S6
ε T @
G2
‘O’
ε O @
G3
‘D’
ε D @
G4
‘A’
ε A @
G5
‘Y’
ε Y @
S0
Monojit Choudhury, CSE, IIT Kharagpur
Investigation and Modeling of the Structure of Texting Languages
AND 2007, Hyderabad
HMM Construction: Phonemic path
P1/T/
S6
T
P2/AH/
A O U
P3/D/
D
P4/AY/
Y E I
S0
S1“2”
2
Monojit Choudhury, CSE, IIT Kharagpur
Investigation and Modeling of the Structure of Texting Languages
AND 2007, Hyderabad
HMM Construction: Cross-linking
G1
‘T’
S6
G2
‘O’
G3
‘D’G4
‘A’G5
‘Y’
S0P1/T/
P2/AH/
P3/D/
P4/AY/
S1“2”
Monojit Choudhury, CSE, IIT Kharagpur
Investigation and Modeling of the Structure of Texting Languages
AND 2007, Hyderabad
HMM Construction: State Minimization
G1
‘T’
S6
G2
‘O’
G3
‘D’G4
‘A’G5
‘Y’
S0P2/AH/
P4/AY/
S1“2”
Monojit Choudhury, CSE, IIT Kharagpur
Investigation and Modeling of the Structure of Texting Languages
AND 2007, Hyderabad
HMM Construction: Modification
G1
‘T’
S6
G2
‘O’
G3
‘D’G4
‘A’G5
‘Y’
S0P2/AH/
P4/AY/
S1“2”
EXT‘E/S’
Monojit Choudhury, CSE, IIT Kharagpur
Investigation and Modeling of the Structure of Texting Languages
AND 2007, Hyderabad
Learning
• Supervised estimation of the HMM parameters for known 234 words
• Generalization of the parameters over HMMs learning operator probabilities
• Construction of HMMs for unknown words
Monojit Choudhury, CSE, IIT Kharagpur
Investigation and Modeling of the Structure of Texting Languages
AND 2007, Hyderabad
Supervised Estimation
G1
‘T’
S6
G2
‘O’G3
‘D’G4
‘A’G5
‘Y’
S0P2/AH/
P4/AY/
S1“2”
EXT‘E/S’
Step 1: HMM for “Today”
Monojit Choudhury, CSE, IIT Kharagpur
Investigation and Modeling of the Structure of Texting Languages
AND 2007, Hyderabad
Supervised Estimation
G1
‘T’
S6
G2
‘O’G3
‘D’G4
‘A’G5
‘Y’
S0P2/AH/
P4/AY/
S1“2”
EXT‘E/S’
Step 2: Initialization
0.7
0.3
1
0.7 1
0.3 1
0.7
0.3
1
1
1
0.7
0.3
Monojit Choudhury, CSE, IIT Kharagpur
Investigation and Modeling of the Structure of Texting Languages
AND 2007, Hyderabad
Supervised Estimation
G1
‘T’
S6
G2
‘O’G3
‘D’G4
‘A’G5
‘Y’
S0P2/AH/
P4/AY/
S1“2”
EXT‘E/S’
Step 3: Training using Viterbi
0.7
0.3
1
0.7 1
0.3 1
0.7
0.3
1
1
1
0.7
0.3
“2day” (10)
Monojit Choudhury, CSE, IIT Kharagpur
Investigation and Modeling of the Structure of Texting Languages
AND 2007, Hyderabad
Supervised Estimation
G1
‘T’
S6
G2
‘O’G3
‘D’G4
‘A’G5
‘Y’
S0P2/AH/
P4/AY/
S1“2”
EXT‘E/S’
Step 3: Training using Viterbi
0.7
0.3
1
0.7 1
0.3 1
0.7
0.3
1
1
1
0.7
0.3
“tday” (5)
Monojit Choudhury, CSE, IIT Kharagpur
Investigation and Modeling of the Structure of Texting Languages
AND 2007, Hyderabad
Supervised Estimation
G1
‘T’
S6
G2
‘O’G3
‘D’G4
‘A’G5
‘Y’
S0P2/AH/
P4/AY/
S1“2”
EXT‘E/S’
Step 4: Update the parameters
0.33
0.66
1
1 1
0 1
1
0
1
1
1
1
0
Monojit Choudhury, CSE, IIT Kharagpur
Investigation and Modeling of the Structure of Texting Languages
AND 2007, Hyderabad
Generalization
• Weighted estimation of 20 parameters from the 234 word HMM– Probability of character deletion (null emission)
from the first, last and intermediate G-states– Probability of transition from G-state to P-
state/S-state and vice versa– Probability of transition to the extended state
Monojit Choudhury, CSE, IIT Kharagpur
Investigation and Modeling of the Structure of Texting Languages
AND 2007, Hyderabad
Construction of HMM for unseen words
• 12000 frequently used English words
• Their pronunciations (CMU pronunciation dictionary)
• Construct the structure of the word HMMs
• Assign the probability values based on the estimated parameters
Monojit Choudhury, CSE, IIT Kharagpur
Investigation and Modeling of the Structure of Texting Languages
AND 2007, Hyderabad
Experiments
• ~1200 distinct tokens obtained from the SMS corpus which are unseen (translations are known from the aligned data)
• Given t, For each word s in the standard lexicon, estimate Pr(s|t) ~ Pr(t|s)
• Rank the words according to Pr(s|t)
• Generate the suggestion list
Monojit Choudhury, CSE, IIT Kharagpur
Investigation and Modeling of the Structure of Texting Languages
AND 2007, Hyderabad
Results: Suggestion lists
2day (today)
today (3.02)stay (11.46)away (13.13)play (13.14)clay (13.14)
fne (phone)
fine (3.52)phone (5.13)funny (6.26)fined (6.51)fines (6.72)
cin (seeing)
coin (3.52)chin (3.79)clean (5.95)coins (6.61)china (6.75)
Monojit Choudhury, CSE, IIT Kharagpur
Investigation and Modeling of the Structure of Texting Languages
AND 2007, Hyderabad
Results: Graphs
50.00
60.00
70.00
80.00
90.00
100.00
0 5 10 15 20
Rank
Acc
urac
y (%
)
All tokens
only distorted tokens
Monojit Choudhury, CSE, IIT Kharagpur
Investigation and Modeling of the Structure of Texting Languages
AND 2007, Hyderabad
Comparison with Aspell
0
20
40
60
80
100
1 6 11 16
Rank
Acc
urac
y (%
)
Aspell
Our model on Unseen
Model testing
Monojit Choudhury, CSE, IIT Kharagpur
Investigation and Modeling of the Structure of Texting Languages
AND 2007, Hyderabad
Ongoing Work
• Detailed evaluation
• Incorporation of language models
• Extension for other languages, namely Hindi and Bangla
• Algorithms for fast argmax searching
Monojit Choudhury, CSE, IIT Kharagpur
Investigation and Modeling of the Structure of Texting Languages
AND 2007, Hyderabad
Future Work
• Improvement of the structure of HMM– Introduction of self loop, backward edges– Learning the structure from data
• Case-based or analogical learning– late l8, test tst gr8st greatest