Investigation and Modeling of the Structure of Texting Languages Monojit Choudhury, Rahul Saraf*,...

33
Investigation and Modeling of the Structure of Texting Languages Monojit Choudhury, Rahul Saraf*, Vijit Jain, Sudeshna Sarkar, Anupam Basu Department of Computer Science & Engineering, IIT Kharagpur *Department of Computer Engineering, NIT Jaipur

Transcript of Investigation and Modeling of the Structure of Texting Languages Monojit Choudhury, Rahul Saraf*,...

Investigation and Modeling of the Structure of Texting Languages

Monojit Choudhury, Rahul Saraf*, Vijit Jain,

Sudeshna Sarkar, Anupam BasuDepartment of Computer Science & Engineering, IIT Kharagpur

*Department of Computer Engineering, NIT Jaipur

Monojit Choudhury, CSE, IIT Kharagpur

Investigation and Modeling of the Structure of Texting Languages

AND 2007, Hyderabad

Texting Language

• A new genre of English & also other languages used in chats, sms, emails, blogs, etc.

• Ungrammatical, unconventional spellings

dis is n eg 4 txtin lang

This is an example for Texting language

Monojit Choudhury, CSE, IIT Kharagpur

Investigation and Modeling of the Structure of Texting Languages

AND 2007, Hyderabad

Texting Language

• A new genre of English & also other languages used in chats, sms, emails, blogs, etc.

• Ungrammatical, unconventional spellings

dis is n eg 4 txtin lang

This is an example for Texting language

24 39

The shorter the fasterConstraint: understandability

Monojit Choudhury, CSE, IIT Kharagpur

Investigation and Modeling of the Structure of Texting Languages

AND 2007, Hyderabad

Objectives

• Modeling the structure of Texting language

• Decoder from Texting language to standard English

• Domain: SMS texts

• Applications– Search Engines

– Noisy text Correction

– Correction of ASR transcribed data

Monojit Choudhury, CSE, IIT Kharagpur

Investigation and Modeling of the Structure of Texting Languages

AND 2007, Hyderabad

The Noisy Channel Model

NOISY CHANNEL

Texting Language

Standard Language

S: s1 s2 … sn T: t1 t2 … tm

(T) = argmax Pr(T|S) Pr(S)

= argmax [ΠPr(ti|si)]Pr(S)i = 1

nS

S

Monojit Choudhury, CSE, IIT Kharagpur

Investigation and Modeling of the Structure of Texting Languages

AND 2007, Hyderabad

Refined Objective

• Given Texting Language word: t • Find the set of possible Standard language

words {s1, s2, s3…} such that Pr(si|t) > p

t = “tns”

s1 = “teens” s2 = “tins”s3 = “tons” s4 = “tens”s5 = “tense” s6 = “turns”

Monojit Choudhury, CSE, IIT Kharagpur

Investigation and Modeling of the Structure of Texting Languages

AND 2007, Hyderabad

Texting Language Data

• 1000 SMS texts collected from web [http://www.treasuremytext.com]

• Manually translated to standard English• Automatic word alignment through heuristics• Word – Variation pair extracted from corpus and

manually corrected

Available at:

http://www.mla.iitkgp.ernet.in/~monojit/sms.html

Monojit Choudhury, CSE, IIT Kharagpur

Investigation and Modeling of the Structure of Texting Languages

AND 2007, Hyderabad

Texting Language Data

• 1000 SMS texts collected from web [http://www.treasuremytext.com]

• Manually translated to standard English• Automatic word alignment through heuristics• Word – Variation pair extracted from corpus and

manually corrected

Available at:

http://www.mla.iitkgp.ernet.in/~monojit/sms.html

No of Tokens: ~ 20000No of Types: ~ 2000 (Std English)No of Frequent Types: 234 (freq > 10)Compression Rate: 0.83

Monojit Choudhury, CSE, IIT Kharagpur

Investigation and Modeling of the Structure of Texting Languages

AND 2007, Hyderabad

Tomorrow never dies!!!• 2moro (9)• tomoz (25) • tomoro (12) • tomrw (5)• tom (2)• tomra (2)• tomorrow (24)• tomora (4)

• tomm (1)• tomo (3)• tomorow (3)• 2mro (2)• morrow (1)• tomor (2)• tmorro (1)• moro (1)

Monojit Choudhury, CSE, IIT Kharagpur

Investigation and Modeling of the Structure of Texting Languages

AND 2007, Hyderabad

Patterns or Compression Operators

• Phonetic substitution (phoneme)– psycho syco, then den

• Phonetic substitution (syllable)– today 2day , see c

• Deletion of vowels– message mssg, about abt

• Deletion of repeated characters– tomorrow tomorow

Monojit Choudhury, CSE, IIT Kharagpur

Investigation and Modeling of the Structure of Texting Languages

AND 2007, Hyderabad

Patterns or Compression Operators

• Truncation (deletion of tails)– introduction intro, evaluation eval

• Common Abbreviations– Kharagpur kgp, text back tb

• Informal pronunciation– going to gonna, better betta

Monojit Choudhury, CSE, IIT Kharagpur

Investigation and Modeling of the Structure of Texting Languages

AND 2007, Hyderabad

Successive Application of Operators

• Because cause (informal usage)

• cause cauz (phonetic substitution)

• cauz cuz (vowel deletion)

Monojit Choudhury, CSE, IIT Kharagpur

Investigation and Modeling of the Structure of Texting Languages

AND 2007, Hyderabad

Approach

• Supervised Machine Learning using

Hidden Markov Models

• Training Instance – Only positive examples

(t, s, freq)(“tns”, “teens”, 52)(“tns”, “tins”, 34)(“tns”, “tens”, 27)(“tns”, “tense”, 2)

Monojit Choudhury, CSE, IIT Kharagpur

Investigation and Modeling of the Structure of Texting Languages

AND 2007, Hyderabad

HMM Construction: Graphemic path

G1

‘T’S6

ε T @

G2

‘O’

ε O @

G3

‘D’

ε D @

G4

‘A’

ε A @

G5

‘Y’

ε Y @

S0

Monojit Choudhury, CSE, IIT Kharagpur

Investigation and Modeling of the Structure of Texting Languages

AND 2007, Hyderabad

HMM Construction: Phonemic path

P1/T/

S6

T

P2/AH/

A O U

P3/D/

D

P4/AY/

Y E I

S0

S1“2”

2

Monojit Choudhury, CSE, IIT Kharagpur

Investigation and Modeling of the Structure of Texting Languages

AND 2007, Hyderabad

HMM Construction: Cross-linking

G1

‘T’

S6

G2

‘O’

G3

‘D’G4

‘A’G5

‘Y’

S0P1/T/

P2/AH/

P3/D/

P4/AY/

S1“2”

Monojit Choudhury, CSE, IIT Kharagpur

Investigation and Modeling of the Structure of Texting Languages

AND 2007, Hyderabad

HMM Construction: State Minimization

G1

‘T’

S6

G2

‘O’

G3

‘D’G4

‘A’G5

‘Y’

S0P2/AH/

P4/AY/

S1“2”

Monojit Choudhury, CSE, IIT Kharagpur

Investigation and Modeling of the Structure of Texting Languages

AND 2007, Hyderabad

HMM Construction: Modification

G1

‘T’

S6

G2

‘O’

G3

‘D’G4

‘A’G5

‘Y’

S0P2/AH/

P4/AY/

S1“2”

EXT‘E/S’

Monojit Choudhury, CSE, IIT Kharagpur

Investigation and Modeling of the Structure of Texting Languages

AND 2007, Hyderabad

Learning

• Supervised estimation of the HMM parameters for known 234 words

• Generalization of the parameters over HMMs learning operator probabilities

• Construction of HMMs for unknown words

Monojit Choudhury, CSE, IIT Kharagpur

Investigation and Modeling of the Structure of Texting Languages

AND 2007, Hyderabad

Supervised Estimation

G1

‘T’

S6

G2

‘O’G3

‘D’G4

‘A’G5

‘Y’

S0P2/AH/

P4/AY/

S1“2”

EXT‘E/S’

Step 1: HMM for “Today”

Monojit Choudhury, CSE, IIT Kharagpur

Investigation and Modeling of the Structure of Texting Languages

AND 2007, Hyderabad

Supervised Estimation

G1

‘T’

S6

G2

‘O’G3

‘D’G4

‘A’G5

‘Y’

S0P2/AH/

P4/AY/

S1“2”

EXT‘E/S’

Step 2: Initialization

0.7

0.3

1

0.7 1

0.3 1

0.7

0.3

1

1

1

0.7

0.3

Monojit Choudhury, CSE, IIT Kharagpur

Investigation and Modeling of the Structure of Texting Languages

AND 2007, Hyderabad

Supervised Estimation

G1

‘T’

S6

G2

‘O’G3

‘D’G4

‘A’G5

‘Y’

S0P2/AH/

P4/AY/

S1“2”

EXT‘E/S’

Step 3: Training using Viterbi

0.7

0.3

1

0.7 1

0.3 1

0.7

0.3

1

1

1

0.7

0.3

“2day” (10)

Monojit Choudhury, CSE, IIT Kharagpur

Investigation and Modeling of the Structure of Texting Languages

AND 2007, Hyderabad

Supervised Estimation

G1

‘T’

S6

G2

‘O’G3

‘D’G4

‘A’G5

‘Y’

S0P2/AH/

P4/AY/

S1“2”

EXT‘E/S’

Step 3: Training using Viterbi

0.7

0.3

1

0.7 1

0.3 1

0.7

0.3

1

1

1

0.7

0.3

“tday” (5)

Monojit Choudhury, CSE, IIT Kharagpur

Investigation and Modeling of the Structure of Texting Languages

AND 2007, Hyderabad

Supervised Estimation

G1

‘T’

S6

G2

‘O’G3

‘D’G4

‘A’G5

‘Y’

S0P2/AH/

P4/AY/

S1“2”

EXT‘E/S’

Step 4: Update the parameters

0.33

0.66

1

1 1

0 1

1

0

1

1

1

1

0

Monojit Choudhury, CSE, IIT Kharagpur

Investigation and Modeling of the Structure of Texting Languages

AND 2007, Hyderabad

Generalization

• Weighted estimation of 20 parameters from the 234 word HMM– Probability of character deletion (null emission)

from the first, last and intermediate G-states– Probability of transition from G-state to P-

state/S-state and vice versa– Probability of transition to the extended state

Monojit Choudhury, CSE, IIT Kharagpur

Investigation and Modeling of the Structure of Texting Languages

AND 2007, Hyderabad

Construction of HMM for unseen words

• 12000 frequently used English words

• Their pronunciations (CMU pronunciation dictionary)

• Construct the structure of the word HMMs

• Assign the probability values based on the estimated parameters

Monojit Choudhury, CSE, IIT Kharagpur

Investigation and Modeling of the Structure of Texting Languages

AND 2007, Hyderabad

Experiments

• ~1200 distinct tokens obtained from the SMS corpus which are unseen (translations are known from the aligned data)

• Given t, For each word s in the standard lexicon, estimate Pr(s|t) ~ Pr(t|s)

• Rank the words according to Pr(s|t)

• Generate the suggestion list

Monojit Choudhury, CSE, IIT Kharagpur

Investigation and Modeling of the Structure of Texting Languages

AND 2007, Hyderabad

Results: Suggestion lists

2day (today)

today (3.02)stay (11.46)away (13.13)play (13.14)clay (13.14)

fne (phone)

fine (3.52)phone (5.13)funny (6.26)fined (6.51)fines (6.72)

cin (seeing)

coin (3.52)chin (3.79)clean (5.95)coins (6.61)china (6.75)

Monojit Choudhury, CSE, IIT Kharagpur

Investigation and Modeling of the Structure of Texting Languages

AND 2007, Hyderabad

Results: Graphs

50.00

60.00

70.00

80.00

90.00

100.00

0 5 10 15 20

Rank

Acc

urac

y (%

)

All tokens

only distorted tokens

Monojit Choudhury, CSE, IIT Kharagpur

Investigation and Modeling of the Structure of Texting Languages

AND 2007, Hyderabad

Comparison with Aspell

0

20

40

60

80

100

1 6 11 16

Rank

Acc

urac

y (%

)

Aspell

Our model on Unseen

Model testing

Monojit Choudhury, CSE, IIT Kharagpur

Investigation and Modeling of the Structure of Texting Languages

AND 2007, Hyderabad

Ongoing Work

• Detailed evaluation

• Incorporation of language models

• Extension for other languages, namely Hindi and Bangla

• Algorithms for fast argmax searching

Monojit Choudhury, CSE, IIT Kharagpur

Investigation and Modeling of the Structure of Texting Languages

AND 2007, Hyderabad

Future Work

• Improvement of the structure of HMM– Introduction of self loop, backward edges– Learning the structure from data

• Case-based or analogical learning– late l8, test tst gr8st greatest

Thank you for listening