Web Scale NLP: A Case Study on URL Word Breaking

21
Web Scale NLP: A Case Study on URL Word Breaking Kuansan Wang, Chris Thrasher, Bo-June (Paul) Hsu Microsoft Research, Redmond, USA WWW 2011 March 31, 2011

description

Web Scale NLP: A Case Study on URL Word Breaking. Kuansan Wang, Chris Thrasher, Bo-June (Paul) Hsu Microsoft Research, Redmond, USA WWW 2011 March 31, 2011. More Data > Complex Model. Banko and Brill. Mitigating the Paucity-of-Data Problems . HLT 01. More Data > Complex Model. ?. - PowerPoint PPT Presentation

Transcript of Web Scale NLP: A Case Study on URL Word Breaking

Page 1: Web Scale NLP: A Case Study on URL Word Breaking

Web Scale NLP:A Case Study on URL Word Breaking

Kuansan Wang, Chris Thrasher, Bo-June (Paul) HsuMicrosoft Research, Redmond, USA

WWW 2011March 31, 2011

Page 2: Web Scale NLP: A Case Study on URL Word Breaking

2

More Data > Complex Model

Banko and Brill. Mitigating the Paucity-of-Data Problems. HLT 01

Page 3: Web Scale NLP: A Case Study on URL Word Breaking

3

More Data > Complex Model

• CIKM 08

There is no data like more data ?

Page 4: Web Scale NLP: A Case Study on URL Word Breaking

NLP for the Web• Scale of the Web– Avoid manual intervention– Efficient implementations

• Dynamic Nature of the Web– Fast adaptation

• Global Reach of the Web– Need rudimentary multi-lingual capabilities

• Diverse Language Styles of Web Contents– Multi-style language models

Simple models with matched data!

Page 5: Web Scale NLP: A Case Study on URL Word Breaking

5

Outline• Web-Scale NLP• Word Breaking• Models• Evaluation• Conclusion

Page 6: Web Scale NLP: A Case Study on URL Word Breaking

Word Breaking• Large Data + Simple Model (Norvig, CIKM 2008)– Use unigram model to rank all possible segmentations– Pretty good, but with occasional embarrassing outcomes

• More data does not help!• Extension to trigram alleviates the problem

Page 7: Web Scale NLP: A Case Study on URL Word Breaking

Word Breaking for the WebWeb URLs exhibit variety of language styles…

…and in different languagesMatched data is crucial to accuracy!

Page 8: Web Scale NLP: A Case Study on URL Word Breaking

8

Outline• Web-Scale NLP• Word Breaking• Models• Evaluation• Conclusion

Page 9: Web Scale NLP: A Case Study on URL Word Breaking

MAP Decision Rule• Special case of Bayesian Minimum Risk– Speech, MT, Parsing, Tagging, Information Retrieval, …

• Problem: Given , find

: transformation model : prior

ChannelSignal Observation

Distortion

Page 10: Web Scale NLP: A Case Study on URL Word Breaking

MAP for Word Breaker

• : tweeter hash tag or URL domain name– Ex. 247moms, w84um8

• : what user meant to say– Ex. 24_7_moms, w8_4_u_m8 (wait for you mate)

Channel

Signal Output

Transformation

Page 11: Web Scale NLP: A Case Study on URL Word Breaking

Plug-in MAP Problem• MAP decision rule is optimal only if and are the

“correct” underlying distributions• Adjustments needed when estimated models and

have unknown errors– Simple logarithmic interpolation:

– “Random Field”/Machine Learning:

– Bayesian• Point estimation is outdated• Assume parameters are drawn from “some” distribution

Page 12: Web Scale NLP: A Case Study on URL Word Breaking

Baseline Methods• GM: Geometric Mean (Keohn and Kline, 2003)– Widely used, especially in MT systems

• BI: Binomial Model (Venkataraman, 2001)

• WL: Word Length Normalization (Kaitan et al, 2009)

All special cases/variations of MAP

Page 13: Web Scale NLP: A Case Study on URL Word Breaking

Proposed MethodME: Maximum Entropy Principle Model• – Special case of BI () and WL (uniform)• using Microsoft Web N-gram,

Microsoft Web N-gram (http://web-ngram.research.microsoft.com)• Web documents/Bing queries (EN-US market)• Rudimentary multilingual (NAACL 10)• Frequent updates (ICASSP 09)• Multi-style language model (WWW 10, SIGIR 10)

Body Title Anchor Query1-gram 1.2 B 60 M 150 M 252M5-gram 237 B 3.8 B 8.9 B -

Page 14: Web Scale NLP: A Case Study on URL Word Breaking

14

Outline• Web-Scale NLP• Word Breaking• Models• Evaluation• Conclusion

Page 15: Web Scale NLP: A Case Study on URL Word Breaking

Data Set• 100K randomly sampled URLs indexed by Bing– Simple tokenization– 266K unique tokens– Mostly ASCII characters

• Metric: Precision@3– Manually labeled word breaks– Multiple answers are allowed

Page 16: Web Scale NLP: A Case Study on URL Word Breaking

16

Language Model Style

Body Title Query Anchor93%

94%

95%

96%

97%

98%

99%

1-gram2-gram3-gram

ME

• Title is best although Body is 100x larger• Nav queries often word-split URLs, but Query worse than Title

Matched style is crucial to precision!

Page 17: Web Scale NLP: A Case Study on URL Word Breaking

17

Model Complexity

• With mismatched data, model choice is crucial• With matched data, complex models do not help

Body Title Query Anchor95%

96%

97%

98%

99%

BI (2)WL (1)ME (0)

3-gram

Simple model is sufficient with matched data!

Page 18: Web Scale NLP: A Case Study on URL Word Breaking

18

Outline• Web-Scale NLP• Word Breaking• Models• Evaluation• Conclusion

Page 19: Web Scale NLP: A Case Study on URL Word Breaking

Best = Right Data + Smart Model• Style of language trumps size of data– There is no data like more data… provided it’s matched data!

• Right data alleviates Plug-in MAP problem– Complicated machine learning artillery not required; simple methods

suffice• Smart model gives us:– Rudimentary multi-lingual capability– Fast inclusion of new words/phrases– Eliminate needs of human labor in data labeling

http://research.microsoft.com/en-us/um /people/kuansanw/wordbreaker/

Page 20: Web Scale NLP: A Case Study on URL Word Breaking

20

BACKUP SLIDES

Page 21: Web Scale NLP: A Case Study on URL Word Breaking

95.00%

95.50%

96.00%

96.50%

97.00%

97.50%

98.00%

98.50%

99.00%

1-gram

3-gram

BodyTitle

Query

Anchor

Note: BI, WL are oracle results

GM 1-gram 2-gram 3-gram

Body 59.01% 44.68% 44.78%

Title 61.55% 60.31% 58.70%

Anchor 60.46% 55.25% 54.84%

Query 54.83% 54.27% 54.83%