Natural language processing

17
NATURAL LANGUAGE PROCESSING

description

Natural language processing. Applications. Classification (spam) Clustering (news stories, twitter) Input correction (spell checking) Sentiment analysis (product reviews) Information retrieval (web search) Question answering (web search, IBM’s Watson) - PowerPoint PPT Presentation

Transcript of Natural language processing

Page 1: Natural language processing

NATURAL LANGUAGE PROCESSING

Page 2: Natural language processing

Applications Classification (spam) Clustering (news stories, twitter) Input correction (spell checking) Sentiment analysis (product reviews) Information retrieval (web search) Question answering (web search, IBM’s

Watson) Machine translation (english to spanish) Speech recognition (Siri)

Page 3: Natural language processing

Language Models Two ways to think about modeling

language:

1. Sequences of letters/words probabilistic, word based, learned

2. Tree-based grammar models Logical, boolean,

often hand coded

Page 4: Natural language processing

Bag-of-words ModelTransform documents into sparse numeric vectors and then deal with them with linear algebra operations

Page 5: Natural language processing

Bag-of-words Model One of the most common ways to deal

with documents

Forgets everything about the linguistic structure within the text Useful for classification, clustering,

visualization etc.

Page 6: Natural language processing

Similarity between document vectors

Each document is represented as a vector of weights

Cosine similarity (dot product) is the most widely used similarity measure between two document vectors …calculates cosine of the angle between document vectors …efficient to calculate (sum of products of intersecting words) …similarity value between 0 (different) and 1 (the same)

k kj j

iii

xx

xxDDSim

22

21

21 ),(

Page 7: Natural language processing

Bag of Words with Word Weighting Each word is represented as a separate variable having

numeric weight (importance) The most popular weighting schema is normalized word

frequency TFIDF:

– term frequency (number of word occurrences in a document) – document frequency (number of documents containing the

word) – number of all documents – relative importance of the word in the document

))(

log(.)(wdf

Ntfwtfidf

The word is more important if it appears several times in a target document

The word is more important if it appears in less documents

Page 8: Natural language processing

Example document and its vector representation

TRUMP MAKES BID FOR CONTROL OF RESORTS Casino owner and real estate Donald Trump has offered to acquire all Class B common shares of Resorts International Inc, a spokesman for Trump said. The estate of late Resorts chairman James M. Crosby owns 340,783 of the 752,297 Class B shares. Resorts also has about 6,432,000 Class A common shares outstanding. Each Class B share has 100 times the voting power of a Class A share, giving the Class B stock about 93 pct of Resorts' voting power.

[RESORTS:0.624] [CLASS:0.487] [TRUMP:0.367] [VOTING:0.171] [ESTATE:0.166] [POWER:0.134] [CROSBY:0.134] [CASINO:0.119] [DEVELOPER:0.118] [SHARES:0.117] [OWNER:0.102] [DONALD:0.097] [COMMON:0.093] [GIVING:0.081] [OWNS:0.080] [MAKES:0.078] [TIMES:0.075] [SHARE:0.072] [JAMES:0.070] [REAL:0.068] [CONTROL:0.065] [ACQUIRE:0.064] [OFFERED:0.063] [BID:0.063] [LATE:0.062] [OUTSTANDING:0.056] [SPOKESMAN:0.049] [CHAIRMAN:0.049] [INTERNATIONAL:0.041] [STOCK:0.035] [YORK:0.035] [PCT:0.022] [MARCH:0.011]

Original text

Bag-of-Wordsrepresentation(high dimensional sparse vector)

Page 9: Natural language processing

What happens if some words do not appear in the training corpus?

Smoothing: assigning very low, but greater than 0, probabilities to previously unseen words

Page 10: Natural language processing

Wouldn’t it be helpful to reason about word order?

Page 11: Natural language processing

-gram models Probabilistic language model based on a

contiguous sequence of n items

n=1 is a unigram model (“bag of words”) n=2 is a bigram model n=3 is a trigram model … etc

Page 12: Natural language processing

-gram example Source text: to be or not to be

Unigrams: to, be, or, not, to, be Bigrams: to be, be or, or not, not to, to be Trigrams: to be or, be or not, or not to, not to be

Page 13: Natural language processing

Google -gram corpus In September 2006 Google announced

availability of n-gram corpus: http://googleresearch.blogspot.com/2006/08/all-our-

n-gram-are-belong-to-you.html#links Some statistics of the corpus:

File sizes: approx. 24 GB compressed (gzip'ed) text files Number of tokens: 1,024,908,267,229 Number of sentences: 95,119,665,584 Number of unigrams: 13,588,391 Number of bigrams: 314,843,401 Number of trigrams: 977,069,902 Number of fourgrams: 1,313,818,354 Number of fivegrams: 1,176,470,663

Page 14: Natural language processing

Example: Google -grams ceramics collectables collectibles 55

ceramics collectables fine 130ceramics collected by 52ceramics collectible pottery 50ceramics collectibles cooking 45ceramics collection , 144ceramics collection . 247ceramics collection </S> 120ceramics collection and 43ceramics collection at 52ceramics collection is 68ceramics collection of 76ceramics collection | 59ceramics collections , 66ceramics collections . 60ceramics combined with 46ceramics come from 69ceramics comes from 660ceramics community , 109ceramics community . 212ceramics community for 61ceramics companies . 53ceramics companies consultants 173ceramics company ! 4432ceramics company , 133ceramics company . 92ceramics company </S> 41ceramics company facing 145ceramics company in 181ceramics company started 137ceramics company that 87ceramics component ( 76ceramics composed of 85

serve as the incoming 92serve as the incubator 99serve as the independent 794serve as the index 223serve as the indication 72serve as the indicator 120serve as the indicators 45serve as the indispensable 111serve as the indispensible 40serve as the individual 234serve as the industrial 52serve as the industry 607serve as the info 42serve as the informal 102serve as the information 838serve as the informational 41serve as the infrastructure 500serve as the initial 5331serve as the initiating 125serve as the initiation 63serve as the initiator 81serve as the injector 56serve as the inlet 41serve as the inner 87serve as the input 1323serve as the inputs 189serve as the insertion 49serve as the insourced 67serve as the inspection 43serve as the inspector 66serve as the inspiration 1390serve as the installation 136serve as the institute 187

Page 15: Natural language processing

Using -grams to generate text(Shakespeare)

Unigrams: Every enter now severally so, let Hill he late speaks; or! a more to leg less first you enter

Bigrams: What means, sir. I confess she? then all sorts, he is trim, captain. Why dost stand forth thy canopy, forsooth; he is this palpable hit

the King Henry.

Trigrams: Sweet prince, Falstaff shall die. This shall forbid it should be branded, if renown made it empty.

Page 16: Natural language processing

Using -grams to generate text(Shakespeare)

Quadrigrams What! I will go seek the traitor Gloucester. Will you not tell me who I am?

Note: As we increase the value of N, the accuracy of an n-gram model increases, since choice of next word becomes increasingly constrained

Page 17: Natural language processing

Using -grams to generate text(Wall Street Journal)