Text%Analysis- - Duke Computer Science...Problem1 • Whatisaword?-–...

47
Text Analysis Everything Data CompSci 290.01 Spring 2014

Transcript of Text%Analysis- - Duke Computer Science...Problem1 • Whatisaword?-–...

Page 1: Text%Analysis- - Duke Computer Science...Problem1 • Whatisaword?-– I’m…1wordor2words{I’m}or{I,am} – Stateoftheart…1wordor4words-– SanFrancisco…1or2words-– OtherLanguages-•

Text  Analysis

Everything  Data CompSci  290.01  Spring  2014

Page 2: Text%Analysis- - Duke Computer Science...Problem1 • Whatisaword?-– I’m…1wordor2words{I’m}or{I,am} – Stateoftheart…1wordor4words-– SanFrancisco…1or2words-– OtherLanguages-•

Announcements  (Thu.  Jan.  16) •  Sample  solutions  for  HW  1  /  Lab  1  online.  

•  HW  2  will  be  posted  latest  by  tomorrow  (Fri)  morning.  Please  work  individually  (not  teams).

•  Updated  team  assignments  will  be  posted  before  Lab  2   –  (to  account  for  slight  changes  in  enrollment).  

2

Page 3: Text%Analysis- - Duke Computer Science...Problem1 • Whatisaword?-– I’m…1wordor2words{I’m}or{I,am} – Stateoftheart…1wordor4words-– SanFrancisco…1or2words-– OtherLanguages-•

Outline

•  Basic  Text  Processing – Word  Counts – Tokenization

•  Pointwise  Mutual  Information   – Normalization

•  Stemming

3

Page 4: Text%Analysis- - Duke Computer Science...Problem1 • Whatisaword?-– I’m…1wordor2words{I’m}or{I,am} – Stateoftheart…1wordor4words-– SanFrancisco…1or2words-– OtherLanguages-•

Outline

•  Basic  Text  Processing

•  Finding  Salient  Tokens – TFIDF  scoring

4

Page 5: Text%Analysis- - Duke Computer Science...Problem1 • Whatisaword?-– I’m…1wordor2words{I’m}or{I,am} – Stateoftheart…1wordor4words-– SanFrancisco…1or2words-– OtherLanguages-•

Outline

•  Basic  Text  Processing •  Finding  Salient  Tokens

•  Document  Similarity  &  Keyword  Search – Vector  Space  Model  &  Cosine  Similarity –  Inverted  Indexes

5

Page 6: Text%Analysis- - Duke Computer Science...Problem1 • Whatisaword?-– I’m…1wordor2words{I’m}or{I,am} – Stateoftheart…1wordor4words-– SanFrancisco…1or2words-– OtherLanguages-•

Outline

•  Basic  Text  Processing – Word  Counts – Tokenization

•  Pointwise  Mutual  Information   – Normalization

•  Stemming

•  Finding  Salient  Tokens •  Document  Similarity  &  Keyword  Search

6

Page 7: Text%Analysis- - Duke Computer Science...Problem1 • Whatisaword?-– I’m…1wordor2words{I’m}or{I,am} – Stateoftheart…1wordor4words-– SanFrancisco…1or2words-– OtherLanguages-•

Basic  Query:  Word  Counts 7

Google  Flu

Emotional  Trends

http://www.ccs.neu.edu/home/amislove/twittermood/

Page 8: Text%Analysis- - Duke Computer Science...Problem1 • Whatisaword?-– I’m…1wordor2words{I’m}or{I,am} – Stateoftheart…1wordor4words-– SanFrancisco…1or2words-– OtherLanguages-•

Basic  Query:  Word  Counts

•  How  many  times  does  each  word  appear  in  the  document?

8

Page 9: Text%Analysis- - Duke Computer Science...Problem1 • Whatisaword?-– I’m…1wordor2words{I’m}or{I,am} – Stateoftheart…1wordor4words-– SanFrancisco…1or2words-– OtherLanguages-•

Problem  1

•  What  is  a  word? –  I’m  …  1  word  or  2  words  {I’m}  or  {I,  am} – State-­‐‑of-­‐‑the-­‐‑art  …  1  word  or  4  words – San  Francisco  …  1  or  2  words

– Other  Languages •  French:  l’ensemble •  German:  freundshaftsbezeigungen •  Chinese:  這是一個簡單的句子    (no  spaces)

9

This,  one,  easy,  sentence

Page 10: Text%Analysis- - Duke Computer Science...Problem1 • Whatisaword?-– I’m…1wordor2words{I’m}or{I,am} – Stateoftheart…1wordor4words-– SanFrancisco…1or2words-– OtherLanguages-•

Solution:  Tokenization

•  For  English:   – Splihing  the  text  on  non  alphanumeric  characters  is  a  reasonable  way  to  find  individual  words.

– But  will  split  words  like  “San  Francisco”  and  “state-­‐‑of-­‐‑the-­‐‑art”

•  For  other  languages:   – Need  more  sophisticated  algorithms  for  identifying  words.  

10

Page 11: Text%Analysis- - Duke Computer Science...Problem1 • Whatisaword?-– I’m…1wordor2words{I’m}or{I,am} – Stateoftheart…1wordor4words-– SanFrancisco…1or2words-– OtherLanguages-•

Finding  simple  phrases

•  Want  to  find  pairs  of  tokens  that  always  occur  together  (co-­‐‑occurence)

•  Intuition:    Two  tokens  are  co-­‐‑occurent    if  they  appear  together  more    often  than  “random” – More  often  than  monkeys    typing  English  words  

11

http://tmblr.co/ZiEAFywGEckV

Page 12: Text%Analysis- - Duke Computer Science...Problem1 • Whatisaword?-– I’m…1wordor2words{I’m}or{I,am} – Stateoftheart…1wordor4words-– SanFrancisco…1or2words-– OtherLanguages-•

Formalizing  “more  often  than  random”

•  Language  Model – Assigns  a  probability  P(x1,  x2,  …,  xk)  to  any  sequence  of  tokens.

– More  common  sequences  have  a  higher  probability

–  Sequences  of  length  1:  unigrams –  Sequences  of  length  2:  bigrams –  Sequences  of  length  3:  trigrams –  Sequences  of  length  n:  n-­‐‑grams

12

Page 13: Text%Analysis- - Duke Computer Science...Problem1 • Whatisaword?-– I’m…1wordor2words{I’m}or{I,am} – Stateoftheart…1wordor4words-– SanFrancisco…1or2words-– OtherLanguages-•

Formalizing  “more  often  than  random”

•  Suppose  we  have  a  language  model

– P(“San  Francisco”)  is  the  probability  that  “San”  and  “Francisco”  occur  together  (and  in  that  order)  in  the  language.

13

Page 14: Text%Analysis- - Duke Computer Science...Problem1 • Whatisaword?-– I’m…1wordor2words{I’m}or{I,am} – Stateoftheart…1wordor4words-– SanFrancisco…1or2words-– OtherLanguages-•

Formalizing  “random”:  Bag  of  words  

•  Suppose  we  only  have  access  to  the  unigram  language  model – Think:  all  unigrams  in  the  language  thrown  into  a  bag  with  counts  proportional  to  their  P(x)

– Monkeys  drawing  words  at  random  from  the  bag

– P(“San”)  x  P(“Francisco”)  is  the  probability  that  “San  Francisco”  occurs  together  in  the  (random)  unigram  model

14

Page 15: Text%Analysis- - Duke Computer Science...Problem1 • Whatisaword?-– I’m…1wordor2words{I’m}or{I,am} – Stateoftheart…1wordor4words-– SanFrancisco…1or2words-– OtherLanguages-•

Formalizing  “more  often  than  random”

•  Pointwise  Mutual  Information:  

– Positive  PMI  suggests  word  co-­‐‑occurrence – Negative  PMI  suggests  words  don’t  appear  together  

15

Page 16: Text%Analysis- - Duke Computer Science...Problem1 • Whatisaword?-– I’m…1wordor2words{I’m}or{I,am} – Stateoftheart…1wordor4words-– SanFrancisco…1or2words-– OtherLanguages-•

What  is  P(x)?

•  “Suppose  we  have  a  language  model  …”

•  Idea:  Use  counts  from  a  large  corpus  of  text  to  compute  the  probabilities

16

Page 17: Text%Analysis- - Duke Computer Science...Problem1 • Whatisaword?-– I’m…1wordor2words{I’m}or{I,am} – Stateoftheart…1wordor4words-– SanFrancisco…1or2words-– OtherLanguages-•

What  is  P(x)?

•  Unigram:  P(x)  =  count(x)  /  N •   count(x)  is  the  number  of  times  token  x  appears •   N  is  the  total  #  tokens.  

•  Bigram:  P(x1,  x2)  =  count(x1,  x2)  /  N •   count(x1,x2)  =  #  times  sequence  (x1,x2)  appears

•  Trigram:  P(x1,  x2,  x3)  =  count(x1,  x2,  x3)  /  N •   count(x1,x2,x3)  =  #  times  sequence  (x1,x2,x3)  appears

17

Page 18: Text%Analysis- - Duke Computer Science...Problem1 • Whatisaword?-– I’m…1wordor2words{I’m}or{I,am} – Stateoftheart…1wordor4words-– SanFrancisco…1or2words-– OtherLanguages-•

What  is  P(x)?

•  “Suppose  we  have  a  language  model  …”

•  Idea:  Use  counts  from  a  large  corpus  of  text  to  compute  the  probabilities

18

Page 19: Text%Analysis- - Duke Computer Science...Problem1 • Whatisaword?-– I’m…1wordor2words{I’m}or{I,am} – Stateoftheart…1wordor4words-– SanFrancisco…1or2words-– OtherLanguages-•

Large  text  corpora

•  Corpus  of  Contemporary  American  English – hhp://corpus.byu.edu/coca/

•  Google  N-­‐‑gram  viewer – hhps://books.google.com/ngrams

19

Page 20: Text%Analysis- - Duke Computer Science...Problem1 • Whatisaword?-– I’m…1wordor2words{I’m}or{I,am} – Stateoftheart…1wordor4words-– SanFrancisco…1or2words-– OtherLanguages-•

Summary  of  Problem  1 •  Word  tokenization  can  be  hard  

•  Space/non-­‐‑alphanumeric  words  may  oversplit  the  text

•  We  can  find  co-­‐‑ocurrent  tokens:   – Build  a  language  model  from  a  large  corpus

– Check  whether  the  pair  of  tokens  appear  more  often  than  random  using  pointwise  mutual  information.

20

Page 21: Text%Analysis- - Duke Computer Science...Problem1 • Whatisaword?-– I’m…1wordor2words{I’m}or{I,am} – Stateoftheart…1wordor4words-– SanFrancisco…1or2words-– OtherLanguages-•

Language  models  …

•  …  have  many  many  applications

– Tokenizing  long  strings – Word/query  completion/suggestion – Spell  checking – Machine  translation – …

21

Page 22: Text%Analysis- - Duke Computer Science...Problem1 • Whatisaword?-– I’m…1wordor2words{I’m}or{I,am} – Stateoftheart…1wordor4words-– SanFrancisco…1or2words-– OtherLanguages-•

Problem  2

•  A  word  may  be  represented  in  many  forms – car,  cars,  car’s,  cars’  à  car –   automation,  automatic  à  automate

•  Lemmatization:  Problem  of  finding  the  correct  dictionary  headword  form

22

Page 23: Text%Analysis- - Duke Computer Science...Problem1 • Whatisaword?-– I’m…1wordor2words{I’m}or{I,am} – Stateoftheart…1wordor4words-– SanFrancisco…1or2words-– OtherLanguages-•

Solution:  Stemming

•  Words  are  made  up  of   – Stems:  core  word – Affixes:  modifiers  added  (often  with  grammatical  function)

•  Stemming:  reduce  terms  to  their  stems  by  crudely  chopping  off  affixes –   automation,  automatic  à  automat

23

Page 24: Text%Analysis- - Duke Computer Science...Problem1 • Whatisaword?-– I’m…1wordor2words{I’m}or{I,am} – Stateoftheart…1wordor4words-– SanFrancisco…1or2words-– OtherLanguages-•

Porter’s  algorithm  for  English

•  Sequences  of  rules  applied  to  words

Example  rule  sets:

24

/(.*)sses$/ à \1ss /(.*)ies$/ à \1i /(.*)ss$/ à \1s /(.*[^s])s$/ à \1

/(.*[aeiou]+.*)ing$/ à \1 /(.*[aeiou]+.*)ed$/ à \1

Page 25: Text%Analysis- - Duke Computer Science...Problem1 • Whatisaword?-– I’m…1wordor2words{I’m}or{I,am} – Stateoftheart…1wordor4words-– SanFrancisco…1or2words-– OtherLanguages-•

Any  other  problems?  

•  Same  words  that  mean  different  things – Florence  the  person  vs  Florence  the  city – Paris  Hilton  (person  or  hotel)

•  Abbreviations –  I.B.M  vs  International  Business  Machines

•  Different  words  meaning  same  thing – Big  Apple  vs  New  York

•  …

25

Page 26: Text%Analysis- - Duke Computer Science...Problem1 • Whatisaword?-– I’m…1wordor2words{I’m}or{I,am} – Stateoftheart…1wordor4words-– SanFrancisco…1or2words-– OtherLanguages-•

Any  other  problems?

•  Same  words  that  mean  different  things – Word  Sense  Disambiguation

•  Abbreviations – Translations

•  Different  words  meaning  same  thing – Named  Entity  Recognition  &  Entity  Resolution

•  …

26

Page 27: Text%Analysis- - Duke Computer Science...Problem1 • Whatisaword?-– I’m…1wordor2words{I’m}or{I,am} – Stateoftheart…1wordor4words-– SanFrancisco…1or2words-– OtherLanguages-•

Outline

•  Basic  Text  Processing

•  Finding  Salient  Tokens – TFIDF  scoring

•  Document  Similarity  &  Keyword  Search

27

Page 28: Text%Analysis- - Duke Computer Science...Problem1 • Whatisaword?-– I’m…1wordor2words{I’m}or{I,am} – Stateoftheart…1wordor4words-– SanFrancisco…1or2words-– OtherLanguages-•

Summarizing  text 28

Page 29: Text%Analysis- - Duke Computer Science...Problem1 • Whatisaword?-– I’m…1wordor2words{I’m}or{I,am} – Stateoftheart…1wordor4words-– SanFrancisco…1or2words-– OtherLanguages-•

Finding  salient  tokens  (words)

•  Most  frequent  tokens?

29

Page 30: Text%Analysis- - Duke Computer Science...Problem1 • Whatisaword?-– I’m…1wordor2words{I’m}or{I,am} – Stateoftheart…1wordor4words-– SanFrancisco…1or2words-– OtherLanguages-•

Document  Frequency

•  Intuition:  Uninformative  words  appear  in  many  documents  (not  just  the  one  we  are  concerned  about)

•  Salient  word: – High  count  within  the  document – Low  count  across  documents

30

Page 31: Text%Analysis- - Duke Computer Science...Problem1 • Whatisaword?-– I’m…1wordor2words{I’m}or{I,am} – Stateoftheart…1wordor4words-– SanFrancisco…1or2words-– OtherLanguages-•

TF�IDF  score

•  Term  Frequency  (TF):

c(x)  is  #  times  x  appears  in  the  document

•  Inverse  Document  Frequency  (IDF):      DF(x)  is  the  number  of  documents  x  appears  in.  

31

TF ! = log!" 1+ ! ! !!!"!!!(!)!

Page 32: Text%Analysis- - Duke Computer Science...Problem1 • Whatisaword?-– I’m…1wordor2words{I’m}or{I,am} – Stateoftheart…1wordor4words-– SanFrancisco…1or2words-– OtherLanguages-•

Back  to  summarization

•  Simple  heuristic: – Pick  sentences  S  =  {x1,  x2,  …,  xk}  with  the  highest:  

32

Salience ! = ! 1|!| TF ! ∙ IDF(!)!!∈!

!!

Page 33: Text%Analysis- - Duke Computer Science...Problem1 • Whatisaword?-– I’m…1wordor2words{I’m}or{I,am} – Stateoftheart…1wordor4words-– SanFrancisco…1or2words-– OtherLanguages-•

Outline

•  Basic  Text  Processing •  Finding  Salient  Tokens

•  Document  Similarity  &  Keyword  Search – Vector  Space  Model  &  Cosine  Similarity –  Inverted  Indexes

33

Page 34: Text%Analysis- - Duke Computer Science...Problem1 • Whatisaword?-– I’m…1wordor2words{I’m}or{I,am} – Stateoftheart…1wordor4words-– SanFrancisco…1or2words-– OtherLanguages-•

Document  Similarity 34

Page 35: Text%Analysis- - Duke Computer Science...Problem1 • Whatisaword?-– I’m…1wordor2words{I’m}or{I,am} – Stateoftheart…1wordor4words-– SanFrancisco…1or2words-– OtherLanguages-•

Vector  Space  Model

•  Let  V  =  {x1,  x2,  …,  xn}  be  the  set  of  all  tokens  (across  all  documents)

•  A  document  is  a  n-­‐‑dimensional  vector     D  =  [w1,  w2,  …,  wn]    where  wi  =  TFIDF(xi,  D)

35

Page 36: Text%Analysis- - Duke Computer Science...Problem1 • Whatisaword?-– I’m…1wordor2words{I’m}or{I,am} – Stateoftheart…1wordor4words-– SanFrancisco…1or2words-– OtherLanguages-•

Distance  between  documents

•  Euclidean  distance   D1  =  [w1,  w2,  …,  wn] D2  =  [y1,  y2,  …,  yn]

•  Why?

36

!(!1,!2) = !! !! − !! !!

!

D1 D2 dista

nce

Page 37: Text%Analysis- - Duke Computer Science...Problem1 • Whatisaword?-– I’m…1wordor2words{I’m}or{I,am} – Stateoftheart…1wordor4words-– SanFrancisco…1or2words-– OtherLanguages-•

Cosine  Similarity 37

θ

D1

D2

Page 38: Text%Analysis- - Duke Computer Science...Problem1 • Whatisaword?-– I’m…1wordor2words{I’m}or{I,am} – Stateoftheart…1wordor4words-– SanFrancisco…1or2words-– OtherLanguages-•

Cosine  Similarity

•  D1  =  [w1,  w2,  …,  wn] •  D2  =  [y1,  y2,  …,  yn]

38

Dot  Product

L2  Norm

Page 39: Text%Analysis- - Duke Computer Science...Problem1 • Whatisaword?-– I’m…1wordor2words{I’m}or{I,am} – Stateoftheart…1wordor4words-– SanFrancisco…1or2words-– OtherLanguages-•

Keyword  Search

•  How  to  find  documents  that  are  similar  to  a  keyword  query?

•  Intuition:  Think  of  the  query  as  another  (very  short)  document

39

Page 40: Text%Analysis- - Duke Computer Science...Problem1 • Whatisaword?-– I’m…1wordor2words{I’m}or{I,am} – Stateoftheart…1wordor4words-– SanFrancisco…1or2words-– OtherLanguages-•

Keyword  Search

•  Simple  Algorithm

For  every  document  D  in  the  corpus   Compute  d(q,  D)  

 Return  the  top-­‐‑k  highest  scoring  documents

40

Page 41: Text%Analysis- - Duke Computer Science...Problem1 • Whatisaword?-– I’m…1wordor2words{I’m}or{I,am} – Stateoftheart…1wordor4words-– SanFrancisco…1or2words-– OtherLanguages-•

Does  it  scale? 41

http://www.worldwidewebsize.com/

Page 42: Text%Analysis- - Duke Computer Science...Problem1 • Whatisaword?-– I’m…1wordor2words{I’m}or{I,am} – Stateoftheart…1wordor4words-– SanFrancisco…1or2words-– OtherLanguages-•

Inverted  Indexes

•  Let  V  =  {x1,  x2,  …,  xn}  be  the  set  of  all  token

•  For  each  token,  store    <token,  sorted  list  of  documents  token   appears  in> – <“ceaser”,  [1,3,4,6,7,10,…]>

•  How  does  this  help?

42

Page 43: Text%Analysis- - Duke Computer Science...Problem1 • Whatisaword?-– I’m…1wordor2words{I’m}or{I,am} – Stateoftheart…1wordor4words-– SanFrancisco…1or2words-– OtherLanguages-•

Using  Inverted  Lists

•  Documents  containing  “ceasar” – Use  the  inverted  list  to  find  documents  containing  “caesar”

– What  additional  information  should  we  keep  to  compute  similarity  between  the  query  and  documents?  

43

Page 44: Text%Analysis- - Duke Computer Science...Problem1 • Whatisaword?-– I’m…1wordor2words{I’m}or{I,am} – Stateoftheart…1wordor4words-– SanFrancisco…1or2words-– OtherLanguages-•

Using  Inverted  Lists

•  Documents  containing  “caesar”  AND  “cleopatra” – Return  documents  in  the  intersection  of  the  two  inverted  lists.  

– Why  is  inverted  list  sorted  on  document  id?  

•  OR?  NOT?   – Union  and  difference,  resp.  

44

Page 45: Text%Analysis- - Duke Computer Science...Problem1 • Whatisaword?-– I’m…1wordor2words{I’m}or{I,am} – Stateoftheart…1wordor4words-– SanFrancisco…1or2words-– OtherLanguages-•

Many  other  things  in  a  modern  search  engine  … •  Maintain  positional  information  to  answer  phrase  queries

•  Scoring  is  not  only  based  on  token  similarity –  Importance  of  Webpages:  PageRank    (in  later  classes)

•  User  Feedback   – Clicks  and  query  reformulations

45

Page 46: Text%Analysis- - Duke Computer Science...Problem1 • Whatisaword?-– I’m…1wordor2words{I’m}or{I,am} – Stateoftheart…1wordor4words-– SanFrancisco…1or2words-– OtherLanguages-•

Summary

•  Word  counts  are  very  useful  when  analyzing  text – Need  good  algorithms  for  tokenization,  stemming  and  other  normalizations

•  Algorithm  for  finding  word  co-­‐‑occurrence – Language  Models – Pointwise  Mutual  Information

46

Page 47: Text%Analysis- - Duke Computer Science...Problem1 • Whatisaword?-– I’m…1wordor2words{I’m}or{I,am} – Stateoftheart…1wordor4words-– SanFrancisco…1or2words-– OtherLanguages-•

Summary  (contd.)

•  Raw  counts  are  not  sufficient  to  find  salient  tokens  in  a  document – Term  Frequency  x  Inverse  Document  Frequency  (TFIDF)  scoring

•  Keyword  Search – Use  Cosine  Similarity  over  TFIDF  scores  to  compute  similarity

– Use  Inverted  Indexes  to  speed  up  processing.

47