What can text statistics reveal? {week 10}

18
What can text statistics reveal? {week 10} Rensselaer Polytechnic Institute CSCI-4220 – Network Programming David Goldschmidt, Ph.D. om Search Engines: Information Retrieval in Practice, 1st edition by Croft, Metzler, and Strohman, Pearson, 2010, ISBN 0-13-607224-0

description

Rensselaer Polytechnic Institute CSCI-4220 – Network Programming David Goldschmidt, Ph.D. What can text statistics reveal? {week 10}. from Search Engines: Information Retrieval in Practice , 1st edition by Croft, Metzler, and Strohman, Pearson, 2010, ISBN 0-13-607224-0. - PowerPoint PPT Presentation

Transcript of What can text statistics reveal? {week 10}

Page 1: What can text  statistics reveal? {week  10}

What can text statistics reveal?{week 10}

Rensselaer Polytechnic InstituteCSCI-4220 – Network ProgrammingDavid Goldschmidt, Ph.D.

from Search Engines: Information Retrieval in Practice, 1st edition by Croft, Metzler, and Strohman, Pearson, 2010, ISBN 0-13-607224-0

Page 2: What can text  statistics reveal? {week  10}

Text transformation

how do we bestconvert documentsto their index termshow do we make

acquired documentssearchable?

Page 3: What can text  statistics reveal? {week  10}

Find/Replace

Simplest approach is find, whichrequires no text transformation Useful in user applications,

but not in search (why?) Optional transformation

handled during the findoperation: case sensitivity

Page 4: What can text  statistics reveal? {week  10}

Text statistics (i)

English documents are predictable: Top two most frequently occurring words

are “the” and “of” (10% of word occurrences)

Top six most frequently occurring wordsaccount for 20% of word occurrences

Top fifty most frequently occurring words account for 50% of word occurrences

Given all unique words in a (large) document, approximately 50% occur only once

Page 5: What can text  statistics reveal? {week  10}

Text statistics (ii)

Zipf’s law: Rank words in order of decreasing

frequency The rank (r) of a word times its

frequency (f) is approximately equal to a constant (k)

r x f = k In other words, the frequency of the rth

most common word is inversely proportional to r

George Kingsley Zipf(1902-1950)

Page 6: What can text  statistics reveal? {week  10}

Text statistics (iii)

The probability of occurrence (Pr)of a word is the word frequencydivided by the total number ofwords in the document

Revise Zipf’s law as: r x Pr = c

for English,c ≈ 0.1

Page 7: What can text  statistics reveal? {week  10}

Text statistics (iv)

Verify Zipf’s law using the AP89 dataset: Collection of Associated Press (AP) news

stories from 1989 (available at http://trec.nist.gov):Total documents

84,678Total word occurrences39,749,179Vocabulary size 198,763Words occurring > 1000 times 4,169Words occurring once 70,064

Page 8: What can text  statistics reveal? {week  10}

Text statistics (v)

Top 50wordsof AP89

Page 9: What can text  statistics reveal? {week  10}

Vocabulary growth (i)

As the corpus grows, so does vocabulary size Fewer new words when corpus is already

large The relationship between corpus size

(n) and vocabulary size (v) was defined empirically by Heaps (1978) and called Heaps law:

v = k x nβ

Constants k and β vary Typically 10 ≤ k ≤ 100 and β ≈ 0.5

Page 10: What can text  statistics reveal? {week  10}

Vocabulary growth (ii)

note values of k and β

Page 11: What can text  statistics reveal? {week  10}

Vocabulary growth (iii)

Web pages crawled from .gov in early 2004

Page 12: What can text  statistics reveal? {week  10}

Estimating result set size (i) Word occurrence statistics can be used to

estimate result set size of a user query

Aside from stop words, how many pagescontain all of the query terms?▪ To figure this out, first assume that words

occur independently of one another▪ Also assume that the search engine knows N,

the number of documents it indexes

Page 13: What can text  statistics reveal? {week  10}

Estimating result set size (ii) Given three query terms a, b, and c

Probability of a document containing all threeis the product of individual probabilities foreach query term:

P(a b c) = P(a) x P(b) x P(c)

P(a b c) is the joint probability ofevents a, b, and c occurring

Page 14: What can text  statistics reveal? {week  10}

Estimating result set size (iii) We assume the search engine knows

thenumber of documents that a word occurs in Call these na, nb, and nc ▪ Note that the book uses fa, fb, and fc

Estimate individual query term probabilities: P(a) = na / N P(b) = nb / N P(c)

= nc / N

Page 15: What can text  statistics reveal? {week  10}

Estimating result set size (iv) Given P(a), P(b), and P(c), we

estimatethe result set size as: nabc = N x (na / N) x (nb / N) x (nc / N)

nabc = (na x nb x nc) / N2

This estimation sounds good, but is lacking due to our query term independence assumption

Page 16: What can text  statistics reveal? {week  10}

Estimating result set size (v) Using the GOV2 dataset with N =

25,205,179 Poor results,

because of thequery termindependenceassumption

Could use wordco-occurrencedata...

Page 17: What can text  statistics reveal? {week  10}

Estimating result set size (vi) Extrapolate based on the size

of the current result set: The current result set is the subset of

documents that have been ranked thus far Let C be the number of documents found

thus far containing all the query words Let s be the proportion of the total

documents ranked (use least frequently occurring term)

Estimate result set size via nabc = C / s

Page 18: What can text  statistics reveal? {week  10}

Estimating result set size (vii) Given example query: tropical fish

aquarium Least frequently occurring term is

aquarium (which occurs in 26,480 documents)

After ranking 3,000 documents,258 documents contain all three query terms

Thus, nabc = C / s = 258 / (3,000 ÷ 26,480) = 2,277

After processing 20% of the documents, the estimate is 1,778▪ Which overshoots actual value of 1,529