Exploring Text: Zipf’s Law and Heaps’ Law. (a) (b) (a) Distribution of sorted word frequencies...

28
Exploring Text: Zipf’s Law and Heaps’ Law

Transcript of Exploring Text: Zipf’s Law and Heaps’ Law. (a) (b) (a) Distribution of sorted word frequencies...

Page 1: Exploring Text: Zipf’s Law and Heaps’ Law. (a) (b) (a) Distribution of sorted word frequencies (Zipf’s law) (b) Distribution of size of the vocabulary.

Exploring Text: Zipf’s Law and Heaps’ Law

Page 2: Exploring Text: Zipf’s Law and Heaps’ Law. (a) (b) (a) Distribution of sorted word frequencies (Zipf’s law) (b) Distribution of size of the vocabulary.

(a) (b)(a) Distribution of sorted word frequencies (Zipf’s law) (b) Distribution of size of the

vocabulary  

Zipf’s and Heap’s distributions

Page 3: Exploring Text: Zipf’s Law and Heaps’ Law. (a) (b) (a) Distribution of sorted word frequencies (Zipf’s law) (b) Distribution of size of the vocabulary.

Sample Word Frequency Data(from B. Croft, UMass)

Page 4: Exploring Text: Zipf’s Law and Heaps’ Law. (a) (b) (a) Distribution of sorted word frequencies (Zipf’s law) (b) Distribution of size of the vocabulary.

Predicting Occurrence Frequencies

• By Zipf, a word appearing n times has rank rn=AN/n

• If several words may occur n times, assume rank rn applies to the last of these.

• Therefore, rn words occur n or more times and rn+1 words occur n+1 or more times.

• So, the number of words appearing exactly n times is:

)1(11

nn

AN

n

AN

n

ANrrI nnn

• Fraction of words with frequency n is:

• Fraction of words appearing only once is therefore ½.

)1(

1

nnD

In

Page 5: Exploring Text: Zipf’s Law and Heaps’ Law. (a) (b) (a) Distribution of sorted word frequencies (Zipf’s law) (b) Distribution of size of the vocabulary.

Zipf’s Law Impact on Language Analysis

• Good News: Stopwords will account for a large fraction of text so eliminating them greatly reduces size of vocabulary in a text

• Bad News: For most words, gathering sufficient data for meaningful statistical analysis (e.g. for correlation analysis for query expansion) is difficult since they are extremely rare.

Page 6: Exploring Text: Zipf’s Law and Heaps’ Law. (a) (b) (a) Distribution of sorted word frequencies (Zipf’s law) (b) Distribution of size of the vocabulary.

Vocabulary Growth

• How does the size of the overall vocabulary (number of unique words) grow with the size of the corpus?

• This determines how the size of the inverted index will scale with the size of the corpus.

• Vocabulary not really upper-bounded due to proper names, typos, etc.

Page 7: Exploring Text: Zipf’s Law and Heaps’ Law. (a) (b) (a) Distribution of sorted word frequencies (Zipf’s law) (b) Distribution of size of the vocabulary.

Heaps’ Law

• If V is the size of the vocabulary and the n is the length of the corpus in words:

• Typical constants:– K 10100– 0.40.6 (approx. square-root)

10 , constants with KKnV

Page 8: Exploring Text: Zipf’s Law and Heaps’ Law. (a) (b) (a) Distribution of sorted word frequencies (Zipf’s law) (b) Distribution of size of the vocabulary.

Heaps’ Law Data

Page 9: Exploring Text: Zipf’s Law and Heaps’ Law. (a) (b) (a) Distribution of sorted word frequencies (Zipf’s law) (b) Distribution of size of the vocabulary.

Occurrence Frequency Data (from B. Croft, UMass)

Page 10: Exploring Text: Zipf’s Law and Heaps’ Law. (a) (b) (a) Distribution of sorted word frequencies (Zipf’s law) (b) Distribution of size of the vocabulary.

Text properties (formalized)

Sample word frequency data

Page 11: Exploring Text: Zipf’s Law and Heaps’ Law. (a) (b) (a) Distribution of sorted word frequencies (Zipf’s law) (b) Distribution of size of the vocabulary.

Zipf’s Law

• We use a few words very frequently and rarely use most other words

• The product of the frequency of a word and its rank is approximately he same as the product of the frequency and rank of another word.

• Deviations usually occur at the beginning and at the end of the table/graph

Page 12: Exploring Text: Zipf’s Law and Heaps’ Law. (a) (b) (a) Distribution of sorted word frequencies (Zipf’s law) (b) Distribution of size of the vocabulary.

Zipf’s Law

• Zipf's law states that given some corpus of natural language utterances, the frequency of any word is inversely proportional to its rank in the frequency table.

• Thus the most frequent word will occur approximately twice as often as the second most frequent word, three times as often as the third most frequent word, etc.

• For example, in the Brown Corpus of American English text, the word "the" is the most frequently occurring word, and by itself accounts for nearly 7% of all word occurrences (69,971 out of slightly over 1 million).

• the second-place word "of" accounts for slightly over 3.5% of words (36,411 occurrences), followed by "and" (28,852).

• Only 135 vocabulary items are needed to account for half the Brown Corpus.[3]

Page 13: Exploring Text: Zipf’s Law and Heaps’ Law. (a) (b) (a) Distribution of sorted word frequencies (Zipf’s law) (b) Distribution of size of the vocabulary.

Zipf’s Law

• The most common 20 words in English are listed in the following table.

• The table is based on the Brown Corpus, a careful study of a million words from a wide variety of sources including newspapers, books, magazines, fiction, government documents, comedy and academic publications.

Page 14: Exploring Text: Zipf’s Law and Heaps’ Law. (a) (b) (a) Distribution of sorted word frequencies (Zipf’s law) (b) Distribution of size of the vocabulary.

Table of Top 20 frequently occurring words in English

Rank Word Frequency % Frequency Theoretical Zipf Distribution

1 the 69970 68.872 699702 of 36410 35.839 364703 and 28854 28.401 249124 to 26154 25.744 190095 a 23363 22.996 154126 in 21345 21.010 129857 that 10594 10.428 112338 is 10102 0.9943 99089 was 9815 0.9661 8870

10 he 9542 0.9392 803311 for 9489 0.9340 734512 it 8760 0.8623 676813 with 7290 0.7176 627714 as 7251 0.7137 585515 his 6996 0.6886 548716 on 6742 0.6636 516417 be 6376 0.6276 487818 at 5377 0.5293 462319 by 5307 0.5224 439420 I 5180 0.5099 4187

Page 15: Exploring Text: Zipf’s Law and Heaps’ Law. (a) (b) (a) Distribution of sorted word frequencies (Zipf’s law) (b) Distribution of size of the vocabulary.

Plot of Top 20 frequently occurring words in English

Page 16: Exploring Text: Zipf’s Law and Heaps’ Law. (a) (b) (a) Distribution of sorted word frequencies (Zipf’s law) (b) Distribution of size of the vocabulary.

Zipf’s Law

• Rank (r): The numerical position of a word in a list sorted by decreasing frequency (f ).

• Zipf (1949) “discovered” that:

• If probability of word of rank r is pr and N is the total number of word occurrences:

rf

1 )constant (for kkrf

r

A

N

fpr

Page 17: Exploring Text: Zipf’s Law and Heaps’ Law. (a) (b) (a) Distribution of sorted word frequencies (Zipf’s law) (b) Distribution of size of the vocabulary.

Does Real Data Fit Zipf’s Law?

• A law of the form y = kxc is called a power law.• Zipf’s law is a power law with c = –1• On a log-log plot, power laws give a straight

line with slope c.

• Zipf is quite accurate except for very high and low rank.

)log(log)log()log( xckkxy c

Page 18: Exploring Text: Zipf’s Law and Heaps’ Law. (a) (b) (a) Distribution of sorted word frequencies (Zipf’s law) (b) Distribution of size of the vocabulary.

Top 2000 English words using a log-log scale

Page 19: Exploring Text: Zipf’s Law and Heaps’ Law. (a) (b) (a) Distribution of sorted word frequencies (Zipf’s law) (b) Distribution of size of the vocabulary.

Fit to Zipf for Brown Corpus

k = 100,000

Page 20: Exploring Text: Zipf’s Law and Heaps’ Law. (a) (b) (a) Distribution of sorted word frequencies (Zipf’s law) (b) Distribution of size of the vocabulary.

Plot of word frequency in Wikipedia-dump 2006-11-27

• The plot is made in log-log coordinates.

• x is rank of a word in the frequency table;

• y is the total number of the word’s occurrences.

• Most popular words are “the”, “of” and “and”, as expected

Page 21: Exploring Text: Zipf’s Law and Heaps’ Law. (a) (b) (a) Distribution of sorted word frequencies (Zipf’s law) (b) Distribution of size of the vocabulary.

Zipf’s Law

• The same relationship occurs in many other rankings unrelated to language, such as • Corporation sizes, • Calls to computer operating systems• Colors in images• As the basis of most approaches to image compression• City populations (a small number of large cities, a larger

number of smaller cities)• Wealth distribution (a small number of people have large

amounts of money, large numbers of people have small amounts of money)

• Popularity of web pages in websites

Page 22: Exploring Text: Zipf’s Law and Heaps’ Law. (a) (b) (a) Distribution of sorted word frequencies (Zipf’s law) (b) Distribution of size of the vocabulary.

Zipf’s Law• Authorship tests• Textual analysis can be used to demonstrate the authenticity of

disputed works. • Each author has their own preference for using certain words,

and so one technique compares the occurrence of different words in the uncertain text with that of an author's known works.

• The counted words are ranked (whereby the most common is number one and the rarest is last) and then plotted on a graph with their frequency of occurrence up the side:

• Comparing the Zipf graphs of two different pieces of writing, paying attention to the position of selected words, reveals whether they were both composed by the same author.

Page 23: Exploring Text: Zipf’s Law and Heaps’ Law. (a) (b) (a) Distribution of sorted word frequencies (Zipf’s law) (b) Distribution of size of the vocabulary.

Heaps’s Law• Vocabulary (number of word types) increase as the number of words

increase• empirical law which describes the number of distinct words in a

document (or set of documents) as a function of the document length (type-token relation). It can be formulated as

– where VR is the number of distinct words in an instance text of size n. – K and β are free parameters determined empirically.

• With English text corpora, typically K is between 10 and 100, and β is between 0.4 and 0.6.

• As corpus grows, so does vocabulary size– Fewer new words when corpus is already large

Page 24: Exploring Text: Zipf’s Law and Heaps’ Law. (a) (b) (a) Distribution of sorted word frequencies (Zipf’s law) (b) Distribution of size of the vocabulary.

A typical Heaps-law plot

• The x-axis represents the text size

• The y-axis represents the number of distinct vocabulary elements present in the text.

• Compare the values of the two axes

Page 25: Exploring Text: Zipf’s Law and Heaps’ Law. (a) (b) (a) Distribution of sorted word frequencies (Zipf’s law) (b) Distribution of size of the vocabulary.

AP89 Example

Page 26: Exploring Text: Zipf’s Law and Heaps’ Law. (a) (b) (a) Distribution of sorted word frequencies (Zipf’s law) (b) Distribution of size of the vocabulary.

Heaps’ Law Predictions

• Predictions for TREC collections are accurate for large numbers of words– e.g., first 10,879,522 words of the AP89 collection

scanned– prediction is 100,151 unique words– actual number is 100,024

• Predictions for small numbers of words (i.e. < 1000) are much worse

Page 27: Exploring Text: Zipf’s Law and Heaps’ Law. (a) (b) (a) Distribution of sorted word frequencies (Zipf’s law) (b) Distribution of size of the vocabulary.

GOV2 (Web) Example

Page 28: Exploring Text: Zipf’s Law and Heaps’ Law. (a) (b) (a) Distribution of sorted word frequencies (Zipf’s law) (b) Distribution of size of the vocabulary.

Web Example

• Heaps’ Law works with very large corpora– new words occurring even after seeing 30 million!– parameter values different than typical newswire

corpora used in competitions• New words come from a variety of sources

• spelling errors, invented words (e.g. product, company names), code, other languages, email addresses, etc.

• Search engines must deal with these large and growing vocabularies