9/13/2001Information Organization and Retrieval Vector Representation, Term Weights and Clustering...

9/13/2001 Information Organization and Retrieval

Vector Representation, Term Weights and Clustering

Ray Larson & Warren Sack

University of California, Berkeley

School of Information Management and Systems

SIMS 202: Information Organization and RetrievalLecture authors: Marti Hearst & Ray Larson & Warren Sack


Last Time • Content Analysis:

– Transformation of raw text into more computationally useful forms

• Words in text collections exhibit interesting statistical properties– Zipf distribution– Word co-occurrences non-independent

Information Organization and Retrieval

Document Processing Steps


Zipf Distribution

Histogram

0

50

100

150

200

250

300

350

Bin

Frequency

Frequency

10/

/1

NC

rCf

Rank = order of words’ frequency of occurrence

The product of the frequency of words (f) and their rank (r)

is approximately constant


Zipf Distribution

• The Important Points:– a few elements occur very frequently– a medium number of elements have

medium frequency– many elements occur very infrequently


Zipf Distribution(Same curve on

linear and log scale)


Statistical Independence

Two events x and y are statistically independent if the product of their probability of their happening individually equals their probability of happening together.

),()()( yxPyPxP


Today

• Document Vectors

• Inverted Files

• Vector Space Model

• Term Weighting

• Clustering


Document Vectors

• Documents are represented as “bags of words”

• Represented as vectors when used computationally– A vector is like an array of (floating point) numbers– Has direction and magnitude– Each vector holds a place for every term in the

collection– Therefore, most vectors are sparse


Document VectorsOne location for each word.

nova galaxy heat h’wood film role diet fur

10 5 3

5 10

10 8 7

9 10 5

10 10

9 10

5 7 9

6 10 2 8

7 5 1 3

ABCDEFGHI

“Nova” occurs 10 times in text A“Galaxy” occurs 5 times in text A“Heat” occurs 3 times in text A(Blank means 0 occurrences.)


Document VectorsOne location for each word.


10 5 3

5 10

10 8 7

9 10 5

10 10

9 10

5 7 9

6 10 2 8

7 5 1 3

ABCDEFGHI

“Hollywood” occurs 7 times in text I“Film” occurs 5 times in text I“Diet” occurs 1 time in text I“Fur” occurs 3 times in text I


Document Vectors


10 5 3

5 10

10 8 7

9 10 5

10 10

9 10

5 7 9

6 10 2 8

7 5 1 3

ABCDEFGHI

Document ids


We Can Plot the VectorsStar

Diet

Doc about astronomyDoc about movie stars

Doc about mammal behavior


Documents in 3D Space


Content Analysis Summary• Content Analysis: transforming raw text into

more computationally useful forms• Words in text collections exhibit interesting

statistical properties– Word frequencies have a Zipf distribution– Word co-occurrences exhibit dependencies

• Text documents are transformed into vectors– Pre-processing includes tokenization, stemming,

collocations/phrases– Documents occupy multi-dimensional space.

Informationneed

Index

Pre-process

Parse

Collections

Rank

Query

text inputHow isthe indexconstructed?


Inverted Index• This is the primary data structure for text

indexes• Main Idea:

– Invert documents into a big index

• Basic steps:– Make a “dictionary” of all the tokens in the

collection– For each token, list all the docs it occurs in.– Do a few things to reduce redundancy in the data

structure


Inverted IndexesWe have seen “Vector files” conceptually.

An Inverted File is a vector file “inverted” so that rows become columns and columns become rowsdocs t1 t2 t3D1 1 0 1D2 1 0 0D3 0 1 1D4 1 0 0D5 1 1 1D6 1 1 0D7 0 1 0D8 0 1 0D9 0 0 1

D10 0 1 1

Terms D1 D2 D3 D4 D5 D6 D7 …

t1 1 1 0 1 1 1 0t2 0 0 1 0 1 1 1t3 1 0 1 0 1 0 0


How Are Inverted Files Created

• Documents are parsed to extract tokens. These are saved with the Document ID.

Now is the timefor all good men

to come to the aidof their country

Doc 1

It was a dark andstormy night in

the country manor. The time was past midnight

Doc 2

Term Doc #now 1is 1the 1time 1for 1all 1good 1men 1to 1come 1to 1the 1aid 1of 1their 1country 1it 2was 2a 2dark 2and 2stormy 2night 2in 2the 2country 2manor 2the 2time 2was 2past 2midnight 2


How Inverted Files are Created

• After all documents have been parsed the inverted file is sorted alphabetically.

Term Doc #a 2aid 1all 1and 2come 1country 1country 2dark 2for 1good 1in 2is 1it 2manor 2men 1midnight 2night 2now 1of 1past 2stormy 2the 1the 1the 2the 2their 1time 1time 2to 1to 1was 2was 2

Term Doc #now 1is 1the 1time 1for 1all 1good 1men 1to 1come 1to 1the 1aid 1of 1their 1country 1it 2was 2a 2dark 2and 2stormy 2night 2in 2the 2country 2manor 2the 2time 2was 2past 2midnight 2


How InvertedFiles are Created

• Multiple term entries for a single document are merged.

• Within-document term frequency information is compiled.

Term Doc # Freqa 2 1aid 1 1all 1 1and 2 1come 1 1country 1 1country 2 1dark 2 1for 1 1good 1 1in 2 1is 1 1it 2 1manor 2 1men 1 1midnight 2 1night 2 1now 1 1of 1 1past 2 1stormy 2 1the 1 2the 2 2their 1 1time 1 1time 2 1to 1 2was 2 2

Term Doc #a 2aid 1all 1and 2come 1country 1country 2dark 2for 1good 1in 2is 1it 2manor 2men 1midnight 2night 2now 1of 1past 2stormy 2the 1the 1the 2the 2their 1time 1time 2to 1to 1was 2was 2



• Then the file can be split into – A Dictionary file and – A Postings file



Dictionary PostingsTerm Doc # Freqa 2 1aid 1 1all 1 1and 2 1come 1 1country 1 1country 2 1dark 2 1for 1 1good 1 1in 2 1is 1 1it 2 1manor 2 1men 1 1midnight 2 1night 2 1now 1 1of 1 1past 2 1stormy 2 1the 1 2the 2 2their 1 1time 1 1time 2 1to 1 2was 2 2

Doc # Freq2 11 11 12 11 11 12 12 11 11 12 11 12 12 11 12 12 11 11 12 12 11 22 21 11 12 11 22 2

Term N docs Tot Freqa 1 1aid 1 1all 1 1and 1 1come 1 1country 2 2dark 1 1for 1 1good 1 1in 1 1is 1 1it 1 1manor 1 1men 1 1midnight 1 1night 1 1now 1 1of 1 1past 1 1stormy 1 1the 2 4their 1 1time 2 2to 1 2was 1 2


Inverted indexes• Permit fast search for individual terms• For each term, you get a list consisting of:

– document ID – frequency of term in doc (optional) – position of term in doc (optional)

• These lists can be used to solve Boolean queries:• country -> d1, d2• manor -> d2• country AND manor -> d2

• Also used for statistical ranking algorithms


How Inverted Files are Used Dictionary Postings

Doc # Freq2 11 11 12 11 11 12 12 11 11 12 11 12 12 11 12 12 11 11 12 12 11 22 21 11 12 11 22 2

Term N docs Tot Freqa 1 1aid 1 1all 1 1and 1 1come 1 1country 2 2dark 1 1for 1 1good 1 1in 1 1is 1 1it 1 1manor 1 1men 1 1midnight 1 1night 1 1now 1 1of 1 1past 1 1stormy 1 1the 2 4their 1 1time 2 2to 1 2was 1 2

Boolean Query on “time” AND “dark”

2 docs with “time” in dictionary ->IDs 1 and 2 from posting file

1 doc with “dark” in dictionary ->ID 2 from posting file

Therefore, only doc 2 satisfied the query.


Vector Space Model

• Documents are represented as vectors in term space– Terms are usually stems– Documents represented by binary vectors of terms

• Queries represented the same as documents• Query and Document weights are based on

length and direction of their vector• A vector distance measure between the query

and documents is used to rank retrieved documents

• This makes partial matching possible


Documents in 3D Space

Assumption: Documents that are “close together” in space are similar in meaning.


Vector Space Documentsand Queries

docs t1 t2 t3 RSV=Q.DiD1 1 0 1 4D2 1 0 0 1D3 0 1 1 5D4 1 0 0 1D5 1 1 1 6D6 1 1 0 3D7 0 1 0 2D8 0 1 0 2D9 0 0 1 3

D10 0 1 1 5D11 1 0 1 3Q 1 2 3

q1 q2 q3

D1

D2

D3

D4

D5

D6

D7

D8

D9

D10

D11

t2

t3

t1

Boolean term combinations

Q is a query – also represented as a vector


Documents in Vector Space

t1

t2

t3

D1

D2

D10

D3

D9

D4

D7

D8

D5

D11

D6


Assigning Weights to Terms

• Binary Weights

• Raw term frequency

• tf x idf– Recall the Zipf distribution– Want to weight terms highly if they are

• frequent in relevant documents … BUT• infrequent in the collection as a whole


Binary Weights

• Only the presence (1) or absence (0) of a term is included in the vector

docs t1 t2 t3D1 1 0 1D2 1 0 0D3 0 1 1D4 1 0 0D5 1 1 1D6 1 1 0D7 0 1 0D8 0 1 0D9 0 0 1

D10 0 1 1D11 1 0 1


Raw Term Weights

• The frequency of occurrence for the term in each document is included in the vector

docs t1 t2 t3D1 2 0 3D2 1 0 0D3 0 4 7D4 3 0 0D5 1 6 3D6 3 5 0D7 0 8 0D8 0 10 0D9 0 0 1

D10 0 3 5D11 4 0 1


Assigning Weights• tf x idf measure:

– term frequency (tf)– inverse document frequency (idf) -- a way

to deal with the problems of the Zipf distribution

• Goal: assign a tf * idf weight to each term in each document


tf x idf)/log(* kikik nNtfw

log

Tcontain that in documents ofnumber the

collection in the documents ofnumber total

in T termoffrequency document inverse

document in T termoffrequency

document in term

nNidf

Cn

CN

Cidf

Dtf

DkT

kk

kk

kk

ikik

ik


Inverse Document Frequency

• IDF provides high values for rare words and low values for common words

41

10000log

698.220

10000log

301.05000

10000log

010000

10000log

For a collectionof 10000 documents


Similarity Measures

|)||,min(|

||

||||

||

||||

||||

||2

||

21

21

DQ

DQ

DQ

DQ

DQDQ

DQ

DQ

DQ

Simple matching (coordination level match)

Dice’s Coefficient

Jaccard’s Coefficient

Cosine Coefficient

Overlap Coefficient


Computing Similarity Scores -- Preview

2

1 1D

Q2D

98.0cos

74.0cos

)8.0 ,4.0(

)7.0 ,2.0(

)3.0 ,8.0(

2

1

2

1

Q

D

D

1.0

0.8

0.6

0.8

0.4

0.60.4 1.00.2

0.2


Document Space has High Dimensionality

• What happens beyond 2 or 3 dimensions?• Similarity still has to do with how many tokens

are shared in common.• More terms -> harder to understand which

subsets of words are shared among similar documents.

• Next time we will look in detail at ranking methods

• One approach to handling high dimensionality:Clustering


Vector Space Visualization


Text Clustering

• Finds overall similarities among groups of documents

• Finds overall similarities among groups of tokens

• Picks out some themes, ignores others


Text ClusteringClustering is

“The art of finding groups in data.” Kaufmann and Rousseeuw, Finding Groups in Data, 1990

Term 1

Term 2


Text Clustering

Term 1

Term 2

Clustering is“The art of finding groups in data.” Kaufmann and Rousseeuw, Finding Groups in Data, 1990


Types of Clustering

• Hierarchical vs. Flat

• Hard vs.Soft vs. Disjunctive

(set vs. uncertain vs. multiple assignment)


Pair-wise Document Similarity


1 3 1

5 2

2 1 5

4 1

ABCD

How to compute document similarity?


Pair-wise Document Similarity(no normalization for simplicity)

nova galaxy heat h’wood film role diet fur 1 3 1 5 2

2 1 5 4 1

ABCD

t

iii

t

t

wwDDsim

wwwD

wwwD

12121

2,22212

1,12111

),(

...,,

...,,

9)11()42(),(

0),(

0),(

0),(

0),(

11)32()51(),(

DCsim

DBsim

CBsim

DAsim

CAsim

BAsim


Pair-wise Document Similarity(cosine normalization)

normalized cosine

)()(

),(

edunnormaliz ),(

...,,

...,,

1

22

1

21

121

21

12121

2,22212

1,12111

t

ii

t

ii

t

iii

t

iii

t

t

ww

wwDDsim

wwDDsim

wwwD

wwwD


Document/Document Matrix

....

.....

.....

....

....

...

21

2212

1121

21

nnn

t

t

t

ddD

ddD

ddD

DDD

jiij DDd to of similarity


Hierarchical Clustering(Agglomerative)

A B C D E F G HI


Types ofHierarchical Clustering

• Top-down vs. Bottom-up

• O(n**2) vs. O(n**3)

• Single-link vs. complete-link

(local coherence vs. global coherence)


Flat Clustering

• K-Means – Hard– O(n)

• EM (soft version of K-Means)


K-Means Clustering

• 1 Create a pair-wise similarity measure• 2 Find K centers • 3 Assign each document to nearest center,

forming new clusters• 4 Repeat 3 as necessary


Scatter/Gather

Cutting, Pedersen, Tukey & Karger 92, 93Hearst & Pedersen 95

• Cluster sets of documents into general “themes”, like a table of contents

• Display the contents of the clusters by showing topical terms and typical titles

• User chooses subsets of the clusters and re-clusters the documents within

• Resulting new groups have different “themes”


S/G Example: query on “star”Encyclopedia text

14 sports 8 symbols 47 film, tv 68 film, tv (p) 7 music97 astrophysics 67 astronomy(p) 12 stellar phenomena 10 flora/fauna 49 galaxies, stars

29 constellations 7 miscelleneous

Clustering and re-clustering is entirely automated


Another use of clustering

• Use clustering to map the entire huge multidimensional document space into a huge number of small clusters.

• “Project” these onto a 2D graphical representation:


Clustering Multi-Dimensional Document Space

Wise, Thomas, Pennock, Lantrip, Pottier, Schur, Crow“Visualizing the Non-Visual: Spatial analysis and interaction with Information from text documents,” 1995


Clustering Multi-Dimensional Document Space

Wise et al., 1995


Concept “Landscapes”Browsing without search

Pharmocology

Anatomy

Legal

Disease

Hospitals

(e.g., Xia Lin, “Visualization for the Document Space,” 1992)

Based on Kohonen feature maps;See http://websom.hut.fi/websom/


More examples ofinformation visualization

• Stuart Card, Jock Mackinlay, Ben Schneiderman (eds.) Readings in Information Visualization (San Francisco: Morgan Kaufmann, 1999)

• Martin Dodge, www.cybergeography.org


Clustering• Advantages:

– See some main themes

• Disadvantage:– Many ways documents could group together are

hidden

• Thinking point: what is the relationship to classification systems and faceted queries?

e.g., f1: (osteoporosis OR ‘bone loss’) f2: (drugs OR pharmaceuticals) f3: (prevention OR cure)


More information on content analysis and clustering

• Christopher Manning and Hinrich Schutze, Foundations of Statistical Natural Language Processing (Cambridge, MA: MIT Press, 1999)

• Daniel Jurafsky and James Martin, Speech and Language Processing (Upper Saddle River, NJ: Prentice Hall, 2000)


Next Time

• Vector Space Ranking

• Probabilistic Models and Ranking

9/13/2001Information Organization and Retrieval Vector Representation, Term Weights and Clustering...

Documents

Transcript of 9/13/2001Information Organization and Retrieval Vector Representation, Term Weights and Clustering...