9/13/2001Information Organization and Retrieval Vector Representation, Term Weights and Clustering...

71
9/13/2001 Information Organization and Retrieval Vector Representation, Term Weights and Clustering Ray Larson & Warren Sack University of California, Berkeley School of Information Management and Systems SIMS 202: Information Organization and Retrieval Lecture authors: Marti Hearst & Ray Larson & Warren Sack
  • date post

    15-Jan-2016
  • Category

    Documents

  • view

    213
  • download

    0

Transcript of 9/13/2001Information Organization and Retrieval Vector Representation, Term Weights and Clustering...

Page 1: 9/13/2001Information Organization and Retrieval Vector Representation, Term Weights and Clustering Ray Larson & Warren Sack University of California, Berkeley.

9/13/2001 Information Organization and Retrieval

Vector Representation, Term Weights and Clustering

Ray Larson & Warren Sack

University of California, Berkeley

School of Information Management and Systems

SIMS 202: Information Organization and RetrievalLecture authors: Marti Hearst & Ray Larson & Warren Sack

Page 2: 9/13/2001Information Organization and Retrieval Vector Representation, Term Weights and Clustering Ray Larson & Warren Sack University of California, Berkeley.

9/13/2001 Information Organization and Retrieval

Last Time • Content Analysis:

– Transformation of raw text into more computationally useful forms

• Words in text collections exhibit interesting statistical properties– Zipf distribution– Word co-occurrences non-independent

Page 3: 9/13/2001Information Organization and Retrieval Vector Representation, Term Weights and Clustering Ray Larson & Warren Sack University of California, Berkeley.

Information Organization and Retrieval

Document Processing Steps

Page 4: 9/13/2001Information Organization and Retrieval Vector Representation, Term Weights and Clustering Ray Larson & Warren Sack University of California, Berkeley.

9/13/2001 Information Organization and Retrieval

Zipf Distribution

Histogram

0

50

100

150

200

250

300

350

Bin

Frequency

Frequency

10/

/1

NC

rCf

Rank = order of words’ frequency of occurrence

The product of the frequency of words (f) and their rank (r)

is approximately constant

Page 5: 9/13/2001Information Organization and Retrieval Vector Representation, Term Weights and Clustering Ray Larson & Warren Sack University of California, Berkeley.

9/13/2001 Information Organization and Retrieval

Zipf Distribution

• The Important Points:– a few elements occur very frequently– a medium number of elements have

medium frequency– many elements occur very infrequently

Page 6: 9/13/2001Information Organization and Retrieval Vector Representation, Term Weights and Clustering Ray Larson & Warren Sack University of California, Berkeley.

Information Organization and Retrieval

Zipf Distribution(Same curve on

linear and log scale)

Page 7: 9/13/2001Information Organization and Retrieval Vector Representation, Term Weights and Clustering Ray Larson & Warren Sack University of California, Berkeley.

9/13/2001 Information Organization and Retrieval

Statistical Independence

Two events x and y are statistically independent if the product of their probability of their happening individually equals their probability of happening together.

),()()( yxPyPxP

Page 8: 9/13/2001Information Organization and Retrieval Vector Representation, Term Weights and Clustering Ray Larson & Warren Sack University of California, Berkeley.

9/13/2001 Information Organization and Retrieval

Today

• Document Vectors

• Inverted Files

• Vector Space Model

• Term Weighting

• Clustering

Page 9: 9/13/2001Information Organization and Retrieval Vector Representation, Term Weights and Clustering Ray Larson & Warren Sack University of California, Berkeley.

9/13/2001 Information Organization and Retrieval

Document Vectors

• Documents are represented as “bags of words”

• Represented as vectors when used computationally– A vector is like an array of (floating point) numbers– Has direction and magnitude– Each vector holds a place for every term in the

collection– Therefore, most vectors are sparse

Page 10: 9/13/2001Information Organization and Retrieval Vector Representation, Term Weights and Clustering Ray Larson & Warren Sack University of California, Berkeley.

9/13/2001 Information Organization and Retrieval

Document VectorsOne location for each word.

nova galaxy heat h’wood film role diet fur

10 5 3

5 10

10 8 7

9 10 5

10 10

9 10

5 7 9

6 10 2 8

7 5 1 3

ABCDEFGHI

“Nova” occurs 10 times in text A“Galaxy” occurs 5 times in text A“Heat” occurs 3 times in text A(Blank means 0 occurrences.)

Page 11: 9/13/2001Information Organization and Retrieval Vector Representation, Term Weights and Clustering Ray Larson & Warren Sack University of California, Berkeley.

9/13/2001 Information Organization and Retrieval

Document VectorsOne location for each word.

nova galaxy heat h’wood film role diet fur

10 5 3

5 10

10 8 7

9 10 5

10 10

9 10

5 7 9

6 10 2 8

7 5 1 3

ABCDEFGHI

“Hollywood” occurs 7 times in text I“Film” occurs 5 times in text I“Diet” occurs 1 time in text I“Fur” occurs 3 times in text I

Page 12: 9/13/2001Information Organization and Retrieval Vector Representation, Term Weights and Clustering Ray Larson & Warren Sack University of California, Berkeley.

9/13/2001 Information Organization and Retrieval

Document Vectors

nova galaxy heat h’wood film role diet fur

10 5 3

5 10

10 8 7

9 10 5

10 10

9 10

5 7 9

6 10 2 8

7 5 1 3

ABCDEFGHI

Document ids

Page 13: 9/13/2001Information Organization and Retrieval Vector Representation, Term Weights and Clustering Ray Larson & Warren Sack University of California, Berkeley.

9/13/2001 Information Organization and Retrieval

We Can Plot the VectorsStar

Diet

Doc about astronomyDoc about movie stars

Doc about mammal behavior

Page 14: 9/13/2001Information Organization and Retrieval Vector Representation, Term Weights and Clustering Ray Larson & Warren Sack University of California, Berkeley.

Information Organization and Retrieval

Documents in 3D Space

Page 15: 9/13/2001Information Organization and Retrieval Vector Representation, Term Weights and Clustering Ray Larson & Warren Sack University of California, Berkeley.

9/13/2001 Information Organization and Retrieval

Content Analysis Summary• Content Analysis: transforming raw text into

more computationally useful forms• Words in text collections exhibit interesting

statistical properties– Word frequencies have a Zipf distribution– Word co-occurrences exhibit dependencies

• Text documents are transformed into vectors– Pre-processing includes tokenization, stemming,

collocations/phrases– Documents occupy multi-dimensional space.

Page 16: 9/13/2001Information Organization and Retrieval Vector Representation, Term Weights and Clustering Ray Larson & Warren Sack University of California, Berkeley.

Informationneed

Index

Pre-process

Parse

Collections

Rank

Query

text inputHow isthe indexconstructed?

Page 17: 9/13/2001Information Organization and Retrieval Vector Representation, Term Weights and Clustering Ray Larson & Warren Sack University of California, Berkeley.

9/13/2001 Information Organization and Retrieval

Inverted Index• This is the primary data structure for text

indexes• Main Idea:

– Invert documents into a big index

• Basic steps:– Make a “dictionary” of all the tokens in the

collection– For each token, list all the docs it occurs in.– Do a few things to reduce redundancy in the data

structure

Page 18: 9/13/2001Information Organization and Retrieval Vector Representation, Term Weights and Clustering Ray Larson & Warren Sack University of California, Berkeley.

9/13/2001 Information Organization and Retrieval

Inverted IndexesWe have seen “Vector files” conceptually.

An Inverted File is a vector file “inverted” so that rows become columns and columns become rowsdocs t1 t2 t3D1 1 0 1D2 1 0 0D3 0 1 1D4 1 0 0D5 1 1 1D6 1 1 0D7 0 1 0D8 0 1 0D9 0 0 1

D10 0 1 1

Terms D1 D2 D3 D4 D5 D6 D7 …

t1 1 1 0 1 1 1 0t2 0 0 1 0 1 1 1t3 1 0 1 0 1 0 0

Page 19: 9/13/2001Information Organization and Retrieval Vector Representation, Term Weights and Clustering Ray Larson & Warren Sack University of California, Berkeley.

9/13/2001 Information Organization and Retrieval

How Are Inverted Files Created

• Documents are parsed to extract tokens. These are saved with the Document ID.

Now is the timefor all good men

to come to the aidof their country

Doc 1

It was a dark andstormy night in

the country manor. The time was past midnight

Doc 2

Term Doc #now 1is 1the 1time 1for 1all 1good 1men 1to 1come 1to 1the 1aid 1of 1their 1country 1it 2was 2a 2dark 2and 2stormy 2night 2in 2the 2country 2manor 2the 2time 2was 2past 2midnight 2

Page 20: 9/13/2001Information Organization and Retrieval Vector Representation, Term Weights and Clustering Ray Larson & Warren Sack University of California, Berkeley.

9/13/2001 Information Organization and Retrieval

How Inverted Files are Created

• After all documents have been parsed the inverted file is sorted alphabetically.

Term Doc #a 2aid 1all 1and 2come 1country 1country 2dark 2for 1good 1in 2is 1it 2manor 2men 1midnight 2night 2now 1of 1past 2stormy 2the 1the 1the 2the 2their 1time 1time 2to 1to 1was 2was 2

Term Doc #now 1is 1the 1time 1for 1all 1good 1men 1to 1come 1to 1the 1aid 1of 1their 1country 1it 2was 2a 2dark 2and 2stormy 2night 2in 2the 2country 2manor 2the 2time 2was 2past 2midnight 2

Page 21: 9/13/2001Information Organization and Retrieval Vector Representation, Term Weights and Clustering Ray Larson & Warren Sack University of California, Berkeley.

9/13/2001 Information Organization and Retrieval

How InvertedFiles are Created

• Multiple term entries for a single document are merged.

• Within-document term frequency information is compiled.

Term Doc # Freqa 2 1aid 1 1all 1 1and 2 1come 1 1country 1 1country 2 1dark 2 1for 1 1good 1 1in 2 1is 1 1it 2 1manor 2 1men 1 1midnight 2 1night 2 1now 1 1of 1 1past 2 1stormy 2 1the 1 2the 2 2their 1 1time 1 1time 2 1to 1 2was 2 2

Term Doc #a 2aid 1all 1and 2come 1country 1country 2dark 2for 1good 1in 2is 1it 2manor 2men 1midnight 2night 2now 1of 1past 2stormy 2the 1the 1the 2the 2their 1time 1time 2to 1to 1was 2was 2

Page 22: 9/13/2001Information Organization and Retrieval Vector Representation, Term Weights and Clustering Ray Larson & Warren Sack University of California, Berkeley.

9/13/2001 Information Organization and Retrieval

How Inverted Files are Created

• Then the file can be split into – A Dictionary file and – A Postings file

Page 23: 9/13/2001Information Organization and Retrieval Vector Representation, Term Weights and Clustering Ray Larson & Warren Sack University of California, Berkeley.

9/13/2001 Information Organization and Retrieval

How Inverted Files are Created

Dictionary PostingsTerm Doc # Freqa 2 1aid 1 1all 1 1and 2 1come 1 1country 1 1country 2 1dark 2 1for 1 1good 1 1in 2 1is 1 1it 2 1manor 2 1men 1 1midnight 2 1night 2 1now 1 1of 1 1past 2 1stormy 2 1the 1 2the 2 2their 1 1time 1 1time 2 1to 1 2was 2 2

Doc # Freq2 11 11 12 11 11 12 12 11 11 12 11 12 12 11 12 12 11 11 12 12 11 22 21 11 12 11 22 2

Term N docs Tot Freqa 1 1aid 1 1all 1 1and 1 1come 1 1country 2 2dark 1 1for 1 1good 1 1in 1 1is 1 1it 1 1manor 1 1men 1 1midnight 1 1night 1 1now 1 1of 1 1past 1 1stormy 1 1the 2 4their 1 1time 2 2to 1 2was 1 2

Page 24: 9/13/2001Information Organization and Retrieval Vector Representation, Term Weights and Clustering Ray Larson & Warren Sack University of California, Berkeley.

9/13/2001 Information Organization and Retrieval

Inverted indexes• Permit fast search for individual terms• For each term, you get a list consisting of:

– document ID – frequency of term in doc (optional) – position of term in doc (optional)

• These lists can be used to solve Boolean queries:• country -> d1, d2• manor -> d2• country AND manor -> d2

• Also used for statistical ranking algorithms

Page 25: 9/13/2001Information Organization and Retrieval Vector Representation, Term Weights and Clustering Ray Larson & Warren Sack University of California, Berkeley.

9/13/2001 Information Organization and Retrieval

How Inverted Files are Used Dictionary Postings

Doc # Freq2 11 11 12 11 11 12 12 11 11 12 11 12 12 11 12 12 11 11 12 12 11 22 21 11 12 11 22 2

Term N docs Tot Freqa 1 1aid 1 1all 1 1and 1 1come 1 1country 2 2dark 1 1for 1 1good 1 1in 1 1is 1 1it 1 1manor 1 1men 1 1midnight 1 1night 1 1now 1 1of 1 1past 1 1stormy 1 1the 2 4their 1 1time 2 2to 1 2was 1 2

Boolean Query on “time” AND “dark”

2 docs with “time” in dictionary ->IDs 1 and 2 from posting file

1 doc with “dark” in dictionary ->ID 2 from posting file

Therefore, only doc 2 satisfied the query.

Page 26: 9/13/2001Information Organization and Retrieval Vector Representation, Term Weights and Clustering Ray Larson & Warren Sack University of California, Berkeley.

9/13/2001 Information Organization and Retrieval

Vector Space Model

• Documents are represented as vectors in term space– Terms are usually stems– Documents represented by binary vectors of terms

• Queries represented the same as documents• Query and Document weights are based on

length and direction of their vector• A vector distance measure between the query

and documents is used to rank retrieved documents

• This makes partial matching possible

Page 27: 9/13/2001Information Organization and Retrieval Vector Representation, Term Weights and Clustering Ray Larson & Warren Sack University of California, Berkeley.

9/13/2001 Information Organization and Retrieval

Documents in 3D Space

Assumption: Documents that are “close together” in space are similar in meaning.

Page 28: 9/13/2001Information Organization and Retrieval Vector Representation, Term Weights and Clustering Ray Larson & Warren Sack University of California, Berkeley.

9/13/2001 Information Organization and Retrieval

Vector Space Documentsand Queries

docs t1 t2 t3 RSV=Q.DiD1 1 0 1 4D2 1 0 0 1D3 0 1 1 5D4 1 0 0 1D5 1 1 1 6D6 1 1 0 3D7 0 1 0 2D8 0 1 0 2D9 0 0 1 3

D10 0 1 1 5D11 1 0 1 3Q 1 2 3

q1 q2 q3

D1

D2

D3

D4

D5

D6

D7

D8

D9

D10

D11

t2

t3

t1

Boolean term combinations

Q is a query – also represented as a vector

Page 29: 9/13/2001Information Organization and Retrieval Vector Representation, Term Weights and Clustering Ray Larson & Warren Sack University of California, Berkeley.

9/13/2001 Information Organization and Retrieval

Documents in Vector Space

t1

t2

t3

D1

D2

D10

D3

D9

D4

D7

D8

D5

D11

D6

Page 30: 9/13/2001Information Organization and Retrieval Vector Representation, Term Weights and Clustering Ray Larson & Warren Sack University of California, Berkeley.

9/13/2001 Information Organization and Retrieval

Assigning Weights to Terms

• Binary Weights

• Raw term frequency

• tf x idf– Recall the Zipf distribution– Want to weight terms highly if they are

• frequent in relevant documents … BUT• infrequent in the collection as a whole

Page 31: 9/13/2001Information Organization and Retrieval Vector Representation, Term Weights and Clustering Ray Larson & Warren Sack University of California, Berkeley.

9/13/2001 Information Organization and Retrieval

Binary Weights

• Only the presence (1) or absence (0) of a term is included in the vector

docs t1 t2 t3D1 1 0 1D2 1 0 0D3 0 1 1D4 1 0 0D5 1 1 1D6 1 1 0D7 0 1 0D8 0 1 0D9 0 0 1

D10 0 1 1D11 1 0 1

Page 32: 9/13/2001Information Organization and Retrieval Vector Representation, Term Weights and Clustering Ray Larson & Warren Sack University of California, Berkeley.

9/13/2001 Information Organization and Retrieval

Raw Term Weights

• The frequency of occurrence for the term in each document is included in the vector

docs t1 t2 t3D1 2 0 3D2 1 0 0D3 0 4 7D4 3 0 0D5 1 6 3D6 3 5 0D7 0 8 0D8 0 10 0D9 0 0 1

D10 0 3 5D11 4 0 1

Page 33: 9/13/2001Information Organization and Retrieval Vector Representation, Term Weights and Clustering Ray Larson & Warren Sack University of California, Berkeley.

9/13/2001 Information Organization and Retrieval

Assigning Weights• tf x idf measure:

– term frequency (tf)– inverse document frequency (idf) -- a way

to deal with the problems of the Zipf distribution

• Goal: assign a tf * idf weight to each term in each document

Page 34: 9/13/2001Information Organization and Retrieval Vector Representation, Term Weights and Clustering Ray Larson & Warren Sack University of California, Berkeley.

9/13/2001 Information Organization and Retrieval

tf x idf)/log(* kikik nNtfw

log

Tcontain that in documents ofnumber the

collection in the documents ofnumber total

in T termoffrequency document inverse

document in T termoffrequency

document in term

nNidf

Cn

CN

Cidf

Dtf

DkT

kk

kk

kk

ikik

ik

Page 35: 9/13/2001Information Organization and Retrieval Vector Representation, Term Weights and Clustering Ray Larson & Warren Sack University of California, Berkeley.

9/13/2001 Information Organization and Retrieval

Inverse Document Frequency

• IDF provides high values for rare words and low values for common words

41

10000log

698.220

10000log

301.05000

10000log

010000

10000log

For a collectionof 10000 documents

Page 36: 9/13/2001Information Organization and Retrieval Vector Representation, Term Weights and Clustering Ray Larson & Warren Sack University of California, Berkeley.

9/13/2001 Information Organization and Retrieval

Similarity Measures

|)||,min(|

||

||||

||

||||

||||

||2

||

21

21

DQ

DQ

DQ

DQ

DQDQ

DQ

DQ

DQ

Simple matching (coordination level match)

Dice’s Coefficient

Jaccard’s Coefficient

Cosine Coefficient

Overlap Coefficient

Page 37: 9/13/2001Information Organization and Retrieval Vector Representation, Term Weights and Clustering Ray Larson & Warren Sack University of California, Berkeley.

9/13/2001 Information Organization and Retrieval

Computing Similarity Scores -- Preview

2

1 1D

Q2D

98.0cos

74.0cos

)8.0 ,4.0(

)7.0 ,2.0(

)3.0 ,8.0(

2

1

2

1

Q

D

D

1.0

0.8

0.6

0.8

0.4

0.60.4 1.00.2

0.2

Page 38: 9/13/2001Information Organization and Retrieval Vector Representation, Term Weights and Clustering Ray Larson & Warren Sack University of California, Berkeley.

9/13/2001 Information Organization and Retrieval

Document Space has High Dimensionality

• What happens beyond 2 or 3 dimensions?• Similarity still has to do with how many tokens

are shared in common.• More terms -> harder to understand which

subsets of words are shared among similar documents.

• Next time we will look in detail at ranking methods

• One approach to handling high dimensionality:Clustering

Page 39: 9/13/2001Information Organization and Retrieval Vector Representation, Term Weights and Clustering Ray Larson & Warren Sack University of California, Berkeley.

9/13/2001 Information Organization and Retrieval

Vector Space Visualization

Page 40: 9/13/2001Information Organization and Retrieval Vector Representation, Term Weights and Clustering Ray Larson & Warren Sack University of California, Berkeley.

9/13/2001 Information Organization and Retrieval

Text Clustering

• Finds overall similarities among groups of documents

• Finds overall similarities among groups of tokens

• Picks out some themes, ignores others

Page 41: 9/13/2001Information Organization and Retrieval Vector Representation, Term Weights and Clustering Ray Larson & Warren Sack University of California, Berkeley.

9/13/2001 Information Organization and Retrieval

Text ClusteringClustering is

“The art of finding groups in data.” Kaufmann and Rousseeuw, Finding Groups in Data, 1990

Term 1

Term 2

Page 42: 9/13/2001Information Organization and Retrieval Vector Representation, Term Weights and Clustering Ray Larson & Warren Sack University of California, Berkeley.

9/13/2001 Information Organization and Retrieval

Text Clustering

Term 1

Term 2

Clustering is“The art of finding groups in data.” Kaufmann and Rousseeuw, Finding Groups in Data, 1990

Page 43: 9/13/2001Information Organization and Retrieval Vector Representation, Term Weights and Clustering Ray Larson & Warren Sack University of California, Berkeley.

9/13/2001 Information Organization and Retrieval

Types of Clustering

• Hierarchical vs. Flat

• Hard vs.Soft vs. Disjunctive

(set vs. uncertain vs. multiple assignment)

Page 44: 9/13/2001Information Organization and Retrieval Vector Representation, Term Weights and Clustering Ray Larson & Warren Sack University of California, Berkeley.

9/13/2001 Information Organization and Retrieval

Pair-wise Document Similarity

nova galaxy heat h’wood film role diet fur

1 3 1

5 2

2 1 5

4 1

ABCD

How to compute document similarity?

Page 45: 9/13/2001Information Organization and Retrieval Vector Representation, Term Weights and Clustering Ray Larson & Warren Sack University of California, Berkeley.

9/13/2001 Information Organization and Retrieval

Pair-wise Document Similarity(no normalization for simplicity)

nova galaxy heat h’wood film role diet fur 1 3 1 5 2

2 1 5 4 1

ABCD

t

iii

t

t

wwDDsim

wwwD

wwwD

12121

2,22212

1,12111

),(

...,,

...,,

9)11()42(),(

0),(

0),(

0),(

0),(

11)32()51(),(

DCsim

DBsim

CBsim

DAsim

CAsim

BAsim

Page 46: 9/13/2001Information Organization and Retrieval Vector Representation, Term Weights and Clustering Ray Larson & Warren Sack University of California, Berkeley.

9/13/2001 Information Organization and Retrieval

Pair-wise Document Similarity(cosine normalization)

normalized cosine

)()(

),(

edunnormaliz ),(

...,,

...,,

1

22

1

21

121

21

12121

2,22212

1,12111

t

ii

t

ii

t

iii

t

iii

t

t

ww

wwDDsim

wwDDsim

wwwD

wwwD

Page 47: 9/13/2001Information Organization and Retrieval Vector Representation, Term Weights and Clustering Ray Larson & Warren Sack University of California, Berkeley.

9/13/2001 Information Organization and Retrieval

Document/Document Matrix

....

.....

.....

....

....

...

21

2212

1121

21

nnn

t

t

t

ddD

ddD

ddD

DDD

jiij DDd to of similarity

Page 48: 9/13/2001Information Organization and Retrieval Vector Representation, Term Weights and Clustering Ray Larson & Warren Sack University of California, Berkeley.

9/13/2001 Information Organization and Retrieval

Hierarchical Clustering(Agglomerative)

A B C D E F G HI

Page 49: 9/13/2001Information Organization and Retrieval Vector Representation, Term Weights and Clustering Ray Larson & Warren Sack University of California, Berkeley.

9/13/2001 Information Organization and Retrieval

Hierarchical Clustering(Agglomerative)

A B C D E F G HI

Page 50: 9/13/2001Information Organization and Retrieval Vector Representation, Term Weights and Clustering Ray Larson & Warren Sack University of California, Berkeley.

9/13/2001 Information Organization and Retrieval

Hierarchical Clustering(Agglomerative)

A B C D E F G HI

Page 51: 9/13/2001Information Organization and Retrieval Vector Representation, Term Weights and Clustering Ray Larson & Warren Sack University of California, Berkeley.

9/13/2001 Information Organization and Retrieval

Page 52: 9/13/2001Information Organization and Retrieval Vector Representation, Term Weights and Clustering Ray Larson & Warren Sack University of California, Berkeley.

9/13/2001 Information Organization and Retrieval

Types ofHierarchical Clustering

• Top-down vs. Bottom-up

• O(n**2) vs. O(n**3)

• Single-link vs. complete-link

(local coherence vs. global coherence)

Page 53: 9/13/2001Information Organization and Retrieval Vector Representation, Term Weights and Clustering Ray Larson & Warren Sack University of California, Berkeley.

9/13/2001 Information Organization and Retrieval

Page 54: 9/13/2001Information Organization and Retrieval Vector Representation, Term Weights and Clustering Ray Larson & Warren Sack University of California, Berkeley.

9/13/2001 Information Organization and Retrieval

Page 55: 9/13/2001Information Organization and Retrieval Vector Representation, Term Weights and Clustering Ray Larson & Warren Sack University of California, Berkeley.

9/13/2001 Information Organization and Retrieval

Flat Clustering

• K-Means – Hard– O(n)

• EM (soft version of K-Means)

Page 56: 9/13/2001Information Organization and Retrieval Vector Representation, Term Weights and Clustering Ray Larson & Warren Sack University of California, Berkeley.

9/13/2001 Information Organization and Retrieval

K-Means Clustering

• 1 Create a pair-wise similarity measure• 2 Find K centers • 3 Assign each document to nearest center,

forming new clusters• 4 Repeat 3 as necessary

Page 57: 9/13/2001Information Organization and Retrieval Vector Representation, Term Weights and Clustering Ray Larson & Warren Sack University of California, Berkeley.

9/13/2001 Information Organization and Retrieval

Page 58: 9/13/2001Information Organization and Retrieval Vector Representation, Term Weights and Clustering Ray Larson & Warren Sack University of California, Berkeley.

9/13/2001 Information Organization and Retrieval

Page 59: 9/13/2001Information Organization and Retrieval Vector Representation, Term Weights and Clustering Ray Larson & Warren Sack University of California, Berkeley.

9/13/2001 Information Organization and Retrieval

Scatter/Gather

Cutting, Pedersen, Tukey & Karger 92, 93Hearst & Pedersen 95

• Cluster sets of documents into general “themes”, like a table of contents

• Display the contents of the clusters by showing topical terms and typical titles

• User chooses subsets of the clusters and re-clusters the documents within

• Resulting new groups have different “themes”

Page 60: 9/13/2001Information Organization and Retrieval Vector Representation, Term Weights and Clustering Ray Larson & Warren Sack University of California, Berkeley.

9/13/2001 Information Organization and Retrieval

S/G Example: query on “star”Encyclopedia text

14 sports 8 symbols 47 film, tv 68 film, tv (p) 7 music97 astrophysics 67 astronomy(p) 12 stellar phenomena 10 flora/fauna 49 galaxies, stars

29 constellations 7 miscelleneous

Clustering and re-clustering is entirely automated

Page 61: 9/13/2001Information Organization and Retrieval Vector Representation, Term Weights and Clustering Ray Larson & Warren Sack University of California, Berkeley.
Page 62: 9/13/2001Information Organization and Retrieval Vector Representation, Term Weights and Clustering Ray Larson & Warren Sack University of California, Berkeley.
Page 63: 9/13/2001Information Organization and Retrieval Vector Representation, Term Weights and Clustering Ray Larson & Warren Sack University of California, Berkeley.
Page 64: 9/13/2001Information Organization and Retrieval Vector Representation, Term Weights and Clustering Ray Larson & Warren Sack University of California, Berkeley.

9/13/2001 Information Organization and Retrieval

Another use of clustering

• Use clustering to map the entire huge multidimensional document space into a huge number of small clusters.

• “Project” these onto a 2D graphical representation:

Page 65: 9/13/2001Information Organization and Retrieval Vector Representation, Term Weights and Clustering Ray Larson & Warren Sack University of California, Berkeley.

9/13/2001 Information Organization and Retrieval

Clustering Multi-Dimensional Document Space

Wise, Thomas, Pennock, Lantrip, Pottier, Schur, Crow“Visualizing the Non-Visual: Spatial analysis and interaction with Information from text documents,” 1995

Page 66: 9/13/2001Information Organization and Retrieval Vector Representation, Term Weights and Clustering Ray Larson & Warren Sack University of California, Berkeley.

9/13/2001 Information Organization and Retrieval

Clustering Multi-Dimensional Document Space

Wise et al., 1995

Page 67: 9/13/2001Information Organization and Retrieval Vector Representation, Term Weights and Clustering Ray Larson & Warren Sack University of California, Berkeley.

9/13/2001 Information Organization and Retrieval

Concept “Landscapes”Browsing without search

Pharmocology

Anatomy

Legal

Disease

Hospitals

(e.g., Xia Lin, “Visualization for the Document Space,” 1992)

Based on Kohonen feature maps;See http://websom.hut.fi/websom/

Page 68: 9/13/2001Information Organization and Retrieval Vector Representation, Term Weights and Clustering Ray Larson & Warren Sack University of California, Berkeley.

9/13/2001 Information Organization and Retrieval

More examples ofinformation visualization

• Stuart Card, Jock Mackinlay, Ben Schneiderman (eds.) Readings in Information Visualization (San Francisco: Morgan Kaufmann, 1999)

• Martin Dodge, www.cybergeography.org

Page 69: 9/13/2001Information Organization and Retrieval Vector Representation, Term Weights and Clustering Ray Larson & Warren Sack University of California, Berkeley.

9/13/2001 Information Organization and Retrieval

Clustering• Advantages:

– See some main themes

• Disadvantage:– Many ways documents could group together are

hidden

• Thinking point: what is the relationship to classification systems and faceted queries?

e.g., f1: (osteoporosis OR ‘bone loss’) f2: (drugs OR pharmaceuticals) f3: (prevention OR cure)

Page 70: 9/13/2001Information Organization and Retrieval Vector Representation, Term Weights and Clustering Ray Larson & Warren Sack University of California, Berkeley.

9/13/2001 Information Organization and Retrieval

More information on content analysis and clustering

• Christopher Manning and Hinrich Schutze, Foundations of Statistical Natural Language Processing (Cambridge, MA: MIT Press, 1999)

• Daniel Jurafsky and James Martin, Speech and Language Processing (Upper Saddle River, NJ: Prentice Hall, 2000)

Page 71: 9/13/2001Information Organization and Retrieval Vector Representation, Term Weights and Clustering Ray Larson & Warren Sack University of California, Berkeley.

9/13/2001 Information Organization and Retrieval

Next Time

• Vector Space Ranking

• Probabilistic Models and Ranking