9/13/2001Information Organization and Retrieval Vector Representation, Term Weights and Clustering...
-
date post
15-Jan-2016 -
Category
Documents
-
view
213 -
download
0
Transcript of 9/13/2001Information Organization and Retrieval Vector Representation, Term Weights and Clustering...
9/13/2001 Information Organization and Retrieval
Vector Representation, Term Weights and Clustering
Ray Larson & Warren Sack
University of California, Berkeley
School of Information Management and Systems
SIMS 202: Information Organization and RetrievalLecture authors: Marti Hearst & Ray Larson & Warren Sack
9/13/2001 Information Organization and Retrieval
Last Time • Content Analysis:
– Transformation of raw text into more computationally useful forms
• Words in text collections exhibit interesting statistical properties– Zipf distribution– Word co-occurrences non-independent
Information Organization and Retrieval
Document Processing Steps
9/13/2001 Information Organization and Retrieval
Zipf Distribution
Histogram
0
50
100
150
200
250
300
350
Bin
Frequency
Frequency
10/
/1
NC
rCf
Rank = order of words’ frequency of occurrence
The product of the frequency of words (f) and their rank (r)
is approximately constant
9/13/2001 Information Organization and Retrieval
Zipf Distribution
• The Important Points:– a few elements occur very frequently– a medium number of elements have
medium frequency– many elements occur very infrequently
Information Organization and Retrieval
Zipf Distribution(Same curve on
linear and log scale)
9/13/2001 Information Organization and Retrieval
Statistical Independence
Two events x and y are statistically independent if the product of their probability of their happening individually equals their probability of happening together.
),()()( yxPyPxP
9/13/2001 Information Organization and Retrieval
Today
• Document Vectors
• Inverted Files
• Vector Space Model
• Term Weighting
• Clustering
9/13/2001 Information Organization and Retrieval
Document Vectors
• Documents are represented as “bags of words”
• Represented as vectors when used computationally– A vector is like an array of (floating point) numbers– Has direction and magnitude– Each vector holds a place for every term in the
collection– Therefore, most vectors are sparse
9/13/2001 Information Organization and Retrieval
Document VectorsOne location for each word.
nova galaxy heat h’wood film role diet fur
10 5 3
5 10
10 8 7
9 10 5
10 10
9 10
5 7 9
6 10 2 8
7 5 1 3
ABCDEFGHI
“Nova” occurs 10 times in text A“Galaxy” occurs 5 times in text A“Heat” occurs 3 times in text A(Blank means 0 occurrences.)
9/13/2001 Information Organization and Retrieval
Document VectorsOne location for each word.
nova galaxy heat h’wood film role diet fur
10 5 3
5 10
10 8 7
9 10 5
10 10
9 10
5 7 9
6 10 2 8
7 5 1 3
ABCDEFGHI
“Hollywood” occurs 7 times in text I“Film” occurs 5 times in text I“Diet” occurs 1 time in text I“Fur” occurs 3 times in text I
9/13/2001 Information Organization and Retrieval
Document Vectors
nova galaxy heat h’wood film role diet fur
10 5 3
5 10
10 8 7
9 10 5
10 10
9 10
5 7 9
6 10 2 8
7 5 1 3
ABCDEFGHI
Document ids
9/13/2001 Information Organization and Retrieval
We Can Plot the VectorsStar
Diet
Doc about astronomyDoc about movie stars
Doc about mammal behavior
Information Organization and Retrieval
Documents in 3D Space
9/13/2001 Information Organization and Retrieval
Content Analysis Summary• Content Analysis: transforming raw text into
more computationally useful forms• Words in text collections exhibit interesting
statistical properties– Word frequencies have a Zipf distribution– Word co-occurrences exhibit dependencies
• Text documents are transformed into vectors– Pre-processing includes tokenization, stemming,
collocations/phrases– Documents occupy multi-dimensional space.
Informationneed
Index
Pre-process
Parse
Collections
Rank
Query
text inputHow isthe indexconstructed?
9/13/2001 Information Organization and Retrieval
Inverted Index• This is the primary data structure for text
indexes• Main Idea:
– Invert documents into a big index
• Basic steps:– Make a “dictionary” of all the tokens in the
collection– For each token, list all the docs it occurs in.– Do a few things to reduce redundancy in the data
structure
9/13/2001 Information Organization and Retrieval
Inverted IndexesWe have seen “Vector files” conceptually.
An Inverted File is a vector file “inverted” so that rows become columns and columns become rowsdocs t1 t2 t3D1 1 0 1D2 1 0 0D3 0 1 1D4 1 0 0D5 1 1 1D6 1 1 0D7 0 1 0D8 0 1 0D9 0 0 1
D10 0 1 1
Terms D1 D2 D3 D4 D5 D6 D7 …
t1 1 1 0 1 1 1 0t2 0 0 1 0 1 1 1t3 1 0 1 0 1 0 0
9/13/2001 Information Organization and Retrieval
How Are Inverted Files Created
• Documents are parsed to extract tokens. These are saved with the Document ID.
Now is the timefor all good men
to come to the aidof their country
Doc 1
It was a dark andstormy night in
the country manor. The time was past midnight
Doc 2
Term Doc #now 1is 1the 1time 1for 1all 1good 1men 1to 1come 1to 1the 1aid 1of 1their 1country 1it 2was 2a 2dark 2and 2stormy 2night 2in 2the 2country 2manor 2the 2time 2was 2past 2midnight 2
9/13/2001 Information Organization and Retrieval
How Inverted Files are Created
• After all documents have been parsed the inverted file is sorted alphabetically.
Term Doc #a 2aid 1all 1and 2come 1country 1country 2dark 2for 1good 1in 2is 1it 2manor 2men 1midnight 2night 2now 1of 1past 2stormy 2the 1the 1the 2the 2their 1time 1time 2to 1to 1was 2was 2
Term Doc #now 1is 1the 1time 1for 1all 1good 1men 1to 1come 1to 1the 1aid 1of 1their 1country 1it 2was 2a 2dark 2and 2stormy 2night 2in 2the 2country 2manor 2the 2time 2was 2past 2midnight 2
9/13/2001 Information Organization and Retrieval
How InvertedFiles are Created
• Multiple term entries for a single document are merged.
• Within-document term frequency information is compiled.
Term Doc # Freqa 2 1aid 1 1all 1 1and 2 1come 1 1country 1 1country 2 1dark 2 1for 1 1good 1 1in 2 1is 1 1it 2 1manor 2 1men 1 1midnight 2 1night 2 1now 1 1of 1 1past 2 1stormy 2 1the 1 2the 2 2their 1 1time 1 1time 2 1to 1 2was 2 2
Term Doc #a 2aid 1all 1and 2come 1country 1country 2dark 2for 1good 1in 2is 1it 2manor 2men 1midnight 2night 2now 1of 1past 2stormy 2the 1the 1the 2the 2their 1time 1time 2to 1to 1was 2was 2
9/13/2001 Information Organization and Retrieval
How Inverted Files are Created
• Then the file can be split into – A Dictionary file and – A Postings file
9/13/2001 Information Organization and Retrieval
How Inverted Files are Created
Dictionary PostingsTerm Doc # Freqa 2 1aid 1 1all 1 1and 2 1come 1 1country 1 1country 2 1dark 2 1for 1 1good 1 1in 2 1is 1 1it 2 1manor 2 1men 1 1midnight 2 1night 2 1now 1 1of 1 1past 2 1stormy 2 1the 1 2the 2 2their 1 1time 1 1time 2 1to 1 2was 2 2
Doc # Freq2 11 11 12 11 11 12 12 11 11 12 11 12 12 11 12 12 11 11 12 12 11 22 21 11 12 11 22 2
Term N docs Tot Freqa 1 1aid 1 1all 1 1and 1 1come 1 1country 2 2dark 1 1for 1 1good 1 1in 1 1is 1 1it 1 1manor 1 1men 1 1midnight 1 1night 1 1now 1 1of 1 1past 1 1stormy 1 1the 2 4their 1 1time 2 2to 1 2was 1 2
9/13/2001 Information Organization and Retrieval
Inverted indexes• Permit fast search for individual terms• For each term, you get a list consisting of:
– document ID – frequency of term in doc (optional) – position of term in doc (optional)
• These lists can be used to solve Boolean queries:• country -> d1, d2• manor -> d2• country AND manor -> d2
• Also used for statistical ranking algorithms
9/13/2001 Information Organization and Retrieval
How Inverted Files are Used Dictionary Postings
Doc # Freq2 11 11 12 11 11 12 12 11 11 12 11 12 12 11 12 12 11 11 12 12 11 22 21 11 12 11 22 2
Term N docs Tot Freqa 1 1aid 1 1all 1 1and 1 1come 1 1country 2 2dark 1 1for 1 1good 1 1in 1 1is 1 1it 1 1manor 1 1men 1 1midnight 1 1night 1 1now 1 1of 1 1past 1 1stormy 1 1the 2 4their 1 1time 2 2to 1 2was 1 2
Boolean Query on “time” AND “dark”
2 docs with “time” in dictionary ->IDs 1 and 2 from posting file
1 doc with “dark” in dictionary ->ID 2 from posting file
Therefore, only doc 2 satisfied the query.
9/13/2001 Information Organization and Retrieval
Vector Space Model
• Documents are represented as vectors in term space– Terms are usually stems– Documents represented by binary vectors of terms
• Queries represented the same as documents• Query and Document weights are based on
length and direction of their vector• A vector distance measure between the query
and documents is used to rank retrieved documents
• This makes partial matching possible
9/13/2001 Information Organization and Retrieval
Documents in 3D Space
Assumption: Documents that are “close together” in space are similar in meaning.
9/13/2001 Information Organization and Retrieval
Vector Space Documentsand Queries
docs t1 t2 t3 RSV=Q.DiD1 1 0 1 4D2 1 0 0 1D3 0 1 1 5D4 1 0 0 1D5 1 1 1 6D6 1 1 0 3D7 0 1 0 2D8 0 1 0 2D9 0 0 1 3
D10 0 1 1 5D11 1 0 1 3Q 1 2 3
q1 q2 q3
D1
D2
D3
D4
D5
D6
D7
D8
D9
D10
D11
t2
t3
t1
Boolean term combinations
Q is a query – also represented as a vector
9/13/2001 Information Organization and Retrieval
Documents in Vector Space
t1
t2
t3
D1
D2
D10
D3
D9
D4
D7
D8
D5
D11
D6
9/13/2001 Information Organization and Retrieval
Assigning Weights to Terms
• Binary Weights
• Raw term frequency
• tf x idf– Recall the Zipf distribution– Want to weight terms highly if they are
• frequent in relevant documents … BUT• infrequent in the collection as a whole
9/13/2001 Information Organization and Retrieval
Binary Weights
• Only the presence (1) or absence (0) of a term is included in the vector
docs t1 t2 t3D1 1 0 1D2 1 0 0D3 0 1 1D4 1 0 0D5 1 1 1D6 1 1 0D7 0 1 0D8 0 1 0D9 0 0 1
D10 0 1 1D11 1 0 1
9/13/2001 Information Organization and Retrieval
Raw Term Weights
• The frequency of occurrence for the term in each document is included in the vector
docs t1 t2 t3D1 2 0 3D2 1 0 0D3 0 4 7D4 3 0 0D5 1 6 3D6 3 5 0D7 0 8 0D8 0 10 0D9 0 0 1
D10 0 3 5D11 4 0 1
9/13/2001 Information Organization and Retrieval
Assigning Weights• tf x idf measure:
– term frequency (tf)– inverse document frequency (idf) -- a way
to deal with the problems of the Zipf distribution
• Goal: assign a tf * idf weight to each term in each document
9/13/2001 Information Organization and Retrieval
tf x idf)/log(* kikik nNtfw
log
Tcontain that in documents ofnumber the
collection in the documents ofnumber total
in T termoffrequency document inverse
document in T termoffrequency
document in term
nNidf
Cn
CN
Cidf
Dtf
DkT
kk
kk
kk
ikik
ik
9/13/2001 Information Organization and Retrieval
Inverse Document Frequency
• IDF provides high values for rare words and low values for common words
41
10000log
698.220
10000log
301.05000
10000log
010000
10000log
For a collectionof 10000 documents
9/13/2001 Information Organization and Retrieval
Similarity Measures
|)||,min(|
||
||||
||
||||
||||
||2
||
21
21
DQ
DQ
DQ
DQ
DQDQ
DQ
DQ
DQ
Simple matching (coordination level match)
Dice’s Coefficient
Jaccard’s Coefficient
Cosine Coefficient
Overlap Coefficient
9/13/2001 Information Organization and Retrieval
Computing Similarity Scores -- Preview
2
1 1D
Q2D
98.0cos
74.0cos
)8.0 ,4.0(
)7.0 ,2.0(
)3.0 ,8.0(
2
1
2
1
Q
D
D
1.0
0.8
0.6
0.8
0.4
0.60.4 1.00.2
0.2
9/13/2001 Information Organization and Retrieval
Document Space has High Dimensionality
• What happens beyond 2 or 3 dimensions?• Similarity still has to do with how many tokens
are shared in common.• More terms -> harder to understand which
subsets of words are shared among similar documents.
• Next time we will look in detail at ranking methods
• One approach to handling high dimensionality:Clustering
9/13/2001 Information Organization and Retrieval
Vector Space Visualization
9/13/2001 Information Organization and Retrieval
Text Clustering
• Finds overall similarities among groups of documents
• Finds overall similarities among groups of tokens
• Picks out some themes, ignores others
9/13/2001 Information Organization and Retrieval
Text ClusteringClustering is
“The art of finding groups in data.” Kaufmann and Rousseeuw, Finding Groups in Data, 1990
Term 1
Term 2
9/13/2001 Information Organization and Retrieval
Text Clustering
Term 1
Term 2
Clustering is“The art of finding groups in data.” Kaufmann and Rousseeuw, Finding Groups in Data, 1990
9/13/2001 Information Organization and Retrieval
Types of Clustering
• Hierarchical vs. Flat
• Hard vs.Soft vs. Disjunctive
(set vs. uncertain vs. multiple assignment)
9/13/2001 Information Organization and Retrieval
Pair-wise Document Similarity
nova galaxy heat h’wood film role diet fur
1 3 1
5 2
2 1 5
4 1
ABCD
How to compute document similarity?
9/13/2001 Information Organization and Retrieval
Pair-wise Document Similarity(no normalization for simplicity)
nova galaxy heat h’wood film role diet fur 1 3 1 5 2
2 1 5 4 1
ABCD
t
iii
t
t
wwDDsim
wwwD
wwwD
12121
2,22212
1,12111
),(
...,,
...,,
9)11()42(),(
0),(
0),(
0),(
0),(
11)32()51(),(
DCsim
DBsim
CBsim
DAsim
CAsim
BAsim
9/13/2001 Information Organization and Retrieval
Pair-wise Document Similarity(cosine normalization)
normalized cosine
)()(
),(
edunnormaliz ),(
...,,
...,,
1
22
1
21
121
21
12121
2,22212
1,12111
t
ii
t
ii
t
iii
t
iii
t
t
ww
wwDDsim
wwDDsim
wwwD
wwwD
9/13/2001 Information Organization and Retrieval
Document/Document Matrix
....
.....
.....
....
....
...
21
2212
1121
21
nnn
t
t
t
ddD
ddD
ddD
DDD
jiij DDd to of similarity
9/13/2001 Information Organization and Retrieval
Hierarchical Clustering(Agglomerative)
A B C D E F G HI
9/13/2001 Information Organization and Retrieval
Hierarchical Clustering(Agglomerative)
A B C D E F G HI
9/13/2001 Information Organization and Retrieval
Hierarchical Clustering(Agglomerative)
A B C D E F G HI
9/13/2001 Information Organization and Retrieval
9/13/2001 Information Organization and Retrieval
Types ofHierarchical Clustering
• Top-down vs. Bottom-up
• O(n**2) vs. O(n**3)
• Single-link vs. complete-link
(local coherence vs. global coherence)
9/13/2001 Information Organization and Retrieval
9/13/2001 Information Organization and Retrieval
9/13/2001 Information Organization and Retrieval
Flat Clustering
• K-Means – Hard– O(n)
• EM (soft version of K-Means)
9/13/2001 Information Organization and Retrieval
K-Means Clustering
• 1 Create a pair-wise similarity measure• 2 Find K centers • 3 Assign each document to nearest center,
forming new clusters• 4 Repeat 3 as necessary
9/13/2001 Information Organization and Retrieval
9/13/2001 Information Organization and Retrieval
9/13/2001 Information Organization and Retrieval
Scatter/Gather
Cutting, Pedersen, Tukey & Karger 92, 93Hearst & Pedersen 95
• Cluster sets of documents into general “themes”, like a table of contents
• Display the contents of the clusters by showing topical terms and typical titles
• User chooses subsets of the clusters and re-clusters the documents within
• Resulting new groups have different “themes”
9/13/2001 Information Organization and Retrieval
S/G Example: query on “star”Encyclopedia text
14 sports 8 symbols 47 film, tv 68 film, tv (p) 7 music97 astrophysics 67 astronomy(p) 12 stellar phenomena 10 flora/fauna 49 galaxies, stars
29 constellations 7 miscelleneous
Clustering and re-clustering is entirely automated
9/13/2001 Information Organization and Retrieval
Another use of clustering
• Use clustering to map the entire huge multidimensional document space into a huge number of small clusters.
• “Project” these onto a 2D graphical representation:
9/13/2001 Information Organization and Retrieval
Clustering Multi-Dimensional Document Space
Wise, Thomas, Pennock, Lantrip, Pottier, Schur, Crow“Visualizing the Non-Visual: Spatial analysis and interaction with Information from text documents,” 1995
9/13/2001 Information Organization and Retrieval
Clustering Multi-Dimensional Document Space
Wise et al., 1995
9/13/2001 Information Organization and Retrieval
Concept “Landscapes”Browsing without search
Pharmocology
Anatomy
Legal
Disease
Hospitals
(e.g., Xia Lin, “Visualization for the Document Space,” 1992)
Based on Kohonen feature maps;See http://websom.hut.fi/websom/
9/13/2001 Information Organization and Retrieval
More examples ofinformation visualization
• Stuart Card, Jock Mackinlay, Ben Schneiderman (eds.) Readings in Information Visualization (San Francisco: Morgan Kaufmann, 1999)
• Martin Dodge, www.cybergeography.org
9/13/2001 Information Organization and Retrieval
Clustering• Advantages:
– See some main themes
• Disadvantage:– Many ways documents could group together are
hidden
• Thinking point: what is the relationship to classification systems and faceted queries?
e.g., f1: (osteoporosis OR ‘bone loss’) f2: (drugs OR pharmaceuticals) f3: (prevention OR cure)
9/13/2001 Information Organization and Retrieval
More information on content analysis and clustering
• Christopher Manning and Hinrich Schutze, Foundations of Statistical Natural Language Processing (Cambridge, MA: MIT Press, 1999)
• Daniel Jurafsky and James Martin, Speech and Language Processing (Upper Saddle River, NJ: Prentice Hall, 2000)
9/13/2001 Information Organization and Retrieval
Next Time
• Vector Space Ranking
• Probabilistic Models and Ranking