cs707_030112

download cs707_030112

of 53

Transcript of cs707_030112

  • 8/2/2019 cs707_030112

    1/53

    1

    Agglomerative (bottom-up):Precondition: Start with each document as a

    separate cluster.Postcondition: Eventually all documents belong to

    the same cluster.

    Divisive (top-down):Precondition: Start with all documents belonging

    to the same cluster.

    Postcondition: Eventually each document forms acluster of its own.

    Does not require the number of clusters kinadvance.

    Needs a termination/readout condition

    Hierarchical Clustering algorithms

    Prasad L17HierCluster

  • 8/2/2019 cs707_030112

    2/53

    2

    Hierarchical Agglomerative Clustering (HAC)

    Algorithm

    Start with all instances in their own cluster.

    Until there is only one cluster:Among the current clusters, determine the two

    clusters, ci and cj, that are most similar.

    Replace ci and cj with a single clustercicj

    Prasad L17HierCluster

  • 8/2/2019 cs707_030112

    3/53

    3

    Dendrogram: Document Example

    As clusters agglomerate, docs likely to fallinto a hierarchy oftopics or concepts.

    d1

    d2

    d3

    d4

    d5

    d1,d2 d4,d5 d3

    d3,d4,d5

    Prasad L17HierCluster

  • 8/2/2019 cs707_030112

    4/53

    4

    Key notion: cluster representative

    We want a notion of a representative point in acluster, to represent the location of each cluster

    Representative should be some sort oftypicalor central point in the cluster, e.g., point inducing smallest radii to docs in cluster smallest squared distances, etc. point that is the average of all docs in the cluster

    Centroid or center of gravity Measure intercluster distances by distances of centroids.

    Prasad L17HierCluster

  • 8/2/2019 cs707_030112

    5/53

    5

    Example: n=6, k=3, closest pair of centroids

    d1 d2

    d3

    d4

    d5

    d6

    Centroid after first step.

    Centroid after

    second step.

    Prasad

  • 8/2/2019 cs707_030112

    6/53

    6

    Outliers in centroid computation

    Can ignore outliers when computingcentroid.

    What is an outlier?Lots of statistical definitions, e.g. momentof point to centroid > M some clustermoment.

    Centroid

    Outlier

    Say 10.

    Prasad L17HierCluster

  • 8/2/2019 cs707_030112

    7/53

    7

    Closest pairof clusters

    Many variants to defining closest pair of clusters Single-link

    Similarity of the mostcosine-similar (single-link) Complete-link Similarity of the furthest points, the leastcosine-

    similar

    Centroid Clusters whose centroids (centers of gravity) are the

    most cosine-similar Average-link

    Average cosine between pairs of elements

  • 8/2/2019 cs707_030112

    8/53

    8

    Single Link Agglomerative Clustering

    Use maximum similarity of pairs:

    Can result in straggly (long and thin)clusters due to chaining effect.

    After merging ciand cj, the similarity of theresulting cluster to another cluster, ck, is:

    ),(max),( , yxsimccsim ji cycxji =

    )),(),,(max()),(( kjkikji ccsimccsimcccsim =

    Prasad L17HierCluster

  • 8/2/2019 cs707_030112

    9/53

    9

    Single Link Example

    Prasad L17HierCluster

  • 8/2/2019 cs707_030112

    10/53

    10

    Complete Link Agglomerative

    Clustering Use minimum similarity of pairs:

    Makes tighter, spherical clusters that aretypically preferable.

    After merging ciand cj, the similarity of theresulting cluster to another cluster, ck, is:

    ),(min),(,

    yxsimccsimji cycx

    ji

    =

    )),(),,(min()),(( kjkikji ccsimccsimcccsim =

    Ci Cj Ck

    Prasad

  • 8/2/2019 cs707_030112

    11/53

    11

    Complete Link Example

    Prasad L17HierCluster

  • 8/2/2019 cs707_030112

    12/53

    12

    Computational Complexity

    In the first iteration, all HAC methods need tocompute similarity of all pairs ofn individualinstances which is O(n2).

    In each of the subsequent n-2 mergingiterations, compute the distance between themost recently created cluster and all otherexisting clusters.

    In order to maintain an overall O(n2)performance, computing similarity to eachcluster must be done in constant time.Often O(n3) if done naively or O(n2log n) if done

    more cleverly

    Prasad L17HierCluster

  • 8/2/2019 cs707_030112

    13/53

    13

    Group Average Agglomerative

    Clustering Use average similarity across all pairs within the merged

    cluster to measure the similarity of two clusters.

    Compromise between single and complete link. Two options:

    Averaged across all ordered pairs in the merged cluster Averaged over all pairs between the two original clusters

    No clear difference in efficacy

    =

    )( :)(),()1(

    1

    ),(ji jiccx xyccyjiji

    ji yxsimccccccsim

    Prasad L17HierCluster

  • 8/2/2019 cs707_030112

    14/53

    14

    Computing Group Average

    SimilarityAssume cosine similarity and normalized

    vectors with unit length.

    Always maintain sum of vectors in eachcluster.

    Compute similarity of clusters in constanttime:

    =

    jcx

    jxcs

    )(

    )1||||)(|||(|

    |)||(|))()(())()((),(

    ++

    +++

    =

    jiji

    jijiji

    jicccc

    cccscscscsccsim

    Prasad L17HierCluster

  • 8/2/2019 cs707_030112

    15/53

    15

    Medoid as Cluster Representative The centroid does not have to be a

    document.

    Medoid: A cluster representative that isone of the documents (E.g., the documentclosest to the centroid)

    One reason this is useful Consider the representative of a large cluster (>1K

    docs)

    The centroidof this cluster will be a densevectorPrasad L17HierCluster

  • 8/2/2019 cs707_030112

    16/53

    16

    Efficiency: Using

    approximations

    In standard algorithm, must find closestpair of centroids at each step

    Approximation: instead, find nearly closestpairuse some data structure that makes this

    approximation easier to maintain

    simplistic example: maintain closest pairbased on distances in projection on a random

    line

    Random line

    Prasad

  • 8/2/2019 cs707_030112

    17/53

    17

    Major issue - labeling

    After clustering algorithm finds clusters -how can they be useful to the end user?

    Need pithy label for each clusterIn search results, say Animal orCar inthejaguarexample.

    In topic trees (Yahoo), need navigational cues.Often done by hand, a posteriori.

    Prasad L17HierCluster

  • 8/2/2019 cs707_030112

    18/53

    18

    How to Label Clusters

    Show titles of typical documentsTitles are easy to scanAuthors create them for quick scanning!But you can only show a few titles which

    may not fully represent cluster

    Show words/phrases prominent incluster

    More likely to fully represent clusterUse distinguishing words/phrases

    Differential labelingPrasad L17HierCluster

  • 8/2/2019 cs707_030112

    19/53

    19

    Labeling

    Common heuristics - list 5-10 mostfrequent terms in the centroid vector.

    Drop stop-words; stem.

    Differential labeling by frequent termsWithin a collection Computers, clusters all

    have the word computeras frequent term.

    Discriminant analysis of centroids. Perhaps better: distinctive noun phrasePrasad L17HierCluster

  • 8/2/2019 cs707_030112

    20/53

    20

    What is a Good Clustering?

    Internal criterion: A good clustering willproduce high quality clusters in which:

    the intra-class (that is, intra-cluster) similarityis high

    the inter-class similarity is low

    The measured quality of a clusteringdepends on both the documentrepresentation and the similarity measure

    usedPrasad L17HierCluster

  • 8/2/2019 cs707_030112

    21/53

    21

    External criteria for clustering quality

    Quality measured by its ability to discoversome or all of the hidden patterns or latent

    classes in gold standard data

    Assesses a clustering with respect to groundtruth requires labeled data

    Assume documents with Cgold standardclasses, while our clustering algorithmsproduce Kclusters, 1, 2,, K with ni

    members.

    Prasad L17HierCluster

  • 8/2/2019 cs707_030112

    22/53

    22

    External Evaluation of Cluster Quality

    Simple measure: purity, the ratio betweenthe dominant class in the clusteri and

    the size of clusteri

    Others are entropy of classes in clusters(or mutual information between classes

    and clusters)

    Cjnn

    Purity ijji

    i = )(max1

    )(

    Prasad L17HierCluster

  • 8/2/2019 cs707_030112

    23/53

    23

    Cluster I Cluster II Cluster III

    Cluster I: Purity = 1/6 * (max(5, 1, 0)) = 5/6

    Cluster II: Purity = 1/6 * (max(1, 4, 1)) = 4/6

    Cluster III: Purity = 1/5 * (max(2, 0, 3)) = 3/5

    Purity example

  • 8/2/2019 cs707_030112

    24/53

    Rand Index

    Number of

    points

    Same Cluster

    in clustering

    Different

    Clusters in

    clustering

    Same class in

    ground truth A C

    Different

    classes in

    ground truthB D

  • 8/2/2019 cs707_030112

    25/53

    25

    Rand index: symmetric version

    BA

    AP

    +

    =

    DCBA

    DARI

    +++

    +

    =

    CA

    AR

    +

    =

    Compare with standard Precision and Recall.

    Prasad L17HierCluster

  • 8/2/2019 cs707_030112

    26/53

    Rand Index example: 0.68

    Number of

    points

    Same Cluster

    in clustering

    Different

    Clusters in

    clustering

    Same class in

    ground truth 20 24

    Different

    classes in

    ground truth20 72

  • 8/2/2019 cs707_030112

    27/53

    Final word and resources

    In clustering, clusters are inferred from thedata without human input (unsupervised

    learning)

    However, in practice, its a bit less clear:there are many ways of influencing the

    outcome of clustering: number of clusters,

    similarity measure, representation ofdocuments, . . .

  • 8/2/2019 cs707_030112

    28/53

    28

    SKIP WHAT FOLLOWS

    Prasad L17HierCluster

  • 8/2/2019 cs707_030112

    29/53

    29

    Term vs. document space

    So far, we clustered docs based on theirsimilarities in term space

    For some applications, e.g., topicanalysis for inducing navigation

    structures, can dualize

    use docs as axesrepresent (some) terms as vectorsproximity based on co-occurrence of terms

    in docs

    now clustering terms, notdocsPrasad L17HierCluster

  • 8/2/2019 cs707_030112

    30/53

    30

    Term vs. document space

    Cosine computationConstant for docs in term spaceGrows linearly with corpus size for terms in

    doc space

    Cluster labelingClusters have clean descriptions in terms of

    noun phrase co-occurrence

    Application of term clustersPrasad L17HierCluster

  • 8/2/2019 cs707_030112

    31/53

    31

    Multi-lingual docs

    E.g., Canadian government docs. Every doc in English and equivalent

    French.Must cluster by concepts rather than language

    Simplest: pad docs in one language withdictionary equivalents in the other

    thus each doc has a representation in bothlanguages

    Axes are terms in both languagesPrasad L17HierCluster

  • 8/2/2019 cs707_030112

    32/53

    32

    Feature selection

    Which terms to use as axes for vectorspace?

    IDF is a form of feature selectionCan exaggerate noise e.g., mis-spellings

    Better to use highest weight mid-frequencywords the most discriminating terms

    Pseudo-linguistic heuristics, e.g.,drop stop-wordsstemming/lemmatizationuse only nouns/noun phrases

    Prasad L17HierCluster

  • 8/2/2019 cs707_030112

    33/53

    33

    Evaluation of clustering

    Perhaps the most substantive issue in datamining in general:

    how do you measure goodness? Most measures focus on computational

    efficiency

    Time and space For application of clustering to search:

    Measure retrieval effectiveness

    Prasad L17HierCluster

  • 8/2/2019 cs707_030112

    34/53

    34

    Approaches to evaluating

    AnecdotalGround truth comparison

    Cluster retrievalPurely quantitative measures

    Probability of generating clusters foundAverage distance between cluster members

    Microeconomic / utility

    Prasad L17HierCluster

  • 8/2/2019 cs707_030112

    35/53

    35

    Anecdotal evaluation

    Probably the commonest (and surely theeasiest)

    I wrote this clustering algorithm and lookwhat it found!

    No benchmarks, no comparison possibleAny clustering algorithm will pick up the

    easy stuff like partition by languages

    Generally, unclear scientific value.Prasad L17HierCluster

  • 8/2/2019 cs707_030112

    36/53

    36

    Ground truth comparison

    Take a union of docs from a taxonomy &cluster

    Yahoo!, ODP, newspaper sections Compare clustering results to baseline

    e.g., 80% of the clusters found map cleanlyto taxonomy nodes

    But is it the

    right

    answer?There can be several equally right answers For the docs given, the static prior

    taxonomy may be incomplete/wrong in

    places

    Subjec've

    Prasad

  • 8/2/2019 cs707_030112

    37/53

    37

    Microeconomic viewpoint

    Anything - including clustering - is only asgood as the economic utility it provides

    For clustering: net economic gainproduced by an approach (vs. another

    approach)

    Strive for a concrete optimization problem Examples

    recommendation systemsclock time for interactive search

    expensivePrasad L17HierCluster

  • 8/2/2019 cs707_030112

    38/53

    38

    Evaluation example: Cluster retrieval

    Ad-hoc retrieval Cluster docs in returned set Identify best cluster & only retrieve docsfrom it How do various clustering methods affect

    the quality of whats retrieved?

    Concrete measure of quality:Precision as measured by user judgements

    for these queries

    Prasad L17HierCluster

  • 8/2/2019 cs707_030112

    39/53

    39

    Evaluation

    Compare two IR algorithms1. send query, present ranked results2. send query, cluster results, present clusters

    Experiment was simulated (no users)Results were clustered into 5 clustersClusters were ranked according to percentage

    relevant documents

    Documents within clusters were rankedaccording to similarity to query

    Prasad L17HierCluster

  • 8/2/2019 cs707_030112

    40/53

    Prasad L18LSI 40

    Latent Semantic Indexing

    Adapted from Lectures by

    Prabhaker Raghavan, Christopher Manning and

    Thomas Hoffmann

  • 8/2/2019 cs707_030112

    41/53

    Todays topic

    Latent Semantic IndexingTerm-document matrices are very largeBut the number of topics that people talk

    about is small (in some sense)

    Clothes, movies, politics, Can we represent the term-document space

    by a lower dimensional latent space?

    Prasad 41L18LSI

  • 8/2/2019 cs707_030112

    42/53

    Linear Algebra

    Background

    Prasad 42L18LSI

  • 8/2/2019 cs707_030112

    43/53

    Eigenvalues & Eigenvectors

    Eigenvectors (for a square mm matrix S)

    How many eigenvalues are there at most?only has a non-zero solution ifthis is a m-th order equation in which can have atmost m distinct solutions(roots of the characteristicpolynomial) can be complex even though S is real.

    eigenvalue(right) eigenvector

    Example

    43

  • 8/2/2019 cs707_030112

    44/53

    Matrix-vector multiplication

    S=

    30 0 0

    0 20 0

    0 0 1

    has eigenvalues 30, 20, 1 withcorresponding eigenvectors

    =

    0

    0

    1

    1v

    =

    0

    1

    0

    2v

    =

    1

    0

    0

    3v

    On each eigenvector, S acts as a multiple of the identitymatrix: but as a different multiple on each.

    Any vector (say x= ) can be viewed as a combination ofthe eigenvectors: x = 2v

    1+ 4v

    2+ 6v

    3

    6

    4

    2

  • 8/2/2019 cs707_030112

    45/53

    Matrix vector multiplication

    Thus a matrix-vector multiplication such asSx (S,xas in the previous slide) can berewritten in terms of the eigenvalues/vectors:

    Even thoughxis an arbitrary vector, theaction ofS onxis determined by theeigenvalues/vectors.

    Sx = S(2v1 + 4v2 + 6v 3)

    Sx = 2Sv1 + 4Sv2 + 6Sv 3= 21v1 + 42v2 + 63v 3

    Sx = 60v1 + 80v2 + 6v 3

    Prasad 45L18LSI

  • 8/2/2019 cs707_030112

    46/53

    Matrix vector multiplication

    Observation: the effect ofsmalleigenvalues is small. If we ignored thesmallest eigenvalue (1), then instead of

    we would get

    These vectors are similar (in terms ofcosine similarity), or close (in terms ofEuclidean distance).

    6080

    6

    60

    80

    0

    Prasad 46L18LSI

  • 8/2/2019 cs707_030112

    47/53

    Eigenvalues & Eigenvectors

    a, 121}2,1{}2,1{}2,1{ =vSv

    For symmetric matrices, eigenvectors fordistinct

    eigenvalues are orthogonal

    =TSa0if,coforIS

    All eigenvalues of a real symmetric matrix are real.

    Sit,0,SwwTn

    All eigenvalues of a positive semi-definite matrix

    are non-negative

    Prasad 47L18LSI

  • 8/2/2019 cs707_030112

    48/53

    Example

    Let

    Then

    The eigenvalues are 1 and 3 (nonnegative,real). The eigenvectors are orthogonal (and real):

    =

    21

    12S

    1)2(21

    122

    =

    IS

    1

    1

    1

    1

    Real, symmetric.

    Plug in these valuesand solve for

    eigenvectors.Prasad

  • 8/2/2019 cs707_030112

    49/53

    Let be a squarematrix with mlinearly independent eigenvectors (a non-

    defective matrix)

    Theorem: Exists an eigen decomposition (cf. matrix diagonalization theorem)

    Columns ofUare eigenvectors ofS Diagonal elements of are eigenvalues of

    Eigen/diagonal Decomposition

    diagonal

    Unique

    fordistincteigen-values

    Prasad 49L18LSI

  • 8/2/2019 cs707_030112

    50/53

    Diagonal decomposition: why/how

    =nvvU ...

    1Let Uhave the eigenvectors as columns:

    =

    =

    =nnnnvvvvvvSSU

    ...........

    1

    1111

    Then, SUcan be written

    And S=UU1.

    Thus SU=U, orU1SU=

    Prasad 50

  • 8/2/2019 cs707_030112

    51/53

    Diagonal decomposition - example

    Recall .3,1;21

    1221==

    = S

    The eigenvectors and form

    1

    1

    1

    1

    =

    11

    11U

    Inverting, we have

    =

    2/12/1

    /12/11

    U

    Then, S=UU1 =

    /12/1

    /12/1

    30

    01

    11

    11

    Recall

    UU1 =1.

  • 8/2/2019 cs707_030112

    52/53

    Example continued

    Lets divide U(and multiply U1)by 2

    /12/1

    /12/1

    30

    01

    2/12/1

    2/12/1Then, S=

    Q (Q-1= QT)

    Why? Stay tuned

    Prasad 52L18LSI

  • 8/2/2019 cs707_030112

    53/53

    If is a symmetric matrix: Theorem: There exists a (unique) eigen

    decomposition where Qis orthogonal:

    Q-1= QTColumns ofQare normalized eigenvectorsColumns are orthogonal.(everything is real)

    Symmetric Eigen Decomposition

    QS=

    Prasad 53L18LSI