HKU CS 11/8/2004 1 Scalable Clustering of Categorical Data HKU CS Database Research Seminar August...

HKU CS 11/8/2004HKU CS 11/8/2004 11

Scalable Clustering of Scalable Clustering of Categorical DataCategorical Data

HKU CS Database Research SeminarHKU CS Database Research SeminarAugust 11th, 2004August 11th, 2004

Panagiotis KarrasPanagiotis Karras

HKU CS 11/8/2004HKU CS 11/8/2004 22

The ProblemThe Problem

Clustering a problem of great importance.Clustering a problem of great importance. Partitioning data into groups so that similar Partitioning data into groups so that similar

objects are grouped together.objects are grouped together. Clustering of numerical data well-treated.Clustering of numerical data well-treated. Clustering of categorical data more Clustering of categorical data more

challenging: no inherent distance measure.challenging: no inherent distance measure.

HKU CS 11/8/2004HKU CS 11/8/2004 33

An ExampleAn Example

A Movie Relation:A Movie Relation:

Distance or similarity between values not Distance or similarity between values not immediately obviousimmediately obvious

HKU CS 11/8/2004HKU CS 11/8/2004 44

Some Information TheorySome Information Theory

Mutual InformationMutual Information measure employed. measure employed. Clusters should be Clusters should be informativeinformative about the about the

data they contain.data they contain. Given a cluster, we should be able to predict Given a cluster, we should be able to predict

the attribute values of its objects accurately.the attribute values of its objects accurately. Information loss should be minimized.Information loss should be minimized.

HKU CS 11/8/2004HKU CS 11/8/2004 55

An ExampleAn Example

In the Movie Relation, clustering In the Movie Relation, clustering CC is better then is better then clustering clustering DD according to this measure (why?). according to this measure (why?).

HKU CS 11/8/2004HKU CS 11/8/2004 66

The The Information BottleneckInformation Bottleneck Method Method

Formalized by Tishby et al. [1999]Formalized by Tishby et al. [1999] Clustering: the compression of one random Clustering: the compression of one random

variable that preserves as much information as variable that preserves as much information as possible about another.possible about another.

Conditional entropy of Conditional entropy of AA given given TT::

Captures the uncertainty of predicting the values Captures the uncertainty of predicting the values of A given the values of T.of A given the values of T.

T A

logt a

taptaptpTAH

HKU CS 11/8/2004HKU CS 11/8/2004 77


Mutual InformationMutual Information quantifies the amount of quantifies the amount of information that variables convey about each information that variables convey about each other [Shannon, 1948]:other [Shannon, 1948]:

TAHAHATHTHATI ;

HKU CS 11/8/2004HKU CS 11/8/2004 88


A set of A set of nn tuples on tuples on mm attributes. attributes. Let Let dd be the size of the set of all possible be the size of the set of all possible

attribute values.attribute values. Then the data can be conceptualized as a Then the data can be conceptualized as a

nn××dd matrix matrix M.M. MM[[tt,,aa]=1 iff tuple ]=1 iff tuple tt contains value contains value aa.. Rows of normalized Rows of normalized MM contain conditional contain conditional

probability distributions probability distributions pp((AA||tt).).

HKU CS 11/8/2004HKU CS 11/8/2004 99


In the Movie Relation Example:In the Movie Relation Example:

HKU CS 11/8/2004HKU CS 11/8/2004 1010


Clustering is a problem of maximizing the Clustering is a problem of maximizing the Mutual Mutual InformationInformation II((AA;;CC) between attribute values and ) between attribute values and cluster identities, for a given number cluster identities, for a given number kk of clusters of clusters [Tishby et al. 1999].[Tishby et al. 1999].

Finding optimal clustering NP-complete.Finding optimal clustering NP-complete. Agglomerative Information BottleneckAgglomerative Information Bottleneck proposed by proposed by

Slonim and Tishby [1999].Slonim and Tishby [1999]. Starts with Starts with nn clusters, reduces one at each step so clusters, reduces one at each step so

that that Information Loss Information Loss in in II((AA;;CC) be minimized.) be minimized.

HKU CS 11/8/2004HKU CS 11/8/2004 1111

LIMBO ClusteringLIMBO Clustering

scascaLLable able IInfornforMMation ation BoBottleneckttleneck Keeps only sufficient statistics in memory.Keeps only sufficient statistics in memory. Compact Summary Model.Compact Summary Model. Clustering based on Model.Clustering based on Model.

HKU CS 11/8/2004HKU CS 11/8/2004 1212

What is a DCF?What is a DCF?

A Cluster is summarized in a A Cluster is summarized in a Distributional Cluster Distributional Cluster FeatureFeature (DCF). (DCF).

Pair of probability of cluster Pair of probability of cluster cc and conditional and conditional probability distribution of attribute values given probability distribution of attribute values given cc..

Distance between DCFs is defined as the Distance between DCFs is defined as the Information LossInformation Loss incurred by merging the incurred by merging the corresponding clusters (computed by the Jensen-corresponding clusters (computed by the Jensen-Shannon divergence).Shannon divergence).

cApcpcDCF ,

HKU CS 11/8/2004HKU CS 11/8/2004 1313

The The DCFDCF Tree Tree

Height balanced tree of branching factor Height balanced tree of branching factor BB.. DCFDCFs at leaves define clustering of tuples.s at leaves define clustering of tuples. Non-leaf nodes merge Non-leaf nodes merge DCFDCFs of children.s of children. Compact hierarchical summarization of data.Compact hierarchical summarization of data.

HKU CS 11/8/2004HKU CS 11/8/2004 1414

The LIMBO algorithmThe LIMBO algorithm

Three phases.Three phases. Phase 1: Insertion into Phase 1: Insertion into DCFDCF tree. tree.o Each tuple Each tuple tt converted to converted to DCF(t)DCF(t)..o Follows path downward in tree along closest non-leaf DCFs.Follows path downward in tree along closest non-leaf DCFs.o At leaf level, let At leaf level, let DCF(c)DCF(c) be entry closest to be entry closest to DCF(t)DCF(t)..o If empty entry in leaf of If empty entry in leaf of DCF(c)DCF(c), then , then DCF(t)DCF(t) placed there. placed there.o If no empty entry and sufficient free space, leaf split in two halves, with If no empty entry and sufficient free space, leaf split in two halves, with

two farthest two farthest DCFDCFs as seeds for new leaves. Split moves upward as s as seeds for new leaves. Split moves upward as necessary.necessary.

o Else, if no space, two closest Else, if no space, two closest DCFDCF entries in {leaf, t} are merged. entries in {leaf, t} are merged.

HKU CS 11/8/2004HKU CS 11/8/2004 1515

The LIMBO algorithmThe LIMBO algorithm

Phase 2: Clustering.Phase 2: Clustering.o For a given value of For a given value of kk, the , the DCFDCF tree is used to produce tree is used to produce kk DCFDCFs that s that

serve as serve as representativesrepresentatives of of kk clusters, emplying the clusters, emplying the Agglomerative Agglomerative Information BottleneckInformation Bottleneck algorithm. algorithm.

Phase 3: Associating tuples with clusters.Phase 3: Associating tuples with clusters.o A scan over the data set is performed and each tuple is assigned to A scan over the data set is performed and each tuple is assigned to

the closest cluster.the closest cluster.

HKU CS 11/8/2004HKU CS 11/8/2004 1616

Intra-Attribute Value DistanceIntra-Attribute Value Distance

How to define the distance between How to define the distance between categorical attribute values of the same categorical attribute values of the same attribute?attribute?

Values should be placed within a Values should be placed within a contextcontext.. Similar values appear in similar contexts.Similar values appear in similar contexts. What is a suitable context?What is a suitable context? The distribution an attribute values induces The distribution an attribute values induces

on the remaining attributes.on the remaining attributes.

HKU CS 11/8/2004HKU CS 11/8/2004 1717

Intra-Attribute Value DistanceIntra-Attribute Value Distance

The distance between two values is then The distance between two values is then defined as the defined as the Information LossInformation Loss incurred incurred about the other attributes if we merge these about the other attributes if we merge these values.values.

In the Movie example, Scorsese and In the Movie example, Scorsese and Coppola are the most similar directors.Coppola are the most similar directors.

Distance between tuples = sum of distances Distance between tuples = sum of distances between attributes.between attributes.

HKU CS 11/8/2004HKU CS 11/8/2004 1818

Experiments - AlgorithmsExperiments - Algorithms

Four algorithms are compared:Four algorithms are compared: ROCK. An agglomerative algorithm by Guha et al. [1999]ROCK. An agglomerative algorithm by Guha et al. [1999] COOLCAT. A scalable non-hierarchical algorithm most COOLCAT. A scalable non-hierarchical algorithm most

similar to LIMBO by Barbará et al. [2002]similar to LIMBO by Barbará et al. [2002] STIRR. A dynamical systems approach using a hypergraph STIRR. A dynamical systems approach using a hypergraph

of weighted attribute values, by Gibson et al. [1998]of weighted attribute values, by Gibson et al. [1998] LIMBO. In addition to the space-bound version, LIMBO LIMBO. In addition to the space-bound version, LIMBO

was implemented in an accuracy-control version, where a was implemented in an accuracy-control version, where a distance threshold is imposed on the decision of merging distance threshold is imposed on the decision of merging two two DCFDCFs, as multiple s, as multiple φφ of the average mutual information of the average mutual information of all tuples.of all tuples. The two versions differ only in Phase 1.The two versions differ only in Phase 1.

HKU CS 11/8/2004HKU CS 11/8/2004 1919

Experiments – Data SetsExperiments – Data Sets

The following Data Sers are used:The following Data Sers are used: Congressional Votes (435 boolean tuples on 16 issues, Congressional Votes (435 boolean tuples on 16 issues,

from 1984, classified as Democrat or Republican).from 1984, classified as Democrat or Republican). Mushroom (8,124 tuples with 22 attributes, classified as Mushroom (8,124 tuples with 22 attributes, classified as

poisonous or edible).poisonous or edible). Database and Theory bibliography (8,000 tuples on Database and Theory bibliography (8,000 tuples on

research papers with 4 attributes).research papers with 4 attributes). Synthetic Data Sets (5,000 tuples, 10 attributes, DS5 and Synthetic Data Sets (5,000 tuples, 10 attributes, DS5 and

DS10 for 5 and 10 classes).DS10 for 5 and 10 classes). Web Data (web pages - a tuple set of authorities with the Web Data (web pages - a tuple set of authorities with the

hubs they are linked to by as attributes)hubs they are linked to by as attributes)

HKU CS 11/8/2004HKU CS 11/8/2004 2020

Experiments - Quality MeasuresExperiments - Quality Measures

Several measures are used to capture the Several measures are used to capture the subjectivity of clustering quality:subjectivity of clustering quality:

Information Loss.Information Loss. The lower the better. The lower the better.

Category Utility. Category Utility. Difference between expected correct Difference between expected correct guesses of attribute values with and without a clustering.guesses of attribute values with and without a clustering.

Min Classification Error. Min Classification Error. For tuples already classified.For tuples already classified.

Precision (P), Recall (R). Precision (P), Recall (R). P measures the accuracy with P measures the accuracy with which a cluster reproduces a class and R the completeness with which which a cluster reproduces a class and R the completeness with which this is done.this is done.

HKU CS 11/8/2004HKU CS 11/8/2004 2121

Quality-Efficiency trade-offs for LIMBOQuality-Efficiency trade-offs for LIMBO

Both controlling the size (Both controlling the size (SS) or the accuracy () or the accuracy (φφ)) of the of the model, there is a trade-off between expressivenessmodel, there is a trade-off between expressiveness (large (large SS, small , small φφ)) and compactness (small and compactness (small SS, large , large φφ).).

For Branching factor B=4 we obtain:For Branching factor B=4 we obtain:

For large For large SS and small and small φφ, the bottleneck is Phase 2., the bottleneck is Phase 2.

HKU CS 11/8/2004HKU CS 11/8/2004 2222

Quality-Efficiency trade-offs for LIMBOQuality-Efficiency trade-offs for LIMBO

Still, in Phase 1 we can obtain significant compression of the Still, in Phase 1 we can obtain significant compression of the data sets at no expense in the final quality.data sets at no expense in the final quality.

This consistency can be attributed in part to the effect of Phase This consistency can be attributed in part to the effect of Phase 3, which assigns tuples to cluster representatives.3, which assigns tuples to cluster representatives.

ΕΕven for large values ofven for large values of φφ and small values of and small values of SS, LIMBO obtains , LIMBO obtains essentially the same clustering quality as AIB, but in linear time.essentially the same clustering quality as AIB, but in linear time.

HKU CS 11/8/2004HKU CS 11/8/2004 2323

Comparative EvaluationsComparative Evaluations The table show the results for all algorithms and all quality The table show the results for all algorithms and all quality

measures for the Votes and Mushrooms data sets.measures for the Votes and Mushrooms data sets. LIMBO’s quality superior to ROCK and COOLCAT.LIMBO’s quality superior to ROCK and COOLCAT. COOLCAT comes closest to LIMBO.COOLCAT comes closest to LIMBO.

HKU CS 11/8/2004HKU CS 11/8/2004 2424

Web DataWeb Data

Authorities clustered into Authorities clustered into three clusters with three clusters with information loss 61%.information loss 61%.

LIMBO accurately LIMBO accurately characterizes structure of characterizes structure of web graph.web graph.

Three clusters Three clusters correspond to different correspond to different viewpoints (pro, against, viewpoints (pro, against, irrelevant).irrelevant).

HKU CS 11/8/2004HKU CS 11/8/2004 2525

Scalability EvaluationScalability Evaluation Four data sets of size 500K, 1M, 5M, 10M (10 clusters, 10 attributes Four data sets of size 500K, 1M, 5M, 10M (10 clusters, 10 attributes

each).each). Phase 1 in detail for LIMBOPhase 1 in detail for LIMBOφφ

For 1.0 < For 1.0 < φφ <1.5 <1.5 manageable size, fast execution time.manageable size, fast execution time.

HKU CS 11/8/2004HKU CS 11/8/2004 2626

Scalability EvaluationScalability Evaluation

We set We set φφ = 1.2, 1.3, S = 1MB, 5MB.= 1.2, 1.3, S = 1MB, 5MB. Time scales linearly with data set size.Time scales linearly with data set size. Varied number of attributes – linear behavior.Varied number of attributes – linear behavior.

HKU CS 11/8/2004HKU CS 11/8/2004 2727

Scalability - Quality ResultsScalability - Quality Results

Quality measures the same for different data set Quality measures the same for different data set sizes.sizes.

HKU CS 11/8/2004HKU CS 11/8/2004 2828

ConclusionsConclusions

LIMBO has advantages over other LIMBO has advantages over other information theoretic clustering algorithms in information theoretic clustering algorithms in terms of scalability and quality.terms of scalability and quality.

LIMBO only LIMBO only hierarchicalhierarchical scalable categorical scalable categorical clustering algorithm – based on compact clustering algorithm – based on compact summary model.summary model.

HKU CS 11/8/2004HKU CS 11/8/2004 2929

Main ReferenceMain Reference

P. Andritsos, P. Tsaparas, R. J. Miller, K. C. P. Andritsos, P. Tsaparas, R. J. Miller, K. C. Sevcik. Sevcik. LIMBO: Scalable Clustering of Categorical DataLIMBO: Scalable Clustering of Categorical Data , , 9th International Conference on Extending 9th International Conference on Extending DataBase Technology (EDBT), Heraklion, Greece, DataBase Technology (EDBT), Heraklion, Greece, 2004.2004.

http://www.cs.toronto.edu/~tsap/publications/limbo.ps

http://www.cs.toronto.edu/~tsap/publications/limbo.ps

HKU CS 11/8/2004HKU CS 11/8/2004 3030

ReferencesReferences

D. Barbará, J. Couto, and Y. Li. COOLCAT: An entropy-based D. Barbará, J. Couto, and Y. Li. COOLCAT: An entropy-based algorithm for categorical clustering. In CIKM, McLean, VA, 2002.algorithm for categorical clustering. In CIKM, McLean, VA, 2002.

D. Gibson, J. M. Kleinberg, and P. Raghavan. Clustering Categorical D. Gibson, J. M. Kleinberg, and P. Raghavan. Clustering Categorical Data: An Approach Based on Dynamical Systems. In VLDB, New York, Data: An Approach Based on Dynamical Systems. In VLDB, New York, NY, 1998.NY, 1998.

S. Guha, R. Rastogi, and K. Shim. ROCK: A Robust Clustering S. Guha, R. Rastogi, and K. Shim. ROCK: A Robust Clustering Algorithm for Categorical Atributes. In ICDE, Sydney, Australia, 1999.Algorithm for Categorical Atributes. In ICDE, Sydney, Australia, 1999.

C. Shannon. C. Shannon. A Mathematical Theory of CommunicationA Mathematical Theory of Communication, 1948., 1948. N. Slonim and N. Tishby. N. Slonim and N. Tishby. Agglomerative Information Agglomerative Information Bottleneck. In Bottleneck. In

NIPS, Breckenridge, 1999.NIPS, Breckenridge, 1999. N. Tishby, F. C. Pereira, and W. Bialek. N. Tishby, F. C. Pereira, and W. Bialek. The Information Bottleneck The Information Bottleneck

MethodMethod. In 37th Annual Allerton Conference on Communication, . In 37th Annual Allerton Conference on Communication, Control and Computing, Urban-Champaign, IL, 1999.Control and Computing, Urban-Champaign, IL, 1999.

HKU CS 11/8/2004 1 Scalable Clustering of Categorical Data HKU CS Database Research Seminar August...

Documents

Transcript of HKU CS 11/8/2004 1 Scalable Clustering of Categorical Data HKU CS Database Research Seminar August...