Clustering Categorical Data: An Approach Based on Dynamical Systems (1998) David Gibson, Jon...

22
Clustering Categorical Data: An Approach Based on Dynamical Systems (1998) David Gibson, Jon Kleinberg, Prabhakar Raghavan VLDB Journal: Very Large Data Bases Aaron Sherman

Transcript of Clustering Categorical Data: An Approach Based on Dynamical Systems (1998) David Gibson, Jon...

Page 1: Clustering Categorical Data: An Approach Based on Dynamical Systems (1998) David Gibson, Jon Kleinberg, Prabhakar Raghavan VLDB Journal: Very Large Data.

Clustering Categorical Data: An Approach Based on Dynamical Systems (1998)

David Gibson, Jon Kleinberg, Prabhakar Raghavan VLDB Journal: Very Large Data Bases

Aaron Sherman

Page 2: Clustering Categorical Data: An Approach Based on Dynamical Systems (1998) David Gibson, Jon Kleinberg, Prabhakar Raghavan VLDB Journal: Very Large Data.

Presentation

What is this presentation about? Definitions and Algorithms Evaluations with Generated Data Real World test Conclusions + Q&A

Page 3: Clustering Categorical Data: An Approach Based on Dynamical Systems (1998) David Gibson, Jon Kleinberg, Prabhakar Raghavan VLDB Journal: Very Large Data.

Categorize this!

Categorizing int’s are easy, but what about words like “red,” “blue,” “august,” and “Moorthy?”

STIRR – Sieving Through iterated Relational Reinforcement

Page 4: Clustering Categorical Data: An Approach Based on Dynamical Systems (1998) David Gibson, Jon Kleinberg, Prabhakar Raghavan VLDB Journal: Very Large Data.

Why is STIRR Better?

No a Priori Quantization Correlation vs. Categorical Similarity New Methods for Hypergraph Clustering

Page 5: Clustering Categorical Data: An Approach Based on Dynamical Systems (1998) David Gibson, Jon Kleinberg, Prabhakar Raghavan VLDB Journal: Very Large Data.

Definitions

Table of Relational Data – Set T of Tuples– Set of K Fields – many possible values (Columns)

– Abstract Node – each possible field

– Г є T – consists of one node from each field

Configuration – weight wv to each node v –w N(w) – Normalization Function – rescale all

weights so their squares add up to 1 Dynamical System – repeated application of f Fixed Point – point u where f(u) = u

Page 6: Clustering Categorical Data: An Approach Based on Dynamical Systems (1998) David Gibson, Jon Kleinberg, Prabhakar Raghavan VLDB Journal: Very Large Data.

Where is all this going?

Page 7: Clustering Categorical Data: An Approach Based on Dynamical Systems (1998) David Gibson, Jon Kleinberg, Prabhakar Raghavan VLDB Journal: Very Large Data.

Weighting Scheme

To update the weight wv:

– For each tuple Г = {v,u1,…uk-1} containing v• X Г § (u1,…uk-1 )

– Wv Σ Г X Г

N() f(w)

Page 8: Clustering Categorical Data: An Approach Based on Dynamical Systems (1998) David Gibson, Jon Kleinberg, Prabhakar Raghavan VLDB Journal: Very Large Data.

Combining Operator П

Product Operator П: §(w1…wk ) = w1 w2… wk

Non-linear term – encode co-occurrence strongly

Does not converge Relatively small # of large basins Very useful data in early iterations

Page 9: Clustering Categorical Data: An Approach Based on Dynamical Systems (1998) David Gibson, Jon Kleinberg, Prabhakar Raghavan VLDB Journal: Very Large Data.

Combining Operator +

Addition Operator +: §(w1…wk ) = w1 +w2+…+

wk

Linear Does a good job converging

Page 10: Clustering Categorical Data: An Approach Based on Dynamical Systems (1998) David Gibson, Jon Kleinberg, Prabhakar Raghavan VLDB Journal: Very Large Data.

Combining Operator Sp

Sp – Combining Rule: §(w1…wk ) =

Non-linear term – encode co-occurrence strongly

Does a good job converging

Page 11: Clustering Categorical Data: An Approach Based on Dynamical Systems (1998) David Gibson, Jon Kleinberg, Prabhakar Raghavan VLDB Journal: Very Large Data.

Combining Operator Sω

Sω – Limiting version of Sp

Take the largest value among the weights Easy to compute, sum like properties Converges the best of all options shown

Page 12: Clustering Categorical Data: An Approach Based on Dynamical Systems (1998) David Gibson, Jon Kleinberg, Prabhakar Raghavan VLDB Journal: Very Large Data.

Initial Configuration

Uniform Initialization – all weights = 1 Random Initialization – independently

choose o1 for each weight then normalize– Some operators more sensitive to initial

configurations then others

Masking / Modification – specific rule for certain nodes to set to higher or lower value

Page 13: Clustering Categorical Data: An Approach Based on Dynamical Systems (1998) David Gibson, Jon Kleinberg, Prabhakar Raghavan VLDB Journal: Very Large Data.

Run Time - Linear

Page 14: Clustering Categorical Data: An Approach Based on Dynamical Systems (1998) David Gibson, Jon Kleinberg, Prabhakar Raghavan VLDB Journal: Very Large Data.

Quasi-Random Input

Create semi random data, and then add tuples to the data to create artificial clusters– Use this to test whether STIRR works

Questions• # of iterations

• Density of cluster to background

Page 15: Clustering Categorical Data: An Approach Based on Dynamical Systems (1998) David Gibson, Jon Kleinberg, Prabhakar Raghavan VLDB Journal: Very Large Data.

How well does STIRR distil a cluster in nodes with above average co-occurrence

# of iterations Purity

Page 16: Clustering Categorical Data: An Approach Based on Dynamical Systems (1998) David Gibson, Jon Kleinberg, Prabhakar Raghavan VLDB Journal: Very Large Data.

How well does STIRR separate distinct planted clusters?Will the data partition?

How long to partition?

S(A,B) = (|a0 – b0| + |a1 –b1| ) / total nodesClusters A,B, a0 nodes from cluster, and a1nodes at other end

Page 17: Clustering Categorical Data: An Approach Based on Dynamical Systems (1998) David Gibson, Jon Kleinberg, Prabhakar Raghavan VLDB Journal: Very Large Data.

How well does STIRR cope with clusters in a few columns with the rest random?

Want to mask irrelevant factors (columns)

Page 18: Clustering Categorical Data: An Approach Based on Dynamical Systems (1998) David Gibson, Jon Kleinberg, Prabhakar Raghavan VLDB Journal: Very Large Data.

Effect of Convergence Operator Max function is

the best Product rule

does not converge

Sum rule is good, but slow

Page 19: Clustering Categorical Data: An Approach Based on Dynamical Systems (1998) David Gibson, Jon Kleinberg, Prabhakar Raghavan VLDB Journal: Very Large Data.

Real World Data

Papers on theory and Database Systems– (Author 1, Author 2, Journal Year)– The two sets of papers were clearly separated in

the STIRR representation– Done using Sp– Grouped most theoretical papers around 1976

Page 20: Clustering Categorical Data: An Approach Based on Dynamical Systems (1998) David Gibson, Jon Kleinberg, Prabhakar Raghavan VLDB Journal: Very Large Data.

Login Data from IBM Servers

Masked one user who logged in / out very frequently

4 highest weight (similar) users – root, help, 2 administrators names

8pm-12am very similar

Page 21: Clustering Categorical Data: An Approach Based on Dynamical Systems (1998) David Gibson, Jon Kleinberg, Prabhakar Raghavan VLDB Journal: Very Large Data.

Conclusion

Powerful technique to categorize data Relatively fast algorithm O(n) Questions?

Page 22: Clustering Categorical Data: An Approach Based on Dynamical Systems (1998) David Gibson, Jon Kleinberg, Prabhakar Raghavan VLDB Journal: Very Large Data.

Additional References

Data Clustering Techniques - Qualifying Oral Examination Paper - Periklis Andritsos– http://www.cs.toronto.edu/~periklis/p

ubs/depth.pdf