GRAPH-BASED HIERARCHICAL CONCEPTUAL CLUSTERING

32
GRAPH-BASED HIERARCHICAL CONCEPTUAL GRAPH-BASED HIERARCHICAL CONCEPTUAL CLUSTERING CLUSTERING by Istvan Jonyer, Lawrence B. Holder and Diane J. Cook The University of Texas at Arlington

description

GRAPH-BASED HIERARCHICAL CONCEPTUAL CLUSTERING. by Istvan Jonyer, Lawrence B. Holder and Diane J. Cook The University of Texas at Arlington. Outline. What is hierarchical conceptual clustering? Overview of Subdue Conceptual clustering in Subdue Evaluation of hierarchical clusterings - PowerPoint PPT Presentation

Transcript of GRAPH-BASED HIERARCHICAL CONCEPTUAL CLUSTERING

Page 1: GRAPH-BASED HIERARCHICAL CONCEPTUAL CLUSTERING

GRAPH-BASED HIERARCHICAL CONCEPTUAL GRAPH-BASED HIERARCHICAL CONCEPTUAL CLUSTERINGCLUSTERING

by

Istvan Jonyer, Lawrence B. Holder and

Diane J. Cook

The University of Texas at Arlington

Page 2: GRAPH-BASED HIERARCHICAL CONCEPTUAL CLUSTERING

OutlineOutline

What is hierarchical conceptual clustering?Overview of SubdueConceptual clustering in SubdueEvaluation of hierarchical clusteringsExperiments and resultsConclusions

Page 3: GRAPH-BASED HIERARCHICAL CONCEPTUAL CLUSTERING

What is clustering?What is clustering?

Page 4: GRAPH-BASED HIERARCHICAL CONCEPTUAL CLUSTERING

What is What is hierarchical hierarchical conceptual conceptual clustering?clustering?

Unsupervised concept learningGenerating hierarchies to explain dataApplications

– Hypothesis generation and testing– Prediction based on groups– Finding taxonomies

Page 5: GRAPH-BASED HIERARCHICAL CONCEPTUAL CLUSTERING

Example hierarchical Example hierarchical conceptualconceptual clusteringclustering

Animals

BodyTemp: unregulatedHeartChamber: fourBodyTemp: regulatedFertilization: internal

Fertilization: externalName: mammalBodyCover: hair

Name: birdBodyCover: feathers

Name: reptileBodyCover: cornified-skin

HeartChamber: imperfect-fourFertilization: internal

Name: fishBodyCover: scales

HeartChamber: two

Name: amphibianBodyCover: moist-skinHeartChamber: three

Page 6: GRAPH-BASED HIERARCHICAL CONCEPTUAL CLUSTERING

The ProblemThe Problem

Hierarchical conceptual clustering in discrete-valued structural databases

Existing systems:– Continuous-valued– Discrete but unstructured– We can do better! (Field under explored)

Page 7: GRAPH-BASED HIERARCHICAL CONCEPTUAL CLUSTERING

Related WorkRelated Work

CobwebLabyrinthAutoClassSnobIn Euclidian space: Chameleon, Cure

Unsupervised learning algorithms

Page 8: GRAPH-BASED HIERARCHICAL CONCEPTUAL CLUSTERING

The SolutionThe Solution

Take Subdue and extend it!

Page 9: GRAPH-BASED HIERARCHICAL CONCEPTUAL CLUSTERING

Overview of SubdueOverview of Subdue

Data mining in graph representations of structural databases

A

C

B D

A

C

BD

F

E

f cb

ad

e

a

bc

g

Page 10: GRAPH-BASED HIERARCHICAL CONCEPTUAL CLUSTERING

Overview of SubdueOverview of Subdue

Iteratively searching for best substructure by MDL heuristic

A

C

BD

cb

a

Page 11: GRAPH-BASED HIERARCHICAL CONCEPTUAL CLUSTERING

Overview of SubdueOverview of Subdue

Compress using best substructure

S S

F

E

f

d

eg

Page 12: GRAPH-BASED HIERARCHICAL CONCEPTUAL CLUSTERING

Overview of SubdueOverview of Subdue

Fuzzy match– Inexact matching of subgraphs– Applications:

Defining fuzzy concepts Evaluation of clusterings

Page 13: GRAPH-BASED HIERARCHICAL CONCEPTUAL CLUSTERING

Conceptual Clustering with Conceptual Clustering with SubdueSubdue

Use Subdue to identify clusters– The best subgraph in an iteration defines a

cluster When to stop within an iteration?

1) Use –limit option2) Use –size option3) Use first minimum heuristic (new)

Page 14: GRAPH-BASED HIERARCHICAL CONCEPTUAL CLUSTERING

The First Minimum HeuristicThe First Minimum Heuristic

Use subgraph at first local minimum– Detect it using –prune2 option

0.75

0.8

0.85

0.9

0.95

1

1.05

Page 15: GRAPH-BASED HIERARCHICAL CONCEPTUAL CLUSTERING

The First Minimum HeuristicThe First Minimum Heuristic

Not a greedy heuristic!– Although first local minimum is usually the

global minimum– First local minimum is caused by a smaller,

more frequently occurring subgraph– Subsequent minima are caused by bigger, less

frequently occurring subgraphs=> First subgraph is more general

Page 16: GRAPH-BASED HIERARCHICAL CONCEPTUAL CLUSTERING

The First Minimum HeuristicThe First Minimum Heuristic

A multi-minimum search space:

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

1.05

1.1

Page 17: GRAPH-BASED HIERARCHICAL CONCEPTUAL CLUSTERING

Lattice vs. TreeLattice vs. Tree

Previous work defined classification trees– Inadequate in structured domains

Better hierarchical description: classification lattice– A cluster can have more than one parent– A parent can be at any level (not only one level

above)

Page 18: GRAPH-BASED HIERARCHICAL CONCEPTUAL CLUSTERING

Hierarchical Clustering in Hierarchical Clustering in SubdueSubdue

Subdue can compress by a subgraph after each iteration

Subsequent clusters may be defined in terms of previously defined clusters

This results in a hierarchy

Page 19: GRAPH-BASED HIERARCHICAL CONCEPTUAL CLUSTERING

Hierarchical Conceptual Hierarchical Conceptual Clustering of an Artificial Clustering of an Artificial

DomainDomain

Page 20: GRAPH-BASED HIERARCHICAL CONCEPTUAL CLUSTERING

Hierarchical Conceptual Clustering Hierarchical Conceptual Clustering of an Artificial Domainof an Artificial Domain

Root

Page 21: GRAPH-BASED HIERARCHICAL CONCEPTUAL CLUSTERING

Evaluation of ClusteringsEvaluation of Clusterings

Traditional evaluation:

– Not applicable to hierarchical domainsNo known evaluation for hierarchical

clusterings– Most hierarchical evaluations are anecdotal

erDistanceIntraClusterDistanceInterClustQualityClustering

Page 22: GRAPH-BASED HIERARCHICAL CONCEPTUAL CLUSTERING

New Evaluation Heuristic for New Evaluation Heuristic for Hierarchical ClusteringsHierarchical Clusterings

Properties of a good clustering:– Small number of clusters

Large coverage good generality – Big cluster descriptions

More features more inferential power– Minimal or no overlap between clusters

More distinct clusters better defined concepts

Page 23: GRAPH-BASED HIERARCHICAL CONCEPTUAL CLUSTERING

New Evaluation Heuristic for New Evaluation Heuristic for Hierarchical ClusteringsHierarchical Clusterings

Big clusters: bigger distance between disjoint clustersOverlap: less overlap bigger distanceFew clusters: averaging comparisons

c

iHc

i

c

ijji

c

i

c

ij

H

k

H

l ljkisize

ljki

C i

i j

CQHH

HHHHdistance

CQ1

1

1 1

1

1 1 1 1 ,,

,,

)(

),(max),(

Page 24: GRAPH-BASED HIERARCHICAL CONCEPTUAL CLUSTERING

Experiments and ResultsExperiments and Results

Validation in an artificial domainValidation in unstructured domainsComparison to existing systemsReal world applications

Page 25: GRAPH-BASED HIERARCHICAL CONCEPTUAL CLUSTERING

The Animal DomainThe Animal Domain

Name Body Cover Heart Chamber Body Temp. Fertilization

mammal hair four regulated internalbird feathers four regulated internalreptile cornified-skin imperfect-four unregulated internal

amphibian moist-skin three unregulated external

fish scales two unregulated external

animal

hair

mammal

BodyCover

FertilizationHeartChamber

BodyTemp internalregulated

Namefour

Page 26: GRAPH-BASED HIERARCHICAL CONCEPTUAL CLUSTERING

Hierarchical Clustering of the Hierarchical Clustering of the Animal DomainAnimal Domain

Animals

BodyTemp: unregulatedHeartChamber: fourBodyTemp: regulatedFertilization: internal

Fertilization: externalName: mammalBodyCover: hair

Name: birdBodyCover: feathers

Name: reptileBodyCover: cornified-skin

HeartChamber: imperfect-fourFertilization: internal

Name: fishBodyCover: scales

HeartChamber: two

Name: amphibianBodyCover: moist-skinHeartChamber: three

Page 27: GRAPH-BASED HIERARCHICAL CONCEPTUAL CLUSTERING

Hierarchical Clustering of the Hierarchical Clustering of the Animal Domain by CobwebAnimal Domain by Cobweb

animals

amphibian/fishmammal/bird reptile

mammal bird fish amphibian

Page 28: GRAPH-BASED HIERARCHICAL CONCEPTUAL CLUSTERING

Comparison of Subdue and Comparison of Subdue and CobwebCobweb

Quality of Subdue’s lattice (tree): 2.60Quality of Cobweb’s tree: 1.74Therefore Subdue is betterReasons for a higher score:

– Better generalization resulting in less clusters– Eliminating overlap between (reptile) and

(amphibian/fish)

Page 29: GRAPH-BASED HIERARCHICAL CONCEPTUAL CLUSTERING

Chemical Application: Chemical Application: Clustering of a DNA sequenceClustering of a DNA sequence

Page 30: GRAPH-BASED HIERARCHICAL CONCEPTUAL CLUSTERING

Chemical Application: Chemical Application: Clustering of a DNA sequenceClustering of a DNA sequence

Coverage– 61%

– 68%

– 71%

DNA

O |O == P — OH

C — N C — C

C — C \ O

O |O == P — OH | O | CH2

C \ N — C \ C

O \ C / \ C — C N — C / \O C

Page 31: GRAPH-BASED HIERARCHICAL CONCEPTUAL CLUSTERING

ConclusionsConclusions

Goal of hierarchical conceptual clustering of structured databases was achieved

Synthesized classification latticeDeveloped new evaluation heuristic for

hierarchical clusteringsGood performance in comparison to other

systems, even in unstructured domains

Page 32: GRAPH-BASED HIERARCHICAL CONCEPTUAL CLUSTERING

Future WorkFuture Work

More experiments on real-world domainsComparison to other systemsIncorporation of evaluation tool into

Subdue