GRAPH-BASED HIERARCHICAL CONCEPTUAL CLUSTERING

Post on 25-Feb-2016

57 views 1 download

Tags:

description

GRAPH-BASED HIERARCHICAL CONCEPTUAL CLUSTERING. by Istvan Jonyer, Lawrence B. Holder and Diane J. Cook The University of Texas at Arlington. Outline. What is hierarchical conceptual clustering? Overview of Subdue Conceptual clustering in Subdue Evaluation of hierarchical clusterings - PowerPoint PPT Presentation

Transcript of GRAPH-BASED HIERARCHICAL CONCEPTUAL CLUSTERING

GRAPH-BASED HIERARCHICAL CONCEPTUAL GRAPH-BASED HIERARCHICAL CONCEPTUAL CLUSTERINGCLUSTERING

by

Istvan Jonyer, Lawrence B. Holder and

Diane J. Cook

The University of Texas at Arlington

OutlineOutline

What is hierarchical conceptual clustering?Overview of SubdueConceptual clustering in SubdueEvaluation of hierarchical clusteringsExperiments and resultsConclusions

What is clustering?What is clustering?

What is What is hierarchical hierarchical conceptual conceptual clustering?clustering?

Unsupervised concept learningGenerating hierarchies to explain dataApplications

– Hypothesis generation and testing– Prediction based on groups– Finding taxonomies

Example hierarchical Example hierarchical conceptualconceptual clusteringclustering

Animals

BodyTemp: unregulatedHeartChamber: fourBodyTemp: regulatedFertilization: internal

Fertilization: externalName: mammalBodyCover: hair

Name: birdBodyCover: feathers

Name: reptileBodyCover: cornified-skin

HeartChamber: imperfect-fourFertilization: internal

Name: fishBodyCover: scales

HeartChamber: two

Name: amphibianBodyCover: moist-skinHeartChamber: three

The ProblemThe Problem

Hierarchical conceptual clustering in discrete-valued structural databases

Existing systems:– Continuous-valued– Discrete but unstructured– We can do better! (Field under explored)

Related WorkRelated Work

CobwebLabyrinthAutoClassSnobIn Euclidian space: Chameleon, Cure

Unsupervised learning algorithms

The SolutionThe Solution

Take Subdue and extend it!

Overview of SubdueOverview of Subdue

Data mining in graph representations of structural databases

A

C

B D

A

C

BD

F

E

f cb

ad

e

a

bc

g

Overview of SubdueOverview of Subdue

Iteratively searching for best substructure by MDL heuristic

A

C

BD

cb

a

Overview of SubdueOverview of Subdue

Compress using best substructure

S S

F

E

f

d

eg

Overview of SubdueOverview of Subdue

Fuzzy match– Inexact matching of subgraphs– Applications:

Defining fuzzy concepts Evaluation of clusterings

Conceptual Clustering with Conceptual Clustering with SubdueSubdue

Use Subdue to identify clusters– The best subgraph in an iteration defines a

cluster When to stop within an iteration?

1) Use –limit option2) Use –size option3) Use first minimum heuristic (new)

The First Minimum HeuristicThe First Minimum Heuristic

Use subgraph at first local minimum– Detect it using –prune2 option

0.75

0.8

0.85

0.9

0.95

1

1.05

The First Minimum HeuristicThe First Minimum Heuristic

Not a greedy heuristic!– Although first local minimum is usually the

global minimum– First local minimum is caused by a smaller,

more frequently occurring subgraph– Subsequent minima are caused by bigger, less

frequently occurring subgraphs=> First subgraph is more general

The First Minimum HeuristicThe First Minimum Heuristic

A multi-minimum search space:

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

1.05

1.1

Lattice vs. TreeLattice vs. Tree

Previous work defined classification trees– Inadequate in structured domains

Better hierarchical description: classification lattice– A cluster can have more than one parent– A parent can be at any level (not only one level

above)

Hierarchical Clustering in Hierarchical Clustering in SubdueSubdue

Subdue can compress by a subgraph after each iteration

Subsequent clusters may be defined in terms of previously defined clusters

This results in a hierarchy

Hierarchical Conceptual Hierarchical Conceptual Clustering of an Artificial Clustering of an Artificial

DomainDomain

Hierarchical Conceptual Clustering Hierarchical Conceptual Clustering of an Artificial Domainof an Artificial Domain

Root

Evaluation of ClusteringsEvaluation of Clusterings

Traditional evaluation:

– Not applicable to hierarchical domainsNo known evaluation for hierarchical

clusterings– Most hierarchical evaluations are anecdotal

erDistanceIntraClusterDistanceInterClustQualityClustering

New Evaluation Heuristic for New Evaluation Heuristic for Hierarchical ClusteringsHierarchical Clusterings

Properties of a good clustering:– Small number of clusters

Large coverage good generality – Big cluster descriptions

More features more inferential power– Minimal or no overlap between clusters

More distinct clusters better defined concepts

New Evaluation Heuristic for New Evaluation Heuristic for Hierarchical ClusteringsHierarchical Clusterings

Big clusters: bigger distance between disjoint clustersOverlap: less overlap bigger distanceFew clusters: averaging comparisons

c

iHc

i

c

ijji

c

i

c

ij

H

k

H

l ljkisize

ljki

C i

i j

CQHH

HHHHdistance

CQ1

1

1 1

1

1 1 1 1 ,,

,,

)(

),(max),(

Experiments and ResultsExperiments and Results

Validation in an artificial domainValidation in unstructured domainsComparison to existing systemsReal world applications

The Animal DomainThe Animal Domain

Name Body Cover Heart Chamber Body Temp. Fertilization

mammal hair four regulated internalbird feathers four regulated internalreptile cornified-skin imperfect-four unregulated internal

amphibian moist-skin three unregulated external

fish scales two unregulated external

animal

hair

mammal

BodyCover

FertilizationHeartChamber

BodyTemp internalregulated

Namefour

Hierarchical Clustering of the Hierarchical Clustering of the Animal DomainAnimal Domain

Animals

BodyTemp: unregulatedHeartChamber: fourBodyTemp: regulatedFertilization: internal

Fertilization: externalName: mammalBodyCover: hair

Name: birdBodyCover: feathers

Name: reptileBodyCover: cornified-skin

HeartChamber: imperfect-fourFertilization: internal

Name: fishBodyCover: scales

HeartChamber: two

Name: amphibianBodyCover: moist-skinHeartChamber: three

Hierarchical Clustering of the Hierarchical Clustering of the Animal Domain by CobwebAnimal Domain by Cobweb

animals

amphibian/fishmammal/bird reptile

mammal bird fish amphibian

Comparison of Subdue and Comparison of Subdue and CobwebCobweb

Quality of Subdue’s lattice (tree): 2.60Quality of Cobweb’s tree: 1.74Therefore Subdue is betterReasons for a higher score:

– Better generalization resulting in less clusters– Eliminating overlap between (reptile) and

(amphibian/fish)

Chemical Application: Chemical Application: Clustering of a DNA sequenceClustering of a DNA sequence

Chemical Application: Chemical Application: Clustering of a DNA sequenceClustering of a DNA sequence

Coverage– 61%

– 68%

– 71%

DNA

O |O == P — OH

C — N C — C

C — C \ O

O |O == P — OH | O | CH2

C \ N — C \ C

O \ C / \ C — C N — C / \O C

ConclusionsConclusions

Goal of hierarchical conceptual clustering of structured databases was achieved

Synthesized classification latticeDeveloped new evaluation heuristic for

hierarchical clusteringsGood performance in comparison to other

systems, even in unstructured domains

Future WorkFuture Work

More experiments on real-world domainsComparison to other systemsIncorporation of evaluation tool into

Subdue