3.6 constraint based cluster analysis

ClusteringConstraint based Cluster

Analysis

1

Constraint based Clustering Constraint based Clustering – finds clusters that satisfy

user-specified preferences or constraints

Desirable to have the Clustering process take the user

preferences and constraints into consideration Expected number of clusters

Maximal / Minimal Cluster size

Weights for dimensions / Important dimensions

Mining becomes focused

2

Categories of Constraints Constraints on Individual objects

Ex: Luxury mansions worth over a million dollars Processed through selection

Constraints on the selection of Clustering parameters Number of clusters, radius, MinPts Not strictly constraint based clustering

Constraints on distance or similarity functions Different measures for specific attributes / Objects Weighting process – Clustering with obstacle objects

User specified constraints on properties of individual clusters Clusters satisfy given properties

Semi-supervised clustering based on partial supervision Pair-wise constraints

3

Clustering with Obstacle Objects

City – rivers, lakes, bridges, roads etc Obstacles must be avoided Distance function between objects must be re-defined

Straight ine distance is meaningless When using a partitioning approach – distance

calculation with obstacles becomes expensive k-means – not suitable as cluster centre may lie on an obstacle k-medoids can be used and distance between objects can be

determined using triangulation

4

Clustering with Obstacles Point p is visible from q in region R if straight line

between p and q does not intersect any obstacle Visibility graph - VG

Each vertex of the obstacle has a corresponding node Edge between two vertices only if they are visible to each other Additional points can be added and paths can be determined

5

Clustering with Obstacles To reduce cost of distance

computation points can be grouped into micro-clusters Triangulate a region Group nearby points in same triangle

into micro clusters Process micro-clusters instead of

points Computation of shortest paths in terms

of: VV indices – pair of obstacle objects MV indices for pair of micro-cluster and

obstacle objects

6

Clustering with Obstacles

7

User-Constrained Cluster Analysis Example: Relocating package delivery

centres N customers : high-value and ordinary customers Determine locations for k service stations Constraints

Each station should server At least 100 high value customers At least 5000 ordinary customers

Constrained Optimization problem Direct Mathematical approach is expensive

8

User-Constrained Cluster Analysis Micro-Clustering Initially find a partition of k-groups satisfying given

constraints Iteratively refine solution

Move m customers from cluster Ci to Cj if Ci has atleast m surplus customers

Movement done if total sum of distances (objects – Centers) is reduced

Can be directed by selecting promising points Dead lock has to be avoided (constraint cannot be satisfied)

Instead of points can work on micro-clusters

9

Semi-Supervised Cluster Analysis Constraint based Semi-supervised Clustering

Relies on user provided labels or constraints Initialize based on labeled objects Modify Objective function

Distance based Semi-supervised clustering Adaptive distance measure trained to satisfy labels or

constraints

10

CLTree (Clustering based on decision TREEs) Integrates unsupervised clustering with supervised classification Transforms clustering task into Classification

Points to be clustered – Y Adds a set of non-existence points - N

11

Semi-Supervised Cluster Analysis

Non-existence points Not added physically For decision tree construction only number of N points are

needed – not actual points At the root node, the number of inherited “N” points is 0. At any current node, E, if the number of “N” points inherited from

the parent node of E is less than the number of “Y” points in E, then the number of “N” points for E is increased to the number of “Y” points in E.

Basic idea is to use an equal number of “N” points to the number of “Y” points.

12

Semi-Supervised Cluster Analysis

Semi-Supervised Cluster Analysis Decision tree Splitting

Information gain CLTree forms initial cuts and looks ahead to find better partitions

that cut less into cluster regions CLTree

Handles high dimensional space Sub space clusters are determined Empty regions can also be detected

13

3.6 constraint based cluster analysis

Education

Transcript of 3.6 constraint based cluster analysis