Clustering and illustration with spatial and categorical data · 2016. 8. 26. · uExplore a...

1

Clustering and illustration with Clustering and illustration with spatial and categorical dataspatial and categorical data

©Vladimir Estivill-CastroSchool of Computing and Information Technology

© Vladimir Estivill -Castro 27

Knowledge DiscoveryKnowledge Discoveryu The nontrivial process of identifying valid, novel,

potentially useful and ultimately understandable patters in large data sets.u Data:

u The geo-referenced layers.u The census data.

u Information: u The average population per administrative region.u Comparison of mortality rates by ethnic background

u Knowledge:u The patterns of growth of population densities and valid explanations

for them.


Mining Different Kinds of Mining Different Kinds of KnowledgeKnowledge

u Characterization: Generalize, summarize, and possibly contrast data characteristics, e.g., dry vs. wet regions.

u Association: Rules like “inside(x, city) à near(x, highway)”.

u Classification: Classify data based on the values in a classifying attribute, e.g., classify countries based on climate.

u Clustering: Cluster data to form new classes, e.g., cluster houses to find distribution patterns.

u Trend and deviation analysis : Find and characterize evolution trend, sequential patterns, similar sequences, and deviation data, e.g., housing market analysis.

u Pattern-directed analysis : Find and characterize user-specified patterns in large databases, e.g., volcanoes on Mars.


Visual illustration (2d)Visual illustration (2d)


ClusteringClusteringu The task of segmenting a heterogeneous

population into a number of more homogeneous subgroups or clusters [Berry & Linoff 97].u The top-down view

u We partition into subgroups

u The task of finding groups in a data set by some natural criteria of similarity [Duda73].u The bottom-up view

u We agglomerate together those that are near by


Spatial data miningSpatial data mining

u The discovery of interesting, implicit knowledge in spatial databases.u An important step for understanding the use of the spatial data.

u It is an exploratory (rather than confirmatory) type of analysis for spatial association rules in GIS.u Look out for patterns or anomalies with reduced explicit direction

u where to look for,u what to look for,u when to look.

u Foster the philosophy: “Let the data speak for itsef”.


Spatial Data MiningSpatial Data Miningu Not to be confused with building the data sets

u Remote sensing, classification, supervised learning.

u This is collecting the data.u Set out to not just work in geographic space, but

to operate simultaneously in the two other principal domains of Geographic Information Systemsu temporal spaceu attribute space.


Spatial association ruleSpatial association rule

u A rule indicating associations and relationships among a set of spatial and possibly non-spatial predicates:u Examples:

u Most big Canadian cities are close to the USA boarder.

u Most big Australian cities are in the East coast.

uNote: From implicit to explicit knowledge.


GeneralizationGeneralization--based Spatial based Spatial Data Mining (J. Han)Data Mining (J. Han)

u Non-spatial dominant generalizationu Spatial-dominant generalization.

ZOOM OUT

LOOSE DETAIL

GAIN KNOWLEDGE


NonNon--spatial dominant spatial dominant generalizationgeneralization

HOUSE PRICE

28 Donald $ 125,000

Layer of parcels Attribute data Map with small numberof regions with high

level description

0-50,000

50,000-100,00

100,000-500,00

above 500,000

Last step: Find correlation between data layers


SpatialSpatial--dominant dominant generalizationgeneralization

u Data generalized by spatial referencesu Spatial data hierarchies (geographic or

administrative regions)u supplied by users/experts

u data structures (quadtrees, R-trees)u Clusteringu Find a description as a high level concept

(as a set of predicates)


Spatial dominant Spatial dominant generalizationgeneralization

Layer of parcels Cluster Find highlevel description

Affordable Expensive

‘‘Expensive single houses in Vancouver urban area are alongthe beach and around two city parks”


ClusteringClusteringu “Clustering is the task of identifying groups in a

data set by some natural criteria of similarity”u Classic reference on clustering: Duda, R.O. and Hart, P.E. “Pattern

Classification and Scene Analysis” John Wiley & Sons, New York, US (1973).

u For geo-referenced data, the most obvious measure of similarity is distance (Euclidean distance).u Similarity is relatively well-defined for geo-referenced data.

u For categorical data, the most obvious measure of similarity is the Hamming distance.


DistanceDistance--Based Spatial Clustering Based Spatial Clustering Analysis (Related Work)Analysis (Related Work)

u Statistical approaches: scan data frequently, iterativeoptimization, hierarchical clustering, etc.

u CLARANS (Ng & Han’94): randomized search (sampling)+ PAM (a distance-based clustering algorithm).

u DASCAN (Ester et al.’96): density-based clustering using spatial data structures (R*-tree).

u BIRCH (Zhang et al.’96): Balanced iterative reducing and clustering using hierarchies.u Focus on densely occupied portions of the data space.u Measurement reflects the “natural” closeness of points.u A height-balanced tree (CF-tree) is used for clustering.

u Describe aggregate proximity relationships (Knorr & Ng’96).


Why clustering?Why clustering?u It provides a means of generalization of the

spatial component of the data associated with a GIS.

u This generalizations is complementary to the techniques for generalization used inu inductive machine learning (symbolic learning):

u generalizes from attribute-vectors instances to logic rules.u data mining

u generalizes the content of a database to some rule about the instances of the data base


IllustrationIllustration



ClusteringClusteringu It is a central operation inside a knowledge discovery

queryu Clustering may need to be performed as a result of a

knowledge discovery set on a new slice of the datau using different attributes of the attribute oriented datau using a new selection of the attribute oriented datau using different layers of the spatial datau using a new selection of layers of the spatial data

u It may not be possible to store pre-computed clusteringu It may not be possible to use indexing data structures

to support the clustering algorithmsu unless such index can be computed fast


ClusteringClustering

u Based on optimizing some measure of the cohesion between clusters Vs the dissimilarity of clusters.u Non-hierarchical clustering (bottom-up)u Concentrates on measuring the cohesion, which data points

belong together.u Alternative measures of cluster cohesion result in different

clusters.u Hierarchical clustering (top-down)u Find the biggest difference first that creates two categories,

then recursively clusters those categories.uUsually O(n2); thus, impractical for KDDM


Approaches to nonApproaches to non--hierarchical clusteringhierarchical clusteringu Group Criterionu All pairs of points inside a cluster should be as similar

as possibleu Representative-based approachesu The total distance of the points in a cluster to their

typical representative should be as small as possible.u Center clusters:

u The representative can be an average outside the current set of observations.

u Median (Medoid) clusters:u The representative has to be one of the observations.

u Density based [Ester, Kriegel, Sander and Xu KDD-96]


Group criterionGroup criterion


Group criterionGroup criterion

GC(P) = Σ κ Σ su,sv ∈Pi wu wv d(su,sv)

u Minimize, among all partitions P of the n data points into k groups, the quality measure


Center criterionCenter criterion


Center criterionCenter criterion

CC(C) = Σ wi d(si, rep[si , C] )2

The representative of si in C is rep[si , C]

u Minimize, among all sets C of the k points, the quality measure


Clustering in KDDClustering in KDD

u Scaling k-MeansuBradley, Fayyad and Reyna KDD-98

u Scaling Expectation Maximizationu Bradley, Fayyad and Reyna NIPS 98

uLiteratureu k-Means is sensitive to initial pointsu EM depends of structural model


kk--meansmeansu Originally, for numerical datau Algorithm:u chose a random set of k representatives u repeat until representatives do not change

u assign each point to its nearest representative.u compute the averages (means) of each class and make this a new

set of representatives

u Complexity is Θ(tmkn) whereu t is number of hill-climbing steps (loop iterations)u m is number of attributesu k is number of clustersu n is number of records (data points)


kk--meansmeans


kk--meansmeansu When the number n of records is large the complexity is

Θ(tmkn)= Θ(n)

u IT IS LINEAR !u The disadvantagesu for numerical data

u fixed with k-modes [Huang97a,Huang97b]u it is statistically biasedu it is very sensitive to the presence of noise and

outliers as well as to the initial random clustering.

u representatives may not be valid records


Bias (illustration)Bias (illustration)


The quality of the clusteringThe quality of the clusteringcentroidcentroid--basedbased


Clustering desiderataClustering desiderata(Bradley, Fayyad and Reyna)(Bradley, Fayyad and Reyna)

uLiner timeu “Anytime algorithms” - best answer readyu Stoppable, resumableu Work with confines of given RAMu Can use data in order present in DBu Incremental - additional datau Operate on forward only cursor on view


Median (medoid) criterionMedian (medoid) criterion


Median (medoid) criterionMedian (medoid) criterion

MC(M) = Σ wi d(si, rep[si , M] )

The representative of si in M ⊆ S is rep[si , M]

u Minimize, among all sets M⊆ S of k points, the quality measure


Medoids criterionMedoids criterionu Benefits:u There search space has a discrete structure of a graph.

u Can be searched by hill-climbers efficientlyu The search can guarantee local optimality.

u Evaluating MC(M) on a set of medoids requires linear time, that is, Θ (n) time.

u Robust to outliers.u There is a natural meaning to the notion of

representative.u If MC(M) is good, GC(P) is usually not bad.u It is equivalent to the p-median problem (NP-

Complete)


Environment for searching a Environment for searching a (very large) graph(very large) graph

u A graph is a structure with

u nodes uand

u arcs.u A discrete search spaceu with the structure of a graphu is a graph where each node has a quality criteria Q

u Q is also called the objective functionu Q : set of nodes → R


How is medoid based clustering How is medoid based clustering a search problem in a grapha search problem in a graphu The nodes of the graph are the subsets M ⊆ S of

size k.

u Two nodes (subsets) are adjacent if they differ in exactly one site (point).u That is, the set of medoids M1 is adjacent to the

set of medoids M2 if and only ifu M1 ∩ M2 = k-1


Versions of hillVersions of hill--climbingclimbingu Global hill-climbing

u Explore all the neighbors to a node.

u Randomized hill-climbingu Explore a random sample of size m of the neighbors

uRestricted hill-climbingu Explore a neighborhood to the current node defined by those

interchanges with sites no further away than a parameter r from who thy would be swapped to

uLocal hill-climbingu Move to a neighbor as soon as one with an improvement is found but

continue the exploration with the next potential swap (do not evaluate a site for a swap until all others have had a chance [Tabu search])


A comparison (CPU seconds)A comparison (CPU seconds)n

100 10.1 2 44.2 4.2 61 8.1200 41.2 5.3 204.3 32.4 350.3 60.9500 338.2 10.3 1435.3 123.8 2318.7 240.2

1000 939.2 42.3

Local Hill Climbing Randomized Hill Climbing Global Hill Climbing


A comparison (CPU seconds)A comparison (CPU seconds)


k

n-k

representatives


k

n-k

Attempt to swapwith currentrepresentatives


k

n-k

If improvement found


k

n-k

If improvement NOTfound


Local HillLocal Hill--climbingclimbingu A combination of Hill-climbing and Tabu searchu For each data-point si (not a medoid)u For each point selected as a medoid mj

u Explore if the swap si mj results in an improvementu If an improvement

u Take the improvement si best of mj

u If no improvement, place si at the end of a queue

u Stop when a round does not cause improvement.


The O(nThe O(n22) complexity of local) complexity of local--hill climbinghill climbingu There are some O(n) methods from statisticsu k-means

u a hill-climber

u They have disadvantages:u convergenceu outliersu sensitive to initial starting point

u Trade-off of complexity Vs quality


The quality of the clusteringThe quality of the clusteringcentroidcentroid--basedbased


The quality of the clusteringThe quality of the clusteringmedoidmedoid--basedbased


The behavior of the The behavior of the discriminantdiscriminant


The behavior of the Tabu The behavior of the Tabu circular listcircular list


Using the Delauney Using the Delauney TriangulationTriangulationu The algorithm requires O(n log n) timeu The number k of clusters can be predicted


Robust clusteringRobust clusteringu For large sets of categorical data O(n)u For large sets of spatial data O(n log n)

u Distance-based clusteringu but it is approximating density information

u Representatives are data itemsu Experiments show meritsu in its complexityu in the quality of the groups

u trade-off in time vs quality as u is enlarged


IdeaIdeauEvaluating MC[M] is Ω(n) time becauseuMC[M] = Σ wi d(si, rep[si , M] )u Ω(n) time to find rep[si , M] (all n records si)u Ω(n) time since the sun has n terms

u Approximate MC[M] in constant time gives an O(n) algorithmu consider only the most important terms in the

sumu what is that MC[M] intends to measure about the

cluster


Intent of the measurementIntent of the measurement

MC(M) = Σ wi d(si, rep[si , M] )

The representative of si in M ⊆ S is rep[si , M]

The expected discrepancy between an item an its representative

In the initial steps, large distances are considered, but they are ofpoints that would not belong to that representative.


Idea: Sum only the distances Idea: Sum only the distances of those that are near the of those that are near the representativerepresentative

Representatives that have u neighbors very near

It is more important to find few true clusters, than lots of imaginary ones


Eliminating the contribution Eliminating the contribution ofofoutliers to a clusteroutliers to a cluster

u Preprocessing:uFor each si, we find u records that aretheu/2

nearest neighbors of siu For each si, find another u/2 records at randomu We construct a regular graph of degree u

ueach si is linked to its u/2 nearest neighbors and another random u/2 elements


Efficiently approximating Efficiently approximating MC[M’]MC[M’]

uWhen evaluating a new set of medoids M’u (it is different on just one medoid to M)1 .- bring all the u nearest neighbors of the k

representatives in M’2 .- build a set N with them

u (N may have less than k(u+1) records)3 .- classify and evaluate MC[M’] using only N

u (if N has less than k(u+1) records, complete to k(u+1) terms using the largest distance found in 2.-



k=2

u=7

14 elements are considered, although 2(7+1)=16 is quotaRed line is accounted 3 timesBlue lines are not accounted


uUsing the Delauney Triangulation

How can we find How can we find uu neighborsneighborsfor each record?for each record?


Successful ExampleSuccessful Exampleu Recent analysis of Bank of America loan database

u 250 fields per customeru back to 1914!u over nine million records

u A clustering tools was used to automatically segment customers into groups with many similar categorical attributes.u 14 groups identified, only one could be explainedu The interesting cluster had 2 properties

u 39% of customers had business and personal accounts with the banku this cluster accounted for 27% of the 11% of customers that had been

classified by a decision tree as likely respondents to a home equity loan offer


First experiments to illustrateFirst experiments to illustrateRobustnessRobustness

u Reproducing the experiment on cluster quality[Huang97]u Huang designed methods to construct initial clustering.

u methods depend on order of records in the file

u Using the soy-bean data of the Machine Learning communityu 47 records (classified)u 35 categorical attributesu 4 classes (10,10,10,17)

u A method achieved a “good clustering” if it recovered the original classes with 5 or less errors.

u Distance metric is the Hamming distance (max value is 35)


ResultsResults

k-Medoidsrandom different point frequency random

Records initialization initialization initialization initialization0 12 20 29 01 2 15 17 101 (M[C]=206)2 8 26 50 11 (M[C]=208)3 2 20 19 88 (M[C]=206)4 2 25 12 05 2 3 3 0

more than 5 172(86%) 111(55%) 70(35%) 0(0%)

k-Means for categorical attributesMisclassified


TheThe medoidsmedoids resultsresults

u Never make a mistake with the first 3 classes

u The record misclassified is always the sameu Two records are at the same distance from

class 3 than from class 4u Thus the equal MC[M] value but different

number of misclassifications.


The second experimentThe second experiment

u The data set is expanded by inserting randomly generated records.u The i-th attribute is obtained by randomly

selecting a record in the original file and copying its i-th attribute


ResultsResults

% of noise % of noise

% of poor clusterslargest absolute # of mis-classifications

k-Meansk-Means

k-Medoidsk-Medoids


How costly ?How costly ?

u Typically, k is small and n very large (data mining and knowledge discovery, k is very close to 10 and no more than 50 but n can easily be above 100 and usually 1000).

u Evaluating the objective function on a node requires O(kn)=O(n)time.


How can we find How can we find uu neighborsneighborsfor each record for each record

u Records have categorical attributesu attribute-vectors of dimension d


Idea: Sum only the distances Idea: Sum only the distances of those that are near the of those that are near the representativerepresentative

Representatives that have u neighbors very near

It is more important to find few true clusters, than lots of imaginary ones


A A trie trie of common of common prefixesprefixes

aabccabacbabacababccabaca

a b

a

a b

a

c

1

3,5

b

2

4

Alphabet is {a,b,c}d

D


Finding at least the nearest Finding at least the nearest neighbor for a record neighbor for a record ssii

u Find the nearest neighbor of each record in the trie.uTime independent of n (amortized constant

time)

u Find all neighbors at Hamming distance less than Λu Time independent of n but exponential on d

u Λ should be 2 or 3


Complete to Complete to uu nearby nearby recordsrecords

u Apply a transitive closure on the link found in the step beforeu Using Breadth First Search for each node

u and stopping when u nearby records are foundu Time is O(u) per recordu Total preprocessing is O(n) time.

u Idea:u If y is nearby x and x is nearby z,

u then y is nearby z.


First experimentFirst experimentuConfirming that the algorithm has linear

complexityu Nursery Data Set (also UCI Repository)u 12,960 recordsu 8 categorical dimensionsu clustering into k=5 groups

u Ten random subsetsu100, 200, 500, 1000, 2000, 4000, 6000, 8000,

10000 and 12000


ResultsResults

n=number of records

CPU time (observed average)


Second experimentSecond experiment

u For the Soy-bean data: there is a set of medoids that results in no misclassificationu but suboptimal MC[M]

u Use this as virtual centersu Find the radius of each classu Generate records by choosing one center and then a

number between 1 and the radius of the class (uniformly)

u Data sets of arbitrary size can be generated, but look like the Soy-Bean data set


ResultsResults

FileSize

recovery no recovery recovery no recovery recovery no recovery100 48% 52% 100% 0% 97% 3%200 66% 34% 94% 6% 92% 8%

k-Medoids random initializationwith proximity graph

k-Means for categorical attributes k-Medoids random initializationfrequency initialization and interchange as TaB


SummarySummary

u Clustering of spatial data is performed using as a similarity criteria between two sites. Some measure of the distance in space between the two sites.

u There are many ways to formalize a criteria for the quality of a clustering; however, almost all of them are unfeasible to be solved to optimality for large data sets.u The medoid approach offers several advantages over

others.u We have presented a revision of Local-Hill Climbing for

use with large data sets thatu is computationally more efficient than previous

approachesu does not compromise the quality of the clustering


Food for thoughtFood for thought

u “Most conventional statistical approaches are limited to datamining. Statisticians historically placed an emphasis on managing the probability of type one error. That is, they have sought to control the likelihood of accepting a proposition when it is false. This is appropriate for hypothesis testing, where there is a single hypothesis under consideration. However, it is often not appropriate for datamining …; it can be undesirable … to reject a hypothesis that is (partially) true.”

u Data Mining is not hypothesis testing but hypothesis generation.

Clustering and illustration with spatial and categorical data · 2016. 8. 26. · uExplore a...

Documents

Transcript of Clustering and illustration with spatial and categorical data · 2016. 8. 26. · uExplore a...