Clustering and illustration with spatial and categorical data · 2016. 8. 26. · uExplore a...
Transcript of Clustering and illustration with spatial and categorical data · 2016. 8. 26. · uExplore a...
-
1
Clustering and illustration with Clustering and illustration with spatial and categorical dataspatial and categorical data
©Vladimir Estivill-CastroSchool of Computing and Information Technology
© Vladimir Estivill -Castro 27
Knowledge DiscoveryKnowledge Discoveryu The nontrivial process of identifying valid, novel,
potentially useful and ultimately understandable patters in large data sets.u Data:
u The geo-referenced layers.u The census data.
u Information: u The average population per administrative region.u Comparison of mortality rates by ethnic background
u Knowledge:u The patterns of growth of population densities and valid explanations
for them.
-
© Vladimir Estivill -Castro 3
Mining Different Kinds of Mining Different Kinds of KnowledgeKnowledge
u Characterization: Generalize, summarize, and possibly contrast data characteristics, e.g., dry vs. wet regions.
u Association: Rules like “inside(x, city) à near(x, highway)”.
u Classification: Classify data based on the values in a classifying attribute, e.g., classify countries based on climate.
u Clustering: Cluster data to form new classes, e.g., cluster houses to find distribution patterns.
u Trend and deviation analysis : Find and characterize evolution trend, sequential patterns, similar sequences, and deviation data, e.g., housing market analysis.
u Pattern-directed analysis : Find and characterize user-specified patterns in large databases, e.g., volcanoes on Mars.
© Vladimir Estivill -Castro 4
Visual illustration (2d)Visual illustration (2d)
-
© Vladimir Estivill -Castro 5
ClusteringClusteringu The task of segmenting a heterogeneous
population into a number of more homogeneous subgroups or clusters [Berry & Linoff 97].u The top-down view
u We partition into subgroups
u The task of finding groups in a data set by some natural criteria of similarity [Duda73].u The bottom-up view
u We agglomerate together those that are near by
© Vladimir Estivill -Castro 6
Spatial data miningSpatial data mining
u The discovery of interesting, implicit knowledge in spatial databases.u An important step for understanding the use of the spatial data.
u It is an exploratory (rather than confirmatory) type of analysis for spatial association rules in GIS.u Look out for patterns or anomalies with reduced explicit direction
u where to look for,u what to look for,u when to look.
u Foster the philosophy: “Let the data speak for itsef”.
-
© Vladimir Estivill -Castro 7
Spatial Data MiningSpatial Data Miningu Not to be confused with building the data sets
u Remote sensing, classification, supervised learning.
u This is collecting the data.u Set out to not just work in geographic space, but
to operate simultaneously in the two other principal domains of Geographic Information Systemsu temporal spaceu attribute space.
© Vladimir Estivill -Castro 8
Spatial association ruleSpatial association rule
u A rule indicating associations and relationships among a set of spatial and possibly non-spatial predicates:u Examples:
u Most big Canadian cities are close to the USA boarder.
u Most big Australian cities are in the East coast.
uNote: From implicit to explicit knowledge.
-
© Vladimir Estivill -Castro 9
GeneralizationGeneralization--based Spatial based Spatial Data Mining (J. Han)Data Mining (J. Han)
u Non-spatial dominant generalizationu Spatial-dominant generalization.
ZOOM OUT
LOOSE DETAIL
GAIN KNOWLEDGE
© Vladimir Estivill -Castro 10
NonNon--spatial dominant spatial dominant generalizationgeneralization
HOUSE PRICE
28 Donald $ 125,000
Layer of parcels Attribute data Map with small numberof regions with high
level description
0-50,000
50,000-100,00
100,000-500,00
above 500,000
Last step: Find correlation between data layers
-
© Vladimir Estivill -Castro 11
SpatialSpatial--dominant dominant generalizationgeneralization
u Data generalized by spatial referencesu Spatial data hierarchies (geographic or
administrative regions)u supplied by users/experts
u data structures (quadtrees, R-trees)u Clusteringu Find a description as a high level concept
(as a set of predicates)
© Vladimir Estivill -Castro 12
Spatial dominant Spatial dominant generalizationgeneralization
Layer of parcels Cluster Find highlevel description
Affordable Expensive
‘‘Expensive single houses in Vancouver urban area are alongthe beach and around two city parks”
-
© Vladimir Estivill -Castro 13
ClusteringClusteringu “Clustering is the task of identifying groups in a
data set by some natural criteria of similarity”u Classic reference on clustering: Duda, R.O. and Hart, P.E. “Pattern
Classification and Scene Analysis” John Wiley & Sons, New York, US (1973).
u For geo-referenced data, the most obvious measure of similarity is distance (Euclidean distance).u Similarity is relatively well-defined for geo-referenced data.
u For categorical data, the most obvious measure of similarity is the Hamming distance.
© Vladimir Estivill -Castro 14
DistanceDistance--Based Spatial Clustering Based Spatial Clustering Analysis (Related Work)Analysis (Related Work)
u Statistical approaches: scan data frequently, iterativeoptimization, hierarchical clustering, etc.
u CLARANS (Ng & Han’94): randomized search (sampling)+ PAM (a distance-based clustering algorithm).
u DASCAN (Ester et al.’96): density-based clustering using spatial data structures (R*-tree).
u BIRCH (Zhang et al.’96): Balanced iterative reducing and clustering using hierarchies.u Focus on densely occupied portions of the data space.u Measurement reflects the “natural” closeness of points.u A height-balanced tree (CF-tree) is used for clustering.
u Describe aggregate proximity relationships (Knorr & Ng’96).
-
© Vladimir Estivill -Castro 15
Why clustering?Why clustering?u It provides a means of generalization of the
spatial component of the data associated with a GIS.
u This generalizations is complementary to the techniques for generalization used inu inductive machine learning (symbolic learning):
u generalizes from attribute-vectors instances to logic rules.u data mining
u generalizes the content of a database to some rule about the instances of the data base
© Vladimir Estivill -Castro 16
IllustrationIllustration
-
© Vladimir Estivill -Castro 17
IllustrationIllustration
© Vladimir Estivill -Castro 18
IllustrationIllustration
-
© Vladimir Estivill -Castro 19
IllustrationIllustration
© Vladimir Estivill -Castro 20
IllustrationIllustration
-
© Vladimir Estivill -Castro 21
© Vladimir Estivill -Castro 22
ClusteringClusteringu It is a central operation inside a knowledge discovery
queryu Clustering may need to be performed as a result of a
knowledge discovery set on a new slice of the datau using different attributes of the attribute oriented datau using a new selection of the attribute oriented datau using different layers of the spatial datau using a new selection of layers of the spatial data
u It may not be possible to store pre-computed clusteringu It may not be possible to use indexing data structures
to support the clustering algorithmsu unless such index can be computed fast
-
© Vladimir Estivill -Castro 23
ClusteringClustering
u Based on optimizing some measure of the cohesion between clusters Vs the dissimilarity of clusters.u Non-hierarchical clustering (bottom-up)u Concentrates on measuring the cohesion, which data points
belong together.u Alternative measures of cluster cohesion result in different
clusters.u Hierarchical clustering (top-down)u Find the biggest difference first that creates two categories,
then recursively clusters those categories.uUsually O(n2); thus, impractical for KDDM
© Vladimir Estivill -Castro 24
Approaches to nonApproaches to non--hierarchical clusteringhierarchical clusteringu Group Criterionu All pairs of points inside a cluster should be as similar
as possibleu Representative-based approachesu The total distance of the points in a cluster to their
typical representative should be as small as possible.u Center clusters:
u The representative can be an average outside the current set of observations.
u Median (Medoid) clusters:u The representative has to be one of the observations.
u Density based [Ester, Kriegel, Sander and Xu KDD-96]
-
© Vladimir Estivill -Castro 25
Group criterionGroup criterion
© Vladimir Estivill -Castro 26
Group criterionGroup criterion
GC(P) = Σ κ Σ su,sv ∈Pi wu wv d(su,sv)
u Minimize, among all partitions P of the n data points into k groups, the quality measure
-
© Vladimir Estivill -Castro 27
Center criterionCenter criterion
© Vladimir Estivill -Castro 28
Center criterionCenter criterion
CC(C) = Σ wi d(si, rep[si , C] )2
The representative of si in C is rep[si , C]
u Minimize, among all sets C of the k points, the quality measure
-
© Vladimir Estivill -Castro 29
Clustering in KDDClustering in KDD
u Scaling k-MeansuBradley, Fayyad and Reyna KDD-98
u Scaling Expectation Maximizationu Bradley, Fayyad and Reyna NIPS 98
uLiteratureu k-Means is sensitive to initial pointsu EM depends of structural model
© Vladimir Estivill -Castro 30
kk--meansmeansu Originally, for numerical datau Algorithm:u chose a random set of k representatives u repeat until representatives do not change
u assign each point to its nearest representative.u compute the averages (means) of each class and make this a new
set of representatives
u Complexity is Θ(tmkn) whereu t is number of hill-climbing steps (loop iterations)u m is number of attributesu k is number of clustersu n is number of records (data points)
-
© Vladimir Estivill -Castro 31
kk--meansmeans
© Vladimir Estivill -Castro 32
kk--meansmeansu When the number n of records is large the complexity is
Θ(tmkn)= Θ(n)
u IT IS LINEAR !u The disadvantagesu for numerical data
u fixed with k-modes [Huang97a,Huang97b]u it is statistically biasedu it is very sensitive to the presence of noise and
outliers as well as to the initial random clustering.
u representatives may not be valid records
-
© Vladimir Estivill -Castro 33
Bias (illustration)Bias (illustration)
© Vladimir Estivill -Castro 34
The quality of the clusteringThe quality of the clusteringcentroidcentroid--basedbased
-
© Vladimir Estivill -Castro 35
Clustering desiderataClustering desiderata(Bradley, Fayyad and Reyna)(Bradley, Fayyad and Reyna)
uLiner timeu “Anytime algorithms” - best answer readyu Stoppable, resumableu Work with confines of given RAMu Can use data in order present in DBu Incremental - additional datau Operate on forward only cursor on view
© Vladimir Estivill -Castro 36
Median (medoid) criterionMedian (medoid) criterion
-
© Vladimir Estivill -Castro 37
Median (medoid) criterionMedian (medoid) criterion
MC(M) = Σ wi d(si, rep[si , M] )
The representative of si in M ⊆ S is rep[si , M]
u Minimize, among all sets M⊆ S of k points, the quality measure
© Vladimir Estivill -Castro 38
Medoids criterionMedoids criterionu Benefits:u There search space has a discrete structure of a graph.
u Can be searched by hill-climbers efficientlyu The search can guarantee local optimality.
u Evaluating MC(M) on a set of medoids requires linear time, that is, Θ (n) time.
u Robust to outliers.u There is a natural meaning to the notion of
representative.u If MC(M) is good, GC(P) is usually not bad.u It is equivalent to the p-median problem (NP-
Complete)
-
© Vladimir Estivill -Castro 39
Environment for searching a Environment for searching a (very large) graph(very large) graph
u A graph is a structure with
u nodes uand
u arcs.u A discrete search spaceu with the structure of a graphu is a graph where each node has a quality criteria Q
u Q is also called the objective functionu Q : set of nodes → R
© Vladimir Estivill -Castro 40
How is medoid based clustering How is medoid based clustering a search problem in a grapha search problem in a graphu The nodes of the graph are the subsets M ⊆ S of
size k.
u Two nodes (subsets) are adjacent if they differ in exactly one site (point).u That is, the set of medoids M1 is adjacent to the
set of medoids M2 if and only ifu M1 ∩ M2 = k-1
-
© Vladimir Estivill -Castro 41
Versions of hillVersions of hill--climbingclimbingu Global hill-climbing
u Explore all the neighbors to a node.
u Randomized hill-climbingu Explore a random sample of size m of the neighbors
uRestricted hill-climbingu Explore a neighborhood to the current node defined by those
interchanges with sites no further away than a parameter r from who thy would be swapped to
uLocal hill-climbingu Move to a neighbor as soon as one with an improvement is found but
continue the exploration with the next potential swap (do not evaluate a site for a swap until all others have had a chance [Tabu search])
© Vladimir Estivill -Castro 42
A comparison (CPU seconds)A comparison (CPU seconds)n
100 10.1 2 44.2 4.2 61 8.1200 41.2 5.3 204.3 32.4 350.3 60.9500 338.2 10.3 1435.3 123.8 2318.7 240.2
1000 939.2 42.3
Local Hill Climbing Randomized Hill Climbing Global Hill Climbing
-
© Vladimir Estivill -Castro 43
A comparison (CPU seconds)A comparison (CPU seconds)
© Vladimir Estivill -Castro 44
k
n-k
representatives
-
© Vladimir Estivill -Castro 45
k
n-k
Attempt to swapwith currentrepresentatives
© Vladimir Estivill -Castro 46
k
n-k
If improvement found
-
© Vladimir Estivill -Castro 47
k
n-k
If improvement NOTfound
© Vladimir Estivill -Castro 48
Local HillLocal Hill--climbingclimbingu A combination of Hill-climbing and Tabu searchu For each data-point si (not a medoid)u For each point selected as a medoid mj
u Explore if the swap si mj results in an improvementu If an improvement
u Take the improvement si best of mj
u If no improvement, place si at the end of a queue
u Stop when a round does not cause improvement.
-
© Vladimir Estivill -Castro 49
The O(nThe O(n22) complexity of local) complexity of local--hill climbinghill climbingu There are some O(n) methods from statisticsu k-means
u a hill-climber
u They have disadvantages:u convergenceu outliersu sensitive to initial starting point
u Trade-off of complexity Vs quality
© Vladimir Estivill -Castro 50
The quality of the clusteringThe quality of the clusteringcentroidcentroid--basedbased
-
© Vladimir Estivill -Castro 51
The quality of the clusteringThe quality of the clusteringmedoidmedoid--basedbased
© Vladimir Estivill -Castro 52
The behavior of the The behavior of the discriminantdiscriminant
-
© Vladimir Estivill -Castro 53
The behavior of the Tabu The behavior of the Tabu circular listcircular list
© Vladimir Estivill -Castro 54
Using the Delauney Using the Delauney TriangulationTriangulationu The algorithm requires O(n log n) timeu The number k of clusters can be predicted
-
© Vladimir Estivill -Castro 55
Robust clusteringRobust clusteringu For large sets of categorical data O(n)u For large sets of spatial data O(n log n)
u Distance-based clusteringu but it is approximating density information
u Representatives are data itemsu Experiments show meritsu in its complexityu in the quality of the groups
u trade-off in time vs quality as u is enlarged
© Vladimir Estivill -Castro 56
IdeaIdeauEvaluating MC[M] is Ω(n) time becauseuMC[M] = Σ wi d(si, rep[si , M] )u Ω(n) time to find rep[si , M] (all n records si)u Ω(n) time since the sun has n terms
u Approximate MC[M] in constant time gives an O(n) algorithmu consider only the most important terms in the
sumu what is that MC[M] intends to measure about the
cluster
-
© Vladimir Estivill -Castro 57
Intent of the measurementIntent of the measurement
MC(M) = Σ wi d(si, rep[si , M] )
The representative of si in M ⊆ S is rep[si , M]
The expected discrepancy between an item an its representative
In the initial steps, large distances are considered, but they are ofpoints that would not belong to that representative.
© Vladimir Estivill -Castro 58
Idea: Sum only the distances Idea: Sum only the distances of those that are near the of those that are near the representativerepresentative
Representatives that have u neighbors very near
It is more important to find few true clusters, than lots of imaginary ones
-
© Vladimir Estivill -Castro 59
Eliminating the contribution Eliminating the contribution ofofoutliers to a clusteroutliers to a cluster
u Preprocessing:uFor each si, we find u records that aretheu/2
nearest neighbors of siu For each si, find another u/2 records at randomu We construct a regular graph of degree u
ueach si is linked to its u/2 nearest neighbors and another random u/2 elements
© Vladimir Estivill -Castro 60
Efficiently approximating Efficiently approximating MC[M’]MC[M’]
uWhen evaluating a new set of medoids M’u (it is different on just one medoid to M)1 .- bring all the u nearest neighbors of the k
representatives in M’2 .- build a set N with them
u (N may have less than k(u+1) records)3 .- classify and evaluate MC[M’] using only N
u (if N has less than k(u+1) records, complete to k(u+1) terms using the largest distance found in 2.-
-
© Vladimir Estivill -Castro 61
IllustrationIllustration
k=2
u=7
14 elements are considered, although 2(7+1)=16 is quotaRed line is accounted 3 timesBlue lines are not accounted
© Vladimir Estivill -Castro 62
uUsing the Delauney Triangulation
How can we find How can we find uu neighborsneighborsfor each record?for each record?
-
© Vladimir Estivill -Castro 63
Successful ExampleSuccessful Exampleu Recent analysis of Bank of America loan database
u 250 fields per customeru back to 1914!u over nine million records
u A clustering tools was used to automatically segment customers into groups with many similar categorical attributes.u 14 groups identified, only one could be explainedu The interesting cluster had 2 properties
u 39% of customers had business and personal accounts with the banku this cluster accounted for 27% of the 11% of customers that had been
classified by a decision tree as likely respondents to a home equity loan offer
© Vladimir Estivill -Castro 64
First experiments to illustrateFirst experiments to illustrateRobustnessRobustness
u Reproducing the experiment on cluster quality[Huang97]u Huang designed methods to construct initial clustering.
u methods depend on order of records in the file
u Using the soy-bean data of the Machine Learning communityu 47 records (classified)u 35 categorical attributesu 4 classes (10,10,10,17)
u A method achieved a “good clustering” if it recovered the original classes with 5 or less errors.
u Distance metric is the Hamming distance (max value is 35)
-
© Vladimir Estivill -Castro 65
ResultsResults
k-Medoidsrandom different point frequency random
Records initialization initialization initialization initialization0 12 20 29 01 2 15 17 101 (M[C]=206)2 8 26 50 11 (M[C]=208)3 2 20 19 88 (M[C]=206)4 2 25 12 05 2 3 3 0
more than 5 172(86%) 111(55%) 70(35%) 0(0%)
k-Means for categorical attributesMisclassified
© Vladimir Estivill -Castro 66
TheThe medoidsmedoids resultsresults
u Never make a mistake with the first 3 classes
u The record misclassified is always the sameu Two records are at the same distance from
class 3 than from class 4u Thus the equal MC[M] value but different
number of misclassifications.
-
© Vladimir Estivill -Castro 67
The second experimentThe second experiment
u The data set is expanded by inserting randomly generated records.u The i-th attribute is obtained by randomly
selecting a record in the original file and copying its i-th attribute
© Vladimir Estivill -Castro 68
ResultsResults
% of noise % of noise
% of poor clusterslargest absolute # of mis-classifications
k-Meansk-Means
k-Medoidsk-Medoids
-
© Vladimir Estivill -Castro 69
How costly ?How costly ?
u Typically, k is small and n very large (data mining and knowledge discovery, k is very close to 10 and no more than 50 but n can easily be above 100 and usually 1000).
u Evaluating the objective function on a node requires O(kn)=O(n)time.
© Vladimir Estivill -Castro 70
How can we find How can we find uu neighborsneighborsfor each record for each record
u Records have categorical attributesu attribute-vectors of dimension d
-
© Vladimir Estivill -Castro 71
Idea: Sum only the distances Idea: Sum only the distances of those that are near the of those that are near the representativerepresentative
Representatives that have u neighbors very near
It is more important to find few true clusters, than lots of imaginary ones
© Vladimir Estivill -Castro 72
A A trie trie of common of common prefixesprefixes
aabccabacbabacababccabaca
a b
a
a b
a
c
1
3,5
b
2
4
Alphabet is {a,b,c}d
D
-
© Vladimir Estivill -Castro 73
Finding at least the nearest Finding at least the nearest neighbor for a record neighbor for a record ssii
u Find the nearest neighbor of each record in the trie.uTime independent of n (amortized constant
time)
u Find all neighbors at Hamming distance less than Λu Time independent of n but exponential on d
u Λ should be 2 or 3
© Vladimir Estivill -Castro 74
Complete to Complete to uu nearby nearby recordsrecords
u Apply a transitive closure on the link found in the step beforeu Using Breadth First Search for each node
u and stopping when u nearby records are foundu Time is O(u) per recordu Total preprocessing is O(n) time.
u Idea:u If y is nearby x and x is nearby z,
u then y is nearby z.
-
© Vladimir Estivill -Castro 75
First experimentFirst experimentuConfirming that the algorithm has linear
complexityu Nursery Data Set (also UCI Repository)u 12,960 recordsu 8 categorical dimensionsu clustering into k=5 groups
u Ten random subsetsu100, 200, 500, 1000, 2000, 4000, 6000, 8000,
10000 and 12000
© Vladimir Estivill -Castro 76
ResultsResults
n=number of records
CPU time (observed average)
-
© Vladimir Estivill -Castro 77
Second experimentSecond experiment
u For the Soy-bean data: there is a set of medoids that results in no misclassificationu but suboptimal MC[M]
u Use this as virtual centersu Find the radius of each classu Generate records by choosing one center and then a
number between 1 and the radius of the class (uniformly)
u Data sets of arbitrary size can be generated, but look like the Soy-Bean data set
© Vladimir Estivill -Castro 78
ResultsResults
FileSize
recovery no recovery recovery no recovery recovery no recovery100 48% 52% 100% 0% 97% 3%200 66% 34% 94% 6% 92% 8%
k-Medoids random initializationwith proximity graph
k-Means for categorical attributes k-Medoids random initializationfrequency initialization and interchange as TaB
-
© Vladimir Estivill -Castro 79
SummarySummary
u Clustering of spatial data is performed using as a similarity criteria between two sites. Some measure of the distance in space between the two sites.
u There are many ways to formalize a criteria for the quality of a clustering; however, almost all of them are unfeasible to be solved to optimality for large data sets.u The medoid approach offers several advantages over
others.u We have presented a revision of Local-Hill Climbing for
use with large data sets thatu is computationally more efficient than previous
approachesu does not compromise the quality of the clustering
© Vladimir Estivill -Castro 80
Food for thoughtFood for thought
u “Most conventional statistical approaches are limited to datamining. Statisticians historically placed an emphasis on managing the probability of type one error. That is, they have sought to control the likelihood of accepting a proposition when it is false. This is appropriate for hypothesis testing, where there is a single hypothesis under consideration. However, it is often not appropriate for datamining …; it can be undesirable … to reject a hypothesis that is (partially) true.”
u Data Mining is not hypothesis testing but hypothesis generation.