Clustering and illustration with spatial and categorical data · 2016. 8. 26. · uExplore a...

40
1 Clustering and illustration with Clustering and illustration with spatial and categorical data spatial and categorical data ©Vladimir Estivill-Castro School of Computing and Information Technology © Vladimir Estivill-Castro 2 7 Knowledge Discovery Knowledge Discovery u The nontrivial process of identifying valid, novel, potentially useful and ultimately understandable patters in large data sets. u Data: u The geo-referenced layers. u The census data. u Information: u The average population per administrative region. u Comparison of mortality rates by ethnic background u Knowledge: u The patterns of growth of population densities and valid explanations for them.

Transcript of Clustering and illustration with spatial and categorical data · 2016. 8. 26. · uExplore a...

  • 1

    Clustering and illustration with Clustering and illustration with spatial and categorical dataspatial and categorical data

    ©Vladimir Estivill-CastroSchool of Computing and Information Technology

    © Vladimir Estivill -Castro 27

    Knowledge DiscoveryKnowledge Discoveryu The nontrivial process of identifying valid, novel,

    potentially useful and ultimately understandable patters in large data sets.u Data:

    u The geo-referenced layers.u The census data.

    u Information: u The average population per administrative region.u Comparison of mortality rates by ethnic background

    u Knowledge:u The patterns of growth of population densities and valid explanations

    for them.

  • © Vladimir Estivill -Castro 3

    Mining Different Kinds of Mining Different Kinds of KnowledgeKnowledge

    u Characterization: Generalize, summarize, and possibly contrast data characteristics, e.g., dry vs. wet regions.

    u Association: Rules like “inside(x, city) à near(x, highway)”.

    u Classification: Classify data based on the values in a classifying attribute, e.g., classify countries based on climate.

    u Clustering: Cluster data to form new classes, e.g., cluster houses to find distribution patterns.

    u Trend and deviation analysis : Find and characterize evolution trend, sequential patterns, similar sequences, and deviation data, e.g., housing market analysis.

    u Pattern-directed analysis : Find and characterize user-specified patterns in large databases, e.g., volcanoes on Mars.

    © Vladimir Estivill -Castro 4

    Visual illustration (2d)Visual illustration (2d)

  • © Vladimir Estivill -Castro 5

    ClusteringClusteringu The task of segmenting a heterogeneous

    population into a number of more homogeneous subgroups or clusters [Berry & Linoff 97].u The top-down view

    u We partition into subgroups

    u The task of finding groups in a data set by some natural criteria of similarity [Duda73].u The bottom-up view

    u We agglomerate together those that are near by

    © Vladimir Estivill -Castro 6

    Spatial data miningSpatial data mining

    u The discovery of interesting, implicit knowledge in spatial databases.u An important step for understanding the use of the spatial data.

    u It is an exploratory (rather than confirmatory) type of analysis for spatial association rules in GIS.u Look out for patterns or anomalies with reduced explicit direction

    u where to look for,u what to look for,u when to look.

    u Foster the philosophy: “Let the data speak for itsef”.

  • © Vladimir Estivill -Castro 7

    Spatial Data MiningSpatial Data Miningu Not to be confused with building the data sets

    u Remote sensing, classification, supervised learning.

    u This is collecting the data.u Set out to not just work in geographic space, but

    to operate simultaneously in the two other principal domains of Geographic Information Systemsu temporal spaceu attribute space.

    © Vladimir Estivill -Castro 8

    Spatial association ruleSpatial association rule

    u A rule indicating associations and relationships among a set of spatial and possibly non-spatial predicates:u Examples:

    u Most big Canadian cities are close to the USA boarder.

    u Most big Australian cities are in the East coast.

    uNote: From implicit to explicit knowledge.

  • © Vladimir Estivill -Castro 9

    GeneralizationGeneralization--based Spatial based Spatial Data Mining (J. Han)Data Mining (J. Han)

    u Non-spatial dominant generalizationu Spatial-dominant generalization.

    ZOOM OUT

    LOOSE DETAIL

    GAIN KNOWLEDGE

    © Vladimir Estivill -Castro 10

    NonNon--spatial dominant spatial dominant generalizationgeneralization

    HOUSE PRICE

    28 Donald $ 125,000

    Layer of parcels Attribute data Map with small numberof regions with high

    level description

    0-50,000

    50,000-100,00

    100,000-500,00

    above 500,000

    Last step: Find correlation between data layers

  • © Vladimir Estivill -Castro 11

    SpatialSpatial--dominant dominant generalizationgeneralization

    u Data generalized by spatial referencesu Spatial data hierarchies (geographic or

    administrative regions)u supplied by users/experts

    u data structures (quadtrees, R-trees)u Clusteringu Find a description as a high level concept

    (as a set of predicates)

    © Vladimir Estivill -Castro 12

    Spatial dominant Spatial dominant generalizationgeneralization

    Layer of parcels Cluster Find highlevel description

    Affordable Expensive

    ‘‘Expensive single houses in Vancouver urban area are alongthe beach and around two city parks”

  • © Vladimir Estivill -Castro 13

    ClusteringClusteringu “Clustering is the task of identifying groups in a

    data set by some natural criteria of similarity”u Classic reference on clustering: Duda, R.O. and Hart, P.E. “Pattern

    Classification and Scene Analysis” John Wiley & Sons, New York, US (1973).

    u For geo-referenced data, the most obvious measure of similarity is distance (Euclidean distance).u Similarity is relatively well-defined for geo-referenced data.

    u For categorical data, the most obvious measure of similarity is the Hamming distance.

    © Vladimir Estivill -Castro 14

    DistanceDistance--Based Spatial Clustering Based Spatial Clustering Analysis (Related Work)Analysis (Related Work)

    u Statistical approaches: scan data frequently, iterativeoptimization, hierarchical clustering, etc.

    u CLARANS (Ng & Han’94): randomized search (sampling)+ PAM (a distance-based clustering algorithm).

    u DASCAN (Ester et al.’96): density-based clustering using spatial data structures (R*-tree).

    u BIRCH (Zhang et al.’96): Balanced iterative reducing and clustering using hierarchies.u Focus on densely occupied portions of the data space.u Measurement reflects the “natural” closeness of points.u A height-balanced tree (CF-tree) is used for clustering.

    u Describe aggregate proximity relationships (Knorr & Ng’96).

  • © Vladimir Estivill -Castro 15

    Why clustering?Why clustering?u It provides a means of generalization of the

    spatial component of the data associated with a GIS.

    u This generalizations is complementary to the techniques for generalization used inu inductive machine learning (symbolic learning):

    u generalizes from attribute-vectors instances to logic rules.u data mining

    u generalizes the content of a database to some rule about the instances of the data base

    © Vladimir Estivill -Castro 16

    IllustrationIllustration

  • © Vladimir Estivill -Castro 17

    IllustrationIllustration

    © Vladimir Estivill -Castro 18

    IllustrationIllustration

  • © Vladimir Estivill -Castro 19

    IllustrationIllustration

    © Vladimir Estivill -Castro 20

    IllustrationIllustration

  • © Vladimir Estivill -Castro 21

    © Vladimir Estivill -Castro 22

    ClusteringClusteringu It is a central operation inside a knowledge discovery

    queryu Clustering may need to be performed as a result of a

    knowledge discovery set on a new slice of the datau using different attributes of the attribute oriented datau using a new selection of the attribute oriented datau using different layers of the spatial datau using a new selection of layers of the spatial data

    u It may not be possible to store pre-computed clusteringu It may not be possible to use indexing data structures

    to support the clustering algorithmsu unless such index can be computed fast

  • © Vladimir Estivill -Castro 23

    ClusteringClustering

    u Based on optimizing some measure of the cohesion between clusters Vs the dissimilarity of clusters.u Non-hierarchical clustering (bottom-up)u Concentrates on measuring the cohesion, which data points

    belong together.u Alternative measures of cluster cohesion result in different

    clusters.u Hierarchical clustering (top-down)u Find the biggest difference first that creates two categories,

    then recursively clusters those categories.uUsually O(n2); thus, impractical for KDDM

    © Vladimir Estivill -Castro 24

    Approaches to nonApproaches to non--hierarchical clusteringhierarchical clusteringu Group Criterionu All pairs of points inside a cluster should be as similar

    as possibleu Representative-based approachesu The total distance of the points in a cluster to their

    typical representative should be as small as possible.u Center clusters:

    u The representative can be an average outside the current set of observations.

    u Median (Medoid) clusters:u The representative has to be one of the observations.

    u Density based [Ester, Kriegel, Sander and Xu KDD-96]

  • © Vladimir Estivill -Castro 25

    Group criterionGroup criterion

    © Vladimir Estivill -Castro 26

    Group criterionGroup criterion

    GC(P) = Σ κ Σ su,sv ∈Pi wu wv d(su,sv)

    u Minimize, among all partitions P of the n data points into k groups, the quality measure

  • © Vladimir Estivill -Castro 27

    Center criterionCenter criterion

    © Vladimir Estivill -Castro 28

    Center criterionCenter criterion

    CC(C) = Σ wi d(si, rep[si , C] )2

    The representative of si in C is rep[si , C]

    u Minimize, among all sets C of the k points, the quality measure

  • © Vladimir Estivill -Castro 29

    Clustering in KDDClustering in KDD

    u Scaling k-MeansuBradley, Fayyad and Reyna KDD-98

    u Scaling Expectation Maximizationu Bradley, Fayyad and Reyna NIPS 98

    uLiteratureu k-Means is sensitive to initial pointsu EM depends of structural model

    © Vladimir Estivill -Castro 30

    kk--meansmeansu Originally, for numerical datau Algorithm:u chose a random set of k representatives u repeat until representatives do not change

    u assign each point to its nearest representative.u compute the averages (means) of each class and make this a new

    set of representatives

    u Complexity is Θ(tmkn) whereu t is number of hill-climbing steps (loop iterations)u m is number of attributesu k is number of clustersu n is number of records (data points)

  • © Vladimir Estivill -Castro 31

    kk--meansmeans

    © Vladimir Estivill -Castro 32

    kk--meansmeansu When the number n of records is large the complexity is

    Θ(tmkn)= Θ(n)

    u IT IS LINEAR !u The disadvantagesu for numerical data

    u fixed with k-modes [Huang97a,Huang97b]u it is statistically biasedu it is very sensitive to the presence of noise and

    outliers as well as to the initial random clustering.

    u representatives may not be valid records

  • © Vladimir Estivill -Castro 33

    Bias (illustration)Bias (illustration)

    © Vladimir Estivill -Castro 34

    The quality of the clusteringThe quality of the clusteringcentroidcentroid--basedbased

  • © Vladimir Estivill -Castro 35

    Clustering desiderataClustering desiderata(Bradley, Fayyad and Reyna)(Bradley, Fayyad and Reyna)

    uLiner timeu “Anytime algorithms” - best answer readyu Stoppable, resumableu Work with confines of given RAMu Can use data in order present in DBu Incremental - additional datau Operate on forward only cursor on view

    © Vladimir Estivill -Castro 36

    Median (medoid) criterionMedian (medoid) criterion

  • © Vladimir Estivill -Castro 37

    Median (medoid) criterionMedian (medoid) criterion

    MC(M) = Σ wi d(si, rep[si , M] )

    The representative of si in M ⊆ S is rep[si , M]

    u Minimize, among all sets M⊆ S of k points, the quality measure

    © Vladimir Estivill -Castro 38

    Medoids criterionMedoids criterionu Benefits:u There search space has a discrete structure of a graph.

    u Can be searched by hill-climbers efficientlyu The search can guarantee local optimality.

    u Evaluating MC(M) on a set of medoids requires linear time, that is, Θ (n) time.

    u Robust to outliers.u There is a natural meaning to the notion of

    representative.u If MC(M) is good, GC(P) is usually not bad.u It is equivalent to the p-median problem (NP-

    Complete)

  • © Vladimir Estivill -Castro 39

    Environment for searching a Environment for searching a (very large) graph(very large) graph

    u A graph is a structure with

    u nodes uand

    u arcs.u A discrete search spaceu with the structure of a graphu is a graph where each node has a quality criteria Q

    u Q is also called the objective functionu Q : set of nodes → R

    © Vladimir Estivill -Castro 40

    How is medoid based clustering How is medoid based clustering a search problem in a grapha search problem in a graphu The nodes of the graph are the subsets M ⊆ S of

    size k.

    u Two nodes (subsets) are adjacent if they differ in exactly one site (point).u That is, the set of medoids M1 is adjacent to the

    set of medoids M2 if and only ifu M1 ∩ M2 = k-1

  • © Vladimir Estivill -Castro 41

    Versions of hillVersions of hill--climbingclimbingu Global hill-climbing

    u Explore all the neighbors to a node.

    u Randomized hill-climbingu Explore a random sample of size m of the neighbors

    uRestricted hill-climbingu Explore a neighborhood to the current node defined by those

    interchanges with sites no further away than a parameter r from who thy would be swapped to

    uLocal hill-climbingu Move to a neighbor as soon as one with an improvement is found but

    continue the exploration with the next potential swap (do not evaluate a site for a swap until all others have had a chance [Tabu search])

    © Vladimir Estivill -Castro 42

    A comparison (CPU seconds)A comparison (CPU seconds)n

    100 10.1 2 44.2 4.2 61 8.1200 41.2 5.3 204.3 32.4 350.3 60.9500 338.2 10.3 1435.3 123.8 2318.7 240.2

    1000 939.2 42.3

    Local Hill Climbing Randomized Hill Climbing Global Hill Climbing

  • © Vladimir Estivill -Castro 43

    A comparison (CPU seconds)A comparison (CPU seconds)

    © Vladimir Estivill -Castro 44

    k

    n-k

    representatives

  • © Vladimir Estivill -Castro 45

    k

    n-k

    Attempt to swapwith currentrepresentatives

    © Vladimir Estivill -Castro 46

    k

    n-k

    If improvement found

  • © Vladimir Estivill -Castro 47

    k

    n-k

    If improvement NOTfound

    © Vladimir Estivill -Castro 48

    Local HillLocal Hill--climbingclimbingu A combination of Hill-climbing and Tabu searchu For each data-point si (not a medoid)u For each point selected as a medoid mj

    u Explore if the swap si mj results in an improvementu If an improvement

    u Take the improvement si best of mj

    u If no improvement, place si at the end of a queue

    u Stop when a round does not cause improvement.

  • © Vladimir Estivill -Castro 49

    The O(nThe O(n22) complexity of local) complexity of local--hill climbinghill climbingu There are some O(n) methods from statisticsu k-means

    u a hill-climber

    u They have disadvantages:u convergenceu outliersu sensitive to initial starting point

    u Trade-off of complexity Vs quality

    © Vladimir Estivill -Castro 50

    The quality of the clusteringThe quality of the clusteringcentroidcentroid--basedbased

  • © Vladimir Estivill -Castro 51

    The quality of the clusteringThe quality of the clusteringmedoidmedoid--basedbased

    © Vladimir Estivill -Castro 52

    The behavior of the The behavior of the discriminantdiscriminant

  • © Vladimir Estivill -Castro 53

    The behavior of the Tabu The behavior of the Tabu circular listcircular list

    © Vladimir Estivill -Castro 54

    Using the Delauney Using the Delauney TriangulationTriangulationu The algorithm requires O(n log n) timeu The number k of clusters can be predicted

  • © Vladimir Estivill -Castro 55

    Robust clusteringRobust clusteringu For large sets of categorical data O(n)u For large sets of spatial data O(n log n)

    u Distance-based clusteringu but it is approximating density information

    u Representatives are data itemsu Experiments show meritsu in its complexityu in the quality of the groups

    u trade-off in time vs quality as u is enlarged

    © Vladimir Estivill -Castro 56

    IdeaIdeauEvaluating MC[M] is Ω(n) time becauseuMC[M] = Σ wi d(si, rep[si , M] )u Ω(n) time to find rep[si , M] (all n records si)u Ω(n) time since the sun has n terms

    u Approximate MC[M] in constant time gives an O(n) algorithmu consider only the most important terms in the

    sumu what is that MC[M] intends to measure about the

    cluster

  • © Vladimir Estivill -Castro 57

    Intent of the measurementIntent of the measurement

    MC(M) = Σ wi d(si, rep[si , M] )

    The representative of si in M ⊆ S is rep[si , M]

    The expected discrepancy between an item an its representative

    In the initial steps, large distances are considered, but they are ofpoints that would not belong to that representative.

    © Vladimir Estivill -Castro 58

    Idea: Sum only the distances Idea: Sum only the distances of those that are near the of those that are near the representativerepresentative

    Representatives that have u neighbors very near

    It is more important to find few true clusters, than lots of imaginary ones

  • © Vladimir Estivill -Castro 59

    Eliminating the contribution Eliminating the contribution ofofoutliers to a clusteroutliers to a cluster

    u Preprocessing:uFor each si, we find u records that aretheu/2

    nearest neighbors of siu For each si, find another u/2 records at randomu We construct a regular graph of degree u

    ueach si is linked to its u/2 nearest neighbors and another random u/2 elements

    © Vladimir Estivill -Castro 60

    Efficiently approximating Efficiently approximating MC[M’]MC[M’]

    uWhen evaluating a new set of medoids M’u (it is different on just one medoid to M)1 .- bring all the u nearest neighbors of the k

    representatives in M’2 .- build a set N with them

    u (N may have less than k(u+1) records)3 .- classify and evaluate MC[M’] using only N

    u (if N has less than k(u+1) records, complete to k(u+1) terms using the largest distance found in 2.-

  • © Vladimir Estivill -Castro 61

    IllustrationIllustration

    k=2

    u=7

    14 elements are considered, although 2(7+1)=16 is quotaRed line is accounted 3 timesBlue lines are not accounted

    © Vladimir Estivill -Castro 62

    uUsing the Delauney Triangulation

    How can we find How can we find uu neighborsneighborsfor each record?for each record?

  • © Vladimir Estivill -Castro 63

    Successful ExampleSuccessful Exampleu Recent analysis of Bank of America loan database

    u 250 fields per customeru back to 1914!u over nine million records

    u A clustering tools was used to automatically segment customers into groups with many similar categorical attributes.u 14 groups identified, only one could be explainedu The interesting cluster had 2 properties

    u 39% of customers had business and personal accounts with the banku this cluster accounted for 27% of the 11% of customers that had been

    classified by a decision tree as likely respondents to a home equity loan offer

    © Vladimir Estivill -Castro 64

    First experiments to illustrateFirst experiments to illustrateRobustnessRobustness

    u Reproducing the experiment on cluster quality[Huang97]u Huang designed methods to construct initial clustering.

    u methods depend on order of records in the file

    u Using the soy-bean data of the Machine Learning communityu 47 records (classified)u 35 categorical attributesu 4 classes (10,10,10,17)

    u A method achieved a “good clustering” if it recovered the original classes with 5 or less errors.

    u Distance metric is the Hamming distance (max value is 35)

  • © Vladimir Estivill -Castro 65

    ResultsResults

    k-Medoidsrandom different point frequency random

    Records initialization initialization initialization initialization0 12 20 29 01 2 15 17 101 (M[C]=206)2 8 26 50 11 (M[C]=208)3 2 20 19 88 (M[C]=206)4 2 25 12 05 2 3 3 0

    more than 5 172(86%) 111(55%) 70(35%) 0(0%)

    k-Means for categorical attributesMisclassified

    © Vladimir Estivill -Castro 66

    TheThe medoidsmedoids resultsresults

    u Never make a mistake with the first 3 classes

    u The record misclassified is always the sameu Two records are at the same distance from

    class 3 than from class 4u Thus the equal MC[M] value but different

    number of misclassifications.

  • © Vladimir Estivill -Castro 67

    The second experimentThe second experiment

    u The data set is expanded by inserting randomly generated records.u The i-th attribute is obtained by randomly

    selecting a record in the original file and copying its i-th attribute

    © Vladimir Estivill -Castro 68

    ResultsResults

    % of noise % of noise

    % of poor clusterslargest absolute # of mis-classifications

    k-Meansk-Means

    k-Medoidsk-Medoids

  • © Vladimir Estivill -Castro 69

    How costly ?How costly ?

    u Typically, k is small and n very large (data mining and knowledge discovery, k is very close to 10 and no more than 50 but n can easily be above 100 and usually 1000).

    u Evaluating the objective function on a node requires O(kn)=O(n)time.

    © Vladimir Estivill -Castro 70

    How can we find How can we find uu neighborsneighborsfor each record for each record

    u Records have categorical attributesu attribute-vectors of dimension d

  • © Vladimir Estivill -Castro 71

    Idea: Sum only the distances Idea: Sum only the distances of those that are near the of those that are near the representativerepresentative

    Representatives that have u neighbors very near

    It is more important to find few true clusters, than lots of imaginary ones

    © Vladimir Estivill -Castro 72

    A A trie trie of common of common prefixesprefixes

    aabccabacbabacababccabaca

    a b

    a

    a b

    a

    c

    1

    3,5

    b

    2

    4

    Alphabet is {a,b,c}d

    D

  • © Vladimir Estivill -Castro 73

    Finding at least the nearest Finding at least the nearest neighbor for a record neighbor for a record ssii

    u Find the nearest neighbor of each record in the trie.uTime independent of n (amortized constant

    time)

    u Find all neighbors at Hamming distance less than Λu Time independent of n but exponential on d

    u Λ should be 2 or 3

    © Vladimir Estivill -Castro 74

    Complete to Complete to uu nearby nearby recordsrecords

    u Apply a transitive closure on the link found in the step beforeu Using Breadth First Search for each node

    u and stopping when u nearby records are foundu Time is O(u) per recordu Total preprocessing is O(n) time.

    u Idea:u If y is nearby x and x is nearby z,

    u then y is nearby z.

  • © Vladimir Estivill -Castro 75

    First experimentFirst experimentuConfirming that the algorithm has linear

    complexityu Nursery Data Set (also UCI Repository)u 12,960 recordsu 8 categorical dimensionsu clustering into k=5 groups

    u Ten random subsetsu100, 200, 500, 1000, 2000, 4000, 6000, 8000,

    10000 and 12000

    © Vladimir Estivill -Castro 76

    ResultsResults

    n=number of records

    CPU time (observed average)

  • © Vladimir Estivill -Castro 77

    Second experimentSecond experiment

    u For the Soy-bean data: there is a set of medoids that results in no misclassificationu but suboptimal MC[M]

    u Use this as virtual centersu Find the radius of each classu Generate records by choosing one center and then a

    number between 1 and the radius of the class (uniformly)

    u Data sets of arbitrary size can be generated, but look like the Soy-Bean data set

    © Vladimir Estivill -Castro 78

    ResultsResults

    FileSize

    recovery no recovery recovery no recovery recovery no recovery100 48% 52% 100% 0% 97% 3%200 66% 34% 94% 6% 92% 8%

    k-Medoids random initializationwith proximity graph

    k-Means for categorical attributes k-Medoids random initializationfrequency initialization and interchange as TaB

  • © Vladimir Estivill -Castro 79

    SummarySummary

    u Clustering of spatial data is performed using as a similarity criteria between two sites. Some measure of the distance in space between the two sites.

    u There are many ways to formalize a criteria for the quality of a clustering; however, almost all of them are unfeasible to be solved to optimality for large data sets.u The medoid approach offers several advantages over

    others.u We have presented a revision of Local-Hill Climbing for

    use with large data sets thatu is computationally more efficient than previous

    approachesu does not compromise the quality of the clustering

    © Vladimir Estivill -Castro 80

    Food for thoughtFood for thought

    u “Most conventional statistical approaches are limited to datamining. Statisticians historically placed an emphasis on managing the probability of type one error. That is, they have sought to control the likelihood of accepting a proposition when it is false. This is appropriate for hypothesis testing, where there is a single hypothesis under consideration. However, it is often not appropriate for datamining …; it can be undesirable … to reject a hypothesis that is (partially) true.”

    u Data Mining is not hypothesis testing but hypothesis generation.