Unit-03 (Part 2)

download Unit-03 (Part 2)

of 20

Transcript of Unit-03 (Part 2)

  • 7/27/2019 Unit-03 (Part 2)

    1/20

    Unit-03: Cluster Analysis 1

    GAURAV JAISWAL, Dept. of CS, AITM

    CLUSTERANALYSIS

    The process of grouping a set of physical or abstract objects into classes of similarobjects is calledclustering. A cluster is a collection of data objects that are similarto one another within the same

    cluster and are dissimilarto the objects in other clusters. Clustering is also called data segmentationin some applications because clustering partitions large data sets into groups according to their

    similarity. Clustering can also be used for outlier detection, where outliers (values that are far away

    from any cluster) may be more interesting than common cases. As a data mining function, cluster

    analysis can be used as a stand-alone tool to gain insight into the distribution of data, to observe the

    characteristics of each cluster, and to focus on a particular set of clusters for further analysis.

    Alternatively, it may serve as a preprocessing step for other algorithms, such as characterization,

    attribute subset selection, and classification, which would then operate on the detected clusters and

    the selected attributes or features.

    The following are typical requirements of clustering in data mining:

    Scalability: Many clustering algorithms work well on small data sets containing fewer than

    several hundred data objects; however, a large database may contain millions of objects.

    Clustering on a sample of a given large data set may lead to biased results. Highly scalableclustering algorithms are needed.

    Ability to deal with different types of attributes: Many algorithms are designed to cluster

    interval-based (numerical) data. However, applications may require clustering other types of

    data, such as binary, categorical (nominal), and ordinal data, or mixtures of these data types.

    Discovery of clusters with arbitrary shape: Many clustering algorithms determine clusters

    based on Euclidean or Manhattan distance measures. Algorithms based on such distancemeasures tend to find spherical clusters with similar size and density. However, a cluster

    could be of any shape. It is important to develop algorithms that can detect clusters of

    arbitrary shape.

    Minimal requirements for domain knowledge to determine input parameters: Manyclustering algorithms require users to input certain parameters in cluster analysis (such as

    the number of desired clusters). The clustering results can be quite sensitive to inputparameters. Parameters are often difficult to determine, especially for data sets containing

    high-dimensional objects. This not only burdens users, but it also makes the quality ofclustering difficult to control.

    Ability to deal with noisy data: Most real-world databases contain outliers or missing,

    unknown, or erroneous data. Some clustering algorithms are sensitive to such data and may

    lead to clusters of poor quality.

    Incremental clustering and insensitivity to the order of input records : Some clustering

    algorithms cannot incorporate newly inserted data (i.e., database updates) into existing

    clustering structures and, instead, must determine a new clustering from scratch. Some

    clustering algorithms are sensitive to the order of input data. High dimensionality: A database or a data warehouse can contain several dimensions or

    attributes. Many clustering algorithms are good at handling low-dimensional data, involving

    only two to three dimensions.

    Constraint-based clustering: Real-world applications may need to perform clustering

    under various kinds of constraints. Suppose that your job is to choose the locations for a givennumber of new automatic banking machines (ATMs) in a city. To decide upon this, you may

    cluster households while considering constraints such as the citys rivers and highway

  • 7/27/2019 Unit-03 (Part 2)

    2/20

    Unit-03: Cluster Analysis 2

    GAURAV JAISWAL, Dept. of CS, AITM

    networks, and the type and number of customers per cluster. A challenging task is to findgroups of data with good clustering behavior that satisfy specified constraints.

    Interpretability and usability: Users expect clustering results to be interpretable,

    comprehensible, and usable. That is, clustering may need to be tied to specific semantic

    interpretations and applications.

    DATATYPESINCLUSTERANALYSIS

    Main memory-based clustering algorithms typically operate on either of the following two data

    structures.

    Data matrix (or object-by-variable structure): This represents n objects, such as persons,

    withp variables (also called measurements or attributes), such as age, height, weight, gender,

    and so on. The structure is in the form of a relational table, or n-by-p matrix (n objects X p

    variables):

    Dissimilarity matrix (or object-by-object structure): This stores a collection of proximities

    that are available for all pairs ofn objects. It is often represented by an n-by-n table:

    where d(i, j) is the measured difference or dissimilarity between objects i and j. In general, d(i, j) is a

    nonnegative number that is close to 0 when objects i and jare highly similar or near each other,

    and becomes larger the more they differ. Since d(i, j)=d(j, i), and d(i, i)=0, we have the 2nd matrix.

    1. Interval-Scaled Variables

    Interval-scaled variables are continuous measurements of a roughly linear scale. Typical examplesinclude weight and height, latitude and longitude coordinates (e.g., when clustering houses), and

    weather temperature. The measurement unit used can affect the clustering analysis. To standardize

    measurements, one choice is to convert the original measurements to unitless variables. Given

    measurements for a variable f, this can be performed as follows.

    Calculate the mean absolute deviation, S f:

  • 7/27/2019 Unit-03 (Part 2)

    3/20

    Unit-03: Cluster Analysis 3

    GAURAV JAISWAL, Dept. of CS, AITM

    where x1 f, : : : , xn fare n measurements off, and mfis the mean value off, that is, mf= 1n

    (x1 f+x2 f+_ _ _+xn f).

    Calculate the standardized measurement, or z-score:

    The mean absolute deviation, s f , is more robust to outliers than the standard deviation, sf .When

    computing the mean absolute deviation, the deviations from the mean (i.e., |xi f - mf|) are not squared;

    hence, the effect of outliers is somewhat reduced.

    After standardization, or without standardization in certain applications, the dissimilarity (or

    similarity) between the objects described by interval-scaled variables is typically computed based on

    the distance between each pair of objects. The most popular distance measure is Euclidean distance,

    which is defined as

    where i=(xi1, xi2, : : : , xin) and j=(x j1, x j2, : : : , x jn) are two n-dimensional data objects.

    Another well-known metric is Manhattan (or city block) distance, defined as

    For Example - Letx1 = (1, 2) and x2 = (3, 5) represent two objects. The Euclidean distance between

    the two is The Manhattan distance between the two is 2+3 = 5.

    2. Binary Variables The dissimilarity between objects described by either symmetric orasymmetric binary variables. A binary variable is symmetric if both of its states are equally

    valuable and carry the same weight; that is, there is no preference on which outcome shouldbe coded as 0 or 1. One such example could be the attribute genderhaving the states male and

    female. Dissimilarity that is based on symmetric binary variables is called symmetric binary

    dissimilarity.

    A binary variable is asymmetric if the outcomes of the states are not equally important, such as the

    positive and negative outcomes of a disease test. By convention, we shall code the most importantoutcome, which is usually the rarest one, by 1 (e.g., HIV positive) and the other by 0 (e.g., HIV negative).

    The dissimilarity based on such variables is called asymmetric binary dissimilarity, where thenumber of negative matches, t, is considered unimportant and thus is ignored in the computation, as

    shown in Equation

    Asymmetric binary similarity between the objects i and j, or sim(i, j), can be computed as,

    The coefficientsim(i, j) is called the Jaccard coefficient.

  • 7/27/2019 Unit-03 (Part 2)

    4/20

    Unit-03: Cluster Analysis 4

    GAURAV JAISWAL, Dept. of CS, AITM

    For example Suppose that a patient record table contains the attributes name, gender, fever, cough,

    test-1, test-2, test-3, and test-4, where name is an object identifier, genderis a symmetric attribute,

    and the remaining attributes are asymmetric binary. For asymmetric attribute values, let the valuesY(yes) and P(positive)be set to1,and the value N(no or negative) be set to 0. Suppose that the distance

    between objects (patients) is computed based only on the asymmetric variables. According toEquation, the distance between each pair of the three patients, Jack, Mary, and Jim, is

    3. Categorical, Ordinal, and Ratio-Scaled Variables

    3.1 Categorical Variables A categorical variable is a generalization of the binary variable in that it can take on more than two

    states. For example, map color is a categorical variable that may have, say, five states: red, yellow,

    green, pink, and blue. Let the number of states of a categorical variable be M. The states can be denotedby letters, symbols, or a set of integers, such as 1, 2, : : : , M.

    The dissimilarity between two objects i and jcan be computed based on the ratio of mismatches:

    where m is the number ofmatches (i.e., the number of variables for which i and jare in the samestate), and p is the total number of variables.

  • 7/27/2019 Unit-03 (Part 2)

    5/20

    Unit-03: Cluster Analysis 5

    GAURAV JAISWAL, Dept. of CS, AITM

    3.2 Ordinal Variables

    A discrete ordinal variable resembles a categorical variable, except that the Mstates of the ordinalvalue are ordered in a meaningful sequence. Ordinal variables are very useful for registering

    subjective assessments of qualities that cannot be measured objectively. For example, professional

    ranks are often enumerated in a sequential order, such as assistant, associate, andfullfor professors.A continuous ordinal variable looks like a set of continuous data of an unknown scale; that is, the

    relative ordering of the values is essential but their actual magnitude is not.Ordinal variables may also be obtained from the discretization of interval-scaled quantities by

    splitting the value range into a finite number of classes. The values of an ordinal variable can bemapped to ranks. For example, suppose that an ordinal variable fhas Mfstates. These ordered states

    define the ranking 1, : : : , Mf.

    Suppose thatf is a variable from a set of ordinal variables describing n objects. The dissimilarity

    computation with respect to finvolves the following steps:

    1. The value offfor the ith object isxi f, andfhas Mfordered states, representing the ranking 1,

    : : : , Mf. Replace each xi fby its corresponding rank,2. Since each ordinal variable can have a different number of states, it is often necessary to map

    the range of each variable onto [0.0,1.0] so that each variable has equal weight. This can beachieved by replacing the rankri fof the ith object in the fth variable by

    3.3 Ratio-Scaled Variables

    A ratio-scaled variable makes a positive measurement on a nonlinear scale, such as an exponentialscale, approximately following the formula

    whereA and B are positive constants, and ttypically represents time. Common examples include the

    growth of a bacteria population or the decay of a radioactive element.There are three methods to handle ratio-scaled variables for computing the dissimilarity between

    objects.

    Treat ratio-scaled variables like interval-scaled variables. This, however, is not usually a good

    choice since it is likely that the scale may be distorted.

    Apply logarithmic transformation to a ratio-scaled variable fhaving value xi ffor objecti by

    using the formula yi f= log(xi f).

    Treatxi fas continuous ordinal data and treat their ranks as interval-valued.

    3.4 Variables of Mixed Types

    In many real databases, objects are described by a mixture of variable types. One approach is to groupeach kind of variable together, performing a separate cluster analysis for each variable type.

    A more preferable approach is to process all variable types together, performing a single cluster

    analysis. One such technique combines the different variables into a single dissimilarity matrix,

    bringing all of the meaningful variables onto a common scale of the interval [0.0,1.0].

  • 7/27/2019 Unit-03 (Part 2)

    6/20

    Unit-03: Cluster Analysis 6

    GAURAV JAISWAL, Dept. of CS, AITM

    Suppose that the data set containsp variables of mixed type. The dissimilarity d(i,j) between objectsi and jis defined as

    where the indicator d( f) i j = 0 if either (1) xi f or x j fis missing (i.e., there is no measurement ofvariable ffor objecti or objectj), or (2) xi f= x j f= 0 and variable fis asymmetric binary; otherwise,

    d( f) i j= 1. The contribution of variable fto the dissimilarity between i and j, that is, d( f) i j , is

    computed dependent on its type:

    3.5 Vector Objects In some applications, such as information retrieval, text document clustering, and biological

    taxonomy, we need to compare and cluster complex objects (such as documents) containing a largenumber of symbolic entities (such as keywords and phrases). To measure the distance between

    complex objects, it is often desirable to abandon traditional metric distance computation andintroduce a nonmetric similarity function.

    There are several ways to define such a similarity function, s(x, y), to compare two vectors xand y.One popular way is to define the similarity function as a cosine measure as follows:

    where xtis a transposition of vector x, ||x|| is the Euclidean norm of vector x, ||y|| is the Euclidean

    norm of vector y, and s is essentially the cosine of the angle between vectors xand y.

    CATEGORIESOFCLUSTERINGMETHODS

    In general, the major clustering methods can be classified into the following categories.

    Partitioning methods: Given a database of n objects or data tuples, a partitioning methodconstructs k partitions of the data, where each partition represents a cluster and k n. That is, it

    classifies the data into kgroups, which together satisfy the following requirements: (1) each group

    must contain at least one object, and (2) each object must belong to exactly one group.Given k, the number of partitions to construct, a partitioning method creates an initial partitioning.

    It then uses an iterative relocation technique that attempts to improve the partitioning by movingobjects from one group to another. The general criterion of a good partitioning is that objects in the

  • 7/27/2019 Unit-03 (Part 2)

    7/20

    Unit-03: Cluster Analysis 7

    GAURAV JAISWAL, Dept. of CS, AITM

    same cluster are close or related to each other, whereas objects of different clusters are far apartor very different.

    To achieve global optimality in partitioning-based clustering would require the exhaustive

    enumeration of all of the possible partitions. Instead, most applications adopt one of a few popularheuristic methods, such as (1) the k-means algorithm, where each cluster is represented by the mean

    value of the objects in the cluster, and (2) the k-medoids algorithm, where each cluster is representedby one of the objects located near the center of the cluster.

    Hierarchical methods: A hierarchical method creates a hierarchical decomposition of the givenset of data objects. A hierarchical method can be classified as being either agglomerative or divisive,

    based on how the hierarchical decomposition is formed. The agglomerative approach, also called the

    bottom-up approach, starts with each object forming a separate group. It successively merges the

    objects or groups that are close to one another, until all of the groups are merged into one (the

    topmost level of the hierarchy), or until a termination condition holds. The divisive approach, alsocalled the top-down approach, starts with all of the objects in the same cluster. In each successive

    iteration, a cluster is split up into smaller clusters, until eventually each object is in one cluster, oruntil a termination condition holds.

    Hierarchical methods suffer from the fact that once a step (merge or split) is done, it can never beundone. There are two approaches to improving the quality of hierarchical clustering: (1) perform

    careful analysis of object linkages at each hierarchical partitioning, such as in Chameleon, or (2)integrate hierarchical agglomeration and other approaches by first using a hierarchical

    agglomerative algorithm to group objects into microclusters, and then performing macroclustering

    on the microclusters using another clustering method such as iterative relocation, as in BIRCH.

    Density-based methods: clustering methods have been developed based on the notion ofdensity.Their general idea is to continue growing the given cluster as long as the density (number of objects

    or data points) in the neighborhood exceeds some threshold; that is, for each data point within a

    given cluster, the neighborhood of a given radius has to contain at least a minimum number of points.Such a method can be used to filter out noise (outliers) and discover clusters of arbitrary shape.DBSCAN and its extension, OPTICS, are typical density-based methods that grow clusters according

    to a density-based connectivity analysis.

    Grid-based methods: Grid-based methods quantize the object space into a finite number of cellsthat form a grid structure. All of the clustering operations are performed on the grid structure (i.e.,

    on the quantized space). The main advantage of this approach is its fast processing time, which is

    typically independent of the number of data objects and dependent only on the number of cells in

    each dimension in the quantized space.

    Model-based methods: Model-based methods hypothesize a model for each of the clusters and find

    the best fit of the data to the given model. A model-based algorithm may locate clusters byconstructing a density function that reflects the spatial distribution of the data points. It also leads to

    a way of automatically determining the number of clusters based on standard statistics, takingnoise or outliers into account and thus yielding robust clustering methods.

    NOTE: - The choice of clustering algorithm depends both on the type of data available and on

    the particular purpose of the application.

  • 7/27/2019 Unit-03 (Part 2)

    8/20

    Unit-03: Cluster Analysis 8

    GAURAV JAISWAL, Dept. of CS, AITM

    PartitioningMethods

    Given D, a data set of n objects, and k, the number of clusters to form, a partitioning algorithm

    organizes the objects into kpartitions (k n), where each partition represents a cluster. The clusters

    are formed to optimize an objective partitioning criterion, such as a dissimilarity function based on

    distance, so that the objects within a cluster are similar, whereas the objects of different clusters

    are dissimilar in terms of the data set attributes.

    The most well-known and commonly used partitioning methods are k-means, k-medoids, and their

    variations.

    Centroid-Based Technique: The k-Means Method

    The k-means algorithm takes the input parameter, k, and partitions a set ofn objects into kclusters

    so that the resulting intracluster similarity is high but the intercluster similarity is low. Cluster

    similarity is measured in regard to the mean value of the objects in a cluster, which can be viewed asthe clusters centroidor center of gravity. The k-means method, however, can be applied only when

    the mean of a cluster is defined.

    The k-means algorithm proceeds as follows. First, it randomly selects kof the objects, each of whichinitially represents a cluster mean or center. For each of the remaining objects, an object is assigned

    to the cluster to which it is the most similar, based on the distance between the object and the clustermean. It then computes the new mean for each cluster. This process iterates until the criterion

    function converges. Typically, the square-error criterion is used, defined as

    (In other words, for each object in each cluster, the distance from the object to its cluster center is

    squared, and the distances are summed.)

    where E is the sum of the square error for all objects in the data set; p is the point in space

    representing a given object; and miis the mean of cluster Ci (bothp and miare multidimensional).

    Algorithm: k-means. The k-means algorithm for partitioning, where each clusters center isrepresented by the mean value of the objects in the cluster.

    Input:

    k: the number of clusters,

    D: a data set containing n objects.Output: A set ofkclusters.

    Method:1. arbitrarily choose kobjects from D as the initial cluster centers;2. repeat

    3. (re)assign each object to the cluster to which the object is the most similar, based on the meanvalue of the objects in the cluster;

    4. update the cluster means, i.e., calculate the mean value of the objects for each cluster;5. until no change;

  • 7/27/2019 Unit-03 (Part 2)

    9/20

    Unit-03: Cluster Analysis 9

    GAURAV JAISWAL, Dept. of CS, AITM

    The algorithm attempts to determine kpartitions that minimize the square-error function. It workswell when the clusters are compact clouds that are rather well separated from one another. The

    method is relatively scalable and efficient in processing large data sets because the computationalcomplexity of the algorithm is O(nkt), where n is the total number of objects, k is the number of

    clusters, and tis the number of iterations.

    Representative Object-Based Technique: The k-Medoids Method

    The k-means algorithm is sensitive to outliers because an object with an extremely large value may

    substantially distort the distribution of data. This effect is particularly exacerbated due to the use ofthe square-error function. Instead of taking the mean value of the objects in a cluster as a reference

    point, we can pick actual objects to represent the clusters, using one representative object per cluster.Each remaining object is clustered with the representative object to which it is the most similar. The

    partitioning method is then performed based on the principle of minimizing the sum of thedissimilarities between each object and its corresponding reference point. That is, an absolute-error

    criterion is used, defined as

    where E is the sum of the absolute error for all objects in the data set; p is the point in space

    representing a given object in cluster Cj; and ojis the representative object ofCj.

    The initial representative objects (or seeds) are chosen arbitrarily. The iterative process of replacingrepresentative objects by nonrepresentative objects continues as long as the quality of the resulting

    clustering is improved. This quality is estimated using a cost function that measures the average

    dissimilarity between an object and the representative object of its cluster. To determine whether anonrepresentative object, orandom, is a good replacement for a current representative object, oj, the

    following four cases are examined for each of the nonrepresentative objects, p, as-

    Case 1: p currently belongs to representative object, oj. Ifoj is replaced by orandom as arepresentative object and p is closest to one of the other representative objects, oi, i j, then

    p is reassigned to oi.

  • 7/27/2019 Unit-03 (Part 2)

    10/20

    Unit-03: Cluster Analysis 10

    GAURAV JAISWAL, Dept. of CS, AITM

    Case 2: p currently belongs to representative object, oj. Ifoj is replaced by orandom as arepresentative object andp is closest to orandom, thenp is reassigned to orandom.

    Case 3:p currently belongs to representative object, oi, i j. Ifojis replaced by orandom as arepresentative object andp is still closest to oi, then the assignment does not change.

    Case 4:p currently belongs to representative object, oi, i j. Ifojis replaced by orandom as arepresentative object andp is closest to orandom, thenp is reassigned to orandom.

    If the total cost is negative, then ojis replaced or swapped with orandom since the actual absolute

    error E would be reduced. If the total cost is positive, the current representative object, oj, is

    considered acceptable, and nothing is changed in the iteration.

    Algorithm: k-medoids. PAM (Partitioning Around Medoids), a k-medoids algorithm for partitioningbased on medoid or central objects.Input:

    k: the number of clusters, D: a data set containing n objects.

    Output: A set ofkclusters.Method:

    1. arbitrarily choose kobjects in D as the initial representative objects or seeds;

    2. repeat

    3. assign each remaining object to the cluster with the nearest representative object;

    4. randomly select a nonrepresentative object, orandom;5. compute the total cost, S, of swapping representative object, oj, with orandom;

    6. if S< 0 then swap ojwith orandom to form the new set ofkrepresentative objects;7. until no change;

    In general, the algorithm iterates until, eventually, each representative object is actually the medoid,

    or most centrally located object, of its cluster. This is the basis of the k-medoids method for groupingn objects into kclusters.

  • 7/27/2019 Unit-03 (Part 2)

    11/20

    Unit-03: Cluster Analysis 11

    GAURAV JAISWAL, Dept. of CS, AITM

    HierarchicalMethods

    A hierarchical clustering method works by grouping data objects into a tree of clusters. Hierarchical

    clustering methods can be further classified as either agglomerative or divisive.

    Agglomerative hierarchical clustering: This bottom-up strategy starts by placing each

    object in its own cluster and then merges these atomic clusters into larger and larger clusters,

    until all of the objects are in a single cluster or until certain termination conditions are

    satisfied.

    Divisive hierarchical clustering: This top-down strategy does the reverse of agglomerativehierarchical clustering by starting with all objects in one cluster. It subdivides the cluster into

    smaller and smaller pieces, until each object forms a cluster on its own or until it satisfiescertain termination conditions.

    Chameleon: A Hierarchical Clustering Algorithm Using Dynamic Modeling

    Chameleon is a hierarchical clustering algorithm that uses dynamic modeling to determine the

    similarity between pairs of clusters. In Chameleon, cluster similarity is assessed based on how well-

    connected objects are within a cluster and on the proximity of clusters. That is, two clusters aremerged if their interconnectivityis high and they are close together. Chameleon does not depend on a

    static, user-supplied model and can automatically adapt to the internal characteristics of the clustersbeing merged.

  • 7/27/2019 Unit-03 (Part 2)

    12/20

    Unit-03: Cluster Analysis 12

    GAURAV JAISWAL, Dept. of CS, AITM

    Chameleon uses a k-nearest-neighbor graph approach to construct a sparse graph, where each vertex

    of the graph represents a data object, and there exists an edge between two vertices (objects) if one

    object is among the k-most-similar objects of the other. The edges are weighted to reflect the

    similarity between objects. Chameleon uses a graph partitioning algorithm to partition the k-nearest-

    neighbor graph into a large number of relatively small subclusters. It then uses an agglomerative

    hierarchical clustering algorithm that repeatedly merges subclusters based on their similarity. To

    determine the pairs of most similar subclusters, it takes into account both the interconnectivity as

    well as the closeness of the clusters.

    The graph-partitioning algorithm partitions the k-nearest-neighbor graph such that it minimizes the

    edge cut. That is, a cluster Cis partitioned into subclusters Ci and Cjso as to minimize the weight of

    the edges.

    Chameleon determines the similarity between each pair of clusters Ci and Cj according to their

    relative interconnectivity, RI(Ci, Cj), and their relative closeness, RC(Ci, Cj):

    The relative interconnectivity, RI(Ci, Cj), between two clusters,Ci and Cj, is defined as the

    absolute interconnectivity between Ci and Cj, normalized with respect to the internalinterconnectivity of the two clusters, Ci and Cj. That is,

    where EC{Ci, Cj} is the edge cut, defined as above, for a cluster containing both Ci and Cj. Similarly, ECCi(or ECCj) is the minimum sum of the cut edges that partition Ci (or Cj) into two roughly equal parts.

    The relative closeness, RC(Ci, Cj), between a pair of clusters,Ci and Cj, is the absolutecloseness between Ci and Cj, normalized with respect to the internal closeness of the two

    clusters, Ci and Cj. It is defined as

    where SEC{Ci;Cj} is the average weight of the edges that connect vertices in Ci to vertices in Cj, andSECCi (or SECCj) is the average weight of the edges that belong to the min-cut bisector of cluster Ci (or

    Cj).

  • 7/27/2019 Unit-03 (Part 2)

    13/20

    Unit-03: Cluster Analysis 13

    GAURAV JAISWAL, Dept. of CS, AITM

    CURE (Clustering Using REpresentatives)

    CURE is an efficient data clustering algorithm for large databases that is more robust to outliers and

    identifies clusters having non-spherical shapes and wide variances in size. To avoid the problemswith non-uniform sized or shaped clusters, CURE employs a novel hierarchical clustering algorithm

    that adopts a middle ground between the centroid based and all point extremes.

    In CURE, a constant number c of well scattered points of a cluster are chosen and they are shrunk

    towards the centroid of the cluster by a fraction . The scattered points after shrinking are used as

    representatives of the cluster. The clusters with the closest pair of representatives are the clusters

    that are merged at each step of CURE's hierarchical clustering algorithm. This enables CURE tocorrectly identify the clusters and makes it less sensitive to outliers.

    To handle large databases, CURE employs a combination of random sampling and partitioning: a

    random sample is first partitioned, and each partition is partially clustered. The partial clusters are

    then clustered in a second pass to yield the desired clusters.

    The algorithm is given below.

    Draw random sample s. Partition sample to p partitions with size s/p

    Partially cluster partitions into s/pq clusters

    Eliminate outliers

    o By random sampling

    o If a cluster grows too slow, eliminate it.

    Cluster partial clusters.

    Label data in disk

    Strength:

    Produces high-quality clusters in the existence of outliers

    Allowing clusters of complex shapes and different sizes.

    Algorithm requires one scan of the entire database

    Weakness:

    CURE does not handle categorical attributes.

  • 7/27/2019 Unit-03 (Part 2)

    14/20

    Unit-03: Cluster Analysis 14

    GAURAV JAISWAL, Dept. of CS, AITM

    Density-BasedMethodsTo discover clusters with arbitrary shape, density-based clustering methods have been developed.

    DBSCAN: A Density-Based Clustering Method Based on Connected Regionswith Sufficiently High Density

    The algorithm grows regions with sufficiently high density into clusters and discovers clusters of

    arbitrary shape in spatial databases with noise. It defines a cluster as a maximal set of density-

    connectedpoints.

    The basic ideas of density-based clustering involve a number of new definitions.

    The neighborhood within a radius of a given object is called the -neighborhood of the

    object. If the -neighborhood of an object contains at least a minimum number, MinPts, of objects,

    then the object is called a core object. Given a set of objects, D, we say that an objectp is directly density-reachable from objectq

    ifp is within the e-neighborhood ofq, and q is a core object. An objectp is density-reachable from objectq with respect to and MinPts in a set of objects,

    D, if there is a chain of objects p1, : : : , pn, where p1 = q and pn = p such thatpi+1 is directly

    density-reachable from piwith respect to e and MinPts, for 1 i n,pi D.

    An objectp is density-connected to objectq with respect to and MinPts in a set of objects,D, if there is an objecto D such that bothp and q are density-reachable from o with respectto and MinPts.

    Consider Figure for a given

    represented by the radius of the circles, and, say, letMinPts = 3. Basedon the above definitions:

    Of the labeled points, m, p, o, and rare core objects because each is in an -neighborhoodcontaining at least three points.

    q is directly density-reachable from m. m is directly density-reachable fromp and vice versa. q is (indirectly) density-reachable fromp because q is directly density-reachable from m and

    m is directly density-reachable from p. However, p is not density-reachable from q because

  • 7/27/2019 Unit-03 (Part 2)

    15/20

    Unit-03: Cluster Analysis 15

    GAURAV JAISWAL, Dept. of CS, AITM

    q is not a core object. Similarly, r and s are density-reachable from o, and o is density-reachable from r.

    o, r, and s are all density-connected.

    DBSCAN searches for clusters by checking the e-neighborhood of each point in the database. If the -neighborhood of a pointp contains more than MinPts, a new cluster withp as a core object is created.

    DBSCAN then iteratively collects directly density-reachable objects from these core objects, whichmay involve the merge of a few density-reachable clusters. The process terminates when no new

    point can be added to any cluster.

    OPTICS: Ordering Points to Identify the Clustering Structure

    Although DBSCAN can cluster objects given input parameters such as and MinPts, it still leaves the

    user with the responsibility of selecting parameter values that will lead to the discovery of acceptableclusters. To help overcome this difficulty, a cluster analysis method called OPTICS was proposed.

    Rather than produce a data set clustering explicitly, OPTICS computes an augmented cluster ordering

    for automatic and interactive cluster analysis. The cluster ordering can be used to extract basic

    clustering information (such as cluster centers or arbitrary-shaped clusters) as well as provide theintrinsic clustering structure.

    Two values need to be stored for each objectcore-distance and reachability-distance.

    The core-distance of an objectp is the smallest value that makes {p} a core object. Ifp is

    not a core object, the core-distance ofp is undefined.

    The reachability-distance of an objectq with respect to another objectp is the greater value

    of the core-distance ofp and the Euclidean distance between p and q. Ifp is not a core object,

    the reachability-distance between p and q is undefined.

    The OPTICS algorithm creates an ordering of the objects in a database, additionally storing the core-

    distance and a suitable reachability distance for each object. An algorithm was proposed to extract

    clusters based on the ordering information produced by OPTICS. Such information is sufficient for

    the extraction of all density-based clustering with respect to any distance that is smaller than thedistance used in generating the order.

  • 7/27/2019 Unit-03 (Part 2)

    16/20

    Unit-03: Cluster Analysis 16

    GAURAV JAISWAL, Dept. of CS, AITM

    Figure is the reachability plot for a simple two-dimensional data set, which presents a generaloverview of how the data are structured and clustered. The data objects are plotted in cluster order

    (horizontal axis) together with their respective reachability-distance (vertical axis). The three

    Gaussian bumps in the plot reflect three clusters in the data set. Methods have also been developed

    for viewing clustering structures of high-dimensional data at various levels of detail.

  • 7/27/2019 Unit-03 (Part 2)

    17/20

    Unit-03: Cluster Analysis 17

    GAURAV JAISWAL, Dept. of CS, AITM

    Grid-BasedMethodsThe grid-based clustering approach uses a multiresolution grid data structure. It quantizes the object

    space into a finite number of cells that form a grid structure on which all of the operations for

    clustering are performed. The main advantage of the approach is its fast processing time, which istypically independent of the number of data objects, yet dependent on only the number of cells in

    each dimension in the quantized space.

    STING: STatistical INformation Grid

    STING is a grid-based multiresolution clustering technique in which the spatial area is divided into

    rectangular cells. Because STING uses a multiresolution approach to cluster analysis, the quality of

    STING clustering depends on the granularity of the lowest level of the grid structure. There are

    usually several levels of such rectangular cells corresponding to different levels of resolution, and

    these cells form a hierarchical structure: each cell at a high level is partitioned to form a number of

    cells at the next lower level. Statistical information regarding the attributes in each grid cell (such as

    the mean, maximum, and minimum values) is precomputed and stored.

    Statistical parameters of higher-level cells can easily be computed from the parameters of the lower-

    level cells. These parameters include the following: the attribute-independent parameter, count; the

    attribute-dependent parameters, mean, stdev (standard deviation), min (minimum), max

    (maximum); and the type ofdistribution that the attribute value in the cell follows, such as normal,

    uniform, exponential, or none (if the distribution is unknown).

    The statistical parameters can be used in a top-down, grid-based method as follows. First, a layerwithin the hierarchical structure is determined from which the query-answering process is to start.

    This layer typically contains a small number of cells. For each cell in the current layer, we compute

    the confidence interval (or estimated range of probability) reflecting the cells relevancy to the givenquery. The irrelevant cells are removed from further consideration. Processing of the next lower level

    examines only the remaining relevant cells. This process is repeated until the bottom layer is reached.At this time, if the query specification is met, the regions of relevant cells that satisfy the query are

    returned. Otherwise, the data that fall into the relevant cells are retrieved and further processed untilthey meet the requirements of the query.

  • 7/27/2019 Unit-03 (Part 2)

    18/20

    Unit-03: Cluster Analysis 18

    GAURAV JAISWAL, Dept. of CS, AITM

    STING offers several advantages:

    1. The grid-based computation is query-independent, because the statistical information stored

    in each cell represents the summary information of the data in the grid cell, independent ofthe query;

    2. The grid structure facilitates parallel processing and incremental updating; and3. The methods efficiency is a major advantage: STING goes through the database once to

    compute the statistical parameters of the cells, and hence the time complexity of generating

    clusters is O(n), where n is the total number of objects. After generating the hierarchical

    structure, the query processing time is O(g), where g is the total number of grid cells at the

    lowest level, which is usually much smaller than n.

    CLIQUE: A Dimension-Growth Subspace Clustering Method

    CLIQUE (CLustering In QUEst) was the first algorithm proposed for dimension-growth subspace

    clustering in high-dimensional space. In dimension-growth subspace clustering, the clustering

    process starts at single-dimensional subspaces and grows upward to higher-dimensional ones.Because CLIQUE partitions each dimension like a grid structure and determines whether a cell is

    dense based on the number of points it contains.

    The ideas of the CLIQUE clustering algorithm are outlined as follows:

    Given a large set of multidimensional data points, the data space is usually not uniformly

    occupied by the data points. CLIQUEs clustering identifies the sparse and the crowdedareas in space (or units), thereby discovering the overall distribution patterns of the data set.

    A unit is dense if the fraction of total data points contained in it exceeds an input model

    parameter. In CLIQUE, a cluster is defined as a maximal set ofconnected dense units.

    CLIQUE performs multidimensional clustering in two steps:

    1. In the first step, CLIQUE partitions the d-dimensional data space into nonoverlapping

    rectangular units, identifying the dense units among these. This is done (in 1-D) for eachdimension.

    2. In the second step, CLIQUE generates a minimal description for each cluster as follows. Foreach cluster, it determines the maximal region that covers the cluster of connected dense

    units. It then determines a minimal cover (logic description) for each cluster.

    CLIQUE automatically finds subspaces of the highest dimensionality such that high-density clusters

    exist in those subspaces. It is insensitive to the order of input objects and does not presume anycanonical data distribution. It scales linearly with the size of input and has good scalability as the

    number of dimensions in the data is increased. However, obtaining meaningful clustering results isdependent on proper tuning of the grid size (which is a stable structure here) and the density

    threshold.

  • 7/27/2019 Unit-03 (Part 2)

    19/20

    Unit-03: Cluster Analysis 19

    GAURAV JAISWAL, Dept. of CS, AITM

    Model-BasedClusteringMethodsModel-based clustering methods attempt to optimize the fit between the given data and some

    mathematical model. Such methods are often based on the assumption that the data are generated

    by a mixture of underlying probability distributions.

    Neural Network Approach

    The neural network approach is motivated by biological neural networks. Roughly speaking, a neural

    network is a set of connected input/output units, where each connection has a weight associated

    with it.

    Neural networks have several properties: First, neural networks are inherently parallel and

    distributed processing architectures. Second, neural networks learn by adjusting their

    interconnection weights so as to best fit the data. This allows them to normalize or prototype thepatterns and act as feature (or attribute) extractors for the various clusters. Third, neural networks

    process numerical vectors and require object patterns to be represented by quantitative features

    only.

    The neural network approach to clustering tends to represent each cluster as an exemplar. An

    exemplar acts as a prototype of the cluster and does not necessarily have to correspond to a

    particular data example or object. New objects can be distributed to the cluster whose exemplar is

    the most similar, based on some distance measure.

    Self-organizing feature maps (SOMs) are one of the most popular neural network methods for

    cluster analysis. They are sometimes referred to as Kohonen self-organizing feature maps, or astopologically ordered maps. SOMs goal is to represent all points in a high-dimensional source space

    by points in a low-dimensional (usually 2-D or 3-D) target space, such that the distance and proximity

    relationships (hence the topology) are preserved as much as possible. With SOMs, clustering is

    performed by having several units competing for the current object. The unit whose weight vector isclosest to the current object becomes the winning or active unit. The organization of units is said to

    form a feature map. The SOM approach has been used successfully for Web document clustering.

    Outlier Analysis

    Very often, there exist data objects that do not comply with the general behavior or model of the data.

    Such data objects, which are grossly different from or inconsistent with the remaining set of data, are

    called outliers. Outliers can be caused by measurement or execution error. For example, the display

    of a persons age as - 999 could be caused by a program default setting of an unrecorded age.

    Alternatively, outliers may be the result of inherent data variability. Many data mining algorithms try

    to minimize the influence of outliers or eliminate the mall together. This, however, could result in theloss of important hidden information.

    In other words, the outliers may be of particular interest, such as in the case of fraud detection, where

    outliers may indicate fraudulent activity. Thus, outlier detection and analysis is an interesting data

    mining task, referred to as outlier mining. Outlier mining can be described as follows: Given a set ofn data points or objects and k, the expected number of outliers, find the top k objects that are

    considerably dissimilar, exceptional, or inconsistent with respect to the remaining data.

  • 7/27/2019 Unit-03 (Part 2)

    20/20

    Unit-03: Cluster Analysis 20

    f

    The outlier mining problem can be viewed as two subproblems: (1) define what data can beconsidered as inconsistent in a given data set, and (2) find an efficient method to mine the outliers so

    defined. The problem of defining outliers is nontrivial.These can be categorized into four approaches: the statistical approach, the distance-based approach,the density-based local outlier approach, and the deviation-based approach.

    1. Statistical Distribution-Based Outlier DetectionThe statistical distribution-based approach to outlier detection assumes a distribution or probability modelfor the given data set (e.g., a normal or Poisson distribution) and then identifies outliers with respect tothe model using a discordancy test. Application of the test requires knowledge of the data set parameters(such as the assumed data distribution), knowledge of distribution parameters (such as the mean andvariance), and the expected number of outliers.

    There are two basic types of procedures for detecting outliers:

    1) Block procedures: In this case, either all of the suspect objects are treated as outliers or all of themare accepted as consistent.

    2) Consecutive (or sequential) procedures: An example of such a procedure is the inside outprocedure. Its main idea is that the object that is least likely to be an outlier is tested first. If it is

    found to be an outlier, then all of the more extreme values are also considered outliers; otherwise, thenext most extreme object is tested, and so on. This procedure tends to be more effective than blockprocedures.