DATA MINING CLUSTERING K-Means. Clustering Definition Techniques that are used to divide data...

DATA MINING

CLUSTERINGK-Means

Clustering Definition

• Techniques that are used to divide data objects into groups– A form of classification in that it creates a labelling

object with class(cluster) labels. The labels are derived from the data

• Cluster analysis is categorized as unsupervised classification– When you have no idea how to define groups,

clustering method can be useful

Types of Clustering

• Hierarchical vs Partitional– Hierarchical nested cluster, organized as tree– Partitional fully non-overlapping

• Exclusive vs Overlapping vs Fuzzy– Exclusive each object is assigned to a single cluster– Overlapping an object can simultaneously belong to more than one

cluster– Fuzzy every object belongs to every cluster with a membership

weigth that is between 0 and 1

• Complete vs Partial– Complete assigns every object to cluster– Partial not all objects are assigned

Types of Clusters

• Well-separated• Prototype-based• Graph-based• Density-based• Shared-property(Conceptual Cluster)

K-Means

• Partitional clustering• Prototype-based• One level

Basic K-Means

• k, the number of clusters that are to be formed, must be decided before beginning

• Step 1– Select k data points to act as the seeds (or initial cluster

centroids)• Step 2– Each record is assigned to the centroid which is nearest,

thus forming a cluster• Step 3– The centroids of the new clusters are then calculated. Go

back to Step 26

Basic K-means -2-

Assign each record to the nearest centroid

Calculate new centroid

Determine cluster boundaries

Choosing Initial Centroids

• Random initial centroids– Poor– Can have empty cluster

• Limits of random initialization– Multiple runs with different set of randomly

choosen centroids then select the set of cluster with the minimum SSE

Similarity, Association, and Distance

• The method just described assumes that each record can be described as a point in a metric space– This is not easily done for many data sets (e.g., categorical and some

numeric variables)• Pre-processing is often necessary

• Records in a cluster should have a natural association. A measure of similarity is required.– Euclidean distance is often used, but it is not always suitable– Euclidean distance treats changes in each dimension equally, but

changes in one field may be more important than changes in another• and changes of the same “size” in different fields can have very different

significances• e.g. 1 metre difference in height vs. $1 difference in annual income

Measures of Similarity

• Euclidean distance between vectors X and Y

• Weighting

Redefine Cluster Centroids• Sum of the Squared Error for data in euclidean space. The

centroid(mean) of the ith cluster is defined:

• Other case:

Proximity Function Centroid Objective Function

Manhattan (L1) median Minimize sum of L1 distance of an object to its cluster centroid

Square Euclidean(L22) mean Minimize sum of the squared L2 distance of an object to its cluster

centroid

Cosine mean Maximize sum of the cosine similarity of an object to its cluster centroid

Bregman divergence mean Minimize sum of the Bregman divergence of an object to its cluster centroid

Bisecting K-means

• Basic idea:– Split the set of all points into two cluster– Select one of these clusters to split– so on, until K cluster have been produced

• Choose the cluster to split:– Cluster with largest SSE– Cluster with largest size– Both, or other criterion

• Bisecting is less susceptible to initialization problems

Strengths and Weaknesses

• Strengths– Simple and can be used for wide variety data

types– Efficient in computation

• Weaknesses– Not suitable for all types of data– Cannot contains outliers, should be remove– Restricted to data for which there is a notion of a

center(centroids)

DATA MINING CLUSTERING K-Means. Clustering Definition Techniques that are used to divide data...

Documents

Transcript of DATA MINING CLUSTERING K-Means. Clustering Definition Techniques that are used to divide data...

Clustering data streams: theory and practice - Knowledge ...gkmc.utah.edu/7910F/papers/IEEE TKDE clustering data streams.pdf · Clustering Data Streams: Theory and Practice Sudipto

CLUSTERING MIXED-TYPE DATA

Distributed data clustering in sensor networkspeople.csail.mit.edu/idish/ftp/EyalDistributedClustering...Distributed data clustering in sensor networks sets. Parallel clustering differs

Clustering Categorical Data

Tutorial 8 Clustering 1. General Methods –Unsupervised Clustering Hierarchical clustering K-means clustering Expression data –GEO –UCSC –ArrayExpress.

Clustering Algorithms for Numerical Data Sets. Contents 1.Data Clustering Introduction 2.Hierarchical Clustering Algorithms 3.Partitional Data Clustering.

Clustering / Unsupervised Methodsjcorso/t/CSE555/files/lecture_cluster.pdf · Data Clustering Data Clustering Source: A. K. Jain and R. C. Dubes. Alg. for Clustering Data, Prentiice

Clustering Data: Does Theory Help?...Clustering Data: Does Theory Help? Ravi Kannan December 10, 2013 Ravi Kannan Clustering Data: Does Theory Help? December 10, 2013 1 / 27 Clustering

3.2 Data Mining - Clustering

Temporal Data Clustering Via

Bridging the Data Divide

Divide-and-Merge Methodology for Clusteringcompgeom.com/~piyush/teach/cgseminar08/arturo.pdf · Data Clustering •Classification/labeling •Input: data composed of elements •Output:

LNCS 3776 - Data Clustering: A User’s Dilemmabiometrics.cse.msu.edu/Publications/Clustering/JainLawClustering05… · Data Clustering: A User’s Dilemma 3 performing clustering,

Introduction to Data Clustering 1 - unige.it...Introduction Partitioning Methods Parametric/Statistical clustering Hard Clustering Introduction to Data Clustering 1 Francesco Masulli

Clustering of Big Data Using Different Data-Mining Techniques · Clustering is a main task of exploratory data analysis and data mining applications. Clustering is one of the data

Clustering and Data Mining in R - Workshop …biocluster.ucr.edu/~tgirke/HTML_Presentations/Manuals/Clustering/...Clustering and Data Mining in R Data Preprocessing Data Transformations

Clustering microarray data

Visualization with Data Clustering · Visualization with Data Clustering Diva Seminar Winter 2006. Visualization with Data Clustering Terreaux Patrick 2 Contents Context Data clustering

Scalable Clustering DECEMBER 16 Algorithm, BIG DATA ...€¦ · Framework online-offline of clustering data streams. 21 Big Data as data stream Framework online-offline of clustering

Data Clustering 1 – An introduction Data Clustering – An IntroductionSlide 1.