Machine Learning, STOR 565 [.1in] Clustering: Overview and ...

35
Machine Learning, STOR 565 Clustering: Overview and Basic Methods Andrew Nobel February, 2021

Transcript of Machine Learning, STOR 565 [.1in] Clustering: Overview and ...

Page 1: Machine Learning, STOR 565 [.1in] Clustering: Overview and ...

Machine Learning, STOR 565

Clustering: Overview and Basic Methods

Andrew Nobel

February, 2021

Page 2: Machine Learning, STOR 565 [.1in] Clustering: Overview and ...

Overview

Task: Divide a set of objects (e.g. data points) into a small number of disjointgroups such that objects in the same group are close together, and objects indifferent groups are far apart.

Page 3: Machine Learning, STOR 565 [.1in] Clustering: Overview and ...

Example (http://rosettacode.org)

Page 4: Machine Learning, STOR 565 [.1in] Clustering: Overview and ...

Example (https://apandre.files.wordpress.com)

Page 5: Machine Learning, STOR 565 [.1in] Clustering: Overview and ...

General Setting

Given: Objects x1, . . . , xn in feature space X

I Dissimilarity or distance d(xi, xj) between pairs of objects

Goal: Divide x1, . . . , xn into disjoint groups C1, . . . , Ck, called clusters, s.t.

I Objects in same cluster are close together

I Objects in different clusters are far apart

I Number of clusters k is small

Distinction: Clustering is complete if it partitions X and incomplete if itpartitions only x1, . . . , xn.

Page 6: Machine Learning, STOR 565 [.1in] Clustering: Overview and ...

Clustering: Areas of Application

Genomics, Biology

Data Compression

Psychology

Computer Science

Social and Political Science

Page 7: Machine Learning, STOR 565 [.1in] Clustering: Overview and ...

Feature Vectors

Objects x ∈ X typically represented by a feature vector

x = (x1, . . . , xp)t

where xi is a numerical/categorical measurement of interest:

I xi ∈ R numerical feature

I xi ∈ {a, b, . . .} categorical feature

Page 8: Machine Learning, STOR 565 [.1in] Clustering: Overview and ...

Examples

Medicine

I Object = patient

I Feature xi = outcome of a diagnostic test on patient

Microarrays (Genomics)

I Object = tissue sample

I Feature xi = measured expression level of gene i in that sample

Data Mining

I Object = consumer

I Features xi = type, location, or amount of recent purchases

Page 9: Machine Learning, STOR 565 [.1in] Clustering: Overview and ...

Dissimilarities/Distances Between Feature Vectors

Euclidean d(u,v) =√∑

i(ui − vi)2

Manhattan d(u,v) =∑

i |ui − vi|

Correlation d(u,v) = 1− corr(u, v)

Hamming d(u,v) =∑

i I{ui 6= vi}

Mixtures of these

Page 10: Machine Learning, STOR 565 [.1in] Clustering: Overview and ...

Basic Steps in Clustering

Objects x1, . . . ,xn

Selection and Extraction of Features

Dissimilarity matrix D = {d(xi,xj) : 1 ≤ i, j ≤ n}

Clustering Algorithm

Partition π = {C1, . . . , Ck} of x1, . . . ,xn.

Page 11: Machine Learning, STOR 565 [.1in] Clustering: Overview and ...

Some Clustering Methods

Hierarchical: Candidate divisions of data described by a binary tree

I Agglomerative (bottom-up)

I Divisive (top-down)

Iterative: Search for local minimum of simple cost function

I k-means and variants

I partitioning around medioids, self organizing maps

Model-based: Fit feature vectors with a finite mixture model

Spectral: Threshold eigenvectors of Laplacian of Dissimilarity Matrix

Page 12: Machine Learning, STOR 565 [.1in] Clustering: Overview and ...

Features of Clusters

Features of clusters can affect the performance of different procedures, e.g.,whether clusters are

I Spherical or elliptical in shape

I Similar in overall variance/spread

I Similar in size (number of points)

Page 13: Machine Learning, STOR 565 [.1in] Clustering: Overview and ...

The k-Means Algorithm

Page 14: Machine Learning, STOR 565 [.1in] Clustering: Overview and ...

The k-Means Algorithm

Setting: Objects x1, . . . ,xn ∈ Rp are vectors. Seek k clusters

Approach: Focus on cluster centers

I Find good cluster centers c1, . . . , ck ∈ Rp

I Let cluster Cj = vectors xi closer to cj than other centers cl

Optimization: Select centers to minimize sum of squares (SoS) cost function

Cost(c1, . . . , ck) =

n∑i=1

min1≤j≤k

||xi − cj ||2

Problem: Exact solution of optimization problem not feasible. Resort toiterative methods that find local minimum

Page 15: Machine Learning, STOR 565 [.1in] Clustering: Overview and ...

Ingredient 1: Centroids

Definition: The centroid of vectors v1, . . . ,vm ∈ Rp is their (vector) average

c =1

m

m∑i=1

vi

I Centroid c is the center of mass of the point configuration v1, . . . ,vm

I Centroid c is an optimal representative for v1, . . . ,vm, in the sense that

m∑i=1

||vi − c||2 ≤m∑i=1

||vi − v||2

for every vector v ∈ Rp

Page 16: Machine Learning, STOR 565 [.1in] Clustering: Overview and ...

Ingredient 2: Nearest Neighbor Partitions

Idea: Given centers c1, . . . , ck ∈ Rp one can partition Rp into correspondingcells A1, . . . , Ak where

Aj = {x : ||x− cj || ≤ ||x− cs|| all l 6= j}

contains vectors that are closer to center cj than any other center cs (wherewe break ties by index)

Definition: Cells {A1, . . . , Ak} called the nearest neighbor or Voronoipartition of Rp generated by centers c1, . . . , ck

Note: Aj =⋂

s6=j{x : ||x− cj || ≤ ||x− cl||} is an intersection of half-spaces

Page 17: Machine Learning, STOR 565 [.1in] Clustering: Overview and ...

The k-Means Algorithm

Initialize: Centers C0 = {a1, . . . ,ak}

Iterate: For m = 1, 2, . . . do:

I Let πm be the nearest neighbor partition of the centers Cm−1.

I Let Cm be the centroids of the vectors in each cell of πm

Stop: When Cost(Cm) is close to Cost(Cm+1)

Key Fact: Cost function decreases at each iteration of algorithm. Recall

Cost(c1, . . . , ck) =n∑

i=1

min1≤j≤k

||xi − cj ||2

Page 18: Machine Learning, STOR 565 [.1in] Clustering: Overview and ...

k-means (Yu-Zhong Chen, ResearchGate)

Page 19: Machine Learning, STOR 565 [.1in] Clustering: Overview and ...

The k-Means Algorithm

In practice

I Choose multiple initial sets of representative vectors C0 = {c1, . . . , ck}

I Run the iterative k-means procedure

I Choose the partition associated with the smallest final cost

Example: http://onmyphd.com/?p=k-means.clustering.

K-Means tends to perform best when clusters are spherical, similar invariance and size

Page 20: Machine Learning, STOR 565 [.1in] Clustering: Overview and ...

3-means (onmyphd.com)

Page 21: Machine Learning, STOR 565 [.1in] Clustering: Overview and ...

2-Means (onmyphd.com)

Page 22: Machine Learning, STOR 565 [.1in] Clustering: Overview and ...

Agglomerative Clustering

Page 23: Machine Learning, STOR 565 [.1in] Clustering: Overview and ...

Binary Trees

1. Distinguished node called the root with zero or two children but no parent

2. Every other node has one parent and zero or two children

I Nodes with no children are called leaves

I Nodes with two children are called internal

Note: Tree usually drawn upside-down, with root node at the top

Page 24: Machine Learning, STOR 565 [.1in] Clustering: Overview and ...

Agglomerative Clustering

Stage 0: Assign each object xi to its own cluster

Stage k:

I Find the two closest clusters at stage k − 1

I Combine them into a single cluster

Stop: When all objects xi belong to a single cluster

Output: Binary tree T called a dendrogram

Note: Closeness of clusters C,C′ can be measured in different ways

Page 25: Machine Learning, STOR 565 [.1in] Clustering: Overview and ...

Distances Between Clusters

Single Linkageds(C,C

′) = minxi∈C, xj∈C′

d(xi, xj)

Average Linkage

da(C,C′) =

1

|C| |C′|∑

xi∈C, xj∈C′

d(xi, xj)

Total Linkagedt(C,C

′) = maxxi∈C, xj∈C′

d(xi, xj)

Page 26: Machine Learning, STOR 565 [.1in] Clustering: Overview and ...

Dendrogram

Binary tree associated with the agglomerative clustering procedure: it is agraphical record of the clustering process

Initialize: Each singleton cluster {xi} corresponds to a node at height 0

Update: If two clusters C,C′ are combined, their respective nodes are joinedto a parent node at height d(C,C′)

Each node of dendrogram corresponds to a set of objects. Objectsassociated with two nodes are merged when forming their parent

I Leaves correspond to individual objects

I The root corresponds to all objects

Page 27: Machine Learning, STOR 565 [.1in] Clustering: Overview and ...

Cities by Distance (blogs.sas.com)

Page 28: Machine Learning, STOR 565 [.1in] Clustering: Overview and ...

Salmon by Genetic Similarity

Page 29: Machine Learning, STOR 565 [.1in] Clustering: Overview and ...

Dendrogram, cont.

Note: Dendrogram T represents many possible clusterings, one for each(rooted) subtree.

Methods for selecting a clustering/subtree

I Ad hoc selection (by eye)

I “Cutting” dendrogram at fixed level

I Penalized pruning

Visualization of clustering structure

I Order objects in the same way as the leaves of the dendrogram

I Caveat: many orderings possible

Page 30: Machine Learning, STOR 565 [.1in] Clustering: Overview and ...

Cars Data

I Samples: 32 unique cars

I Variables: 11 descriptive variables, including gas mileage, horsepower,number of cylinders, etc.

I Freely available in R: data(mtcars)

Page 31: Machine Learning, STOR 565 [.1in] Clustering: Overview and ...

Mas

erat

i Bor

aFo

rd P

ante

ra L

Dus

ter 3

60C

amar

o Z2

8C

hrys

ler I

mpe

rial

Cad

illac

Fle

etw

ood

Linc

oln

Con

tinen

tal

Hor

net S

porta

bout

Pon

tiac

Fire

bird

Mer

c 45

0SLC

Mer

c 45

0SE

Mer

c 45

0SL

Dod

ge C

halle

nger

AM

C J

avel

inH

orne

t 4 D

rive

Valiant

Ferr

ari D

ino

Hon

da C

ivic

Toyo

ta C

orol

laFi

at 1

28Fi

at X

1-9

Mer

c 24

0DM

azda

RX

4M

azda

RX

4 W

agM

erc

280

Mer

c 28

0CLo

tus

Eur

opa

Mer

c 23

0D

atsu

n 71

0V

olvo

142

ETo

yota

Cor

ona

Por

sche

914

-2

020

4060

80

Single Linkage Clustering on Cars data

hclust (*, "single")dist(mtcars)

Distance

Page 32: Machine Learning, STOR 565 [.1in] Clustering: Overview and ...

Ferr

ari D

ino

Hon

da C

ivic

Toyo

ta C

orol

laFi

at 1

28Fi

at X

1-9

Maz

da R

X4

Maz

da R

X4

Wag

Mer

c 28

0M

erc

280C

Mer

c 24

0DLo

tus

Eur

opa

Mer

c 23

0V

olvo

142

ED

atsu

n 71

0To

yota

Cor

ona

Por

sche

914

-2M

aser

ati B

ora

Hor

net 4

Driv

eValiant

Mer

c 45

0SLC

Mer

c 45

0SE

Mer

c 45

0SL

Dod

ge C

halle

nger

AM

C J

avel

inC

hrys

ler I

mpe

rial

Cad

illac

Fle

etw

ood

Linc

oln

Con

tinen

tal

Ford

Pan

tera

LD

uste

r 360

Cam

aro

Z28

Hor

net S

porta

bout

Pon

tiac

Fire

bird

050

100

150

200

250

Average Linkage Clustering on Cars data

hclust (*, "average")dist(mtcars)

Distance

Page 33: Machine Learning, STOR 565 [.1in] Clustering: Overview and ...

TCGA Data

Gene expression data from The Cancer Genome Atlas (TCGA)

I Samples

I 95 Luminal A breast tumors

I 122 Basal breast tumors

I Variables: 2000 randomly selected genes

Page 34: Machine Learning, STOR 565 [.1in] Clustering: Overview and ...

TCGA Data

I Clustered samples (breast tumor subtype)I Colors: Luminal A and Basal

Page 35: Machine Learning, STOR 565 [.1in] Clustering: Overview and ...

Important Questions

I What is the right number of clusters?

I What is right measure of distance?

I What is the best clustering method for the data?

I How robust is an observed clustering to small perturbations of the data?

I What significance can be assigned to the clusters?