Download - CS260: Machine Learning Algorithmsweb.cs.ucla.edu/~chohsieh/teaching/CS260_Winter2019/lecture5.pdfnkg: r nk = (1 if k = arg min j kx n m jk22 0 otherwise Step 2: Fix fr nkgand minimize

Page 1: CS260: Machine Learning Algorithmsweb.cs.ucla.edu/~chohsieh/teaching/CS260_Winter2019/lecture5.pdfnkg: r nk = (1 if k = arg min j kx n m jk22 0 otherwise Step 2: Fix fr nkgand minimize

CS260: Machine Learning AlgorithmsLecture 5: Clustering

Cho-Jui HsiehUCLA

Jan 23, 2019

Page 2: CS260: Machine Learning Algorithmsweb.cs.ucla.edu/~chohsieh/teaching/CS260_Winter2019/lecture5.pdfnkg: r nk = (1 if k = arg min j kx n m jk22 0 otherwise Step 2: Fix fr nkgand minimize

Supervised versus Unsupervised Learning

Supervised Learning:

Learning from labeled observations

Classification, regression, . . .

Unsupervised Learning:

Learning from unlabeled observations

Discover hidden patterns

Clustering (today)

Page 3: CS260: Machine Learning Algorithmsweb.cs.ucla.edu/~chohsieh/teaching/CS260_Winter2019/lecture5.pdfnkg: r nk = (1 if k = arg min j kx n m jk22 0 otherwise Step 2: Fix fr nkgand minimize

Kmeans Clustering

Page 4: CS260: Machine Learning Algorithmsweb.cs.ucla.edu/~chohsieh/teaching/CS260_Winter2019/lecture5.pdfnkg: r nk = (1 if k = arg min j kx n m jk22 0 otherwise Step 2: Fix fr nkgand minimize

Clustering

Given {x1, x2, . . . , xn} and K (number of clusters)

Output A(xi ) ∈ {1, 2, . . . ,K} (cluster membership)

Page 5: CS260: Machine Learning Algorithmsweb.cs.ucla.edu/~chohsieh/teaching/CS260_Winter2019/lecture5.pdfnkg: r nk = (1 if k = arg min j kx n m jk22 0 otherwise Step 2: Fix fr nkgand minimize

Two circles

Can we split the data into two clusters?

Page 6: CS260: Machine Learning Algorithmsweb.cs.ucla.edu/~chohsieh/teaching/CS260_Winter2019/lecture5.pdfnkg: r nk = (1 if k = arg min j kx n m jk22 0 otherwise Step 2: Fix fr nkgand minimize

Two circles

Can we split the data into two clusters?

Page 7: CS260: Machine Learning Algorithmsweb.cs.ucla.edu/~chohsieh/teaching/CS260_Winter2019/lecture5.pdfnkg: r nk = (1 if k = arg min j kx n m jk22 0 otherwise Step 2: Fix fr nkgand minimize

Clustering is Subjective

Non-trivial to say one partition is better than others

Each algorithm has two parts:Define the objective functionDesign an algorithm to minimize this objective function

Page 8: CS260: Machine Learning Algorithmsweb.cs.ucla.edu/~chohsieh/teaching/CS260_Winter2019/lecture5.pdfnkg: r nk = (1 if k = arg min j kx n m jk22 0 otherwise Step 2: Fix fr nkgand minimize

K-means Objective Function

Partition dataset into C1,C2, . . . ,CK to minimize the followingobjective:

J =K∑

k=1

∑x∈Ck

‖x −mk‖22,

where mk is the mean of Ck .

Multiple ways to minimize this objective

Hierarchical Agglomerative ClusteringKmeans Algorithm (Today). . .

Page 9: CS260: Machine Learning Algorithmsweb.cs.ucla.edu/~chohsieh/teaching/CS260_Winter2019/lecture5.pdfnkg: r nk = (1 if k = arg min j kx n m jk22 0 otherwise Step 2: Fix fr nkgand minimize

K-means Objective Function

Partition dataset into C1,C2, . . . ,CK to minimize the followingobjective:

J =K∑

k=1

∑x∈Ck

‖x −mk‖22,

where mk is the mean of Ck .

Multiple ways to minimize this objective

Hierarchical Agglomerative ClusteringKmeans Algorithm (Today). . .

Page 10: CS260: Machine Learning Algorithmsweb.cs.ucla.edu/~chohsieh/teaching/CS260_Winter2019/lecture5.pdfnkg: r nk = (1 if k = arg min j kx n m jk22 0 otherwise Step 2: Fix fr nkgand minimize

K-means Algorithm

Page 11: CS260: Machine Learning Algorithmsweb.cs.ucla.edu/~chohsieh/teaching/CS260_Winter2019/lecture5.pdfnkg: r nk = (1 if k = arg min j kx n m jk22 0 otherwise Step 2: Fix fr nkgand minimize

K-means Algorithm

Re-write objective:

J =N∑

n=1

K∑k=1

rnk‖xn −mk‖22,

where rnk ∈ {0, 1} is an indicator variable

rnk = 1 if and only if xn ∈ Ck

Alternative optimization between {rnk} and {mk}Fix {mk} and update {rnk}Fix {rnk} and update {mk}

Page 12: CS260: Machine Learning Algorithmsweb.cs.ucla.edu/~chohsieh/teaching/CS260_Winter2019/lecture5.pdfnkg: r nk = (1 if k = arg min j kx n m jk22 0 otherwise Step 2: Fix fr nkgand minimize

K-means Algorithm

Step 0: Initialize {mk} to some values

Step 1: Fix {mk} and minimize over {rnk}:

rnk =

{1 if k = arg minj ‖xn −mj‖220 otherwise

Step 2: Fix {rnk} and minimize over {mk}:

mk =

∑n rnkxn∑n rnk

Step 3: Return to step 1 unless stopping criterion is met

Page 13: CS260: Machine Learning Algorithmsweb.cs.ucla.edu/~chohsieh/teaching/CS260_Winter2019/lecture5.pdfnkg: r nk = (1 if k = arg min j kx n m jk22 0 otherwise Step 2: Fix fr nkgand minimize

K-means Algorithm

rnk =

mk =

∑n rnkxn∑n rnk

Page 14: CS260: Machine Learning Algorithmsweb.cs.ucla.edu/~chohsieh/teaching/CS260_Winter2019/lecture5.pdfnkg: r nk = (1 if k = arg min j kx n m jk22 0 otherwise Step 2: Fix fr nkgand minimize

K-means Algorithm

rnk =

mk =

∑n rnkxn∑n rnk

Page 15: CS260: Machine Learning Algorithmsweb.cs.ucla.edu/~chohsieh/teaching/CS260_Winter2019/lecture5.pdfnkg: r nk = (1 if k = arg min j kx n m jk22 0 otherwise Step 2: Fix fr nkgand minimize

K-means Algorithm

rnk =

mk =

∑n rnkxn∑n rnk

Page 16: CS260: Machine Learning Algorithmsweb.cs.ucla.edu/~chohsieh/teaching/CS260_Winter2019/lecture5.pdfnkg: r nk = (1 if k = arg min j kx n m jk22 0 otherwise Step 2: Fix fr nkgand minimize

K-means Algorithm

Equivalent to the following procedure:

Step 0: Initialize centers {mk} to some values

Step 1: Assign each xn to the nearest center:

A(xn) = arg minj‖xn −mj‖22

Update clusters:

Ck = {xn : A(xn) = k} ∀k = 1, . . . ,K

Step 2: Calculate mean of each cluster Ck :

mk =1

|Ck |∑

xn∈Ck

xn

Page 17: CS260: Machine Learning Algorithmsweb.cs.ucla.edu/~chohsieh/teaching/CS260_Winter2019/lecture5.pdfnkg: r nk = (1 if k = arg min j kx n m jk22 0 otherwise Step 2: Fix fr nkgand minimize

More on K-means Algorithm

Always decrease the objective function for each update

Objective function will remain unchanged when step 1 doesn’t changecluster assignment ⇒ Converged

May not converge to global minimum

Sensitive to initial values

Kmeans++: A better way to initialize the clusters

Page 18: CS260: Machine Learning Algorithmsweb.cs.ucla.edu/~chohsieh/teaching/CS260_Winter2019/lecture5.pdfnkg: r nk = (1 if k = arg min j kx n m jk22 0 otherwise Step 2: Fix fr nkgand minimize

Page 19: CS260: Machine Learning Algorithmsweb.cs.ucla.edu/~chohsieh/teaching/CS260_Winter2019/lecture5.pdfnkg: r nk = (1 if k = arg min j kx n m jk22 0 otherwise Step 2: Fix fr nkgand minimize

Page 20: CS260: Machine Learning Algorithmsweb.cs.ucla.edu/~chohsieh/teaching/CS260_Winter2019/lecture5.pdfnkg: r nk = (1 if k = arg min j kx n m jk22 0 otherwise Step 2: Fix fr nkgand minimize

Graph Clustering

Page 21: CS260: Machine Learning Algorithmsweb.cs.ucla.edu/~chohsieh/teaching/CS260_Winter2019/lecture5.pdfnkg: r nk = (1 if k = arg min j kx n m jk22 0 otherwise Step 2: Fix fr nkgand minimize

Graph Clustering

Given a graph G = (V ,E ,W )

V : nodes {v1, · · · , vn}E : edges {e1, · · · , em}W : weight matrix

Wij =

{wij , if (i , j) ∈ E

0 otherwise

Goal: Partition V into k clusters of nodes

V = V1 ∪ V2 ∪ · · · ∪ Vk , Vi ∩ Vj = ϕ, ∀i , j

Page 22: CS260: Machine Learning Algorithmsweb.cs.ucla.edu/~chohsieh/teaching/CS260_Winter2019/lecture5.pdfnkg: r nk = (1 if k = arg min j kx n m jk22 0 otherwise Step 2: Fix fr nkgand minimize

Similarly Graph

Example: similarity graph

Given samples x1, . . . , xnWeight (similarities) indicates “closeness of samples”

Page 23: CS260: Machine Learning Algorithmsweb.cs.ucla.edu/~chohsieh/teaching/CS260_Winter2019/lecture5.pdfnkg: r nk = (1 if k = arg min j kx n m jk22 0 otherwise Step 2: Fix fr nkgand minimize

Similarity Graph

E.g., Gaussian kernel Wij = e−‖xi−xj‖2/σ2

Page 24: CS260: Machine Learning Algorithmsweb.cs.ucla.edu/~chohsieh/teaching/CS260_Winter2019/lecture5.pdfnkg: r nk = (1 if k = arg min j kx n m jk22 0 otherwise Step 2: Fix fr nkgand minimize

Social graph

Nodes: users in social network

Edges: Wij = 1 if user i and j are friends,

otherwise Wij = 0

Page 25: CS260: Machine Learning Algorithmsweb.cs.ucla.edu/~chohsieh/teaching/CS260_Winter2019/lecture5.pdfnkg: r nk = (1 if k = arg min j kx n m jk22 0 otherwise Step 2: Fix fr nkgand minimize

Partitioning into Two Clusters

Partition graph into two sets V1,V2 to minimize the cut value:

cut(V1,V2) =∑

vi∈V1,vj∈V2

Wij

Also, the size of V1,V2 needs to be similar (balance)

One classical way of enforcing balance:

minV1,V2

cut(V1,V2)

s.t. |V1| = |V2|, V1 ∪ V2 = {1, · · · , n},V1 ∩ V2 = ϕ

⇒ this is NP-hard (cannot be solved in polynomial time)

Page 26: CS260: Machine Learning Algorithmsweb.cs.ucla.edu/~chohsieh/teaching/CS260_Winter2019/lecture5.pdfnkg: r nk = (1 if k = arg min j kx n m jk22 0 otherwise Step 2: Fix fr nkgand minimize

cut(V1,V2) =∑

vi∈V1,vj∈V2

Wij

minV1,V2

cut(V1,V2)

s.t. |V1| = |V2|, V1 ∪ V2 = {1, · · · , n},V1 ∩ V2 = ϕ

Page 27: CS260: Machine Learning Algorithmsweb.cs.ucla.edu/~chohsieh/teaching/CS260_Winter2019/lecture5.pdfnkg: r nk = (1 if k = arg min j kx n m jk22 0 otherwise Step 2: Fix fr nkgand minimize

cut(V1,V2) =∑

vi∈V1,vj∈V2

Wij

minV1,V2

cut(V1,V2)

s.t. |V1| = |V2|, V1 ∪ V2 = {1, · · · , n},V1 ∩ V2 = ϕ

Page 28: CS260: Machine Learning Algorithmsweb.cs.ucla.edu/~chohsieh/teaching/CS260_Winter2019/lecture5.pdfnkg: r nk = (1 if k = arg min j kx n m jk22 0 otherwise Step 2: Fix fr nkgand minimize

Kernighan-Lin Algorithm

Starts with some partitioning V1,V2

Calculate change in cut if 2 vertices are swapped

Swap the vertices (1 in V1 & 1 in V2) that decease the cut the most

Iterative until convergence

Used when we need exact balanced clusters

(e.g., circuit design)

Page 29: CS260: Machine Learning Algorithmsweb.cs.ucla.edu/~chohsieh/teaching/CS260_Winter2019/lecture5.pdfnkg: r nk = (1 if k = arg min j kx n m jk22 0 otherwise Step 2: Fix fr nkgand minimize

Kernighan-Lin Algorithm

Starts with some partitioning V1,V2

Calculate change in cut if 2 vertices are swapped

Swap the vertices (1 in V1 & 1 in V2) that decease the cut the most

Iterative until convergence

Used when we need exact balanced clusters

(e.g., circuit design)

Page 30: CS260: Machine Learning Algorithmsweb.cs.ucla.edu/~chohsieh/teaching/CS260_Winter2019/lecture5.pdfnkg: r nk = (1 if k = arg min j kx n m jk22 0 otherwise Step 2: Fix fr nkgand minimize

Objective function that considers balance

Ratio-Cut:

minV1,V2

{Cut(V1,V2)

|V1|+

Cut(V1,V2)

|V2|

}:= RC(V1,V2)

Normalized-Cut:

minV1,V2

{Cut(V1,V2)

deg(V1)+

Cut(V1,V2)

deg(V2)

}:= NC(V1,V2),

wheredeg(Vc) :=

∑vi∈Vc ,(i ,j)∈E

Wi ,j = links(Vc ,V )

Page 31: CS260: Machine Learning Algorithmsweb.cs.ucla.edu/~chohsieh/teaching/CS260_Winter2019/lecture5.pdfnkg: r nk = (1 if k = arg min j kx n m jk22 0 otherwise Step 2: Fix fr nkgand minimize

Generalize to k clusters

Ratio-Cut:

minV1,··· ,Vk

k∑c=1

Cut(Vc ,V − Vc)

|Vc |

Normalized-Cut:

minV1,··· ,Vk

k∑c=1

Cut(Vc ,V − Vc)

deg(Vc)

Page 32: CS260: Machine Learning Algorithmsweb.cs.ucla.edu/~chohsieh/teaching/CS260_Winter2019/lecture5.pdfnkg: r nk = (1 if k = arg min j kx n m jk22 0 otherwise Step 2: Fix fr nkgand minimize

Reformulation

Recall deg(Vc) = links(Vc ,V )

Define a diagonal matrix

D =

deg(V1) 0 0 · · ·

0 deg(V2) 0 · · ·0 0 deg(V3) · · ·...

......

. . .

yc = {0, 1}n: indicator vector for the c-th cluster

We have

yTc yc = |Vc |yTc Dyc = deg(Vc)

yTc W yc = links(Vc ,Vc)

Page 33: CS260: Machine Learning Algorithmsweb.cs.ucla.edu/~chohsieh/teaching/CS260_Winter2019/lecture5.pdfnkg: r nk = (1 if k = arg min j kx n m jk22 0 otherwise Step 2: Fix fr nkgand minimize

Reformulation

Recall deg(Vc) = links(Vc ,V )

Define a diagonal matrix

D =

deg(V1) 0 0 · · ·

0 deg(V2) 0 · · ·0 0 deg(V3) · · ·...

......

. . .

yc = {0, 1}n: indicator vector for the c-th cluster

We have

yTc yc = |Vc |yTc Dyc = deg(Vc)

yTc W yc = links(Vc ,Vc)

Page 34: CS260: Machine Learning Algorithmsweb.cs.ucla.edu/~chohsieh/teaching/CS260_Winter2019/lecture5.pdfnkg: r nk = (1 if k = arg min j kx n m jk22 0 otherwise Step 2: Fix fr nkgand minimize

Ratio Cut

Rewrite the ratio-cut objective:

RC (V1, · · · ,Vk) =k∑

c=1

Cut(Vc ,V − Vc)

|Vc |

=k∑

c=1

deg(Vc)− links(Vc ,Vc)

|Vc |

=k∑

c=1

yTc Dyc − yTc W ycyTc yc

=k∑

c=1

yTc (D −W )ycyTc yc

=k∑

c=1

yTc LycyTc yc

(L = D −W is “Graph Laplacian”)

Page 35: CS260: Machine Learning Algorithmsweb.cs.ucla.edu/~chohsieh/teaching/CS260_Winter2019/lecture5.pdfnkg: r nk = (1 if k = arg min j kx n m jk22 0 otherwise Step 2: Fix fr nkgand minimize

More on Graph Laplacian

L is symmetric positive semi-definite

For any x ,

xTLx =1

2

∑(i ,j)

Wij(xi − xj)2

Page 37: CS260: Machine Learning Algorithmsweb.cs.ucla.edu/~chohsieh/teaching/CS260_Winter2019/lecture5.pdfnkg: r nk = (1 if k = arg min j kx n m jk22 0 otherwise Step 2: Fix fr nkgand minimize

Solving Ratio-Cut

We have shown Ratio-Cut is equivalent to

RCut =k∑

c=1

yTc LycyTc yc

=k∑

c=1

(yc‖yc‖

)TLyc‖yc‖

Define yc = yc/‖yc‖ (normalized indicator),

Y = [y1, y2, · · · , yk ] ⇒ Y TY = I

Relaxed to real valued problem:

minY TY=I

Trace(Y TLY )

Solution: Eigenvectors corresponding to the smallest k eigenvalues of L

Page 38: CS260: Machine Learning Algorithmsweb.cs.ucla.edu/~chohsieh/teaching/CS260_Winter2019/lecture5.pdfnkg: r nk = (1 if k = arg min j kx n m jk22 0 otherwise Step 2: Fix fr nkgand minimize

Solving Ratio-Cut

RCut =k∑

c=1

yTc LycyTc yc

=k∑

c=1

(yc‖yc‖

)TLyc‖yc‖

Y = [y1, y2, · · · , yk ] ⇒ Y TY = I

minY TY=I

Trace(Y TLY )

Page 39: CS260: Machine Learning Algorithmsweb.cs.ucla.edu/~chohsieh/teaching/CS260_Winter2019/lecture5.pdfnkg: r nk = (1 if k = arg min j kx n m jk22 0 otherwise Step 2: Fix fr nkgand minimize

Solving Ratio-Cut

RCut =k∑

c=1

yTc LycyTc yc

=k∑

c=1

(yc‖yc‖

)TLyc‖yc‖

Y = [y1, y2, · · · , yk ] ⇒ Y TY = I

minY TY=I

Trace(Y TLY )

Page 40: CS260: Machine Learning Algorithmsweb.cs.ucla.edu/~chohsieh/teaching/CS260_Winter2019/lecture5.pdfnkg: r nk = (1 if k = arg min j kx n m jk22 0 otherwise Step 2: Fix fr nkgand minimize

Solving Ratio-Cut

Let Y ∗ ∈ Rn×k be these eigenvectors. Are we done?

No, Y ∗ does not have 0/1 values (not indicators)

(since we are solving a relaxed problem)

Solution: Run k-means on the rows of Y ∗

Summary of Spectral clustering algorithms:

Compute Y ∗ ∈ Rn×k : eigenvectors corresponds to k smallest eigenvaluesof (normalized) Laplacian matrixRun k-means to cluster rows of Y ∗

Page 41: CS260: Machine Learning Algorithmsweb.cs.ucla.edu/~chohsieh/teaching/CS260_Winter2019/lecture5.pdfnkg: r nk = (1 if k = arg min j kx n m jk22 0 otherwise Step 2: Fix fr nkgand minimize

Solving Ratio-Cut

Page 42: CS260: Machine Learning Algorithmsweb.cs.ucla.edu/~chohsieh/teaching/CS260_Winter2019/lecture5.pdfnkg: r nk = (1 if k = arg min j kx n m jk22 0 otherwise Step 2: Fix fr nkgand minimize

Solving Ratio-Cut

Page 43: CS260: Machine Learning Algorithmsweb.cs.ucla.edu/~chohsieh/teaching/CS260_Winter2019/lecture5.pdfnkg: r nk = (1 if k = arg min j kx n m jk22 0 otherwise Step 2: Fix fr nkgand minimize

Solving Ratio-Cut

Page 44: CS260: Machine Learning Algorithmsweb.cs.ucla.edu/~chohsieh/teaching/CS260_Winter2019/lecture5.pdfnkg: r nk = (1 if k = arg min j kx n m jk22 0 otherwise Step 2: Fix fr nkgand minimize

Eigenvectors of Laplacian

If graph is disconnected ( k connected components), Laplacian is blockdiagonal and first k eigen-vectors are:

Page 45: CS260: Machine Learning Algorithmsweb.cs.ucla.edu/~chohsieh/teaching/CS260_Winter2019/lecture5.pdfnkg: r nk = (1 if k = arg min j kx n m jk22 0 otherwise Step 2: Fix fr nkgand minimize

What if the graph is connected?

There will be only one smallest eigenvalue/eigenvector:

L1 = (D − A)1 = 0

(1 = [1, 1, · · · , 1]T is the eigenvector with eigenvalue 0)

However, the 2nd to k-th smallest eigenvectors are still useful forclustering

Page 46: CS260: Machine Learning Algorithmsweb.cs.ucla.edu/~chohsieh/teaching/CS260_Winter2019/lecture5.pdfnkg: r nk = (1 if k = arg min j kx n m jk22 0 otherwise Step 2: Fix fr nkgand minimize

L1 = (D − A)1 = 0

Page 47: CS260: Machine Learning Algorithmsweb.cs.ucla.edu/~chohsieh/teaching/CS260_Winter2019/lecture5.pdfnkg: r nk = (1 if k = arg min j kx n m jk22 0 otherwise Step 2: Fix fr nkgand minimize

L1 = (D − A)1 = 0

Page 48: CS260: Machine Learning Algorithmsweb.cs.ucla.edu/~chohsieh/teaching/CS260_Winter2019/lecture5.pdfnkg: r nk = (1 if k = arg min j kx n m jk22 0 otherwise Step 2: Fix fr nkgand minimize

Normalized Cut

Rewrite Normalized Cut:

NCut =k∑

c=1

Cut(Vc ,V − Vc)

deg(Vc)

=k∑

c=1

yTc (D − A)ycyTc Dyc

Let yc = D1/2yc‖D1/2yc‖

, then

NCut =k∑

c=1

yTc D−1/2(D − A)D−1/2yc

yTc yc

Normalized Laplacian:

L = D−1/2(D − A)D−1/2 = I − D−1/2AD−1/2

Normalized Cut → eigenvectors correspond to the smallest eigenvaluesof L

Page 49: CS260: Machine Learning Algorithmsweb.cs.ucla.edu/~chohsieh/teaching/CS260_Winter2019/lecture5.pdfnkg: r nk = (1 if k = arg min j kx n m jk22 0 otherwise Step 2: Fix fr nkgand minimize

Kmeans vs Spectral Clustering

Kmeans: decision boundary is linear

Spectral clustering: boundary can be non-convex curves

σ in Wij = e−‖xi−xj‖

2

σ2 controls the clustering results (focus on local orglobal structure)

Page 50: CS260: Machine Learning Algorithmsweb.cs.ucla.edu/~chohsieh/teaching/CS260_Winter2019/lecture5.pdfnkg: r nk = (1 if k = arg min j kx n m jk22 0 otherwise Step 2: Fix fr nkgand minimize

Kmeans vs Spectral Clustering

Page 51: CS260: Machine Learning Algorithmsweb.cs.ucla.edu/~chohsieh/teaching/CS260_Winter2019/lecture5.pdfnkg: r nk = (1 if k = arg min j kx n m jk22 0 otherwise Step 2: Fix fr nkgand minimize

Conclusions

Kmeans objective: minimize the distance to centers

Kmeans algorithm: an optimization algorithm to minimize this objective

Graph clustering ⇒ related to eigenvectors of graphs

Questions?