K-Means: A Simple Machine Learning Algorithm to Parallelize€¦ · •Technology watch •Derwent...

K-Means: A Simple Machine Learning

Algorithm to ParallelizeWei Wang

Spring 2020 CS4823 Parallel Programming / CS6643 Parallel Processing 1

Areas that Frequently Use Parallel Computing

• High Performance/Scientific Computing• Requires knowledge numerical algorithms

• Web services• Mostly parallelization is done in the underlying framework (e.g., web

server), does not require programmers to parallelize their code

• Artificial Intelligence and Machine Learning• Used by games, data analysis, intelligent assistance

• Mostly likely you will need to parallelize one AI/ML algorithm in your job


One Algorithm for the Whole Semester

• We will use on ML algorithm – K-Means – as the example for this semester

• Hopefully using ML algorithms will be more beneficial to your future career than numerical algorithms

• The programming assignment will be parallelizing K-Means using different programming models: OpenMP, Pthread, MPI, CUDA etc

• You still need to know how to parallelize other algorithms – we will do in-class exercises for those


Unsupervised Learning and Clustering with K-Means


Slides based on from Dr. Rita Osadchy’s lecture

What is Clustering?

• Inputs:• Data points 𝑝1, 𝑝2, ⋯ , 𝑝𝑚 ∈ 𝑅2

• Outputs: group input points into classes of similar objects –cohesive “clusters.”

• High intra-cluster similarity

• Low inter-cluster similarity

• The most basic type of unsupervised learning


First (?) Application of Clustering

• John Snow, a London physician plotted the location of cholera deaths on a map during an outbreak in the 1850s.

• The locations indicated that cases were clustered around certain intersections where there were polluted wells -- thus exposing both the problem and the solution.


Application of Clustering

• Astronomy• SkyCat: Clustered 2x10 9 sky objects into stars, galaxies, quasars, etc

based on radiation emitted in different spectrum bands.


Applications of Clustering cont’d

• Image Segmentation• Find interesting “objects” in images to focus attention at


M. Pawan Kumar and D. Koller, REGION SELECTION FOR SCENE UNDERSTANDING,

Applications of Clustering cont’d

• Data Mining• Google news

• Technology watch• Derwent Database, contains all patents filed in the last 10 years worldwide

• Searching by keywords leads to thousands of documents

• Find clusters in the database and find if there are any emerging technologies and what competition is up to

• Marketing• Customer database

• Find clusters of customers and tailor marketing schemes to them


K-Means Algorithm: An Example –Step 1


-picture by Andrew Ng

Initial data points to

be clustered




Two cluster centers

randomly generated




Data points partitioned into two

groups based on their distances to the two centers




Two new centers are computed for the two

clusters




Data points repartitioned based on the new centers




New centers calculated based on the latest

partition, and we have found the two cluster

K-Means Algorithm: An Example – All Steps



The k-means Clustering Algorithm

1. Initialized cluster centers 𝑢1, 𝑢2, ⋯ , 𝑢𝑘 randomly

2. Repeat multiple times (or until convergence):1. for every data point xi, find the nearest center to it

let the nearest center of pi be ci:𝑐𝑖 = 𝑗 𝑝𝑖 − 𝑢𝑗 𝑖𝑠 𝑡ℎ𝑒 𝑚𝑖𝑛𝑚𝑢𝑚 𝑓𝑜𝑟 𝑎𝑙𝑙 𝑐𝑒𝑛𝑡𝑒𝑟𝑠

Essentially, we are assigning 𝑝𝑖 to the cluster that centered at 𝑢𝑗2. Update the centers of each new cluster

𝑢𝑗 = 𝑎𝑣𝑒𝑟𝑎𝑔𝑒 𝑚𝑒𝑎𝑛 𝑜𝑓 𝑝𝑜𝑖𝑛𝑡𝑠 𝑎𝑠𝑠𝑖𝑔𝑛𝑒𝑑 𝑡𝑜 𝑐𝑙𝑢𝑠𝑡𝑒𝑟 𝑗

=σ𝑖,𝑐𝑖==𝑗

𝑝𝑖

𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑝𝑜𝑖𝑛𝑡𝑠 𝑖𝑛 𝑐𝑙𝑢𝑠𝑡𝑒𝑟 𝑗


K-Means Algorithm: The Intuition

• The inner-loop of the algorithm repeatedly carries out two steps: 1) Assign each point 𝑝𝑖 to the closest cluster center.

2) Moving each cluster center to the mean of the points assigned to it.


K-Means Algorithm: Pseudocode

• Algorithm in next slide



K-Means Pseudocode

------------------------------------------------------------------

Input: p = {p[1], p[2], …, p[m]} /* m points to cluster */

k /*number of clusters */

Iter /* number of iterations */

Output: u = {u[1], u[2], …, u[k]} /* centers of the clusters */

------------------------------------------------------------------

for j = 1 to k

u[i] = random(); /* initialize each center to be a random point */

endfor

for l = 1 to Iter

/* find the nearest center to each point */

for i = 1 to m

min_dist = MAX_DOUBLE; /* init distance to the nearest center to be maximum value of DOUBLE number */

for j = 1 to k /* find the cluster c[i] for p[i] by computing p[i]’s distance to all centers */

dist = distance(p[i], u[j]); /* compute the distance between p[i] and u[j] */

if (dist < min_dist) /* if u[j] is nearer, than put p[i] in cluster j */

min_dist = dist;

c[i] = j;

endif

endfor

endfor

/* based on the cluster assignment, update the center for each cluster */

for j = 1 to k

sum = 0;

cluster_size = 0;

for i = 1 to m

if (c[i] == j) /* point p[i] is in cluster j, add p[i] to sum */

sum += p[i];

cluster_size++; /* increase the number points found in cluster j */

endif

endfor

if(cluster_size > 0)

u[j] = sum / cluster_size; /* computer the average of the points in this cluster */

else

u[j] = random(); /* if on points are assigned to cluster j, then we need to re-generate a new center for it */

endif

endfor

endfor

return u = {u[1], u[2], …, u[k]};


Notes on summing points and getting average:

Because the each point has two coordinates x and y, we should sum the the x coordinate and y coordinate for all points, and

get the average of the x coordinate and y coordinate.

Compute Euclidean Distance: function distance(p, u)

---------------------------------------------------------------------

Input p = (px, py); /* each point has two coordinates x and y */

u = (ux, uy); /* each center point also has two coordinates */

Output dist; /* distance between p and u */

---------------------------------------------------------------------

dist = sqrt(pow(px - ux, 2) + pow(py – uy, 2));

return dist;

K-Means: A Simple Machine Learning Algorithm to Parallelize€¦ · •Technology watch •Derwent...

Documents

Transcript of K-Means: A Simple Machine Learning Algorithm to Parallelize€¦ · •Technology watch •Derwent...