K-Means: A Simple Machine Learning Algorithm to Parallelize€¦ · •Technology watch •Derwent...
Transcript of K-Means: A Simple Machine Learning Algorithm to Parallelize€¦ · •Technology watch •Derwent...
K-Means: A Simple Machine Learning
Algorithm to ParallelizeWei Wang
Spring 2020 CS4823 Parallel Programming / CS6643 Parallel Processing 1
Areas that Frequently Use Parallel Computing
• High Performance/Scientific Computing• Requires knowledge numerical algorithms
• Web services• Mostly parallelization is done in the underlying framework (e.g., web
server), does not require programmers to parallelize their code
• Artificial Intelligence and Machine Learning• Used by games, data analysis, intelligent assistance
• Mostly likely you will need to parallelize one AI/ML algorithm in your job
Spring 2020 CS4823 Parallel Programming / CS6643 Parallel Processing 2
One Algorithm for the Whole Semester
• We will use on ML algorithm – K-Means – as the example for this semester
• Hopefully using ML algorithms will be more beneficial to your future career than numerical algorithms
• The programming assignment will be parallelizing K-Means using different programming models: OpenMP, Pthread, MPI, CUDA etc
• You still need to know how to parallelize other algorithms – we will do in-class exercises for those
Spring 2020 CS4823 Parallel Programming / CS6643 Parallel Processing 3
Unsupervised Learning and Clustering with K-Means
Spring 2020 CS4823 Parallel Programming / CS6643 Parallel Processing 4
Slides based on from Dr. Rita Osadchy’s lecture
What is Clustering?
• Inputs:• Data points 𝑝1, 𝑝2, ⋯ , 𝑝𝑚 ∈ 𝑅2
• Outputs: group input points into classes of similar objects –cohesive “clusters.”
• High intra-cluster similarity
• Low inter-cluster similarity
• The most basic type of unsupervised learning
Spring 2020 CS4823 Parallel Programming / CS6643 Parallel Processing 5
First (?) Application of Clustering
• John Snow, a London physician plotted the location of cholera deaths on a map during an outbreak in the 1850s.
• The locations indicated that cases were clustered around certain intersections where there were polluted wells -- thus exposing both the problem and the solution.
Spring 2020 CS4823 Parallel Programming / CS6643 Parallel Processing 6
Application of Clustering
• Astronomy• SkyCat: Clustered 2x10 9 sky objects into stars, galaxies, quasars, etc
based on radiation emitted in different spectrum bands.
Spring 2020 CS4823 Parallel Programming / CS6643 Parallel Processing 7
Applications of Clustering cont’d
• Image Segmentation• Find interesting “objects” in images to focus attention at
Spring 2020 CS4823 Parallel Programming / CS6643 Parallel Processing 8
M. Pawan Kumar and D. Koller, REGION SELECTION FOR SCENE UNDERSTANDING,
Applications of Clustering cont’d
• Data Mining• Google news
• Technology watch• Derwent Database, contains all patents filed in the last 10 years worldwide
• Searching by keywords leads to thousands of documents
• Find clusters in the database and find if there are any emerging technologies and what competition is up to
• Marketing• Customer database
• Find clusters of customers and tailor marketing schemes to them
Spring 2020 CS4823 Parallel Programming / CS6643 Parallel Processing 9
K-Means Algorithm: An Example –Step 1
Spring 2020 CS4823 Parallel Programming / CS6643 Parallel Processing 10
-picture by Andrew Ng
Initial data points to
be clustered
K-Means Algorithm: An Example –Step 2
Spring 2020 CS4823 Parallel Programming / CS6643 Parallel Processing 11
-picture by Andrew Ng
Two cluster centers
randomly generated
K-Means Algorithm: An Example –Step 3
Spring 2020 CS4823 Parallel Programming / CS6643 Parallel Processing 12
-picture by Andrew Ng
Data points partitioned into two
groups based on their distances to the two centers
K-Means Algorithm: An Example –Step 4
Spring 2020 CS4823 Parallel Programming / CS6643 Parallel Processing 13
-picture by Andrew Ng
Two new centers are computed for the two
clusters
K-Means Algorithm: An Example –Step 5
Spring 2020 CS4823 Parallel Programming / CS6643 Parallel Processing 14
-picture by Andrew Ng
Data points repartitioned based on the new centers
K-Means Algorithm: An Example –Step 6
Spring 2020 CS4823 Parallel Programming / CS6643 Parallel Processing 15
-picture by Andrew Ng
New centers calculated based on the latest
partition, and we have found the two cluster
K-Means Algorithm: An Example – All Steps
Spring 2020 CS4823 Parallel Programming / CS6643 Parallel Processing 16
-picture by Andrew Ng
The k-means Clustering Algorithm
1. Initialized cluster centers 𝑢1, 𝑢2, ⋯ , 𝑢𝑘 randomly
2. Repeat multiple times (or until convergence):1. for every data point xi, find the nearest center to it
let the nearest center of pi be ci:𝑐𝑖 = 𝑗 𝑝𝑖 − 𝑢𝑗 𝑖𝑠 𝑡ℎ𝑒 𝑚𝑖𝑛𝑚𝑢𝑚 𝑓𝑜𝑟 𝑎𝑙𝑙 𝑐𝑒𝑛𝑡𝑒𝑟𝑠
Essentially, we are assigning 𝑝𝑖 to the cluster that centered at 𝑢𝑗2. Update the centers of each new cluster
𝑢𝑗 = 𝑎𝑣𝑒𝑟𝑎𝑔𝑒 𝑚𝑒𝑎𝑛 𝑜𝑓 𝑝𝑜𝑖𝑛𝑡𝑠 𝑎𝑠𝑠𝑖𝑔𝑛𝑒𝑑 𝑡𝑜 𝑐𝑙𝑢𝑠𝑡𝑒𝑟 𝑗
=σ𝑖,𝑐𝑖==𝑗
𝑝𝑖
𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑝𝑜𝑖𝑛𝑡𝑠 𝑖𝑛 𝑐𝑙𝑢𝑠𝑡𝑒𝑟 𝑗
Spring 2020 CS4823 Parallel Programming / CS6643 Parallel Processing 17
K-Means Algorithm: The Intuition
• The inner-loop of the algorithm repeatedly carries out two steps: 1) Assign each point 𝑝𝑖 to the closest cluster center.
2) Moving each cluster center to the mean of the points assigned to it.
Spring 2020 CS4823 Parallel Programming / CS6643 Parallel Processing 18
K-Means Algorithm: Pseudocode
• Algorithm in next slide
Spring 2020 CS4823 Parallel Programming / CS6643 Parallel Processing 19
Spring 2020 CS4823 Parallel Programming / CS6643 Parallel Processing 20
K-Means Pseudocode
------------------------------------------------------------------
Input: p = {p[1], p[2], …, p[m]} /* m points to cluster */
k /*number of clusters */
Iter /* number of iterations */
Output: u = {u[1], u[2], …, u[k]} /* centers of the clusters */
------------------------------------------------------------------
for j = 1 to k
u[i] = random(); /* initialize each center to be a random point */
endfor
for l = 1 to Iter
/* find the nearest center to each point */
for i = 1 to m
min_dist = MAX_DOUBLE; /* init distance to the nearest center to be maximum value of DOUBLE number */
for j = 1 to k /* find the cluster c[i] for p[i] by computing p[i]’s distance to all centers */
dist = distance(p[i], u[j]); /* compute the distance between p[i] and u[j] */
if (dist < min_dist) /* if u[j] is nearer, than put p[i] in cluster j */
min_dist = dist;
c[i] = j;
endif
endfor
endfor
/* based on the cluster assignment, update the center for each cluster */
for j = 1 to k
sum = 0;
cluster_size = 0;
for i = 1 to m
if (c[i] == j) /* point p[i] is in cluster j, add p[i] to sum */
sum += p[i];
cluster_size++; /* increase the number points found in cluster j */
endif
endfor
if(cluster_size > 0)
u[j] = sum / cluster_size; /* computer the average of the points in this cluster */
else
u[j] = random(); /* if on points are assigned to cluster j, then we need to re-generate a new center for it */
endif
endfor
endfor
return u = {u[1], u[2], …, u[k]};
Spring 2020 CS4823 Parallel Programming / CS6643 Parallel Processing 21
Notes on summing points and getting average:
Because the each point has two coordinates x and y, we should sum the the x coordinate and y coordinate for all points, and
get the average of the x coordinate and y coordinate.
Compute Euclidean Distance: function distance(p, u)
---------------------------------------------------------------------
Input p = (px, py); /* each point has two coordinates x and y */
u = (ux, uy); /* each center point also has two coordinates */
Output dist; /* distance between p and u */
---------------------------------------------------------------------
dist = sqrt(pow(px - ux, 2) + pow(py – uy, 2));
return dist;