K-Means: A Simple Machine Learning Algorithm to Parallelize€¦ · •Technology watch •Derwent...

21
K-Means: A Simple Machine Learning Algorithm to Parallelize Wei Wang Spring 2020 CS4823 Parallel Programming / CS6643 Parallel Processing 1

Transcript of K-Means: A Simple Machine Learning Algorithm to Parallelize€¦ · •Technology watch •Derwent...

Page 1: K-Means: A Simple Machine Learning Algorithm to Parallelize€¦ · •Technology watch •Derwent Database, contains all patents filed in the last 10 years worldwide •Searching

K-Means: A Simple Machine Learning

Algorithm to ParallelizeWei Wang

Spring 2020 CS4823 Parallel Programming / CS6643 Parallel Processing 1

Page 2: K-Means: A Simple Machine Learning Algorithm to Parallelize€¦ · •Technology watch •Derwent Database, contains all patents filed in the last 10 years worldwide •Searching

Areas that Frequently Use Parallel Computing

• High Performance/Scientific Computing• Requires knowledge numerical algorithms

• Web services• Mostly parallelization is done in the underlying framework (e.g., web

server), does not require programmers to parallelize their code

• Artificial Intelligence and Machine Learning• Used by games, data analysis, intelligent assistance

• Mostly likely you will need to parallelize one AI/ML algorithm in your job

Spring 2020 CS4823 Parallel Programming / CS6643 Parallel Processing 2

Page 3: K-Means: A Simple Machine Learning Algorithm to Parallelize€¦ · •Technology watch •Derwent Database, contains all patents filed in the last 10 years worldwide •Searching

One Algorithm for the Whole Semester

• We will use on ML algorithm – K-Means – as the example for this semester

• Hopefully using ML algorithms will be more beneficial to your future career than numerical algorithms

• The programming assignment will be parallelizing K-Means using different programming models: OpenMP, Pthread, MPI, CUDA etc

• You still need to know how to parallelize other algorithms – we will do in-class exercises for those

Spring 2020 CS4823 Parallel Programming / CS6643 Parallel Processing 3

Page 4: K-Means: A Simple Machine Learning Algorithm to Parallelize€¦ · •Technology watch •Derwent Database, contains all patents filed in the last 10 years worldwide •Searching

Unsupervised Learning and Clustering with K-Means

Spring 2020 CS4823 Parallel Programming / CS6643 Parallel Processing 4

Slides based on from Dr. Rita Osadchy’s lecture

Page 5: K-Means: A Simple Machine Learning Algorithm to Parallelize€¦ · •Technology watch •Derwent Database, contains all patents filed in the last 10 years worldwide •Searching

What is Clustering?

• Inputs:• Data points 𝑝1, 𝑝2, ⋯ , 𝑝𝑚 ∈ 𝑅2

• Outputs: group input points into classes of similar objects –cohesive “clusters.”

• High intra-cluster similarity

• Low inter-cluster similarity

• The most basic type of unsupervised learning

Spring 2020 CS4823 Parallel Programming / CS6643 Parallel Processing 5

Page 6: K-Means: A Simple Machine Learning Algorithm to Parallelize€¦ · •Technology watch •Derwent Database, contains all patents filed in the last 10 years worldwide •Searching

First (?) Application of Clustering

• John Snow, a London physician plotted the location of cholera deaths on a map during an outbreak in the 1850s.

• The locations indicated that cases were clustered around certain intersections where there were polluted wells -- thus exposing both the problem and the solution.

Spring 2020 CS4823 Parallel Programming / CS6643 Parallel Processing 6

Page 7: K-Means: A Simple Machine Learning Algorithm to Parallelize€¦ · •Technology watch •Derwent Database, contains all patents filed in the last 10 years worldwide •Searching

Application of Clustering

• Astronomy• SkyCat: Clustered 2x10 9 sky objects into stars, galaxies, quasars, etc

based on radiation emitted in different spectrum bands.

Spring 2020 CS4823 Parallel Programming / CS6643 Parallel Processing 7

Page 8: K-Means: A Simple Machine Learning Algorithm to Parallelize€¦ · •Technology watch •Derwent Database, contains all patents filed in the last 10 years worldwide •Searching

Applications of Clustering cont’d

• Image Segmentation• Find interesting “objects” in images to focus attention at

Spring 2020 CS4823 Parallel Programming / CS6643 Parallel Processing 8

M. Pawan Kumar and D. Koller, REGION SELECTION FOR SCENE UNDERSTANDING,

Page 9: K-Means: A Simple Machine Learning Algorithm to Parallelize€¦ · •Technology watch •Derwent Database, contains all patents filed in the last 10 years worldwide •Searching

Applications of Clustering cont’d

• Data Mining• Google news

• Technology watch• Derwent Database, contains all patents filed in the last 10 years worldwide

• Searching by keywords leads to thousands of documents

• Find clusters in the database and find if there are any emerging technologies and what competition is up to

• Marketing• Customer database

• Find clusters of customers and tailor marketing schemes to them

Spring 2020 CS4823 Parallel Programming / CS6643 Parallel Processing 9

Page 10: K-Means: A Simple Machine Learning Algorithm to Parallelize€¦ · •Technology watch •Derwent Database, contains all patents filed in the last 10 years worldwide •Searching

K-Means Algorithm: An Example –Step 1

Spring 2020 CS4823 Parallel Programming / CS6643 Parallel Processing 10

-picture by Andrew Ng

Initial data points to

be clustered

Page 11: K-Means: A Simple Machine Learning Algorithm to Parallelize€¦ · •Technology watch •Derwent Database, contains all patents filed in the last 10 years worldwide •Searching

K-Means Algorithm: An Example –Step 2

Spring 2020 CS4823 Parallel Programming / CS6643 Parallel Processing 11

-picture by Andrew Ng

Two cluster centers

randomly generated

Page 12: K-Means: A Simple Machine Learning Algorithm to Parallelize€¦ · •Technology watch •Derwent Database, contains all patents filed in the last 10 years worldwide •Searching

K-Means Algorithm: An Example –Step 3

Spring 2020 CS4823 Parallel Programming / CS6643 Parallel Processing 12

-picture by Andrew Ng

Data points partitioned into two

groups based on their distances to the two centers

Page 13: K-Means: A Simple Machine Learning Algorithm to Parallelize€¦ · •Technology watch •Derwent Database, contains all patents filed in the last 10 years worldwide •Searching

K-Means Algorithm: An Example –Step 4

Spring 2020 CS4823 Parallel Programming / CS6643 Parallel Processing 13

-picture by Andrew Ng

Two new centers are computed for the two

clusters

Page 14: K-Means: A Simple Machine Learning Algorithm to Parallelize€¦ · •Technology watch •Derwent Database, contains all patents filed in the last 10 years worldwide •Searching

K-Means Algorithm: An Example –Step 5

Spring 2020 CS4823 Parallel Programming / CS6643 Parallel Processing 14

-picture by Andrew Ng

Data points repartitioned based on the new centers

Page 15: K-Means: A Simple Machine Learning Algorithm to Parallelize€¦ · •Technology watch •Derwent Database, contains all patents filed in the last 10 years worldwide •Searching

K-Means Algorithm: An Example –Step 6

Spring 2020 CS4823 Parallel Programming / CS6643 Parallel Processing 15

-picture by Andrew Ng

New centers calculated based on the latest

partition, and we have found the two cluster

Page 16: K-Means: A Simple Machine Learning Algorithm to Parallelize€¦ · •Technology watch •Derwent Database, contains all patents filed in the last 10 years worldwide •Searching

K-Means Algorithm: An Example – All Steps

Spring 2020 CS4823 Parallel Programming / CS6643 Parallel Processing 16

-picture by Andrew Ng

Page 17: K-Means: A Simple Machine Learning Algorithm to Parallelize€¦ · •Technology watch •Derwent Database, contains all patents filed in the last 10 years worldwide •Searching

The k-means Clustering Algorithm

1. Initialized cluster centers 𝑢1, 𝑢2, ⋯ , 𝑢𝑘 randomly

2. Repeat multiple times (or until convergence):1. for every data point xi, find the nearest center to it

let the nearest center of pi be ci:𝑐𝑖 = 𝑗 𝑝𝑖 − 𝑢𝑗 𝑖𝑠 𝑡ℎ𝑒 𝑚𝑖𝑛𝑚𝑢𝑚 𝑓𝑜𝑟 𝑎𝑙𝑙 𝑐𝑒𝑛𝑡𝑒𝑟𝑠

Essentially, we are assigning 𝑝𝑖 to the cluster that centered at 𝑢𝑗2. Update the centers of each new cluster

𝑢𝑗 = 𝑎𝑣𝑒𝑟𝑎𝑔𝑒 𝑚𝑒𝑎𝑛 𝑜𝑓 𝑝𝑜𝑖𝑛𝑡𝑠 𝑎𝑠𝑠𝑖𝑔𝑛𝑒𝑑 𝑡𝑜 𝑐𝑙𝑢𝑠𝑡𝑒𝑟 𝑗

=σ𝑖,𝑐𝑖==𝑗

𝑝𝑖

𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑝𝑜𝑖𝑛𝑡𝑠 𝑖𝑛 𝑐𝑙𝑢𝑠𝑡𝑒𝑟 𝑗

Spring 2020 CS4823 Parallel Programming / CS6643 Parallel Processing 17

Page 18: K-Means: A Simple Machine Learning Algorithm to Parallelize€¦ · •Technology watch •Derwent Database, contains all patents filed in the last 10 years worldwide •Searching

K-Means Algorithm: The Intuition

• The inner-loop of the algorithm repeatedly carries out two steps: 1) Assign each point 𝑝𝑖 to the closest cluster center.

2) Moving each cluster center to the mean of the points assigned to it.

Spring 2020 CS4823 Parallel Programming / CS6643 Parallel Processing 18

Page 19: K-Means: A Simple Machine Learning Algorithm to Parallelize€¦ · •Technology watch •Derwent Database, contains all patents filed in the last 10 years worldwide •Searching

K-Means Algorithm: Pseudocode

• Algorithm in next slide

Spring 2020 CS4823 Parallel Programming / CS6643 Parallel Processing 19

Page 20: K-Means: A Simple Machine Learning Algorithm to Parallelize€¦ · •Technology watch •Derwent Database, contains all patents filed in the last 10 years worldwide •Searching

Spring 2020 CS4823 Parallel Programming / CS6643 Parallel Processing 20

K-Means Pseudocode

------------------------------------------------------------------

Input: p = {p[1], p[2], …, p[m]} /* m points to cluster */

k /*number of clusters */

Iter /* number of iterations */

Output: u = {u[1], u[2], …, u[k]} /* centers of the clusters */

------------------------------------------------------------------

for j = 1 to k

u[i] = random(); /* initialize each center to be a random point */

endfor

for l = 1 to Iter

/* find the nearest center to each point */

for i = 1 to m

min_dist = MAX_DOUBLE; /* init distance to the nearest center to be maximum value of DOUBLE number */

for j = 1 to k /* find the cluster c[i] for p[i] by computing p[i]’s distance to all centers */

dist = distance(p[i], u[j]); /* compute the distance between p[i] and u[j] */

if (dist < min_dist) /* if u[j] is nearer, than put p[i] in cluster j */

min_dist = dist;

c[i] = j;

endif

endfor

endfor

/* based on the cluster assignment, update the center for each cluster */

for j = 1 to k

sum = 0;

cluster_size = 0;

for i = 1 to m

if (c[i] == j) /* point p[i] is in cluster j, add p[i] to sum */

sum += p[i];

cluster_size++; /* increase the number points found in cluster j */

endif

endfor

if(cluster_size > 0)

u[j] = sum / cluster_size; /* computer the average of the points in this cluster */

else

u[j] = random(); /* if on points are assigned to cluster j, then we need to re-generate a new center for it */

endif

endfor

endfor

return u = {u[1], u[2], …, u[k]};

Page 21: K-Means: A Simple Machine Learning Algorithm to Parallelize€¦ · •Technology watch •Derwent Database, contains all patents filed in the last 10 years worldwide •Searching

Spring 2020 CS4823 Parallel Programming / CS6643 Parallel Processing 21

Notes on summing points and getting average:

Because the each point has two coordinates x and y, we should sum the the x coordinate and y coordinate for all points, and

get the average of the x coordinate and y coordinate.

Compute Euclidean Distance: function distance(p, u)

---------------------------------------------------------------------

Input p = (px, py); /* each point has two coordinates x and y */

u = (ux, uy); /* each center point also has two coordinates */

Output dist; /* distance between p and u */

---------------------------------------------------------------------

dist = sqrt(pow(px - ux, 2) + pow(py – uy, 2));

return dist;