Download - Sandy Ryza – Software Engineer, Cloudera at MLconf ATL

Transcript
Page 1: Sandy Ryza – Software Engineer, Cloudera at MLconf ATL

Clustering with SparkSandy Ryza / Data Science / Cloudera

Page 2: Sandy Ryza – Software Engineer, Cloudera at MLconf ATL

● Data scientist at Cloudera● Recently lead Apache Spark development at

Cloudera● Before that, committing on Apache Hadoop● Before that, studying combinatorial

optimization and distributed systems at Brown

Me

Page 3: Sandy Ryza – Software Engineer, Cloudera at MLconf ATL

Sometimes you find yourself with lots of stuff

Page 4: Sandy Ryza – Software Engineer, Cloudera at MLconf ATL

Large Scale Learning

Page 5: Sandy Ryza – Software Engineer, Cloudera at MLconf ATL

Network Packets

Page 6: Sandy Ryza – Software Engineer, Cloudera at MLconf ATL

Detect Network Intrusions

Page 7: Sandy Ryza – Software Engineer, Cloudera at MLconf ATL

Credit Card Transactions

Page 8: Sandy Ryza – Software Engineer, Cloudera at MLconf ATL

Detect Fraud

Page 9: Sandy Ryza – Software Engineer, Cloudera at MLconf ATL

Movie Viewings

Page 10: Sandy Ryza – Software Engineer, Cloudera at MLconf ATL

Recommend Movies

Page 11: Sandy Ryza – Software Engineer, Cloudera at MLconf ATL

Unsupervised Learning

● Learn hidden structure of your data● Interpret new data as it relates to this

structure

Page 12: Sandy Ryza – Software Engineer, Cloudera at MLconf ATL
Page 13: Sandy Ryza – Software Engineer, Cloudera at MLconf ATL
Page 14: Sandy Ryza – Software Engineer, Cloudera at MLconf ATL

Two Main Problems

● Designing a system for processing huge data in parallel

● Taking advantage of it with algorithms that work well in parallel

Page 15: Sandy Ryza – Software Engineer, Cloudera at MLconf ATL

CONFIDENTIAL - RESTRICTED*

MapReduce

Map Map Map Map Map Map Map Map Map Map Map Map

Reduce Reduce Reduce Reduce

Key advances by MapReduce:

•Data Locality: Automatic split computation and launch of mappers appropriately

•Fault tolerance: Write out of intermediate results and restartable mappers meant ability to run on commodity hardware

•Linear scalability: Combination of locality + programming model that forces developers to write generally scalable solutions to problems

Page 16: Sandy Ryza – Software Engineer, Cloudera at MLconf ATL

CONFIDENTIAL - RESTRICTED*

MapReduce

Map Map Map Map Map Map Map Map Map Map Map Map

Reduce Reduce Reduce Reduce

Limitations of MapReduce

•Each job reads data from HDFS

•No concept of a session

•Jobs are rigin map-then-reduce

Page 17: Sandy Ryza – Software Engineer, Cloudera at MLconf ATL

CONFIDENTIAL - RESTRICTED*

Spark is a general purpose computation framework geared towards massive data - more flexible than MapReduce

Extra properties:•Leverages distributed memory•Full Directed Graph expressions for data parallel computations•Improved developer experience

Yet retains:Linear scalability, Fault-tolerance and Data-Locality

Page 18: Sandy Ryza – Software Engineer, Cloudera at MLconf ATL

RDDs

bigfile.txt lines

val lines = sc.textFile (“bigfile.txt”)

numbers

Partition

Partition

Partition

Partition

Partition

Partition

HDFS

sum

Driver

val numbers = lines.map ((x) => x.toDouble) numbers.sum()

Page 19: Sandy Ryza – Software Engineer, Cloudera at MLconf ATL

RDDs

bigfile.txt lines

val lines = sc.textFile (“bigfile.txt”)

numbers

Partition

Partition

Partition

Partition

Partition

Partition

HDFS

sum

Driver

val numbers = lines.map ((x) => x.toInt) numbers.cache()

.sum()

Page 20: Sandy Ryza – Software Engineer, Cloudera at MLconf ATL

numbers.sum()

bigfile.txt lines numbers

Partition

Partition

Partition

sum

Driver

Page 21: Sandy Ryza – Software Engineer, Cloudera at MLconf ATL

Spark MLlib

Discrete Continuous

Supervised Classification● Logistic regression (and

regularized variants)● Linear SVM● Naive Bayes● Random Decision Forests

(soon)

Regression● Linear regression (and

regularized variants)

Unsupervised Clustering● K-means

Dimensionality reduction, matrix factorization

● Principal component analysis / singular value decomposition

● Alternating least squares

Page 22: Sandy Ryza – Software Engineer, Cloudera at MLconf ATL

Spark MLlib

Discrete Continuous

Supervised Classification● Logistic regression (and

regularized variants)● Linear SVM● Naive Bayes● Random Decision Forests

(soon)

Regression● Linear regression (and

regularized variants)

Unsupervised Clustering

● K-meansDimensionality reduction, matrix factorization

● Principal component analysis / singular value decomposition

● Alternating least squares

Page 23: Sandy Ryza – Software Engineer, Cloudera at MLconf ATL
Page 24: Sandy Ryza – Software Engineer, Cloudera at MLconf ATL
Page 25: Sandy Ryza – Software Engineer, Cloudera at MLconf ATL

Using it

val data = sc.textFile("kmeans_data.txt")

val parsedData = data.map( _.split(' ').map(_.toDouble))

// Cluster the data into two classes using KMeans

val numIterations = 20

val numClusters = 2

val clusters = KMeans.train(parsedData, numClusters,

numIterations)

Page 26: Sandy Ryza – Software Engineer, Cloudera at MLconf ATL

K-Means

● Choose some initial centers● Then alternate between two steps:

○ Assign each point to a cluster based on existing centers

○ Recompute cluster centers from the points in each cluster

Page 27: Sandy Ryza – Software Engineer, Cloudera at MLconf ATL
Page 28: Sandy Ryza – Software Engineer, Cloudera at MLconf ATL
Page 29: Sandy Ryza – Software Engineer, Cloudera at MLconf ATL
Page 30: Sandy Ryza – Software Engineer, Cloudera at MLconf ATL
Page 31: Sandy Ryza – Software Engineer, Cloudera at MLconf ATL
Page 32: Sandy Ryza – Software Engineer, Cloudera at MLconf ATL

K-Means - very parallelizable

● Alternate between two steps:○ Assign each point to a cluster based on

existing centers■ Process each data point independently

○ Recompute cluster centers from the points in each cluster■ Average across partitions

Page 33: Sandy Ryza – Software Engineer, Cloudera at MLconf ATL

// Find the sum and count of points mapping to each center

val totalContribs = data.mapPartitions { points =>

val k = centers.length

val dims = centers(0).vector.length

val sums = Array.fill(k)(BDV.zeros[Double](dims).asInstanceOf[BV[Double]])

val counts = Array.fill(k)(0L)

points.foreach { point =>

val (bestCenter, cost) = KMeans.findClosest(centers, point)

costAccum += cost

sums(bestCenter) += point.vector

counts(bestCenter) += 1

}

val contribs = for (j <- 0 until k) yield {

(j, (sums(j), counts(j)))

}

contribs.iterator

}.reduceByKey(mergeContribs).collectAsMap()

Page 34: Sandy Ryza – Software Engineer, Cloudera at MLconf ATL

// Update the cluster centers and costs

var changed = false

var j = 0

while (j < k) {

val (sum, count) = totalContribs(j)

if (count != 0) {

sum /= count.toDouble

val newCenter = new BreezeVectorWithNorm(sum)

if (KMeans.fastSquaredDistance(newCenter, centers(j)) > epsilon * epsilon) {

changed = true

}

centers(j) = newCenter

}

j += 1

}

if (!changed) {

logInfo("Run " + run + " finished in " + (iteration + 1) + " iterations")

}

cost = costAccum.value

Page 35: Sandy Ryza – Software Engineer, Cloudera at MLconf ATL
Page 36: Sandy Ryza – Software Engineer, Cloudera at MLconf ATL

The Problem

● K-Means is very sensitive to initial set of center points chosen.

● Best existing algorithm for choosing centers is highly sequential.

Page 37: Sandy Ryza – Software Engineer, Cloudera at MLconf ATL
Page 38: Sandy Ryza – Software Engineer, Cloudera at MLconf ATL

K-Means++

● Start with random point from dataset● Pick another one randomly, with probability

proportional to distance from the closest already chosen

● Repeat until initial centers chosen

Page 39: Sandy Ryza – Software Engineer, Cloudera at MLconf ATL

K-Means++

● Initial cluster has expected bound of O(log k) of optimum cost

Page 40: Sandy Ryza – Software Engineer, Cloudera at MLconf ATL

K-Means++

● Requires k passes over the data

Page 41: Sandy Ryza – Software Engineer, Cloudera at MLconf ATL

K-Means||

● Do only a few (~5) passes● Sample m points on each pass● Oversample● Run K-Means++ on sampled points to find

initial centers

Page 42: Sandy Ryza – Software Engineer, Cloudera at MLconf ATL
Page 43: Sandy Ryza – Software Engineer, Cloudera at MLconf ATL
Page 44: Sandy Ryza – Software Engineer, Cloudera at MLconf ATL
Page 45: Sandy Ryza – Software Engineer, Cloudera at MLconf ATL
Page 46: Sandy Ryza – Software Engineer, Cloudera at MLconf ATL
Page 47: Sandy Ryza – Software Engineer, Cloudera at MLconf ATL
Page 48: Sandy Ryza – Software Engineer, Cloudera at MLconf ATL
Page 49: Sandy Ryza – Software Engineer, Cloudera at MLconf ATL

Then on the full data...

Page 50: Sandy Ryza – Software Engineer, Cloudera at MLconf ATL