Sandy Ryza – Software Engineer, Cloudera at MLconf ATL
-
Upload
sessionsevents -
Category
Technology
-
view
1.174 -
download
2
description
Transcript of Sandy Ryza – Software Engineer, Cloudera at MLconf ATL
Clustering with SparkSandy Ryza / Data Science / Cloudera
● Data scientist at Cloudera● Recently lead Apache Spark development at
Cloudera● Before that, committing on Apache Hadoop● Before that, studying combinatorial
optimization and distributed systems at Brown
Me
Sometimes you find yourself with lots of stuff
Large Scale Learning
Network Packets
Detect Network Intrusions
Credit Card Transactions
Detect Fraud
Movie Viewings
Recommend Movies
Unsupervised Learning
● Learn hidden structure of your data● Interpret new data as it relates to this
structure
Two Main Problems
● Designing a system for processing huge data in parallel
● Taking advantage of it with algorithms that work well in parallel
CONFIDENTIAL - RESTRICTED*
MapReduce
Map Map Map Map Map Map Map Map Map Map Map Map
Reduce Reduce Reduce Reduce
Key advances by MapReduce:
•Data Locality: Automatic split computation and launch of mappers appropriately
•Fault tolerance: Write out of intermediate results and restartable mappers meant ability to run on commodity hardware
•Linear scalability: Combination of locality + programming model that forces developers to write generally scalable solutions to problems
CONFIDENTIAL - RESTRICTED*
MapReduce
Map Map Map Map Map Map Map Map Map Map Map Map
Reduce Reduce Reduce Reduce
Limitations of MapReduce
•Each job reads data from HDFS
•No concept of a session
•Jobs are rigin map-then-reduce
CONFIDENTIAL - RESTRICTED*
Spark is a general purpose computation framework geared towards massive data - more flexible than MapReduce
Extra properties:•Leverages distributed memory•Full Directed Graph expressions for data parallel computations•Improved developer experience
Yet retains:Linear scalability, Fault-tolerance and Data-Locality
RDDs
bigfile.txt lines
val lines = sc.textFile (“bigfile.txt”)
numbers
Partition
Partition
Partition
Partition
Partition
Partition
HDFS
sum
Driver
val numbers = lines.map ((x) => x.toDouble) numbers.sum()
RDDs
bigfile.txt lines
val lines = sc.textFile (“bigfile.txt”)
numbers
Partition
Partition
Partition
Partition
Partition
Partition
HDFS
sum
Driver
val numbers = lines.map ((x) => x.toInt) numbers.cache()
.sum()
numbers.sum()
bigfile.txt lines numbers
Partition
Partition
Partition
sum
Driver
Spark MLlib
Discrete Continuous
Supervised Classification● Logistic regression (and
regularized variants)● Linear SVM● Naive Bayes● Random Decision Forests
(soon)
Regression● Linear regression (and
regularized variants)
Unsupervised Clustering● K-means
Dimensionality reduction, matrix factorization
● Principal component analysis / singular value decomposition
● Alternating least squares
Spark MLlib
Discrete Continuous
Supervised Classification● Logistic regression (and
regularized variants)● Linear SVM● Naive Bayes● Random Decision Forests
(soon)
Regression● Linear regression (and
regularized variants)
Unsupervised Clustering
● K-meansDimensionality reduction, matrix factorization
● Principal component analysis / singular value decomposition
● Alternating least squares
Using it
val data = sc.textFile("kmeans_data.txt")
val parsedData = data.map( _.split(' ').map(_.toDouble))
// Cluster the data into two classes using KMeans
val numIterations = 20
val numClusters = 2
val clusters = KMeans.train(parsedData, numClusters,
numIterations)
K-Means
● Choose some initial centers● Then alternate between two steps:
○ Assign each point to a cluster based on existing centers
○ Recompute cluster centers from the points in each cluster
K-Means - very parallelizable
● Alternate between two steps:○ Assign each point to a cluster based on
existing centers■ Process each data point independently
○ Recompute cluster centers from the points in each cluster■ Average across partitions
// Find the sum and count of points mapping to each center
val totalContribs = data.mapPartitions { points =>
val k = centers.length
val dims = centers(0).vector.length
val sums = Array.fill(k)(BDV.zeros[Double](dims).asInstanceOf[BV[Double]])
val counts = Array.fill(k)(0L)
points.foreach { point =>
val (bestCenter, cost) = KMeans.findClosest(centers, point)
costAccum += cost
sums(bestCenter) += point.vector
counts(bestCenter) += 1
}
val contribs = for (j <- 0 until k) yield {
(j, (sums(j), counts(j)))
}
contribs.iterator
}.reduceByKey(mergeContribs).collectAsMap()
// Update the cluster centers and costs
var changed = false
var j = 0
while (j < k) {
val (sum, count) = totalContribs(j)
if (count != 0) {
sum /= count.toDouble
val newCenter = new BreezeVectorWithNorm(sum)
if (KMeans.fastSquaredDistance(newCenter, centers(j)) > epsilon * epsilon) {
changed = true
}
centers(j) = newCenter
}
j += 1
}
if (!changed) {
logInfo("Run " + run + " finished in " + (iteration + 1) + " iterations")
}
cost = costAccum.value
The Problem
● K-Means is very sensitive to initial set of center points chosen.
● Best existing algorithm for choosing centers is highly sequential.
K-Means++
● Start with random point from dataset● Pick another one randomly, with probability
proportional to distance from the closest already chosen
● Repeat until initial centers chosen
K-Means++
● Initial cluster has expected bound of O(log k) of optimum cost
K-Means++
● Requires k passes over the data
K-Means||
● Do only a few (~5) passes● Sample m points on each pass● Oversample● Run K-Means++ on sampled points to find
initial centers
Then on the full data...