Juliet Hougland, Data Scientist, Cloudera at MLconf NYC
-
Upload
sessionsevents -
Category
Technology
-
view
552 -
download
1
Transcript of Juliet Hougland, Data Scientist, Cloudera at MLconf NYC
‹#›© Cloudera, Inc. All rights reserved.
Juliet Hougland Data Scientist @j_houg
Matrix Decomposition at Scale
‹#›© Cloudera, Inc. All rights reserved.
The Singular Value Decomposition
‹#›© Cloudera, Inc. All rights reserved.
• Dimensionality Reduction/PCA • Feature dimension reduction • Visualization of gene expression data
• Latent Semantic Indexing • Low Rank Approximations • Digital Signals Processing
SVD is applied everywhere
A Global Map of Human Gene Expression. Lukk Et al. [1]
‹#›© Cloudera, Inc. All rights reserved.
Define SVD
‹#›© Cloudera, Inc. All rights reserved.
Totally awesome LANL video
‹#›© Cloudera, Inc. All rights reserved.
This doesn’t work on distributed, commodity setups
Good ClusterBad Cluster
‹#›© Cloudera, Inc. All rights reserved.
3 Distributed OSS SVD ImplementationsMahout: Lanczos Mahout: Stochastic Spark: Lanczos
‹#›© Cloudera, Inc. All rights reserved.
Lanczos’ Method
‹#›© Cloudera, Inc. All rights reserved.
• Iterative, with the dominant cost a matrix-vector multiply • Requires at least k iterations to get k singular vectors
Lanczos’ Method
‹#›© Cloudera, Inc. All rights reserved.
• Randomly project original matrix to lower dimensional space • Factorize the projected matrix. • Unproject
Stochastic SVDM ⇡ QQ⇤M
Finding Structure in Randomness. Halko Et al. http://bit.ly/19VVRXp
‹#›© Cloudera, Inc. All rights reserved.
• What I test is written on MapReduce • Driver programs launch the series of required map reduce jobs • Lots of writing intermediate data to disk
Frameworks
• Using the MLLib component • Relies on Spark core • => tries to pin data in memory
‹#›© Cloudera, Inc. All rights reserved.
Note!
Mahout Scala & Spark Bindings are integrated in Mahout. Version 0.10 release next month will move these methods The Scala DSL for linear algebra:
val g = bt.t %*% bt - c - c.t + (s_q cross s_q) * (xi dot xi)
‹#›© Cloudera, Inc. All rights reserved.
Performance Comparisons
‹#›© Cloudera, Inc. All rights reserved.
[3]
‹#›© Cloudera, Inc. All rights reserved.
MapReduce
[4]
‹#›© Cloudera, Inc. All rights reserved.
Go Bananas tuning!
[5]
‹#›© Cloudera, Inc. All rights reserved.
My Cluster6 Nodes running CDH 5.3* Per Node: 2 physical cores 24, with hyper threading => 144 total available cores 64 GB Memory 100 TB free in HDFS !*Running Spark 1.3
[6]
‹#›© Cloudera, Inc. All rights reserved.
What am I factorizing?
[7]
‹#›© Cloudera, Inc. All rights reserved.
What am I timing?
[8]
‹#›© Cloudera, Inc. All rights reserved.
Think of the polar bears
[9]
‹#›© Cloudera, Inc. All rights reserved.
Varying Columns
‹#›© Cloudera, Inc. All rights reserved.
Varying Rows
‹#›© Cloudera, Inc. All rights reserved.
Varying Sparsity
‹#›© Cloudera, Inc. All rights reserved.
Progress in Numerical Computation
[10]
‹#›© Cloudera, Inc. All rights reserved.
1. Genome PCA: http://bit.ly/1OxXMRy 2. SVD at LANL: http://bit.ly/193IIdY 3. Apples and Oranges: http://bit.ly/1xd1Q4d 4. Sound Board: http://bit.ly/19okavV 5. Bananas: http://bit.ly/1EGxh4p 6. Eniac: http://bit.ly/1F0GOWC 7. Big data pix tumblr: http://bigdatapix.tumblr.com/ 8. Watch: http://bit.ly/1FZtIKX 9. Polar Bears: http://bit.ly/1G0gXQw 10.Progress in numerical computing: http://bit.ly/1ID8WR5
Thanks for the images!
‹#›© Cloudera, Inc. All rights reserved.
[email protected] @j_houg https://github.com/jhlch/svd-benchmark