Juliet Hougland, Data Scientist, Cloudera at MLconf NYC

‹#›© Cloudera, Inc. All rights reserved.

Juliet Hougland Data Scientist @j_houg

Matrix Decomposition at Scale


The Singular Value Decomposition


• Dimensionality Reduction/PCA • Feature dimension reduction • Visualization of gene expression data

• Latent Semantic Indexing • Low Rank Approximations • Digital Signals Processing

SVD is applied everywhere

A Global Map of Human Gene Expression. Lukk Et al. [1]


Define SVD


Totally awesome LANL video


This doesn’t work on distributed, commodity setups

Good ClusterBad Cluster


3 Distributed OSS SVD ImplementationsMahout: Lanczos Mahout: Stochastic Spark: Lanczos


Lanczos’ Method


• Iterative, with the dominant cost a matrix-vector multiply • Requires at least k iterations to get k singular vectors

Lanczos’ Method


• Randomly project original matrix to lower dimensional space • Factorize the projected matrix. • Unproject

Stochastic SVDM ⇡ QQ⇤M

Finding Structure in Randomness. Halko Et al. http://bit.ly/19VVRXp

http://bit.ly/19VVRXp


• What I test is written on MapReduce • Driver programs launch the series of required map reduce jobs • Lots of writing intermediate data to disk

Frameworks

• Using the MLLib component • Relies on Spark core • => tries to pin data in memory


Note!

Mahout Scala & Spark Bindings are integrated in Mahout. Version 0.10 release next month will move these methods The Scala DSL for linear algebra:

val g = bt.t %*% bt - c - c.t + (s_q cross s_q) * (xi dot xi)


Performance Comparisons


[3]


MapReduce

[4]


Go Bananas tuning!

[5]


My Cluster6 Nodes running CDH 5.3* Per Node: 2 physical cores 24, with hyper threading => 144 total available cores 64 GB Memory 100 TB free in HDFS !*Running Spark 1.3

[6]


What am I factorizing?

[7]


What am I timing?

[8]


Think of the polar bears

[9]


Varying Columns


Varying Rows


Varying Sparsity


Progress in Numerical Computation

[10]


1. Genome PCA: http://bit.ly/1OxXMRy 2. SVD at LANL: http://bit.ly/193IIdY 3. Apples and Oranges: http://bit.ly/1xd1Q4d 4. Sound Board: http://bit.ly/19okavV 5. Bananas: http://bit.ly/1EGxh4p 6. Eniac: http://bit.ly/1F0GOWC 7. Big data pix tumblr: http://bigdatapix.tumblr.com/ 8. Watch: http://bit.ly/1FZtIKX 9. Polar Bears: http://bit.ly/1G0gXQw 10.Progress in numerical computing: http://bit.ly/1ID8WR5

Thanks for the images!

http://bigdatapix.tumblr.com/


[email protected] @j_houg https://github.com/jhlch/svd-benchmark

Juliet Hougland, Data Scientist, Cloudera at MLconf NYC

Technology

Transcript of Juliet Hougland, Data Scientist, Cloudera at MLconf NYC