Juliet Hougland, Data Scientist, Cloudera at MLconf NYC

26
‹#› © Cloudera, Inc. All rights reserved. Juliet Hougland Data Scientist @j_houg Matrix Decomposition at Scale

Transcript of Juliet Hougland, Data Scientist, Cloudera at MLconf NYC

Page 1: Juliet Hougland, Data Scientist, Cloudera at MLconf NYC

‹#›© Cloudera, Inc. All rights reserved.

Juliet Hougland Data Scientist @j_houg

Matrix Decomposition at Scale

Page 2: Juliet Hougland, Data Scientist, Cloudera at MLconf NYC

‹#›© Cloudera, Inc. All rights reserved.

The Singular Value Decomposition

Page 3: Juliet Hougland, Data Scientist, Cloudera at MLconf NYC

‹#›© Cloudera, Inc. All rights reserved.

• Dimensionality Reduction/PCA • Feature dimension reduction • Visualization of gene expression data

• Latent Semantic Indexing • Low Rank Approximations • Digital Signals Processing

SVD is applied everywhere

A Global Map of Human Gene Expression. Lukk Et al. [1]

Page 4: Juliet Hougland, Data Scientist, Cloudera at MLconf NYC

‹#›© Cloudera, Inc. All rights reserved.

Define SVD

Page 5: Juliet Hougland, Data Scientist, Cloudera at MLconf NYC

‹#›© Cloudera, Inc. All rights reserved.

Totally awesome LANL video

Page 6: Juliet Hougland, Data Scientist, Cloudera at MLconf NYC

‹#›© Cloudera, Inc. All rights reserved.

This doesn’t work on distributed, commodity setups

Good ClusterBad Cluster

Page 7: Juliet Hougland, Data Scientist, Cloudera at MLconf NYC

‹#›© Cloudera, Inc. All rights reserved.

3 Distributed OSS SVD ImplementationsMahout: Lanczos Mahout: Stochastic Spark: Lanczos

Page 8: Juliet Hougland, Data Scientist, Cloudera at MLconf NYC

‹#›© Cloudera, Inc. All rights reserved.

Lanczos’ Method

Page 9: Juliet Hougland, Data Scientist, Cloudera at MLconf NYC

‹#›© Cloudera, Inc. All rights reserved.

• Iterative, with the dominant cost a matrix-vector multiply • Requires at least k iterations to get k singular vectors

Lanczos’ Method

Page 10: Juliet Hougland, Data Scientist, Cloudera at MLconf NYC

‹#›© Cloudera, Inc. All rights reserved.

• Randomly project original matrix to lower dimensional space • Factorize the projected matrix. • Unproject

Stochastic SVDM ⇡ QQ⇤M

Finding Structure in Randomness. Halko Et al. http://bit.ly/19VVRXp

Page 11: Juliet Hougland, Data Scientist, Cloudera at MLconf NYC

‹#›© Cloudera, Inc. All rights reserved.

• What I test is written on MapReduce • Driver programs launch the series of required map reduce jobs • Lots of writing intermediate data to disk

Frameworks

• Using the MLLib component • Relies on Spark core • => tries to pin data in memory

Page 12: Juliet Hougland, Data Scientist, Cloudera at MLconf NYC

‹#›© Cloudera, Inc. All rights reserved.

Note!

Mahout Scala & Spark Bindings are integrated in Mahout. Version 0.10 release next month will move these methods The Scala DSL for linear algebra:

val g = bt.t %*% bt - c - c.t + (s_q cross s_q) * (xi dot xi)

Page 13: Juliet Hougland, Data Scientist, Cloudera at MLconf NYC

‹#›© Cloudera, Inc. All rights reserved.

Performance Comparisons

Page 14: Juliet Hougland, Data Scientist, Cloudera at MLconf NYC

‹#›© Cloudera, Inc. All rights reserved.

[3]

Page 15: Juliet Hougland, Data Scientist, Cloudera at MLconf NYC

‹#›© Cloudera, Inc. All rights reserved.

MapReduce

[4]

Page 16: Juliet Hougland, Data Scientist, Cloudera at MLconf NYC

‹#›© Cloudera, Inc. All rights reserved.

Go Bananas tuning!

[5]

Page 17: Juliet Hougland, Data Scientist, Cloudera at MLconf NYC

‹#›© Cloudera, Inc. All rights reserved.

My Cluster6 Nodes running CDH 5.3* Per Node: 2 physical cores 24, with hyper threading => 144 total available cores 64 GB Memory 100 TB free in HDFS !*Running Spark 1.3

[6]

Page 18: Juliet Hougland, Data Scientist, Cloudera at MLconf NYC

‹#›© Cloudera, Inc. All rights reserved.

What am I factorizing?

[7]

Page 19: Juliet Hougland, Data Scientist, Cloudera at MLconf NYC

‹#›© Cloudera, Inc. All rights reserved.

What am I timing?

[8]

Page 20: Juliet Hougland, Data Scientist, Cloudera at MLconf NYC

‹#›© Cloudera, Inc. All rights reserved.

Think of the polar bears

[9]

Page 21: Juliet Hougland, Data Scientist, Cloudera at MLconf NYC

‹#›© Cloudera, Inc. All rights reserved.

Varying Columns

Page 22: Juliet Hougland, Data Scientist, Cloudera at MLconf NYC

‹#›© Cloudera, Inc. All rights reserved.

Varying Rows

Page 23: Juliet Hougland, Data Scientist, Cloudera at MLconf NYC

‹#›© Cloudera, Inc. All rights reserved.

Varying Sparsity

Page 24: Juliet Hougland, Data Scientist, Cloudera at MLconf NYC

‹#›© Cloudera, Inc. All rights reserved.

Progress in Numerical Computation

[10]

Page 25: Juliet Hougland, Data Scientist, Cloudera at MLconf NYC

‹#›© Cloudera, Inc. All rights reserved.

1. Genome PCA: http://bit.ly/1OxXMRy 2. SVD at LANL: http://bit.ly/193IIdY 3. Apples and Oranges: http://bit.ly/1xd1Q4d 4. Sound Board: http://bit.ly/19okavV 5. Bananas: http://bit.ly/1EGxh4p 6. Eniac: http://bit.ly/1F0GOWC 7. Big data pix tumblr: http://bigdatapix.tumblr.com/ 8. Watch: http://bit.ly/1FZtIKX 9. Polar Bears: http://bit.ly/1G0gXQw 10.Progress in numerical computing: http://bit.ly/1ID8WR5

Thanks for the images!

Page 26: Juliet Hougland, Data Scientist, Cloudera at MLconf NYC

‹#›© Cloudera, Inc. All rights reserved.

[email protected] @j_houg https://github.com/jhlch/svd-benchmark