Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal

52
1 © Copyright 2013 Pivotal. All rights reserved. Pivotal Data Labs – Technology and Tools in our Data Scientist’s Arsenal Srivatsan Ramanujam Senior Data Scientist Pivotal Data Labs 15 Oct 2014

description

These slides give an overview of the technology and the tools used by Data Scientists at Pivotal Data Labs. This includes Procedural Languages like PL/Python, PL/R, PL/Java, PL/Perl and the parallel, in-database machine learning library MADlib. The slides also highlight the power and flexibility of the Pivotal platform from embracing open source libraries in Python, R or Java to using new computing paradigms such as Spark on Pivotal HD.

Transcript of Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal

Page 1: Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal

1 © Copyright 2013 Pivotal. All rights reserved. 1 © Copyright 2013 Pivotal. All rights reserved.

Pivotal Data Labs – Technology and Tools in our Data Scientist’s Arsenal

Srivatsan Ramanujam Senior Data Scientist Pivotal Data Labs 15 Oct 2014

Page 2: Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal

2 © Copyright 2013 Pivotal. All rights reserved.

Agenda �  Pivotal: Technology and Tools Introduction

–  Greenplum MPP Database and Pivotal Hadoop with HAWQ

�  Data Parallelism –  PL/Python, PL/R, PL/Java, PL/C

�  Complete Parallelism –  MADlib

�  Python and R Wrappers –  PyMADlib and PivotalR

�  Open Source Integration –  Spark and PySpark examples

�  Live Demos – Pivotal Data Science Tools in Action –  Topic and Sentiment Analysis –  Content Based Image Search

Page 3: Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal

3 © Copyright 2013 Pivotal. All rights reserved.

Technology and Tools

Page 4: Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal

4 © Copyright 2013 Pivotal. All rights reserved.

MPP Architectural Overview Think of it as multiple PostGreSQL servers

Segments/Workers

Master

Rows are distributed across segments by a particular field (or randomly)

Page 5: Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal

5 © Copyright 2013 Pivotal. All rights reserved.

Implicit Parallelism – Procedural Languages

Page 6: Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal

6 © Copyright 2013 Pivotal. All rights reserved.

Data Parallelism – Embarrassingly Parallel Tasks

� Little or no effort is required to break up the problem into a number of parallel tasks, and there exists no dependency (or communication) between those parallel tasks.

� Examples: –  map() function in Python:

>>> x = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] >>> map(lambda e: e*e, x) >>> [1, 4, 9, 16, 25, 36, 49, 64, 81, 100]

www.slideshare.net/SrivatsanRamanujam/python-powered-data-science-at-pivotal-pydata-2013

Page 7: Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal

7 © Copyright 2013 Pivotal. All rights reserved.

�  The interpreter/VM of the language ‘X’ is installed on each node of the Greenplum Database Cluster

•  Data Parallelism: -  PL/X piggybacks on

Greenplum/HAWQ’s MPP architecture

•  Allows users to write Greenplum/PostgreSQL functions in the R/Python/Java, Perl, pgsql or C languages Standby

Master

Master Host

SQL

Interconnect

Segment Host Segment Segment

Segment Host Segment Segment

Segment Host Segment Segment

Segment Host Segment Segment

PL/X : X in {pgsql, R, Python, Java, Perl, C etc.}

Page 8: Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal

8 © Copyright 2013 Pivotal. All rights reserved.

User Defined Functions – PL/Python Example �  Procedural languages need to be installed on each database used.

�  Syntax is like normal Python function with function definition line replaced by SQL wrapper. Alternatively like a SQL User Defined Function with Python inside.

CREATE  FUNCTION  pymax  (a  integer,  b  integer)      RETURNS  integer  AS  $$      if  a  >  b:          return  a      return  b  $$  LANGUAGE  plpythonu;  

SQL wrapper

SQL wrapper

Normal Python

Page 9: Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal

9 © Copyright 2013 Pivotal. All rights reserved.

Returning Results �  Postgres primitive types (int, bigint, text, float8, double precision, date, NULL etc.) �  Composite types can be returned by creating a composite type in the database:

CREATE  TYPE  named_value  AS  (      name    text,      value    integer  );  

�  Then you can return a list, tuple or dict (not sets) which reference the same structure as the table:

CREATE  FUNCTION  make_pair  (name  text,  value  integer)      RETURNS  named_value  AS  $$      return  [  name,  value  ]      #  or  alternatively,  as  tuple:  return  (  name,  value  )      #  or  as  dict:  return  {  "name":  name,  "value":  value  }      #  or  as  an  object  with  attributes  .name  and  .value  $$  LANGUAGE  plpythonu;  

 �  For functions which return multiple rows, prefix “setof” before the return type

http://www.slideshare.net/PyData/massively-parallel-process-with-prodedural-python-ian-huston

Page 10: Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal

10 © Copyright 2013 Pivotal. All rights reserved.

Returning more results You can return multiple results by wrapping them in a sequence (tuple, list or set), an iterator or a generator:

CREATE  FUNCTION  make_pair  (name  text)      RETURNS  SETOF  named_value  AS  $$      return  ([  name,  1  ],  [  name,  2  ],  [  name,  3])    $$  LANGUAGE  plpythonu;  

Sequence

Generator

CREATE  FUNCTION  make_pair  (name  text)      RETURNS  SETOF  named_value    AS  $$      for  i  in  range(3):              yield  (name,  i)    $$  LANGUAGE  plpythonu;  

Page 11: Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal

11 © Copyright 2013 Pivotal. All rights reserved.

Accessing Packages �  On Greenplum DB: To be available packages must be installed on the

individual segment nodes. –  Can use “parallel ssh” tool gpssh to conda/pip install –  Currently Greenplum DB ships with Python 2.6 (!)

�  Then just import as usual inside function:

   CREATE  FUNCTION  make_pair  (name  text)      RETURNS  named_value  AS  $$      import  numpy  as  np      return  ((name,i)  for  i  in  np.arange(3))  $$  LANGUAGE  plpythonu;  

Page 12: Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal

12 © Copyright 2013 Pivotal. All rights reserved.

UCI Auto MPG Dataset – A toy problem Sample Data

�  Sample Task: Aero-dynamics aside (attributable to body style), what is the effect of engine parameters (bore, stroke, compression_ratio, horsepower, peak_rpm) on the highway mpg of cars?

�  Solution: Build a Linear Regression model for each body style (hatchback, sedan) using the features bore, stroke, compression ration, horsepower and peak_rpm with highway_mpg as the target label.

�  This is a data parallel task which can be executed in parallel by simply piggybacking on the MPP architecture. One segment can build a model for Hatchbacks another for Sedan

http://archive.ics.uci.edu/ml/datasets/Auto+MPG

Page 13: Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal

13 © Copyright 2013 Pivotal. All rights reserved.

Ridge Regression with scikit-learn on PL/Python

Python

SQL wrapper

SQL wrapper

User Defined Function

User Defined Type User Defined Aggregate

Page 14: Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal

14 © Copyright 2013 Pivotal. All rights reserved.

PL/Python + scikit-learn : Model Coefficients

Physical machine on the cluster in which the regression model was built

Invoke UDF

Build Feature Vector

Choose Features

One model per body style

Page 15: Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal

15 © Copyright 2013 Pivotal. All rights reserved.

Parallelized R in Pivotal via PL/R: An Example � With placeholders in SQL, write functions in the native R language

�  Accessible, powerful modeling framework

http://pivotalsoftware.github.io/gp-r/

Page 16: Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal

16 © Copyright 2013 Pivotal. All rights reserved.

Parallelized R in Pivotal via PL/R: An Example �  Execute PL/R function

�  Plain and simple table is returned

http://pivotalsoftware.github.io/gp-r/

Page 17: Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal

17 © Copyright 2013 Pivotal. All rights reserved.

Aggregate and obtain final prediction

Each tree makes a prediction

Parallelized R in Pivotal via PL/R: Parallel Bagged Decision Trees

http://pivotalsoftware.github.io/gp-r/

Page 18: Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal

18 © Copyright 2013 Pivotal. All rights reserved.

Complete Parallelism

Page 19: Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal

19 © Copyright 2013 Pivotal. All rights reserved.

Complete Parallelism – Beyond Data Parallel Tasks

� Data Parallel computation via PL/X libraries only allow us to run ‘n’ models in parallel.

� This works great when we are building one model for each value of the group by column, but we need parallelized algorithms to be able to build a single model on all the available data

� For this, we use MADlib – an open source library of parallel in-database machine learning algorithms.

Page 20: Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal

20 © Copyright 2013 Pivotal. All rights reserved.

MADlib: Scalable, in-database Machine Learning

http://madlib.net

Page 21: Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal

21 © Copyright 2013 Pivotal. All rights reserved.

MADlib In-Database Functions

Predictive Modeling Library

Linear Systems •  Sparse and Dense Solvers

Matrix Factorization •  Single Value Decomposition (SVD) •  Low-Rank

Generalized Linear Models •  Linear Regression •  Logistic Regression •  Multinomial Logistic Regression •  Cox Proportional Hazards •  Regression •  Elastic Net Regularization •  Sandwich Estimators (Huber white,

clustered, marginal effects)

Machine Learning Algorithms •  Principal Component Analysis (PCA) •  Association Rules (Affinity Analysis, Market

Basket) •  Topic Modeling (Parallel LDA) •  Decision Trees •  Ensemble Learners (Random Forests) •  Support Vector Machines •  Conditional Random Field (CRF) •  Clustering (K-means) •  Cross Validation

Descriptive Statistics

Sketch-based Estimators •  CountMin (Cormode-

Muthukrishnan) •  FM (Flajolet-Martin) •  MFV (Most Frequent

Values) Correlation Summary

Support Modules

Array Operations Sparse Vectors Random Sampling Probability Functions

Page 22: Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal

22 © Copyright 2013 Pivotal. All rights reserved.

Linear Regression: Streaming Algorithm

� Finding linear dependencies between variables

� How to compute with a single scan?

Page 23: Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal

23 © Copyright 2013 Pivotal. All rights reserved.

Linear Regression: Parallel Computation

XT

y

XT y = xiT yi

i∑

Page 24: Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal

24 © Copyright 2013 Pivotal. All rights reserved.

Linear Regression: Parallel Computation

y

XT

Master

XT y

Segment 1 Segment 2

X1T y1 X2

T y2+ =

Page 25: Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal

25 © Copyright 2013 Pivotal. All rights reserved.

Linear Regression: Parallel Computation

y

XT

Master Segment 1 Segment 2

XT yX1T y1 X2

T y2+ =

Page 26: Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal

26 © Copyright 2013 Pivotal. All rights reserved.

Performing a linear regression on 10 million rows in seconds

Hellerstein, Joseph M., et al. "The MADlib analytics library: or MAD skills, the SQL." Proceedings of the VLDB Endowment 5.12 (2012): 1700-1711.

Page 27: Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal

27 © Copyright 2013 Pivotal. All rights reserved.

Calling MADlib Functions: Fast Training, Scoring

�  MADlib allows users to easily and create models without moving data out of the systems

–  Model generation –  Model validation –  Scoring (evaluation of) new data

�  All the data can be used in one model

�  Built-in functionality to create of multiple smaller models (e.g. classification grouped by feature)

�  Open-source lets you tweak and extend methods, or build your own

SELECT madlib.linregr_train( 'houses’,!'houses_linregr’,!

'price’,!'ARRAY[1, tax, bath, size]’);!

MADlib model function Table containing

training data

Table in which to save results

Column containing dependent variable Features included in the

model

https://www.youtube.com/watch?v=Gur4FS9gpAg

Page 28: Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal

28 © Copyright 2013 Pivotal. All rights reserved.

Calling MADlib Functions: Fast Training, Scoring

SELECT madlib.linregr_train( 'houses’,!'houses_linregr’,!

'price’,!'ARRAY[1, tax, bath, size]’,!

‘bedroom’);!

MADlib model function Table containing

training data

Table in which to save results

Column containing dependent variable

Create multiple output models (one for each value of bedroom)

�  MADlib allows users to easily and create models without moving data out of the systems

–  Model generation –  Model validation –  Scoring (evaluation of) new data

�  All the data can be used in one model

�  Built-in functionality to create of multiple smaller models (e.g. classification grouped by feature)

�  Open-source lets you tweak and extend methods, or build your own

Features included in the model

https://www.youtube.com/watch?v=Gur4FS9gpAg

Page 29: Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal

29 © Copyright 2013 Pivotal. All rights reserved.

Calling MADlib Functions: Fast Training, Scoring

SELECT madlib.linregr_train( 'houses’,!'houses_linregr’,!

'price’,!'ARRAY[1, tax, bath, size]’);!

SELECT houses.*, madlib.linregr_predict(ARRAY[1,tax,bath,size],

m.coef!)as predict !

FROM houses, houses_linregr m;!

MADlib model scoring function

Table with data to be scored Table containing model

�  MADlib allows users to easily and create models without moving data out of the systems

–  Model generation –  Model validation –  Scoring (evaluation of) new data

�  All the data can be used in one model

�  Built-in functionality to create of multiple smaller models (e.g. classification grouped by feature)

�  Open-source lets you tweak and extend methods, or build your own

Page 30: Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal

30 © Copyright 2013 Pivotal. All rights reserved.

Python and R wrappers to MADlib

Page 31: Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal

31 © Copyright 2013 Pivotal. All rights reserved.

PivotalR: Bringing MADlib and HAWQ to a familiar R interface � Challenge

Want to harness the familiarity of R’s interface and the performance & scalability benefits of in-DB analytics

� Simple solution: Translate R code into SQL

d <- db.data.frame(”houses")!houses_linregr <- madlib.lm(price ~ tax!

! ! !+ bath!! ! !+ size!! ! !, data=d)!

Pivotal R SELECT madlib.linregr_train( 'houses’,!

'houses_linregr’,!'price’,!

'ARRAY[1, tax, bath, size]’);!

SQL Code

https://github.com/pivotalsoftware/PivotalR

Page 32: Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal

32 © Copyright 2013 Pivotal. All rights reserved.

PivotalR: Bringing MADlib and HAWQ to a familiar R interface � Challenge

Want to harness the familiarity of R’s interface and the performance & scalability benefits of in-DB analytics

� Simple solution: Translate R code into SQL

d <- db.data.frame(”houses")!houses_linregr <- madlib.lm(price ~ as.factor(state)!

! ! ! !+ tax!! ! ! !+ bath!! ! ! !+ size!! ! ! !, data=d)!

Pivotal R

# Build a regression model with a different!# intercept term for each state!# (state=1 as baseline).!# Note that PivotalR supports automated!# indicator coding a la as.factor()!!

https://github.com/pivotalsoftware/PivotalR

Page 33: Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal

33 © Copyright 2013 Pivotal. All rights reserved.

PivotalR Design Overview

2. SQL to execute

3. Computation results 1. R à SQL

RPostgreSQL

PivotalR

Data lives here No data here

Database/Hadoop w/ MADlib

•  Call MADlib’s in-DB machine learning functions directly from R

•  Syntax is analogous to native R function

•  Data doesn’t need to leave the database •  All heavy lifting, including model estimation

& computation, are done in the database

https://github.com/pivotalsoftware/PivotalR

Page 34: Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal

34 © Copyright 2013 Pivotal. All rights reserved.

http://pivotalsoftware.github.io/pymadlib/

PyMADlib : Power of MADlib + Flexibility of Python Linear Regression

Logistic Regression

Extras –  Support for Categorical variables –  Pivoting

Current PyMADlib Algorithms –  Linear Regression –  Logistic Regression –  K-Means –  LDA

Page 35: Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal

35 © Copyright 2013 Pivotal. All rights reserved.

Visualization

Page 36: Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal

36 © Copyright 2013 Pivotal. All rights reserved.

Visualization

Commercial Open Source

Page 37: Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal

37 © Copyright 2013 Pivotal. All rights reserved.

Hack one when needed – Pandas_via_psql

http://vatsan.github.io/pandas_via_psql/

SQL Client DB

Page 38: Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal

38 © Copyright 2013 Pivotal. All rights reserved.

Integration with Open Source – (Py)Spark Example

Page 39: Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal

39 © Copyright 2013 Pivotal. All rights reserved.

Apache Spark Project – Quick Overview

http://spark-summit.org/wp-content/uploads/2013/10/Zaharia-spark-summit-2013-matei.pdf

•  Apache Project, originated in AMPLab Berkeley •  Supported on Pivotal Hadoop 2.0!

Page 40: Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal

40 © Copyright 2013 Pivotal. All rights reserved.

MapReduce vs. Spark

http://spark-summit.org/wp-content/uploads/2013/10/Zaharia-spark-summit-2013-matei.pdf

Page 41: Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal

41 © Copyright 2013 Pivotal. All rights reserved.

Data Parallelism in PySpark – A Simple Example

•  Next we’ll take the UCI automobile dataset example from PL/Python and demonstrate how to run in PySpark

Page 42: Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal

42 © Copyright 2013 Pivotal. All rights reserved.

Scikit-Learn on PySpark – UCI Auto Dataset Example

•  This is in essence similar to the PL/Python example from the earlier slide, except we’re using data store on HDFS (Pivotal HD) with Spark as the platform in place of HAWQ/Greenplum

Page 43: Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal

43 © Copyright 2013 Pivotal. All rights reserved.

Large Scale Topic and Sentiment Analysis of Tweets

Social Media Demo

Page 44: Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal

44 © Copyright 2013 Pivotal. All rights reserved.

Pivotal GNIP Decahose Pipeline

Parallel Parsing of JSON

PXF

Twitter Decahose (~55 million tweets/day)

Source: http Sink: hdfs

HDFS

External Tables

PXF

Nightly Cron Jobs

Topic Analysis through MADlib pLDA

Unsupervised Sentiment Analysis

(PL/Python)

D3.js

http://www.slideshare.net/SrivatsanRamanujam/a-pipeline-for-distributed-topic-and-sentiment-analysis-of-tweets-on-pivotal-greenplum-database

Page 45: Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal

45 © Copyright 2013 Pivotal. All rights reserved.

Data Science + Agile = Quick Wins

� The Team –  1 Data Scientist –  2 Agile Developers –  1 Designer (part-time) –  1 Project Manager (part-time)

� Duration –  3 weeks!

Page 46: Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal

46 © Copyright 2013 Pivotal. All rights reserved.

Live Demo – Topic and Sentiment Analysis

Page 47: Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal

47 Pivotal Confidential–Internal Use Only

Content Based Image Search

CBIR Live Demo

Page 48: Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal

48 Pivotal Confidential–Internal Use Only

Content Based Information Retrieval - Task

http://blog.pivotal.io/pivotal/features/content-based-image-retrieval-using-pivotal-hd-or-pivotal-greenplum-database

Page 49: Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal

49 Pivotal Confidential–Internal Use Only

CBIR - Components

http://blog.pivotal.io/pivotal/features/content-based-image-retrieval-using-pivotal-hd-or-pivotal-greenplum-database

Page 50: Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal

50 Pivotal Confidential–Internal Use Only

Live Demo – Content Based Image Search

http://blog.pivotal.io/pivotal/features/content-based-image-retrieval-using-pivotal-hd-or-pivotal-greenplum-database

Page 51: Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal

51 Pivotal Confidential–Internal Use Only

Appendix

Page 52: Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal

52 Pivotal Confidential–Internal Use Only

Acknowledgements •  Ian Huston, Woo Jung, Sarah Aerni, Gautam Muralidhar, Regunathan

Radhakrishnan, Ronert Obst, Hai Qian, MADlib Engineering Team, Sumedh Mungee, Girish Lingappa