Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal
-
Upload
srivatsan-ramanujam -
Category
Data & Analytics
-
view
373 -
download
2
description
Transcript of Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal
![Page 1: Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal](https://reader034.fdocuments.in/reader034/viewer/2022051609/547e42dab37959582b8b548c/html5/thumbnails/1.jpg)
1 © Copyright 2013 Pivotal. All rights reserved. 1 © Copyright 2013 Pivotal. All rights reserved.
Pivotal Data Labs – Technology and Tools in our Data Scientist’s Arsenal
Srivatsan Ramanujam Senior Data Scientist Pivotal Data Labs 15 Oct 2014
![Page 2: Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal](https://reader034.fdocuments.in/reader034/viewer/2022051609/547e42dab37959582b8b548c/html5/thumbnails/2.jpg)
2 © Copyright 2013 Pivotal. All rights reserved.
Agenda � Pivotal: Technology and Tools Introduction
– Greenplum MPP Database and Pivotal Hadoop with HAWQ
� Data Parallelism – PL/Python, PL/R, PL/Java, PL/C
� Complete Parallelism – MADlib
� Python and R Wrappers – PyMADlib and PivotalR
� Open Source Integration – Spark and PySpark examples
� Live Demos – Pivotal Data Science Tools in Action – Topic and Sentiment Analysis – Content Based Image Search
![Page 3: Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal](https://reader034.fdocuments.in/reader034/viewer/2022051609/547e42dab37959582b8b548c/html5/thumbnails/3.jpg)
3 © Copyright 2013 Pivotal. All rights reserved.
Technology and Tools
![Page 4: Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal](https://reader034.fdocuments.in/reader034/viewer/2022051609/547e42dab37959582b8b548c/html5/thumbnails/4.jpg)
4 © Copyright 2013 Pivotal. All rights reserved.
MPP Architectural Overview Think of it as multiple PostGreSQL servers
Segments/Workers
Master
Rows are distributed across segments by a particular field (or randomly)
![Page 5: Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal](https://reader034.fdocuments.in/reader034/viewer/2022051609/547e42dab37959582b8b548c/html5/thumbnails/5.jpg)
5 © Copyright 2013 Pivotal. All rights reserved.
Implicit Parallelism – Procedural Languages
![Page 6: Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal](https://reader034.fdocuments.in/reader034/viewer/2022051609/547e42dab37959582b8b548c/html5/thumbnails/6.jpg)
6 © Copyright 2013 Pivotal. All rights reserved.
Data Parallelism – Embarrassingly Parallel Tasks
� Little or no effort is required to break up the problem into a number of parallel tasks, and there exists no dependency (or communication) between those parallel tasks.
� Examples: – map() function in Python:
>>> x = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] >>> map(lambda e: e*e, x) >>> [1, 4, 9, 16, 25, 36, 49, 64, 81, 100]
www.slideshare.net/SrivatsanRamanujam/python-powered-data-science-at-pivotal-pydata-2013
![Page 7: Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal](https://reader034.fdocuments.in/reader034/viewer/2022051609/547e42dab37959582b8b548c/html5/thumbnails/7.jpg)
7 © Copyright 2013 Pivotal. All rights reserved.
� The interpreter/VM of the language ‘X’ is installed on each node of the Greenplum Database Cluster
• Data Parallelism: - PL/X piggybacks on
Greenplum/HAWQ’s MPP architecture
• Allows users to write Greenplum/PostgreSQL functions in the R/Python/Java, Perl, pgsql or C languages Standby
Master
…
Master Host
SQL
Interconnect
Segment Host Segment Segment
Segment Host Segment Segment
Segment Host Segment Segment
Segment Host Segment Segment
PL/X : X in {pgsql, R, Python, Java, Perl, C etc.}
![Page 8: Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal](https://reader034.fdocuments.in/reader034/viewer/2022051609/547e42dab37959582b8b548c/html5/thumbnails/8.jpg)
8 © Copyright 2013 Pivotal. All rights reserved.
User Defined Functions – PL/Python Example � Procedural languages need to be installed on each database used.
� Syntax is like normal Python function with function definition line replaced by SQL wrapper. Alternatively like a SQL User Defined Function with Python inside.
CREATE FUNCTION pymax (a integer, b integer) RETURNS integer AS $$ if a > b: return a return b $$ LANGUAGE plpythonu;
SQL wrapper
SQL wrapper
Normal Python
![Page 9: Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal](https://reader034.fdocuments.in/reader034/viewer/2022051609/547e42dab37959582b8b548c/html5/thumbnails/9.jpg)
9 © Copyright 2013 Pivotal. All rights reserved.
Returning Results � Postgres primitive types (int, bigint, text, float8, double precision, date, NULL etc.) � Composite types can be returned by creating a composite type in the database:
CREATE TYPE named_value AS ( name text, value integer );
� Then you can return a list, tuple or dict (not sets) which reference the same structure as the table:
CREATE FUNCTION make_pair (name text, value integer) RETURNS named_value AS $$ return [ name, value ] # or alternatively, as tuple: return ( name, value ) # or as dict: return { "name": name, "value": value } # or as an object with attributes .name and .value $$ LANGUAGE plpythonu;
� For functions which return multiple rows, prefix “setof” before the return type
http://www.slideshare.net/PyData/massively-parallel-process-with-prodedural-python-ian-huston
![Page 10: Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal](https://reader034.fdocuments.in/reader034/viewer/2022051609/547e42dab37959582b8b548c/html5/thumbnails/10.jpg)
10 © Copyright 2013 Pivotal. All rights reserved.
Returning more results You can return multiple results by wrapping them in a sequence (tuple, list or set), an iterator or a generator:
CREATE FUNCTION make_pair (name text) RETURNS SETOF named_value AS $$ return ([ name, 1 ], [ name, 2 ], [ name, 3]) $$ LANGUAGE plpythonu;
Sequence
Generator
CREATE FUNCTION make_pair (name text) RETURNS SETOF named_value AS $$ for i in range(3): yield (name, i) $$ LANGUAGE plpythonu;
![Page 11: Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal](https://reader034.fdocuments.in/reader034/viewer/2022051609/547e42dab37959582b8b548c/html5/thumbnails/11.jpg)
11 © Copyright 2013 Pivotal. All rights reserved.
Accessing Packages � On Greenplum DB: To be available packages must be installed on the
individual segment nodes. – Can use “parallel ssh” tool gpssh to conda/pip install – Currently Greenplum DB ships with Python 2.6 (!)
� Then just import as usual inside function:
CREATE FUNCTION make_pair (name text) RETURNS named_value AS $$ import numpy as np return ((name,i) for i in np.arange(3)) $$ LANGUAGE plpythonu;
![Page 12: Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal](https://reader034.fdocuments.in/reader034/viewer/2022051609/547e42dab37959582b8b548c/html5/thumbnails/12.jpg)
12 © Copyright 2013 Pivotal. All rights reserved.
UCI Auto MPG Dataset – A toy problem Sample Data
� Sample Task: Aero-dynamics aside (attributable to body style), what is the effect of engine parameters (bore, stroke, compression_ratio, horsepower, peak_rpm) on the highway mpg of cars?
� Solution: Build a Linear Regression model for each body style (hatchback, sedan) using the features bore, stroke, compression ration, horsepower and peak_rpm with highway_mpg as the target label.
� This is a data parallel task which can be executed in parallel by simply piggybacking on the MPP architecture. One segment can build a model for Hatchbacks another for Sedan
http://archive.ics.uci.edu/ml/datasets/Auto+MPG
![Page 13: Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal](https://reader034.fdocuments.in/reader034/viewer/2022051609/547e42dab37959582b8b548c/html5/thumbnails/13.jpg)
13 © Copyright 2013 Pivotal. All rights reserved.
Ridge Regression with scikit-learn on PL/Python
Python
SQL wrapper
SQL wrapper
User Defined Function
User Defined Type User Defined Aggregate
![Page 14: Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal](https://reader034.fdocuments.in/reader034/viewer/2022051609/547e42dab37959582b8b548c/html5/thumbnails/14.jpg)
14 © Copyright 2013 Pivotal. All rights reserved.
PL/Python + scikit-learn : Model Coefficients
Physical machine on the cluster in which the regression model was built
Invoke UDF
Build Feature Vector
Choose Features
One model per body style
![Page 15: Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal](https://reader034.fdocuments.in/reader034/viewer/2022051609/547e42dab37959582b8b548c/html5/thumbnails/15.jpg)
15 © Copyright 2013 Pivotal. All rights reserved.
Parallelized R in Pivotal via PL/R: An Example � With placeholders in SQL, write functions in the native R language
� Accessible, powerful modeling framework
http://pivotalsoftware.github.io/gp-r/
![Page 16: Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal](https://reader034.fdocuments.in/reader034/viewer/2022051609/547e42dab37959582b8b548c/html5/thumbnails/16.jpg)
16 © Copyright 2013 Pivotal. All rights reserved.
Parallelized R in Pivotal via PL/R: An Example � Execute PL/R function
� Plain and simple table is returned
http://pivotalsoftware.github.io/gp-r/
![Page 17: Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal](https://reader034.fdocuments.in/reader034/viewer/2022051609/547e42dab37959582b8b548c/html5/thumbnails/17.jpg)
17 © Copyright 2013 Pivotal. All rights reserved.
Aggregate and obtain final prediction
Each tree makes a prediction
Parallelized R in Pivotal via PL/R: Parallel Bagged Decision Trees
http://pivotalsoftware.github.io/gp-r/
![Page 18: Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal](https://reader034.fdocuments.in/reader034/viewer/2022051609/547e42dab37959582b8b548c/html5/thumbnails/18.jpg)
18 © Copyright 2013 Pivotal. All rights reserved.
Complete Parallelism
![Page 19: Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal](https://reader034.fdocuments.in/reader034/viewer/2022051609/547e42dab37959582b8b548c/html5/thumbnails/19.jpg)
19 © Copyright 2013 Pivotal. All rights reserved.
Complete Parallelism – Beyond Data Parallel Tasks
� Data Parallel computation via PL/X libraries only allow us to run ‘n’ models in parallel.
� This works great when we are building one model for each value of the group by column, but we need parallelized algorithms to be able to build a single model on all the available data
� For this, we use MADlib – an open source library of parallel in-database machine learning algorithms.
![Page 20: Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal](https://reader034.fdocuments.in/reader034/viewer/2022051609/547e42dab37959582b8b548c/html5/thumbnails/20.jpg)
20 © Copyright 2013 Pivotal. All rights reserved.
MADlib: Scalable, in-database Machine Learning
http://madlib.net
![Page 21: Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal](https://reader034.fdocuments.in/reader034/viewer/2022051609/547e42dab37959582b8b548c/html5/thumbnails/21.jpg)
21 © Copyright 2013 Pivotal. All rights reserved.
MADlib In-Database Functions
Predictive Modeling Library
Linear Systems • Sparse and Dense Solvers
Matrix Factorization • Single Value Decomposition (SVD) • Low-Rank
Generalized Linear Models • Linear Regression • Logistic Regression • Multinomial Logistic Regression • Cox Proportional Hazards • Regression • Elastic Net Regularization • Sandwich Estimators (Huber white,
clustered, marginal effects)
Machine Learning Algorithms • Principal Component Analysis (PCA) • Association Rules (Affinity Analysis, Market
Basket) • Topic Modeling (Parallel LDA) • Decision Trees • Ensemble Learners (Random Forests) • Support Vector Machines • Conditional Random Field (CRF) • Clustering (K-means) • Cross Validation
Descriptive Statistics
Sketch-based Estimators • CountMin (Cormode-
Muthukrishnan) • FM (Flajolet-Martin) • MFV (Most Frequent
Values) Correlation Summary
Support Modules
Array Operations Sparse Vectors Random Sampling Probability Functions
![Page 22: Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal](https://reader034.fdocuments.in/reader034/viewer/2022051609/547e42dab37959582b8b548c/html5/thumbnails/22.jpg)
22 © Copyright 2013 Pivotal. All rights reserved.
Linear Regression: Streaming Algorithm
� Finding linear dependencies between variables
� How to compute with a single scan?
![Page 23: Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal](https://reader034.fdocuments.in/reader034/viewer/2022051609/547e42dab37959582b8b548c/html5/thumbnails/23.jpg)
23 © Copyright 2013 Pivotal. All rights reserved.
Linear Regression: Parallel Computation
XT
y
XT y = xiT yi
i∑
![Page 24: Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal](https://reader034.fdocuments.in/reader034/viewer/2022051609/547e42dab37959582b8b548c/html5/thumbnails/24.jpg)
24 © Copyright 2013 Pivotal. All rights reserved.
Linear Regression: Parallel Computation
y
XT
Master
XT y
Segment 1 Segment 2
X1T y1 X2
T y2+ =
![Page 25: Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal](https://reader034.fdocuments.in/reader034/viewer/2022051609/547e42dab37959582b8b548c/html5/thumbnails/25.jpg)
25 © Copyright 2013 Pivotal. All rights reserved.
Linear Regression: Parallel Computation
y
XT
Master Segment 1 Segment 2
XT yX1T y1 X2
T y2+ =
![Page 26: Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal](https://reader034.fdocuments.in/reader034/viewer/2022051609/547e42dab37959582b8b548c/html5/thumbnails/26.jpg)
26 © Copyright 2013 Pivotal. All rights reserved.
Performing a linear regression on 10 million rows in seconds
Hellerstein, Joseph M., et al. "The MADlib analytics library: or MAD skills, the SQL." Proceedings of the VLDB Endowment 5.12 (2012): 1700-1711.
![Page 27: Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal](https://reader034.fdocuments.in/reader034/viewer/2022051609/547e42dab37959582b8b548c/html5/thumbnails/27.jpg)
27 © Copyright 2013 Pivotal. All rights reserved.
Calling MADlib Functions: Fast Training, Scoring
� MADlib allows users to easily and create models without moving data out of the systems
– Model generation – Model validation – Scoring (evaluation of) new data
� All the data can be used in one model
� Built-in functionality to create of multiple smaller models (e.g. classification grouped by feature)
� Open-source lets you tweak and extend methods, or build your own
SELECT madlib.linregr_train( 'houses’,!'houses_linregr’,!
'price’,!'ARRAY[1, tax, bath, size]’);!
MADlib model function Table containing
training data
Table in which to save results
Column containing dependent variable Features included in the
model
https://www.youtube.com/watch?v=Gur4FS9gpAg
![Page 28: Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal](https://reader034.fdocuments.in/reader034/viewer/2022051609/547e42dab37959582b8b548c/html5/thumbnails/28.jpg)
28 © Copyright 2013 Pivotal. All rights reserved.
Calling MADlib Functions: Fast Training, Scoring
SELECT madlib.linregr_train( 'houses’,!'houses_linregr’,!
'price’,!'ARRAY[1, tax, bath, size]’,!
‘bedroom’);!
MADlib model function Table containing
training data
Table in which to save results
Column containing dependent variable
Create multiple output models (one for each value of bedroom)
� MADlib allows users to easily and create models without moving data out of the systems
– Model generation – Model validation – Scoring (evaluation of) new data
� All the data can be used in one model
� Built-in functionality to create of multiple smaller models (e.g. classification grouped by feature)
� Open-source lets you tweak and extend methods, or build your own
Features included in the model
https://www.youtube.com/watch?v=Gur4FS9gpAg
![Page 29: Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal](https://reader034.fdocuments.in/reader034/viewer/2022051609/547e42dab37959582b8b548c/html5/thumbnails/29.jpg)
29 © Copyright 2013 Pivotal. All rights reserved.
Calling MADlib Functions: Fast Training, Scoring
SELECT madlib.linregr_train( 'houses’,!'houses_linregr’,!
'price’,!'ARRAY[1, tax, bath, size]’);!
SELECT houses.*, madlib.linregr_predict(ARRAY[1,tax,bath,size],
m.coef!)as predict !
FROM houses, houses_linregr m;!
MADlib model scoring function
Table with data to be scored Table containing model
� MADlib allows users to easily and create models without moving data out of the systems
– Model generation – Model validation – Scoring (evaluation of) new data
� All the data can be used in one model
� Built-in functionality to create of multiple smaller models (e.g. classification grouped by feature)
� Open-source lets you tweak and extend methods, or build your own
![Page 30: Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal](https://reader034.fdocuments.in/reader034/viewer/2022051609/547e42dab37959582b8b548c/html5/thumbnails/30.jpg)
30 © Copyright 2013 Pivotal. All rights reserved.
Python and R wrappers to MADlib
![Page 31: Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal](https://reader034.fdocuments.in/reader034/viewer/2022051609/547e42dab37959582b8b548c/html5/thumbnails/31.jpg)
31 © Copyright 2013 Pivotal. All rights reserved.
PivotalR: Bringing MADlib and HAWQ to a familiar R interface � Challenge
Want to harness the familiarity of R’s interface and the performance & scalability benefits of in-DB analytics
� Simple solution: Translate R code into SQL
d <- db.data.frame(”houses")!houses_linregr <- madlib.lm(price ~ tax!
! ! !+ bath!! ! !+ size!! ! !, data=d)!
Pivotal R SELECT madlib.linregr_train( 'houses’,!
'houses_linregr’,!'price’,!
'ARRAY[1, tax, bath, size]’);!
SQL Code
https://github.com/pivotalsoftware/PivotalR
![Page 32: Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal](https://reader034.fdocuments.in/reader034/viewer/2022051609/547e42dab37959582b8b548c/html5/thumbnails/32.jpg)
32 © Copyright 2013 Pivotal. All rights reserved.
PivotalR: Bringing MADlib and HAWQ to a familiar R interface � Challenge
Want to harness the familiarity of R’s interface and the performance & scalability benefits of in-DB analytics
� Simple solution: Translate R code into SQL
d <- db.data.frame(”houses")!houses_linregr <- madlib.lm(price ~ as.factor(state)!
! ! ! !+ tax!! ! ! !+ bath!! ! ! !+ size!! ! ! !, data=d)!
Pivotal R
# Build a regression model with a different!# intercept term for each state!# (state=1 as baseline).!# Note that PivotalR supports automated!# indicator coding a la as.factor()!!
https://github.com/pivotalsoftware/PivotalR
![Page 33: Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal](https://reader034.fdocuments.in/reader034/viewer/2022051609/547e42dab37959582b8b548c/html5/thumbnails/33.jpg)
33 © Copyright 2013 Pivotal. All rights reserved.
PivotalR Design Overview
2. SQL to execute
3. Computation results 1. R à SQL
RPostgreSQL
PivotalR
Data lives here No data here
Database/Hadoop w/ MADlib
• Call MADlib’s in-DB machine learning functions directly from R
• Syntax is analogous to native R function
• Data doesn’t need to leave the database • All heavy lifting, including model estimation
& computation, are done in the database
https://github.com/pivotalsoftware/PivotalR
![Page 34: Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal](https://reader034.fdocuments.in/reader034/viewer/2022051609/547e42dab37959582b8b548c/html5/thumbnails/34.jpg)
34 © Copyright 2013 Pivotal. All rights reserved.
http://pivotalsoftware.github.io/pymadlib/
PyMADlib : Power of MADlib + Flexibility of Python Linear Regression
Logistic Regression
Extras – Support for Categorical variables – Pivoting
Current PyMADlib Algorithms – Linear Regression – Logistic Regression – K-Means – LDA
![Page 35: Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal](https://reader034.fdocuments.in/reader034/viewer/2022051609/547e42dab37959582b8b548c/html5/thumbnails/35.jpg)
35 © Copyright 2013 Pivotal. All rights reserved.
Visualization
![Page 36: Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal](https://reader034.fdocuments.in/reader034/viewer/2022051609/547e42dab37959582b8b548c/html5/thumbnails/36.jpg)
36 © Copyright 2013 Pivotal. All rights reserved.
Visualization
Commercial Open Source
![Page 37: Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal](https://reader034.fdocuments.in/reader034/viewer/2022051609/547e42dab37959582b8b548c/html5/thumbnails/37.jpg)
37 © Copyright 2013 Pivotal. All rights reserved.
Hack one when needed – Pandas_via_psql
http://vatsan.github.io/pandas_via_psql/
SQL Client DB
![Page 38: Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal](https://reader034.fdocuments.in/reader034/viewer/2022051609/547e42dab37959582b8b548c/html5/thumbnails/38.jpg)
38 © Copyright 2013 Pivotal. All rights reserved.
Integration with Open Source – (Py)Spark Example
![Page 39: Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal](https://reader034.fdocuments.in/reader034/viewer/2022051609/547e42dab37959582b8b548c/html5/thumbnails/39.jpg)
39 © Copyright 2013 Pivotal. All rights reserved.
Apache Spark Project – Quick Overview
http://spark-summit.org/wp-content/uploads/2013/10/Zaharia-spark-summit-2013-matei.pdf
• Apache Project, originated in AMPLab Berkeley • Supported on Pivotal Hadoop 2.0!
![Page 40: Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal](https://reader034.fdocuments.in/reader034/viewer/2022051609/547e42dab37959582b8b548c/html5/thumbnails/40.jpg)
40 © Copyright 2013 Pivotal. All rights reserved.
MapReduce vs. Spark
http://spark-summit.org/wp-content/uploads/2013/10/Zaharia-spark-summit-2013-matei.pdf
![Page 41: Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal](https://reader034.fdocuments.in/reader034/viewer/2022051609/547e42dab37959582b8b548c/html5/thumbnails/41.jpg)
41 © Copyright 2013 Pivotal. All rights reserved.
Data Parallelism in PySpark – A Simple Example
• Next we’ll take the UCI automobile dataset example from PL/Python and demonstrate how to run in PySpark
![Page 42: Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal](https://reader034.fdocuments.in/reader034/viewer/2022051609/547e42dab37959582b8b548c/html5/thumbnails/42.jpg)
42 © Copyright 2013 Pivotal. All rights reserved.
Scikit-Learn on PySpark – UCI Auto Dataset Example
• This is in essence similar to the PL/Python example from the earlier slide, except we’re using data store on HDFS (Pivotal HD) with Spark as the platform in place of HAWQ/Greenplum
![Page 43: Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal](https://reader034.fdocuments.in/reader034/viewer/2022051609/547e42dab37959582b8b548c/html5/thumbnails/43.jpg)
43 © Copyright 2013 Pivotal. All rights reserved.
Large Scale Topic and Sentiment Analysis of Tweets
Social Media Demo
![Page 44: Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal](https://reader034.fdocuments.in/reader034/viewer/2022051609/547e42dab37959582b8b548c/html5/thumbnails/44.jpg)
44 © Copyright 2013 Pivotal. All rights reserved.
Pivotal GNIP Decahose Pipeline
Parallel Parsing of JSON
PXF
Twitter Decahose (~55 million tweets/day)
Source: http Sink: hdfs
HDFS
External Tables
PXF
Nightly Cron Jobs
Topic Analysis through MADlib pLDA
Unsupervised Sentiment Analysis
(PL/Python)
D3.js
http://www.slideshare.net/SrivatsanRamanujam/a-pipeline-for-distributed-topic-and-sentiment-analysis-of-tweets-on-pivotal-greenplum-database
![Page 45: Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal](https://reader034.fdocuments.in/reader034/viewer/2022051609/547e42dab37959582b8b548c/html5/thumbnails/45.jpg)
45 © Copyright 2013 Pivotal. All rights reserved.
Data Science + Agile = Quick Wins
� The Team – 1 Data Scientist – 2 Agile Developers – 1 Designer (part-time) – 1 Project Manager (part-time)
� Duration – 3 weeks!
![Page 46: Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal](https://reader034.fdocuments.in/reader034/viewer/2022051609/547e42dab37959582b8b548c/html5/thumbnails/46.jpg)
46 © Copyright 2013 Pivotal. All rights reserved.
Live Demo – Topic and Sentiment Analysis
![Page 47: Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal](https://reader034.fdocuments.in/reader034/viewer/2022051609/547e42dab37959582b8b548c/html5/thumbnails/47.jpg)
47 Pivotal Confidential–Internal Use Only
Content Based Image Search
CBIR Live Demo
![Page 48: Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal](https://reader034.fdocuments.in/reader034/viewer/2022051609/547e42dab37959582b8b548c/html5/thumbnails/48.jpg)
48 Pivotal Confidential–Internal Use Only
Content Based Information Retrieval - Task
http://blog.pivotal.io/pivotal/features/content-based-image-retrieval-using-pivotal-hd-or-pivotal-greenplum-database
![Page 49: Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal](https://reader034.fdocuments.in/reader034/viewer/2022051609/547e42dab37959582b8b548c/html5/thumbnails/49.jpg)
49 Pivotal Confidential–Internal Use Only
CBIR - Components
http://blog.pivotal.io/pivotal/features/content-based-image-retrieval-using-pivotal-hd-or-pivotal-greenplum-database
![Page 50: Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal](https://reader034.fdocuments.in/reader034/viewer/2022051609/547e42dab37959582b8b548c/html5/thumbnails/50.jpg)
50 Pivotal Confidential–Internal Use Only
Live Demo – Content Based Image Search
http://blog.pivotal.io/pivotal/features/content-based-image-retrieval-using-pivotal-hd-or-pivotal-greenplum-database
![Page 51: Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal](https://reader034.fdocuments.in/reader034/viewer/2022051609/547e42dab37959582b8b548c/html5/thumbnails/51.jpg)
51 Pivotal Confidential–Internal Use Only
Appendix
![Page 52: Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal](https://reader034.fdocuments.in/reader034/viewer/2022051609/547e42dab37959582b8b548c/html5/thumbnails/52.jpg)
52 Pivotal Confidential–Internal Use Only
Acknowledgements • Ian Huston, Woo Jung, Sarah Aerni, Gautam Muralidhar, Regunathan
Radhakrishnan, Ronert Obst, Hai Qian, MADlib Engineering Team, Sumedh Mungee, Girish Lingappa