District Professional Development Collaboratively Learning About Our New Curriculum
Data Science in Action · 2015-07-28 · machine learning collaboratively • Have open-sourced...
Transcript of Data Science in Action · 2015-07-28 · machine learning collaboratively • Have open-sourced...
© Copyright 2014 Pivotal. All rights reserved. © Copyright 2014 Pivotal. All rights reserved.
Data Science in Action Practical use cases that demonstrate how businesses generate value from data
21 March 2014 – SDS2014, Winterthur
© Copyright 2014 Pivotal. All rights reserved.
Introduction to Pivotal Data Labs
• Leading edge tools to implement machine learning collaboratively
• Have open-sourced several of our own tools for wide-spread use
Our Tools
• Parallelized a wide variety of machine learning algorithms for optimum performance on the Pivotal platform
• Agile, test-driven, customer focused
Our Methods
P(A|B) = !P(B|A) P(A)!
P(B) !
num_
depe
nden
ts_0
male_
mem_
1su
b_ge
nder_
mfem
ale_m
em_0
avg_
mem_
age_
56_6
5su
b_ag
e_56
_65
over_
avg_
age
sub_
age_
0_17
avg_
mem_
age_
0_17
avg_
mem_
age_
41_5
5su
b_ag
e_41
_55
num_
depe
nden
ts_1
sub_
age_
18_2
5av
g_me
m_ag
e_18
_25
num_
depe
nden
ts_3
num_
depe
nden
ts_2
sub_
benif
it_typ
e_i_c
urnu
m_de
pend
ents_
4ma
le_me
m_4
male_
mem_
3fem
ale_m
em_4
female
_mem
_3ma
le_me
m_2
female
_mem
_2su
b_reg
ion_1
sub_
age_
66_p
lusav
g_me
m_ag
e_66
_plus
ca_re
siden
tsu
b_reg
ion_2
sub_
region
_3ma
le_me
m_0
female
_mem
_1su
b_ge
nder_
fun
der_a
vg_a
gesu
b_ag
e_26
_40
avg_
mem_
age_
26_4
0me
m_clm
_cnt_
0urb
ansu
b_reg
ion_4 rural
Males Under 25
Males 26−40
Males 41−55 or 66 and Over
Males 56−65
Females 20−40
Females 55 and Over
Married with Children
Married 56−65
Normalized,k=8
• More than 100 customer assignments carried out in the past 18 months
• Ensures quality and best practice in all our assignments
Our Experience
• Analytical workflow aligned with business needs and optimized for speed
• Supports iterative and collaborative working
Our Process
Data Warehouse
Data Sources Structured Data • dbSNP • refSeq
Analysis
Raw Sequence Data Unstructured Unstructured
Data
Raw Sequence Data
Image Data
External Data Sources
Visualization/Quality Reporting
• High caliber global team of machine learning experts from a wide variety of quantitative backgrounds
• Equally capable in coding & statistics
Our Team
© Copyright 2014 Pivotal. All rights reserved.
4%
32%
64%
18%
43%
39%
18%
11%
25%
28%
11%
7%
Pivotal Data Science Team
PhD
By Degree By University By Subject
Bachelor
Masters
Top 25 Other
Top 50
Physics
Math
Statistics & OR
Computer Science
Engineering
Other
© Copyright 2014 Pivotal. All rights reserved.
What does the Pivotal Data Science team do?
Deliver Data Science Labs
• Deliver value for customers by applying best practice Data Science
• Kick-off use and spread of the Pivotal platform
Engagement & Enablement
• Enable customers to build on and extend Data Science Labs
• Train and enable customers, partners and Pivotal teams
• Recognized as thought leaders
IP Development
• Develop leading approaches • Develop and parallelize
algorithms • Develop and file patents
Data Scientists as catalysts for data-driven transformation
© Copyright 2014 Pivotal. All rights reserved.
Data Science in Action
� The Value of Data Science
� The Practice of Data Science
� In-depth Use Case: Traffic Prediction
� Use Case Overviews
� Q&A
© Copyright 2014 Pivotal. All rights reserved.
What do we mean with Data Science?
Complexity
Value of Analytics
($)
Descriptive Analytics
Diagnostic Analytics
Predictive Analytics
Prescriptive Analytics
What happened?
Why did it happen?
What will happen?
How can we make it happen?
© Copyright 2014 Pivotal. All rights reserved.
Big Data & Data Science
Decision = Data + Rules
“Big Data” Data Science
© Copyright 2014 Pivotal. All rights reserved.
Social Media Commercial & Public Data Dark Data
“Big Data”
Opera7onal Data
© Copyright 2014 Pivotal. All rights reserved.
Combining data sources: Example IPSQ (Quality)
Owner: TS Production team Test flags from production line
1 year ~300GB
APDM Owner: TS Production team
Full vehicle history including IPST (technical), IPSL (logistics), IPSQ test flags and all test
results. 30 years ~TBs
FASTA Owner: Aftersales
Dealership electronic tests Identifies early issues with cars
>25TB
IQS: Initial Quality Survey from JD Power Owner: R&D
Survey responses from new owners after 90 days for approx 1700 vehicles
Few thousand lines ~MB
Social Data Owner: R&D
Pulling 500MB per day from Twitter
TQP Owner: Supplier management
PDFs of parts spec sheets ~ 500GB
© Copyright 2014 Pivotal. All rights reserved.
Generating value from data: Car configurator example
Sales
Configurations
Customers
Sales
Configurations
Customers
Basic Recommendation Engine
Sales
Configurations
Customers
Advanced Recommendation Engine
Sales
Configurations
Customers
Yield Optimization Engine
All car elements: • Attribute frequencies (colors etc) • Attribute combination frequencies
For instance: • Browsing history • Usage patterns • Demographic insights
Ideally: • Volumes • Pricing • By market • Linkable to configurations
© Copyright 2014 Pivotal. All rights reserved.
Traditional Systems “Big Data” “Fast Data”
The value of data over time
Time
Value of Data ($)
µs ms s hour day month year yr+
Pivotal Data Science Labs
© Copyright 2014 Pivotal. All rights reserved.
Data Science
� The use of statistical and machine learning techniques on big multi-structured data in a distributed computing environment to identify correlations and causal relationships, classify and predict events, identify patterns and anomalies, and infer probabilities, interest, and sentiment.
� In order to drive automated low latency actions in response to events of interest
© Copyright 2014 Pivotal. All rights reserved.
Why do Data Science?
Time
Return on Analytics
($)
Business Transformation
Innovation
Optimization
Expansion
“low hanging fruit”
© Copyright 2014 Pivotal. All rights reserved.
What transforms businesses today?
� Digitization
� Internet of Things
� Pervasive Computing
� Pervasive Connectivity
© Copyright 2014 Pivotal. All rights reserved.
Example: major paradigm shifts in automotive
Genesis
Not a horse
Mass Production
Mass availability “You can have any
color of car, provided it’s black”
Modern Manufacturing
Brand proliferation You can have
any color
What’s Next?
Connected, autonomous
vehicles
Platform Strategy
Globalization You can have
anything anywhere
1908 1950s 1885 1980s 2020
© Copyright 2014 Pivotal. All rights reserved.
The Connected Car drives innovation
Communication
Handsfree telephony
WiFi hotspot Car2X
comms
Rich media comms
Telematics
eMobility solutions
Car2X solutions
Remote activation
Floating car data
Stolen vehicle recovery Remote
diagnosis
Driver assistance
Real-time parking info
Behaviour monitoring
Navigation Hybrid
Navigation
Vehicle Tracking
Map/PoI updates
Geo fencing
Fleet Management
Responsive PoIs
Real-time traffic info
Concierge services
Next gen Navigation
Social Media
Share my trip
Car sharing
Car search
Traffic updates
eCommerce
City toll
Payment solutions
Parking space reservation
Pay as you drive
Road tolls
Entertainment
Music streaming
Online games
Web radio
VoD Environmental
browsing
© Copyright 2014 Pivotal. All rights reserved.
Data Science in Action
� The Value of Data Science
� The Practice of Data Science
� In-depth Use Case: Traffic Prediction
� Use Case Overviews
� Q&A
© Copyright 2014 Pivotal. All rights reserved.
Data Warehouse
Visualization/Quality Reporting
Traditional Analytics Process Data Sources
Structured Data • Sensors • Flight
recordings
Analysis Scheduling
Analysis
Operations Analysis
Time-to-Insights
sample
forecast
In-memory statistics tool
In-memory optimization tool solution
© Copyright 2014 Pivotal. All rights reserved.
Data Warehouse
Augmenting an analytical architecture Data Sources
Structured Data • Sensors • Flight recordings
Analysis
Raw Sequence Data Unstructured
In Database Analytics
Unstructured Data
Image Data Geolocation Data Voice Transciption
External Data Sources
Visualization/Quality Reporting
Benefits of a new architecture: • Eliminates data movement • Enables rapid data re-processing • Seamless integration of additional
external resources into analyses
Weather Data
Open Gov Data
© Copyright 2014 Pivotal. All rights reserved.
Machine Learning and Big Data Getting the whole picture improves predictive power
Sensor Data
• More data from different sources • Provides a more complete view • Improves statistics and inference
False positive rate
True
pos
itive
rate
Sensor Data
Flight Records
Sensor Data
Geoloc & Weather
Flight Records
Sensor Data
Geoloc & Weather Imaging Flight
Records
© Copyright 2014 Pivotal. All rights reserved.
The main types of use cases in practice
Data Mining Predicting Behavior Optimization
• Categorize types (segmentation)
• Categorize behaviors/usage
• Identify co-occurrences and associations
• Identify anomalies
• Identify attitudes
• Resolve entities
• Predict churn likelihood
• Predict cross/up sales potential
• Predict fraud/waste/abuse likelihood
• Predict performance
• Predict reliability/quality
• Make a “recommendation”
• Optimize processes
• Optimize process parameters
• Optimize asset allocation
© Copyright 2014 Pivotal. All rights reserved.
1 Find Data Platforms • Greenplum DB • Pivotal HD • Hadoop (other) • SAS HPA • AWS
2 Write Code
Editing Tools • Vi/Vim • Emacs • Smultron • TextWrangler • Eclipse • Notepad++ • IPython • Sublime
Languages • SQL • Bash scripting • C • C++ • C# • Java • Python • R
3 Run Code Interfaces • pgAdminIII • psql • psycopg2 • Terminal • Cygwin • Putty • Winscp
4 Write Code for Big Data In-Database • SQL • PL/Python • PL/Java • PL/R • PL/pgSQL
Hadoop • Pig • Hive • Java • HAWQ
5 Implement Algorithms
Libraries • MADlib Java • Mahout R • (Too many to list!) Text • OpenNLP • NLTK • GPText C++ • opencv
Python • numpy • scipy • scikit-learn • Pandas Programs • Alpine Miner • Rstudio • MATLAB • SAS • Stata
6 Show Results
Visualization • python-matplotlib • python-networkx • D3.js • Tableau
• GraphViz • Gephi • R (ggplot2, lattice,
shiny) • Excel
7 Collaborate
Sharing Tools • Chorus • Confluence • Socialcast • Github • Google Drive &
Hangouts
PIVOTAL DATA SCIENCE TOOLKIT
A large and varied tool box!
© Copyright 2014 Pivotal. All rights reserved.
As Data Scientists, what do we want?
Infrastructure Independent
Fast & Scalable
Schema Free
Easy to Use
Real Time
• Open source
• PaaS
• In-database analytics
• MPP
• Hadoop • In-memory data grids embedded into the platform
• SQL, not Java
• Faster than Hive
© Copyright 2014 Pivotal. All rights reserved.
Data Science in Action
� The Value of Data Science
� The Practice of Data Science
� In-depth Use Case: Traffic Prediction
� Use Case Overviews
� Q&A
© Copyright 2014 Pivotal. All rights reserved. © Copyright 2014 Pivotal. All rights reserved.
What does traffic data look like?
© Copyright 2014 Pivotal. All rights reserved.
…like this?
© Copyright 2014 Pivotal. All rights reserved.
…or this?
(Note: This is the least offensive topic
cluster in our Twitter data!)
© Copyright 2014 Pivotal. All rights reserved.
Velocity by Time of Day
© Copyright 2014 Pivotal. All rights reserved.
Distribution of Velocity over Time
© Copyright 2014 Pivotal. All rights reserved.
0 50 100 150 200
0.00
00.
005
0.01
00.
015
0.02
00.
025
0.03
0
dens
ity
Link 1000064869
km/h
Velocity Distribution
© Copyright 2014 Pivotal. All rights reserved.
Find Velocity Groups � Velocity distributions can be fit well
with Gaussians
� An ‘overlay’ of multiple Gaussians is called Gaussian Mixture Model
� GMM fitting of the velocity distribution is done by Expectation-Maximization algorithm
� Shapes and positions of Gaussians determine velocity groups
0 50 100 150 200
0.00
00.
005
0.01
00.
015
0.02
00.
025
0.03
0
dens
ity
CombinedComponent 1Component 2Component 3Component 4
Link 1000064869
km/h
© Copyright 2014 Pivotal. All rights reserved.
Gaussian Mixture Model
0 50 100 150 200
0.000
0.005
0.010
0.015
0.020
0.025
0.030
dens
ity
CombinedComponent 1Component 2Component 3Component 4
Link 1000064869
km/h 0 50 100 150 200
0.00
00.
005
0.01
00.
015
0.02
00.
025
0.03
0
dens
ity
CombinedComponent 1Component 2Component 3Component 4
Link 1000064869
km/h
0 50 100 150 200
0.00
00.
005
0.01
00.
015
0.02
00.
025
0.03
0
dens
ity
CombinedComponent 1Component 2Component 3Component 4
Link 1000064869
km/h
0 50 100 150 200
0.00
00.
005
0.01
00.
015
0.02
00.
025
0.03
0
dens
ity
CombinedComponent 1Component 2Component 3Component 4
Link 1000064869
km/h
© Copyright 2014 Pivotal. All rights reserved.
Decision Trees Example
0 50 100 150 200 250
0.00
0.01
0.02
0.03
0.04
dens
ity
CombinedComponent 1Component 2Component 3
hour >= 14
hour < 20
weekday = 1,2,3,4,5 weekday = 1,2,3,4,5
weekday = 1,2,3,4,5
nextlink = −1,100000
< 14
>= 20
6,7 6,7
6,7
100002
10.47
10.69
10.76
10.81
30.85
20.55
20.73
30.65
20.56
20.63
20.68
30.49
30.73
© Copyright 2014 Pivotal. All rights reserved.
Sneak Peek at our TfL Data Demo � Used the freely
accessible TfL data for a demo
� Shows # of active disruptions over different days in London
Ø Rush hour effects visible Ø Nights are more quiet, but more disruptions on weekend nights
© Copyright 2014 Pivotal. All rights reserved.
Data Science in Action
� The Value of Data Science
� The Practice of Data Science
� In-depth Use Case: Traffic Prediction
� Use Case Overviews – not presented
� Q&A
BUILT FOR THE SPEED OF BUSINESS