Post on 23-Apr-2020
Presented by:
DATA
Big Data & Predictive Analytics
SO YOU WANT TO
BE A
DATA SCIENTIST
© Data-Magnum 2016
Four Perspectives
Data Tools
Data Science Skills
Business / Employer
© Data-Magnum 2016
© Data-Magnum 2016
Why Start with Data?
Why Start with Data?
80 % CRISP-DM
© Data-Magnum 2016
2002
2004
2006
2008
2009
Google releases research papers
10/03 and 12/04 read by Cutting
and others First Hadoop Developers Conference
Multiple startups spinoff to commercialize incl Hortonworks, Cloudera, MapR
All the Hoopla over Hadoop
A Little History Google develops proprietary search indexing tool based on Big Table and MapReduce
Doug Cutting working on open source version of the same “Nutch”
Cutting at Yahoo. Renamed Hadoop. First prototype launched 2006.
Yahoo is first commercial implementation 2008
Facebook, Twitter, eBay adopt.
Hadoop becomes open source at
Apache Institute
© Data-Magnum 2016
Some Data is Big – But Not Very Often
1,220 Respondents 72 countries
Rexer Analytics
Respondents reported that their ‘typical’ data set size was:
90% typically < 1 to 100 Million records 60% typically < 100,000 to 1 Million records
© Data-Magnum 2016
Predictive Modeling
Insights are: Specific Directional
Structured RDBMS
Semi-Structured Key Value, Document,
Column, Graph
Unstructured Key Value
How NoSQL Changed Data Science
© Data-Magnum 2016
Predictive Modeling
Insights are: Specific Directional
Structured RDBMS
Semi-Structured Key Value, Document,
Column, Graph
Unstructured Key Value
Data Lakes
How NoSQL Changed Data Science
© Data-Magnum 2016
Predictive Modeling
Insights are: Specific Directional
Structured RDBMS
Semi-Structured Key Value, Document,
Column, Graph
Unstructured Key Value
Recommenders Data Lakes
How NoSQL Changed Data Science
© Data-Magnum 2016
Predictive Modeling
Insights are: Specific Directional
Structured RDBMS
Semi-Structured Key Value, Document,
Column, Graph
Unstructured Key Value
Natural Language Processing
Recommenders Data Lakes
How NoSQL Changed Data Science
© Data-Magnum 2016
Predictive Modeling
Insights are: Specific Directional
Structured RDBMS
Semi-Structured Key Value, Document,
Column, Graph
Unstructured Key Value
Natural Language Processing
Recommenders IOT Data Lakes
How NoSQL Changed Data Science
© Data-Magnum 2016
Predictive Modeling
Insights are: Specific Directional
Structured RDBMS
Semi-Structured Key Value, Document,
Column, Graph
Unstructured Key Value
Natural Language Processing
Recommenders IOT
Deep Learning
Data Lakes
Reinforcement Learning
How NoSQL Changed Data Science
© Data-Magnum 2016
The Tools Perspective
© Data-Magnum 2016
All Those Algorithms Answer Only 5
Questions
1. Is this A or B?
2. Is this weird?
3. How much – or – How many?
4. How is this organized?
5. What should I do next?
© Data-Magnum 2016
Three Types of Machine Learning
• Have Data • Data Has Labels • Learn by Example
• No Data • Learn by Trial
and Error
• Have Data • No Labels • Learn by Example • See If There’s a
Pattern in There
© Data-Magnum 2016
Three Types of Machine Learning
Decision trees / Random Forest Naïve Bayes classification Least squares regression Logistic regression Support vector machines Ensemble methods – Bagging, Boosting, Super Learners Neural Networks Linear Genetic Programs
Q-Learning PyBrain Mostly Custom Agents
Clustering Centroid-based algorithms Connectivity-based algorithms Density-based algorithms Probabilistic Dimensionality Reduction Neural networks / Deep Learning Principal Component Analysis Singular Value Decomposition Independent Component Analysis
© Data-Magnum 2016
2015 Algorithm Usage
1,220 Respondents 72 countries
Rexer Analytics
R versus Python versus SAS
Which do you prefer to use? Most DS use multiple languages but everyone has a favorite.
Burtch Works www.burtchworks.com/2015/05/21/2015-sas-vs-r-survey-results
© Data-Magnum 2016
R versus Python versus SAS
Burtch Works www.burtchworks.com/2015/05/21/2015-sas-vs-r-survey-results
© Data-Magnum 2016
The Data Scientist’s Perspective
Data Wrangler
Model Jockey
Data Scientist
© Data-Magnum 2016
What We Do
© Data-Magnum 2016
250 respondents internationally “Analyzing the Analyzers” by Harris, Murphy, and Vaisman, 2013. http://www.oreilly.com/data/free/analyzing-the-analyzers.csp.
Types of Data Scientists – Self Described
“Analyzing the Analyzers” by Harris, Murphy, and Vaisman, 2013. http://www.oreilly.com/data/free/analyzing-the-analyzers.csp. © Data-Magnum 2016
Leader Business- Person Entrepreneur
Jack of all trades Artist Hacker
Developer Engineer
Researcher Scientist Statistician
What You Need to Know
• Foundational Statistical Theory
– Probability, statistical analysis, sampling theory, hypothesis testing, statistical distributions, correlation, standard deviation, basic regression
• Foundational Programming Skills
– R, SAS, Python, SQL
• Machine Learning
– Supervised and Unsupervised (leave Reinforcement Learning for later)
• Big Data Toolbox
– Hadoop, Spark, how to operationalize predictive models to create business value
Amy Gershkoff, Chief Data Officer, Zynga © Data-Magnum 2016
The Business or Employer’s Perspective
© Data-Magnum 2016
Two Markets
The Big Web Developers Market
© Data-Magnum 2016
Two Markets
The Core Data Science Market Banking Insurance Mortgage Lending Brokerage Telecomm
Healthcare e-commerce B&M Retail Utilities Manufacturing
Transportation Education Government Services
© Data-Magnum 2016
Salary Increases as Experience &
Responsibility Increase
Median Base $112,000
The Opportunity – Good News / Bad News
2nd Best Work/Life Balance and Plenty of Openings Going Unfilled
Market Penetration – 12% in 2012 (Gartner) – Guestimating Maybe 20% to 25% Today.
Citizen Data Scientists and Fully Automated DS
© Data-Magnum 2016
Summing It Up
• Should you specialize?
• Build 3 competencies (Your Focus) – Industry
– Business Process (e.g. customer acquisition, fraud detection)
– Tool Sets (languages, analytic platforms, data platforms)
• Have a life. Join a team. Decide where you want
to live.
© Data-Magnum 2016
Some additional references
How to Become a Data Scientist http://www.datasciencecentral.com/profiles/blogs/how-to-become-a-data-scientist
So You Want to be a Data Scientist http://www.datasciencecentral.com/profiles/blogs/so-you-want-to-be-a-data-scientist
The New Rules for Becoming a Data Scientist http://www.datasciencecentral.com/profiles/blogs/the-new-rules-for-becoming-a-data-scientist
Become a member (for free) of DataScienceCentral.com Use the search feature and search for ‘how to become a data scientist” http://www.datasciencecentral.com/page/search
Join some Meet Ups – Westlake Village Data Science Meet Up 2nd Tuesday of each month at 5:30
Practice on some Kaggle competitions https://www.kaggle.com/
© Data-Magnum 2016
Other Blogs by Bill Vorhies http://www.datasciencecentral.com/profiles/blog/list?user=0h5qapp2gbuf8
Contact Information
Bill Vorhies
President & Chief Data Scientist
Data-Magnum
Bill@Data-Magnum.com
www.Data-Magnum.com
818.257.2035
“I shall find a way or make one.” Admiral Robert Peary
© Data-Magnum 2016