DATA SCIENCEPOP UP
AUSTIN
Data Do's and Dont's: Lessons From the Front Line
Ryan OrbanVP of Product and Strategy,
Data Scientist, Galvanize
ryanorban
DATA SCIENCEPOP UP
AUSTIN
#datapopupaustin
April 13, 2016Galvanize, Austin Campus
Data Do’s and Dont’s: Lessons from the Frontline
Co-Founder & CEO Zipfian Academy
Ryan Orban @ryanorban
EVP of Product and Strategy Galvanize
We believe an opportunity belongs to anyone with aptitude and ambition.
4Galvanize 2015
NODES ON THE NETWORK
COLORADO (BOULDER, DENVER, FORT COLLINS)
SEATTLE, WA
SAN FRANCISCO, CA
AUSTIN, TX (OPENING Q1 2016)
Programs: Full Stack Immersive, Data Science Immersive, Entrepreneurship
Programs: Full Stack Immersive, Data Science Immersive, Entrepreneurship
Programs: Full Stack Immersive, Data Science Immersive, Data Engineering Immersive, Masters of Science in Data Science, Entrepreneurship
Programs: Full Stack Immersive, Data Science Immersive, Entrepreneurship
[Explanation Text]
5Galvanize 2015
5 PROGRAMS
• Full Stack Immersive
• Data Science Immersive
• Data Engineering Immersive
Project over 500 Student Member Graduates in 2015
Currently over 1500 Members
• Master of Science in Data Science (University of New Haven)
• Startup Membership
6Galvanize 2015
PLACEMENT STATS
FULL STACK IMMERSIVE DATA SCIENCE IMMERSIVE
$43K $77KPre-program Salary
Average Starting Salary
97% Placement Rate*
*Galvanize is a founder member of NESTA (New Economy Skills Training Association), a trade organization founded to regulate the new “bootcamp” market. This place rate is more rigorous than that requested by state licensure agencies. The placement rate is calculated 6 months after graduation.
$72K $114KPre-program Salary
94% Placement Rate*
Average Starting Salary
Software Engineering
Data Science
Data Analysis
Data Engineering
Machine Learning Java
Linux, UNIX
Mobile Development
Objective C
C, C++, C#
Web Development
Ruby on Rails
JavaScript
Front-endPHP
Full-Stack
Excel
Python
SQL
NLPHadoop
Databases
Network Analysis
Java
AssemblyStatistics
R
The orange words are the most important things we teach.
How These Things Relate to Each Other
Full-Stack Web Development and Data Science are in gray circles.
8Galvanize 2015
DATA SCIENCE IMMERSIVE
Week 1 - Exploratory Data Analysis and Software Engineering Best Practices
Week 2 - Statistical Inference, Bayesian Methods, A/B Testing, Multi-Armed Bandit
Week 3 - Regression, Regularization, Gradient Descent
Week 4 - Supervised Machine Learning: Classification, Validation, Ensemble Methods
Week 5 - Clustering, Topic Modeling (NMF, LDA), NLP
Week 6 - Network Analysis, Matrix Factorization, and Time Series
Week 7 - Hadoop, Hive, and MapReduce
Week 8 - Data Visualization with D3.js, Data Products, and Fraud Detection Case Study
Weeks 9-10 - Capstone Projects
Week 12 - Onsite Interviews
Data Manipulation Model Creation Prediction
Data Manipulation
Do
Don’t
• Assume your data is friendly • ETL and feature engineering is largely opaque to others (and yourself after enough time away)
• Automate cleaning and transformation pipelines • Jupyter and RStudio are great for EDA, but have issues with collaboration and version control
• Build functional code to be reused; export into plain code files, track with Git
Model Creation
Do
Don’t• Never use accuracy as your main metric
• You can have 99% accuracy but 0% predictive power • Unbalanced classes; sampling
• Use metrics like precision and recall • Aggregate metrics like F1-score, AUC/AIC/BIC also good • Remember that models with highest scores are not always the ones you need; permissive vs. conservative based on use case
Do
Don’t• Don’t start with the most complicated models first (deep learning, gradient boosting, SVMs, etc.)
• Don’t focus on the algorithm •“More data always beats better algorithms” • But better features usually beat better algorithms*
• Start with a baseline model, then continuously “close the loop” • Create a base case to optimize against • Does 1% greater F1-score outweigh a 10x training time in production? Not usually unless you’re Google-scale.
Do
Don’t
• Assume your cross-validation metrics will hold up against real-life data
• Separate your application and prediction code • Fast iteration cycles are key. Create a “scoring service” that is uncoupled from application code.
• APIs & service oriented architectures typically work best
Communication
Do
Don’t
• Don’t focus on the “how”, i.e. cover every trial and tribulation
• Cut to the chase • After a presentation, I always ask the class two questions: • What is one sentence that describes what the speaker learned? • Why do I care?
19Galvanize 2015
• Early Access to Students
• Candidate Matching
• Curriculum Development
• Corporate Student Sponsorship
• Diversity
TALENT
20Galvanize 2015
• Membership
• Organic Relationships
• Course Content
• Mentorship
• Community
• Events
ACCESS
21Galvanize 2015
• Galvanize Experts
• Capstone Projects
• Internship
• Corporate Training
EXPERTISE
THANK YOURYAN ORBAN | EVP, STRATEGY [email protected] @ryanorban
www.galvanize.com
DATA SCIENCEPOP UP
AUSTIN
@datapopup #datapopupaustin