Post on 10-Feb-2017
Me:
CTO and Co founder of Skymind
GU Faculty Advisor
Book Author - Deep Learning: A Practitioner's Approach
Benefits/Trade offs
Monolithic - One app Easy to update at first
Microservices - meant for scale, modular components, easier for bigger teams
Software Development Life Cycle
Starts small
Early - Minimum Viable product/ Move fast break things
Mid stage - Company will actually last - now let’s focus a bit on scale might have growing pains
Late stage - too many cooks in the kitchen needs separation of concerns
No Silver Bullet - Different stages/sizes
In the SDLC - different incentives for different teams/companies of different sizes
Microservices can also go wrong:http://martinfowler.com/articles/microservice-trade-offs.html
Analytics and Products
A/B Testing (does this button increase my revenue/CTR?)
Data/Analytics Products: Think BI + some machine learning depending on the application
Machine Learning
Given some observations make some inference based on trends in data
Label stuff (supervised learning)
Predict something (regression)
Group stuff (clustering)
Regression
Given some attributes (features) predict some continuous value
Attributes of house - predict price
Pricing movements in stock market
Classification
Churn Prediction (Will churn or not churn)Big Spender or not big spammerSpam or not spamFraud or not FraudPicture of? (cat/dog/cow)
Exploratory Data Analysis
Extract Transform LoadNormalize (maybe a part of the loading process depending on data warehousing process if any)
Visualize
Determine way to solve problem if any
Get some cursory results
Parallels to software engineeringBoth have a lifecuylce to follow with different standards for different stages
Data science teams are also similar in that you start by building the data infrastructure adding on the analysis later
Machine Learning + Software
Machine learning is integrated in to software
Often a second class citizen in engineering standards
Data scientists don’t often think about production
ML Software
Data pipelines (complete process of data all the way through to modeL) are often messy/adhoc)
To be done right involves 2 + teams of data engineers AND scientists
Data engineers don’t often know much ML and scientists don’t know much about software infrastructure
Components
ETL/Vectorization
Data Manipulation
Data Exploration
Model building
Model integration with serving system (lambda architecture)
Should be separated
ETL not tied to machine learning models
Everything should be swappable
Should run interchangeably on different platforms