The Manager's Pocket Guide to Spiritual Leadership (Manager's Pocket Guide Series)
Data Science: The Product Manager's Primer
-
Upload
product-school -
Category
Technology
-
view
24 -
download
0
Transcript of Data Science: The Product Manager's Primer
Data Science: The Product Manager's Primerby Andrew Koller and Doron Bergman
/Productschool
@ProdSchool
/ProductmanagementSF
Who we are
Andrew [email protected]
Five years experience as an entrepreneur and product manager including a background in statistical physics modeling and data science.
Doron BergmanPhD in theoretical solid state physics. Three years experience in large tech and startups.
OverviewEverything you need to know to understand the world of data science in start ups
➔ Back to the BasicsAn overview of statistics and the mathematical basis
➔ Data Science and AIHow Data Science differs from statistics and makes prediction in the real world
➔ Putting it all together with an exampleProvide a simple unifying message for what is to come
➔ Data Scientist perspectivesHow to understand objections your DS’s might have and how to be their hero
NOTE: Find this presentation at tiny.cc/ProductSchoolDataScience.
How popular is Data Science really?
Really Popular!(Especially in the Bay Area)
Data Science
Data Science!
Q: What is data science and AI? Where do I start?A: Statistics
Statistical BackgroundA quick review and overview of the statistics you’ll need to communicate with your team
➔ RegressionDescribing a relationship between two variables.
➔ ConfidenceUnderstanding to some measure of how valid and strong the result is.
Q: What is statistics and why are you talking about that? I’m here for data science.
A: Statistics is a way to describe and make generalizations about a population of data. It forms the basis of Data Science.
Regression is the description of a variety of data points with a function. The most basic form is linear regression in the form of y=Mx+b
Money and WinsMoney might not be able to buy happiness but it sure seems to buy wins in Major League Baseball
Y = M * x + b
Wins = 0.153 wins/$Million * Payroll (in $Million) + 66.4 Wins
Or about $6.5 Million per win
Actual calculation will be left as an exercise
Image source: New York Times, 2010
Q: How well does linear relationship apply to everything?A: Pretty well*
US PopulationFrom 1650 to 1850 the US population grew non linearly
Y = A* (b *x)n
Sadly not linear…. Or is it?
Image source: http://onlinestatbook.com/2/transformations/tukey.html
US PopulationBut - after taking the log of the population and the function is
Y = M * x + b
Image source: http://onlinestatbook.com/2/transformations/tukey.html
Many datasets can be made linear after just one transformationQ: What if that isn’t good enough?
The next step beyond linear regression is called polynomial regression in the form of y=Nx2 + Mx + b or higher order. Each additional term adds increased accuracy.
More general (math nerd) form: y = ∑ Mnxn
Q: How accurate is my regression line? How is it measured?
Accuracy (or alternatively error) is measured by taking the distance between the value predicted by the regression from the actual value. This is called the residual. Residuals are often expressed by data scientists as residuals squared.
A’s have a positive
residual
Mets have a negative
residual
Which team would you rather
be?
Great! So now we can make predictions right?
Right?
Data Science Understand how data science builds off of statistics to make predictions and power some of the most common uses.
➔ Numeric PredictionsDescribing a relationship between two variables.
➔ CategorizationUnderstanding the some measure of how valid and strong the result is.
Statistics primarily DESCRIBE data sets, but are not set up to PREDICT the values.Example, please???
Which line on the right is the “best” and if
there was another point in the set where
do you think it might be?
Statistics says the black line is the
best.
Human intuition thinks the
green line might be better.
But we still don’t know
Another example:
The graphs to the right show
increasing number of
polynomial terms used to fit
data on house size vs price.Adapted from http://www.astroml.org/sklearn_tutorial/practical.html
Q: I think I get it. What do we do about this problem?
Data scientists divide the data into two:Train and Test
Training set is used to set up the model - aka fitting parameters
Test set is used to measure how well the model predicts the value of data.
Comparing the difference between actual and predicted values - residuals - indicates whether your model is down for the count …
Or worthy of a championship
Great. I now understand the data science process, but it’s not yet magic. What else can it do?
Using a model based on parameters, a computer can group items into categories and make choices
Uhhh OK… Example Please?
Problem:We want to predict if a stranger at the gate of our castle is a Stark or a Lannister.
Should we trust them?
We don’t want to get killed.
Name Eye ColorHair color Stark
Ned GreyDark Brown Y
Robb BlueDark Brown Y
Sansa Blue Red YArya Grey Brown YBran Brown Brown YRickon Blue Brown YLyanna Blue Brown YBenjen Brown Brown YTywin Green Blonde N
Tyrion Green/BlackDirty Blonde N
Jamie Green Brown NCersei Green Blonde N
Name Eye ColorEye Number Hair color
Hair Number Stark
Ned Grey 4Dark Brown 5 1
Robb Blue 2Dark Brown 5 1
Sansa Blue 2 Red 2 1Arya Grey 4 Brown 3 1Bran Brown 3 Brown 3 1Rickon Blue 2 Brown 3 1Lyanna Blue 2 Brown 3 1Benjen Brown 3 Brown 3 1Tywin Green 1 Blonde 1 0
Tyrion Green/Black 1.5Dirty Blonde 2 0
Jamie Green 1 Brown 3 0Cersei Green 1 Blonde 1 0
TrainingName Eye Number Hair Number StarkNed 4 5 1Sansa 2 5 1Bran 3 2 1Rickon 2 3 1Tyrion 1.5 2 0Jamie 1 3 0
TestingName Eye Color Hair Number StarkRobb 2 5 1Arya 4 3 1Tywin 1 1 0Cersei 1 1 0
Training
NameEye Number
Hair Number Stark
Model Outcome
Residuals Squared
Ned 4 5 1 1.3235 0.10465225Sansa 2 5 1 0.7275 0.07425625Bran 3 2 1 0.7786 0.04901796Rickon 2 3 1 0.5629 0.19105641Tyrion 1.5 2 0 0.3316 0.10995856Jamie 1 3 0 0.2649 0.07017201
Stark = 0.298 (eye number) + 0.0823 (hair number) - 0.28
Test
NameEye Number
Hair Number Stark
Model Outcome
Residuals Squared
Robb 2 5 1 0.7275 0.07425625Arya 4 3 1 1.1589 0.02524921Tywin 1 1 0 0.1003 0.01006009Cersei 1 1 0 0.1003 0.01006009
Average Residual SquaredTraining: 0.0998 Test: 0.0299
Working with data scientistsHow to have better two way conversations with your data science team and handle objections
➔ Data cleanlinessDescribing a relationship between two variables.
➔ Model fitUnderstanding the some measure of how valid and strong the result is.
“The data just isn’t clean enough to work with”
But it’s in a database isn’t that good enough?
Real world data is never as clean as one would hope. There is always the danger of missing fields, mistyped entries, previous wrong answers etc.
Solutions:● Removing bad data
columns and checking effect on user
● Making assumptions about missing data
● Spend time tracking down better data
● Find alternative sources for data
Always check the effect that a change in the data has on a user experience.
“We can’t ship. The learning curve is broken.”
Going back to our housing problem will help us identify what might be going on.
A learning curve shows how performance of each test and train data sets perform as the side increases.
Learning curves are used to understand the basic characteristics of the model fit.
Learning curve can show bias, meaning the test and training set both give similar answers but the answers are incorrect.
Data scientists can often fix this problem with more work.
The other issue is called variance, meaning the test and training set give different answers.
This type of issue is the hardest for your data science team to deal with.
Some solutions to variance or underfit might be:● Increasing the
amount of data● Figuring out the exact
impact on users● Increasing complexity
of the model○ Higher order
model○ More variables
“This is cool. Where can I learn more?”
BibliographyFor more information there are many great resources online that give overviews as well as in depth info made for people who want to get into data science
➔ Andrew Ng’s CourseMachine Learning on Coursera
➔ Coursera◆ Machine Learning Foundations: A Case Study Approach
◆ A Crash Course in Data Science
➔ KaggleHosts competitions and datasets. Also has a tutorial to walk through a machine learning example
➔ Sci-Kit Learn DocumentationDocumentation for the most popular machine learning library
Upcoming Courses
San Francisco
Weeknights: September 6th
Weekends: September 10th
Apply Atwww.productschool.com
www.productschool.com
www.productschool.com
Upcoming Workshops
Rsvp On Eventbrite
August 3: From Building Products to Managing Them
August 10: Coding For Non Coders
August 17: Product Owners: How to Get Your Development Team to Love You
August 24: PM Life at an Early Stage Startup