Data Science: The Product Manager's Primer

54
Data Science: The Product Manager's Primer by Andrew Koller and Doron Bergman / Productscho @ProdSchoo l / ProductmanagementS

Transcript of Data Science: The Product Manager's Primer

Page 1: Data Science: The Product Manager's Primer

Data Science: The Product Manager's Primerby Andrew Koller and Doron Bergman

/Productschool

@ProdSchool

/ProductmanagementSF

Page 2: Data Science: The Product Manager's Primer

Who we are

Andrew [email protected]

Five years experience as an entrepreneur and product manager including a background in statistical physics modeling and data science.

Doron BergmanPhD in theoretical solid state physics. Three years experience in large tech and startups.

Page 3: Data Science: The Product Manager's Primer

OverviewEverything you need to know to understand the world of data science in start ups

➔ Back to the BasicsAn overview of statistics and the mathematical basis

➔ Data Science and AIHow Data Science differs from statistics and makes prediction in the real world

➔ Putting it all together with an exampleProvide a simple unifying message for what is to come

➔ Data Scientist perspectivesHow to understand objections your DS’s might have and how to be their hero

NOTE: Find this presentation at tiny.cc/ProductSchoolDataScience.

Page 4: Data Science: The Product Manager's Primer

How popular is Data Science really?

Page 5: Data Science: The Product Manager's Primer

Really Popular!(Especially in the Bay Area)

Page 6: Data Science: The Product Manager's Primer

Data Science

Page 7: Data Science: The Product Manager's Primer

Data Science!

Page 8: Data Science: The Product Manager's Primer

Q: What is data science and AI? Where do I start?A: Statistics

Page 9: Data Science: The Product Manager's Primer

Statistical BackgroundA quick review and overview of the statistics you’ll need to communicate with your team

➔ RegressionDescribing a relationship between two variables.

➔ ConfidenceUnderstanding to some measure of how valid and strong the result is.

Page 10: Data Science: The Product Manager's Primer

Q: What is statistics and why are you talking about that? I’m here for data science.

Page 11: Data Science: The Product Manager's Primer

A: Statistics is a way to describe and make generalizations about a population of data. It forms the basis of Data Science.

Page 12: Data Science: The Product Manager's Primer

Regression is the description of a variety of data points with a function. The most basic form is linear regression in the form of y=Mx+b

Page 13: Data Science: The Product Manager's Primer

Money and WinsMoney might not be able to buy happiness but it sure seems to buy wins in Major League Baseball

Y = M * x + b

Wins = 0.153 wins/$Million * Payroll (in $Million) + 66.4 Wins

Or about $6.5 Million per win

Actual calculation will be left as an exercise

Image source: New York Times, 2010

Page 14: Data Science: The Product Manager's Primer

Q: How well does linear relationship apply to everything?A: Pretty well*

Page 15: Data Science: The Product Manager's Primer

US PopulationFrom 1650 to 1850 the US population grew non linearly

Y = A* (b *x)n

Sadly not linear…. Or is it?

Image source: http://onlinestatbook.com/2/transformations/tukey.html

Page 16: Data Science: The Product Manager's Primer

US PopulationBut - after taking the log of the population and the function is

Y = M * x + b

Image source: http://onlinestatbook.com/2/transformations/tukey.html

Page 17: Data Science: The Product Manager's Primer

Many datasets can be made linear after just one transformationQ: What if that isn’t good enough?

Page 18: Data Science: The Product Manager's Primer

The next step beyond linear regression is called polynomial regression in the form of y=Nx2 + Mx + b or higher order. Each additional term adds increased accuracy.

More general (math nerd) form: y = ∑ Mnxn

Page 19: Data Science: The Product Manager's Primer

Q: How accurate is my regression line? How is it measured?

Page 20: Data Science: The Product Manager's Primer

Accuracy (or alternatively error) is measured by taking the distance between the value predicted by the regression from the actual value. This is called the residual. Residuals are often expressed by data scientists as residuals squared.

A’s have a positive

residual

Mets have a negative

residual

Which team would you rather

be?

Page 21: Data Science: The Product Manager's Primer

Great! So now we can make predictions right?

Right?

Page 22: Data Science: The Product Manager's Primer

Data Science Understand how data science builds off of statistics to make predictions and power some of the most common uses.

➔ Numeric PredictionsDescribing a relationship between two variables.

➔ CategorizationUnderstanding the some measure of how valid and strong the result is.

Page 23: Data Science: The Product Manager's Primer

Statistics primarily DESCRIBE data sets, but are not set up to PREDICT the values.Example, please???

Page 24: Data Science: The Product Manager's Primer

Which line on the right is the “best” and if

there was another point in the set where

do you think it might be?

Statistics says the black line is the

best.

Human intuition thinks the

green line might be better.

But we still don’t know

Page 25: Data Science: The Product Manager's Primer

Another example:

The graphs to the right show

increasing number of

polynomial terms used to fit

data on house size vs price.Adapted from http://www.astroml.org/sklearn_tutorial/practical.html

Page 26: Data Science: The Product Manager's Primer

Q: I think I get it. What do we do about this problem?

Page 27: Data Science: The Product Manager's Primer

Data scientists divide the data into two:Train and Test

Page 28: Data Science: The Product Manager's Primer

Training set is used to set up the model - aka fitting parameters

Test set is used to measure how well the model predicts the value of data.

Page 29: Data Science: The Product Manager's Primer

Comparing the difference between actual and predicted values - residuals - indicates whether your model is down for the count …

Page 30: Data Science: The Product Manager's Primer

Or worthy of a championship

Page 31: Data Science: The Product Manager's Primer

Great. I now understand the data science process, but it’s not yet magic. What else can it do?

Page 32: Data Science: The Product Manager's Primer

Using a model based on parameters, a computer can group items into categories and make choices

Page 33: Data Science: The Product Manager's Primer

Uhhh OK… Example Please?

Page 34: Data Science: The Product Manager's Primer
Page 35: Data Science: The Product Manager's Primer

Problem:We want to predict if a stranger at the gate of our castle is a Stark or a Lannister.

Should we trust them?

We don’t want to get killed.

Page 36: Data Science: The Product Manager's Primer

Name Eye ColorHair color Stark

Ned GreyDark Brown Y

Robb BlueDark Brown Y

Sansa Blue Red YArya Grey Brown YBran Brown Brown YRickon Blue Brown YLyanna Blue Brown YBenjen Brown Brown YTywin Green Blonde N

Tyrion Green/BlackDirty Blonde N

Jamie Green Brown NCersei Green Blonde N

Page 37: Data Science: The Product Manager's Primer

Name Eye ColorEye Number Hair color

Hair Number Stark

Ned Grey 4Dark Brown 5 1

Robb Blue 2Dark Brown 5 1

Sansa Blue 2 Red 2 1Arya Grey 4 Brown 3 1Bran Brown 3 Brown 3 1Rickon Blue 2 Brown 3 1Lyanna Blue 2 Brown 3 1Benjen Brown 3 Brown 3 1Tywin Green 1 Blonde 1 0

Tyrion Green/Black 1.5Dirty Blonde 2 0

Jamie Green 1 Brown 3 0Cersei Green 1 Blonde 1 0

Page 38: Data Science: The Product Manager's Primer

TrainingName Eye Number Hair Number StarkNed 4 5 1Sansa 2 5 1Bran 3 2 1Rickon 2 3 1Tyrion 1.5 2 0Jamie 1 3 0

TestingName Eye Color Hair Number StarkRobb 2 5 1Arya 4 3 1Tywin 1 1 0Cersei 1 1 0

Page 39: Data Science: The Product Manager's Primer

Training

NameEye Number

Hair Number Stark

Model Outcome

Residuals Squared

Ned 4 5 1 1.3235 0.10465225Sansa 2 5 1 0.7275 0.07425625Bran 3 2 1 0.7786 0.04901796Rickon 2 3 1 0.5629 0.19105641Tyrion 1.5 2 0 0.3316 0.10995856Jamie 1 3 0 0.2649 0.07017201

Stark = 0.298 (eye number) + 0.0823 (hair number) - 0.28

Page 40: Data Science: The Product Manager's Primer

Test

NameEye Number

Hair Number Stark

Model Outcome

Residuals Squared

Robb 2 5 1 0.7275 0.07425625Arya 4 3 1 1.1589 0.02524921Tywin 1 1 0 0.1003 0.01006009Cersei 1 1 0 0.1003 0.01006009

Average Residual SquaredTraining: 0.0998 Test: 0.0299

Page 41: Data Science: The Product Manager's Primer

Working with data scientistsHow to have better two way conversations with your data science team and handle objections

➔ Data cleanlinessDescribing a relationship between two variables.

➔ Model fitUnderstanding the some measure of how valid and strong the result is.

Page 42: Data Science: The Product Manager's Primer

“The data just isn’t clean enough to work with”

But it’s in a database isn’t that good enough?

Page 43: Data Science: The Product Manager's Primer

Real world data is never as clean as one would hope. There is always the danger of missing fields, mistyped entries, previous wrong answers etc.

Page 44: Data Science: The Product Manager's Primer

Solutions:● Removing bad data

columns and checking effect on user

● Making assumptions about missing data

● Spend time tracking down better data

● Find alternative sources for data

Always check the effect that a change in the data has on a user experience.

Page 45: Data Science: The Product Manager's Primer

“We can’t ship. The learning curve is broken.”

Page 46: Data Science: The Product Manager's Primer

Going back to our housing problem will help us identify what might be going on.

Page 47: Data Science: The Product Manager's Primer

A learning curve shows how performance of each test and train data sets perform as the side increases.

Learning curves are used to understand the basic characteristics of the model fit.

Page 48: Data Science: The Product Manager's Primer

Learning curve can show bias, meaning the test and training set both give similar answers but the answers are incorrect.

Data scientists can often fix this problem with more work.

Page 49: Data Science: The Product Manager's Primer

The other issue is called variance, meaning the test and training set give different answers.

This type of issue is the hardest for your data science team to deal with.

Page 50: Data Science: The Product Manager's Primer

Some solutions to variance or underfit might be:● Increasing the

amount of data● Figuring out the exact

impact on users● Increasing complexity

of the model○ Higher order

model○ More variables

Page 51: Data Science: The Product Manager's Primer

“This is cool. Where can I learn more?”

Page 52: Data Science: The Product Manager's Primer

BibliographyFor more information there are many great resources online that give overviews as well as in depth info made for people who want to get into data science

➔ Andrew Ng’s CourseMachine Learning on Coursera

➔ Coursera◆ Machine Learning Foundations: A Case Study Approach

◆ A Crash Course in Data Science

➔ KaggleHosts competitions and datasets. Also has a tutorial to walk through a machine learning example

➔ Sci-Kit Learn DocumentationDocumentation for the most popular machine learning library

Page 53: Data Science: The Product Manager's Primer

 

                                                                          Upcoming Courses

San Francisco

Weeknights: September 6th

Weekends: September 10th

Apply Atwww.productschool.com

www.productschool.com

Page 54: Data Science: The Product Manager's Primer

www.productschool.com

Upcoming Workshops

Rsvp On Eventbrite

August 3: From Building Products to Managing Them

August 10: Coding For Non Coders

August 17: Product Owners: How to Get Your Development Team to Love You

August 24: PM Life at an Early Stage Startup