Data Science: The Product Manager's Primer

Data Science: The Product Manager's Primerby Andrew Koller and Doron Bergman

/Productschool

@ProdSchool

/ProductmanagementSF

Who we are

Andrew [email protected]

Five years experience as an entrepreneur and product manager including a background in statistical physics modeling and data science.

Doron BergmanPhD in theoretical solid state physics. Three years experience in large tech and startups.

OverviewEverything you need to know to understand the world of data science in start ups

➔ Back to the BasicsAn overview of statistics and the mathematical basis

➔ Data Science and AIHow Data Science differs from statistics and makes prediction in the real world

➔ Putting it all together with an exampleProvide a simple unifying message for what is to come

➔ Data Scientist perspectivesHow to understand objections your DS’s might have and how to be their hero

NOTE: Find this presentation at tiny.cc/ProductSchoolDataScience.

How popular is Data Science really?

Really Popular!(Especially in the Bay Area)

Data Science

Data Science!

Q: What is data science and AI? Where do I start?A: Statistics

Statistical BackgroundA quick review and overview of the statistics you’ll need to communicate with your team

➔ RegressionDescribing a relationship between two variables.

➔ ConfidenceUnderstanding to some measure of how valid and strong the result is.

Q: What is statistics and why are you talking about that? I’m here for data science.

A: Statistics is a way to describe and make generalizations about a population of data. It forms the basis of Data Science.

Regression is the description of a variety of data points with a function. The most basic form is linear regression in the form of y=Mx+b

Money and WinsMoney might not be able to buy happiness but it sure seems to buy wins in Major League Baseball

Y = M * x + b

Wins = 0.153 wins/$Million * Payroll (in $Million) + 66.4 Wins

Or about $6.5 Million per win

Actual calculation will be left as an exercise

Image source: New York Times, 2010

Q: How well does linear relationship apply to everything?A: Pretty well*

US PopulationFrom 1650 to 1850 the US population grew non linearly

Y = A* (b *x)n

Sadly not linear…. Or is it?

Image source: http://onlinestatbook.com/2/transformations/tukey.html

US PopulationBut - after taking the log of the population and the function is

Y = M * x + b

Image source: http://onlinestatbook.com/2/transformations/tukey.html

Many datasets can be made linear after just one transformationQ: What if that isn’t good enough?

The next step beyond linear regression is called polynomial regression in the form of y=Nx2 + Mx + b or higher order. Each additional term adds increased accuracy.

More general (math nerd) form: y = ∑ Mnxn

Q: How accurate is my regression line? How is it measured?

Accuracy (or alternatively error) is measured by taking the distance between the value predicted by the regression from the actual value. This is called the residual. Residuals are often expressed by data scientists as residuals squared.

A’s have a positive

residual

Mets have a negative

residual

Which team would you rather

be?

Great! So now we can make predictions right?

Right?

Data Science Understand how data science builds off of statistics to make predictions and power some of the most common uses.

➔ Numeric PredictionsDescribing a relationship between two variables.

➔ CategorizationUnderstanding the some measure of how valid and strong the result is.

Statistics primarily DESCRIBE data sets, but are not set up to PREDICT the values.Example, please???

Which line on the right is the “best” and if

there was another point in the set where

do you think it might be?

Statistics says the black line is the

best.

Human intuition thinks the

green line might be better.

But we still don’t know

Another example:

The graphs to the right show

increasing number of

polynomial terms used to fit

data on house size vs price.Adapted from http://www.astroml.org/sklearn_tutorial/practical.html

Q: I think I get it. What do we do about this problem?

Data scientists divide the data into two:Train and Test

Training set is used to set up the model - aka fitting parameters

Test set is used to measure how well the model predicts the value of data.

Comparing the difference between actual and predicted values - residuals - indicates whether your model is down for the count …

Or worthy of a championship

Great. I now understand the data science process, but it’s not yet magic. What else can it do?

Using a model based on parameters, a computer can group items into categories and make choices

Uhhh OK… Example Please?

Problem:We want to predict if a stranger at the gate of our castle is a Stark or a Lannister.

Should we trust them?

We don’t want to get killed.

Name Eye ColorHair color Stark

Ned GreyDark Brown Y

Robb BlueDark Brown Y

Sansa Blue Red YArya Grey Brown YBran Brown Brown YRickon Blue Brown YLyanna Blue Brown YBenjen Brown Brown YTywin Green Blonde N

Tyrion Green/BlackDirty Blonde N

Jamie Green Brown NCersei Green Blonde N

Name Eye ColorEye Number Hair color

Hair Number Stark

Ned Grey 4Dark Brown 5 1

Robb Blue 2Dark Brown 5 1

Sansa Blue 2 Red 2 1Arya Grey 4 Brown 3 1Bran Brown 3 Brown 3 1Rickon Blue 2 Brown 3 1Lyanna Blue 2 Brown 3 1Benjen Brown 3 Brown 3 1Tywin Green 1 Blonde 1 0

Tyrion Green/Black 1.5Dirty Blonde 2 0

Jamie Green 1 Brown 3 0Cersei Green 1 Blonde 1 0

TrainingName Eye Number Hair Number StarkNed 4 5 1Sansa 2 5 1Bran 3 2 1Rickon 2 3 1Tyrion 1.5 2 0Jamie 1 3 0

TestingName Eye Color Hair Number StarkRobb 2 5 1Arya 4 3 1Tywin 1 1 0Cersei 1 1 0

Training

NameEye Number

Hair Number Stark

Model Outcome

Residuals Squared

Ned 4 5 1 1.3235 0.10465225Sansa 2 5 1 0.7275 0.07425625Bran 3 2 1 0.7786 0.04901796Rickon 2 3 1 0.5629 0.19105641Tyrion 1.5 2 0 0.3316 0.10995856Jamie 1 3 0 0.2649 0.07017201

Stark = 0.298 (eye number) + 0.0823 (hair number) - 0.28

Test

NameEye Number

Hair Number Stark

Model Outcome

Residuals Squared

Robb 2 5 1 0.7275 0.07425625Arya 4 3 1 1.1589 0.02524921Tywin 1 1 0 0.1003 0.01006009Cersei 1 1 0 0.1003 0.01006009

Average Residual SquaredTraining: 0.0998 Test: 0.0299

Working with data scientistsHow to have better two way conversations with your data science team and handle objections

➔ Data cleanlinessDescribing a relationship between two variables.

➔ Model fitUnderstanding the some measure of how valid and strong the result is.

“The data just isn’t clean enough to work with”

But it’s in a database isn’t that good enough?

Real world data is never as clean as one would hope. There is always the danger of missing fields, mistyped entries, previous wrong answers etc.

Solutions:● Removing bad data

columns and checking effect on user

● Making assumptions about missing data

● Spend time tracking down better data

● Find alternative sources for data

Always check the effect that a change in the data has on a user experience.

“We can’t ship. The learning curve is broken.”

Going back to our housing problem will help us identify what might be going on.

A learning curve shows how performance of each test and train data sets perform as the side increases.

Learning curves are used to understand the basic characteristics of the model fit.

Learning curve can show bias, meaning the test and training set both give similar answers but the answers are incorrect.

Data scientists can often fix this problem with more work.

The other issue is called variance, meaning the test and training set give different answers.

This type of issue is the hardest for your data science team to deal with.

Some solutions to variance or underfit might be:● Increasing the

amount of data● Figuring out the exact

impact on users● Increasing complexity

of the model○ Higher order

model○ More variables

“This is cool. Where can I learn more?”

BibliographyFor more information there are many great resources online that give overviews as well as in depth info made for people who want to get into data science

➔ Andrew Ng’s CourseMachine Learning on Coursera

➔ Coursera◆ Machine Learning Foundations: A Case Study Approach

◆ A Crash Course in Data Science

➔ KaggleHosts competitions and datasets. Also has a tutorial to walk through a machine learning example

➔ Sci-Kit Learn DocumentationDocumentation for the most popular machine learning library

Upcoming Courses

San Francisco

Weeknights: September 6th

Weekends: September 10th

Apply Atwww.productschool.com

www.productschool.com

www.productschool.com

Upcoming Workshops

Rsvp On Eventbrite

August 3: From Building Products to Managing Them

August 10: Coding For Non Coders

August 17: Product Owners: How to Get Your Development Team to Love You

August 24: PM Life at an Early Stage Startup

Data Science: The Product Manager's Primer

Technology

Transcript of Data Science: The Product Manager's Primer