Things gone bye.. How to Predict The Future Either the world is driven completely by random chance...

42
Things gone bye.

description

What We Want…. You want to do deterministic modeling where we’re able to fill in a table like this: …and express it with a simple formula like this: lbs = weeks * something Weeks of Gestation (in weeks) Weight at birth (in lbs) 29 weeks 38 weeks 39 weeks 40 weeks … β (beta) value

Transcript of Things gone bye.. How to Predict The Future Either the world is driven completely by random chance...

Page 1: Things gone bye.. How to Predict The Future Either the world is driven completely by random chance events (and your best bet for predicting the future.

Things gone bye.

Page 2: Things gone bye.. How to Predict The Future Either the world is driven completely by random chance events (and your best bet for predicting the future.

How to Predict The Future

• Either the world is driven completely by random chance events (and your best bet for predicting the future is using Tarot cards or a Magic 8 Ball™), or there are detectable patterns in the world.

• If you talk to a preschool teacher or a PhD in math, they will tell you that math is all about pattern detection.

Page 3: Things gone bye.. How to Predict The Future Either the world is driven completely by random chance events (and your best bet for predicting the future.

What We Want….

• You want to do deterministic modeling where we’re able to fill in a table like this:

…and express it with a simple formula like this:lbs = weeks * something

Weeks of Gestation (in weeks)

Weight at birth (in lbs)

29 weeks

38 weeks39 weeks40 weeks

β (beta) value

Page 4: Things gone bye.. How to Predict The Future Either the world is driven completely by random chance events (and your best bet for predicting the future.

What (else) We Want….

• Once we have made guesses at those numbers, we want to say how confident we are that they are right.

Page 5: Things gone bye.. How to Predict The Future Either the world is driven completely by random chance events (and your best bet for predicting the future.

The Process

• The process of going from a single predictor or a set of predictors to a predicted outcome is called statistical modeling.

• People get far too excited about figuring out which statistic (with accompanying p-values anxiety) to use for the factors that are used in models.

Page 6: Things gone bye.. How to Predict The Future Either the world is driven completely by random chance events (and your best bet for predicting the future.

The Steps• Say what you are testing.• Note the scale (nominal, ordinal, interval) of all

the predictors.• Describe the predictors numerically and

graphically.– Measures of central tendency and variability

• Look for association between the predictors and the outcome.

• Look at the strength of the association.• Look for interactions.

Page 7: Things gone bye.. How to Predict The Future Either the world is driven completely by random chance events (and your best bet for predicting the future.

What is a model and why care?

• The predictors and the outcomes can be on a continuous scale (time in days) or categorical factors (mom smoked, yes or no).

• Generally we try to use all the information available when we make a prediction about the future.– The amount of blood ejected each time the heart

beats (continuous scale) as opposed to whether or not the heart is beating

– The number of cancer cells seen on a slide (or the presence or absence of malignant cells)

• The models we build are remarkably similar regardless of whether we have categorical or continuous outcomes.

Page 8: Things gone bye.. How to Predict The Future Either the world is driven completely by random chance events (and your best bet for predicting the future.

The Structure of a Model• All the models I learned in school were formulated

at their core like this:Outcome = baseline + predictor + predictor

• The math can get ugly very quickly depending on the properties of the outcome (continuous, count, categories) but the core idea is that these models are all using additive contributions from some predictors!

Baby’sWeight

Impact of time

Impact of being a smoker

Weeks * a number a numbersome number

Page 9: Things gone bye.. How to Predict The Future Either the world is driven completely by random chance events (and your best bet for predicting the future.

What Makes a Bad Model

• Predicts some outcomes poorly• Is strongly influenced by a small number of

data points• Shows systematic patterns in how it fails

to predict

Page 10: Things gone bye.. How to Predict The Future Either the world is driven completely by random chance events (and your best bet for predicting the future.

GoalsI see modeling as having two goals:• Estimate parameters.

– How much weight gain occurs each week as a baby is developing?

• Estimate how well it describes your data. (Is your guess precise?) – How far off will my guess be when I predict the next

child?– Are there regions where my guesses are far off, like

premature or late deliveries?– Is there a lot of variability at one point and not at

others?– Can I see any problems when I fit the model to THIS

data?

Page 11: Things gone bye.. How to Predict The Future Either the world is driven completely by random chance events (and your best bet for predicting the future.

Looking for Errors

• Statisticians use the word “error” differently than everyone else.– You know that you will not have perfect

prediction. Instead, you will be off. That is error. It does not mean somebody made a mistake! It just means you can’t make a perfect prediction.

– Specifying how far you will be off is the fun and interesting part of statistics. The rest is just math.

Page 12: Things gone bye.. How to Predict The Future Either the world is driven completely by random chance events (and your best bet for predicting the future.

Looking at Errors

Outcome = baseline + predictor + predictor + error

Baby Weight Impact of time

Impact being a smoker

Weeks * a number a number

some number

a numberdrawnfrom a

bell shaped distribution

Page 13: Things gone bye.. How to Predict The Future Either the world is driven completely by random chance events (and your best bet for predicting the future.

Looking for Errors• Hopefully you will see that, given any specific

predictor value, your guessed values for the outcome will be close to the values you actually observe in the outcome. Also, any observed outcome values that stray too far from your guess are unlikely.

• That pattern of how far off your guesses are from your observed data can frequently be described by a bell-shaped (“normal”) histogram. So, if you measure errors between your prediction and the observed outcomes, the distribution should be “normal.”

Page 14: Things gone bye.. How to Predict The Future Either the world is driven completely by random chance events (and your best bet for predicting the future.

Guesses and Errors

I guessed way too high rarely

I guessed way too low rarely

My model guesses7.5 lbs

9.5 lbs5.5 lbs

Histogram of actual weights at 40 week births

Histogram of errors at 40 weeks

Most errors are off by just a bit0 error if child was 7.5 lbs

Page 15: Things gone bye.. How to Predict The Future Either the world is driven completely by random chance events (and your best bet for predicting the future.

Variance vs. Standard Error

• The variability around a continuous outcome is frequently described as a variance. The variability around samples in a sampling distribution is frequently described as a standard error. – There are patterns in the variability affected

by the number of people in the sample.

Page 16: Things gone bye.. How to Predict The Future Either the world is driven completely by random chance events (and your best bet for predicting the future.

Looking at Errors• There are some kinds of errors that you will be

unwilling to accept.• If I want to predict the number of times an evil

lackey proposes marriage to a mad scientist, I will not accept a negative number!

• If I am predicting the chance of someone developing cancer, I will not accept a number less than 0% or greater than 100%.

• Specifying the type of errors is a critical part of building a model.

Page 17: Things gone bye.. How to Predict The Future Either the world is driven completely by random chance events (and your best bet for predicting the future.

More on Errors

• In addition to specifying the range of legal values, another critical component is specifying the variability in the errors.– You have met several probability distributions

which let you quantify what is an unusual score given a few parameters describing your data.

• Continuous outcomes– Uniform, Normal, T, F

• Categorical outcomes– The Binomial, Bernoulli, Chi-square

Page 18: Things gone bye.. How to Predict The Future Either the world is driven completely by random chance events (and your best bet for predicting the future.

Ordinary Least Squares

• Perhaps the easiest models to draw and understand are ones where you have a continuous outcome like weight and a continuous predictor like time.

• The model is just a line….• Y = mX + bWeight = estimated weight gain each week after conception * number

of weeks + weight at 0 weeks

Page 19: Things gone bye.. How to Predict The Future Either the world is driven completely by random chance events (and your best bet for predicting the future.

30 35 40

1000

2000

3000

4000

5000

GWKS_DEL

FETA

L_W

GT_

Maximum Likelihood Visual

Page 20: Things gone bye.. How to Predict The Future Either the world is driven completely by random chance events (and your best bet for predicting the future.

Bad Models

• All models are wrong.• Your data is sacred (after you remove the

pregnant men) and you fit models to the data. You do not fit data to a model. That difference is not a semantic minor detail.

Page 21: Things gone bye.. How to Predict The Future Either the world is driven completely by random chance events (and your best bet for predicting the future.

Poor Predictions

• Sometimes you have data points that are not well fit by the model. Go to extreme measures to document those points. If the data is not a true error, then run the analysis with it and without it. Include the point(s) in all your plots with a special symbol and if one person changes your inferences, consider excluding them. – You may have different subgroups that you

have not identified yet.

Page 22: Things gone bye.. How to Predict The Future Either the world is driven completely by random chance events (and your best bet for predicting the future.

A True Outlier

30 35 40

1000

2000

3000

4000

5000

GWKS_DEL

FETA

L_W

GT_

Induced because of HUGE size

Page 23: Things gone bye.. How to Predict The Future Either the world is driven completely by random chance events (and your best bet for predicting the future.

Looking at Residuals

• A critical step in examining the quality of a model is graphically looking at the residuals.

• Residuals are the differences between the estimated values and the observed values for each person/critter/observation.

• Look for curves, changing variability across the range of values or changes over time.

Page 24: Things gone bye.. How to Predict The Future Either the world is driven completely by random chance events (and your best bet for predicting the future.

Patterns in Residuals

From Crawley: Statistical Computing

Page 25: Things gone bye.. How to Predict The Future Either the world is driven completely by random chance events (and your best bet for predicting the future.

Curve Fitting

• Linear models can model curves– The math is not too bad….

• You can use explicit mathematical formulas. If you see curves in your residuals, you can use things like:– Polynomials or inverse polynomials– Exponentials– Power functions

Page 26: Things gone bye.. How to Predict The Future Either the world is driven completely by random chance events (and your best bet for predicting the future.

Nonlinear Regression

• Often the formulas to describe your data are extraordinarily complicated and you want to use non-linear or non-parametric modeling instead.

• Key words you will see include:– Non-parametric smoothing

• Lowess regression• Spine regression

– GAM– Tree models

Page 27: Things gone bye.. How to Predict The Future Either the world is driven completely by random chance events (and your best bet for predicting the future.

A Bad Fit

• What happens when you fit a straight linear model to curvilinear data?

0 10 20 30 40 50

020

4060

8010

012

014

0

X = Age

Y=

Siz

e

0 10 20 30 40 50

020

4060

8010

012

014

0

X = Age

Y=

Siz

e

Is this better than a flat line at the mean?

residual

Page 28: Things gone bye.. How to Predict The Future Either the world is driven completely by random chance events (and your best bet for predicting the future.

Is it good?

• A tiny p-value does not mean a good model!

• Where on the output does it tell that this is a good or a poor model?

Page 29: Things gone bye.. How to Predict The Future Either the world is driven completely by random chance events (and your best bet for predicting the future.

Residuals?

0 10 20 30 40 50

020

4060

8010

012

014

0

X = Age

Y=

Siz

e

Flatten the line, then look up and down to see if you are systematically off.

Page 30: Things gone bye.. How to Predict The Future Either the world is driven completely by random chance events (and your best bet for predicting the future.

Curve Fitting!

• You can build a model that has a curve using a polynomial… the degree of the polynomial determines how many “bends” appear in a curve. So a 2nd degree polynomial would use x and x2 while a 3rd degree polynomial would use x and x2 and x3. These squared or cubed values don’t do anything especially complicated. They are just like adding new variables.

Page 31: Things gone bye.. How to Predict The Future Either the world is driven completely by random chance events (and your best bet for predicting the future.

0 10 20 30 40 50

020

4060

8010

012

014

0

X = Age

Y=

Siz

e

Polynomialssize = intercept + X * something + X2*something else

0 10 20 30 40 50

020

4060

8010

012

014

0

X = Age

Y=

Siz

esize = intercept +

X * something + X2* something else +X3 * another thing

poly2 = lm(y~poly(x,2)) poly3 = lm(y~poly(x,3))

Page 32: Things gone bye.. How to Predict The Future Either the world is driven completely by random chance events (and your best bet for predicting the future.

Generalized Linear Models• You will eventually move out of the realm of

predicting continuous outcomes with normal error. When you do, you will move into the realm of Generalized Linear Models (GLM).

• You want to have a linear model predicting an outcome where you restrict the possible outcome values (e.g., only allow values between 0 and 1) and deal with errors not being consistently normal across the entire range.

• You can change (transform) your outcome and model this with just another linear model similar to what I have shown.

Page 33: Things gone bye.. How to Predict The Future Either the world is driven completely by random chance events (and your best bet for predicting the future.

GLM in English

• If you are predicting the number of bacteria you see in a Petri dish, you can not possibly see a negative number of bacteria. A GLM model can be written so that your predicted values can not be negative.

• Contrast this with the baby weight example where with a bit of bad data for your predictor value, you could have the formula spit out a negative weight or a baby weighing a ton.

Page 34: Things gone bye.. How to Predict The Future Either the world is driven completely by random chance events (and your best bet for predicting the future.

GLM

• Instead of modeling like this:Outcome =

baseline + predictor + predictor + error

• You can model with GLM like this:Tweaked outcome =

baseline + predictor + predictor + not normal error

normal/bell-shaped

log(odds of event) = baseline + predictor * β1+ predictor * β2 + binomial error

Page 35: Things gone bye.. How to Predict The Future Either the world is driven completely by random chance events (and your best bet for predicting the future.

Ordinary Regression

• So, the ordinary least squares regression models are really just a case of GLM. In these cases I specify that the tweak to the outcome is to just make the outcome identical to what it was originally and the error is normal.

• The tweak to the outcome is called the link and this case the link is called identity.

Page 36: Things gone bye.. How to Predict The Future Either the world is driven completely by random chance events (and your best bet for predicting the future.

Mort = 389 - 5.98* lattitude

Page 37: Things gone bye.. How to Predict The Future Either the world is driven completely by random chance events (and your best bet for predicting the future.

Link Functions

• The tweaks to the outcome are called links:

• Identity link = predicting a continuous outcome (baby weight)

• Log link = if you can’t have negative values

• Logit link = if you have to restrict the range to between 0 and 1

• There are other links.

Page 38: Things gone bye.. How to Predict The Future Either the world is driven completely by random chance events (and your best bet for predicting the future.

Error Structure• Why bother to specify an error structure other than

normal?– Strong skew, kurtosis errors, bounded errors, negative counts

• The shape of the error distribution is not a bell-shaped curve. Rather than worrying about the math to describe those curves, you simply need to know that different types of data have different error structures.– Normal errors – continuous outcomes– Poisson errors - counts– Binomial errors - proportions– Gamma errors - variation

Page 39: Things gone bye.. How to Predict The Future Either the world is driven completely by random chance events (and your best bet for predicting the future.

Binary Response

• If you are not dealing with a continuous outcome, or count data, you will likely have a binary (yes/no scored as 1 or 0) outcome.

• Clearly you need to do some major tweaking to the outcome because linear models, as we have seen, can predict very large and small numbers.

• Also, the variability of a binary outcome is very different from a continuous variable.

Page 40: Things gone bye.. How to Predict The Future Either the world is driven completely by random chance events (and your best bet for predicting the future.

Logistic Regression• The solution is to specify a link that limits values

to be between 0 and 1 (think of the changed outcome as being the probability of being scored 1) and use an error term that behaves well with binary outcomes.

• This is a GLM with a logit link and binomal errors.

• This kind of analysis is so popular that most people don’t know it is a GLM. Rather, they know it only as logistic regression.

Page 41: Things gone bye.. How to Predict The Future Either the world is driven completely by random chance events (and your best bet for predicting the future.

Logistic model

log(odds of high) = -17.81 + . 4539 * lattitude

Page 42: Things gone bye.. How to Predict The Future Either the world is driven completely by random chance events (and your best bet for predicting the future.

So Long (and thanks for all the fish)

• Drop by and say hi or send me an email if you have questions in the future.

[email protected]