Review of Statistical Models and Linear Regression Concepts

54
Review of Statistical Models and Linear Regression Concepts STAT E-150 Statistical Methods

description

STAT E-150 Statistical Methods. Review of Statistical Models and Linear Regression Concepts. Statistical Models are used to make predictions, understand relationships, and assess differences. A statistical model can be written as Data = model + error, or Y = f(x) + ε - PowerPoint PPT Presentation

Transcript of Review of Statistical Models and Linear Regression Concepts

Page 1: Review of  Statistical  Models and  Linear Regression Concepts

Review of Statistical Modelsand Linear Regression Concepts

STAT E-150Statistical Methods

Page 2: Review of  Statistical  Models and  Linear Regression Concepts

2

Statistical Models are used to make predictions, understand relationships, and assess differences.

A statistical model can be written as 

Data = model + error, or Y = f(x) + ε 

where Y is the response variablex is the explanatory variableε is the error

The error term, ε, represents the part of the response variable that is not explained by its relationship to the predictor variable. We often consider the probability distribution of this error term as part of our assessment of the model.

Page 3: Review of  Statistical  Models and  Linear Regression Concepts

3

The Four-Step Process for statistical modeling:

1. Choose a form for the modelIdentify the variables and their types

Examine graphs to help identify the appropriate model

2. Fit the model to the dataUse the sample data to estimate the values of the model parameters

  3. Assess how well the model fits the dataCompare modelsExamine the residuals

4. Use the model to make predictions, explain relationships, assess differences

The appropriate model depends on the type of variables and the role each variable plays in the analysis.

Page 4: Review of  Statistical  Models and  Linear Regression Concepts

4

Example:

Medical researchers have noted that adolescent females are more likely to deliver low-birthweight babies than are adult females. Because LBW babies tend to have higher mortality rates, studies have been conducted to examine the relationship between birthweight and the mother’s age.

One such study is discussed in the article “Body Size and Intelligence in 6-Year-Olds: Are Offspring of Teenage Mothers at Risk?” (Maternal and Child Health Journal [2009], pp. 847-856.)

Page 5: Review of  Statistical  Models and  Linear Regression Concepts

5

The following data is consistent with summary values given in the article, and with data published by the National Center for Health Statistics:

What are the observational units? Teenage mothers and their babies

Which is the response variable? The baby’s weight (in grams)

Which is the explanatory variable? The mother’s age (in years)

Observation 1 2 3 4 5 6 7 8 9 10

Maternal Age (in years) 15 17 18 15 16 19 17 16 18 19

Birthweight (in grams) 2289 3393 3271 2648 2897 3327 2970 2535 3138 3573

Page 6: Review of  Statistical  Models and  Linear Regression Concepts

6

The following data is consistent with summary values given in the article, and with data published by the National Center for Health Statistics:

What are the observational units? Teenage mothers and their babies

Which is the response variable? The baby’s weight (in grams)

Which is the explanatory variable? The mother’s age (in years)

Observation 1 2 3 4 5 6 7 8 9 10

Maternal Age (in years) 15 17 18 15 16 19 17 16 18 19

Birthweight (in grams) 2289 3393 3271 2648 2897 3327 2970 2535 3138 3573

Page 7: Review of  Statistical  Models and  Linear Regression Concepts

7

Simple Linear Regression is used to investigate whether there is a linear relationship between two quantitative variables. If a linear relationship exists, we can create a model for the relationship, and use this model to answer these questions: 

What is the relationship between the variables?What does the slope of this linear model tell us? When is it appropriate to use this linear model to make predictions?

Page 8: Review of  Statistical  Models and  Linear Regression Concepts

8

A First-Order Linear Model is of the form

y = β0 + β1x + εwhere

y = the response variablex = the independent, or predictor, or explanatory variableε = the random error

β0 = where the regression line crosses the y-axis; the y-intercept of the regression line is the point (0, β0 )

β1 = the slope of the regression linechange in ychange in xchange in y for every unit increase in x

Page 9: Review of  Statistical  Models and  Linear Regression Concepts

9

y = β1x + β0

Page 10: Review of  Statistical  Models and  Linear Regression Concepts

10

Steps in regression

1. Hypothesize the form of the model for E(y), the mean or expected value of y

2. Collect the sample data

3. Use the sample data to estimate the unknown parameters in the model.

4. Specify the probability distribution of ε and estimate any unknown parameters in the distribution. Check the validity of the assumptions made about the probability distribution.

5. Statistically check the usefulness of the model

6. If the model is useful, use the model for appropriate prediction and estimation

Page 11: Review of  Statistical  Models and  Linear Regression Concepts

11

Notation:

Recall that Data = model + error, or Y = f(x) + ε

μy (or μy|x) is the mean value of y for a particular value of x

ε is the deviation from that mean value at a value of x

In a simple linear regression model:

μy = f(x) = β1x + β0  (the mean value of y at a given value of x)

and y = f(x) + ε = β1x + β0 + ε (the actual value of y for a given x)

Page 12: Review of  Statistical  Models and  Linear Regression Concepts

12

In our example,

μbirthweight = β1age + β0

The actual birthweights are represented by

Birthweight = β1age + β0 + ε

Page 13: Review of  Statistical  Models and  Linear Regression Concepts

13

The first step in determining whether there is a linear relationship between the variables is to create a scatterplot of the data, with the explanatory variable on the x-axis and the response variable on the y-axis.

Page 14: Review of  Statistical  Models and  Linear Regression Concepts

14

Does there appear to be a linear relationship?

What does the scatterplot tell you about the strength and direction of the linear relationship? Write your answer in the context of the scenario.

Page 15: Review of  Statistical  Models and  Linear Regression Concepts

15

Does there appear to be a linear relationship?The scatter diagram shows a positive linear relationship

What does the scatterplot tell you about the strength and direction of the linear relationship? Write your answer in the context of the scenario.

Page 16: Review of  Statistical  Models and  Linear Regression Concepts

What does the scatterplot tell you about the strength and direction of the linear relationship? Write your answer in the context of the scenario.The scatter diagram shows that there is a fairly strong positive linear relationship between the two variables: as the mother’s age increases, the child’s birthweight also increased. That is, higher birthweights are associated with older mothers.

Page 17: Review of  Statistical  Models and  Linear Regression Concepts

17

Fitting a Simple Linear Model 

If the data appears to show a linear relationship, the method of least squares finds the line that best fits the data. That is, it will provide the best estimates for β0 and β1.

We can find the vertical distance between the observed value of y and the predicted value of y for each value of x. This difference is called the residual:

  

  

 

 

Page 18: Review of  Statistical  Models and  Linear Regression Concepts

18

The points should be scattered about a straight line, with deviations from the line determined by ε. This vertical distance between the observed value of y and the predicted value of y is called the residual. 

Residual = observed value - predicted value

  

 

 

Page 19: Review of  Statistical  Models and  Linear Regression Concepts

19

We want the size of the residuals to be as small as possible; since some residuals are positive and some are negative, we square the residuals and minimize the squares.

SSE, the sum of squared errors, is a measure of how well the line predicts the actual values.

The Least Squares line is the line where and SSE is minimized.

The equation of the least squares line is y = β1x + β0

Page 20: Review of  Statistical  Models and  Linear Regression Concepts

20

Some notation: Consider the ith value in the dataset: yi = β0 + β1xi + εi  β0 and β1 are the true population values for the population; these are parameters

are estimates of the coefficients based on the sample data; these are statistics.

0β̂ and 1β̂

Page 21: Review of  Statistical  Models and  Linear Regression Concepts

21

In our example, the equation of the least squares line is

 weight = 245.15 age – 1163.45

What does the value 245.15 represent, in context?   

What does the value -1163.45 represent, in context?

Page 22: Review of  Statistical  Models and  Linear Regression Concepts

22

In our example, the equation of the least squares line is

 weight = 245.15 age – 1163.45

What does the value 245.15 represent, in context? The childs’s birthweight is expected to increase by 125.15g for each additional year in the age of the mother. What does the value -1163.45 represent, in context?

If the mother’s age is 0 years, the child’s birthweight is expected to be -1163.45 g.

Page 23: Review of  Statistical  Models and  Linear Regression Concepts

23

In our example, the equation of the least squares line is

 weight = 245.15 age – 1163.45

What does the value 245.15 represent, in context? The childs’s birthweight is expected to increase by 125.15g for each additional year in the age of the mother. What does the value -1163.45 represent, in context?

If the mother’s age is 0 years, the child’s birthweight is expected to be -1163.45 g.

Page 24: Review of  Statistical  Models and  Linear Regression Concepts

24

In our example, the equation of the least squares line is

 weight = 245.15 age – 1163.45

What birthweight would you expect for the baby of a mother who is 16 years old?

weight = 245.15 age – 1163.45 = 245.15(16) – 1163.45 = 3922.4 – 1163.45 = 2758.95

What was the birthweight for the baby of a mother who was 16 years old? 2897 g

What is the residual? 2897 – 2758.95g

Page 25: Review of  Statistical  Models and  Linear Regression Concepts

25

In our example, the equation of the least squares line is

 weight = 245.15 age – 1163.45

What birthweight would you expect for the baby of a mother who is 16 years old?

weight = 245.15 age – 1163.45 = 245.15(16) – 1163.45 = 3922.4 – 1163.45 = 2758.95

What was the birthweight for the baby of a mother who was 16 years old? 2897 g

What is the residual? 2897 – 2758.95g

Page 26: Review of  Statistical  Models and  Linear Regression Concepts

26

In our example, the equation of the least squares line is

 weight = 245.15 age – 1163.45

What birthweight would you expect for the baby of a mother who is 16 years old?

weight = 245.15 age – 1163.45 = 245.15(16) – 1163.45 = 3922.4 – 1163.45 = 2758.95

What was the birthweight for the baby of a mother who was 16 years old? 2897 g

What is the residual? 2897 – 2758.95g

Page 27: Review of  Statistical  Models and  Linear Regression Concepts

27

In our example, the equation of the least squares line is

 weight = 245.15 age – 1163.45

What birthweight would you expect for the baby of a mother who is 16 years old?

weight = 245.15 age – 1163.45 = 245.15(16) – 1163.45 = 3922.4 – 1163.45 = 2758.95

What was the birthweight for the baby of a mother who was 16 years old? 2897 g

What is the residual? 2897 – 2758.95 = 138.05g

Page 28: Review of  Statistical  Models and  Linear Regression Concepts

28

In our example, the equation of the least squares line is

 weight = 245.15 age – 1163.45

What birthweight would you expect for the baby of a mother who is 11 years old?

Page 29: Review of  Statistical  Models and  Linear Regression Concepts

29

Conditions for a Simple Linear Model 

Linearity - the scatterplot shows a general linear pattern Zero Mean - the distribution of the errors is centered at zero

Constant Variance - the variability of the errors is the same for all values of the predictor variable Independence - the errors are independent of each other

Page 30: Review of  Statistical  Models and  Linear Regression Concepts

30

Conditions for Inference also include: 

Random - the data was obtained through a random process Normality - the distribution of the errors is approximately Normal

Page 31: Review of  Statistical  Models and  Linear Regression Concepts

31

More about Residuals: A residual plot is a scatterplot of the regression residuals against the explanatory variable. Residual plots help us assess the fit of a regression line. Here are examples of residual plots:

This residual plot shows no systematic pattern; it shows a uniform scatter of the points about the fitted line, and indicates that the regression line fits the datawell.

Page 32: Review of  Statistical  Models and  Linear Regression Concepts

32

A curved pattern shows that the data is not linear, so a straight line is not a good fit for the data.

This residual plot shows that there is more spread for larger values of the explanatory variable, indicating that predictions will be less accurate when x is large.

You should also note any values with large residuals. These points are outliers in the vertical (y) direction because they lie far from the line that describes the overall pattern.

Page 33: Review of  Statistical  Models and  Linear Regression Concepts

33

The Simple Linear Regression Model For a quantitative response variable Y and a single quantitative explanatory variable X the simple linear regression model is  

Y = β0 + β1 X + β0 + ε where ε follows a normal distribution, that is ε ~ N(0, σε) and the errors are independent from one another.

Page 34: Review of  Statistical  Models and  Linear Regression Concepts

Assessing Conditions

To check the Linearity Condition, consider a scatterplot of the data to see if the points suggest a linear relationship.

34

Page 35: Review of  Statistical  Models and  Linear Regression Concepts

Assessing Conditions

Check the Constant Variance Condition with a plot of the residuals.

Graphs of the residuals can also help to determine whether the conditions are met. 35

Page 36: Review of  Statistical  Models and  Linear Regression Concepts

36

Page 37: Review of  Statistical  Models and  Linear Regression Concepts

 

This tells us that the slope of the regression line is -3.23 and the y-intercept is the point (0, 1676.4).  And so the equation of the regression line is

Mortality = -3.23 calcium + 1676.4

37

Coefficientsa

Model

Unstandardized CoefficientsStandardized Coefficients

t Sig.B Std. Error Beta1 (Constant) -1163.450 783.138   -1.486 .176

Age 245.150 45.908 .884 5.340 .001

a. Dependent Variable: Birthweight

Page 38: Review of  Statistical  Models and  Linear Regression Concepts

Mortality = -3.23 calcium + 1676.4

In other words, if the calcium level increases by one ppm, the mortality rate is expected to decrease by 3.23 deaths per 100,000, on average. The y-intercept tells us that if the calcium level is 0 ppm, the mortality rate would be 1676 deaths per 100,000. However, in this case, this would be an extrapolation.

38

Page 39: Review of  Statistical  Models and  Linear Regression Concepts

To add the graph of the regression line to the scatterplot: > plot(x,y)> abline(name of model)

For our data, these commands produced this graph:

> plot(calcium, mortality) > abline(model)

39

Page 40: Review of  Statistical  Models and  Linear Regression Concepts

Making Predictions; Interpolation and Extrapolation The linear model makes it possible to make reasonable predictions about any mean response within the range of the explanatory variable.

Statements about the mean at values of the explanatory variable not in the data set but within the range of the observed values are called interpolations.

Making predictions for values outside of the range of the data is called extrapolation and is not necessarily valid.

40

Page 41: Review of  Statistical  Models and  Linear Regression Concepts

To make a prediction: First create a data structure called a dataframe that contains the value(s) of the explanatory variable that you want to use in your prediction; you may use any appropriate name:  >newdata=data.frame(predictor=value) Then attach this new value to make it available to R: >attach(newdata)

41

Page 42: Review of  Statistical  Models and  Linear Regression Concepts

Now you can make your prediction. You may choose to include these arguments:

- a confidence interval or a prediction interval (default = none) - level of confidence (default is .95) > predict (model, newdata, interval=”confidence”, level=.95)

42

Page 43: Review of  Statistical  Models and  Linear Regression Concepts

Example: Predict the mortality rate in a town where the hardness level of the water is 105 ppm of calcium.  

> newdata=data.frame(calcium=105)> attach(newdata)

> predict(model, newdata, interval="confidence", level=.95) fit lwr upr1 1337.616 1270.624 1404.608

 

The mortality rate would be about 1338 deaths per 100,000.

43

Page 44: Review of  Statistical  Models and  Linear Regression Concepts

We have predicted a mortality rate of about 1338 deaths per 100,000 for a town with a calcium level of 105 ppm. However, there is a town with this calcium level, and the mortality rate for this town is 1247 deaths per 100,000. A Residual is the difference between the observed value and the predicted value of the response variable for a particular value of the explanatory variable. Residual = observed value – predicted value And so the residual for 105 ppm of calcium is 1247 – 1338 = -91 A residual plot is a scatterplot of the regression residuals against the explanatory variable. Residual plots help us assess the fit of a regression line.

44

Page 45: Review of  Statistical  Models and  Linear Regression Concepts

Here are examples of residual plots:

This residual plot shows no systematic pattern; it shows a uniform scatter of the points about the fitted line, and indicates that the regression line fits the data well.

45

Page 46: Review of  Statistical  Models and  Linear Regression Concepts

Here are examples of residual plots:

A curved pattern shows that the data is not linear, so a straight line is not a good fit for the data.

46

Page 47: Review of  Statistical  Models and  Linear Regression Concepts

Here are examples of residual plots:

This residual plot shows that there is more spread for larger values of the explanatory variable, indicating that predictions will be less accurate when x is large.

47

Page 48: Review of  Statistical  Models and  Linear Regression Concepts

You should also note any values with large residuals. These points are outliers in the vertical (y) direction because they lie far from the line that describes the overall pattern.

 

48

Page 49: Review of  Statistical  Models and  Linear Regression Concepts

The R commands to create a residual plot and show the line for a zero residual are:

> plot(fitted(model), resid(model))> abline(h=0) 

49

Page 50: Review of  Statistical  Models and  Linear Regression Concepts

Robustness of Least Squares Inference What if the assumptions for this analysis are not met? What if the scatterplot does not show a linear relationship between the variables?

The United Nations Development Programme (UNDP) collects data in the developing world to help countries solve global and national development challenges. One summary measure used by the agency is the Human Development Index (HDI) which attempts to summarize in a single number the progress in health, education, and economics of a country. In 2006 the HDI was as high as 0.965 for Norway and as low as 0.331 for Niger. The gross domestic product per capita (GDPPC), by contrast, is often used to summarize the overall economic strength of a country.

Is there a relationship between the HDI and the GDPPC?50

Page 51: Review of  Statistical  Models and  Linear Regression Concepts

Here is a scatterplot of GDPPC against HDI:              Is it appropriate to fit a linear model to this data? Why or why not?

51

Page 52: Review of  Statistical  Models and  Linear Regression Concepts

Here are histograms of the GDPPC values and the log of those values. How would you describe these distributions?

52

Page 53: Review of  Statistical  Models and  Linear Regression Concepts

How would you describe the relationship between the HDI and the log(GDPPC)?

> cor(HDI, log(GDPPC))[1] 0.9207729

53

Page 54: Review of  Statistical  Models and  Linear Regression Concepts

How would you describe the relationship between the HDI and the log(GDPPC)?

> UN = lm(HDI~log(GDPPC))> UN

Call:lm(formula = HDI ~ log(GDPPC))

Coefficients:(Intercept) log(GDPPC) -0.5177 0.1422

The regression equation is HDI = 0.1422 log(GDPPC) – 0.5177

54