ISyE 6414: Regression Analysisymei/6414/Handouts/Lecture01.pdf · 10. Important discrete RVs 11...

Post on 27-Jun-2020

14 views 0 download

Transcript of ISyE 6414: Regression Analysisymei/6414/Handouts/Lecture01.pdf · 10. Important discrete RVs 11...

ISyE 6414: Regression Analysis

Lectures: MWF 8:00-10:30, MRDC #2404Early five-week session; May 14- June 15

(8:00-9:10; 10-min break; 9:20-10:30)

Instructor: Dr. Yajun Mei (“YA_JUNE MAY”)• Email: ymei@isye.gatech.edu; Tel: 404-894-2334 (O)

• Office Hours: MWF 10:30-11:00, after class or Groseclose #343

• Course Homepage: Canvas (all HWs due Canvas)backup: http://www.isye.gatech.edu/~ymei/6414

• HW#1 due on Friday, May 18 for on-campus students, and on Wednesday, May 23 for distance learning students

My academic pathway

• Undergraduate: Math, Peking Univ., BS in 1996• Work as a computer programmer in a Chinese

bank, 1996-1998• Graduate: PhD in Math with a minor in EE, Caltech,

1998-2003 (advisor: Dr. Gary Lorden)• Post Doc in biostatistics: FHCRC, Seattle, 2003-

Sep 2005 (supervisor: Dr. Sarah Holte)• New Research Fellow: SAMSI & Duke Univ., Fall

2005• Joined ISyE of GT since Jan 2006. Currently a

tenured associate professor.

About this course

• Regression Analysis is the key building block for many modern Machine Learning, Artificial Intelligent, Business Analytics techniques and methods (such as Neural Networks, Deep Learning, Boosting, Random Forrest, etc.)

• This course aims to help youUnderstand its theoretical aspects

(HW#1, #2, #4, and a midterm)Understand its computational aspects

(HW#3, and a course project)

3

4

Organization of the CourseTextbooks (Notes/slides provided):

• Kutner, Nachtsheim, Neter and Li, Applied linear statistical models (fifth edition).,” 5th ed

• Faraway, Practical Regression and ANOVA using R (freely downloadable online)

www.abebooks.com/servlet/SearchResults?isbn=9780073108742

Topics:• Simple Linear Regression (Ch 1 -4) • Multiple linear Regression (Ch 5-11) (2 weeks, Midterm)• Advanced Regression (Ch 13-14) ( 2 weeks)• Design of Experiments (Ch 13, 14)

5

Organization of the Course

Grading Policy (the past AVG GPA is [3.7,3.9]):• Class attendance (5%) • Homework (4*10%=40%): Collaboration encouraged, but

you cannot look at any other solutions before submitting. • One in-class Midterm (25%): 9:15am-10:30am,

Friday, May 25 (happy Memorial weekend )• Class project (30%): a team of 2-4 or by yourself. See

the handout for possible topics of project. Proposal (1-3 pages) : May 30 (Wed) Presentation file: due 7am on June 13 (Wed)

(only for on-campus students, not required for DL students) Final report: June 15 (Friday)

[Only for the Distance Learning students: two-lectures delay for homeworks and class project proposal, and one-week delay for midterm, and the final report.]

6

Part A

• Basic Background on probability and statistics.

We might not discuss this background part in details, but I listed some slides here, so that you can brush up your memory if necessary

• Three key Probability distributions: Binomial, Poisson, and Normal.

7

Probability Review

See Appendix A of our text.

• Probability• Discrete Random Variable• Continuous Random Variables• Joint Distribution

8

Probability

Basics of Probability Theory• Random Experiments, e.g., flip a fair coin three

times, and observe “Heads” or “Tails”

• Sample spaces: the set of all possible outcomes, e.g., S={HHH,THH,HTH,HHT, HTT,THT,TTH,TTT}

• An Event: a subset of the sample space of a random experiment, e.g., observe one “heads”

• Union/Intersection/Complement of events; Counting Techniques; Axioms of Probability; Conditional Probability; Independence; Bayes’ Theorem

Random Variable

• A random variable is a function that assigns a real number to each outcome in the sample space of a random experiments.

• Example: Let X be the number of heads when flipping a fair coin three times. Rigorously,

9

w HHH HHT HTH THH HTT THT TTH TTT

X(w) 3 2 2 2 1 1 1 0

Discrete Random Variable X

X with countable possible values • Probability Mass function: • Cumulative distribution function

• Mean:

• Variance:

• Standard Deviation

10

Important discrete RVs

11

Discrete Uniform

Binomial(n,p)

Geometric(p)

Poisson(\lambda)

• What are the mean and Var/SD?

Continuous Random Variable

• Probability density function: • Cumulative distribution function

• Mean:

• Variance:

• Standard Deviation

12

Important Continuous RVs

13

• Gamma/Weibull/Lognormal/Beta distribution • What are the mean and Var/SD?

Central Limit Theorem

a. If X is Binomial(n,p), then

(continuity correction)

b. If X1, X2,Λ,Xn are iid with mean µ and variance σ2, then

(or 𝒁𝒁 = 𝑿𝑿𝟏𝟏+..+𝑿𝑿𝒏𝒏−𝒏𝒏𝒏𝒏𝒏𝒏 𝝈𝝈

≈ 𝑵𝑵(𝟎𝟎,𝟏𝟏) )14

𝑍𝑍 = 𝑋𝑋−𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛(1−𝑛𝑛)

≈ 𝑁𝑁(0,1)

Statistical Review

• Population parameter vs. Sample statistic

• Point Estimation

• Conference Interval

• Hypothesis Testing

15

Population Parameter vs Sample Statistic

• Population: a set of entities concerning which statistical inferences are to be drawn. Typically population is very large, making a census or a complete enumeration of all the values in the population impractical or impossible.

• Sample: a subset of observed objects from the populations. The sample represents a subset of manageable size (possibly massive).

• Parameter: a (typical unobservable) parameter that indexes a family of probability distributions. It can be regarded as a numerical characteristics of a population or a model.

• Statistic: some measures of some attribute of a sample. It is calculated by applying a function to the values of the items comprising the sample.

[Population parameter vs. Sample statistic]

16

Important Sample statistics

• Sample mean:• Sample variance:• Sample standard deviation: • Sample range: r = max(x i) – min(x i)• Quartiles:

• The lower quartile: 25% of the data is less than q1

• The median: 50% of the data is less than q2

• The upper quartile: 75% of the data is less than q3

• As a measure of variability, the interquartile range (IQR) is defined as: IQR = q3 – q1

• Plots: Stem-and-Leaf Diagram/Plot, Histogram, Box Plots, Probability Plots (or Normal QQ plots)

17

Normal Distribution

Assume X1, X2,Λ,Xn are iid with normal distribution mean µ and variance σ2

• Sample mean �𝑿𝑿 ∼ 𝑵𝑵 𝒏𝒏, 𝝈𝝈𝟐𝟐

𝒏𝒏. Or 𝒏𝒏(�𝑿𝑿−𝒏𝒏)

𝝈𝝈∼ 𝑵𝑵(𝟎𝟎,𝟏𝟏)

• Sample variance 𝐒𝐒𝟐𝟐 = ∑ 𝑿𝑿𝒊𝒊−�𝑿𝑿 𝟐𝟐

𝒏𝒏−𝟏𝟏satisfies

𝒏𝒏−𝟏𝟏 𝑺𝑺𝟐𝟐

𝝈𝝈𝟐𝟐∼ 𝝌𝝌𝒏𝒏−𝟏𝟏𝟐𝟐 (Chi-square distribution)

18

Normal Distribution (Cont.)

Assume X1, X2,Λ,Xn are iid with normal distribution mean µ and variance σ2

• Sample mean �𝑿𝑿 is independent of sample

variance 𝐒𝐒𝟐𝟐 = ∑ 𝑿𝑿𝒊𝒊−�𝑿𝑿 𝟐𝟐

𝒏𝒏−𝟏𝟏. Moreover,

𝒏𝒏(�𝑿𝑿−𝒏𝒏)𝑺𝑺

= 𝑵𝑵 𝟎𝟎,𝟏𝟏

𝝌𝝌𝒏𝒏−𝟏𝟏𝟐𝟐 /(𝒏𝒏−𝟏𝟏)

has a t-distribution

with df=n-1. [In many cases,

�𝜽𝜽−𝜽𝜽𝒔𝒔.𝒆𝒆. �𝜽𝜽

often has t-distribution.]

• In Appendix B on page 1317, for t-distribution, critical point: 𝒕𝒕𝜶𝜶,𝒅𝒅𝒅𝒅 = 𝒕𝒕 𝑨𝑨,𝒅𝒅𝒅𝒅 with 𝑨𝑨 = 𝟏𝟏 − 𝜶𝜶so 𝒕𝒕𝟎𝟎.𝟎𝟎𝟐𝟐𝟎𝟎,𝟏𝟏𝟏𝟏 = 𝒕𝒕 𝟎𝟎.𝟗𝟗𝟗𝟗𝟎𝟎,𝟏𝟏𝟏𝟏 = 𝟐𝟐.𝟏𝟏𝟎𝟎𝟏𝟏.

19

Point Estimation

• The bias of the estimator �𝜽𝜽 is 𝑩𝑩𝒊𝒊𝑩𝑩𝒔𝒔 �𝜽𝜽 = 𝑬𝑬 �𝜽𝜽 − 𝜽𝜽.

An estimator is unbiased if the bias is 0. • The variance of the estimator �𝜽𝜽.• The mean square error of the estimator �𝜽𝜽 is 𝑴𝑴𝑺𝑺𝑬𝑬 �𝜽𝜽 = 𝑬𝑬 �𝜽𝜽 − 𝜽𝜽 𝟐𝟐 = 𝑽𝑽𝑩𝑩𝑽𝑽 �𝜽𝜽 + 𝑩𝑩𝒊𝒊𝑩𝑩𝒔𝒔 �𝜽𝜽 𝟐𝟐

• The standard error of �𝜽𝜽 is s.e.= 𝑽𝑽𝑩𝑩𝑽𝑽(�𝜽𝜽)20

Methods of Point Estimation

• There are three methodologies to create point estimates of a population parameter.A. Method of moments (MOM)B. Method of maximum likelihood (MLE)C. Bayesian estimation of parameters

21

MOM & MLE

• The method of moment (MOM) estimators are found by equating the population moment to the sample moments and solving the resulting equations, e.g.,

𝐡𝐡 𝛉𝛉 = 𝑬𝑬 𝑿𝑿 = �𝑿𝑿 = 𝑿𝑿𝟏𝟏+⋯+𝑿𝑿𝒏𝒏𝒏𝒏

.

• The maximum likelihood estimator (MLE) is the value of θ that maximizes the likelihood function

L(θ) = f(x1) f(x2) …f(xn) If the domain of f(x) does not depend on θ,

solving𝑑𝑑 𝐥𝐥𝐥𝐥𝐥𝐥𝑳𝑳(𝜽𝜽)

𝑑𝑑𝜽𝜽= 𝟎𝟎 yields the MLE.

Otherwise, plot L(θ) and find the maximum. 22

Confidence Interval & Hypothesis Testing

One sample: 1. Normal mean with known variances (one-sided)2. Normal mean with unknown variances3. Normal variance4. Proportion of Binomial Distribution

Two samples: inference on mean difference5. Two independent normal dist: variances known6. Two independent normal dist: unknown and equal

variances7. Two independent normal distributions: unknown and

unequal variances 8. Paired Samples

23

Part B

• Overview of Supervised Learning

• Simple Linear Regression

24

Overview of Supervised Learning

Supervised Learning (directed data mining, learning with a teacher):• The observed data is of the form of (𝒀𝒀𝒊𝒊,𝑿𝑿𝒊𝒊𝟏𝟏, … ,𝑿𝑿𝒊𝒊𝒊𝒊)

for 𝒊𝒊 = 𝟏𝟏, … ,𝒏𝒏, where the variables can be split into two groups: independent variables (explanatory variables,

inputs, predictors) 𝑿𝑿 = (𝑿𝑿𝟏𝟏, … ,𝑿𝑿𝒊𝒊) and One (or more) dependent variable (output,

responses) Y.• The objective is to predict Y given values of the

input X.

25

Supervised Learning

• Observed Data (Training Data): (𝒀𝒀𝒊𝒊,𝑿𝑿𝒊𝒊𝟏𝟏, . . ,𝑿𝑿𝒊𝒊𝒊𝒊) for 𝒊𝒊 = 𝟏𝟏, … ,𝒏𝒏

• Objective: find a function 𝒅𝒅 𝒙𝒙𝒏𝒏𝒆𝒆𝒏𝒏 =𝒅𝒅(𝒙𝒙𝟏𝟏, … ,𝒙𝒙𝒊𝒊) that can predict 𝒀𝒀 well for any given input 𝒙𝒙𝒏𝒏𝒆𝒆𝒏𝒏 = 𝒙𝒙𝟏𝟏, … ,𝒙𝒙𝒊𝒊 .

• Deterministic relationship?(many classification tasks in machine learning)

26

The Additive Error Model

• Key Statistical Ideas: Observed Data = True Value + Noise

• For the observed training data,𝒀𝒀𝒊𝒊 = 𝒅𝒅 𝒙𝒙𝒊𝒊𝟏𝟏, . . ,𝒙𝒙𝒊𝒊𝒊𝒊 + 𝝐𝝐𝒊𝒊

for 𝒊𝒊 = 𝟏𝟏, … ,𝒏𝒏, where the errors 𝝐𝝐𝒊𝒊′𝒔𝒔 are iid with mean 0 and are independent of 𝑿𝑿′𝒔𝒔.

• Find the function 𝒅𝒅(𝒙𝒙𝟏𝟏, … ,𝒙𝒙𝒊𝒊) or find its approximation!!! (Generative vs. Predictive models)

• The simplest case: when 𝒊𝒊 = 𝟏𝟏, 𝒅𝒅 𝒙𝒙 = 𝜷𝜷𝟎𝟎 + 𝜷𝜷𝟏𝟏𝒙𝒙

Simple linear regression: 𝒀𝒀𝒊𝒊 = 𝜷𝜷𝟎𝟎 + 𝜷𝜷𝟏𝟏𝒙𝒙𝒊𝒊 + 𝝐𝝐𝒊𝒊27

The first Main Topic

• Simple linear regression

28

Empirical Models: Regression

• Many engineering and scientific problems are concerned with determining a relationship between a set of variables.

• For example: Y= college GPA at 1st year; X= high school GPA

Or Y=Mortality rate; X= Immunization rate.• Knowledge of such a relationship would enable

us to predict the output for Y. • Regression analysis is a statistical technique

that is very useful for these types of problems, as it can be used to build a model to predict Y at a given X value.

29

Example: Immunized and Mortality

• Suppose one wants to investigate the relationship between the percentage of children who have been immunized against the infectious disease diphtheria, pertussis, and tetanus (DPT) in a given country and the corresponding mortality rate for children under five years of age in that country.

• The UN Children’s Fund (UNICEF) considers the under-five mortality rate to be one of most important indicators of the level of well-being for children.

30

31

Data

X = Percentage of children immunized against DPT;Y = under-five mortality rate per 1000 live births, in 1992

Nation X Y Nation X Y Nation X YBolivia 77 118 Ethiopia 13 208 Mexico 91 33Brazil 69 65 Finland 95 7 Poland 98 16

Cambodia 32 184 France 95 9 Russian 73 32Canada 85 8 Greece 54 9 Senegal 47 145

China 94 43 India 89 124 Turkey 76 87Czech Republic

99 12 Italy 95 10 UK 90 9

Egypt 89 55 Japan 87 6

32

Look at Scatter Plot

The plot shows that Mortality rate tends to decrease as the percentage of children immunization increases.

33

Question

X = Percentage of children immunized against DPT;Y = under-five mortality rate per 1000 live births, in 1992

Question:• Are Y and X related (associated), and how?• Does better immunization improve mortality

rate?

• Can we use the data to develop a model for predicting under-five mortality rate from the percentage of children immunized against DPT?

34

Linear Regression

• It is interesting both theoretically because of the elegance of the underlying theory, and from an applied point view, because of the wide variety of uses.

• Fit a models for a dependent variable as a function of one or more independent variables

• We will talk about Building models Assessing fit and reliability Drawing conclusions

35

A Simple Linear Regression

• We are interested in developing a linear equation that best summarizes the relationship in a sample between the response variable (Y) and the predictor variable (or independent variable) x

𝒀𝒀𝒊𝒊 = 𝜷𝜷𝟎𝟎 + 𝜷𝜷𝟏𝟏𝒙𝒙𝒊𝒊 + 𝝐𝝐𝒊𝒊where the 𝝐𝝐𝒊𝒊’s are independent with mean 0 and variance 𝝈𝝈𝟐𝟐.

• The equation is also used to predict Y from X

36

(a) How to estimate 𝜷𝜷’s

• Observe n data, 𝒀𝒀𝒊𝒊,𝒙𝒙𝒊𝒊 , and assume𝒀𝒀𝒊𝒊 = 𝜷𝜷𝟎𝟎 + 𝜷𝜷𝟏𝟏𝒙𝒙𝒊𝒊 + 𝝐𝝐𝒊𝒊

where the 𝝐𝝐𝒊𝒊’s are independent with mean 0 and variance 𝝈𝝈𝟐𝟐.

• How to estimate 𝜷𝜷’s?

37

Method of Least Squares

• The (ordinary) least squares estimator:Choose β0 and β1 to minimize the residual of sum square (RSS)

38

Why Least Squares?

• It is the Maximum Likelihood Estimators (MLE) of β0 and β1 when the errors 𝝐𝝐𝒊𝒊’s are iid N(0,𝝈𝝈𝟐𝟐).

• It leads to the best linear unbiased estimators (BLUE) of β0 and β1, no matter whether the errors 𝝐𝝐𝒊𝒊’s are normally distributed or not.

[A linear estimator is of the form ∑𝒊𝒊=𝟏𝟏𝒏𝒏 𝒄𝒄𝒊𝒊𝒀𝒀𝒊𝒊.The meaning of BLUE for β1:Minimize 𝐯𝐯𝐯𝐯𝐯𝐯 ∑𝒄𝒄𝒊𝒊𝒀𝒀𝒊𝒊 = 𝝈𝝈𝟐𝟐 ∑𝒄𝒄𝒊𝒊𝟐𝟐

subject to 𝐄𝐄 ∑𝒄𝒄𝒊𝒊𝒀𝒀𝒊𝒊 = ∑𝒄𝒄𝒊𝒊 𝜷𝜷𝟎𝟎 + 𝜷𝜷𝟏𝟏𝒙𝒙𝒊𝒊 = 𝜷𝜷𝟏𝟏 for all β0 and β1,

i.e., subject to ∑𝒄𝒄𝒊𝒊𝜷𝜷𝟎𝟎 = 𝟎𝟎 and ∑𝒄𝒄𝒊𝒊𝒙𝒙𝒊𝒊 = 𝟏𝟏]

39

Method of Least Squares

• When minimizing the residual of sum square (RSS)

the solutions are:�𝜷𝜷𝟏𝟏 = 𝑺𝑺𝒙𝒙𝒙𝒙

𝑺𝑺𝒙𝒙𝒙𝒙, �𝜷𝜷𝟎𝟎 = �𝒙𝒙 − �𝜷𝜷𝟏𝟏�𝒙𝒙

where 𝑺𝑺𝒙𝒙𝒙𝒙 = ∑ 𝒙𝒙𝒊𝒊 − �𝒙𝒙 𝟐𝟐 = ∑𝒙𝒙𝒊𝒊𝟐𝟐 − 𝒏𝒏 �𝒙𝒙 𝟐𝟐

40

Example (Cont.)

X = Percentage of children immunized against DPT;Y = under-five mortality rate per 1000 live births, in 1992

Nation X Y Nation X Y Nation X YBolivia 77 118 Ethiopia 13 208 Mexico 91 33Brazil 69 65 Finland 95 7 Poland 98 16

Cambodia 32 184 France 95 9 Russian 73 32Canada 85 8 Greece 54 9 Senegal 47 145

China 94 43 India 89 124 Turkey 76 87Czech Republic

99 12 Italy 95 10 UK 90 9

Egypt 89 55 Japan 87 6

41

Answer

• For our data

𝐧𝐧 = 𝟐𝟐𝟎𝟎,�𝒙𝒙 = 𝟗𝟗𝟗𝟗.𝟒𝟒, �𝒙𝒙 = 𝟎𝟎𝟗𝟗, ∑𝒙𝒙𝒊𝒊𝟐𝟐 = 𝟏𝟏𝟏𝟏𝟎𝟎𝟒𝟒𝟒𝟒𝟏𝟏,∑𝒙𝒙𝒊𝒊𝒙𝒙𝒊𝒊 = 𝟏𝟏𝟏𝟏𝟏𝟏𝟐𝟐𝟏𝟏

𝑺𝑺𝒙𝒙𝒙𝒙 = ∑ 𝒙𝒙𝒊𝒊 − �𝒙𝒙 𝟐𝟐 = ∑𝒙𝒙𝒊𝒊𝟐𝟐 − 𝒏𝒏 �𝒙𝒙 𝟐𝟐 = 𝟏𝟏𝟎𝟎𝟏𝟏𝟏𝟏𝟎𝟎.𝟏𝟏𝑺𝑺𝒙𝒙𝒙𝒙 = ∑ 𝒙𝒙𝒊𝒊 − �𝒙𝒙 𝒙𝒙𝒊𝒊 − �𝒙𝒙 = ∑𝒙𝒙𝒊𝒊𝒙𝒙𝒊𝒊 − 𝒏𝒏 �𝒙𝒙�𝒙𝒙 = −𝟐𝟐𝟐𝟐𝟗𝟗𝟎𝟎𝟏𝟏

�𝜷𝜷𝟏𝟏 =𝑺𝑺𝒙𝒙𝒙𝒙𝑺𝑺𝒙𝒙𝒙𝒙

=−𝟐𝟐𝟐𝟐𝟗𝟗𝟎𝟎𝟏𝟏𝟏𝟏𝟎𝟎𝟏𝟏𝟏𝟏𝟎𝟎.𝟏𝟏

= −𝟐𝟐.𝟏𝟏𝟏𝟏𝟎𝟎𝟗𝟗;

�𝜷𝜷𝟎𝟎 = �𝒙𝒙 − �𝜷𝜷𝟏𝟏�𝒙𝒙 = 𝟎𝟎𝟗𝟗 + 𝟐𝟐.𝟏𝟏𝟏𝟏𝟎𝟎𝟗𝟗 ∗ 𝟗𝟗𝟗𝟗.𝟒𝟒 = 𝟐𝟐𝟐𝟐𝟒𝟒.𝟏𝟏𝟏𝟏𝟏𝟏𝟏𝟏• Thus, the fitted (simple linear regression) model is

𝒀𝒀 = 𝟐𝟐𝟐𝟐𝟒𝟒.𝟏𝟏𝟏𝟏𝟏𝟏𝟏𝟏 − 𝟐𝟐.𝟏𝟏𝟏𝟏𝟎𝟎𝟗𝟗 𝒙𝒙 + 𝝐𝝐or 𝐄𝐄 𝒀𝒀 = 𝟐𝟐𝟐𝟐𝟒𝟒.𝟏𝟏𝟏𝟏𝟏𝟏𝟏𝟏 − 𝟐𝟐.𝟏𝟏𝟏𝟏𝟎𝟎𝟗𝟗 𝒙𝒙.

(b) Example (Cont.)

The fitted (simple linear regression) model is𝒀𝒀 = 𝟐𝟐𝟐𝟐𝟒𝟒.𝟏𝟏𝟏𝟏𝟏𝟏𝟏𝟏 − 𝟐𝟐.𝟏𝟏𝟏𝟏𝟎𝟎𝟗𝟗 𝒙𝒙 + 𝝐𝝐

• Estimate the mean under-five mortality rate per 1000 live births when x=10?

• Repeat the question when x= 90?

[202.9573; 32.0853] 42

X = Percentage of children immunized against DPT;Y = under-five mortality rate per 1000 live births, in 1992

43

(c) How to estimate 𝝈𝝈𝟐𝟐?

• Recall that the model is 𝒙𝒙𝒊𝒊 = 𝜷𝜷𝟎𝟎 + 𝜷𝜷𝟏𝟏𝒙𝒙𝒊𝒊 + 𝝐𝝐𝒊𝒊 where

the 𝝐𝝐𝒊𝒊’s are iid with mean 0 and variance 𝝈𝝈𝟐𝟐

• We got the estimator �𝜷𝜷𝟎𝟎, �𝜷𝜷𝟏𝟏, and how to estimate the third parameter, 𝝈𝝈𝟐𝟐 ?

Answer: • It is natural to use the observed fitting error

𝐞𝐞𝐢𝐢 = 𝒙𝒙𝒊𝒊 − (�𝜷𝜷𝟎𝟎 + �𝜷𝜷𝟏𝟏𝒙𝒙𝒊𝒊) and the residual sum of squares 𝑹𝑹𝑺𝑺𝑺𝑺 =∑𝒊𝒊=𝟏𝟏𝒏𝒏 𝒆𝒆𝒊𝒊 𝟐𝟐

• The estimator of σ2 is �𝝈𝝈𝟐𝟐 = 𝑹𝑹𝑺𝑺𝑺𝑺𝒏𝒏−𝟐𝟐

[and 𝒏𝒏 − 𝟐𝟐 �𝝈𝝈𝟐𝟐

𝝈𝝈𝟐𝟐∼ 𝝌𝝌𝒏𝒏−𝟐𝟐𝟐𝟐 ]

• In practice, it is easier to compute RSS as follows:

𝑹𝑹𝑺𝑺𝑺𝑺 = �𝒊𝒊=𝟏𝟏

𝒏𝒏

𝒆𝒆𝒊𝒊 𝟐𝟐 = 𝑺𝑺𝒙𝒙𝒙𝒙 − �𝜷𝜷𝟏𝟏𝑺𝑺𝒙𝒙𝒙𝒙 = 𝑺𝑺𝒙𝒙𝒙𝒙 −𝑺𝑺𝒙𝒙𝒙𝒙𝟐𝟐

𝑺𝑺𝒙𝒙𝒙𝒙

44

Example (Cont.)

In our example, the fitted (simple linear regression) model is 𝒀𝒀 = 𝟐𝟐𝟐𝟐𝟒𝟒.𝟏𝟏𝟏𝟏𝟏𝟏𝟏𝟏 − 𝟐𝟐.𝟏𝟏𝟏𝟏𝟎𝟎𝟗𝟗 𝒙𝒙 + 𝝐𝝐. Find an estimate of 𝝈𝝈𝟐𝟐 = 𝒗𝒗𝑩𝑩𝑽𝑽 𝝐𝝐 .• Two ways to calculate the residual sum of squares RSS:

Calculate the observed fitting error (residual) 𝐞𝐞𝐢𝐢 = 𝒙𝒙𝒊𝒊 − (�𝜷𝜷𝟎𝟎 + �𝜷𝜷𝟏𝟏𝒙𝒙𝒊𝒊)

and then 𝑹𝑹𝑺𝑺𝑺𝑺 = ∑𝒊𝒊=𝟏𝟏𝒏𝒏 𝒆𝒆𝒊𝒊 𝟐𝟐 = 𝟐𝟐𝟗𝟗𝟎𝟎𝟎𝟎𝟎𝟎.𝟗𝟗𝟎𝟎 Use Sxx =10630.8, Sxy=-22706, Syy=77498, and

𝑹𝑹𝑺𝑺𝑺𝑺 = 𝑺𝑺𝒙𝒙𝒙𝒙 − �𝜷𝜷𝟏𝟏𝑺𝑺𝒙𝒙𝒙𝒙 = 𝑺𝑺𝒙𝒙𝒙𝒙 −𝑺𝑺𝒙𝒙𝒙𝒙𝟐𝟐

𝑺𝑺𝒙𝒙𝒙𝒙= 𝟗𝟗𝟗𝟗𝟒𝟒𝟗𝟗𝟏𝟏 − −𝟐𝟐𝟐𝟐𝟗𝟗𝟎𝟎𝟏𝟏 𝟐𝟐 /𝟏𝟏𝟎𝟎𝟏𝟏𝟏𝟏𝟎𝟎.𝟏𝟏=29000.95

• The estimator of σ2 is �𝝈𝝈𝟐𝟐 = 𝑹𝑹𝑺𝑺𝑺𝑺

𝒏𝒏−𝟐𝟐= 𝟏𝟏𝟏𝟏𝟏𝟏𝟏𝟏.𝟏𝟏𝟏𝟏𝟒𝟒 (or �𝝈𝝈 = 𝟏𝟏𝟏𝟏𝟏𝟏𝟏𝟏.𝟏𝟏𝟏𝟏𝟒𝟒 = 𝟒𝟒𝟎𝟎.𝟏𝟏𝟏𝟏𝟗𝟗𝟏𝟏).

X = Percentage of children immunized against DPT;Y = under-five mortality rate per 1000 live births, in 1992

R code (calculator-type)x <- c(77, 69, 32, 85, 94, 99, 89, 13, 95, 95, 54, 89, 95, 87, 91,

98, 73, 47, 76, 90);y <- c(118, 65, 184, 8, 43, 12, 55, 208, 7, 9, 9, 124, 10, 6, 33, 16,

32, 145, 87, 9);

Sxx <- sum( x * x) - length(x) * (mean(x))^2Sxy <- sum(x *y ) - length(x) * mean(x) * mean(y)Syy <- sum( y * y) - length(y) * (mean(y))^2

beta1hat <- Sxy / Sxxbeta0hat <- mean(y) - beta1hat * mean(x)

### Two ways to compute RSS error <- y - (beta0hat + beta1hat * x)RSS <- sum( error * error) ### OrRSS <- Syy – Sxy^2 / Sxxsigma2hat <- RSS / (length(x) - 2)

c(beta0hat, beta1hat, sigma2hat)45

46

(d) Properties of OLS estimators

• To derive the statistical inference of the (ordinary) least squares �𝜷𝜷𝟏𝟏 and �𝜷𝜷𝟎𝟎, we need to find 𝑬𝑬 �𝜷𝜷𝒊𝒊 𝑽𝑽𝑩𝑩𝑽𝑽 �𝜷𝜷𝒊𝒊Then by the central limit theorem, asymptotically

�𝜷𝜷𝒊𝒊 − 𝑬𝑬 �𝜷𝜷𝒊𝒊

𝑽𝑽𝑩𝑩𝑽𝑽(�𝜷𝜷𝒊𝒊)≈ 𝑵𝑵(𝟎𝟎,𝟏𝟏)

Key Steps

𝑺𝑺𝒙𝒙𝒙𝒙 = ∑ 𝒙𝒙𝒊𝒊 − �𝒙𝒙 𝟐𝟐 = ∑𝒙𝒙𝒊𝒊𝟐𝟐 − 𝒏𝒏 �𝒙𝒙 𝟐𝟐, 𝑺𝑺𝒙𝒙𝒙𝒙= ∑ 𝒙𝒙𝒊𝒊 − �𝒙𝒙 𝒙𝒙𝒊𝒊 − �𝒙𝒙 = ∑𝒙𝒙𝒊𝒊𝒙𝒙𝒊𝒊 − 𝒏𝒏 �𝒙𝒙�𝒙𝒙

Assumption: the 𝒙𝒙𝒊𝒊’s are constants, and the 𝒀𝒀𝒊𝒊’s are independent with 𝑬𝑬(𝒀𝒀𝒊𝒊) = 𝜷𝜷𝟎𝟎 + 𝜷𝜷𝟏𝟏𝒙𝒙𝒊𝒊 and 𝑽𝑽𝑩𝑩𝑽𝑽(𝒀𝒀𝒊𝒊) = 𝝈𝝈𝟐𝟐.

• �𝜷𝜷𝟏𝟏 = 𝑺𝑺𝒙𝒙𝒙𝒙𝑺𝑺𝒙𝒙𝒙𝒙

= ∑𝒊𝒊=𝟏𝟏𝒏𝒏 𝒄𝒄𝒊𝒊𝒀𝒀𝒊𝒊 , where 𝒄𝒄𝒊𝒊 = 𝒙𝒙𝒊𝒊−�𝒙𝒙𝑺𝑺𝒙𝒙𝒙𝒙

satisfying the

following three properties:∑𝒊𝒊=𝟏𝟏𝒏𝒏 𝒄𝒄𝒊𝒊 = 𝟎𝟎∑𝒊𝒊=𝟏𝟏𝒏𝒏 𝒄𝒄𝒊𝒊𝒙𝒙𝒊𝒊 = 𝟏𝟏

∑𝒊𝒊=𝟏𝟏𝒏𝒏 𝒄𝒄𝒊𝒊𝟐𝟐 = 𝟏𝟏𝑺𝑺𝒙𝒙𝒙𝒙

• �𝜷𝜷𝟎𝟎 = �𝒙𝒙 − �𝜷𝜷𝟏𝟏�𝒙𝒙 = ∑𝒊𝒊=𝟏𝟏𝒏𝒏 (𝟏𝟏𝒏𝒏− 𝒄𝒄𝒊𝒊 �𝒙𝒙)𝒀𝒀𝒊𝒊

47

48

(d) Properties of OLS

• Unbiased:• Variance:

where

• Note that they are correlated:

49

CI and Tests

• Since σ2 is unknown, consider and thus

• Then and

have t-distribution with n-2 degree of freedom.

(d1) Inference on 𝜷𝜷𝟏𝟏

• When testing 𝑯𝑯𝟎𝟎:𝜷𝜷𝟏𝟏 = 𝟎𝟎 versus 𝑯𝑯𝟏𝟏:𝜷𝜷𝟏𝟏 ≠ 𝟎𝟎the test statistic is

𝑻𝑻𝒐𝒐𝒐𝒐𝒔𝒔 =�𝜷𝜷𝟏𝟏

𝒔𝒔𝒆𝒆(�𝜷𝜷𝟏𝟏)=

�𝜷𝜷𝟏𝟏�𝝈𝝈/ 𝑺𝑺𝒙𝒙𝒙𝒙

and we reject 𝑯𝑯𝟎𝟎 if |𝑻𝑻𝒐𝒐𝒐𝒐𝒔𝒔| ≥ 𝒕𝒕𝜶𝜶/𝟐𝟐,𝒏𝒏−𝟐𝟐

• A 𝟏𝟏 − 𝜶𝜶 confidence interval on 𝜷𝜷𝟏𝟏 is

�𝜷𝜷𝟏𝟏 ± 𝒕𝒕𝜶𝜶/𝟐𝟐,𝒏𝒏−𝟐𝟐�𝝈𝝈𝑺𝑺𝒙𝒙𝒙𝒙

50

Example (Cont.)

The fitted (simple linear regression) model is𝒀𝒀 = 𝟐𝟐𝟐𝟐𝟒𝟒.𝟏𝟏𝟏𝟏𝟏𝟏𝟏𝟏 − 𝟐𝟐.𝟏𝟏𝟏𝟏𝟎𝟎𝟗𝟗 𝒙𝒙 + 𝝐𝝐

• Test 𝑯𝑯𝟎𝟎:𝜷𝜷𝟏𝟏 = 𝟎𝟎 versus 𝑯𝑯𝟏𝟏:𝜷𝜷𝟏𝟏 ≠ 𝟎𝟎 at 𝜶𝜶 = 𝟎𝟎𝟓 level.

[Recall 𝐒𝐒𝐱𝐱𝐱𝐱 = 𝟏𝟏𝟎𝟎𝟏𝟏𝟏𝟏𝟎𝟎.𝟏𝟏, �𝝈𝝈 = 𝟒𝟒𝟎𝟎.𝟏𝟏𝟏𝟏𝟗𝟗𝟏𝟏, 𝒕𝒕𝜶𝜶/𝟐𝟐,𝒏𝒏−𝟐𝟐 = 𝒕𝒕𝟎𝟎.𝟎𝟎𝟐𝟐𝟎𝟎,𝟏𝟏𝟏𝟏 = 𝟐𝟐.𝟏𝟏𝟎𝟎𝟏𝟏

𝑻𝑻𝒐𝒐𝒐𝒐𝒔𝒔 =�𝜷𝜷𝟏𝟏

�𝝈𝝈/ 𝑺𝑺𝒙𝒙𝒙𝒙= −𝟐𝟐.𝟏𝟏𝟎𝟎𝟏𝟏𝟗𝟗

𝟒𝟒𝟎𝟎.𝟏𝟏𝟏𝟏𝟗𝟗𝟏𝟏/ 𝟏𝟏𝟎𝟎𝟏𝟏𝟏𝟏𝟎𝟎.𝟏𝟏= −𝟎𝟎.𝟎𝟎𝟏𝟏𝟏𝟏]

51

X = Percentage of children immunized against DPT;Y = under-five mortality rate per 1000 live births, in 1992

Example (Cont.)

The fitted (simple linear regression) model is𝒀𝒀 = 𝟐𝟐𝟐𝟐𝟒𝟒.𝟏𝟏𝟏𝟏𝟏𝟏𝟏𝟏 − 𝟐𝟐.𝟏𝟏𝟏𝟏𝟎𝟎𝟗𝟗 𝒙𝒙 + 𝝐𝝐

• Find a 95% confidence interval on 𝜷𝜷𝟏𝟏.

[Recall 𝐒𝐒𝐱𝐱𝐱𝐱 = 𝟏𝟏𝟎𝟎𝟏𝟏𝟏𝟏𝟎𝟎.𝟏𝟏, �𝝈𝝈 = 𝟒𝟒𝟎𝟎.𝟏𝟏𝟏𝟏𝟗𝟗𝟏𝟏, 𝒕𝒕𝜶𝜶/𝟐𝟐,𝒏𝒏−𝟐𝟐 = 𝒕𝒕𝟎𝟎.𝟎𝟎𝟐𝟐𝟎𝟎,𝟏𝟏𝟏𝟏 = 𝟐𝟐.𝟏𝟏𝟎𝟎𝟏𝟏, So �𝜷𝜷𝟏𝟏 ± 𝒕𝒕𝜶𝜶/𝟐𝟐,𝒏𝒏−𝟐𝟐

�𝝈𝝈𝑺𝑺𝒙𝒙𝒙𝒙

= −𝟐𝟐.𝟏𝟏𝟏𝟏𝟎𝟎𝟗𝟗 ± 𝟎𝟎.𝟏𝟏𝟏𝟏𝟗𝟗𝟗𝟗 = −𝟐𝟐.𝟗𝟗𝟎𝟎𝟏𝟏𝟏𝟏,−𝟏𝟏.𝟏𝟏𝟏𝟏𝟏𝟏𝟎𝟎 .]

52

X = Percentage of children immunized against DPT;Y = under-five mortality rate per 1000 live births, in 1992

(d2) Inference on 𝜷𝜷𝟎𝟎

• When testing 𝑯𝑯𝟎𝟎:𝜷𝜷𝟎𝟎 = 𝒐𝒐𝟎𝟎 versus 𝑯𝑯𝟏𝟏:𝜷𝜷𝟎𝟎 ≠𝒐𝒐𝟎𝟎, the test statistic is

𝑻𝑻𝒐𝒐𝒐𝒐𝒔𝒔 =�𝜷𝜷𝟎𝟎−𝒐𝒐𝟎𝟎𝒔𝒔𝒆𝒆(�𝜷𝜷𝟎𝟎)

=�𝜷𝜷𝟎𝟎 −𝒐𝒐𝟎𝟎

�𝝈𝝈 𝟏𝟏𝒏𝒏+

�𝒙𝒙 𝟐𝟐𝑺𝑺𝒙𝒙𝒙𝒙

and we reject 𝑯𝑯𝟎𝟎 if |𝑻𝑻𝒐𝒐𝒐𝒐𝒔𝒔| ≥ 𝒕𝒕𝜶𝜶/𝟐𝟐,𝒏𝒏−𝟐𝟐

• A 𝟏𝟏 − 𝜶𝜶 confidence interval on 𝜷𝜷𝟎𝟎 is

�𝜷𝜷𝟎𝟎 ± 𝒕𝒕𝜶𝜶/𝟐𝟐,𝒏𝒏−𝟐𝟐 �𝝈𝝈𝟏𝟏𝒏𝒏

+ �𝒙𝒙 𝟐𝟐

𝑺𝑺𝒙𝒙𝒙𝒙

53

Example (Cont.)

The fitted (simple linear regression) model is𝒀𝒀 = 𝟐𝟐𝟐𝟐𝟒𝟒.𝟏𝟏𝟏𝟏𝟏𝟏𝟏𝟏 − 𝟐𝟐.𝟏𝟏𝟏𝟏𝟎𝟎𝟗𝟗 𝒙𝒙 + 𝝐𝝐

• Test 𝑯𝑯𝟎𝟎:𝜷𝜷𝟎𝟎 = 𝟐𝟐𝟏𝟏𝟎𝟎 versus 𝑯𝑯𝟏𝟏:𝜷𝜷𝟎𝟎 ≠ 𝟐𝟐𝟏𝟏𝟎𝟎 at 𝜶𝜶 = 𝟎𝟎𝟓level.

[Recall 𝐒𝐒𝐱𝐱𝐱𝐱 = 𝟏𝟏𝟎𝟎𝟏𝟏𝟏𝟏𝟎𝟎.𝟏𝟏, �𝝈𝝈 = 𝟒𝟒𝟎𝟎.𝟏𝟏𝟏𝟏𝟗𝟗𝟏𝟏, 𝒕𝒕𝜶𝜶/𝟐𝟐,𝒏𝒏−𝟐𝟐 = 𝒕𝒕𝟎𝟎.𝟎𝟎𝟐𝟐𝟎𝟎,𝟏𝟏𝟏𝟏 = 𝟐𝟐.𝟏𝟏𝟎𝟎𝟏𝟏

𝑻𝑻𝒐𝒐𝒐𝒐𝒔𝒔 =�𝜷𝜷𝟎𝟎 −𝒐𝒐𝟎𝟎

�𝝈𝝈 𝟏𝟏𝒏𝒏+

�𝒙𝒙 𝟐𝟐𝑺𝑺𝒙𝒙𝒙𝒙

= 𝟎𝟎.𝟒𝟒𝟎𝟎𝟎𝟎]

54

X = Percentage of children immunized against DPT;Y = under-five mortality rate per 1000 live births, in 1992

Example (Cont.)

The fitted (simple linear regression) model is𝒀𝒀 = 𝟐𝟐𝟐𝟐𝟒𝟒.𝟏𝟏𝟏𝟏𝟏𝟏𝟏𝟏 − 𝟐𝟐.𝟏𝟏𝟏𝟏𝟎𝟎𝟗𝟗 𝒙𝒙 + 𝝐𝝐

• Find a 95% confidence interval on 𝜷𝜷𝟎𝟎.

[Recall 𝐒𝐒𝐱𝐱𝐱𝐱 = 𝟏𝟏𝟎𝟎𝟏𝟏𝟏𝟏𝟎𝟎.𝟏𝟏, �𝝈𝝈 = 𝟒𝟒𝟎𝟎.𝟏𝟏𝟏𝟏𝟗𝟗𝟏𝟏, 𝒕𝒕𝜶𝜶/𝟐𝟐,𝒏𝒏−𝟐𝟐 = 𝒕𝒕𝟎𝟎.𝟎𝟎𝟐𝟐𝟎𝟎,𝟏𝟏𝟏𝟏 = 𝟐𝟐.𝟏𝟏𝟎𝟎𝟏𝟏,

So�𝜷𝜷𝟎𝟎 ± 𝒕𝒕𝜶𝜶/𝟐𝟐,𝒏𝒏−𝟐𝟐 �𝝈𝝈𝟏𝟏𝒏𝒏

+ �𝒙𝒙 𝟐𝟐

𝑺𝑺𝒙𝒙𝒙𝒙= 𝟐𝟐𝟐𝟐𝟒𝟒.𝟏𝟏𝟏𝟏𝟏𝟏𝟏𝟏 ± 𝟏𝟏𝟏𝟏.𝟎𝟎𝟎𝟎𝟏𝟏𝟐𝟐 = [𝟏𝟏𝟎𝟎𝟏𝟏.𝟐𝟐𝟏𝟏,𝟐𝟐𝟗𝟗𝟎𝟎.𝟏𝟏𝟗𝟗].]

55

X = Percentage of children immunized against DPT;Y = under-five mortality rate per 1000 live births, in 1992

(d3) Inference on 𝜷𝜷𝟎𝟎 + 𝜷𝜷𝟏𝟏𝒙𝒙𝒏𝒏𝒆𝒆𝒏𝒏For the simple linear regression model

𝒙𝒙𝒊𝒊 = 𝜷𝜷𝟎𝟎 + 𝜷𝜷𝟏𝟏𝒙𝒙𝒊𝒊 + 𝝐𝝐𝒊𝒊For a given 𝒙𝒙𝒏𝒏𝒆𝒆𝒏𝒏, what is the confidence interval for the mean response 𝑬𝑬 𝒀𝒀 = 𝜷𝜷𝟎𝟎 + 𝜷𝜷𝟏𝟏𝒙𝒙𝒏𝒏𝒆𝒆𝒏𝒏

Point estimator: �𝒀𝒀 = �𝜷𝜷𝟎𝟎+ �𝜷𝜷𝟏𝟏𝒙𝒙𝒏𝒏𝒆𝒆𝒏𝒏 = ∑𝒊𝒊=𝟏𝟏𝒏𝒏 𝟏𝟏𝒏𝒏

+ 𝒄𝒄𝒊𝒊 𝒙𝒙𝒏𝒏𝒆𝒆𝒏𝒏 − �𝒙𝒙 𝒀𝒀𝒊𝒊

• 𝑬𝑬 �𝜷𝜷𝟎𝟎+ �𝜷𝜷𝟏𝟏𝒙𝒙𝒏𝒏𝒆𝒆𝒏𝒏 = 𝜷𝜷𝟎𝟎 + 𝜷𝜷𝟏𝟏𝒙𝒙𝒏𝒏𝒆𝒆𝒏𝒏

• 𝑽𝑽𝑩𝑩𝑽𝑽 �𝜷𝜷𝟎𝟎+ �𝜷𝜷𝟏𝟏𝒙𝒙𝒏𝒏𝒆𝒆𝒏𝒏 = 𝝈𝝈𝟐𝟐[𝟏𝟏𝒏𝒏

+ 𝒙𝒙𝒏𝒏𝒆𝒆𝒏𝒏 −�𝒙𝒙 𝟐𝟐

𝑺𝑺𝒙𝒙𝒙𝒙]

• The 𝟏𝟏 − 𝜶𝜶 confidence interval on the mean response is

�𝜷𝜷𝟎𝟎+ �𝜷𝜷𝟏𝟏𝒙𝒙𝒏𝒏𝒆𝒆𝒏𝒏 ± 𝒕𝒕𝜶𝜶/𝟐𝟐,𝒏𝒏−𝟐𝟐 �𝝈𝝈𝟏𝟏𝒏𝒏

+𝒙𝒙𝒏𝒏𝒆𝒆𝒏𝒏 − �𝒙𝒙 𝟐𝟐

𝑺𝑺𝒙𝒙𝒙𝒙

56

Example (Cont.)

The fitted (simple linear regression) model is𝒀𝒀 = 𝟐𝟐𝟐𝟐𝟒𝟒.𝟏𝟏𝟏𝟏𝟏𝟏𝟏𝟏 − 𝟐𝟐.𝟏𝟏𝟏𝟏𝟎𝟎𝟗𝟗 𝒙𝒙 + 𝝐𝝐

• Find a 95% confidence interval on the mean under-five mortality rate when x=10

[Recall 𝐒𝐒𝐱𝐱𝐱𝐱 = 𝟏𝟏𝟎𝟎𝟏𝟏𝟏𝟏𝟎𝟎.𝟏𝟏, �𝝈𝝈 = 𝟒𝟒𝟎𝟎.𝟏𝟏𝟏𝟏𝟗𝟗𝟏𝟏, �𝒙𝒙 = 𝟗𝟗𝟗𝟗.𝟒𝟒, 𝒕𝒕𝟎𝟎.𝟎𝟎𝟐𝟐𝟎𝟎,𝟏𝟏𝟏𝟏 = 𝟐𝟐.𝟏𝟏𝟎𝟎𝟏𝟏

�𝒀𝒀 ± 𝒕𝒕𝜶𝜶/𝟐𝟐,𝒏𝒏−𝟐𝟐 �𝝈𝝈𝟏𝟏𝒏𝒏

+ 𝒙𝒙𝒏𝒏𝒆𝒆𝒏𝒏−�𝒙𝒙 𝟐𝟐

𝑺𝑺𝒙𝒙𝒙𝒙= 𝟐𝟐𝟎𝟎𝟐𝟐.𝟗𝟗𝟎𝟎𝟗𝟗𝟏𝟏 ± 𝟎𝟎𝟏𝟏.𝟐𝟐𝟏𝟏𝟒𝟒𝟏𝟏 =

[𝟏𝟏𝟒𝟒𝟒𝟒.𝟏𝟏𝟗𝟗𝟏𝟏𝟐𝟐,𝟐𝟐𝟏𝟏𝟏𝟏.𝟐𝟐𝟐𝟐𝟏𝟏𝟒𝟒]]57

X = Percentage of children immunized against DPT;Y = under-five mortality rate per 1000 live births, in 1992

(e) Prediction on new Observation

For the simple linear regression model 𝒙𝒙𝒊𝒊 = 𝜷𝜷𝟎𝟎 + 𝜷𝜷𝟏𝟏𝒙𝒙𝒊𝒊 + 𝝐𝝐𝒊𝒊

How to predict future observation Y corresponding to a given 𝒙𝒙𝒏𝒏𝒆𝒆𝒏𝒏?

• Point estimator: �𝒀𝒀 = �𝜷𝜷𝟎𝟎+ �𝜷𝜷𝟏𝟏𝒙𝒙𝒏𝒏𝒆𝒆𝒏𝒏• How about a confidence interval on Y?

This is often called prediction interval.

58

Key Idea

For the future response 𝐘𝐘 = 𝜷𝜷𝟎𝟎 + 𝜷𝜷𝟏𝟏𝒙𝒙𝒏𝒏𝒆𝒆𝒏𝒏 + 𝝐𝝐𝒅𝒅𝒇𝒇𝒕𝒕𝒇𝒇𝑽𝑽𝒆𝒆

Consider the estimator �𝒀𝒀 = �𝜷𝜷𝟎𝟎+ �𝜷𝜷𝟏𝟏𝒙𝒙𝒏𝒏𝒆𝒆𝒏𝒏, Then

• 𝑬𝑬 𝒀𝒀 − �𝒀𝒀 = 𝟎𝟎• 𝑽𝑽𝑩𝑩𝑽𝑽 𝒀𝒀 − �𝒀𝒀 = 𝑽𝑽𝑩𝑩𝑽𝑽 𝜷𝜷𝟎𝟎 + 𝜷𝜷𝟏𝟏𝒙𝒙𝒏𝒏𝒆𝒆𝒏𝒏 + 𝝐𝝐𝒅𝒅𝒇𝒇𝒕𝒕𝒇𝒇𝑽𝑽𝒆𝒆 − �𝜷𝜷𝟎𝟎+ �𝜷𝜷𝟏𝟏𝒙𝒙𝒏𝒏𝒆𝒆𝒏𝒏

= 𝑽𝑽𝑩𝑩𝑽𝑽 𝝐𝝐𝒅𝒅𝒇𝒇𝒕𝒕𝒇𝒇𝑽𝑽𝒆𝒆 + 𝑽𝑽𝑩𝑩𝑽𝑽 �𝜷𝜷𝟎𝟎+ �𝜷𝜷𝟏𝟏𝒙𝒙𝒏𝒏𝒆𝒆𝒏𝒏

= 𝝈𝝈𝟐𝟐 +𝝈𝝈𝟐𝟐

𝒏𝒏+ 𝒙𝒙𝒏𝒏𝒆𝒆𝒏𝒏 − �𝒙𝒙 𝟐𝟐 𝝈𝝈𝟐𝟐

𝟏𝟏𝑺𝑺𝒙𝒙𝒙𝒙

59

Key Idea (Cont.)

For the future response 𝒙𝒙 = 𝜷𝜷𝟎𝟎 + 𝜷𝜷𝟏𝟏𝒙𝒙𝒏𝒏𝒆𝒆𝒏𝒏 + 𝝐𝝐

Consider the estimate �𝒀𝒀 = �𝜷𝜷𝟎𝟎+ �𝜷𝜷𝟏𝟏𝒙𝒙𝒏𝒏𝒆𝒆𝒏𝒏, Then

•𝒙𝒙 − �𝒀𝒀

𝝈𝝈 𝟏𝟏+𝟏𝟏𝒏𝒏+𝒙𝒙𝒏𝒏𝒆𝒆𝒏𝒏−�𝒙𝒙 𝟐𝟐

𝑺𝑺𝒙𝒙𝒙𝒙

∼ 𝑵𝑵(𝟎𝟎,𝟏𝟏)

• So𝒙𝒙 − �𝒀𝒀

�𝝈𝝈 𝟏𝟏 + 𝟏𝟏𝒏𝒏 + 𝒙𝒙𝒏𝒏𝒆𝒆𝒏𝒏 − �𝒙𝒙 𝟐𝟐

𝑺𝑺𝒙𝒙𝒙𝒙

∼ 𝑻𝑻𝒏𝒏−𝟐𝟐

60

Prediction Interval

For the simple linear regression model 𝒙𝒙𝒊𝒊 = 𝜷𝜷𝟎𝟎 + 𝜷𝜷𝟏𝟏𝒙𝒙𝒊𝒊 + 𝝐𝝐𝒊𝒊

How to predict future observation Y corresponding to a given 𝒙𝒙𝒏𝒏𝒆𝒆𝒏𝒏?

• Point estimator: �𝒀𝒀 = �𝜷𝜷𝟎𝟎+ �𝜷𝜷𝟏𝟏𝒙𝒙𝒏𝒏𝒆𝒆𝒏𝒏• The 𝟏𝟏 − 𝜶𝜶 prediction interval is

�𝒀𝒀 ± 𝒕𝒕𝜶𝜶/𝟐𝟐,𝒏𝒏−𝟐𝟐 �𝝈𝝈 𝟏𝟏 + 𝟏𝟏𝒏𝒏

+ 𝒙𝒙𝒏𝒏𝒆𝒆𝒏𝒏−�𝒙𝒙 𝟐𝟐

𝑺𝑺𝒙𝒙𝒙𝒙61

Example (Cont.)

The fitted (simple linear regression) model is𝒀𝒀 = 𝟐𝟐𝟐𝟐𝟒𝟒.𝟏𝟏𝟏𝟏𝟏𝟏𝟏𝟏 − 𝟐𝟐.𝟏𝟏𝟏𝟏𝟎𝟎𝟗𝟗 𝒙𝒙 + 𝝐𝝐

• Find a 95% prediction interval on Y when x=10

[Recall 𝐒𝐒𝐱𝐱𝐱𝐱 = 𝟏𝟏𝟎𝟎𝟏𝟏𝟏𝟏𝟎𝟎.𝟏𝟏, �𝝈𝝈 = 𝟒𝟒𝟎𝟎.𝟏𝟏𝟏𝟏𝟗𝟗𝟏𝟏, �𝒙𝒙 = 𝟗𝟗𝟗𝟗.𝟒𝟒, 𝒕𝒕𝟎𝟎.𝟎𝟎𝟐𝟐𝟎𝟎,𝟏𝟏𝟏𝟏 = 𝟐𝟐.𝟏𝟏𝟎𝟎𝟏𝟏

�𝒀𝒀 ± 𝒕𝒕𝜶𝜶/𝟐𝟐,𝒏𝒏−𝟐𝟐 �𝝈𝝈 𝟏𝟏 + 𝟏𝟏𝒏𝒏

+ 𝒙𝒙𝒏𝒏𝒆𝒆𝒏𝒏−�𝒙𝒙 𝟐𝟐

𝑺𝑺𝒙𝒙𝒙𝒙= 𝟐𝟐𝟎𝟎𝟐𝟐.𝟗𝟗𝟎𝟎𝟗𝟗𝟏𝟏 ± 𝟏𝟏𝟎𝟎𝟐𝟐.𝟎𝟎𝟎𝟎𝟐𝟐𝟐𝟐 =

[𝟏𝟏𝟎𝟎𝟎𝟎.𝟒𝟒𝟎𝟎𝟎𝟎𝟏𝟏,𝟏𝟏𝟎𝟎𝟎𝟎.𝟒𝟒𝟎𝟎𝟗𝟗𝟎𝟎]]62

X = Percentage of children immunized against DPT;Y = under-five mortality rate per 1000 live births, in 1992

Example (Cont.)

The fitted (simple linear regression) model is𝒀𝒀 = 𝟐𝟐𝟐𝟐𝟒𝟒.𝟏𝟏𝟏𝟏𝟏𝟏𝟏𝟏 − 𝟐𝟐.𝟏𝟏𝟏𝟏𝟎𝟎𝟗𝟗 𝒙𝒙 + 𝝐𝝐

• Find a 95% prediction interval on Y when x=90

[Recall 𝐒𝐒𝐱𝐱𝐱𝐱 = 𝟏𝟏𝟎𝟎𝟏𝟏𝟏𝟏𝟎𝟎.𝟏𝟏, �𝝈𝝈 = 𝟒𝟒𝟎𝟎.𝟏𝟏𝟏𝟏𝟗𝟗𝟏𝟏, �𝒙𝒙 = 𝟗𝟗𝟗𝟗.𝟒𝟒, 𝒕𝒕𝟎𝟎.𝟎𝟎𝟐𝟐𝟎𝟎,𝟏𝟏𝟏𝟏 = 𝟐𝟐.𝟏𝟏𝟎𝟎𝟏𝟏

�𝒀𝒀 ± 𝒕𝒕𝜶𝜶/𝟐𝟐,𝒏𝒏−𝟐𝟐 �𝝈𝝈 𝟏𝟏 + 𝟏𝟏𝒏𝒏

+ 𝒙𝒙𝒏𝒏𝒆𝒆𝒏𝒏−�𝒙𝒙 𝟐𝟐

𝑺𝑺𝒙𝒙𝒙𝒙= 𝟏𝟏𝟐𝟐.𝟎𝟎𝟏𝟏𝟎𝟎𝟏𝟏 ± 𝟏𝟏𝟗𝟗.𝟎𝟎𝟐𝟐𝟗𝟗𝟏𝟏 =

[−𝟎𝟎𝟒𝟒.𝟗𝟗𝟒𝟒𝟐𝟐𝟏𝟏,𝟏𝟏𝟏𝟏𝟗𝟗.𝟏𝟏𝟒𝟒𝟐𝟐𝟗𝟗]]

63

X = Percentage of children immunized against DPT;Y = under-five mortality rate per 1000 live births, in 1992

Summary (I): point estimation

Assume that we observe (𝒙𝒙𝒊𝒊,𝒙𝒙𝒊𝒊) for i=1,..,n, and we consider the simple linear regression model

𝒙𝒙𝒊𝒊 = 𝜷𝜷𝟎𝟎 + 𝜷𝜷𝟏𝟏𝒙𝒙𝒊𝒊 + 𝝐𝝐𝒊𝒊where the 𝝐𝝐𝒊𝒊’s are iid with mean 0 and variance 𝝈𝝈𝟐𝟐.• Define

𝑺𝑺𝒙𝒙𝒙𝒙 = ∑ 𝒙𝒙𝒊𝒊 − �𝒙𝒙 𝟐𝟐 = ∑𝒙𝒙𝒊𝒊𝟐𝟐 − 𝒏𝒏 �𝒙𝒙 𝟐𝟐, 𝑺𝑺𝒙𝒙𝒙𝒙 = ∑ 𝒙𝒙𝒊𝒊 − �𝒙𝒙 𝒙𝒙𝒊𝒊 − �𝒙𝒙 = ∑𝒙𝒙𝒊𝒊𝒙𝒙𝒊𝒊 − 𝒏𝒏 �𝒙𝒙�𝒙𝒙𝑺𝑺𝒙𝒙𝒙𝒙 = ∑ 𝒙𝒙𝒊𝒊 − �𝒙𝒙 𝟐𝟐 = ∑𝒙𝒙𝒊𝒊𝟐𝟐 − 𝒏𝒏 �𝒙𝒙 𝟐𝟐

• The least squares estimators are

�𝜷𝜷𝟏𝟏 =𝑺𝑺𝒙𝒙𝒙𝒙𝑺𝑺𝒙𝒙𝒙𝒙

, �𝜷𝜷𝟎𝟎 = �𝒙𝒙 − �𝜷𝜷𝟏𝟏�𝒙𝒙

64

Summary (II) : Estimation of σ2 and Inference

• The estimator of σ2 is �𝝈𝝈𝟐𝟐 = 𝑹𝑹𝑺𝑺𝑺𝑺𝒏𝒏−𝟐𝟐

where 𝑹𝑹𝑺𝑺𝑺𝑺 =∑𝒊𝒊=𝟏𝟏𝒏𝒏 𝒆𝒆𝒊𝒊 𝟐𝟐 and residuals 𝐞𝐞𝐢𝐢 = 𝒙𝒙𝒊𝒊 − �𝜷𝜷𝟎𝟎 + �𝜷𝜷𝟏𝟏𝒙𝒙𝒊𝒊 . In practice, it is better to use

𝑹𝑹𝑺𝑺𝑺𝑺 = �𝒊𝒊=𝟏𝟏

𝒏𝒏

𝒆𝒆𝒊𝒊 𝟐𝟐 = 𝑺𝑺𝒙𝒙𝒙𝒙 − �𝜷𝜷𝟏𝟏𝑺𝑺𝒙𝒙𝒙𝒙 = 𝑺𝑺𝒙𝒙𝒙𝒙 −𝑺𝑺𝒙𝒙𝒙𝒙𝟐𝟐

𝑺𝑺𝒙𝒙𝒙𝒙

•�𝜷𝜷𝟏𝟏−𝜷𝜷𝟏𝟏𝒔𝒔𝒆𝒆(�𝜷𝜷𝟏𝟏)

∼ 𝑻𝑻𝒏𝒏−𝟐𝟐; 𝒔𝒔𝒆𝒆 �𝜷𝜷𝟏𝟏 = �𝝈𝝈𝑺𝑺𝒙𝒙𝒙𝒙

•�𝜷𝜷𝟎𝟎−𝜷𝜷𝟎𝟎𝒔𝒔𝒆𝒆(�𝜷𝜷𝟎𝟎)

∼ 𝑻𝑻𝒏𝒏−𝟐𝟐; 𝒔𝒔𝒆𝒆 �𝜷𝜷𝟎𝟎 = �𝝈𝝈 𝟏𝟏𝒏𝒏

+ �𝒙𝒙 𝟐𝟐

𝑺𝑺𝒙𝒙𝒙𝒙65

Summary III: Inference

At a given 𝒙𝒙𝒏𝒏𝒆𝒆𝒏𝒏• the point estimator of Y is �𝒀𝒀 = �𝜷𝜷𝟎𝟎+ �𝜷𝜷𝟏𝟏𝒙𝒙𝒏𝒏𝒆𝒆𝒏𝒏• A 𝟏𝟏 − 𝜶𝜶 confidence interval on the mean

response Y is

�𝒀𝒀 ± 𝒕𝒕𝜶𝜶/𝟐𝟐,𝒏𝒏−𝟐𝟐 �𝝈𝝈𝟏𝟏𝒏𝒏

+ 𝒙𝒙𝒏𝒏𝒆𝒆𝒏𝒏−�𝒙𝒙 𝟐𝟐

𝑺𝑺𝒙𝒙𝒙𝒙• A 𝟏𝟏 − 𝜶𝜶 prediction interval on the future

observation is

�𝒀𝒀 ± 𝒕𝒕𝜶𝜶/𝟐𝟐,𝒏𝒏−𝟐𝟐 �𝝈𝝈 𝟏𝟏 + 𝟏𝟏𝒏𝒏

+ 𝒙𝒙𝒏𝒏𝒆𝒆𝒏𝒏−�𝒙𝒙 𝟐𝟐

𝑺𝑺𝒙𝒙𝒙𝒙(appropriate for testing data)

66

67

Part C

• Introduction to R

What is R

• R is a system for statistical computation and graphics

• It consists of a language plus a run-time environment with graphics, a debugger, access to certain system functions, and the ability to run programs stored in script files

• Free software• OS: Windows, Unix, Linux • Homepage: http://www.r-project.org

Installing R Under Windows

• Need Windows OS(32/64 bits)• Go to any CRAN site (see

http://cran.r-project.org/ mirrors.html for a list), and follow the instruction

• Download R 3.1.0 for Windows “R-3.1.0-win.exe” (Size: 54Mb), and double-click on the icon and follow the instructions to install

Data With R• Objects: vector, factor, array, matrix, data.frame,

ts, list

• Mode (numerical, character, complex, and logical);Length

• Read data stored in text (ASCII) filesread.table(), scan(), and read.fw f()

• Saving datawrite(x, file=“data.txt”), w rite.table() write in a

file a data.frame

• Generating data

71

Linear Regression in R

x <- c(77, 69, 32, 85, 94, 99, 89, 13, 95, 95, 54, 89, 95, 87, 91, 98, 73, 47, 76, 90);

y <- c(118, 65, 184, 8, 43, 12, 55, 208, 7, 9, 9, 124, 10, 6, 33, 16, 32, 145, 87, 9);

fm1 <- lm( y ~ x)fm1

Call:lm(formula = y ~ x)

Coefficients:(Intercept) x

224.316 -2.136

72

summary(fm1)> summary(fm1)

Call:lm(formula = y ~ x)Residuals:

Min 1Q Median 3Q Max -99.97934 -16.57854 0.06684 20.84946 89.77608

Coefficients:Estimate Std. Error t value Pr(>|t|)

(Intercept) 224.3163 31.4403 7.135 1.20e-06 ***x -2.1359 0.3893 -5.486 3.28e-05 ***---Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1

Residual standard error: 40.14 on 18 degrees of freedomMultiple R-Squared: 0.6258, Adjusted R-squared: 0.605F-statistic: 30.1 on 1 and 18 DF, p-value: 3.281e-05

73

Confidence Interval on coefficients

> confint(fm1)

2.5 % 97.5 %(Intercept) 158.262579 290.369998x -2.953763 -1.317976

> confint(fm1, level = 0.99)0.5 % 99.5 %

(Intercept) 133.817133 314.815444x -3.256453 -1.015286

74

Intervals for xnew

> xnew <- data.frame(x = c(10, 90))## Confidence intervals on the mean response> predict(fm1, xnew, interval="confidence“, level=0.95)

fit lwr upr1 202.95759 144.69566 261.219532 32.08805 10.59907 53.57702

## Prediction intervals for future observations> predict(fm1, xnew, interval="prediction“, level=0.95)

fit lwr upr1 202.95759 100.45917 305.45602 32.08805 -54.93637 119.1125