STA305/1004 - Review of Statistical TheoryDistributions hist(dat) Histogram of dat dat Frequency 55...

42
STA305/1004 - Review of Statistical Theory September 10, 2019

Transcript of STA305/1004 - Review of Statistical TheoryDistributions hist(dat) Histogram of dat dat Frequency 55...

Page 1: STA305/1004 - Review of Statistical TheoryDistributions hist(dat) Histogram of dat dat Frequency 55 60 65 70 0.0 1.0 2.0. betweenLobs SS-Go3 bins.

STA305/1004 - Review of Statistical Theory

September 10, 2019

Page 2: STA305/1004 - Review of Statistical TheoryDistributions hist(dat) Histogram of dat dat Frequency 55 60 65 70 0.0 1.0 2.0. betweenLobs SS-Go3 bins.

Data

Experimental data describes the outcome of the experimental run. For example10 successive runs in a chemical experiment produce the following data:set.seed(100)

# Generate a random sample of 5 observations

# from a N(60,10^2)

dat <- round(rnorm(5,mean = 60,sd = 10),1)

dat

## [1] 55.0 61.3 59.2 68.9 61.2

Page 3: STA305/1004 - Review of Statistical TheoryDistributions hist(dat) Histogram of dat dat Frequency 55 60 65 70 0.0 1.0 2.0. betweenLobs SS-Go3 bins.

Distributions

Distributions can be displayed graphically or numerically.

A histogram is a graphical summary of a data set.summary(dat)

## Min. 1st Qu. Median Mean 3rd Qu. Max.

## 55.00 59.20 61.20 61.12 61.30 68.90

p p p p p pSmallest 25 th 50thbeuuwh.WS th

Value percentile . percentile

Page 4: STA305/1004 - Review of Statistical TheoryDistributions hist(dat) Histogram of dat dat Frequency 55 60 65 70 0.0 1.0 2.0. betweenLobs SS-Go3 bins.

Distributions

hist(dat)

Histogram of dat

dat

Frequency

55 60 65 70

0.0

1.0

2.0 Lobs.

between SS - Go

3 bins .

Page 5: STA305/1004 - Review of Statistical TheoryDistributions hist(dat) Histogram of dat dat Frequency 55 60 65 70 0.0 1.0 2.0. betweenLobs SS-Go3 bins.

Distributions

I The total aggregate of observations that might occur as a result ofrepeatedly performing a particular operation is called a population ofobservations.

I The observations that actually occur are a sample from the population.

Page 6: STA305/1004 - Review of Statistical TheoryDistributions hist(dat) Histogram of dat dat Frequency 55 60 65 70 0.0 1.0 2.0. betweenLobs SS-Go3 bins.

Continuous Distributions

I A continuous random variable X is fully characterized by it’s densityfunction f (x).

I f (x) Ø 0, f is piecewise continuous, ands Œ

≠Œ f (x)dx = 1.

I The cumulative distribution function (CDF) of X is defined as:

F (x) = P(X Æ x) =⁄ x

≠Œf (x)dx .

ri . rea .

Page 7: STA305/1004 - Review of Statistical TheoryDistributions hist(dat) Histogram of dat dat Frequency 55 60 65 70 0.0 1.0 2.0. betweenLobs SS-Go3 bins.

Continuous Distributions

I If f is continuous at x then F Õ(x) = f (x) (fundamental theorem ofcalculus).

I The CDF can be used to calculate the probability that X falls in the interval(a, b). This is the area under the density curve which can also be expressedin terms of the CDF:

P (a < X < b) =⁄ b

af (x)dx = F (b) ≠ F (a).

I In R a list of all the common distributions can be obtained by the commandhelp("distributions").

I For example, the normal density and CDF are given by dnorm() andpnorm().

* ¥¥¥¥*e.

If .

Tknsity.

Page 8: STA305/1004 - Review of Statistical TheoryDistributions hist(dat) Histogram of dat dat Frequency 55 60 65 70 0.0 1.0 2.0. betweenLobs SS-Go3 bins.

Continuous Distributions

100 observations (using rchisq()) from a Chi-square distribution on 10 degreesof freedom ‰2

10. The density function of the ‰210 is superimposed over the

histogram of the sample.

Histogram of x

x

Frequency

0 5 10 15 20

05

15

Page 9: STA305/1004 - Review of Statistical TheoryDistributions hist(dat) Histogram of dat dat Frequency 55 60 65 70 0.0 1.0 2.0. betweenLobs SS-Go3 bins.

Randomness

I A random drawing is where each member of the population has an equalchance of being selected.

I The hypothesis of random sampling may not apply to real data.I For example, cold days are usually followed by cold days.I So daily temperature not directly representable by random drawings.I In many cases we can’t rely on the random sampling property although

design can make this assumption relevant.

Page 10: STA305/1004 - Review of Statistical TheoryDistributions hist(dat) Histogram of dat dat Frequency 55 60 65 70 0.0 1.0 2.0. betweenLobs SS-Go3 bins.

Parameters and Statistics

What is the di�erence between a parameter and a statistic?I A parameter is a population quantity and a statistic is a quantity based on

a sample drawn from the population.

Example: The population of all adult (18+ years old) males in Toronto, Canada.I Suppose that there are N adult males and the quantity of interest, y , is age.I A sample of size n is drawn from this population.I The population mean is µ =

qNi=1 yi /N.

I The sample mean is y =qn

i=1 yi /n.OO- fixed

Page 11: STA305/1004 - Review of Statistical TheoryDistributions hist(dat) Histogram of dat dat Frequency 55 60 65 70 0.0 1.0 2.0. betweenLobs SS-Go3 bins.

Residuals and Degress of Freedom

yi ≠ y is called a residual.I Since

q(yi ≠ y) = 0 any n ≠ 1 completely determine the the last

observation.I This is a constraint on the the residuals.I So n residuals have n ≠ 1 degrees of freedom since the last residual cannot

be freely chosen.

nn - -

- ,

?Li-

Ey-

E Li- nj

⇐ = n . Tf - niy = O

Page 12: STA305/1004 - Review of Statistical TheoryDistributions hist(dat) Histogram of dat dat Frequency 55 60 65 70 0.0 1.0 2.0. betweenLobs SS-Go3 bins.

The Normal Distribution

The density function of the normal distribution with mean µ and standarddeviation ‡ is:

„(x) = 1‡

Ô2fi

exp3

≠12

1x ≠ µ‡

224

The cumulative distribution function (CDF) of a N(0, 1) distribution,

�(x) = P(X < x) =⁄ x

≠Œ„(x)dx

S -d or O ? variance .

O O

It

⇐ expftz .se )µ

-

-42=1 .

Page 13: STA305/1004 - Review of Statistical TheoryDistributions hist(dat) Histogram of dat dat Frequency 55 60 65 70 0.0 1.0 2.0. betweenLobs SS-Go3 bins.

The Normal Distribution

x <- seq(-4,4,by=0.1)

plot(x,dnorm(x),type="l",main = "The Standard Normal Distribution",

ylab=expression(paste(phi(x))))

−4 −2 0 2 4

0.0

0.2

0.4

The Standard Normal Distribution

x

φ(x)

#normal density function .

= -

Page 14: STA305/1004 - Review of Statistical TheoryDistributions hist(dat) Histogram of dat dat Frequency 55 60 65 70 0.0 1.0 2.0. betweenLobs SS-Go3 bins.

The Normal Distribution

plot(x <- seq(-2,2,by=0.1),pnorm(x),type="l",

xlab="x",ylab=expression(paste(Phi(x))),

main = "Standard Normal CDF")

−2 −1 0 1 2

0.0

0.4

0.8

Standard Normal CDF

x

Φ(x)

A

p *← . ,pace)

Page 15: STA305/1004 - Review of Statistical TheoryDistributions hist(dat) Histogram of dat dat Frequency 55 60 65 70 0.0 1.0 2.0. betweenLobs SS-Go3 bins.

The Normal Distribution

A random variable X that follows a normal distribution with mean µ andvariance ‡2 will be denoted by

X ≥ N!µ, ‡2" .

If Y ≥ N!µ, ‡2" then

Z ≥ N(0, 1),

whereZ = Y ≠ µ

‡.

-mean

( variance.

✓location

⇒ y = MtEZ,Z - Naya

Scala .

Yn NC µ ,02 )

Page 16: STA305/1004 - Review of Statistical TheoryDistributions hist(dat) Histogram of dat dat Frequency 55 60 65 70 0.0 1.0 2.0. betweenLobs SS-Go3 bins.

The Normal Distribution

X ≥ N(5, 3). Use R to find P(4 < X < 6).pnorm(6,mean = 5,sd = sqrt(3))-pnorm(4,mean = 5,sd = sqrt(3))

## [1] 0.4362971

, III i an u

(Cumulative

%st.funchm\#¥X" "

Page 17: STA305/1004 - Review of Statistical TheoryDistributions hist(dat) Histogram of dat dat Frequency 55 60 65 70 0.0 1.0 2.0. betweenLobs SS-Go3 bins.

Normal Quantile Plots

The following data are the weights from 11 tomato plants.

## [1] 29.9 11.4 26.6 23.7 25.3 28.5 14.2 17.9 16.5 21.1 24.3

Do the weights follow a Normal distribution?

Page 18: STA305/1004 - Review of Statistical TheoryDistributions hist(dat) Histogram of dat dat Frequency 55 60 65 70 0.0 1.0 2.0. betweenLobs SS-Go3 bins.

Normal Quantile Plots

A normal quantile plot in R can be obtained using qqnorm() for the normalprobability plot and qqline() to add the straight line.qqnorm(tomato.data$pounds); qqline(tomato.data$pounds)

−1.5 −0.5 0.5 1.0 1.5

1525Normal Q−Q Plot

Theoretical Quantiles

Sam

ple

Qua

ntile

sthese plots indicate

a Systematicdeathfromstraight

lire .

--

Systematicdeviation

indicates

Non - normality .

t÷÷÷ I

Page 19: STA305/1004 - Review of Statistical TheoryDistributions hist(dat) Histogram of dat dat Frequency 55 60 65 70 0.0 1.0 2.0. betweenLobs SS-Go3 bins.

Central Limit Theorem

The central limit theorem states that if X1, X2, ... is an independent sequence ofidentically distributed random variables with mean µ = E(Xi ) and variance‡2 = Var(Xi ) then

limnæŒ

P3

X ≠ µ‡Ôn

Æ x4

= �(x),

where X =qn

i=1 Xi /n and �(x) is the standard normal CDF. This means thatthe distribution of X is approximately N

1µ, ‡Ôn

2.K

Page 20: STA305/1004 - Review of Statistical TheoryDistributions hist(dat) Histogram of dat dat Frequency 55 60 65 70 0.0 1.0 2.0. betweenLobs SS-Go3 bins.

Central Limit Theorem

Example: A fair coin is flipped 50 times. What is the distribution of the averagenumber of heads?

Xc - = f- if H A Binomial

Pki-1=0.5 { if T D Bernoulli

D t

v :*:* 9.?:c:::* ,

O

F- (Eiji) = to IiExit = xsoxus

= 25

ra YE÷=¥iar⇐⇒To

= t.

= fois Var Hit =¥z Sox . 5×5

Page 21: STA305/1004 - Review of Statistical TheoryDistributions hist(dat) Histogram of dat dat Frequency 55 60 65 70 0.0 1.0 2.0. betweenLobs SS-Go3 bins.

E

-7¥i Nfas

, asx÷ )u

proportion - f heads in So tosses of a

fair coin .

Page 22: STA305/1004 - Review of Statistical TheoryDistributions hist(dat) Histogram of dat dat Frequency 55 60 65 70 0.0 1.0 2.0. betweenLobs SS-Go3 bins.

Central Limit Theorem

set.seed(100)

Total.heads <- rbinom(100,50,0.5); Ave.heads <- Total.heads/50;

hist(Ave.heads, main = "Distribution - Average Number of Heads")

Distribution − Average Number of Heads

Ave.heads

Frequency

0.35 0.45 0.55 0.65

020

40( I

# °" > " " " * "

O

too Simulations of 50 Coin tosses C want Coin ) .

O

Page 23: STA305/1004 - Review of Statistical TheoryDistributions hist(dat) Histogram of dat dat Frequency 55 60 65 70 0.0 1.0 2.0. betweenLobs SS-Go3 bins.

Central Limit Theoremset.seed(100)

x<- rbinom(100,50,0.5)/50 # draw a sample of 100 from bin(50,.5)

h <- hist(x, main = "", ) # create the histogram

# superimpoise normal density over histogram

xfit<-seq(min(x),max(x),length=40)

yfit <- dnorm(xfit,mean = .5,sd = sqrt((.5*.5)/50))

yfit <- yfit*diff(h$mids[1:2])*length(x)

lines(xfit,yfit)

x

Frequency

0.35 0.45 0.55 0.65

020

40 Nsm = too

= SoN Sample size

Page 24: STA305/1004 - Review of Statistical TheoryDistributions hist(dat) Histogram of dat dat Frequency 55 60 65 70 0.0 1.0 2.0. betweenLobs SS-Go3 bins.

Chi-Square Distribution

Let X1, X2, ..., Xn be independent and identically distributed random variablesthat have a N(0, 1) distribution. The distribution of

nÿ

i=1

X 2i ,

has a chi-square distribution on n degrees of freedom or ‰2n.

The mean of a ‰2n is n with variance 2n.

Page 25: STA305/1004 - Review of Statistical TheoryDistributions hist(dat) Histogram of dat dat Frequency 55 60 65 70 0.0 1.0 2.0. betweenLobs SS-Go3 bins.

Chi-Square Distribution

Let X1, X2, ..., Xn be independent with a N(µ, ‡2) distribution. What is thedistribution of the sample variance S2 =

qni=1(Xi ≠ X)2/(n ≠ 1)?

-

( n - I ) S2

→r XIN- I )

distributionof Sample Variance

.

Page 26: STA305/1004 - Review of Statistical TheoryDistributions hist(dat) Histogram of dat dat Frequency 55 60 65 70 0.0 1.0 2.0. betweenLobs SS-Go3 bins.

t Distribution

If X ≥ N(0, 1) and W ≥ ‰2n then the distribution of XÔ

W /nhas a t distribution

on n degrees of freedom or XÔW /n

≥ tn.

Page 27: STA305/1004 - Review of Statistical TheoryDistributions hist(dat) Histogram of dat dat Frequency 55 60 65 70 0.0 1.0 2.0. betweenLobs SS-Go3 bins.

t Distribution

Let X1, X2, ... is an independent sequence of identically distributed randomvariables that have a N(0, 1) distribution. What is the distribution of

X ≠ µSÔn≠1

where S2 =qn

i=1(Xi ≠ X)2/(n ≠ 1)?

one - Sample t - test

Ho =µ=µ ,

E test States K2 .

-7Sample Sol .

ten ,

Page 28: STA305/1004 - Review of Statistical TheoryDistributions hist(dat) Histogram of dat dat Frequency 55 60 65 70 0.0 1.0 2.0. betweenLobs SS-Go3 bins.

t Distribution

−4 −2 0 2 4

0.0

0.1

0.2

0.3

0.4

Comparison of t Distributions

x value

Den

sity

Distributionsdf=1df=3df=8df=30normal

0

a. . . I

Page 29: STA305/1004 - Review of Statistical TheoryDistributions hist(dat) Histogram of dat dat Frequency 55 60 65 70 0.0 1.0 2.0. betweenLobs SS-Go3 bins.

F DistributionLet X ≥ ‰2

m and Y ≥ ‰2n be independent. The distribution of

W = X/mY /n ≥ Fm,n,

where Fm,n denotes the F distribution on m, n degrees of freedom. The Fdistribution is right skewed (see graph below). For n > 2, E(W ) = n/(n ≠ 2). Italso follows that the square of a tn random variable follows an F1,n.

0 1 2 3 4 5

0.0

0.3

0.6

F Distributions

x

Density

DistributionsF(7,10)F(7,4)

~numerator ol f-

( denominator If

Page 30: STA305/1004 - Review of Statistical TheoryDistributions hist(dat) Histogram of dat dat Frequency 55 60 65 70 0.0 1.0 2.0. betweenLobs SS-Go3 bins.

Linear Regression

Lea (1965) discussed the relationship between mean annual temperature andmortality index for a type of breast cancer in women taken from regions inEurope (example from Wu and Hammada).

The data is shown below.#Breast Cancer data

M <- c(102.5, 104.5, 100.4, 95.9, 87.0, 95.0, 88.6, 89.2,

78.9, 84.6, 81.7, 72.2, 65.1, 68.1, 67.3, 52.5)

T <- c(51.3, 49.9, 50.0,49.2, 48.5, 47.8, 47.3, 45.1,

46.3, 42.1, 44.2, 43.5, 42.3, 40.2, 31.8, 34.0)

0

Page 31: STA305/1004 - Review of Statistical TheoryDistributions hist(dat) Histogram of dat dat Frequency 55 60 65 70 0.0 1.0 2.0. betweenLobs SS-Go3 bins.

Linear Regression

A linear regression model of mortality versus temperature is obtained byestimating the intercept and slope in the equation:

yi = —0 + —1xi + ‘i , i = 1, ..., n

where ‘i ≥ N(0, ‡2). The values of —0, —1 that minimize the sum of squares

nÿ

i=1

(yi ≠ (—0 + —1xi ))2,

are called the least squares estimators. They are given by:I —0 = y ≠ —1xI —1 = r Sy

Sx

r is the correlation between y and x , and Sx , Sy are the sample standarddeviations of x and y respectively.

4Pa, pi ) =O I = o

2130I = o

2B i

Solvefor

rn.

in

Page 32: STA305/1004 - Review of Statistical TheoryDistributions hist(dat) Histogram of dat dat Frequency 55 60 65 70 0.0 1.0 2.0. betweenLobs SS-Go3 bins.

Linear Regressionplot(T,M,xlab="temperature",ylab="mortality index")

35 40 45 50

6070

8090

100

temperature

mor

talit

y in

dex /

Temp .

Page 33: STA305/1004 - Review of Statistical TheoryDistributions hist(dat) Histogram of dat dat Frequency 55 60 65 70 0.0 1.0 2.0. betweenLobs SS-Go3 bins.

Linear Regressionreg1 <- lm(M~T)

summary(reg1) # Parameter estimates and ANOVA table

##

## Call:

## lm(formula = M ~ T)

##

## Residuals:

## Min 1Q Median 3Q Max

## -12.8358 -5.6319 0.4904 4.3981 14.1200

##

## Coefficients:

## Estimate Std. Error t value Pr(>|t|)

## (Intercept) -21.7947 15.6719 -1.391 0.186

## T 2.3577 0.3489 6.758 9.2e-06 ***

## ---

## Signif. codes: 0 �***� 0.001 �**� 0.01 �*� 0.05 �.� 0.1 � � 1

##

## Residual standard error: 7.545 on 14 degrees of freedom

## Multiple R-squared: 0.7654, Adjusted R-squared: 0.7486

## F-statistic: 45.67 on 1 and 14 DF, p-value: 9.202e-06

①- Regression of Ton M

York dependent N independent variable ,

yr x , the

↳TE ✓

Ho :B an

Ho- pi

asQian -

tnlltf① meypla.net( of > f variation

' "

byrmeofjlet.

Page 34: STA305/1004 - Review of Statistical TheoryDistributions hist(dat) Histogram of dat dat Frequency 55 60 65 70 0.0 1.0 2.0. betweenLobs SS-Go3 bins.

13^0=-21.7947 } estimates of intercept

pi -

- 2.3577 and Slope .

RZ .7654 ooo approx . 77% of

the variation in mortality es

explained by the regression model

of mortality and temp .

y^= -21.7947T 2.3577T

obtain filled values by pluggingm T valves .

Page 35: STA305/1004 - Review of Statistical TheoryDistributions hist(dat) Histogram of dat dat Frequency 55 60 65 70 0.0 1.0 2.0. betweenLobs SS-Go3 bins.

Linear Regressionplot(T,M,xlab="temperature",ylab="mortality index")

abline(reg1) # Add regression line to the plot

35 40 45 50

6070

8090

100

temperature

mor

talit

y in

dex

i

""

""

Page 36: STA305/1004 - Review of Statistical TheoryDistributions hist(dat) Histogram of dat dat Frequency 55 60 65 70 0.0 1.0 2.0. betweenLobs SS-Go3 bins.

Linear Regression

#plot residuals vs. fitted

plot(reg1$fitted,reg1$residuals);

abline(h=0) # add horizontal line at 0

60 70 80 90 100

−10

010

reg1$fitted

reg1$residuals

if there are no

Systematic patterns

then.

#'Eyre

Ideallyhave a -

random Scatter of points .

Page 37: STA305/1004 - Review of Statistical TheoryDistributions hist(dat) Histogram of dat dat Frequency 55 60 65 70 0.0 1.0 2.0. betweenLobs SS-Go3 bins.

Linear Regression

#check normality of residuals

qqnorm(reg1$residuals); qqline(reg1$residuals)

−2 −1 0 1 2

−10

010

Normal Q−Q Plot

Theoretical Quantiles

Sam

ple

Qua

ntile

sp -

rates are

Valid providedresiduals

are normallydistributed .

Page 38: STA305/1004 - Review of Statistical TheoryDistributions hist(dat) Histogram of dat dat Frequency 55 60 65 70 0.0 1.0 2.0. betweenLobs SS-Go3 bins.

Linear RegressionIf there is more than one independent variable then the above model is called amultiple linear regression model.

yi = —0 + —1xi1 + —2xi2 + · · · + —kxik + ‘i , i = 1, ..., n,

where ‘i ≥ N(0, ‡2).

This can also be expressed in matrix notation as

y = X— + ‘

The least squares estimator is

— =!X T X

"≠1 X T y .

The covariance matrix of — is!X T X

"≠1‡2. An estimator of ‡2 is

‡2 = 1n ≠ k

nÿ

i=1

(yi ≠ yi )2,

where yi = —0 + —1xi1 + · · · + —kxik is the predicted value of yi .

' "

Ay 3h

s

Page 39: STA305/1004 - Review of Statistical TheoryDistributions hist(dat) Histogram of dat dat Frequency 55 60 65 70 0.0 1.0 2.0. betweenLobs SS-Go3 bins.

Weighing Problem

Harold Hotelling in 1949 wrote a paper on how to obtain more accurateweighings through experimental design.

Method 1Weigh each apple separately.

Method 2Obtain two weighings by

1. Weighing two apples in one pan.2. Weighing one apple in one pan and the other apple in the other pan

/@ @ Right

left an ¥ pan .

Page 40: STA305/1004 - Review of Statistical TheoryDistributions hist(dat) Histogram of dat dat Frequency 55 60 65 70 0.0 1.0 2.0. betweenLobs SS-Go3 bins.

Weighing Problem

Let w1, w2 be the weights of apples one and two. Each weighing has standarderror ‡. So the precision of the estimates from method 1 is ‡.

If the objects are weighed together in one pan, resulting in measurement m1,then in opposite pans, resulting in measurement m2, we have two equations forthe unknown weights w1, w2:

w1 + w2 = m1

w1 ≠ w2 = m2

This illustrates that experimental

var @c) =P design Can impact the

precision of the estimatesobtained .

}m the aw ,

⇒wi )var @D= Var(mY me - ma = zwz

= -410702) ⇒ wT=m2

= Eye -- 047 var ( wi )

2

Page 41: STA305/1004 - Review of Statistical TheoryDistributions hist(dat) Histogram of dat dat Frequency 55 60 65 70 0.0 1.0 2.0. betweenLobs SS-Go3 bins.

Weighing Problem

This can also be viewed as a linear regression problem y = X— + ‘:

y = (m1, m2)Õ, X =3

1 11 ≠1

4, — = (w1, w2)Õ.

Xie ={In ibpameasurenentis in Left

if measurement in Right

pan .

Page 42: STA305/1004 - Review of Statistical TheoryDistributions hist(dat) Histogram of dat dat Frequency 55 60 65 70 0.0 1.0 2.0. betweenLobs SS-Go3 bins.

Weighing Problem

The least-squares estimates can be found using R.#step-by-step matrix mutiplication example for weighing problem

X <- matrix(c(1,1,1,-1),nrow=2,ncol=2) #define X matrix

Y <- t(X)%*%X # multiply X^T by X (X^T*X) NB: t(X) is transpose of X

W <- solve(Y) # calculate the inverse

W %*% t(X) # calculate (X^T*X)^(-1)*X^T

## [,1] [,2]

## [1,] 0.5 0.5

## [2,] 0.5 -0.5

W # print (X^T*X)^(-1) for SE

## [,1] [,2]

## [1,] 0.5 0.0

## [2,] 0.0 0.5

wecould use Inn C )

(transpose

*

O

←←%*% matrix

[ ] ( mm: ) -

. ( Ii )multiplication .

[ so'

=