Linear Regression Chp1

8/11/2019 Linear Regression Chp1

1/102

Linear Statistical Analysis I

Fall 2013

() Linear Statistical Analysis I Fall 2013 1 / 101
http://find/


2/102

Outline

Introduction to R statistical softwareIntroduction to regression and model building

http://find/


3/102

R language

R is a popular statistical software package especially suitable for

data analysis and graphical representation.

It is a free open source software.

The R package can be downloaded from

http://cran.us.r-project.org/ where you can also find useful

information for R. Note that they have different versions for

different operation systems such as Windows, Linux, MacOS.

online tutorials by googling the key words R language tutorial.

http://find/


4/102

R language

Open the R package.

http://find/http://goback/


5/102

type commands after the prompt symbol

For example, if we want to compute 3943 and 9,> 3/(9*4-5)

[1] 0.09677419

> sqrt(9)

[1] 3

If you want to make comments on the commands, use the symbol

#. The expression after this symbol will not be executed.

> 3/(9*4-5) # This is a comment.

[1] 0.09677419

> sqrt(9) # compute the square root of 9.

[1] 3



6/102

workspce and working directory

The workspace is the place in your computer where R reads or

saves data or R objects.The directory of the workspace is the working directory.

Find the current working directory

> getwd()

[1] "F:/Teaching/2013_spring_Computing"

change the current working directory

> setwd("F:/Teaching")

> getwd()

[1] "F:/Teaching"

or you can use the File menu in the Console window of R.

If there is an error when you input data, first check whether the

data is in the working directory.

http://find/


7/102

arithmetic expressions and variables

arithmetic expressions: type expressions after the prompt symbol.

R will evaluate their values.> 2+3

[1] 5

> 3-4

[1] -1

> 3*5

[1] 15

> 3/5

[1] 0.6

> 13%%5 # compute the remiander of 13 divided by 5[1] 3

> 13%/%5 #the integer quotient of 13 divided by 5

[1] 2

> 3^2

[1] 9() Linear Statistical Analysis I Fall 2013 7 / 101
http://find/


8/102


> sqrt(9)[1] 3

> log(3)

[1] 1.098612

> log2(2)

[1] 1> log10(100)

[1] 2

> exp(3)

[1] 20.08554

> factorial(3) #compute 3!

[1] 6

> log((1+exp(5)))*3^(-2)

[1] 0.5563017

http://goforward/http://find/http://goback/


9/102


variable: the name is case sensitive, that is, upper and lower case

letters are distinct.

assignment operator x x # check the value of x

[1] 20.38906

equivalently, you can use

> x=13+exp(2)> x

[1] 20.38906

http://find/


10/102


you can include variables in expressions, however, the values ofthe variables must be assigned before the expressions.

> x=2+3

> (2+3)*4

[1] 20

> x=2+3

> x*4

[1] 20

> y*4

Error: object y not found> y=x^2

> y

[1] 25

http://find/


11/102

Relational and Logical Operators

logical values: TRUE (can be denoted by T) and FALSE

(denoted by F).we can apply the arithmetic operators to the logical values, in this

case, TRUE is identified as 1 and FALSE is identified as 0

> TRUE+TRUE

[1] 2

> TRUE+FALSE

[1] 1

> TRUE*FALSE

[1] 0

> x=TRUE

> x

[1] TRUE

> y=FALSE

> y-x

[1] -1() Linear Statistical Analysis I Fall 2013 11 / 101
http://find/


12/102


The relational operators are

> 1>2 # logical expression with two possible[1] FALSE

> 1 1 1>=2

[1] FALSE

> (1+3)==4

[1] TRUE

> 3*2!=6

[1] FALSE

> x=(3*2!=6)

> x

[1] FALSE() Linear Statistical Analysis I Fall 2013 12 / 101
http://find/


13/102


logical operators: can be used to connect two or more logical

expressions

> (1>2)&((1+3)==4) # "&" means "and". The combined

# expression is true only if

# both the two expressions are tr

[1] FALSE> (1 TRUE&FALSE

[1] FALSE

> (1>2)|((1+3)==4) # "|" means "or". The combined# expression is true if and onl

# at least one of the two

# expressions are true.

[1] TRUE

http://find/


14/102


> (1 (1>2)|((1+3)==5)

[1] FALSE

> TRUE|FALSE

[1] TRUE> ((1+2)==1)

[1] FALSE

> !((1+2)==1) #"!" means "NOT".

[1] TRUE

> !TRUE

[1] FALSE

> !FALSE

[1] TRUE



15/102

basic data types in R

vector: concatente several numbers or vectors into a vector byusing the operator c()

> c(1,3,2)

[ 1 ] 1 3 2

> c(c(1,2),c(3,5),c(3,0,0))

[1] 1 2 3 5 3 0 0

> x=c(3,4,9)

> x

[ 1 ] 3 4 9

> y=c(x,x,c(0,1,2))> y

[1] 3 4 9 3 4 9 0 1 2



16/102


Special vectors:

> 1:9[1] 1 2 3 4 5 6 7 8 9

> 5:1

[ 1 ] 5 4 3 2 1

> rep(0.5,6) # repeat 0.5 six times

[1] 0.5 0.5 0.5 0.5 0.5 0.5

> rep(-3,8)

[1] -3 -3 -3 -3 -3 -3 -3 -3

> seq(0,1, length.out=11) # get 11 numbers from

#0 to 1 with equal space.

[1] 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

> seq(1,2,length.out=21)

[1] 1.00 1.05 1.10 1.15 1.20 1.25 1.30 1.35

[9] 1.40 1.45 1.50 1.55 1.60 1.65 1.70 1.75

[17] 1.80 1.85 1.90 1.95 2.00() Linear Statistical Analysis I Fall 2013 16 / 101


17/102


logical vector: consists of logical values

> x=c(T,T,F,F) #TRUE can be denoted by T

> x

[1] TRUE TRUE FALSE FALSE

> x=1:5> x==1 # each component of x is compared with 1

[1] TRUE FALSE FALSE FALSE FALSE

> y=(x!=1)

> y

[1] FALSE TRUE TRUE TRUE TRUE

http://find/


18/102


extract or modify subsets of a vector by using index vector

> x=c(3,2,5,5,7,8)> x[c(1,3,5)]# extract the 1,3,5th numbers in x

[ 1 ] 3 5 7

> y=x[c(3,1,5)]

> y

[ 1 ] 5 3 7

> z=x[-c(2,4)]# remove the 2,4th numbers in x

> z

[ 1 ] 3 5 7 8

> x[1:3]

[ 1 ] 3 2 5

> x[1:3]=c(1,2,3)

> x

[ 1 ] 1 2 3 5 7 8

http://find/


19/102


find all possible values in a vector

> x=c(3,2,5,5,7,8,7,8,9)> unique(x)

[ 1 ] 3 2 5 7 8 9

http://find/


20/102


> x[c(1,3,5)]=0

> x[ 1 ] 0 2 0 5 0 8

extract or modify subsets of a vector by using logical vector

> x=c(3,2,5,5,7,8,3,6,9)

> x=c(3,2,5,5,7,8,3,6,9) #we will extract the

#numbers larger than 4

> index=(x>4)

> index

[1] FALSE FALSE TRUE TRUE TRUE TRUE FALSE TRU> x[index]

[ 1 ] 5 5 7 8 6 9

> x[index]# only values corresponding TRUE

# in "index" will be slected

[ 1 ] 5 5 7 8 6 9() Linear Statistical Analysis I Fall 2013 20 / 101
http://find/


21/102


extract or modify subsets of a vector by using logical vector

> x=1:10 #extract numbers less or equal

# to 6 and larger than 3

> x[(x3)]

[ 1 ] 4 5 6

> x=c(3,2,5,5,7,8,3,6,9)

> y=c(1,3,5,7,6,4,2,3,5)

> #we extract the numbers in

> #the positions corresponding to y>4

> x[y>4]

[ 1 ] 5 5 7 9



22/102


extract the indices corresponding to the numbers which satisfies

specific conditions in a vector.

> x=c(T,T,F,T,F,F,T,T,T,F)

> which(x)

[ 1 ] 1 2 4 7 8 9

> x=c(3,2,5,5,7,8,3,6,9)

> which(x>5)

[ 1 ] 5 6 8 9

> x[which(x>5)][ 1 ] 7 8 6 9

http://find/


23/102

operations on vectors

The elementary arithmetic operators, +, -, *, /, (for raising to a

power), log, exp, sin, cos, tan, sqrt and so on, can be applied to

vectors in an element-by-element sense.

> x=1:5

> x

[ 1 ] 1 2 3 4 5> y=10:15

> y

[1] 10 11 12 13 14 15

> y-x

[1] 9 9 9 9 9 14

Warning message:

In y - x : longer object length is not

a multiple of shorter object length

http://find/


24/102


> x

[ 1 ] 1 2 3 4 5> y=11:15 #x and y should have the same lengths.

> y-x

[1] 10 10 10 10 10

> 2*x

[1] 2 4 6 8 10

> x^2

[1] 1 4 9 16 25

> y/x

[1] 11.000000 6.000000 4.333333 3.500000 3.000> x+1

[ 1 ] 2 3 4 5 6

> x*y

[1] 11 24 39 56 75

http://find/


25/102


get the length, minimum, maximum, mean, variance of a vector

> z=c(43, 45, 5, 44, 767, 57, 68,33,111)> length(z)

[1] 9

> min(z)

[1] 5

> max(z)[1] 767

> which(z==min(z))

[1] 3

> which(z==max(z))

[1] 5

> mean(z)

[1] 130.3333

> var(z)

[1] 57815.75> s d z() Linear Statistical Analysis I Fall 2013 25 / 101
http://find/


26/102


sort the numbers in a vector in an increasing or decreasing order

> z=c(43, 45, 5, 44, 767, 57, 68,33,111)> sort(z)

[1] 5 33 43 44 45 57 68 111 767

> sort(z,decreasing = T)

[1] 767 111 68 57 45 44 43 33 5

> y=sort(z,index.return = T)> y

$x

[1] 5 33 43 44 45 57 68 111 767

$ix

[1] 3 8 1 4 2 6 7 9 5

> y$x

[1] 5 33 43 44 45 57 68 111 767

> z[y$ix]

[1] 5 33 43 44 45 57 68 111 767() Linear Statistical Analysis I Fall 2013 26 / 101


27/102

matrix

we start with some simple matrices

> matrix(1.5, nrow=2,ncol=3,)[,1] [,2] [,3]

[1,] 1.5 1.5 1.5

[2,] 1.5 1.5 1.5

> diag(3.3,3)

[,1] [,2] [,3][1,] 3.3 0.0 0.0

[2,] 0.0 3.3 0.0

[3,] 0.0 0.0 3.3

> y=diag(-2,3,4)

> y

[,1] [,2] [,3] [,4]

[1,] -2 0 0 0

[2,] 0 -2 0 0

[3,] 0 0 -2 0() Linear Statistical Analysis I Fall 2013 27 / 101
http://find/


28/102

matrix

form a matrix from a vector

> x=matrix(c(1,2,3,4,5,6), 2,3)> x

[,1] [,2] [,3]

[1,] 1 3 5

[2,] 2 4 6

> matrix(c(1,2,3,4,5,6), 3,2)[,1] [,2]

[1,] 1 4

[2,] 2 5

[3,] 3 6

> matrix(c(1,2,3,4,5,6), 3,2,byrow=T)

[,1] [,2]

[1,] 1 2

[2,] 3 4

[3,] 5 6() Linear Statistical Analysis I Fall 2013 28 / 101


29/102

matrix

form a matrix by combining vectors

> x=c(1,2,3)

> y=c(4,5,6)

> z=c(7,8,9)

> cbind(x,y,z)#combine three vectors as columns

x y z[ 1 , ] 1 4 7

[ 2 , ] 2 5 8

[ 3 , ] 3 6 9

> rbind(x,y,z)#combine three vectors as rows

[,1] [,2] [,3]

x 1 2 3

y 4 5 6

z 7 8 9


i
http://find/


30/102

matrix

form a matrix by combining matrices> w=cbind(x,y)

> w

x y

[1,] 1 4

[2,] 2 5

[3,] 3 6

> v=cbind(z,y)

> v

z y

[1,] 7 4

[2,] 8 5

[3,] 9 6


t i
http://find/


31/102

matrix

form a matrix by combining matrices

> cbind(w,v)

x y z y

[ 1 , ] 1 4 7 4

[ 2 , ] 2 5 8 5

[ 3 , ] 3 6 9 6> rbind(w,v)

x y

[1,] 1 4

[2,] 2 5

[3,] 3 6

[4,] 7 4

[5,] 8 5

[6,] 9 6


t i
http://find/


32/102

matrix

extract or modify subsets of a matrix

> x=matrix(1:15,3,5)> x

[,1] [,2] [,3] [,4] [,5]

[1,] 1 4 7 10 13

[2,] 2 5 8 11 14

[3,] 3 6 9 12 15> x[2,4]# the number in the second row and the fou

[1] 11

> y=x[1,]

># extract the first row

># which is converted to a vector.

> y

[1] 1 4 7 10 13

> str(y)

int [1:5] 1 4 7 10 13() Linear Statistical Analysis I Fall 2013 32 / 101

m t i
http://find/


33/102

matrix

> z=x[,2]

># extract the second column> z

[ 1 ] 4 5 6

> str(z)

int [1:3] 4 5 6

> w=x[1:2,1:3]> w

[,1] [,2] [,3]

[1,] 1 4 7

[2,] 2 5 8

> v=x[c(1,3),]

> v

[,1] [,2] [,3] [,4] [,5]

[1,] 1 4 7 10 13

[2,] 3 6 9 12 15() Linear Statistical Analysis I Fall 2013 33 / 101
http://find/


34/102

matrix


35/102

matrix

extract irregularly distributed subsets of a matrix

> x[index]=0

> x

[,1] [,2] [,3] [,4] [,5]

[1,] 1 4 7 10 13

[2,] 2 0 8 0 14[3,] 3 6 9 12 0

> which(x==0, arr.ind = T)

row col

[1,] 2 2

[2,] 2 4[3,] 3 5

> x[which(x==0, arr.ind = T)]

[ 1 ] 0 0 0


operations on matrices
http://find/


36/102


The elementary arithmetic operators, +, -, *, /, (for raising to apower), log, exp, sin, cos, tan, sqrt and so on, can be applied to

matrices in an element-by-element sense.

> x=matrix(1:6,2,3)

> x[,1] [,2] [,3]

[1,] 1 3 5

[2,] 2 4 6

> sqrt(x)

[,1] [,2] [,3][1,] 1.000000 1.732051 2.236068

[2,] 1.414214 2.000000 2.449490


http://find/


37/102


> y=matrix(11:16,2,3)

> y

[,1] [,2] [,3]

[1,] 11 13 15[2,] 12 14 16

> x+y

[,1] [,2] [,3]

[1,] 12 16 20

[2,] 14 18 22


http://find/


38/102


transpose of a matrix

> y=matrix(11:16,2,3)> y

[,1] [,2] [,3]

[1,] 11 13 15

[2,] 12 14 16

> t(y)

[,1] [,2]

[1,] 11 12

[2,] 13 14

[3,] 15 16> y=1:3 # vector: a special matrix with one column

> t(y)

[,1] [,2] [,3]

[1,] 1 2 3

http://find/


39/102

matrix multiplication


40/102

matrix multiplication

> A=matrix(1:6,3,2)

> A[,1] [,2]

[1,] 1 4

[2,] 2 5

[3,] 3 6

> B=matrix(11:16,2,3)

> B

[,1] [,2] [,3]

[1,] 11 13 15

[2,] 12 14 16

> A%*%B # different from A*B

[,1] [,2] [,3]

[1,] 59 69 79

[2,] 82 96 110

[3,] 105 123 141() Linear Statistical Analysis I Fall 2013 40 / 101

Linear equations and inversion of a matrix
http://find/


41/102


For example, to solve the following equations

2x+y+3z=13x-y+z=4

x+y+z=2

> A=cbind(c(2,3,1),c(1,-1,1),c(3,1,1))

> A # the matrix of coefficients

[,1] [,2] [,3][1,] 2 1 3

[2,] 3 -1 1

[3,] 1 1 1

> a=c(1,4,2)

> solve(A,a)

[1] 2.333333 1.333333 -1.666667

> v=solve(A,a)

> A%*%v

[,1]() Linear Statistical Analysis I Fall 2013 41 / 101

http://find/


42/102


For example, to solve the following equations

> A%*%v-a

[,1]

[1,] 0.000000e+00

[2,] -4.440892e-16

[3,] 4.440892e-16

> solve(A) # the inverse of A

[,1] [,2] [,3]

[1,] -0.3333333 0.3333333 0.6666667

[2,] -0.3333333 -0.1666667 1.1666667[3,] 0.6666667 -0.1666667 -0.8333333


character vector and factor
http://find/


43/102


A character vector is a vector whose components are character strings

instead of numbers. For example, we construct vectors of names andgenders of five persons in one office.

> names=c("Emily","John", "Lily","Grace","William")

> str(names)

chr [1:5] "Emily" "John" "Lily" "Grace" "William"> gender=c("F","M","F","F","M")

> str(gender)

chr [1:5] "F" "M" "F" "F" "M"

In the vector gender, there are two categories: male and female,labeled by M and F. Sometimes, we are interested in the

information such as how many elements in each category. In this case,

we can convert the character vector to a factor.


http://find/


44/102


A character vector is a vector whose components are character strings

instead of numbers. For example, we construct vectors of names andgenders of five persons in one office.

> names=c("Emily","John", "Lily","Grace","William")

> str(names)

chr [1:5] "Emily" "John" "Lily" "Grace" "William"> gender=c("F","M","F","F","M")

> str(gender)

chr [1:5] "F" "M" "F" "F" "M"

In the vector gender, there are two categories: male and female,labeled by M and F. Sometimes, we are interested in the

information such as how many elements in each category. In this case,

we can convert the character vector to a factor.


http://find/


45/102

> gender=as.factor(gender)

> gender

[ 1 ] F M F F M

Levels: F M

> levels(gender)

[1] "F" "M"

Different levels denote different categories.


data frame
http://find/


46/102

The data frame is one of the most important data types in R.

It is similar to matrix. Its elements are organized in rows andcolumns.

However, in matrix, all elements are all numbers.

In data frames, it is often the case that in a data frame, some

elements are numbers and the others are numbers.when R reads data from an external data file, R always save the

data as a data frame.

It is not efficient to type the data directly in R if the data size is

large.We can read the data from a file where the data is saved.

It is important that the file is in the working directory. Otherwise, R

cannot find the file. Or you can tell R where is the file.


input data from a file
http://find/


47/102

p

The data has been stored in a file called ch.1.ex.1.dat.

http://find/


48/102

input data from a file


49/102

The data has been stored in a file called ch.1.ex.2.dat. The

numbers is separated by commas

http://find/


50/102

> data=read.table("ch.1.ex.2.dat",sep=",")

> str(data)

data.frame: 4 obs. of 3 variables:

$ V1: int 1 2 4 5

$ V2: int 2 4 5 6

$ V3: int 3 7 7 8

> dataV1 V2 V3

1 1 2 3

2 2 4 7

3 4 5 7

4 5 6 8

http://find/


51/102


52/102

> data=read.table("ch.1.ex.3.dat",header=T)

> str(data)


$ A: int 1 2 4 5

$ B: int 2 4 5 6

$ C: int 3 7 7 8

> dataA B C

1 1 2 3

2 2 4 7

3 4 5 7

4 5 6 8


input data from an excel file
http://find/


53/102

The data has been stored in a csv excel file called ch.1.ex.4.csv.

http://find/


54/102

> data=read.csv("ch.1.ex.4.csv")

> str(data)


$ A: int 1 5 4 5

$ B: int 2 4 5 8

$ C: int 3 4 6 7

> dataA B C

1 1 2 3

2 5 4 4

3 4 5 6

4 5 8 7


input data from an excel file
http://find/


55/102

The data has been stored in a csv excel file called ch.1.ex.5.csv.

http://find/


56/102

> data=read.csv("ch.1.ex.5.csv",header=F)

> data

V1 V2 V3

1 1 2 3

2 5 4 43 4 5 6

4 5 8 7


data sets in R
http://find/


57/102

R itself contains some data sets which can be loaded directly. You

can use

> data()

to check available data sets in R.

For example, let us consider a data set named sleep. We can

load the data set by using> data(sleep)

> sleep # check the data

extra group ID

1 0.7 1 1

2 -1.6 1 23 -0.2 1 3

...........

...........

20 3.4 2 10() Linear Statistical Analysis I Fall 2013 57 / 101
http://find/


58/102

> str(sleep)

data.frame: 20 obs. of 3 variables:$ extra: num 0.7 -1.6 -0.2 -1.2 -0.1 3.4 3.7 0.8

$ group: Factor w/ 2 levels "1","2": 1 1 1 1 1 1

$ ID : Factor w/ 10 levels "1","2","3","4",..:

Data which show the effect of two drugs on 10 patients.

There are 20 observations and three variables.

the second variable represents the types of drugs given.

the third is the patients ID.

the first is the increase in hours of sleep compared to control.


dif b f d f
http://find/


59/102

extract or modify subsets of a data frame

> # two different ways to extract the first varibl

> sleep$extra[1] 0.7 -1.6 -0.2 -1.2 -0.1 3.4 3.7 0.8 0.0

[16] 4.4 5.5 1.6 4.6 3.4

> sleep[,1]

[1] 0.7 -1.6 -0.2 -1.2 -0.1 3.4 3.7 0.8 0.0

[16] 4.4 5.5 1.6 4.6 3.4

> sleep[1,1]

[1] 0.7

> sleep[1,]

extra group ID1 0.7 1 1


extract or modify subsets of a data frame

> # extract the measurements for the first drug.
http://find/


60/102

# extract the measurements for the first drug.

> sleep[sleep$group=="2",1]

[1] 1.9 0.8 1.1 0.1 -0.1 4.4

[7] 5.5 1.6 4.6 3.4

> # extract the measurements for the fifth patient

> sleep[sleep$ID=="5",1]

[1] -0.1 -0.1

> # extract the measurements for the third> # patient when he took the second type of drug.

> sleep[(sleep$ID=="3")&(sleep$group=="2"),1]

[1] 1.1

> names(sleep) # the names of variables

[1] "extra" "group" "ID"> names(sleep)=c("hours","Drug","Patient")

> sleep

hours Drug Patient

1 0.7 1 1

2 -1.6 1 2() Linear Statistical Analysis I Fall 2013 60 / 101

If all elements of a data frame are numbers, the data frame can be

converted to a matrix.
http://find/


61/102


> str(data)


$ V1: int 1 5 4 5

$ V2: int 2 4 5 8

$ V3: int 3 4 6 7

> X=as.matrix(data)> str(X)

i n t [ 1 : 4 , 1 : 3 ] 1 5 4 5 2 4 5 8 3 4 . . .

> X

V1 V2 V3

[1,] 1 2 3[2,] 5 4 4

[3,] 4 5 6

[4,] 5 8 7

http://find/


62/102

Conversly, a matrix can be convert to a data frame

> X=matrix(1:15,5,3)

> X

[,1] [,2] [,3]

[1,] 1 6 11

[2,] 2 7 12

[3,] 3 8 13

[4,] 4 9 14

[5,] 5 10 15

> str(X)

int [1:5, 1:3] 1 2 3 4 5 6 7 8 9 10 ...

> Y=data.frame(X)

> Y

X1 X2 X31 1 6 11

2 2 7 12

3 3 8 13

4 4 9 14

5 5 10 15

> str(Y)


$ X1: int 1 2 3 4 5

$ X2: int 6 7 8 9 10$ X 3: i nt 11 1 2 13 1 4 15


control flow
http://find/


63/102

In addition to typing the commands after the prompt symbol, you

can write a group of commands in your text editor such as

notepad, MS word and so on, then copy and paste these

commands after the prompt symbol in R.

For example, I write the following commands in Notepad,


control flow
http://find/


64/102

Then I copy and paste them to R


> X=as.matrix(data)

> names(X)=c("x1","x2","x3")

> print(X)V1 V2 V3

[1,] 1 2 3

[2,] 5 4 4

[3,] 4 5 6

[4,] 5 8 7


control flow
http://find/


65/102

The control-flow specify the order in which computations are

performed.

Some commonly used control-flow methods: If-Else statement,loops (including for and while statements)

If-Else statement: Example: compute the absolute value of a. I

write the following commands in Notepad,


control flow
http://find/


66/102

Then I copy and paste them to R

> a=1

> if(a>0)

+ {absolute=a

+ print(absolute)+ }else

+ {absolute=-a

+ print(absolute)

+ }

[1] 1


control flow


67/102

> a=1

> if(a>0)

+ {absolute=a

+ print(absolute)

+ }else+ {absolute=-a

+ print(absolute)

+ }

[1] 1



68/102

Example 2


69/102

> x=-3

> if((x2))+ {f=0

+ print(f)

+ }else

+ {if(x


70/102

> x=8

> if((x2))+ {f=0

+ print(f)

+ }else

+ {if(x


71/102

the for loop

> sum=0 #set the initial values of the sum

> for (i in 1:6)

+ {sum=sum+i

+ print(c("i=",i, "sum=",sum))

+ }[1] "i=" "1" "sum=" "1"

[1] "i=" "2" "sum=" "3"

[1] "i=" "3" "sum=" "6"

[1] "i=" "4" "sum=" "10"

[1] "i=" "5" "sum=" "15"

[1] "i=" "6" "sum=" "21"


control flow
http://find/


72/102

the for loop> sum=0 #set the initial values of the sum

> for (i in c(2,4,6,8,10))

+ {sum=sum+i

+ print(c("i=",i, "sum=",sum))

+ }

[1] "i=" "2" "sum=" "2"

[1] "i=" "4" "sum=" "6"

[1] "i=" "6" "sum=" "12"

[1] "i=" "8" "sum=" "20"

[1] "i=" "10" "sum=" "30"


Data management
http://find/


73/102

The data input has been introduced. I will introduce the topics

including data output, missing, data manipulation, Merging, combining,

and subsetting datasets.save a R object: for example,


save a R object


74/102

> X=matrix(1:15,3,5)

> X

[,1] [,2] [,3] [,4] [,5]

[1,] 1 4 7 10 13

[2,] 2 5 8 11 14

[3,] 3 6 9 12 15

> save(X,file="ex.RData")> X=0

> X

[1] 0

> load("ex.RData")

> X[,1] [,2] [,3] [,4] [,5]

[1,] 1 4 7 10 13

[2,] 2 5 8 11 14

[3,] 3 6 9 12 15() Linear Statistical Analysis I Fall 2013 74 / 101

output data to external files
http://find/


75/102

> data(sleep)

> str(sleep)


$ extra: num 0.7 -1.6 -0.2 -1.2 -0.1 3.4 3.7 0.8 0

$ group: Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1

$ ID : Factor w/ 10 levels "1","2","3","4",..: 1 2

> sleep.1=sleep[sleep$group=="1",c(1,3)]> sleep.1

extra ID

1 0.7 1

2 -1.6 2

3 -0.2 3............

> write.table(sleep.1,file="sleep.1.dat")




76/102

both the row names and the column names are written in the file

> write.table(sleep.1,file="sleep.1.dat")

http://find/


77/102


If t th fil b d fil


78/102

If we want the file be saved as a csv file, use

> write.csv(sleep.1,file="sleep.1.csv")


missing data
http://find/


79/102

It is very common in practice that there are missing values in a

data set. For example, there are missing values denoted by *.


missing data
http://find/


80/102

> X=read.table("ch.1.ex.6.dat",na.strings = "*")

> XV1 V2 V3 V4 V5

1 1 2 3 4 NA

2 1 3 5 7 8

3 7 NA 9 0 1

In R, all missing values are denoted by NA which can be

identufied by the following command,

> x=X[,2]

> x[1] 2 3 NA

> is.na(x)

[1] FALSE FALSE TRUE


missing data
http://find/


81/102

> is.na(X)

V1 V2 V3 V4 V5

[1,] FALSE FALSE FALSE FALSE TRUE

[2,] FALSE FALSE FALSE FALSE FALSE

[3,] FALSE TRUE FALSE FALSE FALSE

> sum(is.na(X)) # the number of missing values

[1] 2> mean(X[,1])

[1] 3

> mean(X[,2])

[1] NA

> #claculate the mean with the missing values excluded> mean(X[,2],na.rm=TRUE)

[1] 2.5


missing data
http://find/


82/102

Example: write a function to replace the missing values in a data frame by the column mean with the missing values excluded in

the corresponding colum. If the whole column is missed, we remove the culomn.

> missing.replace=function(X)

+ {

+ temp=NULL

+ for (i in 1:dim(X)[2])

+ {if(sum(is.na(X[,i]))==0)

+ {temp=cbind(temp,X[,i])

+ }else

+ {if(sum(!is.na(X[,i]))>0)

+ {y=X[,i]

+ y[is.na(X[,i])]=mean(y,na.rm=TRUE)

+ temp=cbind(temp,y)

+ }

+ }

+ }

+ temp=data.frame(temp)

+ temp

+ }


missing data
http://find/


83/102

> X

V1 V2 V3 V4 V5

1 1 2 3 4 NA

2 1 3 5 7 8

3 7 NA 9 0 1

> missing.replace(X)

V1 y V3 V4 y.1

1 1 2.0 3 4 4.5

2 1 3.0 5 7 8.0

3 7 2.5 9 0 1.0

> Y=X> Y[,2]=NA

> Y

V1 V2 V3 V4 V5

1 1 NA 3 4 NA

2 1 NA 5 7 8

3 7 NA 9 0 1

> missing.replace(Y)

V1 V2 V3 y

1 1 3 4 4.52 1 5 7 8.0

3 7 9 0 1.0

http://find/


84/102

merging data


85/102

Suppose thatX andYare the data frames for the two data sets, respectively.

> merge(X,Y,by="id")id year female inc maxval

1 1 81 0 5500 1800

2 1 80 0 5000 1800

3 1 82 0 6000 1800

4 2 82 1 3300 2400

5 2 80 1 2000 2400

6 2 81 1 2200 2400

> merge(X,Y,by="id",all=T)

id year female inc maxval1 1 81 0 5500 1800

2 1 80 0 5000 1800

3 1 82 0 6000 1800

4 2 82 1 3300 2400

5 2 80 1 2000 2400

6 2 81 1 2200 2400

7 3 82 0 1000 NA

8 3 80 0 3000 NA

9 3 81 0 2000 NA10 4 NA NA NA 1900


graphical exploration of data
http://find/


86/102

Example: This famous (Fishers or Andersons) iris data set gives the measurements in centimeters of the variables sepal lengthand width and petal length and width, respectively, for 50 flowers from each of 3 species of iris. The species are Iris setosa,versicolor, and virginica.

> X=read.table("cell.1.dat")

> str(X)


$ V1: int 0 0 0 0 0 1 1 1 1 1 ...

$ V2: int 100 100 100 100 100 100 100 100 100 100 ...$ V3: int 0 0 0 0 0 0 0 0 0 0 ...

$ V4: int 0 0 0 0 0 0 0 0 0 0 ...

$ V5: int 0 0 0 0 0 0 0 0 0 0 ...

$ V6: int 0 0 0 0 0 10 6 12 16 8 ...

$ V7: int 1600 1600 1604 1590 1581 1568 1569 1577 1570 1577 ...

> Y=read.table("cell.2.dat")

> Z=read.table("cell.3.dat")


scatter plot


87/102

> plot(iris$Sepal.Length,iris$Petal.Length)

4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0

1

2

3

4

5

6

7

iris$Sepal.Length

iris$Petal.Length


scatter plot

If we want to change the labels of x axis, y axis and the title of the plot.
http://find/


88/102

> plot(iris$Sepal.Length,iris$Petal.Length,xlab="Sepal Length"

+ ,ylab="Petal Length",main="The Scatter plot")

4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0

1

2

3

4

5

6

7

The Scatter plot

Sepal Length

PetalLength


Matrix of scatterplotsIf we want to draw the scatteplots of all pairs of the variables

> pairs(iris)
http://find/


89/102

Sepal.Length

2.0 2.5 3.0 3.5 4.0

0.5 1.0 1.5 2.0 2.5

4.

5

6.

0

7.

5

2.

0

3.

0

4.

0

Sepal.Width

Petal.Length

1

3

5

7

0.5

1.5

2.

5

Petal.Width

4.5 5.5 6.5 7.5

1 2 3 4 5 6 7

1.0 1.5 2.0 2.5 3.0

1.0

2.

0

3.

0

Species

http://find/


90/102

histogram

> boxplot(iris$Sepal.Length[iris$Species=="setosa"],iris$Sepal.Length[iris$Species=="versicolor"],

+ iris$Sepal.Length[iris$Species=="virginica"],names=c("setosa", "versicolor", "virginica"),


91/102

+ main="Sepal Length of three species")

setosa versicolor virginica

4.

5

5.

0

5.

5

6.

0

6.

5

7.0

7.

5

8.

0

Sepal Length of three species


Regression

Regression is the study of relationships dependence between two
http://find/


92/102

Regression is the study of relationships, dependence between two

sets of variables based on the data or observations made onthese variables.

Based on the relationships, we can make prediction on one set of

variables from the values of the other set.

Example 1: Predict the price of a stock in 6 months from now, on

the basis of company performance measures and economic data.

Example 2: Estimate the amount of glucose in the blood of a

diabetic person, from the infrared absorption spectrum of that

persons blood.

Example 3: Examine the correlation between the level ofprostate-specific antigen and a number of clinical measures such

as cancer volume, prostate weight, age, and so on.


variables
http://find/


93/102

Variables are quantitative or qualitative measurements of

characteristics of objects under our study. Typically, the values of

variables vary for different objects.

They are considered as random variables with some probability

distributions.

We will use the uppercase letters, such asX,Y, to denote

variables and the lowercase letters, xandy, to denote their

particular values.


variables

In this course, for each regression model, the variables will beclassified into two categories:
http://find/


94/102

classified into two categories:

independent variables, or inputs, or features, or predictor, orregressor, or explanatory variable; typically denoted by X1,X2, .dependent variables, or outputs, or responses, or outcome variable;typically denoted byY1,Y2, . In this course, we only considerone response.

The partition of dependent and independent variables is obviousin some data set but may vary according to the purposes of the

study in other data.

For example, suppose that we collected the temperature and

precipitation data of a city over a period. If we want to construct a

model to predict the temperature from the precipitation, then theresponse is temperature and the predictor is precipitation,

however, If precipitation has to be predicted from temperature,

then the response is precipitation and thepredictoristemperature.


variable
http://find/http://find/


95/102

Example 1: Y=price of the stock in 6 months from now,

X1=company performance measures andX2=economic data.

Example 2:Y=the amount of glucose in the blood of a diabetic

person,X=the infrared absorption spectrum of that personsblood.

Example 3: Y=level of prostate-specific antigen andX1=cancer

volume,X2=prostate weight,X3=age, and so on.


regression models
http://find/


96/102

To construct a model from which given any values of the

predictors: X1,X2, , we can give a good guess of the value ofY such that the difference between the guess and the true value is

as small as possible. The guess is called predicted value denote

dby Y.

deterministic system: given any values ofX1, X2, , the value ofYis determinstic. That is, if any two observations of X1, X2, ,x(1)i andx

(2)i ,i=1, 2, , the true values ofY arey(1) andy(2),

then ifx(1)i =x

(2)i ,i=1, 2, , theny(1) =y(2).

mathematically,Ycan be consdered as a function of X1,X2, ,that is,Y =f(X1, X2, ).


regression models


97/102

No randomness is involved in deterministic system.

However, in practice, there are many factors affecting the

responses, we cannot find and measure all those factors.

In statistics, we do not consider deterministic models. Instead, wewill consider models having randomness.

These models are more flexible and more appropriate.


regression models

Specifically, consider
http://find/


98/102

Specifically, consider

Y =f(X1, X2, ) +,

fis a (unknown) function of the variablesX1,X2, , which areinteresting and easy to observe or measure.

The termis a ranomd variable which is used to account for therandomness due to the unobserved factors or variables, noises or

errors.

A standard assumption is that is independent of X1,X2, andits expection is 0.

Then we haveE(Y|X1,X2, ) =f(X1, X2, ).More assumptions will be imposed in the following classes.


linear regression models
http://find/


99/102

The functionf(X1,X2, )is typically unknown and not a functionwith simple form.

In this course, we will assume the function is linear, that is, if we

havepexplanation variables, then

f(X1,X2, ,Xp) =0+ 1X1+ 2X2+ +pXp,where0, 1, , pare coefficients which are typically unknownparameters and the estimation of them based on data is a main

topic of the course.

Why do we assumefis a linear function?


http://find/


100/102

The reasons we assumefis a linear function:Sincefcan be very general function, it is hopeless to find the

exact form offbased on limited sample.

Hence, an approximation tofis desirable. Acutually, the linear

function is the first order approximation to any smooth function in

a range ofX1,X2, which is not too large.Although there are other better approximation, the computation is

a very important consideration especially in the precomputer age

of statistics. It is much easier to perform analysis with the linear

models than other models.


http://find/


101/102

Even in todays computer era there are still good reasons to study

and use them. They are simple and often provide an adequate

and interpretable description of how the inputs affect the output.

For prediction purposes they can sometimes outperform fancier

nonlinear models, especially in situations with small numbers of

training cases, low signal-to-noise ratio or sparse data.

Finally, linear methods can be applied to transformations of the

inputs and this considerably expands their scope.

http://find/


102/102

Linear Regression Chp1

Documents

Transcript of Linear Regression Chp1