Linear Regression Chp1

download Linear Regression Chp1

of 102

Transcript of Linear Regression Chp1

  • 8/11/2019 Linear Regression Chp1

    1/102

    Linear Statistical Analysis I

    Fall 2013

    () Linear Statistical Analysis I Fall 2013 1 / 101

    http://find/
  • 8/11/2019 Linear Regression Chp1

    2/102

    Outline

    Introduction to R statistical softwareIntroduction to regression and model building

    () Linear Statistical Analysis I Fall 2013 2 / 101

    http://find/
  • 8/11/2019 Linear Regression Chp1

    3/102

    R language

    R is a popular statistical software package especially suitable for

    data analysis and graphical representation.

    It is a free open source software.

    The R package can be downloaded from

    http://cran.us.r-project.org/ where you can also find useful

    information for R. Note that they have different versions for

    different operation systems such as Windows, Linux, MacOS.

    online tutorials by googling the key words R language tutorial.

    () Linear Statistical Analysis I Fall 2013 3 / 101

    http://find/
  • 8/11/2019 Linear Regression Chp1

    4/102

    R language

    Open the R package.

    () Linear Statistical Analysis I Fall 2013 4 / 101

    http://find/http://goback/
  • 8/11/2019 Linear Regression Chp1

    5/102

    type commands after the prompt symbol

    For example, if we want to compute 3943 and 9,> 3/(9*4-5)

    [1] 0.09677419

    > sqrt(9)

    [1] 3

    If you want to make comments on the commands, use the symbol

    #. The expression after this symbol will not be executed.

    > 3/(9*4-5) # This is a comment.

    [1] 0.09677419

    > sqrt(9) # compute the square root of 9.

    [1] 3

    () Linear Statistical Analysis I Fall 2013 5 / 101

    http://find/http://goback/
  • 8/11/2019 Linear Regression Chp1

    6/102

    workspce and working directory

    The workspace is the place in your computer where R reads or

    saves data or R objects.The directory of the workspace is the working directory.

    Find the current working directory

    > getwd()

    [1] "F:/Teaching/2013_spring_Computing"

    change the current working directory

    > setwd("F:/Teaching")

    > getwd()

    [1] "F:/Teaching"

    or you can use the File menu in the Console window of R.

    If there is an error when you input data, first check whether the

    data is in the working directory.

    () Linear Statistical Analysis I Fall 2013 6 / 101

    http://find/
  • 8/11/2019 Linear Regression Chp1

    7/102

    arithmetic expressions and variables

    arithmetic expressions: type expressions after the prompt symbol.

    R will evaluate their values.> 2+3

    [1] 5

    > 3-4

    [1] -1

    > 3*5

    [1] 15

    > 3/5

    [1] 0.6

    > 13%%5 # compute the remiander of 13 divided by 5[1] 3

    > 13%/%5 #the integer quotient of 13 divided by 5

    [1] 2

    > 3^2

    [1] 9() Linear Statistical Analysis I Fall 2013 7 / 101

    http://find/
  • 8/11/2019 Linear Regression Chp1

    8/102

    arithmetic expressions and variables

    > sqrt(9)[1] 3

    > log(3)

    [1] 1.098612

    > log2(2)

    [1] 1> log10(100)

    [1] 2

    > exp(3)

    [1] 20.08554

    > factorial(3) #compute 3!

    [1] 6

    > log((1+exp(5)))*3^(-2)

    [1] 0.5563017

    () Linear Statistical Analysis I Fall 2013 8 / 101

    http://goforward/http://find/http://goback/
  • 8/11/2019 Linear Regression Chp1

    9/102

    arithmetic expressions and variables

    variable: the name is case sensitive, that is, upper and lower case

    letters are distinct.

    assignment operator x x # check the value of x

    [1] 20.38906

    equivalently, you can use

    > x=13+exp(2)> x

    [1] 20.38906

    () Linear Statistical Analysis I Fall 2013 9 / 101

    http://find/
  • 8/11/2019 Linear Regression Chp1

    10/102

    arithmetic expressions and variables

    you can include variables in expressions, however, the values ofthe variables must be assigned before the expressions.

    > x=2+3

    > (2+3)*4

    [1] 20

    > x=2+3

    > x*4

    [1] 20

    > y*4

    Error: object y not found> y=x^2

    > y

    [1] 25

    () Linear Statistical Analysis I Fall 2013 10 / 101

    http://find/
  • 8/11/2019 Linear Regression Chp1

    11/102

    Relational and Logical Operators

    logical values: TRUE (can be denoted by T) and FALSE

    (denoted by F).we can apply the arithmetic operators to the logical values, in this

    case, TRUE is identified as 1 and FALSE is identified as 0

    > TRUE+TRUE

    [1] 2

    > TRUE+FALSE

    [1] 1

    > TRUE*FALSE

    [1] 0

    > x=TRUE

    > x

    [1] TRUE

    > y=FALSE

    > y-x

    [1] -1() Linear Statistical Analysis I Fall 2013 11 / 101

    http://find/
  • 8/11/2019 Linear Regression Chp1

    12/102

    Relational and Logical Operators

    The relational operators are

    > 1>2 # logical expression with two possible[1] FALSE

    > 1 1 1>=2

    [1] FALSE

    > (1+3)==4

    [1] TRUE

    > 3*2!=6

    [1] FALSE

    > x=(3*2!=6)

    > x

    [1] FALSE() Linear Statistical Analysis I Fall 2013 12 / 101

    http://find/
  • 8/11/2019 Linear Regression Chp1

    13/102

    Relational and Logical Operators

    logical operators: can be used to connect two or more logical

    expressions

    > (1>2)&((1+3)==4) # "&" means "and". The combined

    # expression is true only if

    # both the two expressions are tr

    [1] FALSE> (1 TRUE&FALSE

    [1] FALSE

    > (1>2)|((1+3)==4) # "|" means "or". The combined# expression is true if and onl

    # at least one of the two

    # expressions are true.

    [1] TRUE

    () Linear Statistical Analysis I Fall 2013 13 / 101

    http://find/
  • 8/11/2019 Linear Regression Chp1

    14/102

    Relational and Logical Operators

    > (1 (1>2)|((1+3)==5)

    [1] FALSE

    > TRUE|FALSE

    [1] TRUE> ((1+2)==1)

    [1] FALSE

    > !((1+2)==1) #"!" means "NOT".

    [1] TRUE

    > !TRUE

    [1] FALSE

    > !FALSE

    [1] TRUE

    () Linear Statistical Analysis I Fall 2013 14 / 101

    http://find/http://goback/
  • 8/11/2019 Linear Regression Chp1

    15/102

    basic data types in R

    vector: concatente several numbers or vectors into a vector byusing the operator c()

    > c(1,3,2)

    [ 1 ] 1 3 2

    > c(c(1,2),c(3,5),c(3,0,0))

    [1] 1 2 3 5 3 0 0

    > x=c(3,4,9)

    > x

    [ 1 ] 3 4 9

    > y=c(x,x,c(0,1,2))> y

    [1] 3 4 9 3 4 9 0 1 2

    () Linear Statistical Analysis I Fall 2013 15 / 101

    http://find/http://goback/
  • 8/11/2019 Linear Regression Chp1

    16/102

    basic data types in R

    Special vectors:

    > 1:9[1] 1 2 3 4 5 6 7 8 9

    > 5:1

    [ 1 ] 5 4 3 2 1

    > rep(0.5,6) # repeat 0.5 six times

    [1] 0.5 0.5 0.5 0.5 0.5 0.5

    > rep(-3,8)

    [1] -3 -3 -3 -3 -3 -3 -3 -3

    > seq(0,1, length.out=11) # get 11 numbers from

    #0 to 1 with equal space.

    [1] 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

    > seq(1,2,length.out=21)

    [1] 1.00 1.05 1.10 1.15 1.20 1.25 1.30 1.35

    [9] 1.40 1.45 1.50 1.55 1.60 1.65 1.70 1.75

    [17] 1.80 1.85 1.90 1.95 2.00() Linear Statistical Analysis I Fall 2013 16 / 101

    http://find/http://goback/
  • 8/11/2019 Linear Regression Chp1

    17/102

    basic data types in R

    logical vector: consists of logical values

    > x=c(T,T,F,F) #TRUE can be denoted by T

    > x

    [1] TRUE TRUE FALSE FALSE

    > x=1:5> x==1 # each component of x is compared with 1

    [1] TRUE FALSE FALSE FALSE FALSE

    > y=(x!=1)

    > y

    [1] FALSE TRUE TRUE TRUE TRUE

    () Linear Statistical Analysis I Fall 2013 17 / 101

    http://find/
  • 8/11/2019 Linear Regression Chp1

    18/102

    basic data types in R

    extract or modify subsets of a vector by using index vector

    > x=c(3,2,5,5,7,8)> x[c(1,3,5)]# extract the 1,3,5th numbers in x

    [ 1 ] 3 5 7

    > y=x[c(3,1,5)]

    > y

    [ 1 ] 5 3 7

    > z=x[-c(2,4)]# remove the 2,4th numbers in x

    > z

    [ 1 ] 3 5 7 8

    > x[1:3]

    [ 1 ] 3 2 5

    > x[1:3]=c(1,2,3)

    > x

    [ 1 ] 1 2 3 5 7 8

    () Linear Statistical Analysis I Fall 2013 18 / 101

    http://find/
  • 8/11/2019 Linear Regression Chp1

    19/102

    basic data types in R

    find all possible values in a vector

    > x=c(3,2,5,5,7,8,7,8,9)> unique(x)

    [ 1 ] 3 2 5 7 8 9

    () Linear Statistical Analysis I Fall 2013 19 / 101

    http://find/
  • 8/11/2019 Linear Regression Chp1

    20/102

    basic data types in R

    > x[c(1,3,5)]=0

    > x[ 1 ] 0 2 0 5 0 8

    extract or modify subsets of a vector by using logical vector

    > x=c(3,2,5,5,7,8,3,6,9)

    > x=c(3,2,5,5,7,8,3,6,9) #we will extract the

    #numbers larger than 4

    > index=(x>4)

    > index

    [1] FALSE FALSE TRUE TRUE TRUE TRUE FALSE TRU> x[index]

    [ 1 ] 5 5 7 8 6 9

    > x[index]# only values corresponding TRUE

    # in "index" will be slected

    [ 1 ] 5 5 7 8 6 9() Linear Statistical Analysis I Fall 2013 20 / 101

    http://find/
  • 8/11/2019 Linear Regression Chp1

    21/102

    basic data types in R

    extract or modify subsets of a vector by using logical vector

    > x=1:10 #extract numbers less or equal

    # to 6 and larger than 3

    > x[(x3)]

    [ 1 ] 4 5 6

    > x=c(3,2,5,5,7,8,3,6,9)

    > y=c(1,3,5,7,6,4,2,3,5)

    > #we extract the numbers in

    > #the positions corresponding to y>4

    > x[y>4]

    [ 1 ] 5 5 7 9

    () Linear Statistical Analysis I Fall 2013 21 / 101

    http://find/http://goback/
  • 8/11/2019 Linear Regression Chp1

    22/102

    basic data types in R

    extract the indices corresponding to the numbers which satisfies

    specific conditions in a vector.

    > x=c(T,T,F,T,F,F,T,T,T,F)

    > which(x)

    [ 1 ] 1 2 4 7 8 9

    > x=c(3,2,5,5,7,8,3,6,9)

    > which(x>5)

    [ 1 ] 5 6 8 9

    > x[which(x>5)][ 1 ] 7 8 6 9

    () Linear Statistical Analysis I Fall 2013 22 / 101

    http://find/
  • 8/11/2019 Linear Regression Chp1

    23/102

    operations on vectors

    The elementary arithmetic operators, +, -, *, /, (for raising to a

    power), log, exp, sin, cos, tan, sqrt and so on, can be applied to

    vectors in an element-by-element sense.

    > x=1:5

    > x

    [ 1 ] 1 2 3 4 5> y=10:15

    > y

    [1] 10 11 12 13 14 15

    > y-x

    [1] 9 9 9 9 9 14

    Warning message:

    In y - x : longer object length is not

    a multiple of shorter object length

    () Linear Statistical Analysis I Fall 2013 23 / 101

    http://find/
  • 8/11/2019 Linear Regression Chp1

    24/102

    operations on vectors

    > x

    [ 1 ] 1 2 3 4 5> y=11:15 #x and y should have the same lengths.

    > y-x

    [1] 10 10 10 10 10

    > 2*x

    [1] 2 4 6 8 10

    > x^2

    [1] 1 4 9 16 25

    > y/x

    [1] 11.000000 6.000000 4.333333 3.500000 3.000> x+1

    [ 1 ] 2 3 4 5 6

    > x*y

    [1] 11 24 39 56 75

    () Linear Statistical Analysis I Fall 2013 24 / 101

    http://find/
  • 8/11/2019 Linear Regression Chp1

    25/102

    operations on vectors

    get the length, minimum, maximum, mean, variance of a vector

    > z=c(43, 45, 5, 44, 767, 57, 68,33,111)> length(z)

    [1] 9

    > min(z)

    [1] 5

    > max(z)[1] 767

    > which(z==min(z))

    [1] 3

    > which(z==max(z))

    [1] 5

    > mean(z)

    [1] 130.3333

    > var(z)

    [1] 57815.75> s d z() Linear Statistical Analysis I Fall 2013 25 / 101

    http://find/
  • 8/11/2019 Linear Regression Chp1

    26/102

    operations on vectors

    sort the numbers in a vector in an increasing or decreasing order

    > z=c(43, 45, 5, 44, 767, 57, 68,33,111)> sort(z)

    [1] 5 33 43 44 45 57 68 111 767

    > sort(z,decreasing = T)

    [1] 767 111 68 57 45 44 43 33 5

    > y=sort(z,index.return = T)> y

    $x

    [1] 5 33 43 44 45 57 68 111 767

    $ix

    [1] 3 8 1 4 2 6 7 9 5

    > y$x

    [1] 5 33 43 44 45 57 68 111 767

    > z[y$ix]

    [1] 5 33 43 44 45 57 68 111 767() Linear Statistical Analysis I Fall 2013 26 / 101

    http://goforward/http://find/http://goback/
  • 8/11/2019 Linear Regression Chp1

    27/102

    matrix

    we start with some simple matrices

    > matrix(1.5, nrow=2,ncol=3,)[,1] [,2] [,3]

    [1,] 1.5 1.5 1.5

    [2,] 1.5 1.5 1.5

    > diag(3.3,3)

    [,1] [,2] [,3][1,] 3.3 0.0 0.0

    [2,] 0.0 3.3 0.0

    [3,] 0.0 0.0 3.3

    > y=diag(-2,3,4)

    > y

    [,1] [,2] [,3] [,4]

    [1,] -2 0 0 0

    [2,] 0 -2 0 0

    [3,] 0 0 -2 0() Linear Statistical Analysis I Fall 2013 27 / 101

    http://find/
  • 8/11/2019 Linear Regression Chp1

    28/102

    matrix

    form a matrix from a vector

    > x=matrix(c(1,2,3,4,5,6), 2,3)> x

    [,1] [,2] [,3]

    [1,] 1 3 5

    [2,] 2 4 6

    > matrix(c(1,2,3,4,5,6), 3,2)[,1] [,2]

    [1,] 1 4

    [2,] 2 5

    [3,] 3 6

    > matrix(c(1,2,3,4,5,6), 3,2,byrow=T)

    [,1] [,2]

    [1,] 1 2

    [2,] 3 4

    [3,] 5 6() Linear Statistical Analysis I Fall 2013 28 / 101

    http://goforward/http://find/http://goback/
  • 8/11/2019 Linear Regression Chp1

    29/102

    matrix

    form a matrix by combining vectors

    > x=c(1,2,3)

    > y=c(4,5,6)

    > z=c(7,8,9)

    > cbind(x,y,z)#combine three vectors as columns

    x y z[ 1 , ] 1 4 7

    [ 2 , ] 2 5 8

    [ 3 , ] 3 6 9

    > rbind(x,y,z)#combine three vectors as rows

    [,1] [,2] [,3]

    x 1 2 3

    y 4 5 6

    z 7 8 9

    () Linear Statistical Analysis I Fall 2013 29 / 101

    i

    http://find/
  • 8/11/2019 Linear Regression Chp1

    30/102

    matrix

    form a matrix by combining matrices> w=cbind(x,y)

    > w

    x y

    [1,] 1 4

    [2,] 2 5

    [3,] 3 6

    > v=cbind(z,y)

    > v

    z y

    [1,] 7 4

    [2,] 8 5

    [3,] 9 6

    () Linear Statistical Analysis I Fall 2013 30 / 101

    t i

    http://find/
  • 8/11/2019 Linear Regression Chp1

    31/102

    matrix

    form a matrix by combining matrices

    > cbind(w,v)

    x y z y

    [ 1 , ] 1 4 7 4

    [ 2 , ] 2 5 8 5

    [ 3 , ] 3 6 9 6> rbind(w,v)

    x y

    [1,] 1 4

    [2,] 2 5

    [3,] 3 6

    [4,] 7 4

    [5,] 8 5

    [6,] 9 6

    () Linear Statistical Analysis I Fall 2013 31 / 101

    t i

    http://find/
  • 8/11/2019 Linear Regression Chp1

    32/102

    matrix

    extract or modify subsets of a matrix

    > x=matrix(1:15,3,5)> x

    [,1] [,2] [,3] [,4] [,5]

    [1,] 1 4 7 10 13

    [2,] 2 5 8 11 14

    [3,] 3 6 9 12 15> x[2,4]# the number in the second row and the fou

    [1] 11

    > y=x[1,]

    ># extract the first row

    ># which is converted to a vector.

    > y

    [1] 1 4 7 10 13

    > str(y)

    int [1:5] 1 4 7 10 13() Linear Statistical Analysis I Fall 2013 32 / 101

    m t i

    http://find/
  • 8/11/2019 Linear Regression Chp1

    33/102

    matrix

    > z=x[,2]

    ># extract the second column> z

    [ 1 ] 4 5 6

    > str(z)

    int [1:3] 4 5 6

    > w=x[1:2,1:3]> w

    [,1] [,2] [,3]

    [1,] 1 4 7

    [2,] 2 5 8

    > v=x[c(1,3),]

    > v

    [,1] [,2] [,3] [,4] [,5]

    [1,] 1 4 7 10 13

    [2,] 3 6 9 12 15() Linear Statistical Analysis I Fall 2013 33 / 101

    http://find/
  • 8/11/2019 Linear Regression Chp1

    34/102

    matrix

  • 8/11/2019 Linear Regression Chp1

    35/102

    matrix

    extract irregularly distributed subsets of a matrix

    > x[index]=0

    > x

    [,1] [,2] [,3] [,4] [,5]

    [1,] 1 4 7 10 13

    [2,] 2 0 8 0 14[3,] 3 6 9 12 0

    > which(x==0, arr.ind = T)

    row col

    [1,] 2 2

    [2,] 2 4[3,] 3 5

    > x[which(x==0, arr.ind = T)]

    [ 1 ] 0 0 0

    () Linear Statistical Analysis I Fall 2013 35 / 101

    operations on matrices

    http://find/
  • 8/11/2019 Linear Regression Chp1

    36/102

    operations on matrices

    The elementary arithmetic operators, +, -, *, /, (for raising to apower), log, exp, sin, cos, tan, sqrt and so on, can be applied to

    matrices in an element-by-element sense.

    > x=matrix(1:6,2,3)

    > x[,1] [,2] [,3]

    [1,] 1 3 5

    [2,] 2 4 6

    > sqrt(x)

    [,1] [,2] [,3][1,] 1.000000 1.732051 2.236068

    [2,] 1.414214 2.000000 2.449490

    () Linear Statistical Analysis I Fall 2013 36 / 101

    operations on matrices

    http://find/
  • 8/11/2019 Linear Regression Chp1

    37/102

    operations on matrices

    > y=matrix(11:16,2,3)

    > y

    [,1] [,2] [,3]

    [1,] 11 13 15[2,] 12 14 16

    > x+y

    [,1] [,2] [,3]

    [1,] 12 16 20

    [2,] 14 18 22

    () Linear Statistical Analysis I Fall 2013 37 / 101

    operations on matrices

    http://find/
  • 8/11/2019 Linear Regression Chp1

    38/102

    operations on matrices

    transpose of a matrix

    > y=matrix(11:16,2,3)> y

    [,1] [,2] [,3]

    [1,] 11 13 15

    [2,] 12 14 16

    > t(y)

    [,1] [,2]

    [1,] 11 12

    [2,] 13 14

    [3,] 15 16> y=1:3 # vector: a special matrix with one column

    > t(y)

    [,1] [,2] [,3]

    [1,] 1 2 3

    () Linear Statistical Analysis I Fall 2013 38 / 101

    http://find/
  • 8/11/2019 Linear Regression Chp1

    39/102

    matrix multiplication

  • 8/11/2019 Linear Regression Chp1

    40/102

    matrix multiplication

    > A=matrix(1:6,3,2)

    > A[,1] [,2]

    [1,] 1 4

    [2,] 2 5

    [3,] 3 6

    > B=matrix(11:16,2,3)

    > B

    [,1] [,2] [,3]

    [1,] 11 13 15

    [2,] 12 14 16

    > A%*%B # different from A*B

    [,1] [,2] [,3]

    [1,] 59 69 79

    [2,] 82 96 110

    [3,] 105 123 141() Linear Statistical Analysis I Fall 2013 40 / 101

    Linear equations and inversion of a matrix

    http://find/
  • 8/11/2019 Linear Regression Chp1

    41/102

    Linear equations and inversion of a matrix

    For example, to solve the following equations

    2x+y+3z=13x-y+z=4

    x+y+z=2

    > A=cbind(c(2,3,1),c(1,-1,1),c(3,1,1))

    > A # the matrix of coefficients

    [,1] [,2] [,3][1,] 2 1 3

    [2,] 3 -1 1

    [3,] 1 1 1

    > a=c(1,4,2)

    > solve(A,a)

    [1] 2.333333 1.333333 -1.666667

    > v=solve(A,a)

    > A%*%v

    [,1]() Linear Statistical Analysis I Fall 2013 41 / 101

    Linear equations and inversion of a matrix

    http://find/
  • 8/11/2019 Linear Regression Chp1

    42/102

    Linear equations and inversion of a matrix

    For example, to solve the following equations

    > A%*%v-a

    [,1]

    [1,] 0.000000e+00

    [2,] -4.440892e-16

    [3,] 4.440892e-16

    > solve(A) # the inverse of A

    [,1] [,2] [,3]

    [1,] -0.3333333 0.3333333 0.6666667

    [2,] -0.3333333 -0.1666667 1.1666667[3,] 0.6666667 -0.1666667 -0.8333333

    () Linear Statistical Analysis I Fall 2013 42 / 101

    character vector and factor

    http://find/
  • 8/11/2019 Linear Regression Chp1

    43/102

    character vector and factor

    A character vector is a vector whose components are character strings

    instead of numbers. For example, we construct vectors of names andgenders of five persons in one office.

    > names=c("Emily","John", "Lily","Grace","William")

    > str(names)

    chr [1:5] "Emily" "John" "Lily" "Grace" "William"> gender=c("F","M","F","F","M")

    > str(gender)

    chr [1:5] "F" "M" "F" "F" "M"

    In the vector gender, there are two categories: male and female,labeled by M and F. Sometimes, we are interested in the

    information such as how many elements in each category. In this case,

    we can convert the character vector to a factor.

    () Linear Statistical Analysis I Fall 2013 43 / 101

    character vector and factor

    http://find/
  • 8/11/2019 Linear Regression Chp1

    44/102

    character vector and factor

    A character vector is a vector whose components are character strings

    instead of numbers. For example, we construct vectors of names andgenders of five persons in one office.

    > names=c("Emily","John", "Lily","Grace","William")

    > str(names)

    chr [1:5] "Emily" "John" "Lily" "Grace" "William"> gender=c("F","M","F","F","M")

    > str(gender)

    chr [1:5] "F" "M" "F" "F" "M"

    In the vector gender, there are two categories: male and female,labeled by M and F. Sometimes, we are interested in the

    information such as how many elements in each category. In this case,

    we can convert the character vector to a factor.

    () Linear Statistical Analysis I Fall 2013 44 / 101

    character vector and factor

    http://find/
  • 8/11/2019 Linear Regression Chp1

    45/102

    > gender=as.factor(gender)

    > gender

    [ 1 ] F M F F M

    Levels: F M

    > levels(gender)

    [1] "F" "M"

    Different levels denote different categories.

    () Linear Statistical Analysis I Fall 2013 45 / 101

    data frame

    http://find/
  • 8/11/2019 Linear Regression Chp1

    46/102

    The data frame is one of the most important data types in R.

    It is similar to matrix. Its elements are organized in rows andcolumns.

    However, in matrix, all elements are all numbers.

    In data frames, it is often the case that in a data frame, some

    elements are numbers and the others are numbers.when R reads data from an external data file, R always save the

    data as a data frame.

    It is not efficient to type the data directly in R if the data size is

    large.We can read the data from a file where the data is saved.

    It is important that the file is in the working directory. Otherwise, R

    cannot find the file. Or you can tell R where is the file.

    () Linear Statistical Analysis I Fall 2013 46 / 101

    input data from a file

    http://find/
  • 8/11/2019 Linear Regression Chp1

    47/102

    p

    The data has been stored in a file called ch.1.ex.1.dat.

    () Linear Statistical Analysis I Fall 2013 47 / 101

    http://find/
  • 8/11/2019 Linear Regression Chp1

    48/102

    input data from a file

  • 8/11/2019 Linear Regression Chp1

    49/102

    The data has been stored in a file called ch.1.ex.2.dat. The

    numbers is separated by commas

    () Linear Statistical Analysis I Fall 2013 49 / 101

    http://find/
  • 8/11/2019 Linear Regression Chp1

    50/102

    > data=read.table("ch.1.ex.2.dat",sep=",")

    > str(data)

    data.frame: 4 obs. of 3 variables:

    $ V1: int 1 2 4 5

    $ V2: int 2 4 5 6

    $ V3: int 3 7 7 8

    > dataV1 V2 V3

    1 1 2 3

    2 2 4 7

    3 4 5 7

    4 5 6 8

    () Linear Statistical Analysis I Fall 2013 50 / 101

    http://find/
  • 8/11/2019 Linear Regression Chp1

    51/102

  • 8/11/2019 Linear Regression Chp1

    52/102

    > data=read.table("ch.1.ex.3.dat",header=T)

    > str(data)

    data.frame: 4 obs. of 3 variables:

    $ A: int 1 2 4 5

    $ B: int 2 4 5 6

    $ C: int 3 7 7 8

    > dataA B C

    1 1 2 3

    2 2 4 7

    3 4 5 7

    4 5 6 8

    () Linear Statistical Analysis I Fall 2013 52 / 101

    input data from an excel file

    http://find/
  • 8/11/2019 Linear Regression Chp1

    53/102

    The data has been stored in a csv excel file called ch.1.ex.4.csv.

    () Linear Statistical Analysis I Fall 2013 53 / 101

    http://find/
  • 8/11/2019 Linear Regression Chp1

    54/102

    > data=read.csv("ch.1.ex.4.csv")

    > str(data)

    data.frame: 4 obs. of 3 variables:

    $ A: int 1 5 4 5

    $ B: int 2 4 5 8

    $ C: int 3 4 6 7

    > dataA B C

    1 1 2 3

    2 5 4 4

    3 4 5 6

    4 5 8 7

    () Linear Statistical Analysis I Fall 2013 54 / 101

    input data from an excel file

    http://find/
  • 8/11/2019 Linear Regression Chp1

    55/102

    The data has been stored in a csv excel file called ch.1.ex.5.csv.

    () Linear Statistical Analysis I Fall 2013 55 / 101

    http://find/
  • 8/11/2019 Linear Regression Chp1

    56/102

    > data=read.csv("ch.1.ex.5.csv",header=F)

    > data

    V1 V2 V3

    1 1 2 3

    2 5 4 43 4 5 6

    4 5 8 7

    () Linear Statistical Analysis I Fall 2013 56 / 101

    data sets in R

    http://find/
  • 8/11/2019 Linear Regression Chp1

    57/102

    R itself contains some data sets which can be loaded directly. You

    can use

    > data()

    to check available data sets in R.

    For example, let us consider a data set named sleep. We can

    load the data set by using> data(sleep)

    > sleep # check the data

    extra group ID

    1 0.7 1 1

    2 -1.6 1 23 -0.2 1 3

    ...........

    ...........

    20 3.4 2 10() Linear Statistical Analysis I Fall 2013 57 / 101

    http://find/
  • 8/11/2019 Linear Regression Chp1

    58/102

    > str(sleep)

    data.frame: 20 obs. of 3 variables:$ extra: num 0.7 -1.6 -0.2 -1.2 -0.1 3.4 3.7 0.8

    $ group: Factor w/ 2 levels "1","2": 1 1 1 1 1 1

    $ ID : Factor w/ 10 levels "1","2","3","4",..:

    Data which show the effect of two drugs on 10 patients.

    There are 20 observations and three variables.

    the second variable represents the types of drugs given.

    the third is the patients ID.

    the first is the increase in hours of sleep compared to control.

    () Linear Statistical Analysis I Fall 2013 58 / 101

    dif b f d f

    http://find/
  • 8/11/2019 Linear Regression Chp1

    59/102

    extract or modify subsets of a data frame

    > # two different ways to extract the first varibl

    > sleep$extra[1] 0.7 -1.6 -0.2 -1.2 -0.1 3.4 3.7 0.8 0.0

    [16] 4.4 5.5 1.6 4.6 3.4

    > sleep[,1]

    [1] 0.7 -1.6 -0.2 -1.2 -0.1 3.4 3.7 0.8 0.0

    [16] 4.4 5.5 1.6 4.6 3.4

    > sleep[1,1]

    [1] 0.7

    > sleep[1,]

    extra group ID1 0.7 1 1

    () Linear Statistical Analysis I Fall 2013 59 / 101

    extract or modify subsets of a data frame

    > # extract the measurements for the first drug.

    http://find/
  • 8/11/2019 Linear Regression Chp1

    60/102

    # extract the measurements for the first drug.

    > sleep[sleep$group=="2",1]

    [1] 1.9 0.8 1.1 0.1 -0.1 4.4

    [7] 5.5 1.6 4.6 3.4

    > # extract the measurements for the fifth patient

    > sleep[sleep$ID=="5",1]

    [1] -0.1 -0.1

    > # extract the measurements for the third> # patient when he took the second type of drug.

    > sleep[(sleep$ID=="3")&(sleep$group=="2"),1]

    [1] 1.1

    > names(sleep) # the names of variables

    [1] "extra" "group" "ID"> names(sleep)=c("hours","Drug","Patient")

    > sleep

    hours Drug Patient

    1 0.7 1 1

    2 -1.6 1 2() Linear Statistical Analysis I Fall 2013 60 / 101

    If all elements of a data frame are numbers, the data frame can be

    converted to a matrix.

    http://find/
  • 8/11/2019 Linear Regression Chp1

    61/102

    > data=read.csv("ch.1.ex.5.csv",header=F)

    > str(data)

    data.frame: 4 obs. of 3 variables:

    $ V1: int 1 5 4 5

    $ V2: int 2 4 5 8

    $ V3: int 3 4 6 7

    > X=as.matrix(data)> str(X)

    i n t [ 1 : 4 , 1 : 3 ] 1 5 4 5 2 4 5 8 3 4 . . .

    > X

    V1 V2 V3

    [1,] 1 2 3[2,] 5 4 4

    [3,] 4 5 6

    [4,] 5 8 7

    () Linear Statistical Analysis I Fall 2013 61 / 101

    http://find/
  • 8/11/2019 Linear Regression Chp1

    62/102

    Conversly, a matrix can be convert to a data frame

    > X=matrix(1:15,5,3)

    > X

    [,1] [,2] [,3]

    [1,] 1 6 11

    [2,] 2 7 12

    [3,] 3 8 13

    [4,] 4 9 14

    [5,] 5 10 15

    > str(X)

    int [1:5, 1:3] 1 2 3 4 5 6 7 8 9 10 ...

    > Y=data.frame(X)

    > Y

    X1 X2 X31 1 6 11

    2 2 7 12

    3 3 8 13

    4 4 9 14

    5 5 10 15

    > str(Y)

    data.frame: 5 obs. of 3 variables:

    $ X1: int 1 2 3 4 5

    $ X2: int 6 7 8 9 10$ X 3: i nt 11 1 2 13 1 4 15

    () Linear Statistical Analysis I Fall 2013 62 / 101

    control flow

    http://find/
  • 8/11/2019 Linear Regression Chp1

    63/102

    In addition to typing the commands after the prompt symbol, you

    can write a group of commands in your text editor such as

    notepad, MS word and so on, then copy and paste these

    commands after the prompt symbol in R.

    For example, I write the following commands in Notepad,

    () Linear Statistical Analysis I Fall 2013 63 / 101

    control flow

    http://find/
  • 8/11/2019 Linear Regression Chp1

    64/102

    Then I copy and paste them to R

    > data=read.csv("ch.1.ex.5.csv",header=F)

    > X=as.matrix(data)

    > names(X)=c("x1","x2","x3")

    > print(X)V1 V2 V3

    [1,] 1 2 3

    [2,] 5 4 4

    [3,] 4 5 6

    [4,] 5 8 7

    () Linear Statistical Analysis I Fall 2013 64 / 101

    control flow

    http://find/
  • 8/11/2019 Linear Regression Chp1

    65/102

    The control-flow specify the order in which computations are

    performed.

    Some commonly used control-flow methods: If-Else statement,loops (including for and while statements)

    If-Else statement: Example: compute the absolute value of a. I

    write the following commands in Notepad,

    () Linear Statistical Analysis I Fall 2013 65 / 101

    control flow

    http://find/
  • 8/11/2019 Linear Regression Chp1

    66/102

    Then I copy and paste them to R

    > a=1

    > if(a>0)

    + {absolute=a

    + print(absolute)+ }else

    + {absolute=-a

    + print(absolute)

    + }

    [1] 1

    () Linear Statistical Analysis I Fall 2013 66 / 101

    control flow

    http://goforward/http://find/http://goback/
  • 8/11/2019 Linear Regression Chp1

    67/102

    > a=1

    > if(a>0)

    + {absolute=a

    + print(absolute)

    + }else+ {absolute=-a

    + print(absolute)

    + }

    [1] 1

    () Linear Statistical Analysis I Fall 2013 67 / 101

    http://goforward/http://find/http://goback/
  • 8/11/2019 Linear Regression Chp1

    68/102

    Example 2

  • 8/11/2019 Linear Regression Chp1

    69/102

    > x=-3

    > if((x2))+ {f=0

    + print(f)

    + }else

    + {if(x

  • 8/11/2019 Linear Regression Chp1

    70/102

    > x=8

    > if((x2))+ {f=0

    + print(f)

    + }else

    + {if(x

  • 8/11/2019 Linear Regression Chp1

    71/102

    the for loop

    > sum=0 #set the initial values of the sum

    > for (i in 1:6)

    + {sum=sum+i

    + print(c("i=",i, "sum=",sum))

    + }[1] "i=" "1" "sum=" "1"

    [1] "i=" "2" "sum=" "3"

    [1] "i=" "3" "sum=" "6"

    [1] "i=" "4" "sum=" "10"

    [1] "i=" "5" "sum=" "15"

    [1] "i=" "6" "sum=" "21"

    () Linear Statistical Analysis I Fall 2013 71 / 101

    control flow

    http://find/
  • 8/11/2019 Linear Regression Chp1

    72/102

    the for loop> sum=0 #set the initial values of the sum

    > for (i in c(2,4,6,8,10))

    + {sum=sum+i

    + print(c("i=",i, "sum=",sum))

    + }

    [1] "i=" "2" "sum=" "2"

    [1] "i=" "4" "sum=" "6"

    [1] "i=" "6" "sum=" "12"

    [1] "i=" "8" "sum=" "20"

    [1] "i=" "10" "sum=" "30"

    () Linear Statistical Analysis I Fall 2013 72 / 101

    Data management

    http://find/
  • 8/11/2019 Linear Regression Chp1

    73/102

    The data input has been introduced. I will introduce the topics

    including data output, missing, data manipulation, Merging, combining,

    and subsetting datasets.save a R object: for example,

    () Linear Statistical Analysis I Fall 2013 73 / 101

    save a R object

    http://goforward/http://find/http://goback/
  • 8/11/2019 Linear Regression Chp1

    74/102

    > X=matrix(1:15,3,5)

    > X

    [,1] [,2] [,3] [,4] [,5]

    [1,] 1 4 7 10 13

    [2,] 2 5 8 11 14

    [3,] 3 6 9 12 15

    > save(X,file="ex.RData")> X=0

    > X

    [1] 0

    > load("ex.RData")

    > X[,1] [,2] [,3] [,4] [,5]

    [1,] 1 4 7 10 13

    [2,] 2 5 8 11 14

    [3,] 3 6 9 12 15() Linear Statistical Analysis I Fall 2013 74 / 101

    output data to external files

    http://find/
  • 8/11/2019 Linear Regression Chp1

    75/102

    > data(sleep)

    > str(sleep)

    data.frame: 20 obs. of 3 variables:

    $ extra: num 0.7 -1.6 -0.2 -1.2 -0.1 3.4 3.7 0.8 0

    $ group: Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1

    $ ID : Factor w/ 10 levels "1","2","3","4",..: 1 2

    > sleep.1=sleep[sleep$group=="1",c(1,3)]> sleep.1

    extra ID

    1 0.7 1

    2 -1.6 2

    3 -0.2 3............

    > write.table(sleep.1,file="sleep.1.dat")

    () Linear Statistical Analysis I Fall 2013 75 / 101

    output data to external files

    http://goforward/http://find/http://goback/
  • 8/11/2019 Linear Regression Chp1

    76/102

    both the row names and the column names are written in the file

    > write.table(sleep.1,file="sleep.1.dat")

    () Linear Statistical Analysis I Fall 2013 76 / 101

    http://find/
  • 8/11/2019 Linear Regression Chp1

    77/102

    output data to external files

    If t th fil b d fil

  • 8/11/2019 Linear Regression Chp1

    78/102

    If we want the file be saved as a csv file, use

    > write.csv(sleep.1,file="sleep.1.csv")

    () Linear Statistical Analysis I Fall 2013 78 / 101

    missing data

    http://find/
  • 8/11/2019 Linear Regression Chp1

    79/102

    It is very common in practice that there are missing values in a

    data set. For example, there are missing values denoted by *.

    () Linear Statistical Analysis I Fall 2013 79 / 101

    missing data

    http://find/
  • 8/11/2019 Linear Regression Chp1

    80/102

    > X=read.table("ch.1.ex.6.dat",na.strings = "*")

    > XV1 V2 V3 V4 V5

    1 1 2 3 4 NA

    2 1 3 5 7 8

    3 7 NA 9 0 1

    In R, all missing values are denoted by NA which can be

    identufied by the following command,

    > x=X[,2]

    > x[1] 2 3 NA

    > is.na(x)

    [1] FALSE FALSE TRUE

    () Linear Statistical Analysis I Fall 2013 80 / 101

    missing data

    http://find/
  • 8/11/2019 Linear Regression Chp1

    81/102

    > is.na(X)

    V1 V2 V3 V4 V5

    [1,] FALSE FALSE FALSE FALSE TRUE

    [2,] FALSE FALSE FALSE FALSE FALSE

    [3,] FALSE TRUE FALSE FALSE FALSE

    > sum(is.na(X)) # the number of missing values

    [1] 2> mean(X[,1])

    [1] 3

    > mean(X[,2])

    [1] NA

    > #claculate the mean with the missing values excluded> mean(X[,2],na.rm=TRUE)

    [1] 2.5

    () Linear Statistical Analysis I Fall 2013 81 / 101

    missing data

    http://find/
  • 8/11/2019 Linear Regression Chp1

    82/102

    Example: write a function to replace the missing values in a data frame by the column mean with the missing values excluded in

    the corresponding colum. If the whole column is missed, we remove the culomn.

    > missing.replace=function(X)

    + {

    + temp=NULL

    + for (i in 1:dim(X)[2])

    + {if(sum(is.na(X[,i]))==0)

    + {temp=cbind(temp,X[,i])

    + }else

    + {if(sum(!is.na(X[,i]))>0)

    + {y=X[,i]

    + y[is.na(X[,i])]=mean(y,na.rm=TRUE)

    + temp=cbind(temp,y)

    + }

    + }

    + }

    + temp=data.frame(temp)

    + temp

    + }

    () Linear Statistical Analysis I Fall 2013 82 / 101

    missing data

    http://find/
  • 8/11/2019 Linear Regression Chp1

    83/102

    > X

    V1 V2 V3 V4 V5

    1 1 2 3 4 NA

    2 1 3 5 7 8

    3 7 NA 9 0 1

    > missing.replace(X)

    V1 y V3 V4 y.1

    1 1 2.0 3 4 4.5

    2 1 3.0 5 7 8.0

    3 7 2.5 9 0 1.0

    > Y=X> Y[,2]=NA

    > Y

    V1 V2 V3 V4 V5

    1 1 NA 3 4 NA

    2 1 NA 5 7 8

    3 7 NA 9 0 1

    > missing.replace(Y)

    V1 V2 V3 y

    1 1 3 4 4.52 1 5 7 8.0

    3 7 9 0 1.0

    () Linear Statistical Analysis I Fall 2013 83 / 101

    http://find/
  • 8/11/2019 Linear Regression Chp1

    84/102

    merging data

  • 8/11/2019 Linear Regression Chp1

    85/102

    Suppose thatX andYare the data frames for the two data sets, respectively.

    > merge(X,Y,by="id")id year female inc maxval

    1 1 81 0 5500 1800

    2 1 80 0 5000 1800

    3 1 82 0 6000 1800

    4 2 82 1 3300 2400

    5 2 80 1 2000 2400

    6 2 81 1 2200 2400

    > merge(X,Y,by="id",all=T)

    id year female inc maxval1 1 81 0 5500 1800

    2 1 80 0 5000 1800

    3 1 82 0 6000 1800

    4 2 82 1 3300 2400

    5 2 80 1 2000 2400

    6 2 81 1 2200 2400

    7 3 82 0 1000 NA

    8 3 80 0 3000 NA

    9 3 81 0 2000 NA10 4 NA NA NA 1900

    () Linear Statistical Analysis I Fall 2013 85 / 101

    graphical exploration of data

    http://find/
  • 8/11/2019 Linear Regression Chp1

    86/102

    Example: This famous (Fishers or Andersons) iris data set gives the measurements in centimeters of the variables sepal lengthand width and petal length and width, respectively, for 50 flowers from each of 3 species of iris. The species are Iris setosa,versicolor, and virginica.

    > X=read.table("cell.1.dat")

    > str(X)

    data.frame: 1460 obs. of 7 variables:

    $ V1: int 0 0 0 0 0 1 1 1 1 1 ...

    $ V2: int 100 100 100 100 100 100 100 100 100 100 ...$ V3: int 0 0 0 0 0 0 0 0 0 0 ...

    $ V4: int 0 0 0 0 0 0 0 0 0 0 ...

    $ V5: int 0 0 0 0 0 0 0 0 0 0 ...

    $ V6: int 0 0 0 0 0 10 6 12 16 8 ...

    $ V7: int 1600 1600 1604 1590 1581 1568 1569 1577 1570 1577 ...

    > Y=read.table("cell.2.dat")

    > Z=read.table("cell.3.dat")

    () Linear Statistical Analysis I Fall 2013 86 / 101

    scatter plot

    http://goforward/http://find/http://goback/
  • 8/11/2019 Linear Regression Chp1

    87/102

    > plot(iris$Sepal.Length,iris$Petal.Length)

    4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0

    1

    2

    3

    4

    5

    6

    7

    iris$Sepal.Length

    iris$Petal.Length

    () Linear Statistical Analysis I Fall 2013 87 / 101

    scatter plot

    If we want to change the labels of x axis, y axis and the title of the plot.

    http://find/
  • 8/11/2019 Linear Regression Chp1

    88/102

    > plot(iris$Sepal.Length,iris$Petal.Length,xlab="Sepal Length"

    + ,ylab="Petal Length",main="The Scatter plot")

    4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0

    1

    2

    3

    4

    5

    6

    7

    The Scatter plot

    Sepal Length

    PetalLength

    () Linear Statistical Analysis I Fall 2013 88 / 101

    Matrix of scatterplotsIf we want to draw the scatteplots of all pairs of the variables

    > pairs(iris)

    http://find/
  • 8/11/2019 Linear Regression Chp1

    89/102

    Sepal.Length

    2.0 2.5 3.0 3.5 4.0

    0.5 1.0 1.5 2.0 2.5

    4.

    5

    6.

    0

    7.

    5

    2.

    0

    3.

    0

    4.

    0

    Sepal.Width

    Petal.Length

    1

    3

    5

    7

    0.5

    1.5

    2.

    5

    Petal.Width

    4.5 5.5 6.5 7.5

    1 2 3 4 5 6 7

    1.0 1.5 2.0 2.5 3.0

    1.0

    2.

    0

    3.

    0

    Species

    () Linear Statistical Analysis I Fall 2013 89 / 101

    http://find/
  • 8/11/2019 Linear Regression Chp1

    90/102

    histogram

    > boxplot(iris$Sepal.Length[iris$Species=="setosa"],iris$Sepal.Length[iris$Species=="versicolor"],

    + iris$Sepal.Length[iris$Species=="virginica"],names=c("setosa", "versicolor", "virginica"),

  • 8/11/2019 Linear Regression Chp1

    91/102

    + main="Sepal Length of three species")

    setosa versicolor virginica

    4.

    5

    5.

    0

    5.

    5

    6.

    0

    6.

    5

    7.0

    7.

    5

    8.

    0

    Sepal Length of three species

    () Linear Statistical Analysis I Fall 2013 91 / 101

    Regression

    Regression is the study of relationships dependence between two

    http://find/
  • 8/11/2019 Linear Regression Chp1

    92/102

    Regression is the study of relationships, dependence between two

    sets of variables based on the data or observations made onthese variables.

    Based on the relationships, we can make prediction on one set of

    variables from the values of the other set.

    Example 1: Predict the price of a stock in 6 months from now, on

    the basis of company performance measures and economic data.

    Example 2: Estimate the amount of glucose in the blood of a

    diabetic person, from the infrared absorption spectrum of that

    persons blood.

    Example 3: Examine the correlation between the level ofprostate-specific antigen and a number of clinical measures such

    as cancer volume, prostate weight, age, and so on.

    () Linear Statistical Analysis I Fall 2013 92 / 101

    variables

    http://find/
  • 8/11/2019 Linear Regression Chp1

    93/102

    Variables are quantitative or qualitative measurements of

    characteristics of objects under our study. Typically, the values of

    variables vary for different objects.

    They are considered as random variables with some probability

    distributions.

    We will use the uppercase letters, such asX,Y, to denote

    variables and the lowercase letters, xandy, to denote their

    particular values.

    () Linear Statistical Analysis I Fall 2013 93 / 101

    variables

    In this course, for each regression model, the variables will beclassified into two categories:

    http://find/
  • 8/11/2019 Linear Regression Chp1

    94/102

    classified into two categories:

    independent variables, or inputs, or features, or predictor, orregressor, or explanatory variable; typically denoted by X1,X2, .dependent variables, or outputs, or responses, or outcome variable;typically denoted byY1,Y2, . In this course, we only considerone response.

    The partition of dependent and independent variables is obviousin some data set but may vary according to the purposes of the

    study in other data.

    For example, suppose that we collected the temperature and

    precipitation data of a city over a period. If we want to construct a

    model to predict the temperature from the precipitation, then theresponse is temperature and the predictor is precipitation,

    however, If precipitation has to be predicted from temperature,

    then the response is precipitation and thepredictoristemperature.

    () Linear Statistical Analysis I Fall 2013 94 / 101

    variable

    http://find/http://find/
  • 8/11/2019 Linear Regression Chp1

    95/102

    Example 1: Y=price of the stock in 6 months from now,

    X1=company performance measures andX2=economic data.

    Example 2:Y=the amount of glucose in the blood of a diabetic

    person,X=the infrared absorption spectrum of that personsblood.

    Example 3: Y=level of prostate-specific antigen andX1=cancer

    volume,X2=prostate weight,X3=age, and so on.

    () Linear Statistical Analysis I Fall 2013 95 / 101

    regression models

    http://find/
  • 8/11/2019 Linear Regression Chp1

    96/102

    To construct a model from which given any values of the

    predictors: X1,X2, , we can give a good guess of the value ofY such that the difference between the guess and the true value is

    as small as possible. The guess is called predicted value denote

    dby Y.

    deterministic system: given any values ofX1, X2, , the value ofYis determinstic. That is, if any two observations of X1, X2, ,x(1)i andx

    (2)i ,i=1, 2, , the true values ofY arey(1) andy(2),

    then ifx(1)i =x

    (2)i ,i=1, 2, , theny(1) =y(2).

    mathematically,Ycan be consdered as a function of X1,X2, ,that is,Y =f(X1, X2, ).

    () Linear Statistical Analysis I Fall 2013 96 / 101

    regression models

    http://goforward/http://find/http://goback/
  • 8/11/2019 Linear Regression Chp1

    97/102

    No randomness is involved in deterministic system.

    However, in practice, there are many factors affecting the

    responses, we cannot find and measure all those factors.

    In statistics, we do not consider deterministic models. Instead, wewill consider models having randomness.

    These models are more flexible and more appropriate.

    () Linear Statistical Analysis I Fall 2013 97 / 101

    regression models

    Specifically, consider

    http://find/
  • 8/11/2019 Linear Regression Chp1

    98/102

    Specifically, consider

    Y =f(X1, X2, ) +,

    fis a (unknown) function of the variablesX1,X2, , which areinteresting and easy to observe or measure.

    The termis a ranomd variable which is used to account for therandomness due to the unobserved factors or variables, noises or

    errors.

    A standard assumption is that is independent of X1,X2, andits expection is 0.

    Then we haveE(Y|X1,X2, ) =f(X1, X2, ).More assumptions will be imposed in the following classes.

    () Linear Statistical Analysis I Fall 2013 98 / 101

    linear regression models

    http://find/
  • 8/11/2019 Linear Regression Chp1

    99/102

    The functionf(X1,X2, )is typically unknown and not a functionwith simple form.

    In this course, we will assume the function is linear, that is, if we

    havepexplanation variables, then

    f(X1,X2, ,Xp) =0+ 1X1+ 2X2+ +pXp,where0, 1, , pare coefficients which are typically unknownparameters and the estimation of them based on data is a main

    topic of the course.

    Why do we assumefis a linear function?

    () Linear Statistical Analysis I Fall 2013 99 / 101

    linear regression models

    http://find/
  • 8/11/2019 Linear Regression Chp1

    100/102

    The reasons we assumefis a linear function:Sincefcan be very general function, it is hopeless to find the

    exact form offbased on limited sample.

    Hence, an approximation tofis desirable. Acutually, the linear

    function is the first order approximation to any smooth function in

    a range ofX1,X2, which is not too large.Although there are other better approximation, the computation is

    a very important consideration especially in the precomputer age

    of statistics. It is much easier to perform analysis with the linear

    models than other models.

    () Linear Statistical Analysis I Fall 2013 100 / 101

    linear regression models

    http://find/
  • 8/11/2019 Linear Regression Chp1

    101/102

    Even in todays computer era there are still good reasons to study

    and use them. They are simple and often provide an adequate

    and interpretable description of how the inputs affect the output.

    For prediction purposes they can sometimes outperform fancier

    nonlinear models, especially in situations with small numbers of

    training cases, low signal-to-noise ratio or sparse data.

    Finally, linear methods can be applied to transformations of the

    inputs and this considerably expands their scope.

    () Linear Statistical Analysis I Fall 2013 101 / 101

    http://find/
  • 8/11/2019 Linear Regression Chp1

    102/102