Probability and Statstical Inference 7

download Probability and Statstical Inference 7

of 68

Transcript of Probability and Statstical Inference 7

  • 8/12/2019 Probability and Statstical Inference 7

    1/68

    PROBABILITY & STATISTICALINFERENCE LECTURE 9MSc in Computing (Data Analytics)

  • 8/12/2019 Probability and Statstical Inference 7

    2/68

    Lecture Outline

    ! ANOVA versus Regression! Correlations! Simple Linear Regression! Multiple Regression! Section Takeaways

  • 8/12/2019 Probability and Statstical Inference 7

    3/68

    Type of

    AnalysisFactorResponse

    ContinuousCategorical T-test/ANOVA

    ContinuousSimple Linear

    Regression

  • 8/12/2019 Probability and Statstical Inference 7

    4/68

    AVOVA vs Simple Linear Regression

  • 8/12/2019 Probability and Statstical Inference 7

    5/68

    Scatter Plot

    "A scatter plotis a type of

    chart using

    Cartesian

    coordinates to

    display values

    for two

    continuous

    variables for a

    set of data

    Y

    x

  • 8/12/2019 Probability and Statstical Inference 7

    6/68

    Describing Linear Relationships

    ! Correlation we can quantify the relationshipbetween two variables with correlation statistics

    ! Two variables are correlated if there is a linearrelationship between them

    ! We can further classify correlated variables accordingto the type of correlation:!Positive: One variable tends to increase in value as

    the other increases in value

    !Negative: One variable tends to decrease in value asthe other increases in value

    !Zero: No linear relationship between the twovariables (uncorrelated)

  • 8/12/2019 Probability and Statstical Inference 7

    7/68

    Pearson Correlation Coefficient

  • 8/12/2019 Probability and Statstical Inference 7

    8/68

    How to Calculate Correlation?

    ! The correlation coefficient between two samples x1,x2, x3, .... xn and y1, y2, y3, .... yn is calculated with the

    following formula:

  • 8/12/2019 Probability and Statstical Inference 7

    9/68

  • 8/12/2019 Probability and Statstical Inference 7

    10/68

  • 8/12/2019 Probability and Statstical Inference 7

    11/68

    Caution Using Correlation

    !"#$ &'(& ") *+(+ ,-(. (.' &+/' 0"$$'1+2"3

    0"'40-'3( ") !"#$%

  • 8/12/2019 Probability and Statstical Inference 7

    12/68

    Example: The Fast Mile Test

    ! You have been tasked by Team Ireland toanalyse data from a study conducted to

    investigate how fast athletes bodies can

    absorb and use up oxygen

    ! The results of this study will be used to helptrainers devise custom regimes for theirathletes

    ! A dataset has been gathered from 31athletes, each of whom performed a fast-

    mile-testfor which their maximum pulse rate,rest pulse rate, run pulse rate, run time and

    oxygen consumption were measured

  • 8/12/2019 Probability and Statstical Inference 7

    13/68

    Example: The Fast Mile TestOxygen

    Consumption Gender Age Weight Runtime

    Rest

    Pulse

    Run

    Pulse

    Max

    Pulse

    44.609 Male 44 89.47 11.37 62 178 182

    45.313 Male 40 75.07 10.07 62 185 185

    54.297 Female 44 85.84 8.65 45 156 168

    59.571 Male 42 68.15 8.17 40 166 172

    49.874 Female 38 89.02 9.22 55 178 180

    44.811 Female 47 77.45 11.63 58 176 176

    45.681 Male 40 75.98 11.95 70 176 180

    49.091 Male 43 81.19 10.85 64 162 170

    39.442 Female 44 81.42 13.08 63 174 176

  • 8/12/2019 Probability and Statstical Inference 7

    14/68

    Example: Runtime vs Oxygen

    Consumption

  • 8/12/2019 Probability and Statstical Inference 7

    15/68

    Demo

  • 8/12/2019 Probability and Statstical Inference 7

    16/68

    Regression Model

    Y

    x

    " Can wecapture the

    relationship

    between two

    variables inthe scatter

    plot?

  • 8/12/2019 Probability and Statstical Inference 7

    17/68

    Regression Model

    !5+&'* "3 (.' &0+6'$ 71"(8 -( -& 7$"9+91: $'+&"3+91' ("+&/' (.+( (.' $+3*"/ ;+$-+91' ! -& $'1+('* ("# 9: +

    &($+-

  • 8/12/2019 Probability and Statstical Inference 7

    18/68

    Regression Model

    Y

  • 8/12/2019 Probability and Statstical Inference 7

    19/68

    Y

    Regression Model

    One unit

    change in

    x

    !1

  • 8/12/2019 Probability and Statstical Inference 7

    20/68

    Simple Linear Regression

    ! The case of simple linear regression considers asingle regressor(or predictor),x, and a dependent

    (or response) variable, Y

    ! The expected value of Yat each level ofxis arandom variable:

    ! >' +&/' (.+( '+0. "9&'$;+2"38 !8 0+3 9'*'&0$-9'* 9: (.' /"*'1A

  • 8/12/2019 Probability and Statstical Inference 7

    21/68

    Simple Linear Regression

    ! Suppose that we have npairs of observations (x1, y1), (x2,y2), , (xn, yn)

  • 8/12/2019 Probability and Statstical Inference 7

    22/68

    Simple Linear Regression

    ! Suppose that we have npairs of observations (x1, y1), (x2,y2), , (xn, yn)

    " Deviations of the datafrom the estimated

    regression model

  • 8/12/2019 Probability and Statstical Inference 7

    23/68

    Simple Linear Regression

    " Suppose that we have n pairs of observations (x1,y1), (x2, y2), , (xn, yn)

    " Deviations of the datafrom the estimated

    regression modelObserved

    value (y)

    Estimated

    regression

    line

  • 8/12/2019 Probability and Statstical Inference 7

    24/68

    Simple Linear Regression

    ! Suppose that we have npairs of observations (x1, y1), (x2,y2), , (xn, yn)

    " The method of leastsquares is used to

    estimate the

    parameters, !0 and !1

    by minimizing the sum

    of the squares of the

    vertical deviations in

    diagram below

    Observed

    value (y)

    Estimated

    regression

    line

  • 8/12/2019 Probability and Statstical Inference 7

    25/68

    Example: Oxygen Consumption vs

    Runtime for Team Ireland" Can we capture

    the relationshipbetween Oxygen

    Consumption and

    Runtime in the

    Team Irelandfitness study?

    E l O C ti

  • 8/12/2019 Probability and Statstical Inference 7

    26/68

    Example: Oxygen Consumption vs

    Runtime for Team Ireland Regression

    Model"

    Yes, using theregression model:

    where Y is the

    OxygenConsumption and

    x is the Runtime

    for an athlete

  • 8/12/2019 Probability and Statstical Inference 7

    27/68

    Model Assumptions

    ! Fitting a regression model requires severalassumptions:

    ! Errors are uncorrelated random variables with zero mean! Errors have constant variance!

    Errors are normally distributed! The analyst should always consider the validity of

    these assumptions to be doubtful and conduct analyses

    to examine the adequacy of the model

  • 8/12/2019 Probability and Statstical Inference 7

    28/68

    Testing Assumptions Residual Analysis

    !The residuals from a regression model are:

    where yiis an actual observation and iis thecorresponding fitted value from the regression

    model! Analysis of the residuals is frequently helpful in

    checking the assumption that the errors areapproximately normally distributed with constant

    variance, and in determining whether additionalterms in the model would be useful

  • 8/12/2019 Probability and Statstical Inference 7

    29/68

    Interpreting Residual Plots

    B+2&)+0("$:

    ei

    0

    ei

    0

    ei

    0

    ei

    0

    !#33'1

    C"#91' 5", D"3=1-3'+$

  • 8/12/2019 Probability and Statstical Inference 7

    30/68

    Example: Oxygen Consumption vs

    Runtime for Team Ireland Residual Plot

    " What do wethink?

  • 8/12/2019 Probability and Statstical Inference 7

    31/68

    Adequacy of the Regression Model! The quantity:

    is called the coefficient of determination and is oftenused to judge the adequacy of a regression model (0

    !R2!1)

    ! We often refer (loosely) to R2as the amount ofvariability in the data explained or accounted for bythe regression model

  • 8/12/2019 Probability and Statstical Inference 7

    32/68

    Example: Oxygen Consumption vs

    Runtime for Team Ireland R2

    ! For the oxygen consumption regression modelR2 = SSM / SST

    = 632.9 / 851.38

    = 0.7434! Thus, the model accounts for 74.34% of the

    variability in the data

  • 8/12/2019 Probability and Statstical Inference 7

    33/68

    Adjusted R-squared Value

    ! The Adjusted R-squared Value is calculated asfollows:

    ! The figure is adjusted for to take into consideration

    the number of factors in the model

  • 8/12/2019 Probability and Statstical Inference 7

    34/68

    Demo

  • 8/12/2019 Probability and Statstical Inference 7

    35/68

    Multiple Regression Models

    ! Many applications of regression analysis involvesituations in which there is more than one regressor

    variable

    ! A regression model that contains more than oneregressor variable is called a multiple regressionmodel

    ! @.' /#1271' 1-3'+$ $'

  • 8/12/2019 Probability and Statstical Inference 7

    36/68

    Example: Oxygen Consumption vs Runtime for Team

    Ireland Regression Model

    ! For example, suppose that we want to test theaffect of both age and runtime on oxygen

    consumption in the Team Ireland example

    where:

    Y : Oxygen Consumptionx

    1: Runtime

    x2 : Age

  • 8/12/2019 Probability and Statstical Inference 7

    37/68

    Example: Oxygen Consumption vs Runtime for Team

    Ireland Regression Model

    " This is a 3d scatterplot of Oxygen

    Consumption versus

    Runtime and Age

  • 8/12/2019 Probability and Statstical Inference 7

    38/68

    Example: Oxygen Consumption vs Runtime for Team

    Ireland Regression Model

    @.' $'

  • 8/12/2019 Probability and Statstical Inference 7

    39/68

    Demo

  • 8/12/2019 Probability and Statstical Inference 7

    40/68

    Regression & Variable Selection

    !How do we select the best variable for use in aregression model

    !Perform a search to see which variable are themost effective

    !Three search schemes:! Forward sequential selection! Backward sequential selection! Stepwise sequential selection

  • 8/12/2019 Probability and Statstical Inference 7

    41/68

    Sequential Selection Forward

    Entry CutoffInput p-value

  • 8/12/2019 Probability and Statstical Inference 7

    42/68

    Sequential Selection Forward

    Entry CutoffInput p-value

  • 8/12/2019 Probability and Statstical Inference 7

    43/68

    Sequential Selection Forward

    Entry CutoffInput p-value

  • 8/12/2019 Probability and Statstical Inference 7

    44/68

    Sequential Selection Forward

  • 8/12/2019 Probability and Statstical Inference 7

    45/68

    Sequential Selection Backward

    Stay CutoffInput p-value

  • 8/12/2019 Probability and Statstical Inference 7

    46/68

    Sequential Selection Backward

    Stay CutoffInput p-value

  • 8/12/2019 Probability and Statstical Inference 7

    47/68

    Sequential Selection Backward

    Stay CutoffInput p-value

  • 8/12/2019 Probability and Statstical Inference 7

    48/68

    Sequential Selection Backward

    Stay CutoffInput p-value

  • 8/12/2019 Probability and Statstical Inference 7

    49/68

    Sequential Selection Backward

    Stay CutoffInput p-value

  • 8/12/2019 Probability and Statstical Inference 7

    50/68

    Sequential Selection Backward

    Stay CutoffInput p-value

  • 8/12/2019 Probability and Statstical Inference 7

    51/68

    Sequential Selection Backward

    Stay CutoffInput p-value

  • 8/12/2019 Probability and Statstical Inference 7

    52/68

    Sequential Selection Backward

    Stay CutoffInput p-value

  • 8/12/2019 Probability and Statstical Inference 7

    53/68

    Sequential Selection Stepwise

    Input p-value Entry Cutoff

    Stay Cutoff

  • 8/12/2019 Probability and Statstical Inference 7

    54/68

    Sequential Selection Stepwise

    Input p-value Entry Cutoff

    Stay Cutoff

  • 8/12/2019 Probability and Statstical Inference 7

    55/68

    Sequential Selection Stepwise

    Input p-value Entry Cutoff

    Stay Cutoff

  • 8/12/2019 Probability and Statstical Inference 7

    56/68

    Sequential Selection Stepwise

    Input p-value Entry Cutoff

    Stay Cutoff

  • 8/12/2019 Probability and Statstical Inference 7

    57/68

    Sequential Selection Stepwise

    Input p-value Entry Cutoff

    Stay Cutoff

  • 8/12/2019 Probability and Statstical Inference 7

    58/68

    Sequential Selection Stepwise

    Input p-value Entry Cutoff

    Stay Cutoff

  • 8/12/2019 Probability and Statstical Inference 7

    59/68

    Sequential Selection Stepwise

    Input p-value Entry Cutoff

    Stay Cutoff

  • 8/12/2019 Probability and Statstical Inference 7

    60/68

    Demo

  • 8/12/2019 Probability and Statstical Inference 7

    61/68

    Multi-Collinearity

    !Multi-Collinearity exists when two or moreindependent variables are used in regression

    are correlated.

    X2

  • 8/12/2019 Probability and Statstical Inference 7

    62/68

    Demo

  • 8/12/2019 Probability and Statstical Inference 7

    63/68

    Regression Bits and Pieces

    !Polynomial regression

    ! Logistic Regression! Categorical Factors in Regression

  • 8/12/2019 Probability and Statstical Inference 7

    64/68

    Polynomial Regression

    !Polynomial regression models are widely usedwhen the response in curve-linear

    ! The general principles ofmultiple regression will apply

    ! The second degreepolynomial in one variable is:

  • 8/12/2019 Probability and Statstical Inference 7

    65/68

  • 8/12/2019 Probability and Statstical Inference 7

    66/68

    Logistic Regression

    !The equation for a logistic regression model is:

    ! Choose intercept and parameter estimates tomaximize

    ! This function is known as the log-likelihood function

    !log(pi) +!log(1 pi)

  • 8/12/2019 Probability and Statstical Inference 7

    67/68

    Categorical Factors in Regression

    #Many problems may involve categoricalvariables.

    # The usual method for the different levels of aqualitative variable is to use indicator

    variables.

    # For example, to introduce the variable genderinto the model , we could define an indicator

    variable as follows:

  • 8/12/2019 Probability and Statstical Inference 7

    68/68

    Section Takeaways

    !Regression models allow us model betweenvariables

    ! Regression models can be used to evaluate thevariation between variables but are also excellent

    to use as prediction models