Regression analysis Relating two data matrices/tables to each other Purpose: prediction and...

33
Regression analysis elating two data matrices/tables to each other Purpose: prediction and interpretation Y-data X-data

Transcript of Regression analysis Relating two data matrices/tables to each other Purpose: prediction and...

Regression analysis

Relating two data matrices/tables to each other

Purpose: prediction and interpretation

Y-data X-data

Typical examples

• Spectroscopy: Predict chemistry from spectral measurements

• Product development: Relating sensory to chemistry data

• Marketing: Relating sensory data to consumer preferences

Topics covered

• Simple linear regression

• The selectivity problem: a reason why multivariate methods are needed

• The collinearity problem: a reason why data compression is needed

• The outlier problem: why and how to detect

Simple linear regression

• One y and one x. Use x to predict y.

• Use a linear model/equation and fit it by least squares

Data structure

Y-variableX-variable

Objects, same number in x and y-column

241...

768...

b0

b1

y=b0+b1x+e

x

y

Least squares (LS) usedfor estimation of regression coefficients

Simple linear regression

Model

Data (X,Y)

Regression analysis

Future X Prediction

Regression analysis

Outliers?Pre-processing

Interpretation

The selectivity problemA reason why multivariate methods are needed

Can be used for several Y’s also

Multiple linear regression

• Provides– predicted values

– regression coefficients

– diagnostics

• If there are many highly collinear variables– unstable regression equations

– difficult to interpret coefficients: many and unstable

y=b0+b1x1+b2x2+e

The two x’s have high correlation

Leads to unstable equation/plane(in the direction with little variability)

Collinearity, the problem of correlated X-variable

Regression in this case is fitting a plane to the data (open circles)

Possible solutions

• Select the most important wavelengths/variables (stepwise methods)

• Compress the variables to the most dominating dimensions (PCR, PLS)

• We will concentrate on the latter (can be combined)

Data compression

• We will first discuss the situation with one y-variable

• Focus on ideas and principles

• Provides regression equation (as above) and plots for interpretation

Model for data compression methods

X=TPT+E

y=Tq+f

T-scores, carrier of information from X to yP,q –loadingsE,f – residuals (noise)

Centred X and y

Regression by data compression

Regression on scores

PC1

t-score

y

q

ti

PCAto compress data

x1

x2

x3

x4

x1

x2

x3

x4

x2

x3

x1

x2

x4

x3

y

y

y

t1

t2

MLR

PCR

PLS

x1

t1

t2

PCR and PLS

For each factor/component• PCR

– Maximize variance of linear combinations of X

• PLS– Maximize covariance between linear combinations of

X and y

Each factor is subtracted before the next is computed

Principal component regression (PCR)

• Uses principal components

• Solves the collinearity problem, stable solutions

• Provides plots for interpretation (scores and loadings)

• Well understood

• Outlier diagnostics

• Easy to modify

• But uses only X to determine components

PLS-regression

• Easy to compute

• Stable solutions

• Provides scores and loadings

• Often less number of components than PCR

• Sometimes better predictions

PCR and PLS for several Y-variables

• PCR is computed for each Y. Each Y is regressed onto the principal components

• PLS: The algorithm is easily modified. Maximises linear combinations of X and Y.

• For both methods: Regression equations and plots

Validation is important

• Measure quality of the predictor

• Determine A – number of components

• Compare methods

Prediction testing

Calibration Estimate coefficients

Testing/validationPredict y, use thecoefficients

Cross-validation

Predict y, use the coefficients

Calibrate, find y=f(x)estimate coefficients

Validation

• Compute

• Plot RMSEP versus component• Choose the number of components with best

RMSEP properties

• Compare for different methods

N

iii NyyRMSEP

1

2 /)ˆ(

RMSEP

NIR calibration of protein in wheat. 6 NIR wavelengths12 calibration samples, 26 test samples

MLR

Conceptual illustration of important phenomena

Estimation error

Model error

Prediction vs. cross-validation

• Prediction testing: Prediction ability of the predictor at hand. Requires much data.

• Cross-validation: Property of the method. Better for smaller data set.

Validation

• One should also plot measured versus predicted y-value

• Correlation can be computed, but can sometimes be misleading

Plot of measured and predicted proteinNIR calibration

Example, plot of y versus predicted y

Outlier detection

• Instrument error or noise

• Drift of signal (over time)

• Misprints

• Samples outside normal range (different population)

Outlier detection

• Outliers can be detected because

– Model for spectral data (X=TPT+E)

– Model for relationship between X and y (y=Tq+f)

Outlier detection tools

• Residuals– X and y-residuals– X-residuals as before, y-residual is difference

between measured and predicted y

• Leverage– hi