Regression analysis Relating two data matrices/tables to each other Purpose: prediction and...
-
Upload
magdalena-nevils -
Category
Documents
-
view
216 -
download
1
Transcript of Regression analysis Relating two data matrices/tables to each other Purpose: prediction and...
Regression analysis
Relating two data matrices/tables to each other
Purpose: prediction and interpretation
Y-data X-data
Typical examples
• Spectroscopy: Predict chemistry from spectral measurements
• Product development: Relating sensory to chemistry data
• Marketing: Relating sensory data to consumer preferences
Topics covered
• Simple linear regression
• The selectivity problem: a reason why multivariate methods are needed
• The collinearity problem: a reason why data compression is needed
• The outlier problem: why and how to detect
Simple linear regression
• One y and one x. Use x to predict y.
• Use a linear model/equation and fit it by least squares
b0
b1
y=b0+b1x+e
x
y
Least squares (LS) usedfor estimation of regression coefficients
Simple linear regression
Model
Data (X,Y)
Regression analysis
Future X Prediction
Regression analysis
Outliers?Pre-processing
Interpretation
Multiple linear regression
• Provides– predicted values
– regression coefficients
– diagnostics
• If there are many highly collinear variables– unstable regression equations
– difficult to interpret coefficients: many and unstable
y=b0+b1x1+b2x2+e
The two x’s have high correlation
Leads to unstable equation/plane(in the direction with little variability)
Collinearity, the problem of correlated X-variable
Regression in this case is fitting a plane to the data (open circles)
Possible solutions
• Select the most important wavelengths/variables (stepwise methods)
• Compress the variables to the most dominating dimensions (PCR, PLS)
• We will concentrate on the latter (can be combined)
Data compression
• We will first discuss the situation with one y-variable
• Focus on ideas and principles
• Provides regression equation (as above) and plots for interpretation
Model for data compression methods
X=TPT+E
y=Tq+f
T-scores, carrier of information from X to yP,q –loadingsE,f – residuals (noise)
Centred X and y
PCR and PLS
For each factor/component• PCR
– Maximize variance of linear combinations of X
• PLS– Maximize covariance between linear combinations of
X and y
Each factor is subtracted before the next is computed
Principal component regression (PCR)
• Uses principal components
• Solves the collinearity problem, stable solutions
• Provides plots for interpretation (scores and loadings)
• Well understood
• Outlier diagnostics
• Easy to modify
• But uses only X to determine components
PLS-regression
• Easy to compute
• Stable solutions
• Provides scores and loadings
• Often less number of components than PCR
• Sometimes better predictions
PCR and PLS for several Y-variables
• PCR is computed for each Y. Each Y is regressed onto the principal components
• PLS: The algorithm is easily modified. Maximises linear combinations of X and Y.
• For both methods: Regression equations and plots
Validation is important
• Measure quality of the predictor
• Determine A – number of components
• Compare methods
Prediction testing
Calibration Estimate coefficients
Testing/validationPredict y, use thecoefficients
Validation
• Compute
• Plot RMSEP versus component• Choose the number of components with best
RMSEP properties
• Compare for different methods
N
iii NyyRMSEP
1
2 /)ˆ(
RMSEP
NIR calibration of protein in wheat. 6 NIR wavelengths12 calibration samples, 26 test samples
MLR
Prediction vs. cross-validation
• Prediction testing: Prediction ability of the predictor at hand. Requires much data.
• Cross-validation: Property of the method. Better for smaller data set.
Validation
• One should also plot measured versus predicted y-value
• Correlation can be computed, but can sometimes be misleading
Outlier detection
• Instrument error or noise
• Drift of signal (over time)
• Misprints
• Samples outside normal range (different population)
Outlier detection
• Outliers can be detected because
– Model for spectral data (X=TPT+E)
– Model for relationship between X and y (y=Tq+f)