3 principal components analysis

Post on 10-May-2015

10.856 views 2 download

Tags:

Transcript of 3 principal components analysis

Biology

Chemistry

Informatics

Principal Component Analysis (PCA) of metabolomic sample processing methods

Prin

cipa

l Com

pone

nts

Anal

ysis

Goal:Use PCA to identify the major modes of variance

(Used DATA: Pumpkin data 1.csv)

Topics: 1. Principal component number selection2. Data pretreatment3. PCA results visualization

Biology

Chemistry

Informatics

Principal Components Analysis Pr

inci

pal C

ompo

nent

s An

alys

is

Steps1. Calculate a PCA model2. Select optimal model principal component (PC) 3. Overview PCA scores and loadings plots4. Repeat steps 1-2 using data centering and scaling

Visualize:5. Sample scores annotated by extraction and treatment6. Leverage and DmodX (distance from model plane)7. Variable loadings and biplots

Exercise:8. How many PCs are needed to capture 80% variance for raw data and

scaled data?9. Are their any moderate or extreme outliers?10.What variables contribute most to the variance for raw and scaled data?

Biology

Chemistry

Informatics

PCA Variance Explained(raw data)

Pri

nci

pal

Co

mp

on

ents

An

alys

is

• PCs can be selected to explain a minimum %variance in the data (~80%)

• PCs explaining below 1% variance can be excluded

• q2 is the cross-validated PCA prediction of left out data

Biology

Chemistry

Informatics

PCA Scores (raw data)P

rin

cip

al C

om

po

nen

ts A

nal

ysis

• Hotelling's T2 ellipse shows 95% CI for bivariate normal distribution

• Samples lying outside of the ellipse could be outliers

Biology

Chemistry

Informatics

PCA Loadings (raw data, centered)P

rin

cip

al C

om

po

nen

ts A

nal

ysis

• Unscaled data PCA loadings are highly correlated with magnitude

Biology

Chemistry

Informatics

PCA Biplot (raw data)P

rin

cip

al C

om

po

nen

ts A

nal

ysis

• Biplots can be used to rapidly overview the correlation between sample scores and variable loadings

Sample contains high maleic acid

Sample contains high sucrose (low maleic acid)

Biology

Chemistry

Informatics

PCA Leverage and DmodX(raw data)

Pri

nci

pal

Co

mp

on

ents

An

alys

is

Leverage is the distance to samples center in the PCA plane (extreme outliers)

Distance to model X (DmodX) is the orthogonal distance to the PCA plane (moderate outliers)

Biology

Chemistry

Informatics

PCA Variance Explained(autoscaled)

Pri

nci

pal

Co

mp

on

ents

An

alys

is

Biology

Chemistry

Informatics

PCA Scores (autoscaled)P

rin

cip

al C

om

po

nen

ts A

nal

ysis

• Loadings on PC1 describe differences due to extraction

• Loadings on PC2 describe differences due to treatment

Biology

Chemistry

Informatics

PCA Leverage and DmodX(autoscaled)

Pri

nci

pal

Co

mp

on

ents

An

alys

is

• Samples with both high leverage and DomdX are likely outliers

• Evaluate PCA results after their removal

Biology

Chemistry

Informatics

PCA Loadings (autoscaled)P

rin

cip

al C

om

po

nen

ts A

nal

ysis

• Scaled loadings are independent of variable magnitude and show a rich variance structure of the data

Biology

Chemistry

Informatics

Pri

nci

pal

Co

mp

on

ents

An

alys

isRelationship between scores and loadings (autoscaled)

Higher in 100% MeOH

ExtractionLower in 100% MeOH

Biology

Chemistry

Informatics

Loadings and ScoresP

rin

cip

al C

om

po

nen

ts A

nal

ysis

Highest positive loading on PC1

Highest negative loading on PC1