Download - Principal Component Analysis in Rmponce/courses/London2018/PCA.pdf · Principal component analysis (PCA) is a statistical procedure that uses an orthogonal transformation to convert

Principal Component Analysis

Principal Component Analysis in R2018 Ontario Summer School on HPC

Marcelo Ponce

May 2018

2018 Ontario Summer School: PCA in R M.Ponce (SciNet HPC / UofT)

https://support.scinet.utoronto.ca/education/

Page 2: Principal Component Analysis in Rmponce/courses/London2018/PCA.pdf · Principal component analysis (PCA) is a statistical procedure that uses an orthogonal transformation to convert

Basics

Principal Component AnalysisPrincipal component analysis (PCA) is a statistical procedure that uses anorthogonal transformation to convert a set of observations of possiblycorrelated variables into a set of values of linearly uncorrelated variablescalled principal components.

PCA is mostly used as a tool inexploratory data analysis and formaking predictive models.

Unsupervised.

PCA is sensitive to the relative scalingof the original variables.

SVD, dimensionality reduction, ...

Also related to clusterization algs.(k-means) ...

“PCA ≈ fitting an n-dim

ellipsoid: PC;axes”

Page 3: Principal Component Analysis in Rmponce/courses/London2018/PCA.pdf · Principal component analysis (PCA) is a statistical procedure that uses an orthogonal transformation to convert

Basics

It’s often used to visualize genetic distance and relatedness betweenpopulations.

PCA has successfully foundlinear combinations of thedifferent markers, thatseparate out different clusterscorresponding to differentlines of individuals’Y-chromosomal geneticdescent.

A principal components analysis scatterplot of Y-STR haplotypes calculated from

repeat-count values for 37 Y-chromosomal STR markers from 354 individuals.

Page 4: Principal Component Analysis in Rmponce/courses/London2018/PCA.pdf · Principal component analysis (PCA) is a statistical procedure that uses an orthogonal transformation to convert

Basics

Computing PCA in Rß pricomp

uses eigen-values/vectors andcovariance matrix

à prcomp

uses singular value decomposition(SVD)

prcomp(x, retx = TRUE, center = TRUE, scale = FALSE, tol = NULL, ...)

# USArrests data vary by orders of

magnitude, so scaling is appropriate

> prcomp(USArrests) # inappropriate

> pc1 <- prcomp(USArrests, scale = T)

> pc2 <- prcomp(~Murder + Assault +

Rape, data = USArrests, scale = TRUE)

> summary(pc1)

> summary(pc2)

> plot(pc1)

> plot(pc2)

Importance of components: Comp.1 Comp.2

Comp.3 Comp.4

Standard deviation 1.5748783 0.9948694

0.5971291 0.41644938

Proportion of Variance 0.6200604

0.2474413 0.0891408 0.04335752

Cumulative Proportion 0.6200604 0.8675017

0.9566425 1.00000000

Importance of components: PC1 PC2 PC3

PC4

6.4894 2.48279

Proportion of Variance 0.9655 0.02782

0.0058 0.00085

0.9991 1.00000

Page 5: Principal Component Analysis in Rmponce/courses/London2018/PCA.pdf · Principal component analysis (PCA) is a statistical procedure that uses an orthogonal transformation to convert

Basics

à prcomp

> summary(pc1)

> summary(pc2)

> plot(pc1)

> plot(pc2)

Comp.3 Comp.4

0.5971291 0.41644938

0.2474413 0.0891408 0.04335752

0.9566425 1.00000000

PC4

6.4894 2.48279

0.0058 0.00085

0.9991 1.00000

Page 6: Principal Component Analysis in Rmponce/courses/London2018/PCA.pdf · Principal component analysis (PCA) is a statistical procedure that uses an orthogonal transformation to convert

Basics

à prcomp

> summary(pc1)

> summary(pc2)

> plot(pc1)

> plot(pc2)

Comp.3 Comp.4

0.5971291 0.41644938

0.2474413 0.0891408 0.04335752

0.9566425 1.00000000

PC4

6.4894 2.48279

0.0058 0.00085

0.9991 1.00000

Page 7: Principal Component Analysis in Rmponce/courses/London2018/PCA.pdf · Principal component analysis (PCA) is a statistical procedure that uses an orthogonal transformation to convert

Basics

The iris dataset: PCA

> library(ggfortify)

# select only numerical data...

> iris data <- iris[c(1, 2, 3, 4)]

# compute PCA...

> iris pca <- prcomp(iris data)

# visualize PCA

> autoplot(iris pca, data = iris, colour =

’Species’)

# draw eigenvectors...

’Species’, loadings = TRUE)

# attach eigenvector labels and options...

> autoplot(prcomp(df), data = iris, colour =

’Species’, loadings = TRUE, loadings.colour

= ’blue’, loadings.label = TRUE, load-

ings.label.size = 3)

Page 8: Principal Component Analysis in Rmponce/courses/London2018/PCA.pdf · Principal component analysis (PCA) is a statistical procedure that uses an orthogonal transformation to convert

Basics

> iris data <- iris[c(1, 2, 3, 4)]

# compute PCA...

# visualize PCA

’Species’)

Page 9: Principal Component Analysis in Rmponce/courses/London2018/PCA.pdf · Principal component analysis (PCA) is a statistical procedure that uses an orthogonal transformation to convert

Basics

> iris data <- iris[c(1, 2, 3, 4)]

# compute PCA...

# visualize PCA

’Species’)

Page 10: Principal Component Analysis in Rmponce/courses/London2018/PCA.pdf · Principal component analysis (PCA) is a statistical procedure that uses an orthogonal transformation to convert

Basics

> iris data <- iris[c(1, 2, 3, 4)]

# compute PCA...

# visualize PCA

’Species’)

Page 11: Principal Component Analysis in Rmponce/courses/London2018/PCA.pdf · Principal component analysis (PCA) is a statistical procedure that uses an orthogonal transformation to convert

Basics

> iris data <- iris[c(1, 2, 3, 4)]

# compute PCA...

# visualize PCA

’Species’)

Page 12: Principal Component Analysis in Rmponce/courses/London2018/PCA.pdf · Principal component analysis (PCA) is a statistical procedure that uses an orthogonal transformation to convert

Basics

The iris dataset: Cluster and Local Fisher Analyses

> library(cluster)

# cluster analysis...

> autoplot(clara(iris data,3))

# draw convex for each cluster...

> autoplot(fanny(iris data,3), frame=TRUE)

# draw probability ellipse

> autoplot(pam(iris pca,3), frame=TRUE,

frame.type=’norm’)

# Local Fisher Discriminant Analysis (LFDA)

> library(lfda)

> model <- lfda(iris[-5], iris[,5], r=3, met-

ric=’plain’)

> autoplot(model, data = iris, frame=TRUE,

frame.colour = ’Species’)

# Semi-supervised LFDA (SELF)

> model <- self(iris[-5], iris[, 5], beta =

0.1, r = 3, metric="plain")

> autoplot(model, data = iris, frame = TRUE,

Page 13: Principal Component Analysis in Rmponce/courses/London2018/PCA.pdf · Principal component analysis (PCA) is a statistical procedure that uses an orthogonal transformation to convert

Basics

> library(cluster)

> library(lfda)

ric=’plain’)

Page 14: Principal Component Analysis in Rmponce/courses/London2018/PCA.pdf · Principal component analysis (PCA) is a statistical procedure that uses an orthogonal transformation to convert

Basics

> library(cluster)

> library(lfda)

ric=’plain’)

Page 15: Principal Component Analysis in Rmponce/courses/London2018/PCA.pdf · Principal component analysis (PCA) is a statistical procedure that uses an orthogonal transformation to convert

Basics

> library(cluster)

> library(lfda)

ric=’plain’)

Page 16: Principal Component Analysis in Rmponce/courses/London2018/PCA.pdf · Principal component analysis (PCA) is a statistical procedure that uses an orthogonal transformation to convert

Basics

> library(cluster)

> library(lfda)

ric=’plain’)

Page 17: Principal Component Analysis in Rmponce/courses/London2018/PCA.pdf · Principal component analysis (PCA) is a statistical procedure that uses an orthogonal transformation to convert

3D PCA

3D PCAAdditional Packages required:

install.packages("rgl")

install.packages("pca3d")

> library(rgl)

> library(pca3d)

# PCA analysis of ’metabo’ data

# relative abundances of metabolites

from serun samples of three groups

> data(metabo)

> dim(metabo) # 136 424

# PCA analysis, including all rows but

the ’group’ column

> pca <- prcomp(metabo[,-1],

scale=TRUE)

# 2D PCA

> pca2d(pca, group=metabo[,1])

# 3D PCA

Page 18: Principal Component Analysis in Rmponce/courses/London2018/PCA.pdf · Principal component analysis (PCA) is a statistical procedure that uses an orthogonal transformation to convert

3D PCA

3D PCAAdditional Packages required:

install.packages("rgl")

install.packages("pca3d")

> library(rgl)

> library(pca3d)

# PCA analysis of ’metabo’ data

# relative abundances of metabolites

from serun samples of three groups

> data(metabo)

> dim(metabo) # 136 424

# PCA analysis, including all rows but

the ’group’ column

> pca <- prcomp(metabo[,-1],

scale=TRUE)

# 2D PCA

# 3D PCA

Page 19: Principal Component Analysis in Rmponce/courses/London2018/PCA.pdf · Principal component analysis (PCA) is a statistical procedure that uses an orthogonal transformation to convert

3D PCA

References

PCA

http://uc-r.github.io/pca

http://genomicsclass.github.io/book/pages/pca_svd.html

https://cran.r-project.org/web/packages/ggfortify/vignettes/plot_pca.html

https://cran.r-project.org/web/packages/HSAUR/vignettes/Ch_principal_components_analysis.pdf

http://uc-r.github.io/pca

http://genomicsclass.github.io/book/pages/pca_svd.html

https://cran.r-project.org/web/packages/ggfortify/vignettes/plot_pca.html

https://cran.r-project.org/web/packages/HSAUR/vignettes/Ch_principal_components_analysis.pdf