Download - Principal Component Analysis in Rmponce/courses/London2018/PCA.pdf · Principal component analysis (PCA) is a statistical procedure that uses an orthogonal transformation to convert

Transcript

Principal Component Analysis

Principal Component Analysis in R2018 Ontario Summer School on HPC

Marcelo Ponce

May 2018

2018 Ontario Summer School: PCA in R M.Ponce (SciNet HPC / UofT)

Principal Component Analysis

Basics

Principal Component AnalysisPrincipal component analysis (PCA) is a statistical procedure that uses anorthogonal transformation to convert a set of observations of possiblycorrelated variables into a set of values of linearly uncorrelated variablescalled principal components.

PCA is mostly used as a tool inexploratory data analysis and formaking predictive models.

Unsupervised.

PCA is sensitive to the relative scalingof the original variables.

SVD, dimensionality reduction, ...

Also related to clusterization algs.(k-means) ...

“PCA ≈ fitting an n-dim

ellipsoid: PC;axes”

2018 Ontario Summer School: PCA in R M.Ponce (SciNet HPC / UofT)

Principal Component Analysis

Basics

It’s often used to visualize genetic distance and relatedness betweenpopulations.

PCA has successfully foundlinear combinations of thedifferent markers, thatseparate out different clusterscorresponding to differentlines of individuals’Y-chromosomal geneticdescent.

A principal components analysis scatterplot of Y-STR haplotypes calculated from

repeat-count values for 37 Y-chromosomal STR markers from 354 individuals.

2018 Ontario Summer School: PCA in R M.Ponce (SciNet HPC / UofT)

Principal Component Analysis

Basics

Computing PCA in Rß pricomp

uses eigen-values/vectors andcovariance matrix

à prcomp

uses singular value decomposition(SVD)

prcomp(x, retx = TRUE, center = TRUE, scale = FALSE, tol = NULL, ...)

# USArrests data vary by orders of

magnitude, so scaling is appropriate

> prcomp(USArrests) # inappropriate

> pc1 <- prcomp(USArrests, scale = T)

> pc2 <- prcomp(~Murder + Assault +

Rape, data = USArrests, scale = TRUE)

> summary(pc1)

> summary(pc2)

> plot(pc1)

> plot(pc2)

Importance of components: Comp.1 Comp.2

Comp.3 Comp.4

Standard deviation 1.5748783 0.9948694

0.5971291 0.41644938

Proportion of Variance 0.6200604

0.2474413 0.0891408 0.04335752

Cumulative Proportion 0.6200604 0.8675017

0.9566425 1.00000000

Importance of components: PC1 PC2 PC3

PC4

Standard deviation 83.7324 14.21240

6.4894 2.48279

Proportion of Variance 0.9655 0.02782

0.0058 0.00085

Cumulative Proportion 0.9655 0.99335

0.9991 1.00000

2018 Ontario Summer School: PCA in R M.Ponce (SciNet HPC / UofT)

Principal Component Analysis

Basics

Computing PCA in Rß pricomp

uses eigen-values/vectors andcovariance matrix

à prcomp

uses singular value decomposition(SVD)

prcomp(x, retx = TRUE, center = TRUE, scale = FALSE, tol = NULL, ...)

# USArrests data vary by orders of

magnitude, so scaling is appropriate

> prcomp(USArrests) # inappropriate

> pc1 <- prcomp(USArrests, scale = T)

> pc2 <- prcomp(~Murder + Assault +

Rape, data = USArrests, scale = TRUE)

> summary(pc1)

> summary(pc2)

> plot(pc1)

> plot(pc2)

Importance of components: Comp.1 Comp.2

Comp.3 Comp.4

Standard deviation 1.5748783 0.9948694

0.5971291 0.41644938

Proportion of Variance 0.6200604

0.2474413 0.0891408 0.04335752

Cumulative Proportion 0.6200604 0.8675017

0.9566425 1.00000000

Importance of components: PC1 PC2 PC3

PC4

Standard deviation 83.7324 14.21240

6.4894 2.48279

Proportion of Variance 0.9655 0.02782

0.0058 0.00085

Cumulative Proportion 0.9655 0.99335

0.9991 1.00000

2018 Ontario Summer School: PCA in R M.Ponce (SciNet HPC / UofT)

Principal Component Analysis

Basics

Computing PCA in Rß pricomp

uses eigen-values/vectors andcovariance matrix

à prcomp

uses singular value decomposition(SVD)

prcomp(x, retx = TRUE, center = TRUE, scale = FALSE, tol = NULL, ...)

# USArrests data vary by orders of

magnitude, so scaling is appropriate

> prcomp(USArrests) # inappropriate

> pc1 <- prcomp(USArrests, scale = T)

> pc2 <- prcomp(~Murder + Assault +

Rape, data = USArrests, scale = TRUE)

> summary(pc1)

> summary(pc2)

> plot(pc1)

> plot(pc2)

Importance of components: Comp.1 Comp.2

Comp.3 Comp.4

Standard deviation 1.5748783 0.9948694

0.5971291 0.41644938

Proportion of Variance 0.6200604

0.2474413 0.0891408 0.04335752

Cumulative Proportion 0.6200604 0.8675017

0.9566425 1.00000000

Importance of components: PC1 PC2 PC3

PC4

Standard deviation 83.7324 14.21240

6.4894 2.48279

Proportion of Variance 0.9655 0.02782

0.0058 0.00085

Cumulative Proportion 0.9655 0.99335

0.9991 1.00000

2018 Ontario Summer School: PCA in R M.Ponce (SciNet HPC / UofT)

Principal Component Analysis

Basics

The iris dataset: PCA

> library(ggfortify)

# select only numerical data...

> iris data <- iris[c(1, 2, 3, 4)]

# compute PCA...

> iris pca <- prcomp(iris data)

# visualize PCA

> autoplot(iris pca, data = iris, colour =

’Species’)

# draw eigenvectors...

> autoplot(iris pca, data = iris, colour =

’Species’, loadings = TRUE)

# attach eigenvector labels and options...

> autoplot(prcomp(df), data = iris, colour =

’Species’, loadings = TRUE, loadings.colour

= ’blue’, loadings.label = TRUE, load-

ings.label.size = 3)

2018 Ontario Summer School: PCA in R M.Ponce (SciNet HPC / UofT)

Principal Component Analysis

Basics

The iris dataset: PCA

> library(ggfortify)

# select only numerical data...

> iris data <- iris[c(1, 2, 3, 4)]

# compute PCA...

> iris pca <- prcomp(iris data)

# visualize PCA

> autoplot(iris pca, data = iris, colour =

’Species’)

# draw eigenvectors...

> autoplot(iris pca, data = iris, colour =

’Species’, loadings = TRUE)

# attach eigenvector labels and options...

> autoplot(prcomp(df), data = iris, colour =

’Species’, loadings = TRUE, loadings.colour

= ’blue’, loadings.label = TRUE, load-

ings.label.size = 3)

2018 Ontario Summer School: PCA in R M.Ponce (SciNet HPC / UofT)

Principal Component Analysis

Basics

The iris dataset: PCA

> library(ggfortify)

# select only numerical data...

> iris data <- iris[c(1, 2, 3, 4)]

# compute PCA...

> iris pca <- prcomp(iris data)

# visualize PCA

> autoplot(iris pca, data = iris, colour =

’Species’)

# draw eigenvectors...

> autoplot(iris pca, data = iris, colour =

’Species’, loadings = TRUE)

# attach eigenvector labels and options...

> autoplot(prcomp(df), data = iris, colour =

’Species’, loadings = TRUE, loadings.colour

= ’blue’, loadings.label = TRUE, load-

ings.label.size = 3)

2018 Ontario Summer School: PCA in R M.Ponce (SciNet HPC / UofT)

Principal Component Analysis

Basics

The iris dataset: PCA

> library(ggfortify)

# select only numerical data...

> iris data <- iris[c(1, 2, 3, 4)]

# compute PCA...

> iris pca <- prcomp(iris data)

# visualize PCA

> autoplot(iris pca, data = iris, colour =

’Species’)

# draw eigenvectors...

> autoplot(iris pca, data = iris, colour =

’Species’, loadings = TRUE)

# attach eigenvector labels and options...

> autoplot(prcomp(df), data = iris, colour =

’Species’, loadings = TRUE, loadings.colour

= ’blue’, loadings.label = TRUE, load-

ings.label.size = 3)

2018 Ontario Summer School: PCA in R M.Ponce (SciNet HPC / UofT)

Principal Component Analysis

Basics

The iris dataset: PCA

> library(ggfortify)

# select only numerical data...

> iris data <- iris[c(1, 2, 3, 4)]

# compute PCA...

> iris pca <- prcomp(iris data)

# visualize PCA

> autoplot(iris pca, data = iris, colour =

’Species’)

# draw eigenvectors...

> autoplot(iris pca, data = iris, colour =

’Species’, loadings = TRUE)

# attach eigenvector labels and options...

> autoplot(prcomp(df), data = iris, colour =

’Species’, loadings = TRUE, loadings.colour

= ’blue’, loadings.label = TRUE, load-

ings.label.size = 3)

2018 Ontario Summer School: PCA in R M.Ponce (SciNet HPC / UofT)

Principal Component Analysis

Basics

The iris dataset: Cluster and Local Fisher Analyses

> library(cluster)

# cluster analysis...

> autoplot(clara(iris data,3))

# draw convex for each cluster...

> autoplot(fanny(iris data,3), frame=TRUE)

# draw probability ellipse

> autoplot(pam(iris pca,3), frame=TRUE,

frame.type=’norm’)

# Local Fisher Discriminant Analysis (LFDA)

> library(lfda)

> model <- lfda(iris[-5], iris[,5], r=3, met-

ric=’plain’)

> autoplot(model, data = iris, frame=TRUE,

frame.colour = ’Species’)

# Semi-supervised LFDA (SELF)

> model <- self(iris[-5], iris[, 5], beta =

0.1, r = 3, metric="plain")

> autoplot(model, data = iris, frame = TRUE,

frame.colour = ’Species’)

2018 Ontario Summer School: PCA in R M.Ponce (SciNet HPC / UofT)

Principal Component Analysis

Basics

The iris dataset: Cluster and Local Fisher Analyses

> library(cluster)

# cluster analysis...

> autoplot(clara(iris data,3))

# draw convex for each cluster...

> autoplot(fanny(iris data,3), frame=TRUE)

# draw probability ellipse

> autoplot(pam(iris pca,3), frame=TRUE,

frame.type=’norm’)

# Local Fisher Discriminant Analysis (LFDA)

> library(lfda)

> model <- lfda(iris[-5], iris[,5], r=3, met-

ric=’plain’)

> autoplot(model, data = iris, frame=TRUE,

frame.colour = ’Species’)

# Semi-supervised LFDA (SELF)

> model <- self(iris[-5], iris[, 5], beta =

0.1, r = 3, metric="plain")

> autoplot(model, data = iris, frame = TRUE,

frame.colour = ’Species’)

2018 Ontario Summer School: PCA in R M.Ponce (SciNet HPC / UofT)

Principal Component Analysis

Basics

The iris dataset: Cluster and Local Fisher Analyses

> library(cluster)

# cluster analysis...

> autoplot(clara(iris data,3))

# draw convex for each cluster...

> autoplot(fanny(iris data,3), frame=TRUE)

# draw probability ellipse

> autoplot(pam(iris pca,3), frame=TRUE,

frame.type=’norm’)

# Local Fisher Discriminant Analysis (LFDA)

> library(lfda)

> model <- lfda(iris[-5], iris[,5], r=3, met-

ric=’plain’)

> autoplot(model, data = iris, frame=TRUE,

frame.colour = ’Species’)

# Semi-supervised LFDA (SELF)

> model <- self(iris[-5], iris[, 5], beta =

0.1, r = 3, metric="plain")

> autoplot(model, data = iris, frame = TRUE,

frame.colour = ’Species’)

2018 Ontario Summer School: PCA in R M.Ponce (SciNet HPC / UofT)

Principal Component Analysis

Basics

The iris dataset: Cluster and Local Fisher Analyses

> library(cluster)

# cluster analysis...

> autoplot(clara(iris data,3))

# draw convex for each cluster...

> autoplot(fanny(iris data,3), frame=TRUE)

# draw probability ellipse

> autoplot(pam(iris pca,3), frame=TRUE,

frame.type=’norm’)

# Local Fisher Discriminant Analysis (LFDA)

> library(lfda)

> model <- lfda(iris[-5], iris[,5], r=3, met-

ric=’plain’)

> autoplot(model, data = iris, frame=TRUE,

frame.colour = ’Species’)

# Semi-supervised LFDA (SELF)

> model <- self(iris[-5], iris[, 5], beta =

0.1, r = 3, metric="plain")

> autoplot(model, data = iris, frame = TRUE,

frame.colour = ’Species’)

2018 Ontario Summer School: PCA in R M.Ponce (SciNet HPC / UofT)

Principal Component Analysis

Basics

The iris dataset: Cluster and Local Fisher Analyses

> library(cluster)

# cluster analysis...

> autoplot(clara(iris data,3))

# draw convex for each cluster...

> autoplot(fanny(iris data,3), frame=TRUE)

# draw probability ellipse

> autoplot(pam(iris pca,3), frame=TRUE,

frame.type=’norm’)

# Local Fisher Discriminant Analysis (LFDA)

> library(lfda)

> model <- lfda(iris[-5], iris[,5], r=3, met-

ric=’plain’)

> autoplot(model, data = iris, frame=TRUE,

frame.colour = ’Species’)

# Semi-supervised LFDA (SELF)

> model <- self(iris[-5], iris[, 5], beta =

0.1, r = 3, metric="plain")

> autoplot(model, data = iris, frame = TRUE,

frame.colour = ’Species’)

2018 Ontario Summer School: PCA in R M.Ponce (SciNet HPC / UofT)

Principal Component Analysis

3D PCA

3D PCAAdditional Packages required:

install.packages("rgl")

install.packages("pca3d")

> library(rgl)

> library(pca3d)

# PCA analysis of ’metabo’ data

# relative abundances of metabolites

from serun samples of three groups

> data(metabo)

> dim(metabo) # 136 424

# PCA analysis, including all rows but

the ’group’ column

> pca <- prcomp(metabo[,-1],

scale=TRUE)

# 2D PCA

> pca2d(pca, group=metabo[,1])

# 3D PCA

> pca3d(pca, group=metabo[,1])

2018 Ontario Summer School: PCA in R M.Ponce (SciNet HPC / UofT)

Principal Component Analysis

3D PCA

3D PCAAdditional Packages required:

install.packages("rgl")

install.packages("pca3d")

> library(rgl)

> library(pca3d)

# PCA analysis of ’metabo’ data

# relative abundances of metabolites

from serun samples of three groups

> data(metabo)

> dim(metabo) # 136 424

# PCA analysis, including all rows but

the ’group’ column

> pca <- prcomp(metabo[,-1],

scale=TRUE)

# 2D PCA

> pca2d(pca, group=metabo[,1])

# 3D PCA

> pca3d(pca, group=metabo[,1])

2018 Ontario Summer School: PCA in R M.Ponce (SciNet HPC / UofT)

Principal Component Analysis

3D PCA

References

PCA

http://uc-r.github.io/pca

http://genomicsclass.github.io/book/pages/pca_svd.html

https://cran.r-project.org/web/packages/ggfortify/vignettes/plot_pca.html

https://cran.r-project.org/web/packages/HSAUR/vignettes/Ch_principal_components_analysis.pdf

2018 Ontario Summer School: PCA in R M.Ponce (SciNet HPC / UofT)