High Dimensional Biological Data Analysis and Visualization

30
Dmitry Grapov, PhD Metabolomic Data Analysis for the Study of Diseases

description

Examples of data analysis and visualization of high dimensional metabolomic data.

Transcript of High Dimensional Biological Data Analysis and Visualization

Page 1: High Dimensional Biological Data Analysis and Visualization

Dmitry Grapov, PhD

Metabolomic Data Analysis for the Study of Diseases

Page 2: High Dimensional Biological Data Analysis and Visualization

State of the art facility producing massive amounts of biological data…

>13,000 samples/yr>160 studies~32,000 data points/study

Page 3: High Dimensional Biological Data Analysis and Visualization

Goals?

Page 4: High Dimensional Biological Data Analysis and Visualization

Analysis at the Metabolomic Scale

Page 5: High Dimensional Biological Data Analysis and Visualization

Univariate vs. MultivariateUnivariate

Gro

up 1

Gro

up 2

Multivariate Predictive Modeling

Hypothesis testing (t-Test, ANOVA, etc.) PCA O-/PLS/-DA

Page 6: High Dimensional Biological Data Analysis and Visualization

univariate/bivariate vs.

\ multivariate

mixed up samples?outliers?

Univariate vs. Multivariate

Page 7: High Dimensional Biological Data Analysis and Visualization

Data Complexity

nm

1-D 2-D m-D

Data

samples

variables

complexity

Meta Data

Experimental Design =

Variable # = dimensionality

Page 8: High Dimensional Biological Data Analysis and Visualization

Statistical Analysis• Identify differences in sample population

means• sensitive to distribution shape

• parametric = assumes normality

• error in Y, not in X (Y = mX + error)

• optimal for long data

• assumed independence

• false discovery rate (FDR) long

wide

n-of-one

Page 9: High Dimensional Biological Data Analysis and Visualization

Achieving “significance” is a function of:

significance level (α) and power (1-β )

effect size (standardized difference in means)

sample size (n)

Page 10: High Dimensional Biological Data Analysis and Visualization

Type I Error: False Positives

• Type II Error: False Negatives

• Type I risk =

• 1-(1-p.value)m

m = number of variables tested

FDR correction

• p-value adjustment or estimate of FDR (Fdr, q-value)

False Discovery Rate (FDR)

Bioinformatics (2008) 24 (12):1461-1462

Page 11: High Dimensional Biological Data Analysis and Visualization

FDR correctionFD

R ad

just

ed p

-val

ue

p-value

Benjamini & Hochberg (1995) (“BH”)• Accepted standard

Bonferroni• Very conservative• adjusted p-value = p-value*# of tests (e.g. 0.005 * 148 = 0.74 )

Page 12: High Dimensional Biological Data Analysis and Visualization

Multivariate AnalysisClustering• Grouping based on similarity/dissimilarity

Principal Components Analysis (PCA)• Identify modes of variance in the data

Partial Least Squares (PLS) • Identify modes of variance in the data

correlated with a hypothesis

Page 13: High Dimensional Biological Data Analysis and Visualization

Cluster AnalysisUse similarity/dissimilarity to group a collection of samples or variables

Approaches• hierarchical (HCA)• non-hierarchical (k-NN, k-means)• distribution (mixtures models)• density (DBSCAN)• self organizing maps (SOM)

Linkage k-means

Distribution Density

Page 14: High Dimensional Biological Data Analysis and Visualization

Hierarchical Cluster Analysissimilarity/dissimilarity defines “nearness” or distance

objects are grouped based on linkage methods

Page 15: High Dimensional Biological Data Analysis and Visualization

Hierarchy of Similarity

Sim

ilarit

y

x

xx

x

How does my metadata match my data structure?

Hierarchy of effect sizes

Page 16: High Dimensional Biological Data Analysis and Visualization

Projection of Data

The algorithm defines the position of the light sourcePrincipal Components Analysis (PCA)

• unsupervised• maximize variance (X)

Partial Least Squares Projection to Latent Structures (PLS)

• supervised• maximize covariance (Y ~ X)

James X. Li, 2009, VisuMap Tech.

PC1PC2

http://www.scholarpedia.org/article/Eigenfaces

Raw data PCA dimensions

Page 17: High Dimensional Biological Data Analysis and Visualization

Interpreting PCA Results

Variance explained (eigenvalues)

Row (sample) scores and column (variable) loadings

Page 18: High Dimensional Biological Data Analysis and Visualization

How are scores and loadings related?

Page 19: High Dimensional Biological Data Analysis and Visualization

Centering and Scaling

PMID: 16762068

Page 20: High Dimensional Biological Data Analysis and Visualization

Use PLS to test a hypothesis

time = 0 120 min.

Partial Least Squares (PLS) is used to identify planes of maximum correlation between X measurements and Y (hypothesis)

PCA PLS

Page 21: High Dimensional Biological Data Analysis and Visualization

PLS model validation is critical

Determine in-sample (Q2) and out-of-sample error (RMSEP) and compare to a random model

• permutation tests

• training/testing

Page 22: High Dimensional Biological Data Analysis and Visualization

Databases for organism specific biochemical information:

Multiple organisms

• KEGG

• BioCyc

• Reactome

Human

• HMDB

• SMPDB

Biochemical domain information

Page 23: High Dimensional Biological Data Analysis and Visualization

Pathway Enrichment Analysis

http://www.metaboanalyst.ca/MetaboAnalyst/faces/UploadView.jsp

enrichmenttopological importance

Page 24: High Dimensional Biological Data Analysis and Visualization

Biochemical

Network Mapping

doi:10.1186/1471-2105-13-99

Structural Similarity

Page 25: High Dimensional Biological Data Analysis and Visualization

Data visualization as form of analysis

DM

Liver CYP2D6

Dextromethorphan = additives in

dextrorphan

• high fructose corn syrup

• antioxidants

• flavor

Page 26: High Dimensional Biological Data Analysis and Visualization

Identification of relationships between altered metabolites urea cycle

nucleotide

synthesis

protein

glycosylation

Page 27: High Dimensional Biological Data Analysis and Visualization

Identification of treatment effects

Page 28: High Dimensional Biological Data Analysis and Visualization

Analysis of differential metabolic responses

Treatment 1 Treatment 2

Page 29: High Dimensional Biological Data Analysis and Visualization

Resources• DeviumWeb- Dynamic multivariate data analysis and

visualization platformurl: https://github.com/dgrapov/DeviumWeb

• imDEV- Microsoft Excel add-in for multivariate analysisurl: http://sourceforge.net/projects/imdev/

• MetaMapR: Network analysis tools for metabolomicsurl: https://github.com/dgrapov/MetaMapR

• TeachingDemos- Tutorials and demonstrations• url: http://sourceforge.net/projects/teachingdemos/?source=directory• url: https://github.com/dgrapov/TeachingDemos

• CDS Blog- Data analysis case studiesurl: http://imdevsoftware.wordpress.com/

Page 30: High Dimensional Biological Data Analysis and Visualization

[email protected] metabolomics.ucdavis.edu

This research was supported in part by NIH 1 U24 DK097154