Dmitry Grapov, PhD
Metabolomic Data Analysis for the Study of Diseases
State of the art facility producing massive amounts of biological data…
>13,000 samples/yr>160 studies~32,000 data points/study
Goals?
Analysis at the Metabolomic Scale
Univariate vs. MultivariateUnivariate
Gro
up 1
Gro
up 2
Multivariate Predictive Modeling
Hypothesis testing (t-Test, ANOVA, etc.) PCA O-/PLS/-DA
univariate/bivariate vs.
\ multivariate
mixed up samples?outliers?
Univariate vs. Multivariate
Data Complexity
nm
1-D 2-D m-D
Data
samples
variables
complexity
Meta Data
Experimental Design =
Variable # = dimensionality
Statistical Analysis• Identify differences in sample population
means• sensitive to distribution shape
• parametric = assumes normality
• error in Y, not in X (Y = mX + error)
• optimal for long data
• assumed independence
• false discovery rate (FDR) long
wide
n-of-one
Achieving “significance” is a function of:
significance level (α) and power (1-β )
effect size (standardized difference in means)
sample size (n)
Type I Error: False Positives
• Type II Error: False Negatives
• Type I risk =
• 1-(1-p.value)m
m = number of variables tested
FDR correction
• p-value adjustment or estimate of FDR (Fdr, q-value)
False Discovery Rate (FDR)
Bioinformatics (2008) 24 (12):1461-1462
FDR correctionFD
R ad
just
ed p
-val
ue
p-value
Benjamini & Hochberg (1995) (“BH”)• Accepted standard
Bonferroni• Very conservative• adjusted p-value = p-value*# of tests (e.g. 0.005 * 148 = 0.74 )
Multivariate AnalysisClustering• Grouping based on similarity/dissimilarity
Principal Components Analysis (PCA)• Identify modes of variance in the data
Partial Least Squares (PLS) • Identify modes of variance in the data
correlated with a hypothesis
Cluster AnalysisUse similarity/dissimilarity to group a collection of samples or variables
Approaches• hierarchical (HCA)• non-hierarchical (k-NN, k-means)• distribution (mixtures models)• density (DBSCAN)• self organizing maps (SOM)
Linkage k-means
Distribution Density
Hierarchical Cluster Analysissimilarity/dissimilarity defines “nearness” or distance
objects are grouped based on linkage methods
Hierarchy of Similarity
Sim
ilarit
y
x
xx
x
How does my metadata match my data structure?
Hierarchy of effect sizes
Projection of Data
The algorithm defines the position of the light sourcePrincipal Components Analysis (PCA)
• unsupervised• maximize variance (X)
Partial Least Squares Projection to Latent Structures (PLS)
• supervised• maximize covariance (Y ~ X)
James X. Li, 2009, VisuMap Tech.
PC1PC2
http://www.scholarpedia.org/article/Eigenfaces
Raw data PCA dimensions
Interpreting PCA Results
Variance explained (eigenvalues)
Row (sample) scores and column (variable) loadings
How are scores and loadings related?
Centering and Scaling
PMID: 16762068
Use PLS to test a hypothesis
time = 0 120 min.
Partial Least Squares (PLS) is used to identify planes of maximum correlation between X measurements and Y (hypothesis)
PCA PLS
PLS model validation is critical
Determine in-sample (Q2) and out-of-sample error (RMSEP) and compare to a random model
• permutation tests
• training/testing
Databases for organism specific biochemical information:
Multiple organisms
• KEGG
• BioCyc
• Reactome
Human
• HMDB
• SMPDB
Biochemical domain information
Pathway Enrichment Analysis
http://www.metaboanalyst.ca/MetaboAnalyst/faces/UploadView.jsp
enrichmenttopological importance
Biochemical
Network Mapping
doi:10.1186/1471-2105-13-99
Structural Similarity
Data visualization as form of analysis
DM
Liver CYP2D6
Dextromethorphan = additives in
dextrorphan
• high fructose corn syrup
• antioxidants
• flavor
Identification of relationships between altered metabolites urea cycle
nucleotide
synthesis
protein
glycosylation
Identification of treatment effects
Analysis of differential metabolic responses
Treatment 1 Treatment 2
Resources• DeviumWeb- Dynamic multivariate data analysis and
visualization platformurl: https://github.com/dgrapov/DeviumWeb
• imDEV- Microsoft Excel add-in for multivariate analysisurl: http://sourceforge.net/projects/imdev/
• MetaMapR: Network analysis tools for metabolomicsurl: https://github.com/dgrapov/MetaMapR
• TeachingDemos- Tutorials and demonstrations• url: http://sourceforge.net/projects/teachingdemos/?source=directory• url: https://github.com/dgrapov/TeachingDemos
• CDS Blog- Data analysis case studiesurl: http://imdevsoftware.wordpress.com/
[email protected] metabolomics.ucdavis.edu
This research was supported in part by NIH 1 U24 DK097154
Top Related