Introduction to Metabolomic Data Analysis
Dmitry Grapov, PhD
Intr
oduc
tion
Important
•This is an introduction to a series of 8 tutorials for metabolomic data analysis
•Download all the required files and software here:
https://sourceforge.net/projects/teachingdemos/files/Winter%202014%20LC-MS%20and%20Statistics%20Course/
•Then follow the directions in the software/startup.R to launch all accompanying software
Intr
oduc
tion
Goals?
Analysis at the Metabolomic Scale
Cycle of Scientific DiscoveryData Acquisition
DataData AnalysisHypothesis Generation
Data ProcessingHypothesis
Univariate vs. MultivariateUnivariate
Gro
up 1
Gro
up 2
Multivariate Predictive Modeling
Hypothesis testing (t-Test, ANOVA, etc.) PCA O-/PLS/-DA
univariate/bivariate vs.
\ multivariate
mixed up samples?outliers?
Univariate vs. Multivariate
Data Analysis Goals
• Are there any trends in my data?– analytical sources – meta data/covariates
• Useful Methods– matrix decomposition (PCA, ICA, NMF)– cluster analysis
• Differences/similarities between groups?– discrimination, classification, significant changes
• Useful Methods– analysis of variance (ANOVA), mixed effects models– partial least squares discriminant analysis (O-/PLS-DA)– Others: random forest, CART, SVM, ANN
• What is related or predictive of my variable(s) of interest?– Regression, correlation
• Useful Methods– correlation– partial least squares (O-/PLS)
Exploration Classification Prediction
Data Complexity
nm
1-D 2-D m-D
Data
samples
variables
complexity
Meta Data
Experimental Design =
Variable # = dimensionality
Univariate Qualities• length (sample size)
• center (mean, median, geometric mean)
• dispersion (variance, standard deviation)
• range (min / max),
• quantiles
• shape (skewness, kurtosis, normality, etc.)
mean
standard deviation
Data QualityMetrics
• Precision
• Accuracy
Remedies
• normalization
• outliers detection
*Start lab 1-statistical analysis
Univariate Analyses• Identify differences in sample population
means• sensitive to distribution shape
• parametric = assumes normality
• error in Y, not in X (Y = mX + error)
• optimal for long data
• assumed independence
• false discovery rate (FDR) long
wide
n-of-one
Type I Error: False Positives
• Type II Error: False Negatives
• Type I risk =
• 1-(1-p.value)m
m = number of variables tested
FDR correction
• p-value adjustment or estimate of FDR (Fdr, q-value)
False Discovery Rate (FDR)
Bioinformatics (2008) 24 (12):1461-1462
Achieving “significance” is a function of:
significance level (α) and power (1-β )
effect size (standardized difference in means)
sample size (n)
*finish lab 1-statistical analysis
ClusteringIdentify
•patterns
•group structure
• relationships
•Evaluate/refine hypothesis
•Reduce complexity
Artist: Chuck Close
Cluster AnalysisUse the concept similarity/dissimilarity to group a collection of samples or variables
Approaches• hierarchical (HCA)• non-hierarchical (k-NN, k-means)• distribution (mixtures models)• density (DBSCAN)• self organizing maps (SOM)
Linkage k-means
Distribution Density
Hierarchical Cluster Analysis• similarity/dissimilarity
defines “nearness” or distance
X
Y
euclidean
X
Y
manhattan Mahalanobis
X
Y*
non-euclidean
Hierarchical Cluster Analysis
single complete centroid average
Agglomerative/linkage algorithm defines how points are grouped
Dendrograms
Sim
ilarit
y
x
xx
x
Exploration Confirmation
How does my metadata match my data structure?
Hierarchical Cluster Analysis
*finish lab 2-Cluster Analysis
Projection of Data
The algorithm defines the position of the light sourcePrincipal Components Analysis (PCA)
• unsupervised• maximize variance (X)
Partial Least Squares Projection to Latent Structures (PLS)
• supervised• maximize covariance (Y ~ X)
James X. Li, 2009, VisuMap Tech.
Interpreting PCA Results
Variance explained (eigenvalues)
Row (sample) scores and column (variable) loadings
How are scores and loadings related?
Centering and Scaling
PMID: 16762068
*finish lab 3-Principal Components Analysis
Use PLS to test a hypothesis
time = 0 120 min.
Partial Least Squares (PLS) is used to identify planes of maximum correlation between X measurements and Y (hypothesis)
PCA PLS
Modeling multifactorial relationships
dynamic changes among groups~two-way ANOVA
PLS Related ObjectsModel• dimensions, latent variables (LV)• performance metrics (Q2, RMSEP, etc)• validation (training/testing, permutation, cross-validation)• orthogonal correctionSamples• scores• predicted values• residualsVariables• Loadings• Coefficients, summary of loadings based on all LVs• VIP, variable importance in projection• Feature selection
“goodness” of the model is all about the perspective
Determine in-sample (Q2) and out-of-sample error (RMSEP) and compare to a random model
• permutation tests
• training/testing
*finish lab 4-Partial Least Squares and lab 5-Data Analysis Case Study
Biological Interpretation
• Visualization• Enrichment• Networks
– biochemical– structural– spectral– empirical
Projection or mapping of analysis results into a biological context.
Organism specific biochemical relationships and information
Multiple organism DBs
• KEGG
• BioCyc
• Reactome
• Human
• HMDB
• SMPDB
Identification of alterations in biochemical domains
*finish lab 6-Metabolite Enrichment Analysis
2. Calculate Mappings
1. Generate Connections
3. Create Network
Grapov D., Fiehn O., Multivariate and network tools for analysis and visualization of metabolomic data, ASMS, June 08, 2013, Minneapolis, MN
Network Mapping
Connections and Contexts
Biochemical (substrate/product)• Database lookup• Web query
Chemical (structural or spectral similarity )• fingerprint generation
Empirical (dependency)• correlation, partial-correlation
BMC Bioinformatics 2012, 13:99 doi:10.1186/1471-2105-13-99
Mapping Analysis Results
Analysis results Network Annotation Mapped Network
*finish lab 7-Network Mapping I
Biochemical Relationships
http://www.genome.jp/dbget-bin/www_bget?rn:R00975
Structural Similarity
http://pubchem.ncbi.nlm.nih.gov//score_matrix/score_matrix.cgi
Mass Spectral Connections
Watrous J et al. PNAS 2012;109:E1743-E1752 *finish lab 8-Network Mapping II