Big Data Training for Translational Omics Research BigTaP ... 1 day 1-Liu.pdf · In Translational...
Transcript of Big Data Training for Translational Omics Research BigTaP ... 1 day 1-Liu.pdf · In Translational...
Big Data Training for Translational Omics Research
BigTaP: Week IIWanqing Liu, PhD
Assistant ProfessorDepartment of Medicinal Chemistry and Molecular
PharmacologyPurdue University
Min Zhang, MD, PhDProfessor
Department of StatisticsPurdue University
Big Data Training for Translational Omics Research
Learning Objectives• Main Topics:
– Biomarker discovery and development using omics data– GWAS for binary and quantitative traits
• By taking this course, you would be able to– Reinforce what you have leant in WK I– Understand the principles of biomarker discovery and GWAS
• Study design• Process• Potential challenges
– Understand basic operation steps for translational omics data analysis
• Data downloading• Data process• Data analysis• Data visualization• Validation• Result interpretation
– Understand literatures in relevant areas– Discuss with biostatisticians and bioinformaticians
Big Data Training for Translational Omics Research
Teaching Logistics• http://www.stat.purdue.edu/bigtap/schedule.html• Lectureàoperationàlectureàoperation…• Online resources: database and tools• Case study:
– http://www.ncbi.nlm.nih.gov/pubmed/15193263– http://www.ncbi.nlm.nih.gov/pubmed/27010727– You’ll be tested!
• Homework:– Transcriptome biomarker: case-control/continuous– GWAS: case-control/continuous
• Grouping
Big Data Training for Translational Omics Research
Principles of Biomarker Discovery and Development
In Translational MedicineLiuDay 1
Session I8-10am
Session 1:
Big Data Training for Translational Omics Research
Philosophy of Translational Research
• As a biomedical researcher, how can I make something to benefit patients?
• I am working on cell lines and mice, how the omics approach can help me understand the mechanism? esp. causality?
• Can the key molecule(s) I identified in cells and animals be able to used in humans?
Lab researchers, grant writers, physicians…
Big Data Training for Translational Omics Research
Key Words• Biomarker: A characteristic that is objectively measured
and evaluated as an indicator of normal biologic process, pathogenic processes, or pharmacologic responses to a therapeutic intervention.
NIH Biomarkers Definition Working Group
• Translational: Translational research aims to aid in the transformation of biological knowledge into solutions that can be applied in a clinical setting
Atkinson, et al., Clin Pharm Ther, 2001.Azuaje F. Bioinformatics and Biomarker Discovery, 2010
Big Data Training for Translational Omics Research
Why Biomarker?
Big Data Training for Translational Omics Research
A Core Question in Modern Medicine How to Address Patient Heterogeneity?
Big Data Training for Translational Omics Research
Patient Heterogeneity
Big Data Training for Translational Omics Research
BiomarkeràPersonalized Medicine
CML Patients
All Breast Cancer Patients
HER2+ Breast Cancer Patients
All NSCLC Patients
EGFR MT+ NSCLC Patients
Gleevec
Herceptin
Herceptin
Iressa
Iressa
90% RR
10–15% RR
35–45% RR
10–15% RR
60–70% RR
Slamon et al. NEJM 2001; Kantarjian et al. NEJM 2002; Vogel et al. JCO 2002. 20:3; Douillard et al. JCO 2010.
Biomarkers are especially important in diseases with low response rates in the overall population
Big Data Training for Translational Omics Research
Cancer
Other common diseases
Discovery ImplementationDrug development
EGFR
KRAS
ALK
HER2
ALK
BRAF
Gefitinib
ARS-853?
Crizotinib
Herceptin
Vemurafenib
GeneA
GeneB
ALK
GeneD
GeneC
GeneE
Precision molecules
BiomarkeràPersonalized Medicine
Big Data Training for Translational Omics Research
Precision MedicineTo deliver the right treatment to the right patient with the right dose and at the right time
Big Data Training for Translational Omics Research
Clinical Application of Biomarker
• Deal with the patient heterogeneity– Early risk assessment– Disease prevention– Assist diagnosis– Optimize treatment: high effectiveness, low risk– Match the patient to therapeutic strategy– Monitor therapy success/disease recurrence– Long-term management
Risk
Diagnosis
Treatment
Monitoring
Big Data Training for Translational Omics Research
Biomarker in Preclinical Studies• To characterize the phenotype• To monitor the response• To identify potential translational biomarkers for
humans
Big Data Training for Translational Omics Research
Omics Approach in Basic Research• Explore molecular mechanism• Hypothesis generating• Identify therapeutic targets and strategies• Establish intermediate phenotypes
Big Data Training for Translational Omics Research
Type of Biomarkers• Prognostic marker (a): before treatment• Predictive marker (b): before treatment• Pharmacodynamic marker (c): after treatment• Surrogate marker (d): during treatment
Gosho, et al. Sensors 2012, 12, 8966-8986
Big Data Training for Translational Omics Research
Prognostic Marker• Signature separates a population with respect to the outcome (risk)• Regardless of the types of therapies or treatments
– Markers associated with overall survival regardless of treatment• Distinguish outcome (poor or good) following the test and standard
treatments• Cannot guide the choice of a particular treatment• Can determine the aggressiveness of treatment
Ballman KL, JCO. 2015.63.3651
Big Data Training for Translational Omics Research
Predictive Biomarker
Ballman KL, JCO. 2015.63.3651
• Predicts the differential outcome of a particular therapy or treatment• Prospectively identify patients who are likely to have a favorable clinical
outcome from a specific treatment; therefore, a predictive biomarker• Can guide the choice of treatment
Big Data Training for Translational Omics Research
Prognostic and Predictive Markers
Ballman KL, JCO. 2015.63.3651
• Biomarkers are both predictive of disease susceptibility or progression and certain treatment outcomes
• ER status and breast cancer-prognostic• ER status and antiestrogen therapy-prediction
• Be careful about the phrase “prediction”
Big Data Training for Translational Omics Research
Pharmacodynamic Markers• PD biomarkers provide information about the pharmacologic
effects of a drug on its target• Measured after treatment• A clinical endpoint to be measured• Application:
– Proof of mechanism: i.e., Does the drug hit its intended target?– Proof of concept: i.e., Does hitting the drug target alter the biology of
the tumor?– Selection of optimal biologic dosing– Understanding response/resistance mechanisms
• Examples:– Protein phosphorylation markers. i.e. p-EGFR, p-ERK to evaluate
changes in target protein phosphorylation or the activation status of downstream signaling/adapter molecules.
– Apoptosis (TUNEL assay) to assess pharmacologic effect on proliferation
Big Data Training for Translational Omics Research
Surrogate Biomarker• Substitute for a clinical endpoint• expected to predict clinical benefit (lack of benefit or harm)
based on epidemiologic, therapeutic, pathophysiologic, or other scientific evidence
• During or after treatment• Examples:• Glucose level monitoring the treatment for diabetes• Imaging-based measurement for anti-cancer therapy
Big Data Training for Translational Omics Research
Questions
§ What kind of biomarker is HOX13B:IL17BR in the first paper?
§ What kind of biomarker is blood concentration of R-/S-methadone?
Big Data Training for Translational Omics Research
Examples of FDA Approved Biomarkers
Gosho, et al. Sensors 2012, 12, 8966-8986
Big Data Training for Translational Omics Research
Gosho, et al. Sensors 2012, 12, 8966-8986
Examples of FDA Approved Biomarkers
Big Data Training for Translational Omics Research
Biomarker Discovery and Development in the Omics Era
1970s 1980s 1990s
>2005
Big Data Training for Translational Omics Research
Biomarker Discovery and Development in the Omics Era
Genomics Transcriptomics
miRNomicslncRNomicsEpigenomics
Proteomics Metabolomics
LipidomicsExposomics
Big Data Training for Translational Omics Research
Prognostic-diagnostic Markers• Genes for ~50% of rare diseases identified
Nature Reviews Genetics 14, 681–691 (2013)
Big Data Training for Translational Omics Research
Prognostic-Diagnostic Markers• 11,907 SNPs strongly associated with common diseases
Big Data Training for Translational Omics Research
Pharmacogenomic Markers• 166 FDA approved PGx markers for drug treatment
Big Data Training for Translational Omics Research
Transcriptomic Biomarkers• MammaPrint test
– Agendia– 70-gene signature for breast cancer prognosis
• Oncotype Dx test– Genomic Health– 21 gene-expression biomarkers for predicting the
recurrence of breast cancer patients, and predicting response to both chemotherapy and radiation therapy
• H/I test– AviaraDx– 2-gene signature that is used to estimate the risk of
recurrence and response to therapy of breast cancer patients.
Big Data Training for Translational Omics Research
Technical development
Biomarker Development Pipeline
Discovery Confirmation Assay development
Validation/Refinement
Clinical Validation
Clinical Adoption
§ Genomics§ Transcriptomics§ Proteomics§ Metabolomics§ Lipidomics§ Epigenomics§ Exposomics§ Imaging
Target selection
§ Integrated technologies and platforms§ Multi-analyst assays
§ Robust validated assays§ Clinical grade assays§ Accurate, specific,
reproducible, reliable
§ Clinical grade assays§ Instruments
Number of analytesNumber of samples
https://is.muni.cz
Lead identification
PreclinicalRetrospective
Clinical trials
Marketingclinical use
Big Data Training for Translational Omics Research
Institute of Medicine Roadmap for omics-based tumor biomarker test development
Hayes BMC Medicine 2013, 11:221
Big Data Training for Translational Omics Research
Institute of Medicine Roadmap for omics-based tumor biomarker test development Hayes BMC Medicine 2013, 11:221
Big Data Training for Translational Omics Research
Data Acquisition Strategies• Retrospective:
– Clinical samples collected before the design of the biomarker study, and before comparison with control samples.
– Looks back at past, recorded data to find evidence of marker-disease relationships
– Inexpensive, rapid– Potentially biased, noisy– Weak evidence
• Prospective– The biomarker-based prediction or classification model is applied on
patients at the time of patient enrolment– Clinical outcomes or disease occurrence are unknown at the time of
enrolment– Less biased– Strong evidence– Expensive, time-consuming,
• Pro-retrospectiveFDA approval!!
Big Data Training for Translational Omics Research
Study Design Consideration• Biomarker discovery studies require careful planning and
design• Study style: retrospective, prospective, pro-retrospective• Sample collection• Phenotype• Sample size and power estimation• Other covariates• Data collection• Platform• Replication, validation and application• Data analysis plan
Big Data Training for Translational Omics Research
Sample Collection, Assay Design, Data Analysis Plan
• Establish methods• Specimen collection • Processing • Storage
• Establish criteria • Quantity and quality• Minimum amount
• Feasibility • Obtaining specimens
• Assay design• Communication with core/service provider
• Data Analysis• Communication biostatistician and bioinformatician
Big Data Training for Translational Omics Research
Sample and Materials• Biospecimen
• Tissue• Blood• Oral swab• Hair• Tear• Urine• Saliva• Feces• …
• Test materials• DNA• RNA• Protein• Small
molecules• Lipids
• Principles:• Non-invasive• Reproducible• Reliable• Specific• Accurate• Inexpensive• Point-of-care
invasiveness
Big Data Training for Translational Omics Research
Ethical, Legal, and Regulatory Issues
• Establish communication with regulatory agencies, e.g. IRB, FDA
• Regulatory approvals• Documents:
– Informed consent– Study protocol
• Intellectual property issues• CLIA-lab based test for clinical trials involving
patient selection
Big Data Training for Translational Omics Research
Sample Size and Power Estimation• Power setting: 0.8• Statistical significance:
– Discovery: multiple hypothesis (corrected p according to # of tests)
– Validation: usually one hypothesis (p<0.05)• Input parameters: previous publication or
pilot study• Online tools:
– piface.jar by Lenth (2006).• http://homepage.stat.uiowa.edu/~rlenth/Power/
– Microarray power/sample size estimation• http://bioinformatics.mdanderson.org/MicroarraySa
mpleSize/– RNA-seq data:
• Scotty: http://bioinformatics.bc.edu/marthlab/scotty/scotty.php
• RnaSeqSampleSize: https://cqs.mc.vanderbilt.edu/shiny/RnaSeqSampleSize/
Big Data Training for Translational Omics Research
Key Principles: Big Data in Biomarker
Phenotype Molecular Profiles
X“Digits” “Digits”Statistics
BioinformaticsNetwork
…
Big Data Training for Translational Omics Research
Always Start Your Design and Analysis From Data Evaluation!
• What kind of phenotypic and marker data do I have/should I use/collect?
• Are my data normally distributed?• What kind of models should I choose?• What factors may possibly confound my analyses?• How covariate data correlate with my phenotype?
Big Data Training for Translational Omics Research
Phenotype to Digits• Nominal data: no order
– yes or no (Binary): disease vs normal, response vs no response
– Cancer type: Breast, lung, colon…• Ordinal data: some order
– Pathologic: Tumor stage: I, II, III– Disease progression: no, mild, severe, death
• Continuous data: – glucose level, LDL, drug concentration, gene expression
• Survival data: time to event– Death, occurrence of disease, onset of toxicity, in hr, day,
wk, month, yr, etc.
Big Data Training for Translational Omics Research
Platform
Raw data
“Digits”Ordinal data
0, 1, 2
Continuous Variables-1.2,-1.1,0.58, 1.09,2.34…
Genomics Transcriptomics
miRNomicslncRNomicsEpigenomics
Proteomics Metabolomics
Lipidomics
Molecular Data Collection
Big Data Training for Translational Omics Research
Basic Statistical MethodsPhenotype Molecular Profiles
XNumerical data Numerical data
Nominal
Ordinal
Continuous
Nominal
Ordinal
Continuous
Survival
Chi-square test
t-test
ANOVA
Correlation
Log rank
Statistic Models
Descriptive and exploratory association
Big Data Training for Translational Omics Research
Basic Statistical Methods
• Continuous data– Normal distributed: parametric method– Non-normal distribution/ordinal data: non-parametric
method• Winsorization• Log transformation: log2
Parametric Non-parametrict-test Mann-Whitney rank-sum testPaired t-test Wilcoxon signed-rank testANOVA Kruskal-Wallis testPearson correlation Spearman correlation
Big Data Training for Translational Omics Research
Statistic Models
• Univariate models– Logistic regression: binary/categorical phenotype– Linear regression: continuous phenotype– Kaplan-Meier (KM) method: survival phenotype
• Multivariate models– Multivariate regressions: linear or logistic– Cox regression: survival phenotype
• Other sophisticated models
Big Data Training for Translational Omics Research
• Example• P value cutoff =0.05• 1000 genes: 50 genes by chance (error) at this significance level• If 60 genes with p<0.05, many might be due to noise (false positive)
• Common Correction Method• Bonferroni Correction
• True significance level: pXn, e.g. p=0.0005, n=1000 genes, true p= 0.0005X1000=0.5.
• Correct p value = 0.05/N• Explanation: among all genes selected, the p value for at least one
false positive is <=0.05• False discovery rate (FDR)
• FDR=0.1, meaning among all genes selected, (e.g. 100), we would expect 10 to be false positive
• FDR as high as 0.5 may be acceptable to biologists• Several different approaches to estimate (Benjamini & Hochberg,
B&H, most popular)• Data filtering in the process step can also reduce the number of genes
Multiple Testing Issue
Big Data Training for Translational Omics Research
Azuaje F. Bioinformatics and Biomarker Discovery, 2010
Basic Biomarker Discovery Pipeline
Big Data Training for Translational Omics Research
Data Processing• Data pre-processing
– Data filtering and QC• Remove samples with failed experiment• Exclude markers with very low variance• Exclude markers with very low expression levels, e.g.
RNA-seq– Data Normalization
• To transform the data into a format that is compatible or comparable between different samples or assays
• To level potential differences caused by experimental factors, such as labelling and hybridization
Big Data Training for Translational Omics Research
Why Remove Genes with Low Variance?
C a s e
C o n trol
C a s e
C o n trol
0
1
2
3
4
Ge
ne
Ex
pre
ss
ion
p=0.004 p=0.008
Big Data Training for Translational Omics Research
Data Reduction• Focus on smaller sets of potentially novel and
interesting data patterns (e.g. groups of samples or gene sets).
• Confirm initial hypothesis about the relevance of the features available and to guide future experimental and computational analysis
• Exploratory univariate analyses– T-test– Chi-square test– Correlation– Univariate regression
Big Data Training for Translational Omics Research
Data Matrix
• Data matrix• Color-coded representations of• Absolute or relative expression levels
Expr
essi
on
Samples
Big Data Training for Translational Omics Research
Data Visualization
dendrogram
• Statistical plotting: Graphpad• Dendrogram and heatmap: R, GENE-E, Gitools
Big Data Training for Translational Omics Research
Exploratory Analysis• Univariate analysis• Single marker vs phenotype• multiple-hypotheses testing corrections
– DEG– Fold change– Statistical model: t-test, correlation, univariate regression– P values and other cut-off
• Unsupervised classification (clustering) and visualization• Filtering: to remove uninformative, highly noisy or redundant
markers for subsequent analyses• Supervised classification
Big Data Training for Translational Omics Research
Data Integration• Further reduction• Which marker to be chosen for the predictive model
construction• To estimate the potential relevance of the identified markers and relationships;• To discover other significant genes and relationships (e.g. gene-gene or gene-
disease) not found in previous data-driven analysis steps• Tools:
– human gene annotation databases (e.g. GO), – metabolic pathways databases (e.g. KEGG), – gene-disease association extractors from public databases (e.g. Endeavour), – Other functional catalogues
• Resulting data- and knowledge-driven findings, patterns or predictions provide a selected catalogue of genes, pathways and (gene-gene and gene-disease) relationships relevant to the phenotype classes investigated
IPA
Big Data Training for Translational Omics Research
Don’t Forget Covariates Data!• Don’t forget these:
– Demographic• age, gender, race (often a PCA component), smoking, drinking, life style etc.
– Physiological• BMI, weight, height, etc.
– Clinical• blood tests, urine tests, other analytes.
• Integrate information– Molecular data– Knowledge-driving data– Covariates
• Multivariate regression– Model training – Model validation– Model assessment
• ROC
Big Data Training for Translational Omics Research
Data Integration is Critical• Provide more reliable information• Increase the prediction value• Insight into the mechanism• Reliable hypothesis generating• But can be biased as well
Transcription Translation Catalysis
DNARNA ProteinMetabolites
GenomeTranscriptomeProteome Metabolome/Lipidome Clinicalendpoint
dysregulation
Genetic effectEnvironmental effect
Big Data Training for Translational Omics Research
Examples of Cardiovascular Biomarkers with Integrated
Data
Vasan, 2006; Gerszten and Wang, 2008
Big Data Training for Translational Omics Research
Building Predictive Models
If …Then…Build up a model based on selected markers
Discovery set
validation set
Pro-retrospective set
Prospective set
Y= β0 + β1X1 + β2 X2 + βiXi
^ ^ ^ ^
Big Data Training for Translational Omics Research
Predictive Models• Multivariable models
– Linear regression• Continuous data
– logistic regression • Presence/absence of disease
– Cox regression • Survival data
• Algorithmic models—Machine learning– Support vector machines (SVM)– Artificial neural networks (ANN)
Big Data Training for Translational Omics Research
Validation Strategies
• Internal validation– Cross-validation– Random/non-random split samples into
training and test set• External validation
– Independent sample and dataset
Big Data Training for Translational Omics Research
Assessment of Performance• Basic parameters
– Sensitivity: the proportion of the true positive outcomes (e.g. truly diseased subjects) that are predicted to be positive
– Specificity: the proportion of the true negative outcomes (e.g. truly disease-free subjects) that are predicted to be negative
Big Data Training for Translational Omics Research
Assessment of Performance• Receiver Operating Characteristic (ROC) curve• Area under the curve (AUC)
– AUC=0.5: no association– AUC=1: perfect association– AUC<0.6: No medical value– AUC>0.75: reasonable
“AUROC”
Big Data Training for Translational Omics Research
Case study #1
Cancer Cell. 2004;5(6):607-16. PMID: 15193263
Big Data Training for Translational Omics Research
Hormone receptor status and tamoxifen response
Big Data Training for Translational Omics Research
Hormone receptor status and tamoxifen response
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
ER+/PR+
ER+/PR-
ER-/PR+
ER-/PR-
Non-responsiveness Rate
Big Data Training for Translational Omics Research
Biomarker is needed!
• Who would respond to TAM?• Alternative therapy
• Aromatase inhibitors• HER1 and HER2 inhibitors• Other chemotherapy
• Save time
Big Data Training for Translational Omics Research
Design
Frozen biopsy
103 ER+ female BC patients in
MGH
Sample selection
Microarray22k genes
Data collection
54% (N=32) disease free >10
yrs
46% (N=28)
metastatic (recurrent) ~4 yrs
60 female breast cancer patients uniformly
treated with TAM alone
Phenotyping Data filtering
<25% variance
Expression level?
5,475genes
19 DEG
3 DEG
LCD9
DEGTechnical Validation
t-testP=0.001
Permutationtest p<0.04
Data reduction
Discovery set
Big Data Training for Translational Omics Research
Technical Validation
Big Data Training for Translational Omics Research
Data Evaluation, Refinement and Selection
3 DEG
TAM recurrent
TAM Non-recurrent
HOXB13IL17BREST (unknown)
HOXB13
IL17BR
AgeTumor size
GradeLymphonode
status
ERBB2EGFRESR1PGR
Expression ratio
Logistic model
Univariate analysis
T-testAUC ROC comparison
AUC of ROC
Multivariate analysis
Expression ratio
Big Data Training for Translational Omics Research
Association Model: to determine the predictive factors
• Univariate analysis: select factors• Multivariate model: test dependence• Predictive model: include the independent factors
Y= β0 + β1X1 + β2 X2 + βiXi
^ ^ ^ ^
Big Data Training for Translational Omics Research
Validation
• Cross validation?• Independent validation• Technical validation: qPCR-why?• FFPE-why?
Big Data Training for Translational Omics Research
Predictive Model• qPCR determined a feasible strategy for diagnosis purposes• The ratio is the only predictive factor after controlling other
univariate analysis-identified factors, e.g. tumor size, other genes, etc
• Can this ratio accurately predict the recurrence status?
TAM Recur
TAM non-recur
HOX13BIL17BR Y=β1X+β0
?
Big Data Training for Translational Omics Research
Predictive model evaluation• Sensitivity, specificity and accuracy
Non-recur recur
non-predicted
predicted
recur non-recur
Predicted
Non-predicted
Accuracy = (21+27)/59= 81%Sensitivity=21/(21+6) =78%Specificity=27/(27+5) =84%+predictive value=21/(21+5)=81%-predictive value =27/(27+6)=82%
Big Data Training for Translational Omics Research
Independent Evaluation for the Predictive Model
TAM Recur
TAM non-recur
HOX13BIL17BR Y=β1X+β0
?
20 FFPE samples
Accuracy = (9+7)/20= 80%Sensitivity=7/(7+3) =70%Specificity=9/(9+1) =90%+predictive value=-predictive value=
Validation Set
Big Data Training for Translational Omics Research
Evaluation: Other outcomes
Big Data Training for Translational Omics Research
Why these two genes?--Mechanism
• Correlative studies tissue samples: Association• Mechanistic studies in BC cell line: causality
• Q: Is this step necessary?
Big Data Training for Translational Omics Research
Our Validation Set
http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=gse6532
Big Data Training for Translational Omics Research
Evening Session I
Genomespace: http://www.genomespace.org/cBioPortal: http://www.cbioportal.org/