Big Data Training for Translational Omics Research BigTaP ... 1 day 1-Liu.pdf · In Translational...

Big Data Training for Translational Omics Research

BigTaP: Week IIWanqing Liu, PhD

Assistant ProfessorDepartment of Medicinal Chemistry and Molecular

PharmacologyPurdue University

Min Zhang, MD, PhDProfessor

Department of StatisticsPurdue University


Learning Objectives• Main Topics:

– Biomarker discovery and development using omics data– GWAS for binary and quantitative traits

• By taking this course, you would be able to– Reinforce what you have leant in WK I– Understand the principles of biomarker discovery and GWAS

• Study design• Process• Potential challenges

– Understand basic operation steps for translational omics data analysis

• Data downloading• Data process• Data analysis• Data visualization• Validation• Result interpretation

– Understand literatures in relevant areas– Discuss with biostatisticians and bioinformaticians


Teaching Logistics• http://www.stat.purdue.edu/bigtap/schedule.html• Lectureàoperationàlectureàoperation…• Online resources: database and tools• Case study:

– http://www.ncbi.nlm.nih.gov/pubmed/15193263– http://www.ncbi.nlm.nih.gov/pubmed/27010727– You’ll be tested!

• Homework:– Transcriptome biomarker: case-control/continuous– GWAS: case-control/continuous

• Grouping


Principles of Biomarker Discovery and Development

In Translational MedicineLiuDay 1

Session I8-10am

Session 1:


Philosophy of Translational Research

• As a biomedical researcher, how can I make something to benefit patients?

• I am working on cell lines and mice, how the omics approach can help me understand the mechanism? esp. causality?

• Can the key molecule(s) I identified in cells and animals be able to used in humans?

Lab researchers, grant writers, physicians…


Key Words• Biomarker: A characteristic that is objectively measured

and evaluated as an indicator of normal biologic process, pathogenic processes, or pharmacologic responses to a therapeutic intervention.

NIH Biomarkers Definition Working Group

• Translational: Translational research aims to aid in the transformation of biological knowledge into solutions that can be applied in a clinical setting

Atkinson, et al., Clin Pharm Ther, 2001.Azuaje F. Bioinformatics and Biomarker Discovery, 2010


Why Biomarker?


A Core Question in Modern Medicine How to Address Patient Heterogeneity?


Patient Heterogeneity


BiomarkeràPersonalized Medicine

CML Patients

All Breast Cancer Patients

HER2+ Breast Cancer Patients

All NSCLC Patients

EGFR MT+ NSCLC Patients

Gleevec

Herceptin

Herceptin

Iressa

Iressa

90% RR

10–15% RR

35–45% RR

10–15% RR

60–70% RR

Slamon et al. NEJM 2001; Kantarjian et al. NEJM 2002; Vogel et al. JCO 2002. 20:3; Douillard et al. JCO 2010.

Biomarkers are especially important in diseases with low response rates in the overall population


Cancer

Other common diseases

Discovery ImplementationDrug development

EGFR

KRAS

ALK

HER2

ALK

BRAF

Gefitinib

ARS-853?

Crizotinib

Herceptin

Vemurafenib

GeneA

GeneB

ALK

GeneD

GeneC

GeneE

Precision molecules

BiomarkeràPersonalized Medicine


Precision MedicineTo deliver the right treatment to the right patient with the right dose and at the right time


Clinical Application of Biomarker

• Deal with the patient heterogeneity– Early risk assessment– Disease prevention– Assist diagnosis– Optimize treatment: high effectiveness, low risk– Match the patient to therapeutic strategy– Monitor therapy success/disease recurrence– Long-term management

Risk

Diagnosis

Treatment

Monitoring


Biomarker in Preclinical Studies• To characterize the phenotype• To monitor the response• To identify potential translational biomarkers for

humans


Omics Approach in Basic Research• Explore molecular mechanism• Hypothesis generating• Identify therapeutic targets and strategies• Establish intermediate phenotypes


Type of Biomarkers• Prognostic marker (a): before treatment• Predictive marker (b): before treatment• Pharmacodynamic marker (c): after treatment• Surrogate marker (d): during treatment

Gosho, et al. Sensors 2012, 12, 8966-8986


Prognostic Marker• Signature separates a population with respect to the outcome (risk)• Regardless of the types of therapies or treatments

– Markers associated with overall survival regardless of treatment• Distinguish outcome (poor or good) following the test and standard

treatments• Cannot guide the choice of a particular treatment• Can determine the aggressiveness of treatment

Ballman KL, JCO. 2015.63.3651


Predictive Biomarker


• Predicts the differential outcome of a particular therapy or treatment• Prospectively identify patients who are likely to have a favorable clinical

outcome from a specific treatment; therefore, a predictive biomarker• Can guide the choice of treatment


Prognostic and Predictive Markers


• Biomarkers are both predictive of disease susceptibility or progression and certain treatment outcomes

• ER status and breast cancer-prognostic• ER status and antiestrogen therapy-prediction

• Be careful about the phrase “prediction”


Pharmacodynamic Markers• PD biomarkers provide information about the pharmacologic

effects of a drug on its target• Measured after treatment• A clinical endpoint to be measured• Application:

– Proof of mechanism: i.e., Does the drug hit its intended target?– Proof of concept: i.e., Does hitting the drug target alter the biology of

the tumor?– Selection of optimal biologic dosing– Understanding response/resistance mechanisms

• Examples:– Protein phosphorylation markers. i.e. p-EGFR, p-ERK to evaluate

changes in target protein phosphorylation or the activation status of downstream signaling/adapter molecules.

– Apoptosis (TUNEL assay) to assess pharmacologic effect on proliferation


Surrogate Biomarker• Substitute for a clinical endpoint• expected to predict clinical benefit (lack of benefit or harm)

based on epidemiologic, therapeutic, pathophysiologic, or other scientific evidence

• During or after treatment• Examples:• Glucose level monitoring the treatment for diabetes• Imaging-based measurement for anti-cancer therapy


Questions

§ What kind of biomarker is HOX13B:IL17BR in the first paper?

§ What kind of biomarker is blood concentration of R-/S-methadone?


Examples of FDA Approved Biomarkers



Biomarker Discovery and Development in the Omics Era

1970s 1980s 1990s

>2005


Biomarker Discovery and Development in the Omics Era

Genomics Transcriptomics

miRNomicslncRNomicsEpigenomics

Proteomics Metabolomics

LipidomicsExposomics


Prognostic-diagnostic Markers• Genes for ~50% of rare diseases identified

Nature Reviews Genetics 14, 681–691 (2013)


Prognostic-Diagnostic Markers• 11,907 SNPs strongly associated with common diseases


Pharmacogenomic Markers• 166 FDA approved PGx markers for drug treatment


Transcriptomic Biomarkers• MammaPrint test

– Agendia– 70-gene signature for breast cancer prognosis

• Oncotype Dx test– Genomic Health– 21 gene-expression biomarkers for predicting the

recurrence of breast cancer patients, and predicting response to both chemotherapy and radiation therapy

• H/I test– AviaraDx– 2-gene signature that is used to estimate the risk of

recurrence and response to therapy of breast cancer patients.


Technical development

Biomarker Development Pipeline

Discovery Confirmation Assay development

Validation/Refinement

Clinical Validation

Clinical Adoption

§ Genomics§ Transcriptomics§ Proteomics§ Metabolomics§ Lipidomics§ Epigenomics§ Exposomics§ Imaging

Target selection

§ Integrated technologies and platforms§ Multi-analyst assays

§ Robust validated assays§ Clinical grade assays§ Accurate, specific,

reproducible, reliable

§ Clinical grade assays§ Instruments

Number of analytesNumber of samples

https://is.muni.cz

Lead identification

PreclinicalRetrospective

Clinical trials

Marketingclinical use


Institute of Medicine Roadmap for omics-based tumor biomarker test development

Hayes BMC Medicine 2013, 11:221


Institute of Medicine Roadmap for omics-based tumor biomarker test development Hayes BMC Medicine 2013, 11:221


Data Acquisition Strategies• Retrospective:

– Clinical samples collected before the design of the biomarker study, and before comparison with control samples.

– Looks back at past, recorded data to find evidence of marker-disease relationships

– Inexpensive, rapid– Potentially biased, noisy– Weak evidence

• Prospective– The biomarker-based prediction or classification model is applied on

patients at the time of patient enrolment– Clinical outcomes or disease occurrence are unknown at the time of

enrolment– Less biased– Strong evidence– Expensive, time-consuming,

• Pro-retrospectiveFDA approval!!


Study Design Consideration• Biomarker discovery studies require careful planning and

design• Study style: retrospective, prospective, pro-retrospective• Sample collection• Phenotype• Sample size and power estimation• Other covariates• Data collection• Platform• Replication, validation and application• Data analysis plan


Sample Collection, Assay Design, Data Analysis Plan

• Establish methods• Specimen collection • Processing • Storage

• Establish criteria • Quantity and quality• Minimum amount

• Feasibility • Obtaining specimens

• Assay design• Communication with core/service provider

• Data Analysis• Communication biostatistician and bioinformatician


Sample and Materials• Biospecimen

• Tissue• Blood• Oral swab• Hair• Tear• Urine• Saliva• Feces• …

• Test materials• DNA• RNA• Protein• Small

molecules• Lipids

• Principles:• Non-invasive• Reproducible• Reliable• Specific• Accurate• Inexpensive• Point-of-care

invasiveness


Ethical, Legal, and Regulatory Issues

• Establish communication with regulatory agencies, e.g. IRB, FDA

• Regulatory approvals• Documents:

– Informed consent– Study protocol

• Intellectual property issues• CLIA-lab based test for clinical trials involving

patient selection


Sample Size and Power Estimation• Power setting: 0.8• Statistical significance:

– Discovery: multiple hypothesis (corrected p according to # of tests)

– Validation: usually one hypothesis (p<0.05)• Input parameters: previous publication or

pilot study• Online tools:

– piface.jar by Lenth (2006).• http://homepage.stat.uiowa.edu/~rlenth/Power/

– Microarray power/sample size estimation• http://bioinformatics.mdanderson.org/MicroarraySa

mpleSize/– RNA-seq data:

• Scotty: http://bioinformatics.bc.edu/marthlab/scotty/scotty.php

• RnaSeqSampleSize: https://cqs.mc.vanderbilt.edu/shiny/RnaSeqSampleSize/


Key Principles: Big Data in Biomarker

Phenotype Molecular Profiles

X“Digits” “Digits”Statistics

BioinformaticsNetwork

…


Always Start Your Design and Analysis From Data Evaluation!

• What kind of phenotypic and marker data do I have/should I use/collect?

• Are my data normally distributed?• What kind of models should I choose?• What factors may possibly confound my analyses?• How covariate data correlate with my phenotype?


Phenotype to Digits• Nominal data: no order

– yes or no (Binary): disease vs normal, response vs no response

– Cancer type: Breast, lung, colon…• Ordinal data: some order

– Pathologic: Tumor stage: I, II, III– Disease progression: no, mild, severe, death

• Continuous data: – glucose level, LDL, drug concentration, gene expression

• Survival data: time to event– Death, occurrence of disease, onset of toxicity, in hr, day,

wk, month, yr, etc.


Platform

Raw data

“Digits”Ordinal data

0, 1, 2

Continuous Variables-1.2,-1.1,0.58, 1.09,2.34…

Genomics Transcriptomics

miRNomicslncRNomicsEpigenomics

Proteomics Metabolomics

Lipidomics

Molecular Data Collection


Basic Statistical MethodsPhenotype Molecular Profiles

XNumerical data Numerical data

Nominal

Ordinal

Continuous

Nominal

Ordinal

Continuous

Survival

Chi-square test

t-test

ANOVA

Correlation

Log rank

Statistic Models

Descriptive and exploratory association


Basic Statistical Methods

• Continuous data– Normal distributed: parametric method– Non-normal distribution/ordinal data: non-parametric

method• Winsorization• Log transformation: log2

Parametric Non-parametrict-test Mann-Whitney rank-sum testPaired t-test Wilcoxon signed-rank testANOVA Kruskal-Wallis testPearson correlation Spearman correlation


Statistic Models

• Univariate models– Logistic regression: binary/categorical phenotype– Linear regression: continuous phenotype– Kaplan-Meier (KM) method: survival phenotype

• Multivariate models– Multivariate regressions: linear or logistic– Cox regression: survival phenotype

• Other sophisticated models


• Example• P value cutoff =0.05• 1000 genes: 50 genes by chance (error) at this significance level• If 60 genes with p<0.05, many might be due to noise (false positive)

• Common Correction Method• Bonferroni Correction

• True significance level: pXn, e.g. p=0.0005, n=1000 genes, true p= 0.0005X1000=0.5.

• Correct p value = 0.05/N• Explanation: among all genes selected, the p value for at least one

false positive is <=0.05• False discovery rate (FDR)

• FDR=0.1, meaning among all genes selected, (e.g. 100), we would expect 10 to be false positive

• FDR as high as 0.5 may be acceptable to biologists• Several different approaches to estimate (Benjamini & Hochberg,

B&H, most popular)• Data filtering in the process step can also reduce the number of genes

Multiple Testing Issue


Azuaje F. Bioinformatics and Biomarker Discovery, 2010

Basic Biomarker Discovery Pipeline


Data Processing• Data pre-processing

– Data filtering and QC• Remove samples with failed experiment• Exclude markers with very low variance• Exclude markers with very low expression levels, e.g.

RNA-seq– Data Normalization

• To transform the data into a format that is compatible or comparable between different samples or assays

• To level potential differences caused by experimental factors, such as labelling and hybridization


Why Remove Genes with Low Variance?

C a s e

C o n trol

C a s e

C o n trol

0

1

2

3

4

Ge

ne

Ex

pre

ss

ion

p=0.004 p=0.008


Data Reduction• Focus on smaller sets of potentially novel and

interesting data patterns (e.g. groups of samples or gene sets).

• Confirm initial hypothesis about the relevance of the features available and to guide future experimental and computational analysis

• Exploratory univariate analyses– T-test– Chi-square test– Correlation– Univariate regression


Data Matrix

• Data matrix• Color-coded representations of• Absolute or relative expression levels

Expr

essi

on

Samples


Data Visualization

dendrogram

• Statistical plotting: Graphpad• Dendrogram and heatmap: R, GENE-E, Gitools


Exploratory Analysis• Univariate analysis• Single marker vs phenotype• multiple-hypotheses testing corrections

– DEG– Fold change– Statistical model: t-test, correlation, univariate regression– P values and other cut-off

• Unsupervised classification (clustering) and visualization• Filtering: to remove uninformative, highly noisy or redundant

markers for subsequent analyses• Supervised classification


Data Integration• Further reduction• Which marker to be chosen for the predictive model

construction• To estimate the potential relevance of the identified markers and relationships;• To discover other significant genes and relationships (e.g. gene-gene or gene-

disease) not found in previous data-driven analysis steps• Tools:

– human gene annotation databases (e.g. GO), – metabolic pathways databases (e.g. KEGG), – gene-disease association extractors from public databases (e.g. Endeavour), – Other functional catalogues

• Resulting data- and knowledge-driven findings, patterns or predictions provide a selected catalogue of genes, pathways and (gene-gene and gene-disease) relationships relevant to the phenotype classes investigated

IPA


Don’t Forget Covariates Data!• Don’t forget these:

– Demographic• age, gender, race (often a PCA component), smoking, drinking, life style etc.

– Physiological• BMI, weight, height, etc.

– Clinical• blood tests, urine tests, other analytes.

• Integrate information– Molecular data– Knowledge-driving data– Covariates

• Multivariate regression– Model training – Model validation– Model assessment

• ROC


Data Integration is Critical• Provide more reliable information• Increase the prediction value• Insight into the mechanism• Reliable hypothesis generating• But can be biased as well

Transcription Translation Catalysis

DNARNA ProteinMetabolites

GenomeTranscriptomeProteome Metabolome/Lipidome Clinicalendpoint

dysregulation

Genetic effectEnvironmental effect


Examples of Cardiovascular Biomarkers with Integrated

Data

Vasan, 2006; Gerszten and Wang, 2008


Building Predictive Models

If …Then…Build up a model based on selected markers

Discovery set

validation set

Pro-retrospective set

Prospective set

Y= β0 + β1X1 + β2 X2 + βiXi

^ ^ ^ ^


Predictive Models• Multivariable models

– Linear regression• Continuous data

– logistic regression • Presence/absence of disease

– Cox regression • Survival data

• Algorithmic models—Machine learning– Support vector machines (SVM)– Artificial neural networks (ANN)


Validation Strategies

• Internal validation– Cross-validation– Random/non-random split samples into

training and test set• External validation

– Independent sample and dataset


Assessment of Performance• Basic parameters

– Sensitivity: the proportion of the true positive outcomes (e.g. truly diseased subjects) that are predicted to be positive

– Specificity: the proportion of the true negative outcomes (e.g. truly disease-free subjects) that are predicted to be negative


Assessment of Performance• Receiver Operating Characteristic (ROC) curve• Area under the curve (AUC)

– AUC=0.5: no association– AUC=1: perfect association– AUC<0.6: No medical value– AUC>0.75: reasonable

“AUROC”


Case study #1

Cancer Cell. 2004;5(6):607-16. PMID: 15193263


Hormone receptor status and tamoxifen response


Hormone receptor status and tamoxifen response

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

ER+/PR+

ER+/PR-

ER-/PR+

ER-/PR-

Non-responsiveness Rate


Biomarker is needed!

• Who would respond to TAM?• Alternative therapy

• Aromatase inhibitors• HER1 and HER2 inhibitors• Other chemotherapy

• Save time


Design

Frozen biopsy

103 ER+ female BC patients in

MGH

Sample selection

Microarray22k genes

Data collection

54% (N=32) disease free >10

yrs

46% (N=28)

metastatic (recurrent) ~4 yrs

60 female breast cancer patients uniformly

treated with TAM alone

Phenotyping Data filtering

<25% variance

Expression level?

5,475genes

19 DEG

3 DEG

LCD9

DEGTechnical Validation

t-testP=0.001

Permutationtest p<0.04

Data reduction

Discovery set


Technical Validation


Data Evaluation, Refinement and Selection

3 DEG

TAM recurrent

TAM Non-recurrent

HOXB13IL17BREST (unknown)

HOXB13

IL17BR

AgeTumor size

GradeLymphonode

status

ERBB2EGFRESR1PGR

Expression ratio

Logistic model

Univariate analysis

T-testAUC ROC comparison

AUC of ROC

Multivariate analysis

Expression ratio


Association Model: to determine the predictive factors

• Univariate analysis: select factors• Multivariate model: test dependence• Predictive model: include the independent factors

Y= β0 + β1X1 + β2 X2 + βiXi

^ ^ ^ ^


Validation

• Cross validation?• Independent validation• Technical validation: qPCR-why?• FFPE-why?


Predictive Model• qPCR determined a feasible strategy for diagnosis purposes• The ratio is the only predictive factor after controlling other

univariate analysis-identified factors, e.g. tumor size, other genes, etc

• Can this ratio accurately predict the recurrence status?

TAM Recur

TAM non-recur

HOX13BIL17BR Y=β1X+β0

?


Predictive model evaluation• Sensitivity, specificity and accuracy

Non-recur recur

non-predicted

predicted

recur non-recur

Predicted

Non-predicted

Accuracy = (21+27)/59= 81%Sensitivity=21/(21+6) =78%Specificity=27/(27+5) =84%+predictive value=21/(21+5)=81%-predictive value =27/(27+6)=82%


Independent Evaluation for the Predictive Model

TAM Recur

TAM non-recur

HOX13BIL17BR Y=β1X+β0

?

20 FFPE samples

Accuracy = (9+7)/20= 80%Sensitivity=7/(7+3) =70%Specificity=9/(9+1) =90%+predictive value=-predictive value=

Validation Set


Evaluation: Other outcomes


Why these two genes?--Mechanism

• Correlative studies tissue samples: Association• Mechanistic studies in BC cell line: causality

• Q: Is this step necessary?


Our Validation Set

http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=gse6532


Evening Session I

Genomespace: http://www.genomespace.org/cBioPortal: http://www.cbioportal.org/

Big Data Training for Translational Omics Research BigTaP ... 1 day 1-Liu.pdf · In Translational...

Documents

Transcript of Big Data Training for Translational Omics Research BigTaP ... 1 day 1-Liu.pdf · In Translational...