NetBioSIG2014-Talk by David Amar
-
Upload
alexander-pico -
Category
Science
-
view
485 -
download
4
description
Transcript of NetBioSIG2014-Talk by David Amar
1
Pathways as robust biomarkers for cancer
classification: the power of big expression data
David Amar, Tom Hait, and Ron Shamir
Blavatnik School of Computer Science
Tel Aviv University
2
Motivation and introduction
3
Comparative genomicsStandard expression experiments: cases vs.
controls -> differential genes -> interpretationProblems
Small number of samples Non-specific signal Interpretation of a gene set/ gene ranking
Goal: find specific changes for a tested diseaseE.g., an up-regulated pathwayCrucial for clinical studies
Previous integrative classification studies Huang et al. 2010 PNAS (9,160 samples); Schmid
et al. PNAS 2012 (3,030); Lee et al. Bioinformatics 2013 (~14,000) Multilabel classification Global expression patternsOnly 1-3 platformsMany datasets were removed from GEONo “healthy” class (Huang);No diseases (Lee)
Pathprint (Altschuler et al. 2013)Use pathwaysTissue classification (as in Lee et al.)
4
Integrating pathways and molecular profilesEnrichment tests
Improves interpretability GSEA\GSA
Ranked based Higher statistical power
ClassificationExtract pathway features
Example: given a pathway remove non-differential genes
Not clear if prediction performance improves compared to using genes (Staiger et al. 2013)
5
6
Pathway-based gene expression database
PathwaysKEGG Reactome
Biocarta NCI
Expression profiles
GSE
GDS
TCGA
Sample labelsDiseas
e
Dataset\sample
descriptionSingle sample - single pathway
analysis
For each
pathway
• Mean• SD
Y
Sam
ple
s
XP
Pathway features
Platform data
Single sample analysis
g1 , g2 ,g3 , … , gk
Ranked genes\
transcripts
Sample j
g1 , g2 ,g3 , … , gk
Weighted ranks
/i kiW ie
w 1 , w 2 ,w 3 , … , w k
Standardized profile
low expressio
n
highexpressio
n
7
Single sample analysisInput: an expression profile of a sample
A vector of real values for each patientStep 1: rank the genesStep 2: calculate a score for each gene
Rank of gene g in sample s
Total number of ranked genes (Yang et al. 2012,2013)
8
Pathway features1723 pathways in total
Covering 7842 genesMean size: 36.35 (median 15)
Score all genes that are in the pathway databases
Pathway statistics:Mean scoreStandard deviationSkewnessKS test
Pathway DBsKEGG
Reactome
Biocarta NCI
9
Patient labelsUnite ~180 datasets, >14,000
samplesPublic databases contain ‘free
text’Problem: automatic mapping
fails, example:GDS4358:” lymph-node
biopsies from classic Hodgkins lymphoma HIV- patients before ABVD chemotherapy”
MetaMap top score: “HIV infections”
Solution: manual analysisRead descriptions and papers
10
Current microarray dataData from GEO
13,314 samples17 platforms
Sample annotationIgnore terms with less than
100 samples 5 datasets 48 disease terms
Disease terms
XP
Sam
ple
s
Pathway features
Y
Disease terms {0,1}
Sam
ple
s11
12
Analysis and results
13
Multi-label classification algorithmsLearn a single classifier for each
disease Ignore class dependencies
Adaptation: Bayesian CorrectionLearn single classifiersCorrect errors using the DO DAG
Transformation: use the label power sets and learn a multiclass modelUsing RF: multi-label trees
Was better than most approaches in an experimental study (Madjarov et al. 2012)
14
How to validate an classifier?Use leave-dataset out cross-validation
Global AUC scores: each prediction Pij vs the correct label Yij Disease based AUC scores: consider each column separately
Y
Disease terms {0,1}
Sam
ple
s
P
Probabilities [0,1]
Sam
ple
s
The output of a multi-label learner
Test set
15
A problem (!)What is in the background?For a disease D define:
Positives: disease samplesNegatives: direct controlsBackground controls
Example: 500 positives
500 negatives
10000 BGCs
Y
P
16
Multistep validationIt is recommended to use several scores (Lee et al.
2013)Measure global AUPRFor each disease we calculate three scores
Measure Used (additional) information
AUPR: check separation between positives and all others
Sick vs. not sick
ROC: test for separation between positives and negatives
Direct use of negatives
Meta analysis p-value: calculate the overall separation significance within the original datasets (a p-value)
Mapping of samples to datasets
17
Performance results
Meta analysis q-value < 0.001 (filled boxes)
Positives vs. negatives ROC
AUPR
18
Performance results
8.5% improvement in recall, 12% in precision, compared to Huang et al.
Validation on RNA-SeqData from TCGA: 1,699 samples
19
Pathway-Disease networkSteps (for each of the selected diseases):
1. Disease-pathway edges1. RF importance: Select the top features2. Test for disease relevance
2. Add edges between diseases1. Use the DO structure
3. Add edges between pathways1. Based on significant overlap in genes
20
Cancer network
DownUp
23
Cardiovascular disease
DownUp
Gastric cancers
25
SummaryLarge scale integrationMulti-label learningCareful validationPathway based features as biomarkersSummary of the results in a networkCurrently
Add genes: overcome missing valuesShows improvement in validation
AcknowledgementsRon ShamirTom Hait