CSCI2950(C:*Topics*in*Computaonal*Biology* Lecture*4,*Part2*...
Transcript of CSCI2950(C:*Topics*in*Computaonal*Biology* Lecture*4,*Part2*...
-
10/6/11
1
CSCI2950-‐C: Topics in Computa6onal Biology Lecture 4, Part 2
Tes6ng Groups of Genes
Ben Raphael Lecture given by Fabio Vandin
[email protected] October 3, 2011
hPp://cs.brown.edu/courses/csci2950-‐c/
Func6onal Muta6ons Sta$s$cal approach: Find recurrent muta*ons in many cancer genomes.
Muta6ons and other genomic measurements
• Hundreds of cancer samples • Dozens of cancer types
-
10/6/11
2
Mo6va6on Large amount of data is coming! The Cancer Genome Atlas: 206 glioblastoma mul6forme samples [TCGA, Nature 2008] What genes are altered/mutated?
Mo6va6on Cancer is a disease of “pathways”
[Hanahan and Weinberg, Cell 2000]
What pathways are altered/mutated?
36%
2%
6%
18%
21%
3% 2% 5% 3%
4%
-
10/6/11
3
Finding Mutated Pathways Standard prac6ce: assess enrichment of muta6ons on
known pathways/sets of genes
Gene Ontology
www.geneontology.org
-
10/6/11
4
Gene annota6ons for Yeast
KEGG Pathways
-
10/6/11
5
-
10/6/11
6
Tes6ng Predefined Groups of Genes
Hypergeometric Test (see notes)
Finding Mutated Pathways
Limita6ons: • Only known pathways/sets are tested! • Pathways are interconnected (crosstalk)
Standard prac6ce: assess enrichment of muta6ons on known pathways/sets of genes
-
10/6/11
7
Finding Mutated Pathways
Are these results significant? Other interes6ng pathways?
Manually constructed network of interac6ons
[TCGA, Nature 2008] RB1: 78% of samples
p53: 87% of samples
RTK/RAS/PI(3)K: 88% of samples
Problem Given:
1. Large-‐scale interac6on network 2. Muta6on data from mul6ple cancer
samples Find: Subnetworks mutated in a significant number of samples
gene
= alteration
-
10/6/11
8
Problem Defini6on
Given: 1. Interac6on network G = (V, E)
V = genes. E = interac6ons b/w genes 2. Binary altera6on matrix Find: Subnetworks mutated in a significant number of samples
– subnetwork = connected subgraph – subnetwork mutated in sample if ≥ 1 gene mutated in sample
Samples
Gene
s
Computa6onal Formula6on
Goal: Find S such that Pr[NS ≥ m] < ε under suitable null distribu*on
For subnetwork S: NS = number of samples in which S is mutated with
random altera6ons m = number of observed samples in which S mutated
Samples
Gene
s
-
10/6/11
9
Mutated subnetworks: Naïve Method
Problems 1. Mul6ple hypothesis tes6ng:
> 1020 candidate subnetworks with < 6 genes 2. Network topology :
TP53 has 238 neighbors in HPRD network
Find: S such that Pr[NS ≥ m] < ε under suitable null distribu6on Naïve Method: Test each S
Naïve Method
• Example: 91 samples of glioblastoma mul6forme [TCGA, Nature 2008]
• Consider only paths with 3 nodes: – FDR ≤ 0.01: 1700 significant paths! – few genes altered in many samples
• Remove significant genes?
-
10/6/11
10
HotNet
F. Vandin, E. Upfal, B. J. Raphael. “Algorithms for Detec6ng Significantly Mutated Pathways in Cancer”, JCB 2011.
Contribu6on 1. Methods for de novo discovery of mutated
subnetworks I. Combinatorial model II. Enhanced influence model: HotNet
2. Defini6on of Influence Graph: Iden6fy subnetworks using both frequency of altera6on and network topology
3. Sta*s*cal tests to assess the significance
-
10/6/11
11
(Local) Topology MaPers
Single path between mutated genes
Path between mutated genes is one of many through node.
Influence Graph alteration = unit source of heat
Heat diffusion Influence measure
Easily derived from Laplacian matrix of G
-
10/6/11
12
Influence Graph alteration = unit source of heat
.
.
.
Influence graph
Encodes the topology of G
Heat equa6on
f(t) = (f1(t), ..., fn(t) )T heat on ver6ces at 6me t.
f(t) = e-‐L t f(0)
df/dt = (A – D) f(t) A = [aij] = adjacency matrix of G.
dfidt
=�
j
aij(fj(t)� fi(t))
L = D – A = Laplacian matrix of G.
e-‐L t is heat kernel of G
-
10/6/11
13
Discovering Significant Subnetworks
Two methods: 1. Combinatorial model 2. Enhanced Influence Model: HotNet
Based on Influence Graph Sta6s6cal tests to assess significance
Combinatorial Model
Fix K: find the subnetwork with K genes mutated in the maximum number of samples Connected maximum coverage problem
(“graph version” of maximum coverage problem – NP-‐Hard)
Influence graph
Samples
Gene
s
-
10/6/11
14
Connected maximum coverage problem
1. Thm. NP-‐Hard for general graphs.
2. Thm. NP-‐Hard for star graphs.
3. Thm. 1 – 1/e approx. alg. for spider graphs
4. Thm. 1/(cr) approx. alg. for general graphs – c=(2e-‐1)/(e-‐1) – r= radius of the op6mal solu6on in G
Combinatorial Model: Sta6s6cal Test
Fix K: find the subnetwork with K genes mutated in the maximum number of samples tes6ng the number of altered samples only 1 hypothesis no mul6ple correc6on!
Limita6on: inadequate representa6on of heterogeneity of cancer altera6ons
-
10/6/11
15
Influence Graph
= alterations Samples
Gene
s
Altera$on Matrix
Enhance Influence Model (EIM)
(1)
(2)
Extract “significantly hot” subnetworks
Two-‐stage mul6-‐hypothesis test
Enhanced Influence
Hot
Cold
EIM: Sta6s6cal test Xs = number of subnetworks with ≥ s genes using “random” altera6on matrix.
H0s : Xs ≥ ηs, s = 1, …, N = # genes.
Two-‐stage mul$-‐hypothesis test 1. Let s* = smallest s where H0s is rejected.
Pr [ Xs ≥ ηs ] < α / N (Bonferroni correc6on) # hypotheses = #s ≤ # measured genes.
2 subnetworks with 2 or more genes
-
10/6/11
16
EIM: Sta6s6cal test
Thm. Fix β1, …, βN such that Σi βi ≤ β. Let s* be smallest s such that ηs ≥ E[Xs] / βs. If return all subnetworks of size ≥ s* as significant, then FDR ≤ β.
Xs = number of subnetworks with ≥ s genes using “random” altera6on matrix.
H0s : Xs ≥ ηs, s = 1, …, N = # genes.
Two-‐stage mul$-‐hypothesis test 2. Bound false discovery rate (FDR) for list of iden7fied subnetworks.
2 subnetworks with 2 or more genes
Experimental Results Interac$on network
HPRD: 18796 nodes, 37107 edges
Datasets 1. Glioblastoma Mul$forme (GBM) [TCGA, Nature, 2008]
601 sequenced genes in 91 samples Array copy number data on all genes
2. Lung Adenocarcinoma [Ding et al., Nature, 2008] 623 sequenced genes in 188 samples
3. Ovarian Carcinoma (OV) [TCGA, Nature, 2011]
All genes sequenced in 316 samples Array copy number data on all genes
-
10/6/11
17
GBM [TCGA, Nature 2008]
RB1
p53
RTK/RAS/PI(3)K
Manually created Significant?
GBM: Muta6ons + Copy number
FDR
-
10/6/11
18
Lung Adenocarcinoma [Ding et al., Nature 2008]
Results: Lung Adenocarcinoma enrichment
s #net ≥ s p-‐val KEGG pathway/p-‐val 6 3
-
10/6/11
19
Ovarian Carcinoma [TCGA, Nature 2011]
Dendrix
F. Vandin, E. Upfal, B. J. Raphael. “De novo Discovery of Mutated Driver Pathways in Cancer”, Genome Research, to appear.
-
10/6/11
20
Network Methods
Mutation matrix
Interaction network
Networks are noisy! Too many groups of genes to test exhaustively
Pa$e
nts
Genes
De novo method?
Pathways and Muta6ons
RAS
RAF
EGFR
MEK
MAPK
cell membrane
Transcription factors Patients
Gen
es
[Thomas, et al. (2007)]
Driver mutations are rare. Cancer pathway has one driver mutation (gene) per patient [Vogelstein and K. W. Kinzler (2004), Yeang, McCormick, and Levine (2008)]
1. Exclusivity
-
10/6/11
21
Pathways and Muta6ons
RAS
RAF
EGFR
MEK
MAPK
cell membrane
Most patients have mutation in important cancer pathway. 2. Coverage
Transcription factors
Driver mutations are rare. Cancer pathway has one driver mutation (gene) per patient [Vogelstein and K. W. Kinzler (2004), Yeang, McCormick, and Levine (2008)]
1. Exclusivity
Exclusive (Column) Submatrix
Mutation Matrix
genes ;
patie
nts
Exclusivity and Coverage
Coverage: Γ(g) = {pa6ents in which gene g mutated} Γ(M) = Ui Γ(gi) = {pa6ents in which ≥ 1 of {g1,g2,…,gk} is mutated}
-
10/6/11
22
Exclusivity and Coverage genes
;
patie
nts
coverage |Γ(M)|
Maximum Coverage Exclusive Submatrix Problem: Given k>0, find the exclusive set M of k genes that maximizes |Γ(M)| Theorem: Maximum Coverage Exclusive Submatrix Problem is NP-Hard.
genes ;
patie
nts
Relaxing Exclusivity
“Approximately exclusive”, high coverage submatrix
For set M of genes: Coverage overlap: γ(M) = Σi|Γ(gi)| - |Γ(M)| γ(M) = 0 if and only if M is exclusive Goal: |Γ(M)| large and γ(M) small
-
10/6/11
23
Approximate Exclusivity Goal: Γ(M) large and γ(M) small Weight: W(M) = |Γ(M)| - γ(M)
= 2 |Γ(M)| - Σi|Γ(gi)|
genes ;
patie
nts
Maximum Weight Submatrix Problem: Given k>0, find the set M of k genes that maximizes W(M) Theorem. Maximum Weight Submatrix Problem is NP-hard.
genes ;
patie
nts
Summary of Results
1. Simple greedy algorithm
– Proof for par6cular genera6ve model
2. MCMC algorithm for any type of data – Proof of rapid convergence
-
10/6/11
24
Greedy Algorithm
set size k = 3
Time complexity: O(n2)
genes ;
patie
nts
Independent Genes Model Reasonable assump6on for some muta6on types
;a
M* = {driver genes}
;a
p1
patie
nts
genes genes
patie
nts
p2 …
p1, p2,… ϵ[pL;pU]
-
10/6/11
25
Modeling Random Muta6ons
Thm. The greedy algorithm returns M* with high probability provided:
• Mutations from Independent Genes Model • Number of patients ≥ β log n, n = number of genes.
β depends on parameters of Independent Genes Model
• For reasonable parameters ≈1800 patients needed
Markov Chain Monte Carlo Sample gene sets |M| = k according to W(M)
Generate sequence of states: M(1), M(2), M(3), … Markov Chain Convergence Thm: M(i) Pr[M] ~ W(M)
Markov chain States: sets M Transitions : replace gene in M
M1
M3 M4
M2
If Markov Chains converges to stationary distribution in polynomial time, it is rapidly mixing
-
10/6/11
26
Metropolis-‐Has6ngs Define transi6on probabili6es of Markov Chain so Pr[.] = desired distribu6on
In general: no guarantees on rate of convergence
M1
M3 M4
M2 Markov chain States: sets M Transitions : replace gene in M
MCMC approach Thm. Markov Chain is rapidly mixing. Returns a distribution on sets [proportional to W(M)], not just optimal [max W(M)] set No assumptions on distribution of mutations
• Can handle various mutation types
-
10/6/11
27
Experimental Results Simulated data
Cancer data 1. Known muta$ons in mul$ple cancer types [Thomas et al.
(2007)] 238 known muta6ons in 1000 pa6ents of 17 different cancer types
2. Lung Adenocarcinoma [Ding et al. (2008)] 623 sequenced genes in 188 pa6ents
3. Brain cancer (GBM) [TCGA (2008)]
601 sequenced genes in 84 pa6ents Array copy number data on all genes
Known muta6ons in mul6ple cancer types
(p < 0.01)
238 known mutations. 1000 tumors from 17 cancer types [Thomas, et al. (2007)]
8 mutations cover 94% of mutated tumors Only 15 tumors with >1 mutation
Mutation: Exclusive
Co-occurring None
-
10/6/11
28
Lung Adenocarcinoma
(p < 0.01) (p < 0.01)
623 sequenced genes in 188 patients [Ding et al. (2008)]
Brain Cancer (GBM)
(p < 0.01)
(p < 0.01) (p < 0.01)
Mutations (single-nucleotide and copy number) for 601 genes in 84 samples [TCGA (2008)]