CSCI2950(C:TopicsinComputaonalBiology* Lecture4,Part2*...

10/6/11

1

CSCI2950-‐C: Topics in Computa6onal Biology Lecture 4, Part 2

Tes6ng Groups of Genes

Ben Raphael Lecture given by Fabio Vandin

[email protected] October 3, 2011

hPp://cs.brown.edu/courses/csci2950-‐c/

Func6onal Muta6ons Sta$s$cal approach: Find recurrent muta*ons in many cancer genomes.

Muta6ons and other genomic measurements

•  Hundreds of cancer samples •  Dozens of cancer types

10/6/11

2

Mo6va6on Large amount of data is coming! The Cancer Genome Atlas: 206 glioblastoma mul6forme samples [TCGA, Nature 2008] What genes are altered/mutated?

Mo6va6on Cancer is a disease of “pathways”

[Hanahan and Weinberg, Cell 2000]

What pathways are altered/mutated?

36%

2%

6%

18%

21%

3% 2% 5% 3%

4%

10/6/11

3

Finding Mutated Pathways Standard prac6ce: assess enrichment of muta6ons on

known pathways/sets of genes

Gene Ontology

www.geneontology.org

10/6/11

4

Gene annota6ons for Yeast

KEGG Pathways

10/6/11

5

10/6/11

6

Tes6ng Predefined Groups of Genes

Hypergeometric Test (see notes)

Finding Mutated Pathways

Limita6ons: •  Only known pathways/sets are tested! •  Pathways are interconnected (crosstalk)

Standard prac6ce: assess enrichment of muta6ons on known pathways/sets of genes

10/6/11

7

Finding Mutated Pathways

Are these results significant? Other interes6ng pathways?

Manually constructed network of interac6ons

[TCGA, Nature 2008] RB1: 78% of samples

p53: 87% of samples

RTK/RAS/PI(3)K: 88% of samples

Problem Given:

1.  Large-‐scale interac6on network 2.  Muta6on data from mul6ple cancer

samples Find: Subnetworks mutated in a significant number of samples

gene

= alteration

10/6/11

8

Problem Defini6on

Given: 1.  Interac6on network G = (V, E)

V = genes. E = interac6ons b/w genes 2.  Binary altera6on matrix Find: Subnetworks mutated in a significant number of samples

– subnetwork = connected subgraph – subnetwork mutated in sample if ≥ 1 gene mutated in sample

Samples

Gene

s

Computa6onal Formula6on

Goal: Find S such that Pr[NS ≥ m] < ε under suitable null distribu*on

For subnetwork S: NS = number of samples in which S is mutated with

random altera6ons m = number of observed samples in which S mutated

Samples

Gene

s

10/6/11

9

Mutated subnetworks: Naïve Method

Problems 1.  Mul6ple hypothesis tes6ng:

> 1020 candidate subnetworks with < 6 genes 2.  Network topology :

TP53 has 238 neighbors in HPRD network

Find: S such that Pr[NS ≥ m] < ε under suitable null distribu6on Naïve Method: Test each S

Naïve Method

•  Example: 91 samples of glioblastoma mul6forme [TCGA, Nature 2008]

•  Consider only paths with 3 nodes: – FDR ≤ 0.01: 1700 significant paths! –  few genes altered in many samples

•  Remove significant genes?

10/6/11

10

HotNet

F. Vandin, E. Upfal, B. J. Raphael. “Algorithms for Detec6ng Significantly Mutated Pathways in Cancer”, JCB 2011.

Contribu6on 1. Methods for de novo discovery of mutated

subnetworks I.  Combinatorial model II.  Enhanced influence model: HotNet

2.  Defini6on of Influence Graph: Iden6fy subnetworks using both frequency of altera6on and network topology

3. Sta*s*cal tests to assess the significance

10/6/11

11

(Local) Topology MaPers

Single path between mutated genes

Path between mutated genes is one of many through node.

Influence Graph alteration = unit source of heat

Heat diffusion Influence measure

Easily derived from Laplacian matrix of G

10/6/11

12

Influence Graph alteration = unit source of heat

.

.

.

Influence graph

Encodes the topology of G

Heat equa6on

f(t) = (f1(t), ..., fn(t) )T heat on ver6ces at 6me t.

f(t) = e-‐L t f(0)

df/dt = (A – D) f(t) A = [aij] = adjacency matrix of G.

dfidt

=�

j

aij(fj(t)� fi(t))

L = D – A = Laplacian matrix of G.

e-‐L t is heat kernel of G

10/6/11

13

Discovering Significant Subnetworks

Two methods: 1.  Combinatorial model 2.  Enhanced Influence Model: HotNet

Based on Influence Graph Sta6s6cal tests to assess significance

Combinatorial Model

Fix K: find the subnetwork with K genes mutated in the maximum number of samples Connected maximum coverage problem

(“graph version” of maximum coverage problem – NP-‐Hard)

Influence graph

Samples

Gene

s

10/6/11

14

Connected maximum coverage problem

1.  Thm. NP-‐Hard for general graphs.

2.  Thm. NP-‐Hard for star graphs.

3.  Thm. 1 – 1/e approx. alg. for spider graphs

4.  Thm. 1/(cr) approx. alg. for general graphs –  c=(2e-‐1)/(e-‐1) –  r= radius of the op6mal solu6on in G

Combinatorial Model: Sta6s6cal Test

Fix K: find the subnetwork with K genes mutated in the maximum number of samples tes6ng the number of altered samples only 1 hypothesis no mul6ple correc6on!

Limita6on: inadequate representa6on of heterogeneity of cancer altera6ons

10/6/11

15

Influence Graph

= alterations Samples

Gene

s

Altera$on Matrix

Enhance Influence Model (EIM)

(1)

(2)

Extract “significantly hot” subnetworks

Two-‐stage mul6-‐hypothesis test

Enhanced Influence

Hot

Cold

EIM: Sta6s6cal test Xs = number of subnetworks with ≥ s genes using “random” altera6on matrix.

H0s : Xs ≥ ηs, s = 1, …, N = # genes.

Two-‐stage mul$-‐hypothesis test 1.  Let s* = smallest s where H0s is rejected.

Pr [ Xs ≥ ηs ] < α / N (Bonferroni correc6on) # hypotheses = #s ≤ # measured genes.

2 subnetworks with 2 or more genes

10/6/11

16

EIM: Sta6s6cal test

Thm. Fix β1, …, βN such that Σi βi ≤ β. Let s* be smallest s such that ηs ≥ E[Xs] / βs. If return all subnetworks of size ≥ s* as significant, then FDR ≤ β.

Xs = number of subnetworks with ≥ s genes using “random” altera6on matrix.

H0s : Xs ≥ ηs, s = 1, …, N = # genes.

Two-‐stage mul$-‐hypothesis test 2.  Bound false discovery rate (FDR) for list of iden7fied subnetworks.

2 subnetworks with 2 or more genes

Experimental Results Interac$on network

HPRD: 18796 nodes, 37107 edges

Datasets 1.   Glioblastoma Mul$forme (GBM) [TCGA, Nature, 2008]

601 sequenced genes in 91 samples Array copy number data on all genes

2.   Lung Adenocarcinoma [Ding et al., Nature, 2008] 623 sequenced genes in 188 samples

3.   Ovarian Carcinoma (OV) [TCGA, Nature, 2011]

All genes sequenced in 316 samples Array copy number data on all genes

10/6/11

17

GBM [TCGA, Nature 2008]

RB1

p53

RTK/RAS/PI(3)K

Manually created Significant?

GBM: Muta6ons + Copy number

FDR

10/6/11

18

Lung Adenocarcinoma [Ding et al., Nature 2008]

Results: Lung Adenocarcinoma enrichment

s #net ≥ s p-‐val KEGG pathway/p-‐val 6 3

10/6/11

19

Ovarian Carcinoma [TCGA, Nature 2011]

Dendrix

F. Vandin, E. Upfal, B. J. Raphael. “De novo Discovery of Mutated Driver Pathways in Cancer”, Genome Research, to appear.

10/6/11

20

Network Methods

Mutation matrix

Interaction network

Networks are noisy! Too many groups of genes to test exhaustively

Pa$e

nts

Genes

De novo method?

Pathways and Muta6ons

RAS

RAF

EGFR

MEK

MAPK

cell membrane

Transcription factors Patients

Gen

es

[Thomas, et al. (2007)]

Driver mutations are rare. Cancer pathway has one driver mutation (gene) per patient [Vogelstein and K. W. Kinzler (2004), Yeang, McCormick, and Levine (2008)]

1. Exclusivity

10/6/11

21

Pathways and Muta6ons

RAS

RAF

EGFR

MEK

MAPK

cell membrane

Most patients have mutation in important cancer pathway. 2. Coverage

Transcription factors

Driver mutations are rare. Cancer pathway has one driver mutation (gene) per patient [Vogelstein and K. W. Kinzler (2004), Yeang, McCormick, and Levine (2008)]

1. Exclusivity

Exclusive (Column) Submatrix

Mutation Matrix

genes ;

patie

nts

Exclusivity and Coverage

Coverage: Γ(g) = {pa6ents in which gene g mutated} Γ(M) = Ui Γ(gi) = {pa6ents in which ≥ 1 of {g1,g2,…,gk} is mutated}

10/6/11

22

Exclusivity and Coverage genes

;

patie

nts

coverage |Γ(M)|

Maximum Coverage Exclusive Submatrix Problem: Given k>0, find the exclusive set M of k genes that maximizes |Γ(M)| Theorem: Maximum Coverage Exclusive Submatrix Problem is NP-Hard.

genes ;

patie

nts

Relaxing Exclusivity

“Approximately exclusive”, high coverage submatrix

For set M of genes: Coverage overlap: γ(M) = Σi|Γ(gi)| - |Γ(M)| γ(M) = 0 if and only if M is exclusive Goal: |Γ(M)| large and γ(M) small

10/6/11

23

Approximate Exclusivity Goal: Γ(M) large and γ(M) small Weight: W(M) = |Γ(M)| - γ(M)

= 2 |Γ(M)| - Σi|Γ(gi)|

genes ;

patie

nts

Maximum Weight Submatrix Problem: Given k>0, find the set M of k genes that maximizes W(M) Theorem. Maximum Weight Submatrix Problem is NP-hard.

genes ;

patie

nts

Summary of Results

1.  Simple greedy algorithm

–  Proof for par6cular genera6ve model

2.  MCMC algorithm for any type of data –  Proof of rapid convergence

10/6/11

24

Greedy Algorithm

set size k = 3

Time complexity: O(n2)

genes ;

patie

nts

Independent Genes Model Reasonable assump6on for some muta6on types

;a

M* = {driver genes}

;a

p1

patie

nts

genes genes

patie

nts

p2 …

p1, p2,… ϵ[pL;pU]

10/6/11

25

Modeling Random Muta6ons

Thm. The greedy algorithm returns M* with high probability provided:

•  Mutations from Independent Genes Model •  Number of patients ≥ β log n, n = number of genes.

β depends on parameters of Independent Genes Model

•  For reasonable parameters ≈1800 patients needed

Markov Chain Monte Carlo Sample gene sets |M| = k according to W(M)

Generate sequence of states: M(1), M(2), M(3), … Markov Chain Convergence Thm: M(i) Pr[M] ~ W(M)

Markov chain States: sets M Transitions : replace gene in M

M1

M3 M4

M2

If Markov Chains converges to stationary distribution in polynomial time, it is rapidly mixing

10/6/11

26

Metropolis-‐Has6ngs Define transi6on probabili6es of Markov Chain so Pr[.] = desired distribu6on

In general: no guarantees on rate of convergence

M1

M3 M4

M2 Markov chain States: sets M Transitions : replace gene in M

MCMC approach Thm. Markov Chain is rapidly mixing. Returns a distribution on sets [proportional to W(M)], not just optimal [max W(M)] set No assumptions on distribution of mutations

•  Can handle various mutation types

10/6/11

27

Experimental Results Simulated data

Cancer data 1.   Known muta$ons in mul$ple cancer types [Thomas et al.

(2007)] 238 known muta6ons in 1000 pa6ents of 17 different cancer types

2.   Lung Adenocarcinoma [Ding et al. (2008)] 623 sequenced genes in 188 pa6ents

3.   Brain cancer (GBM) [TCGA (2008)]

601 sequenced genes in 84 pa6ents Array copy number data on all genes

Known muta6ons in mul6ple cancer types

(p < 0.01)

238 known mutations. 1000 tumors from 17 cancer types [Thomas, et al. (2007)]

8 mutations cover 94% of mutated tumors Only 15 tumors with >1 mutation

Mutation: Exclusive

Co-occurring None

10/6/11

28

Lung Adenocarcinoma

(p < 0.01) (p < 0.01)

623 sequenced genes in 188 patients [Ding et al. (2008)]

Brain Cancer (GBM)

(p < 0.01)

(p < 0.01) (p < 0.01)

Mutations (single-nucleotide and copy number) for 601 genes in 84 samples [TCGA (2008)]

CSCI2950(C:TopicsinComputaonalBiology* Lecture4,Part2*...

Documents

Transcript of CSCI2950(C:TopicsinComputaonalBiology* Lecture4,Part2*...

CSCI2950(C:*Topics*in*Computaonal*Biology* Lecture*4,*Part2*...

Documents

Transcript of CSCI2950(C:*Topics*in*Computaonal*Biology* Lecture*4,*Part2*...

CSCI2950(C:TopicsinComputaonalBiology* Lecture4,Part2*...

Transcript of CSCI2950(C:TopicsinComputaonalBiology* Lecture4,Part2*...