CSCI2950(C:*Topics*in*Computaonal*Biology* Lecture*4,*Part2*...

28
10/6/11 1 CSCI2950C: Topics in Computa6onal Biology Lecture 4, Part 2 Tes6ng Groups of Genes Ben Raphael Lecture given by Fabio Vandin [email protected] October 3, 2011 hPp://cs.brown.edu/courses/csci2950c/ Func6onal Muta6ons Sta$s$cal approach: Find recurrent muta*ons in many cancer genomes. Muta6ons and other genomic measurements Hundreds of cancer samples Dozens of cancer types

Transcript of CSCI2950(C:*Topics*in*Computaonal*Biology* Lecture*4,*Part2*...

  • 10/6/11

    1

    CSCI2950-‐C:  Topics  in  Computa6onal  Biology  Lecture  4,  Part  2  

    Tes6ng  Groups  of  Genes  

    Ben  Raphael  Lecture  given  by  Fabio  Vandin  

    [email protected]  October  3,  2011  

    hPp://cs.brown.edu/courses/csci2950-‐c/  

     

    Func6onal  Muta6ons  Sta$s$cal  approach:  Find  recurrent  muta*ons  in  many  cancer  genomes.  

    Muta6ons  and  other  genomic  measurements  

    •   Hundreds  of  cancer  samples  •   Dozens  of  cancer  types  

  • 10/6/11

    2

    Mo6va6on  Large  amount  of  data  is  coming!    The  Cancer  Genome  Atlas:  206  glioblastoma    mul6forme  samples  [TCGA,  Nature  2008]    What  genes  are  altered/mutated?    

    Mo6va6on  Cancer  is  a  disease  of  “pathways”  

                       [Hanahan  and  Weinberg,  Cell  2000]  

    What  pathways  are  altered/mutated?  

    36%  

    2%  

    6%  

    18%  

    21%  

    3%  2%   5%   3%  

    4%  

  • 10/6/11

    3

    Finding  Mutated  Pathways  Standard  prac6ce:  assess  enrichment  of  muta6ons  on  

    known  pathways/sets  of  genes  

    Gene  Ontology  

    www.geneontology.org

  • 10/6/11

    4

    Gene  annota6ons  for  Yeast  

    KEGG  Pathways  

  • 10/6/11

    5

  • 10/6/11

    6

    Tes6ng  Predefined  Groups  of  Genes  

         Hypergeometric  Test  (see  notes)  

    Finding  Mutated  Pathways  

    Limita6ons:  •  Only  known  pathways/sets  are  tested!  •  Pathways  are  interconnected  (crosstalk)  

    Standard  prac6ce:  assess  enrichment  of  muta6ons  on  known  pathways/sets  of  genes  

  • 10/6/11

    7

    Finding  Mutated  Pathways  

    Are  these  results  significant?  Other  interes6ng  pathways?  

    Manually  constructed  network    of  interac6ons  

    [TCGA,  Nature  2008]   RB1: 78% of samples

    p53: 87% of samples

    RTK/RAS/PI(3)K: 88% of samples

    Problem  Given:    

    1.  Large-‐scale  interac6on  network  2.  Muta6on  data  from  mul6ple  cancer  

    samples            Find:  Subnetworks  mutated  in  a  significant  number  of  samples  

    gene

    = alteration

  • 10/6/11

    8

    Problem  Defini6on  

    Given:  1.  Interac6on  network  G  =  (V,  E)  

     V  =  genes.    E  =  interac6ons  b/w  genes  2.  Binary  altera6on  matrix    Find:  Subnetworks  mutated  in  a  significant  number  of  samples  

    – subnetwork  =  connected  subgraph  – subnetwork  mutated  in  sample  if  ≥  1  gene  mutated  in  sample  

    Samples  

    Gene

    s  

    Computa6onal  Formula6on  

    Goal:  Find  S  such  that  Pr[NS  ≥  m]  <  ε  under  suitable  null  distribu*on          

    For  subnetwork  S:    NS  =  number  of  samples  in  which  S  is  mutated  with    

                             random  altera6ons            m  =  number  of  observed  samples  in  which  S  mutated  

    Samples  

    Gene

    s  

  • 10/6/11

    9

    Mutated  subnetworks:  Naïve  Method  

    Problems  1.  Mul6ple  hypothesis  tes6ng:    

    >  1020  candidate  subnetworks  with  <  6  genes  2.  Network  topology  :  

     TP53  has  238  neighbors  in  HPRD  network      

    Find:  S  such  that  Pr[NS  ≥  m]  <  ε     under  suitable  null  distribu6on  Naïve  Method:  Test  each  S    

    Naïve  Method  

    •  Example:  91  samples  of  glioblastoma  mul6forme  [TCGA,  Nature  2008]  

    •  Consider  only  paths  with  3  nodes:  – FDR  ≤  0.01:  1700  significant  paths!  –  few  genes  altered  in  many  samples  

    •  Remove  significant  genes?  

  • 10/6/11

    10

    HotNet  

    F.  Vandin,  E.  Upfal,  B.  J.  Raphael.  “Algorithms  for  Detec6ng  Significantly  Mutated  Pathways  in  Cancer”,  JCB  2011.  

    Contribu6on  1. Methods  for  de  novo  discovery  of  mutated  

    subnetworks  I.  Combinatorial  model  II.  Enhanced  influence  model:  HotNet  

    2.  Defini6on  of  Influence  Graph:                                      Iden6fy  subnetworks  using  both  frequency  of  altera6on  and  network  topology  

    3. Sta*s*cal  tests  to  assess  the  significance  

  • 10/6/11

    11

    (Local)  Topology  MaPers  

    Single  path  between  mutated  genes  

    Path  between  mutated  genes  is  one  of  many  through  node.  

    Influence  Graph  alteration = unit source of heat

    Heat diffusion Influence measure

    Easily derived from Laplacian matrix of G

  • 10/6/11

    12

    Influence  Graph  alteration = unit source of heat

    .

    .

    .

    Influence graph

    Encodes the topology of G

    Heat  equa6on  

    f(t)  =  (f1(t),  ...,  fn(t)  )T    heat  on  ver6ces  at  6me  t.  

    f(t)  =  e-‐L  t  f(0)    

    df/dt  =  (A  –  D)  f(t)                A  =  [aij]  =  adjacency  matrix  of  G.    

    dfidt

    =�

    j

    aij(fj(t)� fi(t))

    L  =  D  –  A  =  Laplacian  matrix  of  G.  

    e-‐L  t  is  heat  kernel  of  G      

  • 10/6/11

    13

    Discovering  Significant  Subnetworks  

    Two  methods:  1.  Combinatorial  model  2.  Enhanced  Influence  Model:  HotNet  

    Based  on  Influence  Graph    Sta6s6cal  tests  to  assess  significance  

    Combinatorial  Model  

    Fix  K:  find  the  subnetwork  with  K  genes  mutated  in  the  maximum  number  of  samples        Connected  maximum  coverage  problem  

    (“graph  version”  of  maximum  coverage  problem  –  NP-‐Hard)  

    Influence graph

    Samples  

    Gene

    s  

  • 10/6/11

    14

    Connected  maximum  coverage  problem  

    1.  Thm.  NP-‐Hard  for  general  graphs.  

    2.  Thm.  NP-‐Hard  for  star  graphs.  

    3.  Thm.  1  –  1/e  approx.  alg.  for  spider  graphs  

    4.  Thm.  1/(cr)  approx.  alg.  for  general  graphs  –  c=(2e-‐1)/(e-‐1)  –  r=  radius  of  the  op6mal  solu6on  in  G  

    Combinatorial  Model:  Sta6s6cal  Test  

    Fix  K:  find  the  subnetwork  with  K  genes  mutated  in  the  maximum  number  of  samples    tes6ng  the  number  of  altered  samples    only    1  hypothesis                                no  mul6ple  correc6on!  

    Limita6on:  inadequate  representa6on  of  heterogeneity  of  cancer  altera6ons  

  • 10/6/11

    15

    Influence  Graph

    = alterations Samples  

    Gene

    s  

    Altera$on  Matrix  

    Enhance  Influence  Model  (EIM)  

    (1)

    (2)

    Extract  “significantly  hot”  subnetworks

    Two-‐stage    mul6-‐hypothesis  test  

    Enhanced  Influence      

    Hot

    Cold

    EIM:  Sta6s6cal  test  Xs  =  number  of  subnetworks  with  ≥  s  genes      using  “random”  altera6on  matrix.  

    H0s  :  Xs  ≥  ηs,  s  =  1,  …,  N  =  #  genes.    

    Two-‐stage  mul$-‐hypothesis  test  1.  Let  s*  =  smallest  s  where  H0s  is  rejected.    

     Pr  [  Xs  ≥  ηs  ]  <  α    /  N    (Bonferroni  correc6on)      #  hypotheses  =  #s  ≤  #  measured  genes.  

    2  subnetworks  with  2  or  more  genes  

  • 10/6/11

    16

    EIM:  Sta6s6cal  test  

    Thm.  Fix  β1,  …,  βN  such  that  Σi  βi  ≤  β.    Let  s*  be  smallest  s  such  that  ηs  ≥  E[Xs]  /  βs.    If  return  all  subnetworks  of  size  ≥  s*  as  significant,  then  FDR  ≤  β.    

    Xs  =  number  of  subnetworks  with  ≥  s  genes  using  “random”  altera6on  matrix.  

    H0s  :  Xs  ≥  ηs,  s  =  1,  …,  N  =  #  genes.    

    Two-‐stage  mul$-‐hypothesis  test  2.  Bound  false  discovery  rate  (FDR)  for  list  of            iden7fied  subnetworks.  

     

    2  subnetworks  with  2  or  more  genes  

    Experimental  Results  Interac$on  network  

     HPRD:  18796  nodes,  37107  edges            

    Datasets  1.   Glioblastoma  Mul$forme  (GBM)  [TCGA,  Nature,  2008]  

    601  sequenced  genes  in  91  samples  Array  copy  number  data  on  all  genes  

    2.   Lung  Adenocarcinoma  [Ding  et  al.,  Nature,  2008]  623  sequenced  genes  in  188  samples  

     3.   Ovarian  Carcinoma  (OV)  [TCGA,  Nature,  2011]  

    All  genes  sequenced  in  316  samples  Array  copy  number  data  on  all  genes    

     

  • 10/6/11

    17

    GBM  [TCGA,  Nature  2008]  

    RB1

    p53

    RTK/RAS/PI(3)K

    Manually created Significant?

    GBM:  Muta6ons  +  Copy  number  

       

    FDR  

  • 10/6/11

    18

    Lung  Adenocarcinoma  [Ding  et  al.,  Nature  2008]  

    Results:  Lung  Adenocarcinoma  enrichment  

    s   #net  ≥  s   p-‐val   KEGG  pathway/p-‐val  6   3  

  • 10/6/11

    19

    Ovarian  Carcinoma  [TCGA,  Nature  2011]  

    Dendrix  

    F.  Vandin,  E.  Upfal,  B.  J.  Raphael.  “De  novo  Discovery  of  Mutated  Driver  Pathways  in  Cancer”,  Genome  Research,  to  appear.  

  • 10/6/11

    20

    Network  Methods  

    Mutation matrix

    Interaction network

    Networks  are  noisy!   Too many groups of genes to test exhaustively

    Pa$e

    nts  

    Genes    

    De novo method?

    Pathways  and  Muta6ons  

    RAS  

    RAF  

    EGFR

    MEK  

    MAPK  

    cell membrane

    Transcription factors Patients

    Gen

    es

    [Thomas, et al. (2007)]

    Driver mutations are rare. Cancer pathway has one driver mutation (gene) per patient [Vogelstein and K. W. Kinzler (2004), Yeang, McCormick, and Levine (2008)]

    1. Exclusivity

  • 10/6/11

    21

    Pathways  and  Muta6ons  

    RAS  

    RAF  

    EGFR

    MEK  

    MAPK  

    cell membrane

    Most patients have mutation in important cancer pathway. 2. Coverage

    Transcription factors

    Driver mutations are rare. Cancer pathway has one driver mutation (gene) per patient [Vogelstein and K. W. Kinzler (2004), Yeang, McCormick, and Levine (2008)]

    1. Exclusivity

    Exclusive (Column) Submatrix

    Mutation Matrix

    genes ;  

    patie

    nts

    Exclusivity  and  Coverage  

    Coverage:      Γ(g) =  {pa6ents  in  which  gene  g  mutated}      Γ(M) = Ui Γ(gi) = {pa6ents  in  which  ≥  1  of  {g1,g2,…,gk}  is  mutated}  

  • 10/6/11

    22

    Exclusivity  and  Coverage  genes

    ;  

    patie

    nts

    coverage |Γ(M)|

    Maximum Coverage Exclusive Submatrix Problem: Given k>0, find the exclusive set M of k genes that maximizes |Γ(M)| Theorem: Maximum Coverage Exclusive Submatrix Problem is NP-Hard.

    genes ;  

    patie

    nts

    Relaxing  Exclusivity  

    “Approximately exclusive”, high coverage submatrix

    For set M of genes: Coverage overlap: γ(M) = Σi|Γ(gi)| - |Γ(M)| γ(M) = 0 if and only if M is exclusive Goal: |Γ(M)| large and γ(M) small

  • 10/6/11

    23

    Approximate  Exclusivity  Goal:  Γ(M)  large  and  γ(M)  small    Weight:  W(M) = |Γ(M)| - γ(M)

    = 2 |Γ(M)| - Σi|Γ(gi)|

       

    genes ;  

    patie

    nts

    Maximum Weight Submatrix Problem: Given k>0, find the set M of k genes that maximizes W(M) Theorem. Maximum Weight Submatrix Problem is NP-hard.

    genes ;  

    patie

    nts

    Summary  of  Results  

    1.  Simple  greedy  algorithm  

    –  Proof  for  par6cular  genera6ve  model  

    2.  MCMC  algorithm  for  any  type  of  data  –  Proof  of  rapid  convergence  

  • 10/6/11

    24

    Greedy  Algorithm  

    set  size k =  3  

    Time  complexity:  O(n2)

    genes ;  

    patie

    nts

    Independent  Genes  Model                  Reasonable  assump6on  for  some  muta6on  types  

    ;a    

    M* = {driver genes}

    ;a    

    p1

    patie

    nts

    genes genes

    patie

    nts

    p2 …

    p1, p2,… ϵ[pL;pU]

  • 10/6/11

    25

    Modeling  Random  Muta6ons  

    Thm. The greedy algorithm returns M* with high probability provided:

    •  Mutations from Independent Genes Model •  Number of patients ≥ β log n, n = number of genes.

    β depends on parameters of Independent Genes Model

    •  For reasonable parameters ≈1800 patients needed

    Markov  Chain  Monte  Carlo  Sample  gene  sets  |M|  =  k  according  to  W(M)          

         Generate  sequence  of  states:    M(1),  M(2),  M(3),  …    Markov  Chain  Convergence  Thm:  M(i)              Pr[M]  ~  W(M)    

    Markov chain States: sets M Transitions : replace gene in M

    M1  

    M3   M4  

    M2  

    If Markov Chains converges to stationary distribution in polynomial time, it is rapidly mixing

  • 10/6/11

    26

    Metropolis-‐Has6ngs  Define  transi6on  probabili6es  of  Markov  Chain  so    Pr[.]  =  desired  distribu6on  

               

     In  general:  no  guarantees  on  rate  of  convergence  

           

    M1  

    M3   M4  

    M2  Markov chain States: sets M Transitions : replace gene in M

    MCMC  approach  Thm. Markov Chain is rapidly mixing. Returns a distribution on sets [proportional to W(M)], not just optimal [max W(M)] set No assumptions on distribution of mutations

    •  Can handle various mutation types

  • 10/6/11

    27

    Experimental  Results  Simulated  data  

    Cancer  data  1.   Known  muta$ons  in  mul$ple  cancer  types  [Thomas  et  al.  

    (2007)]      238  known  muta6ons  in  1000  pa6ents    of  17  different  cancer  types    

    2.   Lung  Adenocarcinoma  [Ding  et  al.  (2008)]      623  sequenced  genes  in  188  pa6ents  

     3.   Brain  cancer  (GBM)  [TCGA  (2008)]  

         601  sequenced  genes  in  84  pa6ents        Array  copy  number  data  on  all  genes  

    Known  muta6ons  in  mul6ple  cancer  types  

    (p < 0.01)

    238 known mutations. 1000 tumors from 17 cancer types [Thomas, et al. (2007)]

    8 mutations cover 94% of mutated tumors Only 15 tumors with >1 mutation

    Mutation: Exclusive

    Co-occurring None

  • 10/6/11

    28

    Lung  Adenocarcinoma  

    (p < 0.01) (p < 0.01)

    623 sequenced genes in 188 patients [Ding et al. (2008)]

    Brain  Cancer  (GBM)  

    (p < 0.01)

    (p < 0.01) (p < 0.01)

    Mutations (single-nucleotide and copy number) for 601 genes in 84 samples [TCGA (2008)]