Forsharing cshl2011 sequencing

Post on 05-Jul-2015

319 views 0 download

description

Short overview talk on exome and genome sequencing and DNAse-seq.

Transcript of Forsharing cshl2011 sequencing

High-­‐Resolu,on  Views  of  Cancer  Genomes  

The  Central  Dogma  

+  

Your  Nature  Paper  

Our  First  Experiment  

Overview  of  BAC  in  the  Genome  

Sequencing  a  BAC  

Sequence  Coverage  

Repeats  

Repeats  

Repeats  are  not  created  equal  

Genomic  Sequencing  

TargeFng  the  Exome  

  Long  oligos  synthesized  on  arrays  (DNA)  

  RNA  baits  synthesized  from  DNA  oligo  template  

  RNA  baits  hybridized  to  DNA  sequencing  library  

  Targets  captured  using  beads  and  bioFn-­‐labeled  baits  

  RNA  bait  degraded,  leaving  sequencing  library  enriched  for  target  regions  

Data  Flow  

  FASTQ  files  generated  by  Illumina  pipeline    Aligned  to  reference  genome  (hg18,  excluding  _random,  unmapped,  and  hap)  using  Novoalign    SAM/BAM  used  extensively  

  Follow  Broad  InsFtute  GATK  pipeline  for  exome  capture  

  Use  picard  java  library  for  quality  assessment    Processed  BAM  files  available  via  local  hZp  for  browsing  

Data  Pipeline....  

  Samtools  import    Samtools  sort  

  Picard  MarkDuplicates  

  GATK  Indel  Realignment  

  GATK  Quality  RecalibraFon  

  Picard  QC  metrics  

Realignment  around  Indels  

  The  problem  -  Aligners  align  each  read  independently  -  PotenFally  leads  to  increased  error  rates  around  

indels  

  A  potenFal  soluFon  -  Locally  realign  reads  in  regions  that  might  

harbor  an  indel  -  Goal  is  to  align  reads  overlying  indels  more  

accurately,  reducing  errors  in  each  read  and,  in  turn,  reducing  SNV  call  error  rates  

Quality Recalibration

  Since most SNV callers will rely on quality scores to estimate error probabilities, having the best possible estimates for error rates is important

  Reported error rates from the Illumina sequencer generally reflect technical parameters of the base call process, but not other systematic biases

  Quality recalibration can include covariates to account for systematic biases

-  Cycle count, dinucleotide context, original quality, and sample/library variables

Variant  Calling  and  EvaluaFon  

A  developing  art  

Sequencing  Tumor/Normal  Pairs  

Good  SNP  

Suspect  Variant  

SomaFc  (tumor  only)  Variant  

Likely  False  PosiFve  (normal  only)  

LOH  

NCI60  Exome  Sequencing  

No  Normals  Available!  

Variants  by  Genomic  LocaFon  

All  Coding  Variants  

Type  1:  in  dbSNP,  Type  2:  not  in  dbSNP  

Coding,  novel  (no  dbSNP)  

Copy  Number  from  Exomes  

Complete  Genome  Sequencing  

Complete  Genomics  Data  

Data  

  Delivery    Via  USB  results  

  Storage    Sizes  are  LARGE  -  400GB  per  sample  as  delivered  with  raw  reads  included  

  Should  use  2-­‐locaFon  backed-­‐up  storage  -  Not  trivial  to  find  such  storage,  so  might  resort  to  mulFple  USB  drives  

  Minimize:  -  Data  movement  -  Keeping  mulFple  copies  indefinitely  

Breakdown  of  Data  Sizes  

Data  

  Delivery    Storage    Processing  

  Data  are  typically  tab-­‐delimited  text  files,  so  Excel  can  be  useful  for  examining  individual  small  files  

  Generally,  command-­‐line  tools  needed    MacOS  and  linux  only  supported  operaFng  systems,  but  Windows  might  work....  

  Some  analyses  (snpdiff)  require  large  memory  

Directory  Structure  

Workflows  

  Tumor/Normal    Copy  Number  

  Structural  Varia,on    Annotated  SomaFc  Variants  

  Germline    List  of  annotated  genotypes  per  individual,  summarized  into  a  single  file  that  can  be  used  for  filtering  

Germline  Workflow  

Germline  Workflow  

  Output    Future  direcFons  

  Be  “smarter”  about  inheritance  framework  

  Further  refinements  of  comparison  to  other  data  types  (exomes,  snp  arrays,  RNA-­‐seq)  

Tumor/Normal  Workflow  

Medvedev  et  al.,  Nature  2009  

The  Cancer  Genome  Atlas  Research  Network  Nature  000,  1-­‐8  (2008)  doi:10.1038/nature07385  

Frequent  geneFc  alteraFons  in  three  criFcal  signalling  pathways.  

ChromaFn  

  ChromaFn  is  the  complex  of  protein  and  DNA  that  make  up  the  chromosomes.    It  is  not  a  staFc  structure.  

  DNAse  is  an  enzyme  that  cuts  DNA  at  locaFons  where  DNA  is  accessible  

  These  “accessible”  regions  have  been  associated  with  open  chromaFn  

  Regions  of  open  chromaFn  are  necessary  for  transcripFonal  and  regulatory  machinery  to  have  access  to  gene  neighborhoods  and  facilitate  transcripFon  

DNAse  HypersensiFvity  

  Method  for  finding  regions  of  “open”  chromaFn  

  In  data  published  with  the  ENCODE  consorFum,  DNAse  hypersensiFve  (HS)  were  shown  to  be  correlated  with:    Histone  modificaFon    TranscripFon  start  sites    Early  replicaFng  regions    TranscripFon  factor  binding  sites  (experimentally  determined  by  ChIP/chip,  etc.)  

IdenFficaFon  and  analysis  of  funcFonal  elements  in  1%  of  the  human  genome  by  the  ENCODE  pilot  project.    The  ENCODE  ConsorFum.    Nature,  2007.  

DNAse-­‐chip  Method  

Crawford,  G.E.,  Davis,  S.,  Scacheri,  P.C.,  Renaud,  G.,  Halawi,  M.J.,  Erdos,  M.R.,  Green,  R.,  Meltzer,  P.S.,  Wolfsberg,  T.G.,  and  Collins,  F.S.  Nat  Methods,  2006  

DNAse-­‐Seq  Method  

Crawford,  G.E.,  Davis,  S.,  Scacheri,  P.C.,  Renaud,  G.,  Halawi,  M.J.,  Erdos,  M.R.,  Green,  R.,  Meltzer,  P.S.,  Wolfsberg,  T.G.,  and  Collins,  F.S.  Nat  Methods,  2006  

DNAse  Sites  RelaFve  to  Genes  

DNAse  HS  Sites  and  Gene  Expression  

  DNAse  HS  sites  near  transcripFon  start  sites  are  associated  with  acFvely  transcribed  genes.  

  Distances  between  sequences  in  non-­‐DNAse  HS  regions  have  an  oscillaFng  paZern  with  frequency  that  corresponds  to  a  single  turn  of  the  double-­‐helix  

  DNAse  is  known  to  cut  preferenFally  in  the  minor  groove,  which  is  exposed  every  10.4  bases  when  wrapped  around  a  nucleosome  

  A  nucleosome  is  wrapped  by  147  base  pairs  when  complexed  with  DNA  

  ImplicaFon:  Nucleosomes  are  posiFoned  in  a  highly  organized,  precise  manner  

Nucleosome  PosiFoning  

The  Last  Mile