Hanna bosc2010

22
The Genome Analysis Toolkit A MapReduce framework for analyzing next-generation DNA sequencing data Ma# Hanna and Mark DePristo Genome Sequencing and Analysis Group Medical and Popula<on Gene<cs Program Broad Ins<tute of Harvard and MIT

Transcript of Hanna bosc2010

Page 1: Hanna bosc2010

The Genome Analysis Toolkit A MapReduce framework for analyzing next-generation DNA sequencing data

Ma#  Hanna  and  Mark  DePristo  

Genome  Sequencing  and  Analysis  Group  Medical  and  Popula<on  Gene<cs  Program  

Broad  Ins<tute  of  Harvard  and  MIT  

Page 2: Hanna bosc2010

•  GATK  Overview  and  Concepts  

•  GATK  Workflow  

•  Example:  A  Simple  Bayesian  Genotyper  

The Genome Analysis Toolkit Agenda

2 2 2

Page 3: Hanna bosc2010

GATK: Overview and Concepts Motivation

Coverage in xMHC region of JPT individuals"

•  Dataset size greatly increases analysis complexity. •  Implementation issues can prematurely terminate

long-running jobs or introduce subtle bugs.

3

Page 4: Hanna bosc2010

GATK: Overview Simplifying the process of writing analysis tools for resequencing data

•  The  framework  is  designed  to  support  most  common  paradigms  of  analysis  algorithms  –  Provides  structured  access  to  reads  in  BAM  format,  reference  context,  as  well  as  reference-­‐associated  meta  data  

•  General-­‐purpose  –  Op<mized  for  ease  of  use  and  completeness  of  func<onality  within  scope  

•  Efficient  –  Engineering  investment  on  performance  of  cri<cal  data  structures  and  manipula<on  rou<nes  

•  Convenient  –  Structured  plug-­‐in  model  makes  developing  in  Java  against  the  framework  rela<vely  painfree  

4

Page 5: Hanna bosc2010

GATK: Overview The MapReduce design philosophy

Result is:

Map

Reduce

Function f applied to each element of list

Function r recursively reduced over each f(…)

a   b   c   d   e  Data elements

A   B   C   D   E  X = f(x)

R  R = r(A, R(B,…,E))

f(x)

r(x,y, …, z)

Operations are independent of each other

Results depends on all sites

5

Page 6: Hanna bosc2010

GATK: Overview Rapid development of efficient and robust analysis tools

Genome  Analysis  Toolkit  (GATK)  infrastructure  

Analysis  tool  

Traversal  engine  

Implemented  by  user  Provided  by  framework  

Provides the boilerplate code required to perform any NGS analysis

6

Page 7: Hanna bosc2010

GATK: Workflow Introduction

•  GATK  Overview  and  Concepts  

•  GATK  Workflow  •  An  example  of  one  of  the  GATK’s  most  common  workflows  

•  Data  access  pa#ern:  by  locus  •  Inputs:  reads,  reference,  dbSNP  

•  Example:  A  Simple  Bayesian  Genotyper  

7

Page 8: Hanna bosc2010

GATK: Workflow The sharding system: dividing data into processor-sized pieces

Reads

Reference

dbSNP

•  Divides data into small chunks that can be processed independently

•  Handles extraction of subsets of data •  Groups small intervals together to avoid

repetitive decompression

8

Page 9: Hanna bosc2010

GATK: Workflow Traversal engines: preparing data for processing

Builds data structures easy consumed by the

analysis

9

Page 10: Hanna bosc2010

GATK: Workflow Interaction between sharding system and traversal engines

•  Datasets are split into shards, which can be processed sequentially or in parallel •  When processing sequentially, the reduce value of each shard is used to

bootstrap the next shard. •  When processing in parallel, the result of each shard is computed independently

and then “tree-reduced” together.

10

Page 11: Hanna bosc2010

GATK: Workflow Walkers: Analyses written by end-users

exons dbsnp

A C C A C

A

Analysis  tool  

•  Walkers (analyses) can easily be written by end users. The GATK is distributed with a significant library of walkers.

•  Only the reads, reference, and reference metadata applicable to a single-base location is presented to the analysis tool.

•  The GATK provides tools to filter the pileup automatically or on demand.

11

ref

reads

Page 12: Hanna bosc2010

GATK: Workflow Other data access patterns

Other data access patterns:

Traversal Type Description Reads Call map per read, along with the reference

and reference-ordered metadata spanning that read.

Duplicates Call map for each set of duplicate reads.

Read pair (naïve) Call map for each read and its mate (naïve, requires the input BAM to be sorted in query name order).

Straightforward (but not necessarily easy) to add any new access pattern involving streaming data.

12

Page 13: Hanna bosc2010

GATK: Additional features Additional inputs and outputs

Reference metadata •  Support for additional input data that is sorted in reference

order can easily be added to the GATK. •  Input types can be added by creating two new classes: a

feature (data access object) and a codec (parser). •  New file formats are indexed automatically. •  New data types are autodiscovered via a classpath search. •  Joint initiative with IGV.

Additional I/O •  Analysis parameters can be added to a walker by annotating a

field in the walker with an @Argument annotation. •  Command-line argument types can become very sophisticated.

13

Page 14: Hanna bosc2010

Walkers: Example A simple Bayesian genotyper

•  GATK  Overview  and  Concepts  

•  GATK  Workflow  

•  Example:  A  Simple  Bayesian  Genotyper  •  A  func<onal  genotyper  in  under  150  lines  of  code  •  A  minimal  example:  calls  are  much  lower  in  quality  than  

the  UnifiedGenotyper  

14

Page 15: Hanna bosc2010

Walkers: Example A simple Bayesian genotyper: the model

15

L(G | D) = P(G)P(D |G) = P(b |G)b∈ good _ bases{ }∏

Prior for the genotype

Likelihood for the genotype

Likelihood of the data given the genotype

Bayesian  model    

Independent base model

•  Likelihood  of  data  computed  using  pileup  of  bases  and  associated  quality  scores  at  given  locus  

•  Only  “good  bases”  are  included:  those  sa<sfying  minimum  base  quality,  mapping  read  quality,  pair  mapping  quality,  NQS  

•  L(G|D)  computed  for  all  10  genotypes  

See http://www.broadinstitute.org/gsa/wiki/index.php/Unified_genotyper for a more complete approach

Page 16: Hanna bosc2010

Walkers: Example A simple Bayesian genotyper

•  Walker specifies the data access pattern and declares command-line arguments.

•  Inheritance defines traversal type. •  Annotation defines command-line argument.

public class GATKPaperGenotyper extends LocusWalker<Integer,Long> {

@Argument(fullName = "log_odds_score", shortName = "LOD", doc = "The LOD threshold", required = false) private double LODScore = 3.0;

16

Page 17: Hanna bosc2010

Walkers: Example A simple Bayesian genotyper

public Integer map(RefMetaDataTracker tracker, ReferenceContext ref, AlignmentContext context) {

double likelihoods[] = DiploidGenotypePriors.getReferencePolarizedPrior( ref.getBase(), DiploidGenotypePriors.HUMAN_HETEROZYGOSITY, 0.01);

// get the bases and qualities from the pileup ReadBackedPileup pileup = context.getBasePileup(). getPileupWithoutMappingQualityZeroReads(); byte bases[] = pileup.getBases(); byte quals[] = pileup.getQuals(); …

•  Walker prepares the input dataset. •  ReadBackedPileup utility can be used to filter pileup on

demand.

17

Page 18: Hanna bosc2010

Walkers: Example A simple Bayesian genotyper

for (GENOTYPE genotype : GENOTYPE.values()) for (int index = 0; index < bases.length; index++) { // our epsilon is the de-Phred scored base quality double epsilon = Math.pow(10, quals[index] / -10.0);

byte pileupBase = bases[index]; double p = 0; for (char r : genotype.toString().toCharArray()) p += r == pileupBase ? 1 - epsilon : epsilon / 3; likelihoods[genotype.ordinal()] += Math.log10(p /

genotype.length()); }

Integer sortedList[] = MathUtils.sortPermutation(likelihoods);

•  Calculate the likelihood for each possible genotype. •  Determine the best of the calculated genotypes.

18

Page 19: Hanna bosc2010

Walkers: Example A simple Bayesian genotyper

… if (lod > LODScore) out.printf("%s\t%s\t%.4f\t%c%n", context.getLocation(),

selectedGenotype, lod, (char)ref.getBase()); return 1; }}// end of map() function

public Long reduce(Integer value, Long sum) { return value + sum;}

public void onTraversalDone(Integer result) { out.printf("Simple Genotyper genotyped %d loci.”, result);}

•  Conditionally output the results. •  Use reduce to calculate number of genotypes called. •  Writing to provided output stream is guaranteed to be

thread-safe.

19

Page 20: Hanna bosc2010

Walkers: Threading performance A simple Bayesian genotyper

GATK performance improves nearly linearly as processors are added

20

Page 21: Hanna bosc2010

Genome Analysis Toolkit 1000 Genomes Project

More  info:  h#p://www.broadins<tute.org/gsa/wiki/  Support      :  h#p://www.getsa<sfac<on.com/gsa/  

Ini<al  alignment  

MSA  realignment  

Q-­‐score  recalibra<on  

Base  error  modeling  

Genotyping  

SNP  filtering  

•  All  of  these  tools  have  been  developed  in  the  GATK    

•  They  are  memory  and  CPU  efficient,  cluster  friendly  and  are  easily  parallelized  

•  They  are  now  publically  and  are  being  used  at  many  sites  around  the  world  

•  Supports  any  BAM-­‐compa<ble  aligner  

21

Page 22: Hanna bosc2010

Acknowledgments  Genome sequencing and

analysis group (MPG) Kiran Garimella (Analysis Lead)

Michael Melgar Chris Hartl

Sherman Jia Eric Banks (Development lead)

Ryan Poplin Guillermo del Angel

Aaron McKenna Khalid Shakir Brett Thomas Corin Boyko

Broad postdocs, staff, and faculty

Anthony Philippakis Vineeta Agarwala

Manny Rivas Jared Maguire

Carrie Sougnez David Jaffe

Nick Patterson Steve Schaffner Shamil Sunyaev Paul de Bakker

1000 Genomes project In general but notably:

Matt Hurles Philip Awadalla Richard Durbin

Goncalo Abecasis Richard Gibbs Gabor Marth

Thomas Keane Gil McVean

Gerton Lunter Heng Li

Copy number group Bob Handsaker

Jim Nemesh Josh Korn

Steve McCarroll

Cancer genome analysis

Kristian Cibulskis Andrey Sivachenko

Gad Getz

Integrative Genomics Viewer (IGV) Jim Robinson

Jesse Whitworth Helga Thorvaldsdottir

Genome Sequencing Platform In general but notably:

Lauren Ambrogio Illumina Production Team

Tim Fennell Kathleen Tibbetts

Alec Wysoker Ben Weisburd Toby Bloom

MPG directorship Stacey Gabriel David Altshuler

Mark Daly

22