BIGGIE: A Distributed Pipeline for Genomic Variant Callingkubitron/courses/... · and use the right...
Transcript of BIGGIE: A Distributed Pipeline for Genomic Variant Callingkubitron/courses/... · and use the right...
BIGGIE: A Distributed Pipeline for Genomic Variant CallingRichard Xia1, Sara Sheehan1, Yuchen Zhang1, Ameet Talwalkar1, Matei Zaharia1 Jonathan Terhorst2,
Michael Jordan1,2, Yun S. Song1,2, Armando Fox1, David Patterson1
1 Computer Science Division, UC Berkeley; 2 Department of Statistics, UC Berkeley
Motivation: faster, open-source genome variant calling tools
Impact:Human genome variation is being used more and more to impactdisease diagnosis and treatment, however:
I current tools frequently disagree on variant callsI different types of variation require specialized tools
Current Tools:I GATK [2]: slow and difficult to useI CASAVA [1]: fast, but not freeI samtools mpileup [3]: slow and some accuracy issues
Our Goal:I fast, distributed variant callerI separate the genome into regions of high and low complexity
and use the right tool for the right region Figure 1: BIGGIE pipeline
Per-base SNP caller
Main idea:I Distrubted pipeline for variant calling using Spark [4]I Assign a complexity score to each baseI Use a simple SNP caller at bases with a low complexity scoreI Use more robust structural variant callers at high complexity bases
Complexity region examples:
Figure 2: Different variant calling tools should be used for regions of the genome.
Complexity score features:
Name Weight DescriptionSubstitution 3 Number of aligned reads showing a sub-
stitution with respect to the reference.Insertion 10 Number of aligned reads showing an in-
sertion with respect to the reference.Deletion 10 Number of aligned reads showing a dele-
tion with respect to the reference.Low Quality 3 Number of reads aligned with low map
quality (a common indicator of a repeti-tive region).
Table 1: Relative weight of features for computing complexity.
Results
Simulating data:I Used reads simulated from the consensus sequence for Venter’s genomeI Better approximates the true pattern of SNPs, indels, and structural variants
found in a true genome; reads were aligned using BWA and SNAP
Effect of thresholds:
10-2 10-1 100 101 102
complexity score
0
2000
4000
6000
8000
10000
12000
acc
ura
cy
Accruacy vs. complexity score
false posfalse neg
4 5 6 7 8 9 10percentage of complex bases in region
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
acc
ura
cy
Accruacy vs. percentage complex bases
false posfalse neg
Figure 3: On the left are the per-base results, measuring false negatives only on the regions wecalled. Both accuracy measures increase as the threshold increases, but the number of correctcalls increases as well. We see a similar pattern on the right for the region results, where thenumber of false positives and false negatives increase with the density of complex bases inhigh-complexity regions, but the number of true calls increases as well.
Incorporating high complexity regions
0.8 1.0 1.2 1.4 1.6 1.8 2.0genome position 1e7
0.92
0.94
0.96
0.98
1.00
1.02
1.04
1.06
hig
h c
om
ple
xit
y
High Complexity Regions, chr 21, part 1
2.0 2.2 2.4 2.6 2.8 3.0genome position 1e7
0.92
0.94
0.96
0.98
1.00
1.02
1.04
1.06
hig
h c
om
ple
xit
y
High Complexity Regions, chr 21, part 1
Figure 4: Regions are fairly uniformly distributed, except near the chromosome ends.
I We group bases into a high-complexity region in a greedy fashion,maintaining that the overall high-complexity base density is > t
I We filter out regions that are < 500 bases long
Stats, t = 5%Number of high complexity regions 3603
Percentage of genome is high complexity regions 16.6%
Results g
Timing Results:
Algorithm RuntimeGATK 35m 17s
mpileup 49m 53sBIGGIE 4m 38s
Table 2: Timing results for GATK,mpileup, and BIGGIE. The runtimeis not significantly impacted by thecomplexity threshold.
Low vs. High Complexity:
region type false pos false neg correctlow-complexity 1824 7455 38232high-complexity 2289 2788 13046
Table 3: Our performance degrades in the highcomplexity regions, which is why a special purposevariant caller should be used.
SNP false pos SNP false neg0
2000
4000
6000
8000
10000
12000
14000
16000
BWA+GATKSNAP+GATKBWA+mpileupBWA+BIGGIE,base=0.1BWA+BIGGIE,region=5%
correct SNPs0
10000
20000
30000
40000
50000
60000BWA+GATKSNAP+GATKBWA+mpileupBWA+BIGGIE,base=0.1BWA+BIGGIE,region=5%
Figure 5: Accuracy comparison of BIGGIE with mpileup and GATK. False positives in BIGGIE areoften associated with alignment errors or confusion with a small indel. For each algorithm, a verysmall percentage of correct SNP bases actually have the incorrect (unphased) genotype.
Future Work: Use the high and low complexity regions to distribute the readsacross machines, then call variants using appropriate algorithms.
References g
[1] CASAVA. (2012) http://support.illumina.com/sequencing/sequencing software/casava.ilmn.[2] DePristo M. et al, “A framework for variation discovery and genotyping using next-generation DNA sequencing data.” Nature Genetics(2011), 43:491-498.[3] Li H. et al and 1000 Genome Project Data Processing Subgroup, “The Sequence alignment/map (SAM) format and SAMtools.”Bioinformatics (2009), 25: 2078-9.[4] Zaharia M. et al, “Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing.” NSDI (2012).
{rxia, ssheehan, yuczhang}@eecs.berkeley.edu