Gabor Marth, Goncalo Abecasis, PIs
description
Transcript of Gabor Marth, Goncalo Abecasis, PIs
Robust Software Tools for Variant Identification and Functional Assessment
(Boston College & University of Michigan)
Gabor Marth, Goncalo Abecasis, PIs
Informatics challenges for genomic analysis
• Tool building
• Facilitating analysis
• Widening accessibility
Intentions of the RFA
Our approach
• Complete toolbox including variant interpretation
• Full pipelines for start-to-finish analysis• Easily accessible and well documented methods• Cloud deployment (in addition to single
machine/local compute cluster)• Open development model
Progress in first 6 months• Starting with two sets of tools and pipelines, geared toward high
quality local analysis, battle-tested in the 1000GP data and medical sequencing projects
• The two groups follow a “divide and conquer” strategy to put critical pieces in place for making our algorithms available for the wider genomics community
• Boston College– A universal tool/pipeline launcher application– Infrastructure for dissemination– Cloud access via Galaxy
• University of Michigan– Integration of variant annotation/impact assessment– Pipeline/workflow control infrastructure– Adaptation for Amazon Cloud Services
FUNCTIONALITY & TOOLS
Scope
Include latest versions• Tools constantly evolving (as they must to remain relevant)• Our community toolbox to be updated with new tools as
they become available
ref: TATAGAGAGAGAGAGAGAGCGAGAGAGAGAGAGAGAGGGAGAGACGGAGTTalt: TATAGAGAGAGAGAGAGCGAGAGAGAGAGAGAGAGAGGGAGAGACGGAGTT
ref: TATAGAGAGAGAGAGAGAGC--GAGAGAGAGAGAGAGAGGGAGAGACGGAGTTalt: TATAGAGAGAGAGAGAG--CGAGAGAGAGAGAGAGAGAGGGAGAGACGGAGTT
New algorithms for complex variant detection (FreeBayes)
Include tools when ready for prime time
MEI type RetroSeq Tangram Tea
Sample Total Sensitivity Total Sensitivity Total Sensitivity
ALU NA12891 719 89% 1192 98% 1127 92%
NA12892 687 86% 1185 98% 1078 92%
NA12878 793 82% 1326 99% 1038 89%
L1 NA12891 52 78% 190 81% 286 81%
The BC mobile element insertion caller performs best in its class
EPACTS variant interpretation tools(Efficient and Parallelizable Association Container Toolbox)
• Genetic analysis tool based on VCFo Fast and parallelizable access to large VCF fileso Built-in widely used single variant and burden testso R/C++ interface for extending to newer tests o Binary & quantitative phenotypes with covariateso Useful visualization tools of association results
• Automated visualization
PIPELINES & WORKFLOW
The UM pipeline
Optional LD-aware step
GenotypeLikelihoodBAM Unfiltered
VCF
Hard-filteredVCF
GenotypeLikelihoodBAM
GenotypeLikelihoodBAM
samtools glfMultiples
vcfCooker
FilteredVCF
SVM
Filtered/PhasedVCF
Beagle/Thunder
Filtered/PhasedVCF
EPACTS
UMAKE workflow system
• Makefile based approach– The Make utility is very good for representing dependencies– Pick up where left off on Failure
• Flexible deployment– Local Machine– Local Cluster (Mosix)– Amazon Web Services Elastic Compute Cloud (EC2)
• Default options– User configurable
14
Application of UMAKE to large-scale projects
Project Depth /Region N #SNPs %dbSNP
(129)KnownTs/Tv
NovelTs/Tv
1000G 4x Genome 1,092 34.5M 24.4 2.14 2.16
1000G >40x Exome 822 598K 22.1 2.96 2.80
GoT2D 4x Genome ~2,800 26.7M 25.5 2.16 2.19
ESP >80x Exome ~6,900 1.92M 8.6 2.94 2.83
Sardinia 3x Genome 2120 17.6M 38.4 2.15 2.22
Bipolar 10x Genome
Computational cost is ~1 week / 1000 samples in a 5 node mini-cluster
ACCESSIBILITY
Simplified installation & use
• Unified launcher application (gkno)– single tools (e.g. Mosaik)– tool “macros” (e.g. map)– pipelines (e.g. exome variant calling)
• Download and installation– All tools pulled in a single step from github– All tools installed– All tools tested
Easily configurable pipeline system
• Part of our new unified launcher system (gkno)• Pipeline types (e.g. mapping, variant calling) and
instances (exome, whole-genome)• User-configurable: tools can be swapped in and out,
parameters configured via config files
Support
• Documentation• Tutorials / Blog• User forum• Bug reports
DEPLOYMENT / CLOUD
Software deployment
• All software is ready for running locally on a single machine
• UMAKE adds cluster support• Cloud deployment– Simple Michigan pipelines ported to Amazon– Portation of all project software on the way
Cloud-based analysis – Galaxy
OPEN & COLLABORATIVE DEVELOPMENT MODEL
Integration• Our workflows leverage 3rd
party tools for specific functionality
• All our tools are open-source, available on github (many clones, community contributed code)
• Ensemble approach (multiple tools for critical tasks)
Ensemble approach
• Multiple tools usually benefit analysis
Ts/Tv
Called in # SNPs %dbSNP Novel Known Total
Union 907,170 22.09 2.22 2.30 2.24
2 of 5 766,608 25.33 2.38 2.33 2.37
3 of 5 696,358 27.05 2.44 2.36 2.42
4 of 5 601,132 29.62 2.49 2.40 2.46
Intersection 520,083 32.20 2.53 2.42 2.49
Ensemble approach
• Our pipelines will use multiple aligners (BWA, Mosaik) and variant callers (Freebayes, glfMultiples), developed by BC/UM
In progress
• Expanding pipelines to integrate all tools • Michigan tools -> gkno• BC tools -> Michigan cloud ready pipelines• Large data set analysis on the cloud• Integrate variant interpretation tools• Integrate SV tools as they become more robust• Integrate consensus analysis (SVM and MLP
approaches to callset aggregation)• Minimal, functional pipeline -> Galaxy
Team
Boston College• Alistair Ward• Derek Barnett• Chase Miller• Wan-Ping Lee• Erik Garrison
• Gabor Marth
University of Michigan• Mary-Kate Trost• Tom Blackwell• Hyun-Min Kang• Youna Hu • Adrian Tan • Xiaowei Zhan • Dajiang Liu
• Goncalo Abecasis