Gabor Marth, Goncalo Abecasis, PIs

29
Robust Software Tools for Variant Identification and Functional Assessment (Boston College & University of Michigan) Gabor Marth, Goncalo Abecasis, PIs

description

Robust Software Tools for Variant Identification and Functional Assessment (Boston College & University of Michigan). Gabor Marth, Goncalo Abecasis, PIs. Informatics challenges for genomic analysis. Tool building. Widening accessibility. Facilitating analysis. Intentions of the RFA. - PowerPoint PPT Presentation

Transcript of Gabor Marth, Goncalo Abecasis, PIs

Page 1: Gabor Marth, Goncalo Abecasis, PIs

Robust Software Tools for Variant Identification and Functional Assessment

(Boston College & University of Michigan)

Gabor Marth, Goncalo Abecasis, PIs

Page 2: Gabor Marth, Goncalo Abecasis, PIs

Informatics challenges for genomic analysis

• Tool building

• Facilitating analysis

• Widening accessibility

Page 3: Gabor Marth, Goncalo Abecasis, PIs

Intentions of the RFA

Page 4: Gabor Marth, Goncalo Abecasis, PIs

Our approach

• Complete toolbox including variant interpretation

• Full pipelines for start-to-finish analysis• Easily accessible and well documented methods• Cloud deployment (in addition to single

machine/local compute cluster)• Open development model

Page 5: Gabor Marth, Goncalo Abecasis, PIs

Progress in first 6 months• Starting with two sets of tools and pipelines, geared toward high

quality local analysis, battle-tested in the 1000GP data and medical sequencing projects

• The two groups follow a “divide and conquer” strategy to put critical pieces in place for making our algorithms available for the wider genomics community

• Boston College– A universal tool/pipeline launcher application– Infrastructure for dissemination– Cloud access via Galaxy

• University of Michigan– Integration of variant annotation/impact assessment– Pipeline/workflow control infrastructure– Adaptation for Amazon Cloud Services

Page 6: Gabor Marth, Goncalo Abecasis, PIs

FUNCTIONALITY & TOOLS

Page 7: Gabor Marth, Goncalo Abecasis, PIs

Scope

Page 8: Gabor Marth, Goncalo Abecasis, PIs

Include latest versions• Tools constantly evolving (as they must to remain relevant)• Our community toolbox to be updated with new tools as

they become available

ref: TATAGAGAGAGAGAGAGAGCGAGAGAGAGAGAGAGAGGGAGAGACGGAGTTalt: TATAGAGAGAGAGAGAGCGAGAGAGAGAGAGAGAGAGGGAGAGACGGAGTT

ref: TATAGAGAGAGAGAGAGAGC--GAGAGAGAGAGAGAGAGGGAGAGACGGAGTTalt: TATAGAGAGAGAGAGAG--CGAGAGAGAGAGAGAGAGAGGGAGAGACGGAGTT

New algorithms for complex variant detection (FreeBayes)

Page 9: Gabor Marth, Goncalo Abecasis, PIs

Include tools when ready for prime time

MEI type RetroSeq Tangram Tea

Sample Total Sensitivity Total Sensitivity Total Sensitivity

ALU NA12891 719 89% 1192 98% 1127 92%

NA12892 687 86% 1185 98% 1078 92%

NA12878 793 82% 1326 99% 1038 89%

L1 NA12891 52 78% 190 81% 286 81%

The BC mobile element insertion caller performs best in its class

Page 10: Gabor Marth, Goncalo Abecasis, PIs

EPACTS variant interpretation tools(Efficient and Parallelizable Association Container Toolbox)

• Genetic analysis tool based on VCFo Fast and parallelizable access to large VCF fileso Built-in widely used single variant and burden testso R/C++ interface for extending to newer tests o Binary & quantitative phenotypes with covariateso Useful visualization tools of association results

• Automated visualization

Page 11: Gabor Marth, Goncalo Abecasis, PIs

PIPELINES & WORKFLOW

Page 12: Gabor Marth, Goncalo Abecasis, PIs

The UM pipeline

Optional LD-aware step

GenotypeLikelihoodBAM Unfiltered

VCF

Hard-filteredVCF

GenotypeLikelihoodBAM

GenotypeLikelihoodBAM

samtools glfMultiples

vcfCooker

FilteredVCF

SVM

Filtered/PhasedVCF

Beagle/Thunder

Filtered/PhasedVCF

EPACTS

Page 13: Gabor Marth, Goncalo Abecasis, PIs

UMAKE workflow system

• Makefile based approach– The Make utility is very good for representing dependencies– Pick up where left off on Failure

• Flexible deployment– Local Machine– Local Cluster (Mosix)– Amazon Web Services Elastic Compute Cloud (EC2)

• Default options– User configurable

Page 14: Gabor Marth, Goncalo Abecasis, PIs

14

Application of UMAKE to large-scale projects

Project Depth /Region N #SNPs %dbSNP

(129)KnownTs/Tv

NovelTs/Tv

1000G 4x Genome 1,092 34.5M 24.4 2.14 2.16

1000G >40x Exome 822 598K 22.1 2.96 2.80

GoT2D 4x Genome ~2,800 26.7M 25.5 2.16 2.19

ESP >80x Exome ~6,900 1.92M 8.6 2.94 2.83

Sardinia 3x Genome 2120 17.6M 38.4 2.15 2.22

Bipolar 10x Genome

Computational cost is ~1 week / 1000 samples in a 5 node mini-cluster

Page 15: Gabor Marth, Goncalo Abecasis, PIs

ACCESSIBILITY

Page 16: Gabor Marth, Goncalo Abecasis, PIs

The Boston College tool hub

http://gkno.me(genome)

Page 17: Gabor Marth, Goncalo Abecasis, PIs

Simplified installation & use

• Unified launcher application (gkno)– single tools (e.g. Mosaik)– tool “macros” (e.g. map)– pipelines (e.g. exome variant calling)

• Download and installation– All tools pulled in a single step from github– All tools installed– All tools tested

Page 18: Gabor Marth, Goncalo Abecasis, PIs
Page 19: Gabor Marth, Goncalo Abecasis, PIs

Easily configurable pipeline system

• Part of our new unified launcher system (gkno)• Pipeline types (e.g. mapping, variant calling) and

instances (exome, whole-genome)• User-configurable: tools can be swapped in and out,

parameters configured via config files

Page 20: Gabor Marth, Goncalo Abecasis, PIs

Support

• Documentation• Tutorials / Blog• User forum• Bug reports

Page 21: Gabor Marth, Goncalo Abecasis, PIs

DEPLOYMENT / CLOUD

Page 22: Gabor Marth, Goncalo Abecasis, PIs

Software deployment

• All software is ready for running locally on a single machine

• UMAKE adds cluster support• Cloud deployment– Simple Michigan pipelines ported to Amazon– Portation of all project software on the way

Page 23: Gabor Marth, Goncalo Abecasis, PIs

Cloud-based analysis – Galaxy

Page 24: Gabor Marth, Goncalo Abecasis, PIs

OPEN & COLLABORATIVE DEVELOPMENT MODEL

Page 25: Gabor Marth, Goncalo Abecasis, PIs

Integration• Our workflows leverage 3rd

party tools for specific functionality

• All our tools are open-source, available on github (many clones, community contributed code)

• Ensemble approach (multiple tools for critical tasks)

Page 26: Gabor Marth, Goncalo Abecasis, PIs

Ensemble approach

• Multiple tools usually benefit analysis

Ts/Tv

Called in # SNPs %dbSNP Novel Known Total

Union 907,170 22.09 2.22 2.30 2.24

2 of 5 766,608 25.33 2.38 2.33 2.37

3 of 5 696,358 27.05 2.44 2.36 2.42

4 of 5 601,132 29.62 2.49 2.40 2.46

Intersection 520,083 32.20 2.53 2.42 2.49

Page 27: Gabor Marth, Goncalo Abecasis, PIs

Ensemble approach

• Our pipelines will use multiple aligners (BWA, Mosaik) and variant callers (Freebayes, glfMultiples), developed by BC/UM

Page 28: Gabor Marth, Goncalo Abecasis, PIs

In progress

• Expanding pipelines to integrate all tools • Michigan tools -> gkno• BC tools -> Michigan cloud ready pipelines• Large data set analysis on the cloud• Integrate variant interpretation tools• Integrate SV tools as they become more robust• Integrate consensus analysis (SVM and MLP

approaches to callset aggregation)• Minimal, functional pipeline -> Galaxy

Page 29: Gabor Marth, Goncalo Abecasis, PIs

Team

Boston College• Alistair Ward• Derek Barnett• Chase Miller• Wan-Ping Lee• Erik Garrison

• Gabor Marth

University of Michigan• Mary-Kate Trost• Tom Blackwell• Hyun-Min Kang• Youna Hu • Adrian Tan • Xiaowei Zhan • Dajiang Liu

• Goncalo Abecasis