Developing Accessible Application Software for Individual de novo Genome Projects Vince Forgetta,...

19
Developing Accessible Application Software for Individual de novo Genome Projects Vince Forgetta, PhD Candidate Ken Dewar PhD, Supervisor Department of Human Genetics, McGill University Montreal, Quebec, Canada December 8 th , 2011

Transcript of Developing Accessible Application Software for Individual de novo Genome Projects Vince Forgetta,...

Page 1: Developing Accessible Application Software for Individual de novo Genome Projects Vince Forgetta, PhD Candidate Ken Dewar PhD, Supervisor Department of.

Developing Accessible Application Software for Individual de novo Genome Projects

Vince Forgetta, PhD CandidateKen Dewar PhD, Supervisor

Department of Human Genetics, McGill University Montreal, Quebec, Canada

December 8th, 2011

Page 2: Developing Accessible Application Software for Individual de novo Genome Projects Vince Forgetta, PhD Candidate Ken Dewar PhD, Supervisor Department of.

Next-Gen Gap

Bacterial genome in < 1 week for ~ $3000

(Nature Methods 6, S2 - S5 (2009))

(Genome Assembly)+

“Unfortunately, the software and computer hardware demands on these analyses are not much less than those of the large Genome Centers. From this perspective, the gap between large-scale genome centers and individual investigators may seem to be growing, not shrinking, as the next-generation platforms’ apparent promise of a ‘Genome Center in a box’ may have only been half delivered, providing data without a full suite of tools.”

Download Data Learn *NIX Install Software and Dependencies Run Software … Wait? … Problems?

Page 3: Developing Accessible Application Software for Individual de novo Genome Projects Vince Forgetta, PhD Candidate Ken Dewar PhD, Supervisor Department of.

Three Common Methodologies in de novo Genome Analysis

3

1. Display and analysis of genome annotations

2. Quality assessment of a genome assembly

3. Comparison and mining of genomic data from public repositories.

Project Software Methodology

C. difficile 14 Genome Comparison cgb 1. Genome Display

Multi-centre WGS of O. novo-ulmi ContiGo 2. Assembly QA

E. fergusonii ECD-227 BLAST in Pivot 3. Data Mining

One or more methodologies used to address needs in three specific projects; projects used as a vehicle to develop software:

Page 4: Developing Accessible Application Software for Individual de novo Genome Projects Vince Forgetta, PhD Candidate Ken Dewar PhD, Supervisor Department of.

Assembly Quality Assessment

Page 5: Developing Accessible Application Software for Individual de novo Genome Projects Vince Forgetta, PhD Candidate Ken Dewar PhD, Supervisor Department of.

Assembly Analysis

• Researchers should have easy access to determine quality and perform simple analysis.

Researcher Sequencing Centre

DNA

Assembly

• Delays and limits on data access exist: - Viewers need to be installed and have specific software (e.g. Linux) or hardware requirements (e.g. RAM).- Assembly data (multiple GBs) must be downloaded.

Page 6: Developing Accessible Application Software for Individual de novo Genome Projects Vince Forgetta, PhD Candidate Ken Dewar PhD, Supervisor Department of.

Objective

• Develop a simple assembly viewer that operates within a web-browser, allowing a researcher to rapidly analyze and access their data.

Page 7: Developing Accessible Application Software for Individual de novo Genome Projects Vince Forgetta, PhD Candidate Ken Dewar PhD, Supervisor Department of.

MethodParser/Converter: Used python to parse, analyze, and convert assembly data into web accessible formats (HTML, JSON, JPG images) which are stored on sequence centre servers.

Interface: Use browser-based interface (HTML) to dynamically access data (Javascript) on servers. Incorporates pre-existing web-technologies (JQuery, Seadragon Deepzoom AJAX).

Usage: - after genome assembly, parser/converter is run on

sequencing center servers- researcher accesses interface over the internet using a modern web browser

Page 8: Developing Accessible Application Software for Individual de novo Genome Projects Vince Forgetta, PhD Candidate Ken Dewar PhD, Supervisor Department of.

PerformanceParser/Converter:

– Multiple platforms (Windows/OS X/Linux) – Multi-processor support.– Low memory usage (< 250Mb of memory per processor).

User interface:– Client-side programming decreased server load– Data is downloaded is on-demand limited bandwidth

users.– Sole system requirement: a modern web-browser (Firefox,

Opera, Google Chrome) ease of installation.– Low memory usage (peaks at ~ 250 Mb).

Page 9: Developing Accessible Application Software for Individual de novo Genome Projects Vince Forgetta, PhD Candidate Ken Dewar PhD, Supervisor Department of.

The Interface

Table of contig/scaffold statistics:•Sortable/Filter by column•Access to contig sequence/quality and read sequences.

Assembly statistics, batch download of sequence and statistical data.

Dynamic Charts:• toggle axis value• identify points• summarize regions

Contig Assembly:-Pan/Zoom- Identify position, read names, mismatches

Page 10: Developing Accessible Application Software for Individual de novo Genome Projects Vince Forgetta, PhD Candidate Ken Dewar PhD, Supervisor Department of.

Demo

Page 11: Developing Accessible Application Software for Individual de novo Genome Projects Vince Forgetta, PhD Candidate Ken Dewar PhD, Supervisor Department of.

3. Data Mining

Page 12: Developing Accessible Application Software for Individual de novo Genome Projects Vince Forgetta, PhD Candidate Ken Dewar PhD, Supervisor Department of.

Microsoft Research Summer InternshipMicrosoft Biology FoundationRedmond, Washington, USA

Mentor - Simon Mercer

Microsoft Research Summer InternshipMicrosoft Biology FoundationRedmond, Washington, USA

Mentor - Simon Mercer

BLAST

BLAST Pivot

Pivot

blip.codeplex.com

Page 13: Developing Accessible Application Software for Individual de novo Genome Projects Vince Forgetta, PhD Candidate Ken Dewar PhD, Supervisor Department of.

BLAST

ACGTCACTGACTGACTAGCTAGCTAGCTAGCATCGATCGATCGATCGATCGATCGACGTAACTAGCACGACTGACTCT

?? Species, Function, …Species, Function, …

NCBI

Local

blip.codeplex.com

Page 14: Developing Accessible Application Software for Individual de novo Genome Projects Vince Forgetta, PhD Candidate Ken Dewar PhD, Supervisor Department of.

Limitation

=

=

+

+~5000 genes

E. coli

ScientistScientist

ProgrammerProgrammer

>gi|301326298|ref|ZP_07219671.1| TIM-barrel protein, nifR3 family [Escherichia coli MS 78-1] Length=321

Score = 583.563 bits (1503), Expect = 8.65371E-165 Identities = 280/281 (100%), Positives = 280/281 (100%), Gaps = 0/281 (0%) Frame = 0

Query 1 MMSSNPQVWESDKSRLRMVHIDEPGIRTVQIAGSDPKEMADAARINVESGAQIIDINMGC 60 MMSSNPQVWESDKSRLRMVHIDEPGIRTVQIAGSDPKEMADAARINVESGAQIIDINMGC Sbjct 41 MMSSNPQVWESDKSRLRMVHIDEPGIRTVQIAGSDPKEMADAARINVESGAQIIDINMGC 100

Query 61 PAKKVNRKLAGSALLQYPDVVKSILTEVVNAVDVPVTLKIRTGWAPEHRNCEEIAQLAED 120 PAKKVNRKLAGSALLQYPDVVKSILTEVVN VDVPVTLKIRTGWAPEHRNCEEIAQLAED Sbjct 101 PAKKVNRKLAGSALLQYPDVVKSILTEVVNTVDVPVTLKIRTGWAPEHRNCEEIAQLAED 160

Query 121 CGIQALTIHGRTRACLFNGEAEYDSIRAVKQKVSIPVIANGDITDPLKARAVLDYTGADA 180 CGIQALTIHGRTRACLFNGEAEYDSIRAVKQKVSIPVIANGDITDPLKARAVLDYTGADA Sbjct 161 CGIQALTIHGRTRACLFNGEAEYDSIRAVKQKVSIPVIANGDITDPLKARAVLDYTGADA 220

Query 181 LMIGRAAQGRPWIFREIQHYLDTGELLPPLPLAEVKRLLCAHVRELHDFYGPAKGYRIAR 240 LMIGRAAQGRPWIFREIQHYLDTGELLPPLPLAEVKRLLCAHVRELHDFYGPAKGYRIAR Sbjct 221 LMIGRAAQGRPWIFREIQHYLDTGELLPPLPLAEVKRLLCAHVRELHDFYGPAKGYRIAR 280

Query 241 KHVSWYLQEHAPNDQFRRTFNAIEDASEQLEALEAYFENFA 281 KHVSWYLQEHAPNDQFRRTFNAIEDASEQLEALEAYFENFA Sbjct 281 KHVSWYLQEHAPNDQFRRTFNAIEDASEQLEALEAYFENFA 321

blip.codeplex.com

Page 15: Developing Accessible Application Software for Individual de novo Genome Projects Vince Forgetta, PhD Candidate Ken Dewar PhD, Supervisor Department of.

Blast in Pivot

2 3

ACGTCACTGACTGACTAGCTAGCTAGCTAGCATCGATCGATCGATCGATCGATCGACGTAACTAGCACGACTGACTCT

??

BLASTBLAST Pivo

tPivo

t

ACGTCACTGACTGACTAGCTAGCTAGCTAGCATCGATCGATCGATCGATCGATCGACGTAACTAGCACGACTGACTCT

??ACGTCACTGACTGACTAGCTAGCTAGCTAGCATCGATCGATCGATCGATCGATCGACGTAACTAGCACGACTGACTCT

??1

blip.codeplex.com

Page 16: Developing Accessible Application Software for Individual de novo Genome Projects Vince Forgetta, PhD Candidate Ken Dewar PhD, Supervisor Department of.

E. coli ECD227

E. coli

?????

E. coli ECD-227

AcknowledgementMoussa Diarra, Heidi Rempel

Species?

Function?

Antibiotic

Resistant!

Divergent Strain

blip.codeplex.com

Page 17: Developing Accessible Application Software for Individual de novo Genome Projects Vince Forgetta, PhD Candidate Ken Dewar PhD, Supervisor Department of.

Demo

Page 18: Developing Accessible Application Software for Individual de novo Genome Projects Vince Forgetta, PhD Candidate Ken Dewar PhD, Supervisor Department of.

Conclusions

ContiGo: used by clients of the Genome Centre at McGill (release soon). BL!P: >500 downloads (blip.codeplex.com).

18

Page 19: Developing Accessible Application Software for Individual de novo Genome Projects Vince Forgetta, PhD Candidate Ken Dewar PhD, Supervisor Department of.

C. difficileKen Dewar

Andre Dascal Matthew Oughton

Joana DiasGary Leveque

Pascale MarquisCorina Nagy

Amelie VilleneuveIvan Brukner, Mark Miller

Vivian LooMike MulveyDale GerdingMaya RupnikElaine Mardis

V. MagriniM. Hickenbotham

K. HaubC. MarkovicJ. Nelson

19

Ophiostoma novo-ulmiJan KieleczawaMichael ZianniRobert Steen

Deborah GroveAnoja Perera

Robert Lyons Jr.Sushmita SinghDoug BintzlerScottie AdamsDeborah GroveGregory Grove

Robert Lyons Jr. Suzanne Genik

Chris WrightAlvaro HernandezSharon Bachman

Lorie HetrickSushmita Singh

Nichole PetersonGary Leveque

Joana DiasClotilde Teiling Tim Harkins

E. coli ECD-227H. Rempel

Andrew MetcalfeM. S. Diarra

BL!P/Microsoft

Simon Mercer

Xin-Yi Chua

Mauro Luigi Drago

Beatriz Diaz Acosta

Vivek Kumar

Bob Davidson

Mike ZyskowskiXiaoji Chen

Bob SilversteinVikram BapatJared Jackson

Wei LuThe Pivot Team

Acknowledgements