Kitzmiller Openhelisphereproject Bosc2008

28
The Open HeliSphere™ project True open source from the inventors of True Single Molecule Sequencing (tSMS™). Aaron Kitzmiller BOSC 2008

Transcript of Kitzmiller Openhelisphereproject Bosc2008

Page 1: Kitzmiller Openhelisphereproject Bosc2008

The Open HeliSphere™ project

True open source from the inventors of True Single Molecule Sequencing (tSMS™).

Aaron KitzmillerBOSC 2008

Page 2: Kitzmiller Openhelisphereproject Bosc2008

Agenda

• Introduction to the HeliScope Single Molecule Sequencer

• Helicos and Open Source

• The Open HeliSphere project

• HeliSphere code

Page 3: Kitzmiller Openhelisphereproject Bosc2008

Single Molecule Sequencing by Synthesis

T

G

A

A

C

G

T

G

A

A

C

G

T

G

A

A

C

G

5’

5’

T

A

C

T

T

G

C

C

G

C

A

A

C

T

T

G

C

A

C

T

T

G

C

C

T

A

C

T

G

A

C

G

T

C

TT

Hybridize

Primer 1 ~1/um2

Page 4: Kitzmiller Openhelisphereproject Bosc2008

G

G

G

GG

G

G G

T

G

A

A

C

G

T

G

A

A

C

G

T

G

A

A

C

G

5’

5’

T

A

C

T

T

G

C

C

G

C

A

A

C

T

T

G

C

A

C

T

T

G

C

C

T

A

C

T

G

A

C

G

T

C

TT

Extend‘G’

Single Molecule Sequencing by Synthesis

Page 5: Kitzmiller Openhelisphereproject Bosc2008

G

G

G

GG

G

G G

T

G

A

A

C

G

T

G

A

A

C

G

T

G

A

A

C

G

5’

Wash

5’

T

A

C

T

T

G

C

C

G

C

A

A

C

T

T

G

C

A

C

T

T

G

C

C

T

A

C

T

G

A

C

G

T

C

TT

T

G

A

A

C

G

T

A

C

T

T

G

C

C

G

C

A

T

G

A

A

C

G

A

C

T

T

G

C

T

G

A

A

C

G

A

C

T

T

G

C

C

T

A

C

T

G

A

C

G

T

C

T

G G

5’

5’T

SM Sequence by Synthesis

Page 6: Kitzmiller Openhelisphereproject Bosc2008

T

G

A

A

C

G

T

A

C

T

T

G

C

C

G

C

A

T

G

A

A

C

G

A

C

T

T

G

C

T

G

A

A

C

G

A

C

T

T

G

C

C

T

A

C

T

G

A

C

G

T

C

T

G G

5’

5’T

Image

G

G

G

GG

G

G G

T

G

A

A

C

G

T

G

A

A

C

G

T

G

A

A

C

G

5’

5’

T

A

C

T

T

G

C

C

G

C

A

A

C

T

T

G

C

A

C

T

T

G

C

C

T

A

C

T

G

A

C

G

T

C

TT

T

G

A

A

C

G

T

A

C

T

T

G

C

C

G

C

A

T

G

A

A

C

G

A

C

T

T

G

C

T

G

A

A

C

G

A

C

T

T

G

C

C

T

A

C

T

G

A

C

G

T

C

T

G G

5’

5’T

SM Sequence by Synthesis

Page 7: Kitzmiller Openhelisphereproject Bosc2008

Cleave

T

G

A

A

C

G

T

A

C

T

T

G

C

C

G

C

A

T

G

A

A

C

G

A

C

T

T

G

C

T

G

A

A

C

G

A

C

T

T

G

C

C

T

A

C

T

G

A

C

G

T

C

T

G G

5’

5’T

T

G

A

A

C

G

T

A

C

T

T

G

C

C

G

C

A

T

G

A

A

C

G

A

C

T

T

G

C

T

G

A

A

C

G

A

C

T

T

G

C

C

T

A

C

T

G

A

C

G

T

C

T

G G

5’

5’T

T

G

A

A

C

G

T

A

C

T

T

G

C

C

G

C

A

T

G

A

A

C

G

A

C

T

T

G

C

T

G

A

A

C

G

A

C

T

T

G

C

C

T

A

C

T

G

A

C

G

T

C

T

G G

5’

5’T

G

G

G

GG

G

G G

T

G

A

A

C

G

T

G

A

A

C

G

T

G

A

A

C

G

5’

5’

T

A

C

T

T

G

C

C

G

C

A

A

C

T

T

G

C

A

C

T

T

G

C

C

T

A

C

T

G

A

C

G

T

C

TT

SM Sequence by Synthesis

Page 8: Kitzmiller Openhelisphereproject Bosc2008

8

Flow Cell Imaging

• 1 run => 2 flow cells• 1 flow cell => 25 channels• 1 channel => 1000 fields of view (FOV)• 1 FOV => 4 images

• 8-10 million usable strands / channel

Flow Cell

25 Channels(1.6 x 90 mm)

~12 x 12 cmFlow cell volume = 180 µL

Page 9: Kitzmiller Openhelisphereproject Bosc2008

Raw data collection

- C - A G C T - - C T - G - T A - C T - G - - A G - - A - - - - A - C - A G C - - G - - - G - T - G - - - - - - - G X C T A G C T A G C T A G C T A G C T A G C T A G C T A G - C - A - C T - - C - - G C - A - - T - - C - A - - T - G - - - A G - - A - - T - - C - A - - T - - - - A - C T - - - - - - G - T A - - T - G - - - - - T A - - T A G - - - -

Page 10: Kitzmiller Openhelisphereproject Bosc2008

HeliScope and HeliSphere

Page 11: Kitzmiller Openhelisphereproject Bosc2008

Helicos and Open Source

• Helicos is an instrument company• The diversity of bioinformatics applications is too large for us to

address internally• Open source bioinformatics applications benefit everyone,

including instrument developers

• Helicos bioinformatics applicationsa) Internal developmentb) Academic and industrial partnershipsc) Tool vendor partnerships

d) Open SourceIncluding contributions by Helicos to other projects (BioPerl,

Bioconductor, etc)

Page 12: Kitzmiller Openhelisphereproject Bosc2008

The Open HeliSphere project

• Pre-launch – TODAY a) SVN trunk checkoutb) Tarball downloadc) openhelisphere-announce, openhelisphere-devel d) Wiki documentatione) Datasets

• Full product launcha) Patch submission to HeliSphere coreb) Bug trackingc) HeliSphere contrib repository

http://open.helicosbio.com

Page 13: Kitzmiller Openhelisphereproject Bosc2008

The Open HeliSphere project

• Licensea) Dual GPL + commercial

• Infrastructurea) Mediawiki-driven website (semantically enhanced)b) SourceForge mail, tarballs, Subversion source code control

http://open.helicosbio.com

Page 14: Kitzmiller Openhelisphereproject Bosc2008

Bioinformatics Pipeline for Digital Gene Expression

Page 15: Kitzmiller Openhelisphereproject Bosc2008

SRF file processing

• HeliScope sequencers create SRF filesa) Consortium driven binary read

container • Strong Sanger involvement • used for submission to the NCBI Short Read Archive

b) Reads are stored in ZTR blocksc) Instrument and run information is

stored in an XML document

SRF processing converts reads into a smaller, Helicos-oriented format called SMS.

a) Perl scripts run the srf2sms binaryb) SMS places reads into blocks that are indexed by key Helicos data fields

(flowcell, channel, position)c) Extracted instrument and run XML are used for pipeline configuration

Page 16: Kitzmiller Openhelisphereproject Bosc2008

SMS file

• SMS is a general binary data containera) Manipulate with executables: smsls, sms2txt, srf2sms, filterSMS,

extractSMS, mergeSMSb) Access data directly via C++ iterator API

read_iterator<read_record> rit(smsfile); read_record read; //query the SMS file for desired flowcell/channel rit.select_channel(flowcell,channel);

//iterate over result set, default out format to ostream is fasta while(!rit.end()){ read = *rit; outf << read; rit++; } outf.close();

Page 17: Kitzmiller Openhelisphereproject Bosc2008

Pipeline configuration

• Pipeline is a combination of Perl modules and scripts driven by XML configuration

a) Analysis combines a Protocol and parameters with a Reference Set

b) Reference Set is a pointer to one or more FASTA filesc) Protocol is a pointer to one or more executables and parameters

a) Instrument and Run XML are extracted from the SRF file.

b) analysis_controller converts XML documents into MLDBM database

Page 18: Kitzmiller Openhelisphereproject Bosc2008

DGE analysis

• DGE pipeline features common processing stepsa) Counting of aligned transcriptsb) extractSMS, filterSMS remove poor quality sequences

• Base addition order (CTAG) sequences• Quality score• Read length• Normalized alignment score

a) IndexDP alignment• Helicos developed aligner• Mismatch tolerant seeded

alignment with multiple alignment modes

Page 19: Kitzmiller Openhelisphereproject Bosc2008

IndexDP

ACGTACGTACCCGTA

AAGACGTACATACCCGTATTTACTTTACGT

ACGTACATACCCGTA

AAGACGTACATACCCGTATTTACTTTACGT

10mer word

Template length 15, weight 10 w/sub

• On-the-fly indexes are constructed using template familiesa) Families are arrangements of positions that accommodate a

given template length, weight, and mismatch number (e.g. 20:16:2)

• BLAST, et al. match on contiguous words and then extend to support fast, gapped alignments

• IndexDP uses templates to accommodate mismatches in the words

Page 20: Kitzmiller Openhelisphereproject Bosc2008

IndexDP

• After template matching, the bitHPDP core performs a dynamic programming algorithm. Supported alignment flavors:

a) Smith-Watermanb) Global-Local. Full length of the read against a region of the

reference. End gaps against the reference have zero penalty c) Local-Local. End gaps have no penaltyd) Global-Global. Needleman-Wunsch

Page 21: Kitzmiller Openhelisphereproject Bosc2008

QC analysis

• errorTool uses sample alignments to reference to calculate error rates

a) Uses bitHPDP coreb) Breaks down error rates on a number of dimensions (by

nucleotide, by substitution type, by reference position, by image (X,Y), by incorporation cycle, etc.)

c) Error rates of < 1% are seen with Two Pass Sequencing; single pass is 7% or less

• lengthTool calculates length distribution and term+loss stats

a) Can provide length as alignedb) Termination and loss indicate

strands that stop incorporating base

Page 22: Kitzmiller Openhelisphereproject Bosc2008

Length distributions (yeast DGE experiment)

Raw: Unfiltered reads, 6mer and aboveFiltered: Quality score filter, AT < 0.9, BAO dinuc<0.7, trim leading Ts, length >= 20, alignment against BAO, P102Aligned: Normalized score >= 4

0 10 20 30 40 50 60 700

100,000

200,000

300,000

400,000

500,000

RawFilteredAligned

Length

# o

f s

tra

nd

s

Page 23: Kitzmiller Openhelisphereproject Bosc2008

Error rates and alignments (yeast DGE experiment)

Error-rates were assessed using samples of alignments with normalized alignment score ≥4 to a high-expresser (YLR110C/CCW12)

6.55%0.44%4.72%1.39%

TotalSubDelIns

GACGT-TATGGGTGATGGTAGTAACGATGATGACGAAGA-TAATGTAGACCCGCTGC-ACCGTGCTAAACAATCC ReferenceGACGT-TATGAGTGATGGTAGTAACGATGATGACGAAGA-TAATGTAGACCCGCTGC-ATCGTGCTAAACAATCC Consensus--------------------------------------------------------------------------- TGATGGTAGTAACGATGATGACGAAGA-TAA CCCGCTG--ATCGTGCTAAACA-TC ReadsGACGT-TATGAGTGATGGTAGTAACGATGATGA-GAAGA GC-ATCGTGCTAAACA-TCC A-GTATATGAGTGATGGTAGTAACGATGATGACGAAGAATA ATCGTGCTAAACAATCCGACGT-TATGAGTGATGGTAGTAACGATGATGACGA AATGTAGACCCGCTGC-ATCGTGCTAAACAATCC ACGT-TATGAGTGATG-TAGTAACGATGATGACGAAGA-TAAGACGT-TATGAGT ACGAAGA-TAATGTAGACCCGCTGCTATCGT-CTA GACGT-TATGAGTGATG-TA GA-TAATGTAGACCTGC-GC-ATCGTGCTAAACAA GACGT-TATGAGTGATG GA-TAAT-TAGACCCGCTG--ATCGTG-TAA-CAA GACGT-TATGAGTGATGGTAGTAACGATGATGACG

Page 24: Kitzmiller Openhelisphereproject Bosc2008

Acknowledgments

• Ed Thayer• Eldar Giladi• John Healy• Doron Lipson• Keith Moulton• Steve Roels

Original research shouldn’t start with copies

Page 25: Kitzmiller Openhelisphereproject Bosc2008

Hybrid development model

Source code repository

Com

pany

fire

wal

l

Read-only source code subset

User-owned packages

Secure sync

Page 26: Kitzmiller Openhelisphereproject Bosc2008

Typical closed source development

Source code repository

Com

pany

fire

wal

l

Page 27: Kitzmiller Openhelisphereproject Bosc2008

Typical open source project

Source code repository

Direct commit

Checkout

Submit patch via email

Page 28: Kitzmiller Openhelisphereproject Bosc2008

HeliScope and HeliSphere