Kitzmiller Openhelisphereproject Bosc2008
-
Upload
bosc2008 -
Category
Technology
-
view
798 -
download
0
Transcript of Kitzmiller Openhelisphereproject Bosc2008
The Open HeliSphere™ project
True open source from the inventors of True Single Molecule Sequencing (tSMS™).
Aaron KitzmillerBOSC 2008
Agenda
• Introduction to the HeliScope Single Molecule Sequencer
• Helicos and Open Source
• The Open HeliSphere project
• HeliSphere code
Single Molecule Sequencing by Synthesis
T
G
A
A
C
G
T
G
A
A
C
G
T
G
A
A
C
G
5’
5’
T
A
C
T
T
G
C
C
G
C
A
A
C
T
T
G
C
A
C
T
T
G
C
C
T
A
C
T
G
A
C
G
T
C
TT
Hybridize
Primer 1 ~1/um2
G
G
G
GG
G
G G
T
G
A
A
C
G
T
G
A
A
C
G
T
G
A
A
C
G
5’
5’
T
A
C
T
T
G
C
C
G
C
A
A
C
T
T
G
C
A
C
T
T
G
C
C
T
A
C
T
G
A
C
G
T
C
TT
Extend‘G’
Single Molecule Sequencing by Synthesis
G
G
G
GG
G
G G
T
G
A
A
C
G
T
G
A
A
C
G
T
G
A
A
C
G
5’
Wash
5’
T
A
C
T
T
G
C
C
G
C
A
A
C
T
T
G
C
A
C
T
T
G
C
C
T
A
C
T
G
A
C
G
T
C
TT
T
G
A
A
C
G
T
A
C
T
T
G
C
C
G
C
A
T
G
A
A
C
G
A
C
T
T
G
C
T
G
A
A
C
G
A
C
T
T
G
C
C
T
A
C
T
G
A
C
G
T
C
T
G G
5’
5’T
SM Sequence by Synthesis
T
G
A
A
C
G
T
A
C
T
T
G
C
C
G
C
A
T
G
A
A
C
G
A
C
T
T
G
C
T
G
A
A
C
G
A
C
T
T
G
C
C
T
A
C
T
G
A
C
G
T
C
T
G G
5’
5’T
Image
G
G
G
GG
G
G G
T
G
A
A
C
G
T
G
A
A
C
G
T
G
A
A
C
G
5’
5’
T
A
C
T
T
G
C
C
G
C
A
A
C
T
T
G
C
A
C
T
T
G
C
C
T
A
C
T
G
A
C
G
T
C
TT
T
G
A
A
C
G
T
A
C
T
T
G
C
C
G
C
A
T
G
A
A
C
G
A
C
T
T
G
C
T
G
A
A
C
G
A
C
T
T
G
C
C
T
A
C
T
G
A
C
G
T
C
T
G G
5’
5’T
SM Sequence by Synthesis
Cleave
T
G
A
A
C
G
T
A
C
T
T
G
C
C
G
C
A
T
G
A
A
C
G
A
C
T
T
G
C
T
G
A
A
C
G
A
C
T
T
G
C
C
T
A
C
T
G
A
C
G
T
C
T
G G
5’
5’T
T
G
A
A
C
G
T
A
C
T
T
G
C
C
G
C
A
T
G
A
A
C
G
A
C
T
T
G
C
T
G
A
A
C
G
A
C
T
T
G
C
C
T
A
C
T
G
A
C
G
T
C
T
G G
5’
5’T
T
G
A
A
C
G
T
A
C
T
T
G
C
C
G
C
A
T
G
A
A
C
G
A
C
T
T
G
C
T
G
A
A
C
G
A
C
T
T
G
C
C
T
A
C
T
G
A
C
G
T
C
T
G G
5’
5’T
G
G
G
GG
G
G G
T
G
A
A
C
G
T
G
A
A
C
G
T
G
A
A
C
G
5’
5’
T
A
C
T
T
G
C
C
G
C
A
A
C
T
T
G
C
A
C
T
T
G
C
C
T
A
C
T
G
A
C
G
T
C
TT
SM Sequence by Synthesis
8
Flow Cell Imaging
• 1 run => 2 flow cells• 1 flow cell => 25 channels• 1 channel => 1000 fields of view (FOV)• 1 FOV => 4 images
• 8-10 million usable strands / channel
Flow Cell
25 Channels(1.6 x 90 mm)
~12 x 12 cmFlow cell volume = 180 µL
Raw data collection
- C - A G C T - - C T - G - T A - C T - G - - A G - - A - - - - A - C - A G C - - G - - - G - T - G - - - - - - - G X C T A G C T A G C T A G C T A G C T A G C T A G C T A G - C - A - C T - - C - - G C - A - - T - - C - A - - T - G - - - A G - - A - - T - - C - A - - T - - - - A - C T - - - - - - G - T A - - T - G - - - - - T A - - T A G - - - -
HeliScope and HeliSphere
Helicos and Open Source
• Helicos is an instrument company• The diversity of bioinformatics applications is too large for us to
address internally• Open source bioinformatics applications benefit everyone,
including instrument developers
• Helicos bioinformatics applicationsa) Internal developmentb) Academic and industrial partnershipsc) Tool vendor partnerships
d) Open SourceIncluding contributions by Helicos to other projects (BioPerl,
Bioconductor, etc)
The Open HeliSphere project
• Pre-launch – TODAY a) SVN trunk checkoutb) Tarball downloadc) openhelisphere-announce, openhelisphere-devel d) Wiki documentatione) Datasets
• Full product launcha) Patch submission to HeliSphere coreb) Bug trackingc) HeliSphere contrib repository
http://open.helicosbio.com
The Open HeliSphere project
• Licensea) Dual GPL + commercial
• Infrastructurea) Mediawiki-driven website (semantically enhanced)b) SourceForge mail, tarballs, Subversion source code control
http://open.helicosbio.com
Bioinformatics Pipeline for Digital Gene Expression
SRF file processing
• HeliScope sequencers create SRF filesa) Consortium driven binary read
container • Strong Sanger involvement • used for submission to the NCBI Short Read Archive
b) Reads are stored in ZTR blocksc) Instrument and run information is
stored in an XML document
SRF processing converts reads into a smaller, Helicos-oriented format called SMS.
a) Perl scripts run the srf2sms binaryb) SMS places reads into blocks that are indexed by key Helicos data fields
(flowcell, channel, position)c) Extracted instrument and run XML are used for pipeline configuration
SMS file
• SMS is a general binary data containera) Manipulate with executables: smsls, sms2txt, srf2sms, filterSMS,
extractSMS, mergeSMSb) Access data directly via C++ iterator API
read_iterator<read_record> rit(smsfile); read_record read; //query the SMS file for desired flowcell/channel rit.select_channel(flowcell,channel);
//iterate over result set, default out format to ostream is fasta while(!rit.end()){ read = *rit; outf << read; rit++; } outf.close();
Pipeline configuration
• Pipeline is a combination of Perl modules and scripts driven by XML configuration
a) Analysis combines a Protocol and parameters with a Reference Set
b) Reference Set is a pointer to one or more FASTA filesc) Protocol is a pointer to one or more executables and parameters
a) Instrument and Run XML are extracted from the SRF file.
b) analysis_controller converts XML documents into MLDBM database
DGE analysis
• DGE pipeline features common processing stepsa) Counting of aligned transcriptsb) extractSMS, filterSMS remove poor quality sequences
• Base addition order (CTAG) sequences• Quality score• Read length• Normalized alignment score
a) IndexDP alignment• Helicos developed aligner• Mismatch tolerant seeded
alignment with multiple alignment modes
IndexDP
ACGTACGTACCCGTA
AAGACGTACATACCCGTATTTACTTTACGT
ACGTACATACCCGTA
AAGACGTACATACCCGTATTTACTTTACGT
10mer word
Template length 15, weight 10 w/sub
• On-the-fly indexes are constructed using template familiesa) Families are arrangements of positions that accommodate a
given template length, weight, and mismatch number (e.g. 20:16:2)
• BLAST, et al. match on contiguous words and then extend to support fast, gapped alignments
• IndexDP uses templates to accommodate mismatches in the words
IndexDP
• After template matching, the bitHPDP core performs a dynamic programming algorithm. Supported alignment flavors:
a) Smith-Watermanb) Global-Local. Full length of the read against a region of the
reference. End gaps against the reference have zero penalty c) Local-Local. End gaps have no penaltyd) Global-Global. Needleman-Wunsch
QC analysis
• errorTool uses sample alignments to reference to calculate error rates
a) Uses bitHPDP coreb) Breaks down error rates on a number of dimensions (by
nucleotide, by substitution type, by reference position, by image (X,Y), by incorporation cycle, etc.)
c) Error rates of < 1% are seen with Two Pass Sequencing; single pass is 7% or less
• lengthTool calculates length distribution and term+loss stats
a) Can provide length as alignedb) Termination and loss indicate
strands that stop incorporating base
Length distributions (yeast DGE experiment)
Raw: Unfiltered reads, 6mer and aboveFiltered: Quality score filter, AT < 0.9, BAO dinuc<0.7, trim leading Ts, length >= 20, alignment against BAO, P102Aligned: Normalized score >= 4
0 10 20 30 40 50 60 700
100,000
200,000
300,000
400,000
500,000
RawFilteredAligned
Length
# o
f s
tra
nd
s
Error rates and alignments (yeast DGE experiment)
Error-rates were assessed using samples of alignments with normalized alignment score ≥4 to a high-expresser (YLR110C/CCW12)
6.55%0.44%4.72%1.39%
TotalSubDelIns
GACGT-TATGGGTGATGGTAGTAACGATGATGACGAAGA-TAATGTAGACCCGCTGC-ACCGTGCTAAACAATCC ReferenceGACGT-TATGAGTGATGGTAGTAACGATGATGACGAAGA-TAATGTAGACCCGCTGC-ATCGTGCTAAACAATCC Consensus--------------------------------------------------------------------------- TGATGGTAGTAACGATGATGACGAAGA-TAA CCCGCTG--ATCGTGCTAAACA-TC ReadsGACGT-TATGAGTGATGGTAGTAACGATGATGA-GAAGA GC-ATCGTGCTAAACA-TCC A-GTATATGAGTGATGGTAGTAACGATGATGACGAAGAATA ATCGTGCTAAACAATCCGACGT-TATGAGTGATGGTAGTAACGATGATGACGA AATGTAGACCCGCTGC-ATCGTGCTAAACAATCC ACGT-TATGAGTGATG-TAGTAACGATGATGACGAAGA-TAAGACGT-TATGAGT ACGAAGA-TAATGTAGACCCGCTGCTATCGT-CTA GACGT-TATGAGTGATG-TA GA-TAATGTAGACCTGC-GC-ATCGTGCTAAACAA GACGT-TATGAGTGATG GA-TAAT-TAGACCCGCTG--ATCGTG-TAA-CAA GACGT-TATGAGTGATGGTAGTAACGATGATGACG
Acknowledgments
• Ed Thayer• Eldar Giladi• John Healy• Doron Lipson• Keith Moulton• Steve Roels
Original research shouldn’t start with copies
Hybrid development model
Source code repository
Com
pany
fire
wal
l
Read-only source code subset
User-owned packages
Secure sync
Typical closed source development
Source code repository
Com
pany
fire
wal
l
Typical open source project
Source code repository
Direct commit
Checkout
Submit patch via email
HeliScope and HeliSphere