Overview and Applications of Next-Generation Sequencing Technologies
Data analysis methods for next-generation sequencing technologies
description
Transcript of Data analysis methods for next-generation sequencing technologies
![Page 1: Data analysis methods for next-generation sequencing technologies](https://reader035.fdocuments.in/reader035/viewer/2022062321/56813b2d550346895da3f61e/html5/thumbnails/1.jpg)
Data analysis methods for next-generation sequencing technologies
Gabor T. MarthBoston College Biology Department
Epigenomics & Sequencing MeetingJuly 14-15, 2008, Boston, MA
![Page 2: Data analysis methods for next-generation sequencing technologies](https://reader035.fdocuments.in/reader035/viewer/2022062321/56813b2d550346895da3f61e/html5/thumbnails/2.jpg)
T1. Roche / 454 FLX system
• pyrosequencing technology• variable read-length• the only new technology with >100bp reads• tested in many published applications• supports paired-end read protocols with up to 10kb separation size
![Page 3: Data analysis methods for next-generation sequencing technologies](https://reader035.fdocuments.in/reader035/viewer/2022062321/56813b2d550346895da3f61e/html5/thumbnails/3.jpg)
T2. Illumina / Solexa Genome Analyzer
• fixed-length short-read sequencer• read properties are very close traditional capillary sequences • very low INDEL error rate• tested in many published applications• paired-end read protocols support short (<600bp) separation
![Page 4: Data analysis methods for next-generation sequencing technologies](https://reader035.fdocuments.in/reader035/viewer/2022062321/56813b2d550346895da3f61e/html5/thumbnails/4.jpg)
T3. AB / SOLiD system
A C G T
A
C
G
T
2nd Base
1st
Bas
e
0
0
0
0
1
1
1
1
2
2
2
2
3
3
3
3
• fixed-length short-read sequencer• employs a 2-base encoding system that can be used for error reduction and improving SNP calling accuracy• requires color-space informatics• published applications underway / in review• paired-end read protocols support up to 10kb separation size
![Page 5: Data analysis methods for next-generation sequencing technologies](https://reader035.fdocuments.in/reader035/viewer/2022062321/56813b2d550346895da3f61e/html5/thumbnails/5.jpg)
T4. Helicos / Heliscope system
• experimental short-read sequencer system• single molecule sequencing• no amplification• variable read-length• error rate reduced with 2-pass template sequencing
![Page 6: Data analysis methods for next-generation sequencing technologies](https://reader035.fdocuments.in/reader035/viewer/2022062321/56813b2d550346895da3f61e/html5/thumbnails/6.jpg)
A1. Variation discovery: SNPs and short-INDELs
1. sequence alignment
2. dealing with non-unique mapping
3. looking for allelic differences
![Page 7: Data analysis methods for next-generation sequencing technologies](https://reader035.fdocuments.in/reader035/viewer/2022062321/56813b2d550346895da3f61e/html5/thumbnails/7.jpg)
A2. Structural variation detection
• structural variations (deletions, insertions, inversions and translocations) from paired-end read map locations
• copy number (for amplifications, deletions) from depth of read coverage
![Page 8: Data analysis methods for next-generation sequencing technologies](https://reader035.fdocuments.in/reader035/viewer/2022062321/56813b2d550346895da3f61e/html5/thumbnails/8.jpg)
A3. Identification of protein-bound DNA
genome sequence
aligned reads
Chromatin structure (CHIP-SEQ)(Mikkelsen et al. Nature 2007)
Transcription binding sites. Robertson et al. Nature Methods, 2007
![Page 9: Data analysis methods for next-generation sequencing technologies](https://reader035.fdocuments.in/reader035/viewer/2022062321/56813b2d550346895da3f61e/html5/thumbnails/9.jpg)
A4. Novel transcript discovery (genes)
Mortazavi et al. Nature Methods
![Page 10: Data analysis methods for next-generation sequencing technologies](https://reader035.fdocuments.in/reader035/viewer/2022062321/56813b2d550346895da3f61e/html5/thumbnails/10.jpg)
A5. Novel transcript discovery (miRNAs)
Ruby et al. Cell, 2006
![Page 11: Data analysis methods for next-generation sequencing technologies](https://reader035.fdocuments.in/reader035/viewer/2022062321/56813b2d550346895da3f61e/html5/thumbnails/11.jpg)
A6. Expression profiling by tag counting
aligned reads
aligned reads
Jones-Rhoads et al. PLoS Genetics, 2007
gene gene
![Page 12: Data analysis methods for next-generation sequencing technologies](https://reader035.fdocuments.in/reader035/viewer/2022062321/56813b2d550346895da3f61e/html5/thumbnails/12.jpg)
A7. De novo organismal genome sequencing
assembled sequence contigs
short reads
longer reads
read pairs
Lander et al. Nature 2001
![Page 13: Data analysis methods for next-generation sequencing technologies](https://reader035.fdocuments.in/reader035/viewer/2022062321/56813b2d550346895da3f61e/html5/thumbnails/13.jpg)
C1. Read length
read length [bp]0 100 200 300
~200-450 (var)
25-40 (fixed)
25-35 (fixed)
20-35 (var)
400
![Page 14: Data analysis methods for next-generation sequencing technologies](https://reader035.fdocuments.in/reader035/viewer/2022062321/56813b2d550346895da3f61e/html5/thumbnails/14.jpg)
When does read length matter?
• short reads often sufficient where the entire read length can be used for mapping:
SNPs, short-INDELs, SVsCHIP-SEQshort RNA discoverycounting (mRNA miRNA)
• longer reads are needed where one must use parts of reads for mapping:
de novo sequencing
novel transcript discovery
aacttagacttacagacttacatacgta
Known exon 1 Known exon 2
accgattactatacta
![Page 15: Data analysis methods for next-generation sequencing technologies](https://reader035.fdocuments.in/reader035/viewer/2022062321/56813b2d550346895da3f61e/html5/thumbnails/15.jpg)
C2. Read error rate
• error rate dictates the stringency of the read mapper
• error rate typically 0.4 - 1%
• the more errors the aligner must tolerate, the lower the fraction of the reads that can be uniquely aligned
0 1 20.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35
0.40
Fra
ctio
n of
gen
ome
Number of mismatches allowed
![Page 16: Data analysis methods for next-generation sequencing technologies](https://reader035.fdocuments.in/reader035/viewer/2022062321/56813b2d550346895da3f61e/html5/thumbnails/16.jpg)
0
5
10
15
20
25
30
35
40
0 5 10 15 20 25 30 35 40
Position on Read
0.00%
1.00%
2.00%
3.00%
4.00%
5.00%
6.00%
7.00%
8.00%
9.00%
10.00%
Err
or r
ate
Error rate grows with each cycle
• this phenomenon limits useful read length
![Page 17: Data analysis methods for next-generation sequencing technologies](https://reader035.fdocuments.in/reader035/viewer/2022062321/56813b2d550346895da3f61e/html5/thumbnails/17.jpg)
Substitutions vs. INDEL errors
![Page 18: Data analysis methods for next-generation sequencing technologies](https://reader035.fdocuments.in/reader035/viewer/2022062321/56813b2d550346895da3f61e/html5/thumbnails/18.jpg)
C3. Representational biases / library complexity
fragmentation biases
amplification biases
PCR
sequencing biases
sequencing
low/no representati
on high
representation
![Page 19: Data analysis methods for next-generation sequencing technologies](https://reader035.fdocuments.in/reader035/viewer/2022062321/56813b2d550346895da3f61e/html5/thumbnails/19.jpg)
Dispersal of read coverage
• this affects variation discovery (deeper starting read coverage is needed)• it should have major impact is on counting applications
![Page 20: Data analysis methods for next-generation sequencing technologies](https://reader035.fdocuments.in/reader035/viewer/2022062321/56813b2d550346895da3f61e/html5/thumbnails/20.jpg)
Amplification errors
many reads from clonal copies of a single fragment
• early PCR errors in “clonal” read copies lead to false positive allele calls
early amplification error gets propagated onto every clonal copy
![Page 21: Data analysis methods for next-generation sequencing technologies](https://reader035.fdocuments.in/reader035/viewer/2022062321/56813b2d550346895da3f61e/html5/thumbnails/21.jpg)
C4. Paired-end reads
• fragment amplification: fragment length 100 - 600 bp• fragment length limited by amplification efficiency
• circularization: 500bp - 10kb (sweet spot ~3kb)• fragment length limited by library complexity
Korbel et al. Science 2007
• paired-end read can improve read mapping accuracy (if unique map positions are required for both ends) or efficiency (if fragment length constraint is used to rescue non-uniquely mapping ends)
![Page 22: Data analysis methods for next-generation sequencing technologies](https://reader035.fdocuments.in/reader035/viewer/2022062321/56813b2d550346895da3f61e/html5/thumbnails/22.jpg)
Technologies / properties / applications
Technology
Roche/454 Illumina/Solexa AB/SOLiD
Read properties
Read length 200-450bp 20-50bp 25-50bp
Error rate <0.5% <1.0% <0.5%
Dominant error type INDEL SUB SUB
Quality values available yes yes not really
Paired-end separation < 10kb (3kb optimal) 100 - 600bp 500bp - 10kb (3kb optimal)
Applications
SNP discovery ● ● ○
short-INDEL discovery ● ○
SV discovery ○ ○ ●
CHIP-SEQ ○ ● ●
small RNA/gene discovery ○ ● ●
mRNA Xcript discovery ● ○ ○
Expression profiling ○ ● ●
De novo sequencing ● ? ?
![Page 23: Data analysis methods for next-generation sequencing technologies](https://reader035.fdocuments.in/reader035/viewer/2022062321/56813b2d550346895da3f61e/html5/thumbnails/23.jpg)
Resequencing-based SNP discovery
(iv) read assembly
REF
(iii) read mapping (pair-wise alignment to genome reference)
IND
(i) base calling
IND
(v) SNP calling
(vi) SNP validation
(ii) micro-repeat analysis
(vii) data viewing, hypothesis generation
![Page 24: Data analysis methods for next-generation sequencing technologies](https://reader035.fdocuments.in/reader035/viewer/2022062321/56813b2d550346895da3f61e/html5/thumbnails/24.jpg)
The “toolbox”
• base callers
• microrepeat finders
• read mappers
• SNP callers
• structural variation callers
• assembly viewers
![Page 25: Data analysis methods for next-generation sequencing technologies](https://reader035.fdocuments.in/reader035/viewer/2022062321/56813b2d550346895da3f61e/html5/thumbnails/25.jpg)
…AND they give you the cover on the box
Reference guided read mapping
Reference-sequence guided mapping:
…you get the pieces…
Some pieces are more unique than others
![Page 26: Data analysis methods for next-generation sequencing technologies](https://reader035.fdocuments.in/reader035/viewer/2022062321/56813b2d550346895da3f61e/html5/thumbnails/26.jpg)
MOSAIK: an anchored aligner / assembler
Step 1. initial short-hash scan for possible read locations
Step 2. evaluation of candidate locations with SW method
Michael Stromberg
![Page 27: Data analysis methods for next-generation sequencing technologies](https://reader035.fdocuments.in/reader035/viewer/2022062321/56813b2d550346895da3f61e/html5/thumbnails/27.jpg)
Non-unique mapping, gapped alignments
1. Non-unique read mapping: optionally either only report uniquely mapped reads or report all map locations for each read (mapping quality values for all mapped reads are being implemented)
2. Gapped alignments: allow for mapping reads with insertion or deletion sequencing errors, and reads with bona fide INDEL alleles
![Page 28: Data analysis methods for next-generation sequencing technologies](https://reader035.fdocuments.in/reader035/viewer/2022062321/56813b2d550346895da3f61e/html5/thumbnails/28.jpg)
Read types aligned, paired-end read strategy
3. Aligns and co-assembles customary read types:ABI/capillaryIllumina/SolexaAB/SOLiDRoche/454Helicos/Heliscope
ABI/capillary
454 FLX
454 GS20
Illumina4. Paired-end read alignments
![Page 29: Data analysis methods for next-generation sequencing technologies](https://reader035.fdocuments.in/reader035/viewer/2022062321/56813b2d550346895da3f61e/html5/thumbnails/29.jpg)
Other mainstream read mappers
• ELAND (Tony Cox, Illumina)-- the “official” read mapper supplied by Illumina, fast
• MAQ (Li Heng + Richard Durbin, Sanger)-- the most widely used read mapper, low RAM footprint
• SOAP (Beijing Genomics Institute)-- a new mapper developed for human next-gen reads
• SHRIMP (Michael Brudno, University of Toronto)-- full Smith-Waterman
![Page 30: Data analysis methods for next-generation sequencing technologies](https://reader035.fdocuments.in/reader035/viewer/2022062321/56813b2d550346895da3f61e/html5/thumbnails/30.jpg)
Speed
![Page 31: Data analysis methods for next-generation sequencing technologies](https://reader035.fdocuments.in/reader035/viewer/2022062321/56813b2d550346895da3f61e/html5/thumbnails/31.jpg)
Polymorphism / mutation detection
sequencing error
polymorphism
![Page 32: Data analysis methods for next-generation sequencing technologies](https://reader035.fdocuments.in/reader035/viewer/2022062321/56813b2d550346895da3f61e/html5/thumbnails/32.jpg)
Determining genotype directly from sequence
AACGTTAGCATAAACGTTAGCATAAACGTTCGCATAAACGTTCGCATA
AACGTTCGCATAAACGTTCGCATAAACGTTCGCATAAACGTTCGCATA
AACGTTAGCATAAACGTTAGCATA
individual 1
individual 3
individual 2
A/C
C/C
A/A
![Page 33: Data analysis methods for next-generation sequencing technologies](https://reader035.fdocuments.in/reader035/viewer/2022062321/56813b2d550346895da3f61e/html5/thumbnails/33.jpg)
Software
Siablevarall
]T,G,C,A[S ]T,G,C,A[SiiiorPr
iiorPr
i
iiorPr
i
NiorPrNiorPr
NN
iorPr
i Ni
N
N
N )S,...,S(P)S(P
)R|S(P...
)S(P
)R|S(P...
)S,...,S(P)S(P)R|S(P
...)S(P)R|S(P
)SNP(P
1
1
1
1 11
11
11GigaBayesGigaBayes
SNP
INS
![Page 34: Data analysis methods for next-generation sequencing technologies](https://reader035.fdocuments.in/reader035/viewer/2022062321/56813b2d550346895da3f61e/html5/thumbnails/34.jpg)
Data visualization
1. aid software development: integration of trace data viewing, fast navigation, zooming/panning
2. facilitate data validation (e.g. SNP validation): co-viewing of multiple read types, quality value displays
3. promote hypothesis generation: integration of annotation tracks
Weichun Huang
![Page 35: Data analysis methods for next-generation sequencing technologies](https://reader035.fdocuments.in/reader035/viewer/2022062321/56813b2d550346895da3f61e/html5/thumbnails/35.jpg)
Applications
1. SNP discovery in shallow, single-read 454 coverage(Drosophila melanogaster)
3. Mutational profiling in deep 454 and Illumina read data(Pichia stipitis)
2. SNP and INDEL discovery in deep Illumina short-read coverage(Caenorhabditis elegans)
(image from Nature Biotech.)
![Page 36: Data analysis methods for next-generation sequencing technologies](https://reader035.fdocuments.in/reader035/viewer/2022062321/56813b2d550346895da3f61e/html5/thumbnails/36.jpg)
Our software is available for testing
http://bioinformatics.bc.edu/marthlab/Beta_Release
![Page 37: Data analysis methods for next-generation sequencing technologies](https://reader035.fdocuments.in/reader035/viewer/2022062321/56813b2d550346895da3f61e/html5/thumbnails/37.jpg)
Credits
http://bioinformatics.bc.edu/marthlab
Elaine Mardis (Washington University)Andy Clark (Cornell University)Doug Smith (Agencourt)
Research supported by: NHGRI (G.T.M.) BC Presidential Scholarship (A.R.Q.)
Derek BarnettEric Tsung
Aaron QuinlanDamien Croteau-Chonka
Weichun Huang
Michael Stromberg
Chip Stewart
Michele Busby
![Page 38: Data analysis methods for next-generation sequencing technologies](https://reader035.fdocuments.in/reader035/viewer/2022062321/56813b2d550346895da3f61e/html5/thumbnails/38.jpg)
Accuracy
• As is the case for all heuristic alignment algorithms accuracy and speed are option- and parameter-dependent
![Page 39: Data analysis methods for next-generation sequencing technologies](https://reader035.fdocuments.in/reader035/viewer/2022062321/56813b2d550346895da3f61e/html5/thumbnails/39.jpg)
C3. Quality values are important for allele calling
• PHRED base quality values represent the estimated likelihood of sequencing error and help us pick out true alternate alleles
• inaccurate or not well calibrated base quality values hinder allele calling
Q-values should be accurate … and high!
![Page 40: Data analysis methods for next-generation sequencing technologies](https://reader035.fdocuments.in/reader035/viewer/2022062321/56813b2d550346895da3f61e/html5/thumbnails/40.jpg)
Software tools for next-gen sequence analysis
![Page 41: Data analysis methods for next-generation sequencing technologies](https://reader035.fdocuments.in/reader035/viewer/2022062321/56813b2d550346895da3f61e/html5/thumbnails/41.jpg)
Next-generation sequencing technologies and applications