Next-Generation Sequencing: Challenges and Opportunities Ion Mandoiu Computer Science and...
-
date post
21-Dec-2015 -
Category
Documents
-
view
217 -
download
2
Transcript of Next-Generation Sequencing: Challenges and Opportunities Ion Mandoiu Computer Science and...
Next-Generation Sequencing: Challenges and Opportunities
Ion MandoiuComputer Science and Engineering Department
University of Connecticut
Outline
• Background on high-throughput sequencing• Identification of tumor-specific epitopes• Estimation of gene and isoform expression
levels• Viral quasispecies reconstruction • Future work
http://www.economist.com/node/16349358
Advances in High-Throughput Sequencing (HTS)
Roche/454 FLX Titanium400-600 million reads/run
400bp avg. length
Illumina HiSeq 2000Up to 6 billion PE
reads/run35-100bp read length
SOLiD 41.4-2.4 billion PE reads/run
35-50bp read length
Illumina Workflow – Library Preparation
Genomic DNA mRNA
Illumina Workflow – Cluster Generation
Illumina Workflow – Sequencing by Synthesis
Cost of Whole Genome Sequencing
$100
$1,000
$10,000
$100,000
$1,000,000
$10,000,000
$100,000,000
days weeks months years
Sequencing Time
Co
st
Illumina@36xSOLiD@12x
• HTS is a transformative technology • Numerous applications besides de novo genome sequencing:
– RNA-Seq– Non-coding RNAs– ChIP-Seq– Epigenetics – Structural variation– Metagenomics– Paleogenomics– …
HTS applications
Outline
• Background on high-throughput sequencing• Identification of tumor-specific epitopes• Estimation of gene and isoform expression
levels• Viral quasispecies reconstruction • Future work
Genomics-Guided Cancer Immunotherapy
CTCAATTGATGAAATTGTTCTGAAACTGCAGAGATAGCTAAAGGATACCGGGTTCCGGTATCCTTTAGCTATCTCTGCCTCCTGACACCATCTGTGTGGGCTACCATG
…
AGGCAAGCTCATGGCCAAATCATGAGA
Tumor mRNASequencing
SYFPEITHIISETDLSLLCALRRNESL
…
Tumor Specific Epitopes
PeptideSynthesis
Immune SystemStimulation
Mouse Image Source: http://www.clker.com/clipart-simple-cartoon-mouse-2.html
TumorRemission
Bioinformatics Pipeline
Tumor mRNA reads
CCDSMapping
Genome Mapping
Read Merging
CCDS mapped reads
Genome mapped reads
SNVs Detection
Mapped reads
Epitope Prediction
Tumor specific
epitopes
HaplotypingTumor-specific
SNVs
Close SNV Haplotypes
Primers Design
Primers for Sanger
Sequencing
Bioinformatics Pipeline
Tumor mRNA reads
CCDSMapping
Genome Mapping
Read Merging
CCDS mapped reads
Genome mapped reads
SNVs Detection
Mapped reads
Epitope Prediction
Tumor specific
epitopes
HaplotypingTumor-specific
SNVs
Close SNV Haplotypes
Primers Design
Primers for Sanger
Sequencing
Mapping mRNA Reads
http://en.wikipedia.org/wiki/File:RNA-Seq-alignment.png
Read MergingGenome CCDS Agree? Hard Merge Soft Merge
Unique Unique Yes Keep Keep
Unique Unique No Throw Throw
Unique Multiple No Throw Keep
Unique Not Mapped No Keep Keep
Multiple Unique No Throw Keep
Multiple Multiple No Throw Throw
Multiple Not Mapped No Throw Throw
Not mapped Unique No Keep Keep
Not mapped Multiple No Throw Throw
Not mapped Not Mapped Yes Throw Throw
SNV Detection and Genotyping
AACGCGGCCAGCCGGCTTCTGTCGGCCAGCAGCCAGGAATCTGGAAACAATGGCTACAGCGTGCAACGCGGCCAGCCGGCTTCTGTCGGCCAGCCGGCAG CGCGGCCAGCCGGCTTCTGTCGGCCAGCAGCCCGGA GCGGCCAGCCGGCTTCTGTCGGCCAGCCGGCAGGGA GCCAGCCGGCTTCTGTCGGCCAGCAGCCAGGAATCT GCCGGCTTCTGTCGGCCAGCAGCCAGGAATCTGGAA CTTCTGTCGGCCAGCCGGCAGGAATCTGGAAACAAT CGGCCAGCAGCCAGGAATCTGGAAACAATGGCTACA CCAGCAGCCAGGAATCTGGAAACAATGGCTACAGCG CAAGCAGCCAGGAATCTGGAAACAATGGCTACAGCG GCAGCCAGGAATCTGGAAACAATGGCTACAGCGTGC
Reference
Locus i
Ri
r(i) : Base call of read r at locus iεr(i) : Probability of error reading base call r(i)Gi : Genotype at locus i
SNV Detection and Genotyping
• Use Bayes rule to calculate posterior probabilities and pick the genotype with the largest one
SNV Detection and Genotyping• Calculate conditional probabilities by multiplying contributions of
individual reads
Data Filtering
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 330%
5%
10%
15%
20%
25%
30%
35%
40%
45%
Transcripts
Genome
Hard Merge
SoftMerge
Read Position
% o
f mism
atch
es
Accuracy per RPKM binsSO
APsn
p
Maq
SNVQ
SOAP
snp
Maq
SNVQ
SOAP
snp
Maq
SNVQ
SOAP
snp
Maq
SNVQ
SOAP
snp
Maq
SNVQ
SOAP
snp
Maq
SNVQ
RPKM < 1 1 < RPKM < 5 5 < RPKM < 10 10 < RPKM < 50 50 < RPKM < 100
RPKM > 100
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
TPHomoVar TPHetero FP FNHomoVar FNHetero
Bioinformatics Pipeline
Tumor mRNA reads
CCDSMapping
Genome Mapping
Read Merging
CCDS mapped reads
Genome mapped reads
SNVs Detection
Mapped reads
Epitope Prediction
Tumor specific
epitopes
HaplotypingTumor-specific
SNVs
Close SNV Haplotypes
Primers Design
Primers for Sanger
Sequencing
Haplotyping
• Human somatic cells are diploid, containing two sets of nearly identical chromosomes, one set derived from each parent.
ACGTTACATTGCCACTCAATC--TGGAACGTCACATTG-CACTCGATCGCTGGA
Heterozygous variants
Haplotyping
Locus
Event Alleles
1 SNV C,T
2 Deletion C,-
3 SNV A,G
4 Insertion
-,GC
Locus
Event Alleles Hap 1 Alleles Hap 2
1 SNV T C
2 Deletion C -
3 SNV A G
4 Insertion
- GC
RefHap Algorithm• Reduce the problem to Max-Cut.• Solve Max-Cut• Build haplotypes according with the cut
Locus 1 2 3 4 5
f1 - 0 1 1 0
f2 1 1 0 - 1
f3 1 - - 0 -
f4 - 0 0 - 1
31
1
1 -1
-14
2
3
h1 00110h2 11001
Bioinformatics Pipeline
Tumor mRNA reads
CCDSMapping
Genome Mapping
Read Merging
CCDS mapped reads
Genome mapped reads
SNVs Detection
Mapped reads
Epitope Prediction
Tumor specific
epitopes
HaplotypingTumor-specific
SNVs
Close SNV Haplotypes
Primers Design
Primers for Sanger
Sequencing
Immunology Background
J.W. Yedell, E Reits and J Neefjes. Making sense of mass destruction: quantitating MHC class I antigen presentation. Nature Reviews Immunology, 3:952-961, 2003
Epitope Prediction
C. Lundegaard et al. MHC Class I Epitope Binding Prediction Trained on Small Data Sets. In Lecture Notes in Computer Science, 3239:217-225, 2004
Results on Tumor DataMouse strain BALB/C B10.D2 TRAMP
Tumor Meth-A CMS5 prostate1 prostate2 prostate3 prostate4
#lanes 1 3 4 3 3 3
HQ Het SNPs 465 77 86 17 292 193
DdWeak 119 17 14 12 63 70
Strong 20 2 2 0 7 12
KdWeak 111 21 10 0 19 54
Strong 3 1 1 0 1 3
LdWeak 99 12 25 4 47 75
Strong 8 0 0 0 2 9
TotalWeak 329 50 49 16 129 199
Strong 31 3 3 0 10 24
Experimental Validation• Mutations reported by [Noguchi et al 94] found by the pipeline
• Confirmed with Sanger sequencing 18 out of 20 mutations for MethA and 26 out of 28 mutations for CMS5
• Immunogenic potential under experimental validation in the Srivastava lab at UCHC
Outline
• Background on high-throughput sequencing• Identification of tumor-specific epitopes• Estimation of gene and isoform expression
levels• Viral quasispecies reconstruction • Future work
RNA-Seq
A B C D E
Make cDNA & shatter into fragments
Sequence fragment ends
Map reads
Gene Expression (GE)
A B C
A C
D E
Isoform Discovery (ID) Isoform Expression (IE)
Alternative Splicing
[Griffith and Marra 07]
Challenges to Accurate Estimation of Gene Expression Levels
• Read ambiguity (multireads)
• What is the gene length?
A B C D E
Previous approaches to GE
• Ignore multireads• [Mortazavi et al. 08]
– Fractionally allocate multireads based on unique read estimates
• [Pasaniuc et al. 10]– EM algorithm for solving ambiguities
• Gene length: sum of lengths of exons that appear in at least one isoform Underestimates expression levels for genes with 2 or
more isoforms [Trapnell et al. 10]
Read Ambiguity in IE
A B C D E
A C
Previous approaches to IE
• [Jiang&Wong 09]– Poisson model + importance sampling, single reads
• [Richard et al. 10]• EM Algorithm based on Poisson model, single reads in exons
• [Li et al. 10]– EM Algorithm, single reads
• [Feng et al. 10]– Convex quadratic program, pairs used only for ID
• [Trapnell et al. 10]– Extends Jiang’s model to paired reads– Fragment length distribution
Our contribution
• Unified probabilistic model and Expectation-Maximization Algorithm for IE considering– Single and/or paired reads– Fragment length distribution– Strand information– Base quality scores
Read-Isoform Compatibilityirw ,
a
aaair FQOw ,
Fragment length distribution
• Paired reads
A B C
A C
A B C
A CA C
A B Ci
j
Series1
Fa(i)
Series1
Fa (j)
Fragment length distribution
• Single reads
A B C
A C
A B C
A C
A B C
A C
i
j
Series1
Fa(i)
Series1
Fa (j)
IsoEM algorithm
E-step
M-step
Error Fraction Curves - Isoforms• 30M single reads of length 25 (simulated)
0 0.2 0.4 0.6 0.8 10
10
20
30
40
50
60
70
80
90
100
Uniq
Rescue
UniqLN
Cufflinks
RSEM
IsoEM
Relative error threshold
% o
f iso
form
s ov
er th
resh
old
Error Fraction Curves - Genes• 30M single reads of length 25 (simulated)
0 0.2 0.4 0.6 0.8 10
10
20
30
40
50
60
70
80
90
100
Uniq
Rescue
GeneEM
Cufflinks
RSEM
IsoEM
Relative error threshold
% o
f gen
es o
ver t
hres
hold
Validation on MAQC Samples
0.6
0.650000000000001
0.7
0.75
0.800000000000001
0.85 UHRR Lib 1, IsoEM
UHRR Lib 2, IsoEM
UHRR Lib 3, IsoEM
UHRR Lib 4, IsoEM
UHRR Lib 5, IsoEM
UHRR Lib 6, IsoEM
HBRR Lib 1, IsoEM
HBRR Lib 2, IsoEM
UHRR Lib 1, Cufflinks
UHRR Lib 2, Cufflinks
UHRR Lib 3, Cufflinks
UHRR Lib 4, Cufflinks
UHRR Lib 5, Cufflinks
UHRR Lib 6, Cufflinks
HBRR Lib1, Cufflinks
HBRR Lib 2, Cufflinks
Million Mapped Bases
R2
Outline
• Background on high-throughput sequencing• Identification of tumor-specific epitopes• Estimation of gene and isoform expression
levels• Viral quasispecies reconstruction • Future work
Viral Quasispecies
RNA viruses (HIV, HCV)Many replication mistakesQuasispecies (qsps)
= co-existing closely related variants
Variants differ in virulenceability to escape the immune system resistance to antiviral therapiestissue tropism
How do qsps contribute to viral persistence and evolution?
454 Pyrosequencing
Pyrosequencing =Sequencing by Synthesis.
GS FLX Titanium : Fragments (reads): 300-800 bp Sequence of the reads System software assembles reads
into a single genome
We need a software that assembles reads into multiple genomes!
Quasispecies Spectrum Reconstruction (QSR)
Problem
Given pyrosequencing reads from a quasispecies population of unknown size and distribution
Reconstruct the quasispecies spectrum
sequencesfrequencies
ViSpA Viral Spectrum Assembler
454 Sequencing Errors
Error rate ~0.1%.
Fixed number of incorporated bases vs. light intensity value.
Incorrect resolution of homopolymers =>
over-calls (insertions)65-75% of errors
under-calls (deletions)20-30% of errors
Preprocessing of Aligned Reads
1. Deletions in reads: DReplace deletion, confirmed by a single read, with either allele value that is present in all other reads or N.
2. Insertions into reference: IRemove insertions, confirmed by a single read.
3. Imputation of missing values N
Read Graph: Vertices
Subread = completely contained in some read with ≤ n mismatches. Superread = not a subread => the vertex in the read graph.
ACTGGTCCCTCCTGAGTGT
GGTCCCTCCT
TGGTCACTCGTGAG
ACCTCATCGAAGCGGCGTCCT
Read Graph: Edges
Edge b/w two vertices exists if there is an overlap between superreads they agree on their overlap with ≤ m mismatches.
Auxiliary vertices: source and sink
Read Graph: Edge Cost
The most probable source-sink path through each vertex
Cost: uncertainty that two superreads are from the same qsps.
Overhang Δ is the shift in start positions of two overlapping superreads.
Δ
Contig Assembling
Max Bandwidth Path through vertexpath minimizing maximum edge cost for the path and each subpath
Consensus of path’s superreadsEach position: >70%-majority or N
Weighted consensus obtained on all reads
Remove duplicatesDuplicated sequences = statistical evidence
kkl
L
t
L
t
k
lrsp
1),( read r of length l qsps s of length L k is #mismatches, t/L is a mutation rate
Expectation Maximization
Bipartite graph: Qq is a candidate with frequency fq
Rr is a read with observed frequency or
Weight hq,r = probability that read r is produced by qsps q with j mismatches
E step:
jjlrq j
lh
1,
''
''
:,
,,
qrqrqq
rqqrq hf
hfp
rr
qrrqr
q o
op
fM step:
HCV Qsps (P. Balfe)
30927 reads from 5.2Kb-long region of HCV-1a genomes
intravenous drug user being infected for less than 3 months => mutation rate is in [1.75%, 8%]
27764 reads average length=292bpIndels: ~77% of reads
Insertions length: 1 (86%) , 3 (9.8%)Deletions length: 1 (98%)
N: ~7% of reads
HCV Data Statistics
NJ Tree for 12 Most Frequent Qsps (No Insertions)
The top sequence: 26.9% (no mismatches) and 50.4% (≤1 mismatch) of the reads.
In sum:35.6% (no mismatches ) and 64.5% (≤1 mismatch) of the reads.
Reconstructed sequence with highest frequency 99% identical to one of the ORFs obtained by cloning the quasispecies.
Conclusions & Future Work• Freely available implementations of these methods
available at http://dna.engr.uconn.edu/software/
• Ongoing work– Monitoring immune responses by TCR sequencing– Isoform discovery– Computational deconvolution of heterogeneous samples– Reconstruction & frequency estimation of virus quasispecies
from Ion Torrent reads
Acknowledgments Immunogenomics
Jorge Duitama (KU Leuven) Pramod K. Srivastava, Adam Adler, Brent Graveley, Duan Fei (UCHC) Matt Alessandri and Kelly Gonzalez (Ambry Genetics)
IsoEM Marius Nicolae (Uconn) Alex Zelikovsky, Serghei Mangul (GSU)
ViSpA Alex Zelikovsky, Irina Astrovskaya, Bassam Tork, Serghei Mangul,
(GSU), and Kelly Westbrooks (Life Technologies) Peter Balfe (Birmingham University, UK)
Funding NSF awards IIS-0546457, IIS-0916948, and DBI-0543365 UCONN Research Foundation UCIG grant