Next-Generation Sequencing: Challenges and Opportunities Ion Mandoiu Computer Science and...

Next-Generation Sequencing: Challenges and Opportunities

Ion MandoiuComputer Science and Engineering Department

University of Connecticut

Outline

• Background on high-throughput sequencing• Identification of tumor-specific epitopes• Estimation of gene and isoform expression

levels• Viral quasispecies reconstruction • Future work

http://www.economist.com/node/16349358

Advances in High-Throughput Sequencing (HTS)

Roche/454 FLX Titanium400-600 million reads/run

400bp avg. length

Illumina HiSeq 2000Up to 6 billion PE

reads/run35-100bp read length

SOLiD 41.4-2.4 billion PE reads/run

35-50bp read length

Illumina Workflow – Library Preparation

Genomic DNA mRNA

Illumina Workflow – Cluster Generation

Illumina Workflow – Sequencing by Synthesis

Cost of Whole Genome Sequencing

$100

$1,000

$10,000

$100,000

$1,000,000

$10,000,000

$100,000,000

days weeks months years

Sequencing Time

Co

st

[email protected]

J. [email protected]

Illumina@36xSOLiD@12x

• HTS is a transformative technology • Numerous applications besides de novo genome sequencing:

– RNA-Seq– Non-coding RNAs– ChIP-Seq– Epigenetics – Structural variation– Metagenomics– Paleogenomics– …

HTS applications

Outline



Genomics-Guided Cancer Immunotherapy

CTCAATTGATGAAATTGTTCTGAAACTGCAGAGATAGCTAAAGGATACCGGGTTCCGGTATCCTTTAGCTATCTCTGCCTCCTGACACCATCTGTGTGGGCTACCATG

…

AGGCAAGCTCATGGCCAAATCATGAGA

Tumor mRNASequencing

SYFPEITHIISETDLSLLCALRRNESL

…

Tumor Specific Epitopes

PeptideSynthesis

Immune SystemStimulation

Mouse Image Source: http://www.clker.com/clipart-simple-cartoon-mouse-2.html

TumorRemission

Bioinformatics Pipeline

Tumor mRNA reads

CCDSMapping

Genome Mapping

Read Merging

CCDS mapped reads

Genome mapped reads

SNVs Detection

Mapped reads

Epitope Prediction

Tumor specific

epitopes

HaplotypingTumor-specific

SNVs

Close SNV Haplotypes

Primers Design

Primers for Sanger

Sequencing

Mapping mRNA Reads

http://en.wikipedia.org/wiki/File:RNA-Seq-alignment.png

Read MergingGenome CCDS Agree? Hard Merge Soft Merge

Unique Unique Yes Keep Keep

Unique Unique No Throw Throw

Unique Multiple No Throw Keep

Unique Not Mapped No Keep Keep

Multiple Unique No Throw Keep

Multiple Multiple No Throw Throw

Multiple Not Mapped No Throw Throw

Not mapped Unique No Keep Keep

Not mapped Multiple No Throw Throw

Not mapped Not Mapped Yes Throw Throw

SNV Detection and Genotyping

AACGCGGCCAGCCGGCTTCTGTCGGCCAGCAGCCAGGAATCTGGAAACAATGGCTACAGCGTGCAACGCGGCCAGCCGGCTTCTGTCGGCCAGCCGGCAG CGCGGCCAGCCGGCTTCTGTCGGCCAGCAGCCCGGA GCGGCCAGCCGGCTTCTGTCGGCCAGCCGGCAGGGA GCCAGCCGGCTTCTGTCGGCCAGCAGCCAGGAATCT GCCGGCTTCTGTCGGCCAGCAGCCAGGAATCTGGAA CTTCTGTCGGCCAGCCGGCAGGAATCTGGAAACAAT CGGCCAGCAGCCAGGAATCTGGAAACAATGGCTACA CCAGCAGCCAGGAATCTGGAAACAATGGCTACAGCG CAAGCAGCCAGGAATCTGGAAACAATGGCTACAGCG GCAGCCAGGAATCTGGAAACAATGGCTACAGCGTGC

Reference

Locus i

Ri

r(i) : Base call of read r at locus iεr(i) : Probability of error reading base call r(i)Gi : Genotype at locus i

SNV Detection and Genotyping

• Use Bayes rule to calculate posterior probabilities and pick the genotype with the largest one

SNV Detection and Genotyping• Calculate conditional probabilities by multiplying contributions of

individual reads

Data Filtering

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 330%

5%

10%

15%

20%

25%

30%

35%

40%

45%

Transcripts

Genome

Hard Merge

SoftMerge

Read Position

% o

f mism

atch

es

Accuracy per RPKM binsSO

APsn

p

Maq

SNVQ

SOAP

snp

Maq

SNVQ

SOAP

snp

Maq

SNVQ

SOAP

snp

Maq

SNVQ

SOAP

snp

Maq

SNVQ

SOAP

snp

Maq

SNVQ

RPKM < 1 1 < RPKM < 5 5 < RPKM < 10 10 < RPKM < 50 50 < RPKM < 100

RPKM > 100

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

TPHomoVar TPHetero FP FNHomoVar FNHetero


Tumor mRNA reads

CCDSMapping

Genome Mapping

Read Merging

CCDS mapped reads

Genome mapped reads

SNVs Detection

Mapped reads

Epitope Prediction

Tumor specific

epitopes


SNVs


Primers Design

Primers for Sanger

Sequencing

Haplotyping

• Human somatic cells are diploid, containing two sets of nearly identical chromosomes, one set derived from each parent.

ACGTTACATTGCCACTCAATC--TGGAACGTCACATTG-CACTCGATCGCTGGA

Heterozygous variants

Haplotyping

Locus

Event Alleles

1 SNV C,T

2 Deletion C,-

3 SNV A,G

4 Insertion

-,GC

Locus

Event Alleles Hap 1 Alleles Hap 2

1 SNV T C

2 Deletion C -

3 SNV A G

4 Insertion

- GC

RefHap Algorithm• Reduce the problem to Max-Cut.• Solve Max-Cut• Build haplotypes according with the cut

Locus 1 2 3 4 5

f1 - 0 1 1 0

f2 1 1 0 - 1

f3 1 - - 0 -

f4 - 0 0 - 1

31

1

1 -1

-14

2

3

h1 00110h2 11001


Tumor mRNA reads

CCDSMapping

Genome Mapping

Read Merging

CCDS mapped reads

Genome mapped reads

SNVs Detection

Mapped reads

Epitope Prediction

Tumor specific

epitopes


SNVs


Primers Design

Primers for Sanger

Sequencing

Immunology Background

J.W. Yedell, E Reits and J Neefjes. Making sense of mass destruction: quantitating MHC class I antigen presentation. Nature Reviews Immunology, 3:952-961, 2003

Epitope Prediction

C. Lundegaard et al. MHC Class I Epitope Binding Prediction Trained on Small Data Sets. In Lecture Notes in Computer Science, 3239:217-225, 2004

Results on Tumor DataMouse strain BALB/C B10.D2 TRAMP

Tumor Meth-A CMS5 prostate1 prostate2 prostate3 prostate4

#lanes 1 3 4 3 3 3

HQ Het SNPs 465 77 86 17 292 193

DdWeak 119 17 14 12 63 70

Strong 20 2 2 0 7 12

KdWeak 111 21 10 0 19 54

Strong 3 1 1 0 1 3

LdWeak 99 12 25 4 47 75

Strong 8 0 0 0 2 9

TotalWeak 329 50 49 16 129 199

Strong 31 3 3 0 10 24

Experimental Validation• Mutations reported by [Noguchi et al 94] found by the pipeline

• Confirmed with Sanger sequencing 18 out of 20 mutations for MethA and 26 out of 28 mutations for CMS5

• Immunogenic potential under experimental validation in the Srivastava lab at UCHC

Outline



RNA-Seq

A B C D E

Make cDNA & shatter into fragments

Sequence fragment ends

Map reads

Gene Expression (GE)

A B C

A C

D E

Isoform Discovery (ID) Isoform Expression (IE)

Alternative Splicing

[Griffith and Marra 07]

Challenges to Accurate Estimation of Gene Expression Levels

• Read ambiguity (multireads)

• What is the gene length?

A B C D E

Previous approaches to GE

• Ignore multireads• [Mortazavi et al. 08]

– Fractionally allocate multireads based on unique read estimates

• [Pasaniuc et al. 10]– EM algorithm for solving ambiguities

• Gene length: sum of lengths of exons that appear in at least one isoform Underestimates expression levels for genes with 2 or

more isoforms [Trapnell et al. 10]

Read Ambiguity in IE

A B C D E

A C

Previous approaches to IE

• [Jiang&Wong 09]– Poisson model + importance sampling, single reads

• [Richard et al. 10]• EM Algorithm based on Poisson model, single reads in exons

• [Li et al. 10]– EM Algorithm, single reads

• [Feng et al. 10]– Convex quadratic program, pairs used only for ID

• [Trapnell et al. 10]– Extends Jiang’s model to paired reads– Fragment length distribution

Our contribution

• Unified probabilistic model and Expectation-Maximization Algorithm for IE considering– Single and/or paired reads– Fragment length distribution– Strand information– Base quality scores

Read-Isoform Compatibilityirw ,

a

aaair FQOw ,

Fragment length distribution

• Paired reads

A B C

A C

A B C

A CA C

A B Ci

j

Series1

Fa(i)

Series1

Fa (j)

Fragment length distribution

• Single reads

A B C

A C

A B C

A C

A B C

A C

i

j

Series1

Fa(i)

Series1

Fa (j)

IsoEM algorithm

E-step

M-step

Error Fraction Curves - Isoforms• 30M single reads of length 25 (simulated)

0 0.2 0.4 0.6 0.8 10

10

20

30

40

50

60

70

80

90

100

Uniq

Rescue

UniqLN

Cufflinks

RSEM

IsoEM

Relative error threshold

% o

f iso

form

s ov

er th

resh

old

Error Fraction Curves - Genes• 30M single reads of length 25 (simulated)

0 0.2 0.4 0.6 0.8 10

10

20

30

40

50

60

70

80

90

100

Uniq

Rescue

GeneEM

Cufflinks

RSEM

IsoEM

Relative error threshold

% o

f gen

es o

ver t

hres

hold

Validation on MAQC Samples

0.6

0.650000000000001

0.7

0.75

0.800000000000001

0.85 UHRR Lib 1, IsoEM

UHRR Lib 2, IsoEM

UHRR Lib 3, IsoEM

UHRR Lib 4, IsoEM

UHRR Lib 5, IsoEM

UHRR Lib 6, IsoEM

HBRR Lib 1, IsoEM

HBRR Lib 2, IsoEM

UHRR Lib 1, Cufflinks






HBRR Lib1, Cufflinks

HBRR Lib 2, Cufflinks

Million Mapped Bases

R2

Outline



Viral Quasispecies

RNA viruses (HIV, HCV)Many replication mistakesQuasispecies (qsps)

= co-existing closely related variants

Variants differ in virulenceability to escape the immune system resistance to antiviral therapiestissue tropism

How do qsps contribute to viral persistence and evolution?

454 Pyrosequencing

Pyrosequencing =Sequencing by Synthesis.

GS FLX Titanium : Fragments (reads): 300-800 bp Sequence of the reads System software assembles reads

into a single genome

We need a software that assembles reads into multiple genomes!

Quasispecies Spectrum Reconstruction (QSR)

Problem

Given pyrosequencing reads from a quasispecies population of unknown size and distribution

Reconstruct the quasispecies spectrum

sequencesfrequencies

ViSpA Viral Spectrum Assembler

454 Sequencing Errors

Error rate ~0.1%.

Fixed number of incorporated bases vs. light intensity value.

Incorrect resolution of homopolymers =>

over-calls (insertions)65-75% of errors

under-calls (deletions)20-30% of errors

Preprocessing of Aligned Reads

1. Deletions in reads: DReplace deletion, confirmed by a single read, with either allele value that is present in all other reads or N.

2. Insertions into reference: IRemove insertions, confirmed by a single read.

3. Imputation of missing values N

Read Graph: Vertices

Subread = completely contained in some read with ≤ n mismatches. Superread = not a subread => the vertex in the read graph.

ACTGGTCCCTCCTGAGTGT

GGTCCCTCCT

TGGTCACTCGTGAG

ACCTCATCGAAGCGGCGTCCT

Read Graph: Edges

Edge b/w two vertices exists if there is an overlap between superreads they agree on their overlap with ≤ m mismatches.

Auxiliary vertices: source and sink

Read Graph: Edge Cost

The most probable source-sink path through each vertex

Cost: uncertainty that two superreads are from the same qsps.

Overhang Δ is the shift in start positions of two overlapping superreads.

Δ

Contig Assembling

Max Bandwidth Path through vertexpath minimizing maximum edge cost for the path and each subpath

Consensus of path’s superreadsEach position: >70%-majority or N

Weighted consensus obtained on all reads

Remove duplicatesDuplicated sequences = statistical evidence

kkl

L

t

L

t

k

lrsp

1),( read r of length l qsps s of length L k is #mismatches, t/L is a mutation rate

Expectation Maximization

Bipartite graph: Qq is a candidate with frequency fq

Rr is a read with observed frequency or

Weight hq,r = probability that read r is produced by qsps q with j mismatches

E step:

jjlrq j

lh

1,

''

''

:,

,,

qrqrqq

rqqrq hf

hfp

rr

qrrqr

q o

op

fM step:

HCV Qsps (P. Balfe)

30927 reads from 5.2Kb-long region of HCV-1a genomes

intravenous drug user being infected for less than 3 months => mutation rate is in [1.75%, 8%]

27764 reads average length=292bpIndels: ~77% of reads

Insertions length: 1 (86%) , 3 (9.8%)Deletions length: 1 (98%)

N: ~7% of reads

HCV Data Statistics

NJ Tree for 12 Most Frequent Qsps (No Insertions)

The top sequence: 26.9% (no mismatches) and 50.4% (≤1 mismatch) of the reads.

In sum:35.6% (no mismatches ) and 64.5% (≤1 mismatch) of the reads.

Reconstructed sequence with highest frequency 99% identical to one of the ORFs obtained by cloning the quasispecies.

Conclusions & Future Work• Freely available implementations of these methods

available at http://dna.engr.uconn.edu/software/

• Ongoing work– Monitoring immune responses by TCR sequencing– Isoform discovery– Computational deconvolution of heterogeneous samples– Reconstruction & frequency estimation of virus quasispecies

from Ion Torrent reads

http://dna.engr.uconn.edu/software/



Acknowledgments Immunogenomics

Jorge Duitama (KU Leuven) Pramod K. Srivastava, Adam Adler, Brent Graveley, Duan Fei (UCHC) Matt Alessandri and Kelly Gonzalez (Ambry Genetics)

IsoEM Marius Nicolae (Uconn) Alex Zelikovsky, Serghei Mangul (GSU)

ViSpA Alex Zelikovsky, Irina Astrovskaya, Bassam Tork, Serghei Mangul,

(GSU), and Kelly Westbrooks (Life Technologies) Peter Balfe (Birmingham University, UK)

Funding NSF awards IIS-0546457, IIS-0916948, and DBI-0543365 UCONN Research Foundation UCIG grant

Next-Generation Sequencing: Challenges and Opportunities Ion Mandoiu Computer Science and...

Documents

Transcript of Next-Generation Sequencing: Challenges and Opportunities Ion Mandoiu Computer Science and...