Ngs intro_v6_public

46

Click here to load reader

description

An Introduction to NGS(Next Generation Sequencing). Part 1. principle, machines and comparative analysis. by François PAILLIER

Transcript of Ngs intro_v6_public

Page 1: Ngs intro_v6_public

An Introduction to NGS(Next Generation Sequencing)

François Paillier - 22/02/2011

Page 2: Ngs intro_v6_public

[ Reminder about Sanger Sequencing ]

• NGS Definition

• Overview of NGS technologies

• NGS Applications & examples

• Conclusion

NOT discussed here : Sequence accuracy, assembly and sampling ; NGS

data Analysis & BioInformatics tools

Plan

Page 3: Ngs intro_v6_public

Still a gold standard but capillary sequencing has reached its technical

limitation (costs and performance will remain unchanged)

A word about Sanger Sequencing(First generation sequencing machine Video)

3730xl

Principle (only the tube G + dideoxyG)

From gel to

capillary

Page 4: Ngs intro_v6_public

Short Reminder about « Classical » Assembly

projects

Sample Libraries

n Sequencing sub-projects

Finishing: Draft (Q40)

Assembly

Annotation

Annotated Genome

Target genome

Clone selection &

Sequencing

SubTargets (BACs, cosmids, ..)

Assembly

Other strategy : wgs

Cloning

Page 5: Ngs intro_v6_public

Sequencing, what for ?

Assembly projects for example

In bioinformatics, sequence assembly refers to aligning and merging fragments of

a much longer DNA sequence in order to reconstruct the original sequence. This

is needed as DNA sequencing technology cannot read whole genomes in one go,

but rather small pieces between 20 and 1000 bases, depending on the technology

used. Typically the short fragments, called reads, result from shotgun sequencing

genomic DNA, or gene transcript (ESTs).

Target genome

Sequencing

Assembly

Consensus

Assembled reads

reads

4X Local coveragescaffold

gap gap gap

Page 6: Ngs intro_v6_public

Vocabulary that should be kept in mind

in the sequencing field

• Assembly : result of the sequence clustering based on their local

similarity

• Contig : A set of overlapping DNA segments

• Coverage (in sequencing) : The mean number of times a nucleotide is

sequenced in a genome (example: 10X coverage)

• Scaffolds : A series of contigs that are in the right order but not necessarily

connected in one contiguous stretch

• Mate pairs Sequences known to be in the 3′ and 5′ of a contig from a single

clone

• WGS = Whole genome shotgun sequencing strategy

• ESS = Environmental Shotgun Sequencing

Page 7: Ngs intro_v6_public

NGS = Next Generation

Sequencing

After PCR,

THE new revolution

in Biology ?

Page 8: Ngs intro_v6_public

First Generation :SANGER Sequencing

Second Generation :

NGS = Massively

Parallel Sequencing

Third Generation :

NGS = HTS, Single

Molecule Sequencing

NGS Synonym is : High-throughput Sequencing

(HTS)

Page 9: Ngs intro_v6_public

Overview of actual NGS technologies

(Second generation sequencing machines)

Roche, 454 GS-FLXTitanium Protocol a must

Illumina, GA1 then GA2

Applied Bio.,

Solid v3

Each machine with

different :

- Throughput

- Sequence accuracy

- Data formats (and

programs)

*NGS “proof of principle” was done in 2000 by Lynx Therapeutics : They publishes and markets "MPSS" - a parallelized,

adapter/ligation-mediated, bead-based sequencing technology, launching "next-generation" sequencing.

Year 2005*

2006

2007

Page 10: Ngs intro_v6_public

Throughput per

Illumina Channel

Page 11: Ngs intro_v6_public

HOW is it

Possible ?

Page 12: Ngs intro_v6_public

NGS Principle

Building sequencing devices at nanoscale

Polony : Discrete clonal amplifications of a single DNA molecule,

grown in a gel matrix. The clusters can then be individually

sequenced, producing short reads. Polony-based sequencing is

the basis of most second generation sequencers

A typical NGS Workflow is:

1) Library construction

2) Template CLONAL amplification

3) Massively PARALLEL sequencing

Page 13: Ngs intro_v6_public

High Parallelism is Achieved in

Polony Sequencing

PolonySanger

Page 14: Ngs intro_v6_public

Generation of Polony array: DNA

Beads (454, SOLiD)

DNA Beads are generated using Emulsion PCR

Page 15: Ngs intro_v6_public

Generation of Polony array: DNA

Beads (454, SOLiD)

DNA Beads are placed in wells

Page 16: Ngs intro_v6_public

Sequencing: Pyrosequencing (454)

DNA Polymerase

« pyrogram » / « Flowgram »

Page 17: Ngs intro_v6_public

454 Process : Emulsion PCR &

Pyrosequencing

Titanium =

Read lengths approx. 400 nt

1 million reads / Run

400 Mb / day

VIDEOs

About Pyrosequencing 1’53’’: <here>

Summary about GS Flex 4’34’’: <click

here>

Page 18: Ngs intro_v6_public
Page 19: Ngs intro_v6_public

454 GS FLX titanium

No more Cloning step

From purified DNA to Sequencing

Fit the laboratory bench top / small

LONG Sequences (400 nt)

GS Junior system not so expensive

Capabilities : Multiplexing &

paired-ends

Well fitted to :

- proK. Genome sequencing

- RNA-seq

- Seq. Accuracy not so high

(especially in case of

homopolymers

Main error type is indel

- Cost : approx. 20K€ / Gb

Cost per base is cheaper

(regarding Sanger) but still

High regarding others NexGen

Machines

Page 20: Ngs intro_v6_public

Illumina* : Bridge PCR

GA2x Version =

Read lengths

approx. 100 nt

240 million reads

1500 Mb / day

30000 Mb / Run

Page 22: Ngs intro_v6_public

A Flow cell

8

Lanes

Illumina Chemistry : 4-color DNA sequencing-by-synthesis using reversible

terminators with removable flourescent dyes

Page 23: Ngs intro_v6_public

Illumina seq. Accuracy

Page 24: Ngs intro_v6_public

Illumina Throughput

Page 25: Ngs intro_v6_public

Illumina

No more Cloning step

From purified DNA to Sequencing

Fit the laboratory bench top / small

Good Sequence Accuracy

Capabilities : Multiplexing &

paired-ends

Cost : approx. 2K€ / Gb , Cost per

base is cheaper than 454

Well fitted to :

- proK. Genome sequencing

- RNA-seq, ChIP-Seq,

Methyl-Seq

- Machine is very expensive

Main error type is mismatch

- Read lengths are still too short

Not fitted to big genomes

(Repeats)

- Poor coverage of AT rich regions

- Most widely used NGS platform.

- Requires least DNA

Page 26: Ngs intro_v6_public

SOLiD system : 4-color DNA Sequencing by

Ligation

SOLiD V3 =

Read lengths

approx. 50 nt

400 million reads

1500 Mb / day

20000 Mb / Run

1500€ / Gb

<Watch Video> 4’46’’

Page 27: Ngs intro_v6_public

Sequencing by ligation rxn: Fluorescently Labeled

Nucleotides (ABI SOLiD)

Complementary strand elongation: DNA Ligase

Page 28: Ngs intro_v6_public

Sequencing by ligation ABI SOLiD

Page 29: Ngs intro_v6_public

5 reading frames, each

position is read twice

Sequencing: Fluorescently Labeled Nucleotides

(ABI SOLiD)

Page 30: Ngs intro_v6_public

Sequencing: Fluorescently Labeled

Nucleotides (ABI SOLiD)

Page 31: Ngs intro_v6_public

SOLiD

No more Cloning step

From purified DNA or RNA to Seq.

Fit the laboratory bench top / small

Good Sequence Accuracy

Capabilities : Multiplexing &

paired-ends

Cost : approx. 1.5K€ / Gb , Cost per

base is cheaper than illumina

Well fitted to :

- REsequencing

- RNA-seq, ChIP-Seq,

Methyl-Seq

- This Technology is NOT

Intuitive

- Machine is VERY expensive

-HUGE amount of data produced

(1500 Gb !!)

-Long Run times

-Has been demonstrated

certain reads don’t match

Reference !

Page 32: Ngs intro_v6_public

Focusing NGS effort on predefined targets :

« Target Enrichment » Technology (Capture Array)

Page 33: Ngs intro_v6_public

Focusing NGS effort on predefined targets :

« Target Enrichment » Technology (Capture Beads)

Page 34: Ngs intro_v6_public

Summary : NGS Workflows

Source: BCG

+/- Target Enrichment Strategy

Page 35: Ngs intro_v6_public

Prokaryotic Genome Sequencing

Project as a mix of NGS technologies

Conclusion :

- High quality drafts can be produced for small genomes without any Sanger data input.

- We found that 454 GSFLX and Solexa/Illumina show great complementarity in producing

large contigs and supercontigs with a low error rate.

Page 36: Ngs intro_v6_public

NGS Applications

• In different fields…

– Metagenomics

– Genomics

– Transcriptomics

– proteomics

DEEPER insight into biological processes

BROADER sampling of populations (cells, viruses,

Ecosystems…)

Page 37: Ngs intro_v6_public

Genome

* De Novo Sequencing

* Targeted Resequencing

(SNP, Indel, CNV)

* Whole Genome Resequencing

* Metagenome analyses

Transcriptome

* Gene Expression Profiling

* Small RNA Analysis

* Whole Transcriptome Analysis

Epigenome

* Chromatin Immunoprecipitation

Sequencing (ChIP-Seq)

* Methylation Analysis

…for different

purposes…-Towards Personalized

Medicine

- Biodiversity assessment

-De Novo Sequencing of

prokaryotic or eukaryotic

genomes (or re-sequencing)

-RNA-Seq Annotation of

eukaryotic genomes

-SNP calling : identification of

mutations

-Chip-Seq : identification of

DNA/protein interactions

Page 38: Ngs intro_v6_public
Page 39: Ngs intro_v6_public

What is the current impact of

NGS on Biology ?

• Both transcriptomics and genomics can now be adressed using one technology with higher accuracy and robustess (instead of Sanger sequencing + µarrays p.e.) (Example of RNA-SEQ)

• SNP calling can rely on ultra-deep assemblies

• Whole genome overview of transcription factors binding sites

• Biodiversity assessment (Metagenomics projects)

• And so much more…

Page 40: Ngs intro_v6_public

About whole-exome sequencing :

« For the First Time, DNA Sequencing Technology

Saves A Child's Life »

« Proponents of genetic medicine say DNA sequencing is the future of

medicine and that soon every truly sick person will have his or her genome

sequenced. Critics cite privacy concerns and note that genetic mutations and

variations don’t necessarily lead to medical outcomes. Whatever the

position, it’s hard to argue that this isn’t good news: the first child – plagued

by undiagnosable illness – has been saved by DNA sequencing.

That may be a bit of a strong statement – six-year-old Nicholas Volker is

doing well, though complications could soon arise. But it’s highly likely that

the sequencing of young Nicholas’s genome saved his life. »

<Link> <Article>

Mayer & Al. Genetics IN Medicine • Volume xx, Number xx, 01 2011

Page 41: Ngs intro_v6_public

What’s Next ?

Second Generation :

NGS = Massively

Parallel Sequencing

(polony sequencing)

Third Generation :

- Single

Molecule Sequencing (no bias)

- Faster

- Cheaper (or not)

- 1000€ Human genome ?

Roche, 454 GS-FLXTitanium

Illumina, GA2

Applied BioSys, Solid v3

PacBioIonTorrent

Page 42: Ngs intro_v6_public

Conclusion : impact of NGS

Global Shift to sequencing-based technologies

Great improvements on-going : Higher throughput, longer reads

Is it the end of µarrays ? A sub-part of NGS workflows restricted to target-

enrichment ?

Is it the end of forward genetics ? Reverse genetics only ?

Biologists education should integrate NGS knowledge

Is it the end of « Big sequencing centers »?change in their mission ?

Next bottleneck : BioInformatics

- Storing data a problem (SRA soon down ?) AND IT networks speed

FAR too low Very difficult to share NGS data Fridges instead of

disks !?

- Analyzing data a problem great improvements but still a lot of work

remain to be done

Page 43: Ngs intro_v6_public
Page 44: Ngs intro_v6_public

Thanks

for your attention !

Page 45: Ngs intro_v6_public

Technology Summary

Read length Sequencing

Technology

Throughput

(per run)

Cost

(1mbp)*

Sanger ~800bp Sanger 400kbp 500$

454 ~400bp Polony 500Mbp 60$

Solexa/Illumi

na

75bp Polony 20Gbp 2$

SOLiD 75bp Polony 60Gbp 2$

Helicos 30-35bp Single

molecule

25Gbp 1$

*Source: Shendure & Ji, Nat Biotech, 2008

Page 46: Ngs intro_v6_public

ABI SOLiD Illumina GA 454 Roche FLX

Cost SOLiD 4: $495k

SOLiD PI: $240k

IIe: $470k

IIx: $250k

HiSeq: $690k

Titanium: $500k

Quantity

of Data

per run

SOLiD 4: 100Gb

SOLiD PI: 50Gb

IIe: 20 - 38 Gb

IIx: 50 – 95 Gb

HiSeq: 200Gb +

450 Mb

Run Time 7 Days 4 Days 9 Hours

Pros Low error rate due to

dibase probes

Most widely used

NGS platform.

Requires least DNA

Short run time. Long

reads better for de

novo sequencing

Cons Long run times. Has

been demonstrated

certain reads don’t

match reference

Least multiplexing

capability of the 3.

Poor coverage of AT

rich regions

Expensive reagent

cost. Difficulty

reading

homopolymer

regions

NGS Technology Comparison

Source: The University of Western Ontario