The Iso-Seq Method...2015/08/24  · The term “Iso-Seq method” can refer to any transcriptome...

Post on 07-May-2020

15 views 0 download

Transcript of The Iso-Seq Method...2015/08/24  · The term “Iso-Seq method” can refer to any transcriptome...

FIND MEANING IN COMPLEXITY

© Copyright 2015 by Pacific Biosciences of California, Inc. All rights reserved. For Research Use Only. Not for use in diagnostic procedures.

Elizabeth Tseng, Ph.D.

Senior Staff Scientist

The Iso-Seq™ Method: Transcriptome Sequencing Using Long Reads

Transcription Variation Proteomic/Gene Complexity

2 slide from G. Shenykman, ASMS talk 2014

A Single Gene Locus Many Transcripts

3

slide from G. Shenykman, ASMS talk 2014

Short reads cannot accurately assemble complex

transcripts

Steijger et al. (2013) Assessment of transcript reconstruction methods for RNA-Seq. Nature Methods

doi:10.1038/nmeth.2714.

…the complexity of higher eukaryotic genomes imposes severe

limitations on transcript recall and splice product discrimination…

…assembly of complete isoform structures poses a major

challenge even when all constituent elements are identified…

…Ultimately, the evolution of RNA-seq will move toward single-

pass determination of intact transcripts….

Iso-Seq™ Method: PacBio Transcriptome Sequencing

• Single-molecule observation

– one read = one transcript

• Sequence transcript in full length

– 0 – 15 kb full-length transcripts

– no assembly required

The term “Iso-Seq method” can refer to any transcriptome (cDNA) sequencing

using the PacBio System, including those that do not follow recommended library

preparation or the Iso-Seq bioinformatics pipeline (ICE + Quiver, later slides)

Iso-Seq Library Workflow

6

polyA+ RNA

Total RNA

Optional Poly-A Selection

Reverse Transcription

(SMARTScribe RT)

Full Length 1st Strand cDNA

PCR

Optimization

Large Scale Amplification

(Phusion DNA Polymerase)

Amplified cDNA

1-2 kb

2-3 kb

3-6 kb

Size Selection

(gel / BluePippin / SageELF)

1-2 kb

2-3 kb

3-6 kb

Re-Amplification

(Phusion DNA Polymerase)

1-2 kb

2-3 kb

3-6 kb

SMRTbell Template

Preparation

1-2 kb

2-3 kb

3-6 kb

SMRT Sequencing

3-6 kb

Optional Size Selection

(BluePippin)

Size cuts can be arbitrary

Current max FL transcript seen: 15 kb

5-10 kb

5-10 kb

5-10 kb

5-10 kb

5-10 kb

Full-Length (FL) read identification

Full-Length = 5’ primer seen, polyA tail seen, 3’ primer seen

• Identify and remove primers and polyA/T tail

• Identify transcript stranded-ness

Bioinformatics Challenge

8

ATTTAAGGCC ATTTAAGGCC ATTTAAGGCC

GCCATG GCCATG

TATAGGCAAGTAACGTT TATAGGCAAGTAACGTT

ATTCAAGGCC AATTAGGGC TTTAGGCC AAT GGCCATTG

GCCATG

TATAGGCAAGTACGTT TATAGGGGCAAGTAACGTT

SAMPLE INPUT SEQUENCING OUTPUT

Need to recover the original sequence Error Correction

Bioinformatics Challenge

9

ATTTAAGGCC ATTTAAGGCC ATTTAAGGCC

GCCATG GCCATG

TATAGGCAAGTAACGTT TATAGGCAAGTAACGTT

ATTCAAGGCC AATTAGGGC TTTAGGCC AAT GGCCATTG

GCCATG

TATAGGCAAGTACGTT TATAGGGGCAAGTAACGTT

SAMPLE INPUT SEQUENCING OUTPUT

Need to recover the original sequence Error Correction

POST-

ERROR CORRECTION

ATTTAAGGCC

GCCATG

TATAGGCAAGTAACGTT

Bioinformatics Challenge

10

ATTTAAGGCC ATTTAAGGCC ATTTAAGGCC

GCCATG GCCATG

TATAGGCAAGTAACGTT TATAGGCAAGTAACGTT

ATTCAAGGCC AATTAGGGC TTTAGGCC AAT GGCCATTG

GCCATG

TATAGGCAAGTACGTT TATAGGGGCAAGTAACGTT

SAMPLE INPUT SEQUENCING OUTPUT

Need to recover the original sequence Error Correction

POST-

ERROR CORRECTION

ATTTAAGGCC: 3

GCCATG: 2

TATAGGCAAGTAACGTT: 2

Error Correction: Three Approaches

11

Tool Author Genome-

Guided

Hybrid (long +

short reads)

Abundance

Inferrence

ToFU

(RS_IsoSeq) Liz T. N N (not really)

CONVEX Meisam R.

(David T.) N N Y

LSC + IDP Kin Fai A. Y Y Y

For Research Use Only. Not for use in diagnostic procedures.

ToFU: The ICE + Quiver error correction pipeline

12

Transcript isOforms: Full-length and Unassembled

ToFU is available through

SMRT Analysis (RS_IsoSeq)

and GitHub (ToFU)

Methods is available in paper supp

• de novo (no ref genome required)

• no assembly

• can handle any read length

• works for mixed accuracy

• post-Quiver: 99-100% accuracy

ToFU pipeline: classify cluster (ICE) Quiver polishing

Per-molecule reads (ReadsOfInsert aka CCS reads)

Clusters of transcript alignments using FL + nFL reads

Transcript 1 Transcript 2 Transcript 3

Final transcript consensus

Transcript 1 Transcript 2 Transcript 3

Full-length (FL) reads

Non-FL reads

Transcript 1 Transcript 2 Transcript 3

Isoform-level clusters ICE

Quiver

ToFU reveals transcriptional complexity in P. crispa

Gray are single gene transcripts

Green are polycistronic transcripts

that span 2+ genes

Top: Short read mapping

Bottom: PacBio transcripts

Gordon & Tseng, 2015

From Novel Transcripts to Novel Proteins

Shenykman, ASMS talk 2014

PacBio public MCF-7 dataset

• ~90% predicted ORFs

matched mass spec peptide

• 251 novel ORFs found unique

to MCF-7

For Research Use Only. Not for use in diagnostic procedures. Pacific Biosciences, the Pacific Biosciences logo, PacBio, SMRT, SMRTbell and Iso-Seq

are trademarks of Pacific Biosciences in the United States and/or other countries. All other trademarks are the sole property of their respective owners.