CRC Project on Robust Transcript Discovery and Quantification from Sequencing Data

CRC Project on Robust Transcript Discovery and Quantification from Sequencing

Dec. 22, 2011 live call

UCONN: Ion Mandoiu, Sahar Al SeesiGSU: Alex Zelikovsky, Serghei Mangul, Adrian Caciula

Lifetech PI: Dumitru Brinza

Outline

1. SNV calling from RNA-Seq reads2. Transcriptome reconstruction update

SNV Calling from RNA-Seq Reads

• RNA-Seq typically used for gene expression analysis• SNV calling from RNA-Seq data?• Much less expensive than genome sequencing• Motivated by project in personalized genomic-

guided immunotherapy, where we only need expressed variants

Hybrid Approach Based on Merging Alignments

mRNA reads

Transcript Library

Mapping

Genome Mapping

Read Merging

Transcript mapped reads

Genome mapped reads

Mapped reads

Converting Transcriptome Alignments to Genome Coordinates

Convert to genome coordinates

Transcriptome alignments

Merging Rules for Short ReadsGenome Transcripts Agree? Hard Merge

Unique Unique Yes Keep

Unique Unique No Throw

Unique Multiple No Throw

Unique Not Mapped No Keep

Multiple Unique No Throw

Multiple Multiple No Throw

Multiple Not Mapped No Throw

Not mapped Unique No Keep

Not mapped Multiple No Throw

Not mapped Not Mapped Yes Throw

Merging Local Alignments of ION Reads: HardMerge at Base-Level

• Input: SAM files with alignments from genome and transcriptome mapping

• The following alignments are filtered out– Any local alignments of length <= 15 bases– All alignments of read that has alignments on different chromosomes or

different strands

• Key idea: a read base mapped to multiple locations is discarded

• Output alignments are generated from contiguous stretches of non-ambiguously mapped bases, based on the unique genomic location of these bases– Subject to the above filtering criteria

HardMerge Example

Input alignments in genome coordinates:

Filter multiple local alignments/sub-alignments

Output alignment:

SNV Detection and Genotyping

AACGCGGCCAGCCGGCTTCTGTCGGCCAGCAGCCAGGAATCTGGAAACAATGGCTACAGCGTGCAACGCGGCCAGCCGGCTTCTGTCGGCCAGCCGGCAG CGCGGCCAGCCGGCTTCTGTCGGCCAGCAGCCCGGA GCGGCCAGCCGGCTTCTGTCGGCCAGCCGGCAGGGA GCCAGCCGGCTTCTGTCGGCCAGCAGCCAGGAATCT GCCGGCTTCTGTCGGCCAGCAGCCAGGAATCTGGAA CTTCTGTCGGCCAGCCGGCAGGAATCTGGAAACAAT CGGCCAGCAGCCAGGAATCTGGAAACAATGGCTACA CCAGCAGCCAGGAATCTGGAAACAATGGCTACAGCG CAAGCAGCCAGGAATCTGGAAACAATGGCTACAGCG GCAGCCAGGAATCTGGAAACAATGGCTACAGCGTGC

Reference

Locus i

r(i) : Base call of read r at locus iεr(i) : Probability of error reading base call r(i)Gi : Genotype at locus i

SNV Detection and Genotyping

• Use Bayes rule to calculate posterior probabilities and pick the genotype with the largest one

SNVQ Model• Calculate conditional probabilities by multiplying contributions of

individual reads

ERCC SNV Simulation

• Random SNVs were inserted to the ERCC reference with probability 0.005

• The modified ERCC sequences were appended to the reference genome

• For each ERCC, one exon transcript annotation where added to the Ensembl64 transcript library (GTF).

• tmap indices where for the reference genome and transcriptome including the ERCCs with the simluated SNVs

HBR Sample Statistics

UHR Sample Statistics

ION,[0,1]

SNVQ,[0,1],1

SNVQ,[0,1],2

ION,(1,5]

SNVQ,(1,5],1

SNVQ,(1,5],2

ION,(5,10]

SNVQ,(5,10],

ION,(10,50]

SNVQ,(10,50]

ION,(50,inf)

SNVQ,(50,inf)

HBR - 5 datasets average

FPFNTP

Method, ERCC average coverage, min alternative allele coverage

ION,[0,1]

SNVQ,[0,1],1

SNVQ,[0,1],2

ION,(1,5]

SNVQ,(1,5],1

SNVQ,(1,5],2

ION,(5,10]

SNVQ,(5,10],

ION,(10,50

SNVQ,(10,50

ION,(50,inf

SNVQ,(50,inf

HBR - 5 datasets, combined

FPFNTP

ION,[0,1]

SNVQ,[0,1],1

SNVQ,[0,1],2

ION,(1,5]

SNVQ,(1,5],1

SNVQ,(1,5],2

ION,(5,10]

SNVQ,(5,10],

ION,(10,50

SNVQ,(10,50

ION,(50,inf

SNVQ,(50,inf

UHR - 5 datasets average

FPFNTP

ION,[0,1]

SNVQ,[0,1],1

SNVQ,[0,1],2

ION,(1,5]

SNVQ,(1,5],1

SNVQ,(1,5],2

ION,(5,10]

SNVQ,(5,10],

ION,(10,50

SNVQ,(10,50

ION,(50,inf

SNVQ,(50,inf

UHR - 5 datasets, combined

FPFNTP

Comparing SNVQ & Samtools on HardMerge Alignments

SNVQ,[0,1]

HM/sam,[0,1]

SNVQ,(1,5]

HM/sam,(1,5]

SNVQ,(5,10]

HM/sam,

(5,10]

SNVQ,(10,50]

HM/sam,

(10,50]

SNVQ,(50,inf)

HM/sam,

(50,inf)

HBR - 5 datasets, combined

FPFNTP

Method, ERCC average coverage/min alternative allele coverage

Comparing SNVQ & Samtools on HardMerge Alignments

SNVQ,[0,1]

HM/sam,[0,1]

SNVQ,(1,5]

HM/sam,(1,5]

SNVQ,(5,10]

HM/sam,

(5,10]

SNVQ,(10,50]

HM/sam,

(10,50]

SNVQ,(50,inf)

HM/sam,

(50,inf)

UHR - 5 datasets, combined

FPFNTP

Method, ERCC average coverage/min alternative allele coverage

Whole Transcriptome Comparison on NA12878 Illumina Reads

RPKM < 1 1 < RPKM < 5 5 < RPKM < 10 10 < RPKM < 50 50 < RPKM < 100 RPKM > 100

TPHomoVar TPHetero FP FNHomoVar FNHetero

Plugin Interface

Plugin Output

Outline

1. SNV calling from RNA-Seq reads2. Transcriptome reconstruction update

Challenges and Solutions

• Challenge: Read lengths are currently much shorter then transcripts length– Phasing “free” exons(no direct evidence from

reads) during assembly is challenging• Solutions : Statistical reconstruction method – fragment length distribution

Candidate Transcripts:

1 743 5t4 :

1 742 3 65t1 :

1 743 65t2 :

1 742 3 5t3 :

1 742 3 65

Exon 2 and 6 are “free” exons : no direct evidence from reads

ILP based Transcriptome Reconstruction from PE reads

SE(from PE)• Splicing Graph : candidate transcriptsPE• ILP based filtering of candidate transcripts

Splicing Graph

Genome Research(2004) : The Multiassembly Problem: Reconstructing Multiple Transcript Isoforms From EST

Naive ILP formulation Variables:

y(t) = 1 iff candidate transcript t is selected, 0 otherwise

x(p) = 1 iff the pe read p is mapped within 1 std. dev.

Objective:

Constraints:(1)

ty )(min

)()( jTt

sNpxsN )()()(

number of reads mapped within 1 std. dev. ~68%

for each read pj at least one transcript is selected

Sophisticated ILP Formulation

• Consider reads mapped within >1 std.dev.• Integrate reads with different fragment length – Prepare libraries with different insert sizes– reduce number of “free” exons

Preliminary results

Spec PPV

2 0.73 0.95

3 0.66 0.92

Note : results are on ~20% of UCSC genes

CRC Project on Robust Transcript Discovery and Quantification from Sequencing Data

Documents

Transcript of CRC Project on Robust Transcript Discovery and Quantification from Sequencing Data

DNA Sequencing Sanger Di-deoxy method of Sequencing Manual versus Automatic Sequencing.

Detection and Quantification of Sequence Variants from ... NOTE Sanger Sequencing Data Analysis The introduction of semi-automated fluorescent dye-terminator DNA Sequencing using capillary

Sequencing the Maize (B73) Genome Genome Sequencing Center Maize Genome Sequencing Consortium.

Sequencing Library qPCR Quantification Guide...Sequencing Library qPCR Quantification Guide 7 One or more of the following kits in order to correspond to the number of libraries to

MONOCLINIC - TETRAGONAL ZIRCONIA QUANTIFICATION OF ... · MONOCLINIC - TETRAGONAL ZIRCONIA QUANTIFICATION OF ... Monoclinic-tetragonal zirconia quantification of commercial nanopowder

RNA sequencing, transcriptome and expression quantification Henrik Lantz, BILS/SciLifeLab.

WSIA Underwriting and Leadership Summit Attendee … Kevin Ronan CRC David Ross CRC Marc Rothschild CRC Mike Sullivan CRC Michael Yovino CRC Swett Mike Brennan, Jr. CRC Swett Robert

RNA Sequencing: Experimental Planning and Data Analysisminzhang/598_Fall2018/schedule... · 2018. 9. 18. · RNA Sequencing •High-throughput sequencing of RNA •Allows for quantification

Compact Reinforced Composite - crc-tech.com · BA/CRC Technology CRC presentation – an overview Workability of CRC CRC is thixothropic, which means it responds well to vibrations.

Waste Quantification and Characterization – Nairobi … · Waste Quantification and Characterization – Nairobi ... Waste Quantification and Characterization – Nairobi ... resources

SEQ SUCCESS WITH EVERY SAMPLE ROCHE …...extraction through accurate library quantification—to help you achieve success in sequencing on the Illumina ® HiSeq® and NovaSeq platforms.

Ethics of quantification or quantification of ethics?

CRC CRC CR

How to build A CRC ConsoRtium - CRC Association – …crca.asn.au/.../CRCGuideA-How-to-build-a-CRC-Consortium.pdfHow to build A CRC ConsoRtium Australia’s best capability working

CRC Press, CRC Press LLC ISBN: Pub Date - Directory UMMdirectory.umm.ac.id/Networking Manual/ATM Technology for Broadban… · CRC Press, CRC Press LLC ISBN ... E. Basic Operations

Detection and Quantification of Sequence Variants from ... · DNA sequencing process is the identification of the nucleotides and of possible sequence variants. A largely unknown

qPCR and Digital PCR Congress: USA · Sample preparation & quality control Detection, quantification and sequencing of RNA Precise quantification of nucleic acids Amplification curve

CRC Technical Reference Manual - Fire Alarm Resources · CRC Technical Reference Manual 1.3 Introduction to the CRC The Card Reader Controller (CRC) is shown in Figure 1-1. The CRC

CRC Project on Robust Transcript Discovery and Quantification from Sequencing Data

DNA Sequencing. Next few topics DNA Sequencing Sequencing strategies Hierarchical Online (Walking) Whole Genome Shotgun Sequencing Assembly Gene Recognition.