RNA-seq for DE analysis: extracting counts and QC - part 4

37
Generating the count table and validating assumptions RNA-seq for DE analysis training Joachim Jacob 20 and 27 January 2014 This presentation is available under the Creative Commons Attribution-ShareAlike 3.0 Unported License. Please refer to http://www.bits.vib.be/ if you use this presentation or parts hereof.

description

Part 4 of the training sesson 'RNA-seq for differential expression analysis' considers extracting the count table from a mapping, and performing QC to detect sample biases. See http://www.bits.vib.be

Transcript of RNA-seq for DE analysis: extracting counts and QC - part 4

Page 1: RNA-seq for DE analysis: extracting counts and QC - part 4

Generating the count table and validating assumptions

RNA-seq for DE analysis training

Joachim Jacob20 and 27 January 2014

This presentation is available under the Creative Commons Attribution-ShareAlike 3.0 Unported License. Please refer to http://www.bits.vib.be/ if you use this presentation or parts hereof.

Page 2: RNA-seq for DE analysis: extracting counts and QC - part 4

Goal

Summarize the read counts per gene from a mapping result.

The outcome is a raw count table on which we can perform some QC.

This table is used by the differential expression algorithm to detect DE genes.

Page 3: RNA-seq for DE analysis: extracting counts and QC - part 4

Status

Page 4: RNA-seq for DE analysis: extracting counts and QC - part 4

The challenge'Exons' are the type of features used here.

They are summarized per 'gene'

Concept:GeneA = exon 1 + exon 2 + exon 3 + exon 4 = 215 readsGeneB = exon 1 + exon 2 + exon 3 = 180 reads

No normalization yet! Just pure counts, aka 'raw counts',

Overlaps no feature

Alt splicing

Page 5: RNA-seq for DE analysis: extracting counts and QC - part 4

Tools to count features

● Different tools exist to accomplish this:

http://wiki.bits.vib.be/index.php/RNAseq_toolbox#Feature_counting

Page 6: RNA-seq for DE analysis: extracting counts and QC - part 4

Dealing with ambiguity

● We focus on the gene level: merge all counts over different isoforms into one, taking into account:

● Reads that do not overlap a feature, but appear in introns. Take into account?

● Reads that align to more than one feature (exon or transcript). Transcripts can be overlapping - perhaps on different strands. (PE, and strandedness can resolve this partially).

● Reads that partially overlap a feature, not following known annotations.

Page 7: RNA-seq for DE analysis: extracting counts and QC - part 4

HTSeq count has 3 modes

http://www-huber.embl.de/users/anders/HTSeq/doc/count.html

HTSeq-count recommends the 'union mode'. But depending on your genome, you may opt for the 'intersection_strict mode'. Galaxy allows experimenting!

Page 8: RNA-seq for DE analysis: extracting counts and QC - part 4

Indicate the SE or PE nature of your data(note: mate-pair is not

appropriate naming here)

The annotation file with the coordinatesof the features to be counted

mode

Check with mapping QC (see earlier)

For RNA-seq DE we summarize over'exons' grouped by 'gene_id'. Make surethese fields are correct in your GTF file.

Reverse stranded: heck with mapping viz

Page 9: RNA-seq for DE analysis: extracting counts and QC - part 4

Resulting count table column

One sample !

Page 10: RNA-seq for DE analysis: extracting counts and QC - part 4

Merging to create experiment count table

Page 11: RNA-seq for DE analysis: extracting counts and QC - part 4

Resulting count table

Page 12: RNA-seq for DE analysis: extracting counts and QC - part 4

Quality control of count table

In the end, we used about 70% of the reads. Check for your experiment.

Relative numbers Absolute numbers

Page 13: RNA-seq for DE analysis: extracting counts and QC - part 4

Quality control of count table

2 types of QC:● General metrics● Sample-specific quality control

Page 14: RNA-seq for DE analysis: extracting counts and QC - part 4

QC: general metrics

● General numbers

Page 15: RNA-seq for DE analysis: extracting counts and QC - part 4

QC: general metrics

Which genes are most highly present? Which fractions do they occupy?

42 genes (0,0063%) of the 6665 genes take 25% of all counts.

This graph can be constructed from the count table.

Gene Counts

TEF1alpha, putative ribo prot,...

Page 16: RNA-seq for DE analysis: extracting counts and QC - part 4

QC: general metrics

● General numbers

Page 17: RNA-seq for DE analysis: extracting counts and QC - part 4

QC: general metrics

● We can plot the counts per sample: filter out the '0', and transform on log2.

log2(count)

The bulk of the genes have countsin the hundreds.

Few are extremely highly expressed

A minority have extremely low counts

Page 18: RNA-seq for DE analysis: extracting counts and QC - part 4

QC: log2 density graph

● We can do this for all samples, and merge

Strange Deviation

here

All samples show nice overlap, peaks

are similar

Page 19: RNA-seq for DE analysis: extracting counts and QC - part 4

QC: log2 merging samples

Here, we take one sample, plot the log2 density graph, add the counts of another sample, and plot again, add the counts of another sample, etc. until we have merged all samples.

We see a horizontal shift of the graph, rather than a vertical shift, pointing to no saturation.

Page 20: RNA-seq for DE analysis: extracting counts and QC - part 4

QC: log2, merging samples

Here, we take one sample, plot the log2 density graph, add the counts of another sample, and plot again, add the counts of another sample, etc. until we have merged all samples.

Page 21: RNA-seq for DE analysis: extracting counts and QC - part 4

QC: rarefaction curve

Code:ggplot(data = nonzero_counts, aes(total, counts)) + geom_line() + labs(x = "total number of sequenced reads", y = "number of genes with counts > 0")

What is the number of total detected features, how does the feature space increase with each additional sample added?

There should be saturation, but here there is none.

Page 22: RNA-seq for DE analysis: extracting counts and QC - part 4

QC: rarefaction curve

Saturation: OK!

rRNA genesSa

mp

le A

Sam

ple

A +

sam

ple

BSa

mp

le A

+ s

amp

le B

+ s

amp

le C

Etc.

Page 23: RNA-seq for DE analysis: extracting counts and QC - part 4

QC: transformations for viz

Regularized log (rLog) and 'Variance Stabilizing Transformation' (VST) as alternatives to log2.

http://www.bioconductor.org/packages/2.12/bioc/html/DESeq2.html

Page 24: RNA-seq for DE analysis: extracting counts and QC - part 4

QC: count transformations

● Techniques used for microarray can be applied on VST transformed counts.

http://www.bioconductor.org/packages/2.12/bioc/html/DESeq2.html

VSTrLogLog2

Not normalizations!

http://www.biomedcentral.com/1471-2105/14/91

Page 25: RNA-seq for DE analysis: extracting counts and QC - part 4

QC including condition info

● We can also include condition information, to interpret our QC better. For this, we need to gather sample information.

● Make a separate file

in which sample info

is provided (metadata)

Page 26: RNA-seq for DE analysis: extracting counts and QC - part 4

QC with condition info

What are the differences in counts in each sample

dependent on? Here: counts are dependent on the treatment and the strain. Must match

the sample descriptions file.

Page 27: RNA-seq for DE analysis: extracting counts and QC - part 4

QC with condition infoClustering of the distance between samples based on transformed counts can reveal sample errors.

VST transformed rLog transformed

Colour scaleOf the distance

measure between Samples. Similar conditions

Should cluster together

Page 28: RNA-seq for DE analysis: extracting counts and QC - part 4

QC with condition infoClustering of transformed counts can reveal sample errors.

VST transformed rLog transformed

Page 29: RNA-seq for DE analysis: extracting counts and QC - part 4

QC with condition info

Principal component (PC) analysis allows to display the samples in a 2D scatterplot based on variability between the samples. Samples close to each other resemble each other more.

Page 30: RNA-seq for DE analysis: extracting counts and QC - part 4

Collect enough metadata

Principal component (PC) analysis allows to display the samples in a 2D scatterplot based on variability between the samples. Samples close to each other resemble each other more.

Why do these resemble

each other?

Page 31: RNA-seq for DE analysis: extracting counts and QC - part 4

QC with condition info

During library preparation, collect as much as information as possible, to add to the sample descriptions. Pay particular attention to differences between samples: e.g. day of preparation, centrifuges used, ...

Why do these resemble

each other?

Page 32: RNA-seq for DE analysis: extracting counts and QC - part 4

Collect enough metadata

In the QC of the count table, you can map this additional info to the PC graph. In this case, library prep on a different day had effect on the WT samples.

Additional metadata

Day 1

Day 2

Page 33: RNA-seq for DE analysis: extracting counts and QC - part 4

Collect enough metadata

In the QC of the count table, you can map this additional info to the PC graph. In this case, library prep on a different day had effect on the WT samples (batch effect).

Additional metadata

Day 1

Day 2

Page 34: RNA-seq for DE analysis: extracting counts and QC - part 4

Collect enough metadata

Page 35: RNA-seq for DE analysis: extracting counts and QC - part 4

Next step

Now we know our data from the inside out, we can run a DE algorithm on the count table!

Page 36: RNA-seq for DE analysis: extracting counts and QC - part 4

KeywordsRaw counts

VST

Write in your own words what the terms mean

Page 37: RNA-seq for DE analysis: extracting counts and QC - part 4

Break