Single-Cell Transcriptome Analysis of Pluripotent Stem Cells

Post on 21-Jan-2018

82 views 3 download

Transcript of Single-Cell Transcriptome Analysis of Pluripotent Stem Cells

Single-Cell Transcriptome Analysis of Pluripotent Stem Cells

Nacho CaballeroCenter for Regenerative Medicine

Boston UniversityJun 12, 2017

From raw data to insights

Raw data

ATCG

Analysis pipeline

Raw data

ATCG

Initial QC

Analysis pipeline

Raw data

ATCG

Alignment and Quantification

Initial QC

Analysis pipeline

Raw data

ATCG

Alignment and Quantification

Outlier analysis

Initial QC

Analysis pipeline

Raw data

ATCG

Alignment and Quantification

Outlier analysis

Gene selection and clustering

Initial QC

Analysis pipeline

Raw data

ATCG

Alignment and Quantification

Outlier analysis

Gene selection and clustering

Initial QC Insights

Analysis pipeline

Raw data Initial QC Alignment and Quantification

Outlier analysis

Gene selection and clustering

Insights

ATCG

Barcodedsequencing

files

ATCG

Demultiplex

One pair of sequencing

filesper cell

Barcodedsequencing

files

ATCG

Demultiplex

One pair of sequencing

filesper cell

@NB500996:64:HNM72BGX2:3:12510:12240:93662:N:0:TAGTCATGCTACTGTCTAGAGCTTGTCTCAATGGATCTAGAACTTCATCGCCCTCTGATC…+AAAAAEEEE<E/EEEEEEEEE6EE/6AEEE//E/EEE/AEA/EAEEEE</6A……

Millions of reads

Barcodedsequencing

files

ATCG

Demultiplex

One pair of sequencing

filesper cell

@NB500996:64:HNM72BGX2:3:12510:12240:93662:N:0:TAGTCATGCTACTGTCTAGAGCTTGTCTCAATGGATCTAGAACTTCATCGCCCTCTGATC…+AAAAAEEEE<E/EEEEEEEEE6EE/6AEEE//E/EEE/AEA/EAEEEE</6A……

Millions of reads

Metadata fileCell_idCondition1Condition2Cell_01BU3redCell_02BU3greenCell_03C17redCell_04C17greenCell_05BU3redCell_06BU3green…

Barcodedsequencing

files

ATCG

Demultiplex

One pair of sequencing

filesper cell

@NB500996:64:HNM72BGX2:3:12510:12240:93662:N:0:TAGTCATGCTACTGTCTAGAGCTTGTCTCAATGGATCTAGAACTTCATCGCCCTCTGATC…+AAAAAEEEE<E/EEEEEEEEE6EE/6AEEE//E/EEE/AEA/EAEEEE</6A……

Millions of reads

Metadata fileCell_idCondition1Condition2Cell_01BU3redCell_02BU3greenCell_03C17redCell_04C17greenCell_05BU3redCell_06BU3green…

Barcodedsequencing

files

ATCG

Short simple names

Raw data Initial QC Alignment and Quantification

Outlier analysis

Gene selection and clustering

Insights

ATCG

Analysis pipeline

Position in ReadAvg

Sequ

ence

Qua

lity

Good cDNA quality

Position in ReadAvg

Sequ

ence

Qua

lity

Good cDNA quality

Read length is often inversely correlated with base-pair sequencing quality

Position in ReadAvg

Sequ

ence

Qua

lity

Good cDNA quality Average quality

Read length is often inversely correlated with base-pair sequencing quality

Position in ReadAvg

Sequ

ence

Qua

lity

Good cDNA quality Average quality Bad quality

Read length is often inversely correlated with base-pair sequencing quality

Position in ReadAvg

Sequ

ence

Qua

lity

Num

ber o

f rea

ds p

er c

ell

1M

10K

1K

0400 Cells

More reads is generally better than longer reads (safe target: 200K reads, 150-bp long)

Num

ber o

f rea

ds p

er c

ell

1M

10K

1K

0400 Cells

The Fluidigm protocol makes it extremely easy to lose entire rows or columns

Row

s

Columns

The Fluidigm protocol makes it extremely easy to lose entire rows or columns

Row

s

Columns

Raw data Initial QC Alignment and Quantification

Outlier analysis

Gene selection and clustering

Insights

ATCG

Analysis pipeline

We quantify the gene expression in a cell by counting how many reads align to each gene

SFTPC gene

We quantify the gene expression in a cell by counting how many reads align to each gene

AGGCAGAGGGGCGAGATGCA…

SFTPC gene

We quantify the gene expression in a cell by counting how many reads align to each gene

AGGCAGAGGGGCGAGATGCA…

1358 reads aligned to the SFTPC gene in this cell

SFTPC gene

We quantify the gene expression in a cell by counting how many reads align to each gene

Read type Number of reads per cell

Raw 333,229

Unaligned 81,673

Aligned, but non-uniquely 28,813

Aligned uniquely, but not to a gene 32,774

Aligned uniquely, but span multiple genes 20,838

Aligned uniquely to a single gene 167,241

Read type Number of reads per cell

Raw 333,229

Unaligned 81,673

Aligned, but non-uniquely 28,813

Aligned uniquely, but not to a gene 32,774

Aligned uniquely, but span multiple genes 20,838

Aligned uniquely to a single gene 167,241

Read type Number of reads per cell

Raw 333,229

Unaligned 81,673

Aligned, but non-uniquely 28,813

Aligned uniquely, but not to a gene 32,774

Aligned uniquely, but span multiple genes 20,838

Aligned uniquely to a single gene 167,241

Read type Number of reads per cell

Raw 333,229

Unaligned 81,673

Aligned, but non-uniquely 28,813

Aligned uniquely, but not to a gene 32,774

Aligned uniquely, but span multiple genes 20,838

Aligned uniquely to a single gene 167,241

Read type Number of reads per cell

Raw 333,229

Unaligned 81,673

Aligned, but non-uniquely 28,813

Aligned uniquely, but not to a gene 32,774

Aligned uniquely, but span multiple genes 20,838

Aligned uniquely to a single gene 167,241

Read type Number of reads per cell

Raw 333,229

Unaligned 81,673

Aligned, but non-uniquely 28,813

Aligned uniquely, but not to a gene 32,774

Aligned uniquely, but span multiple genes 20,838

Aligned uniquely to a single gene 167,241

Read type Number of reads per cell

Raw 333,229

Unaligned 81,673

Aligned, but non-uniquely 28,813

Aligned uniquely, but not to a gene 32,774

Aligned uniquely, but span multiple genes 20,838

Aligned uniquely to a single gene 167,241

40-60% of the raw reads cannot be used to quantify gene expression

Raw data Initial QC Alignment and Quantification

Outlier analysis

Gene selection and clustering

Insights

ATCG

Analysis pipeline

Filter out cells with fewer than 5K aligned reads N

umbe

r of a

ligne

d re

ads

1M

10K

1K

0120 Cells

Filter out cells with a high percentage of mitochondrial gene counts (indicative of a broken cell membrane)

% o

f Mito

chon

dria

l gen

e co

unts 100%

75%

50%

048 Cells

25%

Filter out cells with less than 2K expressed genes N

umbe

r of e

xpre

ssed

gen

es6K

4K

030 Cells

2K

Raw data Initial QC Alignment and Quantification

Outlier analysis

Gene selection and clustering

Insights

ATCG

Analysis pipeline

Raw count data

Normalized expression data

Raw count data

Assume that most genes are not differentially expressed

Normalized expression data

Raw count data

Assume that most genes are not differentially expressed

Calculate scaling factors for each cell

Normalized expression data

Raw count data

Assume that most genes are not differentially expressed

Calculate scaling factors for each cell

Normalized expression data

Apply the scaling factors and log

Raw count data

Normalization corrects for differences in capture efficiency, sequencing depth and other technical bias

Assume that most genes are not differentially expressed

Calculate scaling factors for each cell

Normalized expression data

Apply the scaling factors and log

Aver

age

expr

essi

on

Variance

Aver

age

expr

essi

on

Expr

essi

on

Variance

Aver

age

expr

essi

on

Expr

essi

on

Variance

cell

Aver

age

expr

essi

on

Expr

essi

on

Variance

high expression low variance

cell

Aver

age

expr

essi

on

Expr

essi

on

Variance

high expression low variance

cell

Expr

essi

on

low expression low variance

Aver

age

expr

essi

on

Expr

essi

on

Variance

high expression low variance

cell

Expr

essi

on

low expression low variance

high expression high variance

high expression high variance

Typical questions

What are the expression differences between my experimental groups?

Typical questions

What are the expression differences between my experimental groups?

What are the subpopulations in my data?

Typical questions

What are the expression differences between my experimental groups?

What are the subpopulations in my data?

What are the gene expression patterns in each subpopulation?

TREATCONDITIONS AS

GROUPS?

TREATCONDITIONS AS

GROUPS?

ASSIGN CELLS TOGROUPS

SELECTGENES

NO

ASSIGN CELLS TOGROUPS

SELECTGENES

NO

A difference between the populations (signal) should appear among the most variable genes

Aver

age

expr

essi

on

Variance

TREATCONDITIONS AS

GROUPS?

ASSIGN CELLS TOGROUPS

SELECTGENES

NO

A difference between the populations (signal) should appear among the most variable genes

Aver

age

expr

essi

on

Variance

TREATCONDITIONS AS

GROUPS?

ASSIGN CELLS TOGROUPS

SELECTGENES

NO

Variance is a necessary but insufficient indicator of population differences

Aver

age

expr

essi

on

Variance

TREATCONDITIONS AS

GROUPS?

ASSIGN CELLS TOGROUPS

SELECTGENES

NO

Aver

age

expr

essi

on

Variance

Unique populations consistently over or under-express a set of genes

TREATCONDITIONS AS

GROUPS?

ASSIGN CELLS TOGROUPS

SELECTGENES

NO

TREATCONDITIONS AS

GROUPS?

ASSIGN CELLS TOGROUPS

SELECTGENES

NO

TREATCONDITIONS AS

GROUPS?

ASSIGN CELLS TOGROUPS

SELECTGENES

NO

TREATCONDITIONS AS

GROUPS?

The silhouette coefficient is a useful metric to determine the optimal number of groups

ASSIGN CELLS TOGROUPS

SELECTGENES

NO

k = 2 Silhouette coefficient: 0.48

TREATCONDITIONS AS

GROUPS?

The silhouette coefficient is a useful metric to determine the optimal number of groups

ASSIGN CELLS TOGROUPS

SELECTGENES

NO

k = 3 Silhouette coefficient: 0.56

TREATCONDITIONS AS

GROUPS?

The silhouette coefficient is a useful metric to determine the optimal number of groups

ASSIGN CELLS TOGROUPS

SELECTGENES

NO

k = 4 Silhouette coefficient: 0.47

TREATCONDITIONS AS

GROUPS?

The silhouette coefficient is a useful metric to determine the optimal number of groups

ASSIGN CELLS TOGROUPS

TEST GENES FOR DIFFERENTIALEXPRESSION

YES

SELECTGENES

NO

TREATCONDITIONS AS

GROUPS?

ASSIGN CELLS TOGROUPS

TEST GENES FOR DIFFERENTIALEXPRESSION

YES

SELECTGENES

NO

TREATCONDITIONS AS

GROUPS?

Variance

Aver

age

exp

ress

ion

Differentially expressed genes

ASSIGN CELLS TOGROUPS

TEST GENES FOR DIFFERENTIALEXPRESSION

YES

SELECTGENES

NO

TREATCONDITIONS AS

GROUPS?

Variance

Aver

age

exp

ress

ion

Differentially expressed genes

ASSIGN CELLS TOGROUPS

TEST GENES FOR DIFFERENTIALEXPRESSION

YES

SELECTGENES

NO

TREATCONDITIONS AS

GROUPS?

Variance

Aver

age

exp

ress

ion

Differentially expressed genes

Variance

Aver

age

exp

ress

ion

Highly variable genes

ASSIGN CELLS TOGROUPS

TEST GENES FOR DIFFERENTIALEXPRESSION

YES

SELECTGENES

NO

TREATCONDITIONS AS

GROUPS?

Variance

Aver

age

exp

ress

ion

Differentially expressed genes

Variance

Aver

age

exp

ress

ion

Highly variable genes

Raw data Initial QC Alignment and Quantification

Outlier analysis

Gene selection and clustering

Insights

ATCG

Analysis pipeline

The ideal heatmap

Real heatmaps are a rough-draft visualization

NKX2-1CD47

Real heatmaps are a rough-draft visualization

NKX2-1CD47

NKX2-1

CD47

Real heatmaps are a rough-draft visualization

NKX2-1CD47

NKX2-1

CD47

ROW-SCALING GLOBAL SCALING

Real heatmaps are a rough-draft visualization

Expression patterns arebetter conveyed by showing individual genes

Expression patterns arebetter conveyed by showing individual genes

CLU

STER

ED

Expression patterns arebetter conveyed by showing individual genes

CLU

STER

EDR

AN

DO

M

Expression patterns arebetter conveyed by showing individual genes

Geneset enrichment analysis depends on the quality of the geneset

Geneset enrichment analysis depends on the quality of the geneset

MsigDB hallmark genesets only contain 4000 genes

Geneset enrichment analysis depends on the quality of the geneset

MsigDB hallmark genesets only contain 4000 genesMAKE YOUR OWN GENESETS FROM THE LITERATURE

Remember to provide a metadata file

Raw data Initial QC Alignment and Quantification

Outlier analysis

Gene selection and clustering

Insights

ATCG

Takeaways

Raw data Initial QC Alignment and Quantification

Outlier analysis

Gene selection and clustering

Insights

ATCG

Takeaways

More reads is usually better than longer reads

Raw data Initial QC Alignment and Quantification

Outlier analysis

Gene selection and clustering

Insights

ATCG

Takeaways

You will only be able to align 50% of your reads

Raw data Initial QC Alignment and Quantification

Outlier analysis

Gene selection and clustering

Insights

ATCG

Takeaways

Assume that 50% of your cells could fail

Raw data Initial QC Alignment and Quantification

Outlier analysis

Gene selection and clustering

Insights

ATCG

Takeaways

High variance doesn’t imply subpopulations

Raw data Initial QC Alignment and Quantification

Outlier analysis

Gene selection and clustering

Insights

ATCG

Takeaways

Make your own gene lists!

Slides available at: bit.ly/crem_bioinformatics

Raw data Initial QC Alignment and Quantification

Outlier analysis

Gene selection and clustering

Insights

ATCG

Takeaways