Differential Alternative Splicing - an in silico detection ... · Bled Presentation Department of...

Post on 28-May-2020

4 views 0 download

Transcript of Differential Alternative Splicing - an in silico detection ... · Bled Presentation Department of...

Differential Alternative Splicing -an in silico detection approach

Gero Doose

Bled Presentation

Department of Computer Science andInterdisciplinary Center for Bioinformatics,

University of Leipzig

February 19, 2016

Table of Contents

1 Motivation

2 Method

3 Results

2

Section 1

MotivationDifferential alternative splicing

3

Alternative splicing

4

Alternative splicing

5

Section 2 Method Overview of the algorithm

6

Algorithmic workflow

Core algorithm is based on splicejunction supports per sampleScanning routine for standard format(e.g. TCGA)Pre-calculation procedure for BAMfiles (segemehl)

7

Pre-calculation procedure for BAM files (segemehl)

Starting with a list of samples(location of the mapped RNA-seq files)

8

Pre-calculation procedure for BAM files (segemehl)

Starting with a list of samples(location of the mapped RNA-seq files)Building the union of all split-mappedreads

9

Pre-calculation procedure for BAM files (segemehl)

Starting with a list of samples(location of the mapped RNA-seq files)Building the union of all split-mappedreadsCalling splice junctions(cluster splice sites)

10

Pre-calculation procedure for BAM files (segemehl)

Starting with a list of samples(location of the mapped RNA-seq files)Building the union of all split-mappedreadsCalling splice junctions(cluster splice sites)Calculating table of individual samplesupports

11

Intersecting with gene annotation

Normal-spliced (co-linear) junctions

Circular-spliced junctions

In/out going vs. within the gene body

Not restricted to known exon bounderies

12

Matrix reduction

minimum splice junction support value (J)minimum samples amount (S)

Removing samples not showing (J) in atleast one splice junctionRemoving junctions with less than (S)samples showing (J)Removing genes not containing at least 2junctions and (S) samples per condition

13

Zero-replacement

Accounting for technical andbiological varianceSampling junction supports from anegative binomial distributionTracking junctions where more than 50%of a condition are replaced

14

Compositional data approach

Normalizing raw counts to ratios per gene

C1 = [40, 30, 10] C1 = [0.5, 0.375, 0.125]

C2 = [120, 30, 90] C2 = [0.5, 0.125, 0.375]

Simplex as the appropriate sample space:

SD = {[x1, ..., xD ] : xi ≥ 0 for i = 1, ...,D andD∑i=1

xi = 1}

Aitchison (1986)

15

Compositional data approach

C1 = [0.5, 0.375, 0.125]

C2 = [0.5, 0.125, 0.375]

Ternary Diagram16

Compositional data approach

Aitchison distance

d(xa, xb) =

[D∑i=1

(log (

xai

g(xa))− log (

xbi

g(xb))

)2] 1

2

with g(x) =(∏D

i=1 xi

) 1D

17

Compositional data approach

Central tendencyLet C = (xij , ..., xND) be a set of N compositionalvectors with D components:

cen(C) =

[g(xi1)∑Dj=1 g(xij )

, ...,g(xiD)∑Dj=1 g(xij )

]

with g(xij ) =(∏N

i=1 xij

) 1N

18

Compositional data approach

Abundance change

abc(xi ) = d(cen(Cbase)i , cen(Ccompare)i )

C1 = [0.5, 0.375, 0.125]

C2 = [0.5, 0.125, 0.375]19

Detection of differential alternative splicing

For all components with |abc| ≥ 1:Centered log-ratio transformation (clr):

clr(x) =

[log

x1

g(x), ..., log

xD

g(x)

]with g(x) =

(∏Di=1 xi

) 1D

non parametric test statistic(Wilcoxon rank-sum)

Multiple testing correction(Benjamini Hochberg)

20

Clustering and outlier detection

Finding upper quartil of all genes in regard toaverage distance between the centre and eachof the n compositional vectors

21

Clustering and outlier detection

Finding upper quartil of all genes in regard toaverage distance between the centre and eachof the n compositional vectorsCalculating all

(n2

)pairwise sample

combinations averaged over this set of genes

22

Clustering and outlier detection

Finding upper quartil of all genes in regard toaverage distance between the centre and eachof the n compositional vectorsCalculating all

(n2

)pairwise sample

combinations averaged over this set of genesHierarchical agglomerative clustering

BL_411

9027

BL_413

0003

BL_412

7766

BL_414

6289

BL_419

0495

BL_417

7856

BL_413

3511

BL_419

4891

BL_418

9998

BL_419

4218

BL_414

4633

GCB_416

0735

-gcbc

ell

GCB_412

2131

GCB_417

4884

GCB_414

9880

-gcbc

ell

GCB_411

8819

BL_411

2512

BL_417

7434

BL_418

2393

BL_412

5240

BL_416

2611

BL_419

3278

0.0

0.5

1.0

1.5

2.0

2.5

3.0

23

Section 3

ResultsFirst glimpse at the hidden treasures within the ICGC RNA-seq data

24

ICGC data set: Germinal Center B-cell Derived Lymphomas

6 comparisons (4 conditions: GCB, BL, FL, DLBCL)126 samples (5 GCB, 18 BL, 46 FL, 47 DLBCL)

25

Outlier detection

Dendrogram for BL samples

26

Outlier detection

Distance distribution for BL samples

27

Outlier detection

BL_4166151

DLBCL_4181460

DLBCL_4193638

FL_4199996

Outlier

BL 4166151Age 74 (median ≈ 10)

DLBCL 4193638Also an outlier in terms of expression

FL 4199996 and DLBCL 4181460not conspicuous in terms ofmethylation or expression

DLBCL 4176133DLBCL 4181460DLBCL 4184094DLBCL 4188398DLBCL 4188879DLBCL 4189035DLBCL 4193638DLBCL 4199714

DLBCL 4166706

FL/DLBCL 4120403

FL/DLBCL 4110120FL/DLBCL 4111337FL/DLBCL 4113211FL/DLBCL 4113971

normal BL L FL DLBCL others

Expression correlation

28

Subgroup detection

Dendrogram for GCB and BL samples

29

Differential Alternative Splicing - a statistical overview

6 comparisons (4 conditions: GCB, BL, FL, DLBCL)126 samples (5 GCB, 18 BL, 46 FL, 47 DLBCL)442,738 supported splice junctions18,700 supported genes (16,208 protein coding, 2,492 lincRNA)Significant if |abundance change| ≥ 1 and q-value < 0.01

30

Differential Alternative Splicing - a statistical overview

6 comparisons (4 conditions: GCB, BL, FL, DLBCL)126 samples (5 GCB, 18 BL, 46 FL, 47 DLBCL)442,738 supported splice junctions18,700 supported genes (16,208 protein coding, 2,492 lincRNA)Significant if |abundance change| ≥ 1 and q-value < 0.01

GCB-BL GCB-FL GCB-DLBCL BL-FL BL-DLBCL FL-DLBCLsplice junctions

tested 100,315 106,890 107,498 126,667 127,919 138,212sig. 1,012 1,629 1,499 1,089 610 132percent 1.01 1.52 1.39 0.86 0.48 0.10

genes

tested 8,229 8,419 8,450 10,304 10,426 11,418sig. 756 1,077 1,016 738 424 108percent 9.19 12.79 12.02 7.16 4.07 0.95

31

Differential Alternative Splicing - a statistical overview

6 comparisons (4 conditions: GCB, BL, FL, DLBCL)126 samples (5 GCB, 18 BL, 46 FL, 47 DLBCL)442,738 supported splice junctions18,700 supported genes (16,208 protein coding, 2,492 lincRNA)Significant if |abundance change| ≥ 1 and q-value < 0.01

GCB-BL GCB-FL GCB-DLBCL BL-FL BL-DLBCL FL-DLBCLsplice junctions

tested 100,315 106,890 107,498 126,667 127,919 138,212sig. 1,012 1,629 1,499 1,089 610 132percent 1.01 1.52 1.39 0.86 0.48 0.10

genes

tested 8,229 8,419 8,450 10,304 10,426 11,418sig. 756 1,077 1,016 738 424 108percent 9.19 12.79 12.02 7.16 4.07 0.95

Randomizing group info ⇒ no significant results

32

Lymphoma common genes

Overlap of sig. genes

Network of overlap

33

Lymphoma common genes

High protein expression in lymph nodes and other immune relatedtissues

500 bases hg1917,320,500 17,321,000 17,321,500 17,322,000

MYO9BMYO9BMYO9B

con.expr

7.21233 _

0.00233333 _

BL.expr

3.94875 _

0.0043125 _

FL.expr

5.00253 _

0.00473333 _

MYO9B

34

Lymphoma common genes

Plays a role in the invasiveness of cancer cells, and the formation ofmetastases

2 kb hg1970,266,000 70,266,500 70,267,000 70,267,500 70,268,000 70,268,500 70,269,000 70,269,500 70,270,000

CTTNCTTNCTTN

CTTN

con.expr

0.786333 _

0.00233333 _

BL.expr

0.707438 _

0.000375 _

CTTN

35

Enriched pathways (GCB vs BL)

B-cell receptor Apoptosis

36

Apoptosis pathway (GCB vs. BL)

2 kb hg19202,145,000 202,150,000

CASP8CASP8CASP8CASP8CASP8CASP8CASP8CASP8

con.expr

2.48733 _

0.000333333 _

BL.expr

1.48063 _

0.0009375 _

CASP8

500 bases hg1944,166,700 44,166,800 44,166,900 44,167,000 44,167,100 44,167,200 44,167,300 44,167,400 44,167,500 44,167,600 44,167,700 44,167,800

IRAK4IRAK4IRAK4IRAK4

con.expr

1.954 _

0.00233333 _

BL.expr

1.12556 _

0.00425 _

IRAK4

37

Validation by qPCR

PRDM10

38

Validation by qPCR

PRDM10

39

Connection between diff. alt. splicing and diff. expression

0

300

600

900

BL_DLBCL BL_FL FL_DLBCL GCB_BL GCB_DLBCL GCB_FL

amou

nt o

f sig

nific

ant g

enes

legend

not differentially expressed

differentially expressed

Intersection with sig. genes based on DESeq

40

Connection between diff. alt. splicing and RNA editing

Calling sig. differentially RNA editing sites between conditions (withMethylene)Intersected with both splice sites of all sig. diff. alt. splice junctions

GCB vs. BL

GCB vs. FL

Txn Factor ChIP

500 bases hg1917,690,400 17,690,500 17,690,600 17,690,700 17,690,800 17,690,900 17,691,000 17,691,100 17,691,200 17,691,300 17,691,400 17,691,500 17,691,600 17,691,700

COLGALT1

BL.expr

3.53906 _

0.006125 _

con.expr

2.15367 _

0.00233333 _

FL.expr

4.0762 _

0.00666667 _

RNA editing at position chr19:17,691,052

41

Connection between diff. alt. splicing and DMRs

Building 2 x 2 contingency tables of gene countsin regard to diff. alt. splicing and DMR overlap.Odds ratios and p-values from Fisher’s exact test:

GCB-BL GCB-FL BL-FLodds ratio 1.3 1.48 1.84p-value 0.00075 6.273e-09 1.719e-15genes yes yes 336 449 400

42

Abundance changes

−2

0

2

0 20 40 60

log2

abundance

chan

ge

junctions per gene

−2 −1 0 1 2log2 abundance change

dens

ity

−4 −2 0 2log2 abundance change

junction sets:

BL_vs_FL

GCB_vs_FL

GCB_vs_DLBCL

BL_vs_DLBCL

GCB_vs_BL

FL_vs_DLBCL

Abundance change as a measure for biological relevance

43

Junctions per gene

●●

●●

●●●●●●

●●

●●

●●

●●●●●

●●

●●

●●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●●

●●

●●●

●●

●●

●●

●●

●●●

●●

●●●

●●

●●●●●

●●

●●

●●●●●

●●

●●●

●●

●●●●

●●

●●●●

●●

●●●●

●●

●●

●●●

●●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●●●●

●●●

●●

●●

●●

●●

●●●●

●●●

●●

●●●

●●

●●●

●●●

●●

●●

●●●

●●

●●●

●●●

●●

●●●

●●

●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●●

●●●●

●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●●

●●●●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●●●

●●

●●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●●

●●

●●

●●

●●●

●●

●●●

●●●●●

●●

●●●●

●●●

●●

●●●●●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●●

●●●●

●●

●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●●●

●●●

●●

●●

●●●

●●●

●●●●●

●●●

●●●

●●

●●●●

●●●

●●

●●

●●

●●

●●

●●●●●●

●●

●●

●●

●●●●

●●

●●●

●●

●●

●●

●●

●●

●●●

●●

●●●

●●

●●●

●●

●●●

●●●

●●

●●

●●●●●

●●

●●●

●●●

●●●

●●●

●●

●●

●●●

●●

●●●●

●●

●●

●●

●●

●●

●●●●●

●●

●●

●●●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●●

●●

●●●●

●●

●●

●●●

●●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●●●

●●

●●

●●●●

●●

●●

●●

●●

●●●●

●●●●

●●

●●

●●

●●

●●

●●●●●●

●●

●●

●●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●●●

●●

●●

●●●●

●●●

●●●

●●

●●

●●

●●

●●

●●●

●●

●●●

●●

●●●

●●

●●

●●

●●●

●●

●●●

●●

●●●●

●●

●●

●●●

●●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●●●

●●●

●●●

●●

●●

●●

●●●

●●

●●

●●

●●●

●●●●

●●●

●●

●●

●●

●●●

●●

●●

●●

●●●

●●

●●●

●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●●

●●

●●

●●●

●●●

●●

●●

●●●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●●●

●●

●●

●●

●●●●●●●

●●●●

●●

●●●●

BL vs. DLBCL BL vs. FL FL vs. DLBCL

GCB vs. BL GCB vs. DLBCL GCB vs. FL0

20

40

60

0

20

40

60

not significant significant not significant significant not significant significant

junc

tions

per

gen

e

Natural bias towards genes with more junctions

44

Zero-replacement portion

legend

genes

junctions with zero-replacement

junctions without zero-replacement

BL_DLBCL BL_FL FL_DLBCL

GCB_BL GCB_DLBCL GCB_FL

0

500

1000

1500

0

500

1000

1500

genes junctions genes junctions genes junctions

amou

nt o

f sig

nific

ant i

nsta

nces

BL_DLBCL BL_FL FL_DLBCL

GCB_BL GCB_DLBCL GCB_FL

0e+00

5e+04

1e+05

0e+00

5e+04

1e+05

genes junctions genes junctions genes junctions

amou

nt o

f tes

ted

inst

ance

s

Tested instances vs. significant instances45

Summary

Simple and robust method

Based only on direct splice evidence

Not restricted to current annotations

Fast core algorithm (3h for 419 samples)

Plausible results

46

Thanks to

Peter F. Stadler

Steve Hoffmann

Stephan Bernhart

Helene Kretzmer

Reiner Siebert

Rabea Wagener

Hoffmann Group

47

Thank you for your attention!

Questions?!

48

Thank you for your attention!

Questions?!

49

Thank you for your attention!

Questions?!

50