Synonymous mutations - from bacterial evolution to somatic ...

Post on 15-Dec-2016

228 views 0 download

Transcript of Synonymous mutations - from bacterial evolution to somatic ...

Synonymous mutations - from bacterial evolution to somatic

changes in human cancer

Fran Supek

1) Lehner group, CRG/EMBL Systems Biology Unit, Barcelona

2) Division of Electronics, RBI, Zagreb, Croatia

XXI Jornades de Biologia Molecular

Barcelona, 11.6.2014

synonymous mutations =changes in the gene sequencethat don’t alter the protein sequence

Synonymous mutations

• (some) synonymous mutations are subject to evolutionary pressures• clearly shown for many bacteria and yeasts

• likely also higher Eukarya (but weaker signal)

• how does selection for/against synonymous changes relate to gene function in (a) evolution of bacteria and (b) in carcinogenesis?

evolutionary trace across ~1000 bacterial genomes somatic mutations in ~4000 human cancers

malignant transformationadaptation to diverse environments

( plush microbes in photos are from http://www.giantmicrobes.com/ )

• In what way can evolution of synoymous codon preferences be used to systematically infer gene function in bacteria?

• There are other simpler (known) ways to determine gene function from the genome sequences:

• commonly/systematically applied: transfer of annotation via sequence similarity (BLAST, COG, Pfam...)

• >30% of genes end up with no known function annotated. They may not have known homologs, or their homologs may have no experimentally determined function.

• known but less common: genomic context methods, such as phyletic profiling

evolutionary trace across ~1000 bacterial genomes

adaptation to diverse environments

( plush microbes in photos are from http://www.giantmicrobes.com/ )

Phyletic (or phylogenetic) profiling

Pellegrini, Marcotte et al., PNAS (1999)

one genomic context method:

examines presence/absence patterns of homologous genes across species.

Kensche et al. (2008) J Royal Soc Interface. ~30 examples of success of phyletic profiling

• by 2008 -> n~=30

• by 2014 -> n~=300 (estimate)

• aim for: N > 3000

Enriching phyletic profileswith information on orthology and paralogy

Species 1

Species 2

… Species

997 Species

998 Function

OMA 1 … 0 GO:001,

GO:007

OMA 2 0 … ?

… … … … … … …

OMA 64051 0 … 0 0 GO:042

OMA 64052 0 … GO:003,

GO:160

orthologs in cliquesorth. outside cliquesparalogs

groups of orthologs from OMA database:Schneider, Dessimoz and Gonnet (2007) Bioinformatics

Skunca et al. PLoS Comp Biology 2013doi:10.1371/journal.pcbi.1002852

Accuracy of predicting GO categories strongly increases when adding paralogs

+ paralogs + orthologs(outside clique)

+ para + orthoclique only

(bubbles are Gene Ontology categories)

Supervised machine learning is superior to common approaches based on pairwise distances

Based on correlationof profiles

AU

C (

area

un

der

R

OC

cu

rve)

Decision trees

Schietgat et al. 2010. BMC Bioinfo

Experimental validation of predictions made with phyletic profiling

• knockout mutants of E. coli in predicted genes

• three selected GO categories targeted by particular antibiotics:• ‘response to DNA damage’

• ‘translation’

• ‘peptidoglycan-based cell wall biogenesis’

• predictions: 38 genes with expected precision > 60%

0%

20%

40%

60%

80%

100%

120%

140%

160%

w.t.

dbpA

rh

lB

yhbJ

pm

bA

rhlE

tldD

yidD

ynbB

envC

murE

nalidixic acid ampicillin kasugamycin

Su

rviv

al c

om

pa

red

to

th

e w

ild t

yp

e

inhibitstranslationinitiation

inhibits cell wall synthesis

DNA damaging

agent

0%

20%

40%

60%

80%

100%

120%

140%

160%

w.t.

dbpA

rh

lB

yhbJ

pm

bA

rhlE

tldD

yidD

ynbB

envC

murE

nalidixic acid ampicillin kasugamycin

Su

rviv

al c

om

pa

red

to

th

e w

ild t

yp

e

Does this gene participate in ‘peptidoglycan-based cell wall biogenesis’ ?

0%

20%

40%

60%

80%

100%

120%

140%

160%

w.t.

dbpA

rh

lB

yhbJ

pm

bA

rhlE

tldD

yidD

ynbB

envC

murE

nalidixic acid ampicillin kasugamycin

Su

rviv

al c

om

pa

red

to

th

e w

ild t

yp

e Does this gene participate in ‘peptidoglycan-based cell wall biogenesis’ ?

25/38 validated predictions (experimental precision = 66%; theoretically expected = 60%) our method is useful for prioritizing genes for experimentally

determining gene function

http://gorbi.irb.hr/

“We predict Gene Ontology annotations ... for about 1.3 million poorly annotated genes in 998 prokaryotes at a stringent threshold of 90% Precision...”

“...about 19000 of those are highly specific functions.”

published in:Skunca et al. PLoS Comp Biology 2013doi:10.1371/journal.pcbi.1002852

• Codon usage biases are another useful source of evolutionary information

• ... complementary to gene presence/absence• ... available from just the genome sequence• ... with an established biological rationale

tRNA levels and codon usage biases

E. coli K-12, tRNA gene counts (proxy for tRNA levels)

codon

anticodon

Commonly used codons typically correspond to abundant tRNAs, particularly in highly expressed genes.

Codon biases correlate to gene expression

0.5

1

1.5

2

2.5

0.5 1 1.5 2 2.5 3 3.5

MIL

C (n

on

-RP

gen

es)

MILC (ribosomal protein genes)

ribosomal protein genes other highly expressed genes rest of genome

B

Figure from

Supek and Vlahoviček (2005)

BMC Bioinformatics

doi:10.1186/1471-2105-6-182

E. coli genome

• organisms adapt to the environment through changes in translation efficiency?

• Carbone A (2005) J Mol Evol – codon adaptation in metabolic pathways:

Photosynthesis genes in Synechocystis

Methanogenesis genes in Methanosarcina

Archaea

Bacteria

An example phenotype: oxygen requirement

• Man & Pilpel (2007) Nat Genet: 9 yeasts

TCA cycle glycolysis

aerobic anaerobic (low) codon adaptation (high)

• Based on these examples, we aimed to systematically link:

• Many environments/phenotypes, with

• evolutionary change in translation efficiency across many gene families

Measuring translation efficiency

Method from

Supek et al. (2010)

PLoS Geneticsdoi:10.1371/journal.pgen.1001004

non-HE HE

4-20% of genome

Expression levels: microarrayson 19 diverse bacteria

0

1

2

3

4

log

2e

xpre

ssio

n r

ati

o

OCU/non-OCU, from ref. [7] HE/non-HE ribosomal proteins/all genes

gene 1

intergenicDNA

codonusage

all otherproteingenes

highly expressed

genes *

increasein

probability after adding

codon usage?

classifier predicts probability:

expr.

A

gene1

gene2

gene3

* ribosome, translationelongation factors, chaperones

vs.

B

C

3.9x6.0x

Correlation vs. causality?

a randomization test to control for confounding phenotypes and phylogeny

This passes the randomization test:

This fails (association not unique):

associations between phenotypes, and also with phylogeny:

• 514 aerotolerant vs. 214 aerointolerant:

295 COGs are significantly enrichedwith HE genes

• obligate vs. facultative aerobes:

• thermophiles

• halophiles

+ 20 other phenotypes tested

control for confounders 23 COGs

11 COGs

16 COGs

6 COGs

Gene families linked to aerotolerance

all experiments: Anita Kriško lab (Mediterranean Institute for Life Science, Split, Croatia)published as Kriško et al, Genome Biology 2014. doi:10.1186/gb-2014-15-3-r44

0%

20%

40%

60%

80%

100%

120%

w.t

.

yjjB

flg

H

cysG

mn

mA

nlp

E

pro

Xosmotic oxidative heat

C

0%

20%

40%

60%

80%

100%

120%

w.t

.

clp

S

op

pA tig

ssu

D

nu

dF

pn

p

typ

A

mng

R

lsrR

yeb

S

rhlE

yajL

pyk

F

dtd

eu

tD

glo

B

yfcA

ma

rR

yccX

pn

cB

ttd

B

mo

aA

dsb

B

surv

ival

, no

rmal

ize

d to

w.t

.

heat oxidative osmotic

B

0x

1x

2x

3x

4x

5x

6x

0%

20%

40%

60%

80%

100%

120%

NA

C /

no

NA

C s

urv

ival

rat

io

surv

ival

, n

orm

aliz

ed

to

w.t

.

2.5 mM H2O2 5 mM NAC pretreatment heat shock osmotic shock

A

** ** **

* known antioxidant proteins in E. coli (or homologs in other organisms)

* known to be regulated in response to air or oxidative stress

positive control

2 nonspeci-fic hits

ca

rbo

nyla

tion

incre

ase

DH

R-1

23

incre

ase

Ce

llRO

X

incre

ase

tota

lF

e

incre

ase

dip

yrid

yl

rescu

e

NA

DP

Hle

ve

lin

cre

ase

NA

DP

Hre

scu

e

fresufD

rseCsodA

w.t.

clpArecA

napFlon

ybeQ

yaaUcysD

ybhJgpmM

icdlpd

yidH

0 0.4 0.8ROS levels in the mutants

ca

rbo

nyla

tion

incre

ase

DH

R-1

23

incre

ase

Ce

llRO

X

incre

ase

tota

lF

e

incre

ase

dip

yrid

yl

rescu

e

NA

DP

Hle

ve

lin

cre

ase

NA

DP

Hre

scu

e

fresufD

rseCsodA

w.t.

clpArecA

napFlon

ybeQ

yaaUcysD

ybhJgpmM

icdlpd

yidH

0 0.4 0.8

positive control

wild-type

ROS are typically not increased (except cysD, yaaU, rseC, and the positive control sodA)

Predicted functional interactions from STRING v9

Gene families whose codon biases are associated to aerobicity/aerotolerance:

ca

rbo

nyla

tion

incre

ase

DH

R-1

23

incre

ase

Ce

llRO

X

incre

ase

tota

lF

e

incre

ase

dip

yrid

yl

rescu

e

NA

DP

Hle

ve

lin

cre

ase

NA

DP

Hre

scu

e

fresufD

rseCsodA

w.t.

clpArecA

napFlon

ybeQ

yaaUcysD

ybhJgpmM

icdlpd

yidH

0 0.4 0.8Putative mechanisms of oxidative stress resistance

NAD(P)Hrelated

iron-related

unknown

all experiments: Anita Kriško lab (Mediterranean Institute for Life Science, Split, Croatia)published as Kriško et al, Genome Biology 2014. doi:10.1186/gb-2014-15-3-r44

carb

on

ylat

ion

incr

ease

DH

R-1

23

incr

ease

Cel

lRO

Xin

crea

se

tota

l Fe

incr

ease

dip

yrid

ylre

scu

e

NA

DP

H le

vel

dec

reas

e

exo

gen

ou

s N

AD

PH

res

cue

0%

20%

40%

60%

80%

100%

120%

w.t

.

yjjB

flg

H

cysG

mn

mA

nlp

E

pro

X

osmotic oxidative heat

C

0%

20%

40%

60%

80%

100%

120%

w.t

.

clp

S

op

pA tig

ssu

D

nu

dF

pn

p

typ

A

mng

R

lsrR

yeb

S

rhlE

yajL

pyk

F

dtd

eu

tD

glo

B

yfcA

ma

rR

yccX

pn

cB

ttd

B

mo

aA

dsb

B

surv

ival

, no

rmal

ize

d to

w.t

.

heat oxidative osmotic

B

0x

1x

2x

3x

4x

5x

6x

0%

20%

40%

60%

80%

100%

120%

NA

C /

no

NA

C s

urv

ival

rat

io

surv

ival

, n

orm

aliz

ed

to

w.t

.

2.5 mM H2O2 5 mM NAC pretreatment heat shock osmotic shock

A

Other phenotypes: thermophilicity, halophilicity

Knockout of candidate genes affects heat shock resistance and osmotic shock resistance.

Validation using synthetic genes with introduced suboptimal codons

0%

5%

10%

15%

20%

25%

30%

w.t. ΔclpS ΔclpS + clpS_w.t.

ΔclpS + clpS_15

ΔclpS + clpS_20

ΔclpS + clpS_25

% s

urv

ival

0

0.1

0.2

0.3

0.4

0.5

0.6

0 0.5 1 1.5 2 2.5

rela

tive

fre

qu

en

cy

codon distance (MILC) to ribosomal protein genes

ribosomal protein genes

all other E. coli genesw.t.

1520 25

w.t.

21 28 35

yjjB

clpS

0%

5%

10%

15%

20%

25%

30%

w.t. ΔyjjB ΔyjjB + yjjB_w.t.

ΔyjjB + yjjB_21

ΔyjjB + yjjB_28

ΔyjjB + yjjB_35

% s

urv

ival

osmotic shock

heat shockC

DB

A

all experiments: Anita Kriško lab (Mediterranean Institute for Life Science, Split, Croatia)published as Kriško et al, Genome Biology 2014. doi:10.1186/gb-2014-15-3-r44

Overall:

• 200 links between 187 different COG gene families

- and -

24 diverse phenotypic traits, including• spore-forming ability

• motility

• pathogenicity to plants or mammals• affecting certain tissues/organs

• (1000s predictions at less stringent thresholds)

• Anita Kriško - Mediterranean Institute for Life Sciences (MedILS)Split, Croatia.

all experimentalwork shown

• Nives ŠkuncaETH Zurich.

phyletic profiling

Cancer

3851 cancer exomes from 11 tissues (>200 samples each)292,405 missense and 123,193 synonymous somatic mutations

ARE THE SYNONYMOUS MUTATIONS SELECTED FOR IN CARCINOGENESIS?

from Lawrence et al (2013) Nature. Mutation rate varies widely across the genome and correlates with DNA replication time and expression level.

from Schuster-Böckler and Lehner (2012)heterochromatin correlates to SNV rates

Drivers vs. passengers

• many somatic mutations in cancer = „passengers”

• a driver = a gene that confers a selective advantage. Recurrently mutated (ie. more than expected)

1. For missense, could be measured using the dN/dS

2.

3. commonly: find backgroud mut. frequencies for patient from entire exome see if a gene is above that background

Intronic rates as a baseline: INVEX testHodis et al. (Cell 2012)

-0.5

-0.25

0

0.25

0.5

0.75

1

-1 -0.75 -0.5 -0.25 0 0.25 0.5 0.75 1corr

elat

ion

to

PC

2 (

24

.3 %

)

correlation to PC1 (30.4 % variance)

carcinoma, 1Mbnon-carcinoma, 1Mb

pooled, 200kbliver, 200kb

liver, 1Mbbreast, 1Mb

H3K9me3,1Mb

GC3

RepliSeq,1Mb

hypothalamusliver

skeletal & heart muscle

6 tissues

regional mutation rates

mRNA levels

0

0.2

0.4

0.6

0.8

1

-2 0 2

D- = 0.224P = 0.017

0

0.2

0.4

0.6

0.8

1

9 19 29

D+ = 0.313P = 0.0004

0

0.2

0.4

0.6

0.8

1

-2 0 2

D- = 0.464P = 2.4·10-8

0

0.2

0.4

0.6

0.8

1

9 19 29

D- = 0.256P = 0.005

0

0.2

0.4

0.6

0.8

1

D+ =0.211P = 0.026

earlylate

oncogenes:

translocation(217)

missense(40)

copy number (12)

tumorsuppressors:

all mechanisms

(84)

Cancer GeneCensusA

recurrently mutated genes(self-reported in literature)

matched sets of noncancer genes:

1517 genes (for oncogenes)

693 genes (for tumor suppressors)

complete set of 13219 noncancer genes

B

known cancer genes

in Census

others:336

39

38

C

# mutations per 200 kb(110 cancers, pooled tissues)

heterochromatin (H3K9me3levels in 1 MB windows)

replication timing (RepliSeqsignal in 1 MB windows)

mRNA levels, avg. of 6 tissues(log2 RPKM)

# mutations per 200 kb(110 cancers, pooled tissues)

heterochromatin (H3K9me3levels in 1 MB windows)

replication timing (RepliSeqsignal in 1 MB windows)

mRNA levels, avg. of 6 tissues(log2 RPKM)

0

0.2

0.4

0.6

0.8

1D- = 0.199P = 0.043

earlylate

0

0.2

0.4

0.6

0.8

1

0.1 0.3 0.5

D+ = 0.215P = 0.025

39 oncogenes (recurrently mutated)

38 tumor suppressors (recurr. mutated)

D

19 1821missense-activatedoncogenes

recurrently mutated(from literature)

oncogenes

0

0.2

0.4

0.6

0.8

1

0.1 0.3 0.5

D- = 0.185P = 0.061

-0.5

-0.25

0

0.25

0.5

0.75

1

-1 -0.75 -0.5 -0.25 0 0.25 0.5 0.75 1corr

elat

ion

to

PC

2 (

24

.3 %

)

correlation to PC1 (30.4 % variance)

carcinoma, 1Mbnon-carcinoma, 1Mb

pooled, 200kbliver, 200kb

liver, 1Mbbreast, 1Mb

H3K9me3,1Mb

GC3

RepliSeq,1Mb

hypothalamusliver

skeletal & heart muscle

6 tissues

regional mutation rates

mRNA levels

0

0.2

0.4

0.6

0.8

1

-2 0 2

D- = 0.224P = 0.017

0

0.2

0.4

0.6

0.8

1

9 19 29

D+ = 0.313P = 0.0004

0

0.2

0.4

0.6

0.8

1

-2 0 2

D- = 0.464P = 2.4·10-8

0

0.2

0.4

0.6

0.8

1

9 19 29

D- = 0.256P = 0.005

0

0.2

0.4

0.6

0.8

1

D+ =0.211P = 0.026

earlylate

oncogenes:

translocation(217)

missense(40)

copy number (12)

tumorsuppressors:

all mechanisms

(84)

Cancer GeneCensusA

recurrently mutated genes(self-reported in literature)

matched sets of noncancer genes:

1517 genes (for oncogenes)

693 genes (for tumor suppressors)

complete set of 13219 noncancer genes

B

known cancer genes

in Census

others:336

39

38

C

# mutations per 200 kb(110 cancers, pooled tissues)

heterochromatin (H3K9me3levels in 1 MB windows)

replication timing (RepliSeqsignal in 1 MB windows)

mRNA levels, avg. of 6 tissues(log2 RPKM)

# mutations per 200 kb(110 cancers, pooled tissues)

heterochromatin (H3K9me3levels in 1 MB windows)

replication timing (RepliSeqsignal in 1 MB windows)

mRNA levels, avg. of 6 tissues(log2 RPKM)

0

0.2

0.4

0.6

0.8

1D- = 0.199P = 0.043

earlylate

0

0.2

0.4

0.6

0.8

1

0.1 0.3 0.5

D+ = 0.215P = 0.025

39 oncogenes (recurrently mutated)

38 tumor suppressors (recurr. mutated)

D

19 1821missense-activatedoncogenes

recurrently mutated(from literature)

oncogenes

0

0.2

0.4

0.6

0.8

1

0.1 0.3 0.5

D- = 0.185P = 0.061

„classical” cancer genes:newly discovered, fromcancer genomes:

-0.5

-0.25

0

0.25

0.5

0.75

1

-1 -0.75 -0.5 -0.25 0 0.25 0.5 0.75 1corr

elat

ion

to

PC

2 (

24

.3 %

)

correlation to PC1 (30.4 % variance)

carcinoma, 1Mbnon-carcinoma, 1Mb

pooled, 200kbliver, 200kb

liver, 1Mbbreast, 1Mb

H3K9me3,1Mb

GC3

RepliSeq,1Mb

hypothalamusliver

skeletal & heart muscle

6 tissues

regional mutation rates

mRNA levels

0

0.2

0.4

0.6

0.8

1

-2 0 2

D- = 0.224P = 0.017

0

0.2

0.4

0.6

0.8

1

9 19 29

D+ = 0.313P = 0.0004

0

0.2

0.4

0.6

0.8

1

-2 0 2

D- = 0.464P = 2.4·10-8

0

0.2

0.4

0.6

0.8

1

9 19 29

D- = 0.256P = 0.005

0

0.2

0.4

0.6

0.8

1

D+ =0.211P = 0.026

earlylate

oncogenes:

translocation(217)

missense(40)

copy number (12)

tumorsuppressors:

all mechanisms

(84)

Cancer GeneCensusA

recurrently mutated genes(self-reported in literature)

matched sets of noncancer genes:

1517 genes (for oncogenes)

693 genes (for tumor suppressors)

complete set of 13219 noncancer genes

B

known cancer genes

in Census

others:336

39

38

C

# mutations per 200 kb(110 cancers, pooled tissues)

heterochromatin (H3K9me3levels in 1 MB windows)

replication timing (RepliSeqsignal in 1 MB windows)

mRNA levels, avg. of 6 tissues(log2 RPKM)

# mutations per 200 kb(110 cancers, pooled tissues)

heterochromatin (H3K9me3levels in 1 MB windows)

replication timing (RepliSeqsignal in 1 MB windows)

mRNA levels, avg. of 6 tissues(log2 RPKM)

0

0.2

0.4

0.6

0.8

1D- = 0.199P = 0.043

earlylate

0

0.2

0.4

0.6

0.8

1

0.1 0.3 0.5

D+ = 0.215P = 0.025

39 oncogenes (recurrently mutated)

38 tumor suppressors (recurr. mutated)

D

19 1821missense-activatedoncogenes

recurrently mutated(from literature)

oncogenes

0

0.2

0.4

0.6

0.8

1

0.1 0.3 0.5

D- = 0.185P = 0.061

Detecting positive selection on synonymous mutations in cancer

• create „matched sets” of genes closely following the oncogenes in:

• regional mutation rates• In 1 Mb and 200 kb windows

• expression levels in different tissues

• Heterochromatin, replication timing

• G+C content

How to find a good set of genes?

A genetic algorithm. An optimization technique that can (relatively)easily handle many criteria at once. Quite efficient. Many parameters.

Operators:

...crossover

...random mutation

-0.5

-0.25

0

0.25

0.5

0.75

1

-1 -0.75 -0.5 -0.25 0 0.25 0.5 0.75 1corr

elat

ion

to

PC

2 (

24

.3 %

)

correlation to PC1 (30.4 % variance)

carcinoma, 1Mbnon-carcinoma, 1Mb

pooled, 200kbliver, 200kb

liver, 1Mbbreast, 1Mb

H3K9me3,1Mb

GC3

RepliSeq,1Mb

hypothalamusliver

skeletal & heart muscle

6 tissues

regional mutation rates

mRNA levels

0

0.2

0.4

0.6

0.8

1

-2 0 2

D- = 0.224P = 0.017

0

0.2

0.4

0.6

0.8

1

9 19 29

D+ = 0.313P = 0.0004

0

0.2

0.4

0.6

0.8

1

-2 0 2

D- = 0.464P = 2.4·10-8

0

0.2

0.4

0.6

0.8

1

9 19 29

D- = 0.256P = 0.005

0

0.2

0.4

0.6

0.8

1

D+ =0.211P = 0.026

earlylate

oncogenes:

translocation(217)

missense(40)

copy number (12)

tumorsuppressors:

all mechanisms

(84)

Cancer GeneCensusA

recurrently mutated genes(self-reported in literature)

matched sets of noncancer genes:

1517 genes (for oncogenes)

693 genes (for tumor suppressors)

complete set of 13219 noncancer genes

B

known cancer genes

in Census

others:336

39

38

C

# mutations per 200 kb(110 cancers, pooled tissues)

heterochromatin (H3K9me3levels in 1 MB windows)

replication timing (RepliSeqsignal in 1 MB windows)

mRNA levels, avg. of 6 tissues(log2 RPKM)

# mutations per 200 kb(110 cancers, pooled tissues)

heterochromatin (H3K9me3levels in 1 MB windows)

replication timing (RepliSeqsignal in 1 MB windows)

mRNA levels, avg. of 6 tissues(log2 RPKM)

0

0.2

0.4

0.6

0.8

1D- = 0.199P = 0.043

earlylate

0

0.2

0.4

0.6

0.8

1

0.1 0.3 0.5

D+ = 0.215P = 0.025

39 oncogenes (recurrently mutated)

38 tumor suppressors (recurr. mutated)

D

19 1821missense-activatedoncogenes

recurrently mutated(from literature)

oncogenes

0

0.2

0.4

0.6

0.8

1

0.1 0.3 0.5

D- = 0.185P = 0.061

Oncogenes: Tumor suppressors:

Distributions of regional mutation rates (1Mb and 200 kb), heterochromatin, etc. in the optimized sets of non-cancer genes closely match the cancer genes. Genetic algorithm tries to minimize the K-S statistic.

-0.5

-0.25

0

0.25

0.5

0.75

1

-1 -0.75 -0.5 -0.25 0 0.25 0.5 0.75 1corr

ela

tio

n t

o P

C2

(2

4.3

%)

correlation to PC1 (30.4 % variance)

carcinoma, 1Mbnon-carcinoma, 1Mb

pooled, 200kbliver, 200kb

liver, 1Mbbreast, 1Mb

H3K9me3,1Mb

GC3

RepliSeq,1Mb

hypothalamusliver

skeletal & heart muscle

6 tissues

regional mutation rates

mRNA levels

0

0.2

0.4

0.6

0.8

1

-2 0 2

D- = 0.224P = 0.017

0

0.2

0.4

0.6

0.8

1

9 19 29

D+ = 0.313P = 0.0004

0

0.2

0.4

0.6

0.8

1

-2 0 2

D- = 0.464P = 2.4·10-8

0

0.2

0.4

0.6

0.8

1

9 19 29

D- = 0.256P = 0.005

0

0.2

0.4

0.6

0.8

1

D+ =0.211P = 0.026

earlylate

oncogenes:

translocation(217)

missense(40)

copy number (12)

tumorsuppressors:

all mechanisms

(84)

Cancer GeneCensusA

recurrently mutated genes(self-reported in literature)

matched sets of noncancer genes:

1517 genes (for oncogenes)

693 genes (for tumor suppressors)

complete set of 13219 noncancer genes

B

known cancer genes

in Census

others:336

39

38

C

# mutations per 200 kb(110 cancers, pooled tissues)

heterochromatin (H3K9me3levels in 1 MB windows)

replication timing (RepliSeqsignal in 1 MB windows)

mRNA levels, avg. of 6 tissues(log2 RPKM)

# mutations per 200 kb(110 cancers, pooled tissues)

heterochromatin (H3K9me3levels in 1 MB windows)

replication timing (RepliSeqsignal in 1 MB windows)

mRNA levels, avg. of 6 tissues(log2 RPKM)

0

0.2

0.4

0.6

0.8

1D- = 0.199P = 0.043

earlylate

0

0.2

0.4

0.6

0.8

1

0.1 0.3 0.5

D+ = 0.215P = 0.025

39 oncogenes (recurrently mutated)

38 tumor suppressors (recurr. mutated)

D

19 1821missense-activatedoncogenes

recurrently mutated(from literature)

oncogenes

0

0.2

0.4

0.6

0.8

1

0.1 0.3 0.5

D- = 0.185P = 0.061

Expected: the oncogenes and the tumor suppressors are highly enriched with missense mutations (~1.5 - 2.5x).

However, the oncogenes are also enriched with synoynmous mutations over their matched sets, ~1.2x.

Introns of oncogenes (from whole-genome sequencing) are not enriched with SNVs, compared to matched sets.

The matched sets method agrees with Invex, and with simply using neighboring genes as a baseline.

Tissue-specific oncogenes are more enriched with synonymous mutations in the corresponding tissue.

This effect is not due to mutation showers/clustered mutations, as the same cancer samples don't tend to contain both a synonymous and a missense mutation in same gene.

Synonymous enrichment in oncogenes is detectable across cancer types.

Some oncogenes are more highly enriched with synonymous mutations than others, e.g. PDGFRA, EGFR, GATA1, ELN, NTRK1, JAK3, ALK and others (n=16).

The synonymous SNV enrichment in these genes is not paralleled by intronic SNV enrichment.

The synonymous mutations tend to cluster together to a similar extent as the missense mutations in the affected oncogenes. They also (less prominently) cluster with missense mutations.

0%

10%

20%

30%

40%

50%

60%

optimalcodongain

optimalcodon

loss

nochange%

of

syn

on

ymo

us

mu

tati

on

s le

adin

g to

ou

tco

me

n.s.

-18

-13

-8

-3

mR

NA

fo

ldin

g fr

ee e

ner

gy

aro

un

d m

uta

ted

sit

es (

kcal

/mo

l)

50nt windows

w.t.mRNA

mut.mRNA

-31

-26

-21

-16

-11

-6 100nt windows

w.t.mRNA

mut.mRNA

0%

10%

20%

30%

40%

≤30 nt 31-70nt

>70 nt

p < 10-4

1.75

1.26

0.45

-2

-1

0

1

2

1 2 3 4 5 6 7

log 2

RP

KM

of

exo

n

exon # in transcript ENST00000334286

30 random samples w/o point mutations

6 samples w/ synonymous exonic mutations

EDNRB gene,colorectal cancer

-0.5

-0.3

-0.1

0.1

0.3

0.5

wholecDNA

sites w/phyloP>1.0

net

# o

f ga

ined

miR

NA

see

d

site

s p

er s

yn. m

uta

tio

n

16 oncogenes

matched set

-0.3 -0.2 -0.1 0 0.1 0.2

normalized difference (Glass' delta) between properties of mutated positions in oncogenes vs. matched set

Relative preference value at C-cap (of α helices)

Normalized frequency of turn in all-α class

Alpha-helix indices for α-proteins

Relative preference value at N' (of α helices)

Relative preference value at N'' (of α helices)

Normalized frequency of α-helix in all-α class

t-testFDR<10%

0%

10%

20%

30%

enh.gain

enh.loss

sil.gain

sil.loss%

syn

. mu

tati

on

s (w

ith

in 3

0 n

t o

f sp

lice

site

) le

adin

g to

eve

nt

Ke et al. 2012 hexamers

1.53

0.83

0.60

1.90

p = 0.02

enh.gain

enh.loss

RESCUE-ESE

p = 0.003

1.90

0.53

sil.gain

sil.loss

FAS-hex2

p = 3·10-4

0.372.73

A B C

D E

G

F

0%

10%

20%

α-helix, 1st a.a.

α-helix, middle

α-helix, last a.a.

p=0.05n.s.

n.s.

1.43

1.12

0.79

0%

10%

20%

30%

40%

50%

coil

actualsynonymousmutations

randomizedmutationpositions

0%

10%

20%

middle next tocoil only

next to β-sheet

p = 4·10-5

0.97

1.01

2.60

α-helixparts:

0%

10%

20%

30%

40%

50%

coil

H

I

To do: Make nice schematicof alpha-helix as a legend here

Use of „optimal codons” miRNA binding sites Secondary structures in mRNA

What could the synonymous mutations do?

0%

10%

20%

30%

40%

50%

60%

optimalcodongain

optimalcodon

loss

nochange%

of

syn

on

ymo

us

mu

tati

on

s le

adin

g to

ou

tco

me

n.s.

-18

-13

-8

-3

mR

NA

fo

ldin

g fr

ee e

ner

gy

aro

un

d m

uta

ted

sit

es (

kcal

/mo

l)

50nt windows

w.t.mRNA

mut.mRNA

-31

-26

-21

-16

-11

-6 100nt windows

w.t.mRNA

mut.mRNA

0%

10%

20%

30%

40%

≤30 nt 31-70nt

>70 nt

p < 10-4

1.75

1.26

0.45

-2

-1

0

1

2

1 2 3 4 5 6 7

log 2

RP

KM

of

exo

n

exon # in transcript ENST00000334286

30 random samples w/o point mutations

6 samples w/ synonymous exonic mutations

EDNRB gene,colorectal cancer

-0.5

-0.3

-0.1

0.1

0.3

0.5

wholecDNA

sites w/phyloP>1.0

net

# o

f ga

ined

miR

NA

see

d

site

s p

er s

yn. m

uta

tio

n

16 oncogenes

matched set

-0.3 -0.2 -0.1 0 0.1 0.2

normalized difference (Glass' delta) between properties of mutated positions in oncogenes vs. matched set

Relative preference value at C-cap (of α helices)

Normalized frequency of turn in all-α class

Alpha-helix indices for α-proteins

Relative preference value at N' (of α helices)

Relative preference value at N'' (of α helices)

Normalized frequency of α-helix in all-α class

t-testFDR<10%

0%

10%

20%

30%

enh.gain

enh.loss

sil.gain

sil.loss%

syn

. mu

tati

on

s (w

ith

in 3

0 n

t o

f sp

lice

site

) le

adin

g to

eve

nt

Ke et al. 2012 hexamers

1.53

0.83

0.60

1.90

p = 0.02

enh.gain

enh.loss

RESCUE-ESE

p = 0.003

1.90

0.53

sil.gain

sil.loss

FAS-hex2

p = 3·10-4

0.372.73

A B C

D E

G

F

0%

10%

20%

α-helix, 1st a.a.

α-helix, middle

α-helix, last a.a.

p=0.05n.s.

n.s.

1.43

1.12

0.79

0%

10%

20%

30%

40%

50%

coil

actualsynonymousmutations

randomizedmutationpositions

0%

10%

20%

middle next tocoil only

next to β-sheet

p = 4·10-5

0.97

1.01

2.60

α-helixparts:

0%

10%

20%

30%

40%

50%

coil

H

I

To do: Make nice schematicof alpha-helix as a legend here

Use of „optimal codons” miRNA binding sites Secondary structures in mRNA

No general effect was detected in any of these cases (although they may still be important in specific examples).

Exonic Splicing Enhancer

~ and ~

Exonic Splicing Silencer

From Cartegni, Chew & Krainer. Nat Rev Genet. 2002 3(4),285-98.

AGAAGA enhGAAGAT enhGACGTC enhGAAGAC enh

....

CTTTTA silCTTTAA silTAGGTA silTAGTAG sil

Synonymous SNVs tend to be closer to splice sites in oncogenes.

They also tend to cause gains of known exonic splicing enhancer motifs, and losses of exonic splicing silencer motifs.

They more often affect exons with weaker (noncanonical) splice sites.

The exonic splicing enhancers created may resemble SF2/ASF motifs.

The ESS sites that are lost upon mutation sometimes resemble hnRNP A2/B1, H2 and A1 motifs.

Roughly ½ of the putatively causal synonymous mutations alter splicing, as evidenced by examining RNA-seq data from cancer.

We don't (yet) know what the other ½ is doing. One possibility may be affecting protein folding.

In yeast: Pechmann & Frydmann Nature Struct Mol Biol 2013

F

0%

10%

20%

α-helix, 1st a.a.

α-helix, middle

α-helix, last a.a.

p=0.05n.s.

n.s.

1.43

1.12

0.79

0%

10%

20%

30%

40%

50%

coil

actualsynonymousmutations

randomizedmutationpositions

0%

10%

20%

middle next tocoil only

next to β-sheet

p = 4·10-5

0.97

1.01

2.60

α-helixparts:

0%

10%

20%

30%

40%

50%

coil

G

H

N’’ N’ Ncap Ccap C’ C’’

α-helix

turn

-0.3 -0.2 -0.1 0 0.1 0.2

normalized difference (Glass' delta) between mutated sites in oncogenes vs. matched set

relative preference value at C-cap

normalized frequency of turn in all-α class

α-helix indices for α-proteins

relative preference value at N'

relative preference value at N''

normalized frequency of α-helix in all-α class

FDR<10%

...also in cancer: we observe an enrichment of synonymous mutations at N-termini of alpha-helices, esp. if close to beta-sheets.Suggestive of effects on folding.

known

novel

TP53 gene has a large excess of synonymous mutations, which are always near splice sites.

We found three examples of recurrent SNV that inactivate the nearby splice site.

causes a frameshift

Dosage sensitive oncogenes have many point mutations in their 3' UTRs

Take-home messages:

• oncogenes contain an excess of synonymous mutations in human cancers

• a subset of synonymous mutations target splicing motifs

• 1/5 to 1/2 synonymous mutations in oncogenes reported to-date are acting as driver mutations

• ~6 – 8% of all driver mutations due to single nucleotide changes are likely to be synonymous mutations

• TP53 has recurrent synonymous mutations that disrupt splice sites

• an excess of mutations of 3’ UTRs of dosage-sensitive genes

published in: Supek et al. (2014) Cell. http://dx.doi.org/10.1016/j.cell.2014.01.051

Thank you!

Fran Supek

1) Lehner group, CRG/EMBL Systems Biology Unit, Barcelona

2) Dept of Electronics, RBI, Zagreb, Croatia

XXI Jornades de Biologia Molecular

Barcelona, 11.6.2014