Quantifying contributions of mutations and homologous recombination to E. coli genomic diversity...

Post on 23-Dec-2015

218 views 0 download

Transcript of Quantifying contributions of mutations and homologous recombination to E. coli genomic diversity...

Quantifying contributions of mutations and homologous

recombination to E. coli genomic diversity

Sergei MaslovDepartment of Biosciences

Brookhaven National Laboratory, New York

Bacterial genome evolution happens in cooperation with

phages

+ =

Variation between E. coli strains

M Touchon et al. PLoS Genetics (2009)

Pan-genome of E. coli

Copy and Insert Copy and Replace

FW Studier, P Daegelen, RE Lenski,

S Maslov, JF Kim, JMB (2009)

Comparison of B vs K-12 strains of E. coli

Usual suspects are there but do not explain heterogeneity

Negative correlation with protein abundance: 2.5% of variation, P-value=10-5

Positive correlation with distance from origin of replication: 0.4% of variation, P-value=10-2

High SNP numbers are clustered along the

chromosome

RecombinedClo

nal

P. Dixit, T. Y. Pang, Studier FW, Maslov S, PNAS submitted (2013)

SNPs by recombination/SNPs by clonal mutations

r/μ=6±1

Clonal regions Recombined regions

Recombined regions

P. Dixit, T. Y. Pang, Studier FW, Maslov S, PNAS submitted (2013)

Strains: K-12 vsETEC-H10407 HS O157-H7-Sakai

Neutral model:Mutations and Recombinations among 70 “genes”, population of 104

C. Fraser et al.(2007) and (2009)

P. Dixit, T. Y. Pang, Studier FW, Maslov S, PNAS submitted (2013)

Phase transition Δc=1.5%

P. Dixit, T. Y. Pang, Studier FW, Maslov S, PNAS submitted (2013)

P. Dixit, T. Y. Pang, Studier FW, Maslov S, PNAS submitted (2013)

Why exponential tail?

Time to coalescence: Prob(t)= 1/Ne (1-1/Ne)t-1=exp(

exponential slope =1/2μNe or 1/θ

Population size Ne=1±0.1 x 109 consistent with earlier estimates

Why Ne << N ? Phages:

But: there are phages that cross species boundaries.

Also slope is similar for different species

Restriction modification system: Recombined segments are not continuous

[Milkman R, Bridges MM. Genetics 1990]

Recombination efficiency: Need 20-30 identical bases to start

recombination Our slope predicts 60 bases which roughly

matches30 in the neginnng and 30 in the end

Species are defined by recombination

Are our 30+ strains a representative sample?

Fully sequenced genomes: 1000s of genes (unbiased and complete) 10s of strains (biased)

MLST data: 10s of genes (biased) 1000s of strains (unbiased, I hope)

Database http://mlst.ucc.ie ∼3000 E. coli strains 7 short regions of ~500 base pairs

eachin housekeeping genes

MLST-- Genomes

Is it really phages?

Phage capacity: 20kbOther strains up to

40kb

K-12 to B comparison

1kb: gene length

Does neutral model explain everything?

At 3 standard deviations

• 19 1kb regions supervariable

• 29 1kb regionssuperconserved

Collaborators & funding• Bill Studier (BNL)• Purushottam Dixit (BNL)• Tin Yau Pang (Stony Brook)• Rich Lenski (Michigan State)• Patrick Daegelen (France)• Jinhyun Kim (Korea) DOE Systems Biology

Knoledgebase (KBase) • Adam Arkin (Berkley) • Rick Stevens (Argonne)• Bob Cottingham (Oak Ridge)• Mark Gerstein (Yale)• Doreen Ware (Cold Spring Harbor)• Mike Schatz (Cold Spring Harbor)• Dave Weston (ORNL)• 60+ other collaborators

Thank you!

24

Genes encoded in bacterial genomes

Packages installed on Linux computers

~

25

Complex systems have many components Genes (Bacteria) Software packages (Linux OS)

Components do not work alone: they need to be assembled to work

In individual systems only a subset of components is used Genome (Bacteria) – bag of genes Computer (Linux OS) – installed

packages Components have vastly

different frequencies of use

26

Justin Pollard, http://www.designboom.com

IKEA: has many components

27Justin Pollard, http://www.designboom.com

They need to be assembled to work

28

Different frequencies of use

vs

Common Rare

29

What determines the frequency of use?

Popularity: AKA preferential attachment Frequency ~ self-amplifying popularity Relevant for social systems: WWW links,

facebook friendships, scientific citations Functional role:

Frequency ~ breadth or importance of the functional role

Relevant for biological and technological systems where selection adjusts undeserved popularity

30

Empirical data on component frequencies

Bacterial genomes (eggnog.embl.de): 500 sequenced prokaryotic genomes 44,000 Orthologous Gene families

Linux packages (popcon.ubuntu.com): 200,000 Linux packages installed on 2,000,000 individual computers

Binary tables: component is either present or not in a given system

31

Frequency distributions

P(f)~ f-1.5 except the top √N “universal” components with f~1

Cloud

ShellCore

ORFans

32

How to quantify functional importance?

Components do not work alone Breadth/Importance ~ Component is

needed for proper functioning of other components

Dependency network A B means A depends on B for its function Formalized for Linux software packages For metabolic enzymes given by upstream-

downstream positions in pathways Frequency ~ dependency degree, Kdep

Kdep = the total number of components that directly or indirectly depend on the selected one

33

34Correlation coefficient ~0.4 for both Linux and genesCould be improved by using weighted dependency

degree

Frequency is positively correlated with functional importance

35

Tree-like metabolic network

Kdep=5

Kdep=15

TCA cycle

36

Dependency degree distribution

on a critical branching tree P(K)~K-1.5 for a critical branching tree

Paradox: Kmax-0.5 ~ 1/N Kmax=N2>N

Answer: parent tree size imposes a cutoff:there will be √N “core” nodes with Kmax=N present in almost all systems (ribosomal genes

or core metabolic enzymes)

Need a new model: in a tree D=1, while in real systems D~2>1

37

Dependency network evolution New components added gradually

over time New component depends on D

existing components selected randomly

Kdep(t) ~(t/N)-D

P(Kdep(t)>K)=P(t/N<K-1/D)=K-1/D

P(Kdep)=Kdep-(1+1/D) =Kdep

-1.5 for D=2 Nuniversal=N(D-1)/D=N0.5 for D=2

38

Kdep decreases layer number

Linux Model with D=2

39

Zipf plot for Kdep distributions

Metabolic enzymesvs

Model

Linuxvs

Model

40

Frequency distributions

P(f)~ f-1.5 except the top √N “universal” components with f~1

Shell

Core

ORFans

Cloud

41

Why should we care about P(f)?

42

Metagenomes and pan-genomes

The Human Microbiome Project Consortium, Nature (2012)

For P(f) ~ f -1.5: (Pan-genome size)~ ~(# of samples)0.5

Pan-genome of E. coli strains

M Touchon et al. PLoS Genetics (2009)

Genome evolution in E. coliStudier FW, Daegelen P, Lenski RE, Maslov S, Kim JF J. Mol Biol. (2009)P. Dixit, T. Y. Pang, Studier FW, Maslov S, submitted (2013) 

How many transcription factorsdoes an organism need?

Regulator genes Worker genes

S. Maslov, TY Pang, K. Sneppen, S. Krishna, PNAS (2009)TY Pang, S. Maslov, PLoS Comp Bio (2011)

Figure adapted from S. Maslov, TY Pang, K. Sneppen, S. Krishna, PNAS (2009)

+

NR~ NG2 NR/NG ~ NG

Parkinson's Law

The report of the Royal Commission on the Civil Service was published on Thursday afternoon. Time has not permitted any comment in this week's issue of The Economist on the contents of the Report. But the startling discovery enunciated by a correspondent in the following article is certainly relevant to what should have been in it.

Nov 19th 1955 | From The Economist print edition

“… bureaucracy grew by 5-7% per year "irrespective of any variation in the

amount of work (if any) to be done."

Why? 1) "An official wants to multiply subordinates, not rivals" 2) "Officials make work for each other.“ so that

“Work expands so as to fill the time available for its completion”

Is this what happens in bacterial genomes? Probably not!

Cyril Northcote Parkinson

(1909 -1993)

Economies of scale in bacterial evolution

• NR=NG2/80,000 NG/NR=80,000/NG

• Economies of scale: as genome gets larger: new pathways get shorter

nutrient

Horizontal gene transfer:entire pathways could be added in one step

nutrient

Redundant enzymes are removed

Central metabolic core anabolic pathways biomass production

Adapted from “scope-expansion” algorithm by R. Heinrich et al.

Minimal metabolic pathwaysfrom reactions in KEGG database

(# of pathways or their regulators) ~(# of enzymes )2

NG

NR

What it all means for regulatory networks?

Scale-free regulatory networks with “hubs” due to power law distribution of branch sizes: P(S)~S-3

Trends in complexity of regulation vs. genome size

NR<Kout>=NG<Kin>=number of regulatory interactions E. van Nimwegen, TIG (2003)

NR/NG= <Kin>/<Kout> increases with NG Either <Kout> decreases with NG:

functions become more specialized Or <Kin> grows with NG:

regulation gets more coordinated & interconnected

Most likely both trends at once

nutrient

TF1

nutrient

TF2

Regulatory templates:one worker – one boss

<Kout>: <Kin>=1=const

nutrient

TF1

nutrient

Regulatory templates:long top-to-bottom regulation

<Kout>=const<Kin>:

TF2<Kout>:<Kin> :

nutrient

nutrient

TF1

TF2

Regulatory templates:hierarchy & middle management

TF3

Histogram of the # of SNPs in genes

FW Studier, P Daegelen, RE Lenski, S Maslov, JF Kim, JMB (2009)

Comparison of B vs K-12 strains of E. coli

• 50% of genes have very few SNPs• 1253: 0 SNPs• 445: 1 SNP• 232: 2 SNP

• The remaining 50% are in exponential tail up to 100 SNPs (10% divergence) and higher