Center f Gene Regulation in Microbial Genomes€¦ · Comparative Microbial Genomics group Center f...

Comparative Microbial Genomics groupC

en

ter fo

r Bio

log

ica

l Se

qu

en

ce

An

aly

sis

Th

e T

ech

nic

al U

niv

ers

ity o

f De

nm

ark

DT

U

Gene Regulation in Microbial Genomes

DNA curvature, melting, and flexibility

Dave UsseryComparative Microbial Genomics & Taxonomy WorkshopPetrópolis - Rio de Janeiro17 August, 2006


en

ter fo

r Bio

log

ica

l Se

qu

en

ce

An

aly

sis

Th

e T

ech

nic

al U

niv

ers

ity o

f De

nm

ark

DT

U


en

ter fo

r Bio

log

ica

l Se

qu

en

ce

An

aly

sis

Th

e T

ech

nic

al U

niv

ers

ity o

f De

nm

ark

DT

U

Outline

• Introduction & disclaimer

• DNA curvature

• DNA Flexibility


en

ter fo

r Bio

log

ica

l Se

qu

en

ce

An

aly

sis

Th

e T

ech

nic

al U

niv

ers

ity o

f De

nm

ark

DT

U

How to become a uropathogen: Comparative genomicanalysis of extraintestinal pathogenic Escherichiacoli strainsElzbieta Brzuszkiewicz*†, Holger Bruggemann*‡, Heiko Liesegang*, Melanie Emmerth†, Tobias Olschlager†,Gabor Nagy§, Kaj Albermann¶, Christian Wagner¶, Carmen Buchrieser‡, Levente Emody§, Gerhard Gottschalk*,Jorg Hacker†, and Ulrich Dobrindt†!

*Gottingen Genomics Laboratory, Institute of Microbiology and Genetics, Georg-August-Universitat Gottingen, Grisebachstrasse 8, 37077 Gottingen,Germany; †Institute for Molecular Biology of Infectious Diseases, Bayerische Julius-Maximilians-Universitat Wurzburg, Rontgenring 11, 97070 Wurzburg,Germany; §Institute of Medical Microbiology and Immunology, University of Pecs, Szigeti ut 12, 7624 Pecs, Hungary; ¶Biomax Informatics AG,Lochhamerstrasse 9, 82152 Martinsried, Germany; and ‡Laboratoire de Genomique des Microorganismes Pathogenes, Institut Pasteur,75724 Paris Cedex 15, France

Edited by Roy Curtiss, Arizona State University, Tempe, AZ, and approved July 2, 2006 (received for review April 18, 2006)

Uropathogenic Escherichia coli (UPEC) strain 536 (O6:K15:H31) isone of the model organisms of extraintestinal pathogenic E. coli(ExPEC). To analyze this strain’s genetic basis of urovirulence, wesequenced the entire genome and compared the data with thegenome sequence of UPEC strain CFT073 (O6:K2:H1) and to theavailable genomes of nonpathogenic E. coli strain MG1655 (K-12)and enterohemorrhagic E. coli. The genome of strain 536 is !292 kbsmaller than that of strain CFT073. Genomic differences betweenboth UPEC are mainly restricted to large pathogenicity islands,parts of which are unique to strain 536 or CFT073. Genomecomparison underlines that repeated insertions and deletions incertain parts of the genome contribute to genome evolution.Furthermore, 427 and 432 genes are only present in strain 536 orin both UPEC, respectively. The majority of the latter genes isencoded within smaller horizontally acquired DNA regions scat-tered all over the genome. Several of these genes are involved inincreasing the pathogens’ fitness and adaptability. Analysis ofvirulence-associated traits expressed in the two UPEC O6 strains,together with genome comparison, demonstrate the marked ge-netic and phenotypic variability among UPEC. The ability to accu-mulate and express a variety of virulence-associated genes distin-guishes ExPEC from many commensals and forms the basis for theindividual virulence potential of ExPEC. Accordingly, instead of acommon virulence mechanism, different ways exist among ExPECto cause disease.

fitness " genome comparison " uropathogenic Escherichia coli

Uropathogenic Escherichia coli (UPEC) are the most commoncause of community-acquired urinary tract infection (UTI)

and are responsible for 70–90% of the estimated 150 million UTIsdiagnosed annually (1). UPEC also cause !40% of all nosocomialUTI, thus representing one of the most frequently isolated noso-comial pathogens (2). These frequencies illustrate the magnitude ofthe problem but do not reflect disease diversity in the urinary tract.UTI may be acute, symptomatic with a varying severity andlocalization, but may also be sporadic, recurrent, or chronic. It isessential to understand the molecular basis of disease diversity onthe bacterial side that determines the different disease types. UPECare a geno- and phenotypically heterogeneous group of isolatesrestricted to a small number of O-serogroups that seem to representdifferent subclasses of facultative pathogens (3–5).

UPEC virulence factors are frequently encoded on pathogenicityislands (PAIs) (6–9). The two O6 strains 536 (pyelonephritisisolate) and CFT073 (urosepsis isolate) became generally acceptedUPEC model organisms, and several PAIs of them have beendescribed in detail (10–16). The complete CFT073 genome se-quence shows a mosaic structure in terms of the distribution ofbackbone genes conserved in E. coli, and ‘‘foreign’’ genes, which

presumably have been acquired horizontally (17). Genome com-parison of CFT073, O157:H7 strain EDL933, and K-12 strainMG1655 revealed that only 39.2% of their combined set of proteinsare common to all three strains (17–20), underlining the astonishingdiversity among E. coli. Furthermore, the genome sequence ofCFT073 revealed 1,623 strain-specific genes (21.2%). Comparisonof both UPEC phenotypes and their genomes with other completeE. coli genome sequences should therefore help to identify sets of‘‘UPEC-specific’’ and strain-specific proteins, respectively, that mayform the basis of their different individual phenotypes and uro-pathogenic potential.

Results and DiscussionE. coli 536 Genome Sequence Determination and Comparative Anal-ysis. The genome consists of a single circular chromosome of4,938,875 bp. No plasmids were found. The 536 genome is 292 kbsmaller than that of strain CFT073. Essentially, the additional DNAin CFT073 harbors genes of five cryptic prophages, which are absentfrom strain 536, as well as genes that are located in islands absentfrom other E. coli. The E. coli 536 genome contains one crypticprophage region.

For the 536 genome, 4,747 putative coding sequences werepredicted, !3,650 of which (77%) have highly similar orthologs inMG1655 (Fig. 1). From the remaining ORFs, 524 are also presentin CFT073, which means that 89% of all ORFs of E. coli 536 havehighly similar orthologs in the UPEC CFT073 genome (Table1 andTable 4, which is published as supporting information on the PNASweb site). Further comparison with the genome sequences ofenterohemorrhagic E. coli (EHEC) O157:H7 strains Sakai andEDL933 (18, 19) revealed !3,560 ORFs (75%) with highly similarorthologs in all published complete E. coli genomes. Of theremaining ORFs present in the genomes of strain 536 and at leastone of these other four E. coli strains, 427 are mainly located withina region of the cryptic prophage or within the major PAIs of strain536 (Table 5, which is published as supporting information on thePNAS web site).

Key features of the completely sequenced E. coli genomes are

Conflict of interest statement: No conflicts declared.

This paper was submitted directly (Track II) to the PNAS office.

Abbreviations: EHEC, enterohemorrhagic E. coli; ExPEC, extraintestinal pathogenic E. coli;IPEC, intestinal pathogenic E. coli; PAI, pathogenicity islands; PTS, phosphotransferasesystem; UPEC, uropathogenic E. coli; UTI, urinary tract infection.

Data deposition: The complete genome sequence of E. coli strain 536 has been depositedin the GenBank database (accession no. CP000247).!To whom correspondence should be addressed at: Institut fur Molekulare Infektionsbi-ologie, Rontgenring 11, D-97070 Wurzburg, Germany. E-mail: [email protected].

© 2006 by The National Academy of Sciences of the USA

www.pnas.org#cgi#doi#10.1073#pnas.0603038103 PNAS " August 22, 2006 " vol. 103 " no. 34 " 12879–12884

MIC

ROBI

OLO

GY

How to become a uropathogen: Comparative genomicanalysis of extraintestinal pathogenic Escherichiacoli strainsElzbieta Brzuszkiewicz*†, Holger Bruggemann*‡, Heiko Liesegang*, Melanie Emmerth†, Tobias Olschlager†,Gabor Nagy§, Kaj Albermann¶, Christian Wagner¶, Carmen Buchrieser‡, Levente Emody§, Gerhard Gottschalk*,Jorg Hacker†, and Ulrich Dobrindt†!

*Gottingen Genomics Laboratory, Institute of Microbiology and Genetics, Georg-August-Universitat Gottingen, Grisebachstrasse 8, 37077 Gottingen,Germany; †Institute for Molecular Biology of Infectious Diseases, Bayerische Julius-Maximilians-Universitat Wurzburg, Rontgenring 11, 97070 Wurzburg,Germany; §Institute of Medical Microbiology and Immunology, University of Pecs, Szigeti ut 12, 7624 Pecs, Hungary; ¶Biomax Informatics AG,Lochhamerstrasse 9, 82152 Martinsried, Germany; and ‡Laboratoire de Genomique des Microorganismes Pathogenes, Institut Pasteur,75724 Paris Cedex 15, France

Edited by Roy Curtiss, Arizona State University, Tempe, AZ, and approved July 2, 2006 (received for review April 18, 2006)

Uropathogenic Escherichia coli (UPEC) strain 536 (O6:K15:H31) isone of the model organisms of extraintestinal pathogenic E. coli(ExPEC). To analyze this strain’s genetic basis of urovirulence, wesequenced the entire genome and compared the data with thegenome sequence of UPEC strain CFT073 (O6:K2:H1) and to theavailable genomes of nonpathogenic E. coli strain MG1655 (K-12)and enterohemorrhagic E. coli. The genome of strain 536 is !292 kbsmaller than that of strain CFT073. Genomic differences betweenboth UPEC are mainly restricted to large pathogenicity islands,parts of which are unique to strain 536 or CFT073. Genomecomparison underlines that repeated insertions and deletions incertain parts of the genome contribute to genome evolution.Furthermore, 427 and 432 genes are only present in strain 536 orin both UPEC, respectively. The majority of the latter genes isencoded within smaller horizontally acquired DNA regions scat-tered all over the genome. Several of these genes are involved inincreasing the pathogens’ fitness and adaptability. Analysis ofvirulence-associated traits expressed in the two UPEC O6 strains,together with genome comparison, demonstrate the marked ge-netic and phenotypic variability among UPEC. The ability to accu-mulate and express a variety of virulence-associated genes distin-guishes ExPEC from many commensals and forms the basis for theindividual virulence potential of ExPEC. Accordingly, instead of acommon virulence mechanism, different ways exist among ExPECto cause disease.

fitness " genome comparison " uropathogenic Escherichia coli

Uropathogenic Escherichia coli (UPEC) are the most commoncause of community-acquired urinary tract infection (UTI)

and are responsible for 70–90% of the estimated 150 million UTIsdiagnosed annually (1). UPEC also cause !40% of all nosocomialUTI, thus representing one of the most frequently isolated noso-comial pathogens (2). These frequencies illustrate the magnitude ofthe problem but do not reflect disease diversity in the urinary tract.UTI may be acute, symptomatic with a varying severity andlocalization, but may also be sporadic, recurrent, or chronic. It isessential to understand the molecular basis of disease diversity onthe bacterial side that determines the different disease types. UPECare a geno- and phenotypically heterogeneous group of isolatesrestricted to a small number of O-serogroups that seem to representdifferent subclasses of facultative pathogens (3–5).

UPEC virulence factors are frequently encoded on pathogenicityislands (PAIs) (6–9). The two O6 strains 536 (pyelonephritisisolate) and CFT073 (urosepsis isolate) became generally acceptedUPEC model organisms, and several PAIs of them have beendescribed in detail (10–16). The complete CFT073 genome se-quence shows a mosaic structure in terms of the distribution ofbackbone genes conserved in E. coli, and ‘‘foreign’’ genes, which

presumably have been acquired horizontally (17). Genome com-parison of CFT073, O157:H7 strain EDL933, and K-12 strainMG1655 revealed that only 39.2% of their combined set of proteinsare common to all three strains (17–20), underlining the astonishingdiversity among E. coli. Furthermore, the genome sequence ofCFT073 revealed 1,623 strain-specific genes (21.2%). Comparisonof both UPEC phenotypes and their genomes with other completeE. coli genome sequences should therefore help to identify sets of‘‘UPEC-specific’’ and strain-specific proteins, respectively, that mayform the basis of their different individual phenotypes and uro-pathogenic potential.

Results and DiscussionE. coli 536 Genome Sequence Determination and Comparative Anal-ysis. The genome consists of a single circular chromosome of4,938,875 bp. No plasmids were found. The 536 genome is 292 kbsmaller than that of strain CFT073. Essentially, the additional DNAin CFT073 harbors genes of five cryptic prophages, which are absentfrom strain 536, as well as genes that are located in islands absentfrom other E. coli. The E. coli 536 genome contains one crypticprophage region.

For the 536 genome, 4,747 putative coding sequences werepredicted, !3,650 of which (77%) have highly similar orthologs inMG1655 (Fig. 1). From the remaining ORFs, 524 are also presentin CFT073, which means that 89% of all ORFs of E. coli 536 havehighly similar orthologs in the UPEC CFT073 genome (Table1 andTable 4, which is published as supporting information on the PNASweb site). Further comparison with the genome sequences ofenterohemorrhagic E. coli (EHEC) O157:H7 strains Sakai andEDL933 (18, 19) revealed !3,560 ORFs (75%) with highly similarorthologs in all published complete E. coli genomes. Of theremaining ORFs present in the genomes of strain 536 and at leastone of these other four E. coli strains, 427 are mainly located withina region of the cryptic prophage or within the major PAIs of strain536 (Table 5, which is published as supporting information on thePNAS web site).

Key features of the completely sequenced E. coli genomes are

Conflict of interest statement: No conflicts declared.

This paper was submitted directly (Track II) to the PNAS office.

Abbreviations: EHEC, enterohemorrhagic E. coli; ExPEC, extraintestinal pathogenic E. coli;IPEC, intestinal pathogenic E. coli; PAI, pathogenicity islands; PTS, phosphotransferasesystem; UPEC, uropathogenic E. coli; UTI, urinary tract infection.

Data deposition: The complete genome sequence of E. coli strain 536 has been depositedin the GenBank database (accession no. CP000247).!To whom correspondence should be addressed at: Institut fur Molekulare Infektionsbi-ologie, Rontgenring 11, D-97070 Wurzburg, Germany. E-mail: [email protected].

© 2006 by The National Academy of Sciences of the USA

www.pnas.org#cgi#doi#10.1073#pnas.0603038103 PNAS " August 22, 2006 " vol. 103 " no. 34 " 12879–12884

MIC

ROBI

OLO

GY


en

ter fo

r Bio

log

ica

l Se

qu

en

ce

An

aly

sis

Th

e T

ech

nic

al U

niv

ers

ity o

f De

nm

ark

DT

U

MinorGroove

MajorGroove

1.12

3600= one helical

turn

10.5 bp per turn

34.30 twist angle

(rotation per residue)3.4A Axial Rise

Base Pair Tilt - 6o

Helix Pitch

35.7A

34.3o

Helix Diameter

20A


en

ter fo

r Bio

log

ica

l Se

qu

en

ce

An

aly

sis

Th

e T

ech

nic

al U

niv

ers

ity o

f De

nm

ark

DT

U


en

ter fo

r Bio

log

ica

l Se

qu

en

ce

An

aly

sis

Th

e T

ech

nic

al U

niv

ers

ity o

f De

nm

ark

DT

U

Marini gel


en

ter fo

r Bio

log

ica

l Se

qu

en

ce

An

aly

sis

Th

e T

ech

nic

al U

niv

ers

ity o

f De

nm

ark

DT

U2.1

200 bp

Straight DNA

A.

200 bp

Bent DNA

B.


en

ter fo

r Bio

log

ica

l Se

qu

en

ce

An

aly

sis

Th

e T

ech

nic

al U

niv

ers

ity o

f De

nm

ark

DT

U

Agarose gel


en

ter fo

r Bio

log

ica

l Se

qu

en

ce

An

aly

sis

Th

e T

ech

nic

al U

niv

ers

ity o

f De

nm

ark

DT

U

A=TA=T

2.5

Tilt Roll

A=T A=T


en

ter fo

r Bio

log

ica

l Se

qu

en

ce

An

aly

sis

Th

e T

ech

nic

al U

niv

ers

ity o

f De

nm

ark

DT

U

Random Bends

Bends Phased at 10.5 bp

Bends Phased at 16 bp

2.6


en

ter fo

r Bio

log

ica

l Se

qu

en

ce

An

aly

sis

Th

e T

ech

nic

al U

niv

ers

ity o

f De

nm

ark

DT

U

CCCAAAAATGTCAAAAAATAGGCAAAAAATGCCAAAAATCCCAAC

1

.

11

.

21

.

31

.

41

.

2.3


en

ter fo

r Bio

log

ica

l Se

qu

en

ce

An

aly

sis

Th

e T

ech

nic

al U

niv

ers

ity o

f De

nm

ark

DT

U

katP >

espP >

L7028 >

L7031 >

etpD

>

etpE >

etpF >

etpL >

EHEC-hlyA >

EHEC-hlyB >

L7072

>

L70

81 >

L70

86 >

L70

91 >

L7095

>

traI >

EHEC-hlyC

EHEC

-hlyB

EHEC

-hlyA

L709

5

0k12.5k

25k

37.5k50k62

.5k

75k

GENOME ATLAS

Center for Biological Sequence Analysishttp://www.cbs.dtu.dk/

Genetica, 108:47-51, 2000

Intrinsic Curvaturedevavg

0.07 0.30

Stacking Energydevavg

-9.52 -6.40

Position Preferencedevavg

0.11 0.17

Annotations:

CDS +

CDS -

Direct Repeatsfixavg

5.00 7.50

Inverted Repeatsfixavg

5.00 7.50

Percent ATfixavg

0.30 0.70

Resolution: 19

E. coli pO157GenBank Acession AF074613 92,077 bp


en

ter fo

r Bio

log

ica

l Se

qu

en

ce

An

aly

sis

Th

e T

ech

nic

al U

niv

ers

ity o

f De

nm

ark

DT

U


en

ter fo

r Bio

log

ica

l Se

qu

en

ce

An

aly

sis

Th

e T

ech

nic

al U

niv

ers

ity o

f De

nm

ark

DT

U

Promoter Structural profile

-10 “TATA box”-35“UP”


en

ter fo

r Bio

log

ica

l Se

qu

en

ce

An

aly

sis

Th

e T

ech

nic

al U

niv

ers

ity o

f De

nm

ark

DT

U

!400 !200 0 200 400

!0.0

50.

000.

050.

100.

150.

200.

25Ecoli_K!12_W3110_Main: Stacking energy

Distance from translation start

Z!sc

ore

CURVATURE


en

ter fo

r Bio

log

ica

l Se

qu

en

ce

An

aly

sis

Th

e T

ech

nic

al U

niv

ers

ity o

f De

nm

ark

DT

U

!400 !200 0 200 400

!0.0

50.

000.

050.

100.

150.

20

Ecoli_K!12_W3110_Main: Structural Profile

Distance from translation start

Z!sc

ore AT content

Position PreferenceStacking EnergyIntrinsic CurvatureDNAase sensitivity


en

ter fo

r Bio

log

ica

l Se

qu

en

ce

An

aly

sis

Th

e T

ech

nic

al U

niv

ers

ity o

f De

nm

ark

DT

U

Promoter Structural profile

+1

CDS

DNA curvature,flexibility important

here

meltsrigid

cruciform

-10-35

mRNACDS

β / β’-10

-35

+1

σα


en

ter fo

r Bio

log

ica

l Se

qu

en

ce

An

aly

sis

Th

e T

ech

nic

al U

niv

ers

ity o

f De

nm

ark

DT

U

J)

I)

H)

G)

F)

E)

ubiD > trkH >< fadB rrsA > rrlA >D)

C)

B)

A)

0k 2.5k 5k 7.5k 10k 12.5k 15k 17.5k 20k

rrsA 20,000 bp


GENOME ATLAS

A) sigma -0.030fixavg

0.00 0.50

B) sigma -0.035fixavg

0.00 0.50

C) sigma -0.040fixavg

0.00 0.50

D) Annotations:

CDS +

CDS -

rRNA

tRNA

E) sigma -0.045fixavg

0.00 0.50

F) sigma -0.050fixavg

0.00 0.50

G) sigma -0.055fixavg

0.00 0.50

H) sigma -0.060fixavg

0.00 0.50

I) sigma -0.065fixavg

0.00 0.50

J) sigma -0.070fixavg

0.00 0.50

Resolution: 9


en

ter fo

r Bio

log

ica

l Se

qu

en

ce

An

aly

sis

Th

e T

ech

nic

al U

niv

ers

ity o

f De

nm

ark

DT

U

2.10

"Refined" Junction Model

CCGG

Tilt

CGGC5' 3'

GGCC GCCG

A

T

5' 3'

AA A

AA

TT

Roll

+

Tilt

3' 5'

T TT

A

T

A

T

A

T

A

T

A

T

A

T

A

T

A

T

A

T

2.11

5'3'

5'3'


en

ter fo

r Bio

log

ica

l Se

qu

en

ce

An

aly

sis

Th

e T

ech

nic

al U

niv

ers

ity o

f De

nm

ark

DT

U

nucleosomes


en

ter fo

r Bio

log

ica

l Se

qu

en

ce

An

aly

sis

Th

e T

ech

nic

al U

niv

ers

ity o

f De

nm

ark

DT

U

chromatin


en

ter fo

r Bio

log

ica

l Se

qu

en

ce

An

aly

sis

Th

e T

ech

nic

al U

niv

ers

ity o

f De

nm

ark

DT

U


en

ter fo

r Bio

log

ica

l Se

qu

en

ce

An

aly

sis

Th

e T

ech

nic

al U

niv

ers

ity o

f De

nm

ark

DT

U

Trimmed 1


en

ter fo

r Bio

log

ica

l Se

qu

en

ce

An

aly

sis

Th

e T

ech

nic

al U

niv

ers

ity o

f De

nm

ark

DT

U

Trimmed 2


en

ter fo

r Bio

log

ica

l Se

qu

en

ce

An

aly

sis

Th

e T

ech

nic

al U

niv

ers

ity o

f De

nm

ark

DT

U

trinuc. % Out AAT/ATT -0.280 AAA/TTT -0.274 CCA/TGG -0.246 AAC/GTT -0.205 ACT/AGT -0.183 CCG/CGG -0.136 ATC/GAT -0.110 AAG/CTT -0.081 CGC/GCG -0.077 AGG/CCT -0.057 GAA/TTC -0.037 ACG/CGT -0.033 ACC/GGT -0.032 GAC/GTC -0.013 CCC/GGG -0.012 ACA/TGT -0.006 CGA/TCG -0.003 GGA/TCC 0.013 CAA/TTG 0.015AGC/GCT 0.017 GTA/TAC 0.025 AGA/TCT 0.027 CTC/GAG 0.031 CAC/GTG 0.040 TAA/TTA 0.068 GCA/TGC 0.076 CTA/TAG 0.090 GCC/GGC 0.107 ATG/CAT 0.134 CAG/CTG 0.175 ATA/TAT 0.182 TCA/TGA 0.194

“Travers” trinucleotide scale:

Satchwell,S.C., Drew,H.R., and Travers,A.A., "Sequence periodicities in chicken nucleosome core DNA", J. Mol. Biol., 191:659-675, (1986).

Baldi,P., Brunak,S., Chauvin,Y., and Krogh,A., "Naturally occurring nucleosome positioning signals in human exons and introns", J. Mol. Biol., 263:503-510, (1996).

Low position

pref.

Highposition

pref.

Highposition

pref.

NOTE: we use a (slight) modification, in which the absolute value (magnitude) of the values is used to reflect trinucleotides which tend to exclude nucleosomes.


en

ter fo

r Bio

log

ica

l Se

qu

en

ce

An

aly

sis

Th

e T

ech

nic

al U

niv

ers

ity o

f De

nm

ark

DT

U


en

ter fo

r Bio

log

ica

l Se

qu

en

ce

An

aly

sis

Th

e T

ech

nic

al U

niv

ers

ity o

f De

nm

ark

DT

U

A genomic code for nucleosomepositioningEran Segal1, Yvonne Fondufe-Mittendorf2, Lingyi Chen2, AnnChristine Thastrom2, Yair Field1, Irene K. Moore2,Ji-Ping Z. Wang3 & Jonathan Widom2

Eukaryotic genomes are packaged into nucleosome particles that occlude the DNA from interacting with most DNAbinding proteins. Nucleosomes have higher affinity for particular DNA sequences, reflecting the ability of the sequenceto bend sharply, as required by the nucleosome structure. However, it is not known whether these sequence preferenceshave a significant influence on nucleosome position in vivo, and thus regulate the access of other proteins to DNA. Herewe isolated nucleosome-bound sequences at high resolution from yeast and used these sequences in a newcomputational approach to construct and validate experimentally a nucleosome–DNA interaction model, and to predictthe genome-wide organization of nucleosomes. Our results demonstrate that genomes encode an intrinsic nucleosomeorganization and that this intrinsic organization can explain ,50% of the in vivo nucleosome positions. This nucleosomepositioning code may facilitate specific chromosome functions including transcription factor binding, transcriptioninitiation, and even remodelling of the nucleosomes themselves.

Eukaryotic genomic DNA exists as highly compacted nucleosomearrays called chromatin. Each nucleosome contains a 147-base-pair(bp) stretch of DNA, which is sharply bent and tightly wrappedaround a histone protein octamer1. This sharp bending occurs atevery DNA helical repeat (,10 bp), when the major groove ofthe DNA faces inwards towards the histone octamer, and again,5 bp away, with opposite direction, when the major groove facesoutward. Bends of each direction are facilitated by specific dinucleo-tides2,3. Neighbouring nucleosomes are separated from each other by10–50-bp-long stretches of unwrapped linkerDNA4; thus, 75–90%ofgenomic DNA is wrapped in nucleosomes. Access to DNA wrappedin a nucleosome is occluded1 for polymerase, regulatory, repair andrecombination complexes, yet nucleosomes also recruit other pro-teins through interactions with their histone tail domains5. Thus, thedetailed locations of nucleosomes along the DNA may have import-ant inhibitory or facilitatory roles6,7 in regulating gene expression.DNA sequences differ greatly in their ability to bend sharply2,3,8.

Consequently, the ability of the histone octamer to wrap differingDNA sequences into nucleosomes is highly dependent on the specificDNA sequence9,10. In vitro studies show this range of affinities to be1,000-fold or greater11. Thus, nucleosomes have substantial DNAsequence preferences. A key question is whether genomes use thesesequence preferences to control the distribution of nucleosomesin vivo in a way that strongly impacts on the ability of DNA bindingproteins to access particular binding sites. By controlling binding siteaccessibility in this way, genomes could, for example, target thebinding of transcription factors towards appropriate sites and awayfrom irrelevant, non-functional sites9.One view is that the sequence preferences of nucleosomes might

not be meaningful. Nucleosome positions might be regulated in cellsin trans by the abundant12 ATP-dependent nucleosome remodellingcomplexes13, which might over-ride the sequence preferences ofnucleosomes and move them to new locations whenever needed.Another view, however, is that remodelling factors do not themselves

determine the destinations of the nucleosomes that they mobilize.Rather, the remodelling complexes may allow nucleosomes to samplealternative positions rapidly, resulting in a thermodynamic equili-brium between the nucleosomes and the site-specific DNA bindingproteins that compete with nucleosomes for occupancy along thegenome. In this view, nucleosome positions are regulated in cis bytheir intrinsic sequence preferences, which would then have signifi-cant regulatory roles. In this cis regulation model, we expect thegenome to encode a nucleosome organization, intrinsic to the DNAsequence alone, comprising sequences with both low and highaffinity for nucleosomes. Many of the high-affinity sequences shouldthen be occupied by nucleosomes in vivo. Moreover, the detaileddistribution of nucleosome positions encoded by the genome shouldsignificantly influence chromosome functions genome-wide.Here we report the results of a combined experimental and

computational approach to detect the DNA sequence preferencesof nucleosomes and the intrinsic nucleosome organization of thegenome that these preferences dictate. Our findings demonstrate thateukaryotic genomes use a nucleosome positioning code, and link theresulting nucleosome positions to specific chromosome functions.

Validating a nucleosome–DNA interaction modelTo construct a model for nucleosome–DNA interactions in yeast(Fig. 1a), we used a genome-wide assay to isolate DNA regions thatwere stably wrapped in nucleosomes. Our experimental methodmaps nucleosomes on the yeast genome with greater accuracy thanprevious approaches, resulting in a set of 199mononucleosome DNAsequences of length 142–152 bp (Supplementary Fig. 1). We used thiscollection of sequences to construct a probabilistic model thatrepresents the DNA sequence preferences of yeast nucleosomes(Supplementary Fig. 2). Our approach resembles that used forrepresenting the binding specificities of transcription factors froma collection of known sites, but with two main distinctions: first, incontrast to the mononucleotide probability distributions used for

ARTICLES

1Department of Computer Science and Applied Mathematics, Weizmann Institute of Science, Rehovot 76100, Israel. 2Department of Biochemistry, Molecular Biology and CellBiology, Northwestern University, 2153 Sheridan Road, Evanston, Illinois 60208, USA. 3Department of Statistics, Northwestern University, 2006 Sheridan Road, Evanston, Illinois60208, USA.

Vol 442|17 August 2006|doi:10.1038/nature04979

772©!2006!Nature Publishing Group!

!


en

ter fo

r Bio

log

ica

l Se

qu

en

ce

An

aly

sis

Th

e T

ech

nic

al U

niv

ers

ity o

f De

nm

ark

DT

U

July 25, 2006

Scientists Say They’ve Found a Code Beyond Genetics in DNA

By NICHOLAS WADEResearchers believe they have found a second code in DNA in addition to the genetic code.

The genetic code specifies all the proteins that a cell makes. The second code, superimposed on the first, sets the placement of the nucleosomes, miniature protein spools around which the DNA is looped. The spools both protect and control access to the DNA itself.

The discovery, if confirmed, could open new insights into the higher order control of the genes, like the critical but still mysterious process by which each type of human cell is allowed to activate the genes it needs but cannot access the genes used by other types of cell.

The new code is described in the current issue of Nature by Eran Segal of the Weizmann Institute in Israel and Jonathan Widom of Northwestern University in Illinois and their colleagues.

There are about 30 million nucleosomes in each human cell. So many are needed because the DNA strand wraps around each one only 1.65 times, in a twist containing 147 of its units, and the DNA molecule in a single chromosome can be up to 225 million units in length.

Biologists have suspected for years that some positions on the DNA, notably those where it bends most easily, might be more favorable for nucleosomes than others, but no overall pattern was apparent. Drs. Segal and Widom analyzed the sequence at some 200 sites in the yeast genome where nucleosomes are known to bind, and discovered that there is indeed a hidden pattern.

Knowing the pattern, they were able to predict the placement of about 50 percent of the nucleosomes in other organisms.

The pattern is a combination of sequences that makes it easier for the DNA to bend itself and wrap tightly around a nucleosome. But the pattern requires only some of the sequences to be present in each nucleosome binding site, so it is not obvious. The looseness of its requirements is presumably the reason it does not conflict with the genetic code, which also has a little bit of redundancy or wiggle room built into it.

Having the sequence of units in DNA determine the placement of nucleosomes would explain a puzzling feature of transcription factors, the proteins that activate genes. The transcription factors recognize short sequences of DNA, about six to eight units in length, which lie just in front of the gene to be transcribed.f the nucleosome code was “a profound insight if true,” because it would explain many aspects of how the DNA is controlled.


en

ter fo

r Bio

log

ica

l Se

qu

en

ce

An

aly

sis

Th

e T

ech

nic

al U

niv

ers

ity o

f De

nm

ark

DT

U

Spombe - 3 chromosomes

Nature, 415:871-880, (2002).


en

ter fo

r Bio

log

ica

l Se

qu

en

ce

An

aly

sis

Th

e T

ech

nic

al U

niv

ers

ity o

f De

nm

ark

DT

U

S.pombe chr. 1

DNA Conformation and Transcription, chapt14, (edited by Takashi Ohyama, 2004).


en

ter fo

r Bio

log

ica

l Se

qu

en

ce

An

aly

sis

Th

e T

ech

nic

al U

niv

ers

ity o

f De

nm

ark

DT

U

Spombe zoom

DNA Conformation and Transcription, chapt14, (edited by Takashi Ohyama, 2004).


en

ter fo

r Bio

log

ica

l Se

qu

en

ce

An

aly

sis

Th

e T

ech

nic

al U

niv

ers

ity o

f De

nm

ark

DT

U


en

ter fo

r Bio

log

ica

l Se

qu

en

ce

An

aly

sis

Th

e T

ech

nic

al U

niv

ers

ity o

f De

nm

ark

DT

U

Terminus

Origin

rhsA

rhsB

rhsC

rhsD

rhsE

phnM ya

gF

rrsA

rrsB

rrsC

rrsD

rrsE

rrsG

rrsH

0M0.5M

1M1.5M

2M2.5M

3M3.

5M4M

GENOME ATLAS



0.16 0.21


-8.74 -7.63


0.14 0.16

Annotations:

CDS +

CDS -

rRNA

tRNA

Watson Repeatsfixavg

5.00 7.50

Crick Repeatsfixavg

5.00 7.50

GC Skewfixavg

-0.03 0.03

Percent ATfixavg

0.45 0.55

Resolution: 1856

E. coli K-12isolate MG1655 4,639,221 bp

ygeG

rfaJ

rpoB

flu

ycgZ

rpsQ

b2651

phnI

rpsO

nohB

yagG

atpF

lpdA

ydfO

b0725

yjhA

yhiF

yi21_2

b037

3

0M0.5M

1M1.5M

2M2.5M

3M3.

5M4M

GENOME ATLAS



0.16 0.21


-8.66 -7.71


0.14 0.16

Annotations:

CDS +

CDS -

rRNA

tRNA

Global Direct Repeatsfixavg

5.00 7.50

Global Inverted Repeatsfixavg

5.00 7.50

GC Skewdevavg

-0.05 0.05

Percent ATfixavg

0.20 0.80

Resolution: 1856

E. coli 4,639,221 bp

J. Mol. Biol, 299:907-930, (2000).


en

ter fo

r Bio

log

ica

l Se

qu

en

ce

An

aly

sis

Th

e T

ech

nic

al U

niv

ers

ity o

f De

nm

ark

DT

U

!"#$%"$#&' %'$(")# &*&'+(,(

!" #$%&$ '# (")&*'(+,'& -.&'.&$ '.&$& (* , /#""&/0'(#" 1&'-&&" '.& 2$&%(/'&% *'$3/'3$,4 /.,$,/'&$0(*'(/* #5 , +&"& #" #"& .,"%6 ,"% ('* 53"/'(#" #"'.& #'.&$ .,"%6 -& 2&$5#$7&% , /43*'&$ ,",48*(*1,*&% #" '.& 9)& *'$3/'3$,4 2,$,7&'&$*: ;$(&<86 -&/,4/34,'&% '.& ,)&$,+& #5 &,/. 7&,*3$& 5#$ ,44=>>> 12 -("%#-* /&"'&$&% #" ?@A #$ 2$#'&("0&"/#%("+ +&"&*6 "#$7,4(B&% '.& ),43&* C1,*&% #"'.& +&"#7(/ ,)&$,+& ,"% *',"%,$% %&)(,'(#"D6 ,"%'.&" 3*&% '.& $&*34'("+ 9)& ),43&* '# /43*'&$ '.&-("%#-*: E& '.3* '$&,' &,/. +&"&0/&"'&$&% -("0%#- ,* , 2#("' (" , 9)&0%(7&"*(#",4 FF*'$3/'3$&*2,/&GG6 ,"% 3*& '.& H3/4(%&," %(*',"/& 1&'-&&"'.&7 ,* , *(724& 7&,*3$& #5 *'$3/'3$,4 *(7(4,$('8C*&& I&'.#%* 5#$ 7#$& %&',(4* #" /43*'&$("+D:J(+3$& K (* , %(*',"/& '$&& '.,' *377,$(B& '.& #)&$0,44 '#2#4#+8 #5 '.(* *2,/&: !" '.(* 24#'6 ,44 +&"&*.,)& 1&&" %()(%&% ("'# LL /43*'&$* ,"% '.& '$&&*.#-* '.& $&4,'()& 2#*('(#" #5 '.& 7(%02#("'* #5'.&*& C'.& 9)& /##$%(",'&* #5 &,/. /&"'$#(%6 (" *',"0%,$% %&)(,'(#" 3"('*6 (* +()&" (" '.& 4&+&"% '#J(+3$& KD: A' '.& 1,*& #5 &,/. 1$,"/.6 '.& "371&$#5 +&"&* (" '.,' /43*'&$ (* ("%(/,'&%:M# ,",48B& -.&'.&$ '.&$& (* , /#""&/'(#"

1&'-&&" '.& '$&& *'$3/'3$& ,"% +&"& 53"/'(#",4('86-& 3*&% -#$%0,",48*(* *#5'-,$& '.,' -& .,)&%&)&4#2&% 5#$ (")&*'(+,'(#" #5 8&,*' 2$#7#'&$*CN&"*&" O P"3%*&"6 Q>>>D: ;$(&<86 '.& ,22$#,/.-,* ,* 5#44#-*: J$#7 '.& R&";,"S 94& -& 9$*' /#404&/'&% ,44 '.& 53"/'(#",4 ,""#','(#"* 5#$ &,/. +&"&:J#$ &,/. /43*'&$ -& '.&" %()(%&% '.& ,""#','(#"("'# '-# +$#32*T #"& /#"',("("+ '.& ,""#','(#" 5#$'.& (")&*'(+,'&% /43*'&$ C'.& 2#*('()& *&'D ,"%,"#'.&$ +$#32 /#"',("("+ '.& ,""#','(#" 5#$ ,44 '.&$&7,("("+ +&"&* C'.& "&+,'()& *&'D: ;8 /#3"'("+-#$% 5$&U3&"/(&* (" '.& '-# *&'* ,"% 3*("+ .82&$0+&#7&'$(/ *','(*'(/*6 (' -,* '.&" 2#**(14& '# 9"%,""#','(#" S&8-#$%* '.,' ,$& *(+"(9/,"'48 #)&$0$&2$&*&"'&% (" '.& 2#*('()& *&' CN&"*&" O P"3%*&"6Q>>>D: V" J(+3$& K -& .,)& ("%(/,'&% '.& 7#*' *(+0"(9/,"' -#$%* 5#3"% (" '.(* -,8:A* (' /," 1& *&&"6 *&)&$,4 +$#32* %(% %(*24,8 *(+0

"(9/,"' #)&$0$&2$&*&"','(#" #5 S&8-#$%*: !" *#7&/,*&* , /43*'&$ /#"',("* #"48 , 5&- +&"&*6 7#*' #5-.(/. 1&4#"+ '# #"& #$ 7#$& #2&$#"* #5 $&4,'&%53"/'(#": J#$ &W,724&6 ,7#"+ '.& Q> +&"&* ("/43*'&$ LL6 '&" 1&4#"+ '# '.& !"# #2&$#"6 +()("+ '.&&"'($& /43*'&$ ," #)&$0$&2$&*&"','(#" #5 '.&S&8-#$%* FF2.#*2.#",'& 7&',1#4(*7GG: A"#'.&$&W,724& (* /43*'&$ X -.(/. /#"',("* , "371&$ #5+&"&* 5$#7 '-# %(55&$&"' +$#32* #5 +&"&* '.,' &,/.,$& (")#4)&% (" 4(2#2#48*,//.,$(%& *8"'.&*(* C'.&$%& ,"% '(( +&"&*D: Y43*'&$ X (* /.,$,/'&$(B&% 18&W'$&7& ),43&* 5#$ ,44 9)& 7&,*3$&*6 ,"% %(*24,8*, )&$8 .(+. 4&)&4 #5 /3$),'3$&6 )&$8 4#- <&W(1(4('8,"% &W'$&7&48 .(+. *',/S("+ &"&$+8 C/#$$&*2#"%("+'# 3"*',14& Z@AD: Y43*'&$ = (* *'$3/'3$,448 *(7(4,$'# /43*'&$ X ,"% /#"',("* '.$&& ,%%('(#",4 $%& +&"&*:!"'&$&*'("+486 )* +,-. *'$,("* 4,/S("+ '.& [\ 2$#'&("6%(*24,8 , 2.&"#'82& '.,' $&*&714&* '.& %&&20$#3+.2.&"#'82& *&&" (" $%& 73',"'* C],("1&"( /0 &-:6

L^^KD: _("/& [\ (* S"#-" '# ("'&$,/' -('. /3$)&%#$ S("S&% Z@A6 ,"% *("/& 7#*' #5 '.& $%& #2&$#"%(*24,8* &W'$&7& *'$3/'3$,4 2$#2&$'(&* C("/43%("+)&$8 .(+. /3$),'3$&D6 (' (* '&72'("+ '# *3++&*' '.,''.& ("'&$,/'(#" 1&'-&&" [\ ,"% '.(* /.$#7#*#7,4$&+(#" 24,8* , $#4& (" &W2$&**(#" #5 '.& +&"&*6,4'.#3+. (' .,* 1&&" $&2#$'&% '.,' ,22,$&"'48 '.&[\0%&9/(&"' 73',"'* %# "#' .,)& , '$3"/,'&% 4(2#02#48*,//.,$(%& C],("1&"( /0 &-:6 L^^KD:Y43*'&$ L> (* ,%`,/&"' '# /43*'&$* X ,"% =6 ,"%

%(*24,8* *(7(4,$ C13' 4&** &W'$&7&D *'$3/'3$,4 5&,0'3$&* #5 .(+. /3$),'3$&6 4#- <&W(1(4('86 ,"% .(+.*',/S("+ &"&$+8: Y#"*(*'&"'486 '&" #5 '.& Q> &W'$&7&L>>> 12 $&+(#"* %(*/3**&% ,1#)& C,"% -.(/. %(*024,8&% '.& *,7& *'$3/'3$,4 5&,'3$&*a M,14& LD ,$&("/43%&% (" #"& #5 '.&*& '.$&& /43*'&$*: !" ,+$&&07&"' -('. '.& +&"#7&0-(%& '$&"% 5#$ .(+.&$ /3$0),'3$& "&,$ '.& '&$7("3*6 '.& +&"&* (" /43*'&$ L>*.#- , -&,S '&"%&"/8 '# 1& 4#/,'&% (" '.(* $&+(#"C%,', "#' *.#-"D: M.(* 5,/' (* ,4*# /#"*(*'&"' -('.'.& #1*&$),'(#" '.,' '.& S&8-#$% FF2.,+&GG (* #)&$0$&2$&*&"'&% (" /43*'&$ L> CQ> #5 '.& L>X +&"&* .,)&'.& -#$% (" '.&($ ,""#','(#"D6 *("/& '.& '&$7("3*$&+(#" (* S"#-" '# /#"',(" 7,"8 2.,+& ,"% '$,"*02#*#"0$&4,'&% +&"&* C[(446 L^^bD:

!"#$%& '( _'$3/'3$,4 /43*'&$ ,",48*(*: Z(*',"/& '$&&*.#-("+ '.& $&4,'()& 4#/,'(#" #5 LL +&"& /43*'&$* 1,*&%#" ,)&$,+& *'$3/'3$,4 7&,*3$&*: M.& "371&$ #5 +&"&* ("&,/. /43*'&$ (* ("%(/,'&% ,' '.& 1,*& #5 '.& 1$,"/.: _(+0"(9/,"'48 #)&$02$&*&"'&% ,""#','(#" S&8-#$%* ,$& ("%(0/,'&% ,' '.& &"% #5 1$,"/.&*: @,7&* #5 +&"&* 7&"'(#"&%(" '.& '&W' ,$& ("%(/,'&% (" 2,$&"'.&*&*: M.& /&"'$#(%/##$%(",'&* #5 '.& LL /43*'&$* ,$& +()&" 1&4#- C),43&*,$& "#$7,4(B&% ,"% ,$& (" 3"('* #5 *',"%,$% %&)(,'(#"*5$#7 +&"#7(/ ,)&$,+&: M.&8 ,$& 4(*'&% (" '.& 5#44#-("+#$%&$T /3$),'3$&6 Z@,*&!6 2#*('(#" 2$&5&$&"/&6 %&5#$70,1(4('86 ,"% *',/S("+ &"&$+8D: Y43*'&$ LT >:K6 !>:b6 >:c6!>:b6 >:b: Y43*'&$ QT !>:b6 >:c6 >:L6 >:b6 !>:b: Y43*'&$ cT!>:c6 >:^6 !L:b6 !>:c6 >:L: Y43*'&$ XT c:c6 !Q:X6 Q:L6 !X:X6X:b: Y43*'&$ =T c:>6 !c:c6 Q:Q6 !Q:d6 c:L: Y43*'&$ bT !>:=6Q:b6 !c:c6 >:K6 >:=: Y43*'&$ KT !L:b6 L:L6 !X:b6 >:L6 !>:d:Y43*'&$ dT !L:K6 Q:>6 !L:K6 >:d6 !>:^: Y43*'&$ ^T !Q:L6 c:^6!Q:d6 >:K6 !>:b: Y43*'&$ L>T Q:c6 !L:c6 >:K6 !Q:d6 Q:d: Y43*0'&$ LLT !Q:^6 Q:Q6 L:L6 Q:K6 !Q:d:

!"" !"# $%&'(%'&)* #%*)+ ,-& #$%&'()%&)* %+,)

J. Mol. Biol, 299:907-930, (2000).

low position preference


en

ter fo

r Bio

log

ica

l Se

qu

en

ce

An

aly

sis

Th

e T

ech

nic

al U

niv

ers

ity o

f De

nm

ark

DT

U

Histogram of log(affy.ex[, i])

log(affy.ex[, i])

Density

7 8 9 10

0.0

0.1

0.2

0.3

0.4

0.5

0.6


log(affy.ex[, i])

Density

7 8 9 10

0.0

0.1

0.2

0.3

0.4

0.5

0.6


log(affy.ex[, i])

Density

7 8 9 10

0.0

0.1

0.2

0.3

0.4

0.5

0.6


log(affy.ex[, i])

Density

7 8 9 10

0.0

0.1

0.2

0.3

0.4

0.5

0.6

Highly Expressed Genes


en

ter fo

r Bio

log

ica

l Se

qu

en

ce

An

aly

sis

Th

e T

ech

nic

al U

niv

ers

ity o

f De

nm

ark

DT

U

niceDogma

Comparative Microbial Genomics Group Ce

nte

r for B

iolo

gic

al S

eq

ue

nc

e A

na

lysis

Th

e T

ech

nic

al U

niv

ers

ity o

f De

nm

ark

D

TU


en

ter fo

r Bio

log

ica

l Se

qu

en

ce

An

aly

sis

Th

e T

ech

nic

al U

niv

ers

ity o

f De

nm

ark

DT

U

Biochimie, 83:201-212, (2001).


en

ter fo

r Bio

log

ica

l Se

qu

en

ce

An

aly

sis

Th

e T

ech

nic

al U

niv

ers

ity o

f De

nm

ark

DT

U

Summary

1. DNA sequence DNA structure Function

2. DNA curvature is caused mainly by phased A-tracts.

3. Where you are in the chromosome is important!


en

ter fo

r Bio

log

ica

l Se

qu

en

ce

An

aly

sis

Th

e T

ech

nic

al U

niv

ers

ity o

f De

nm

ark

DT

U

Center f Gene Regulation in Microbial Genomes€¦ · Comparative Microbial Genomics group Center f...

Documents

Transcript of Center f Gene Regulation in Microbial Genomes€¦ · Comparative Microbial Genomics group Center f...