MINIREVIEWS
Comparison of 61 Sequenced Escherichia coli Genomes
Oksana Lukjancenko & Trudy M. Wassenaar &
David W. Ussery
Received: 24 February 2010 /Accepted: 23 June 2010# The Author(s) 2010. This article is published with open access at Springerlink.com
Abstract Escherichia coli is an important component ofthe biosphere and is an ideal model for studies of processesinvolved in bacterial genome evolution. Sixty-one publicallyavailable E. coli and Shigella spp. sequenced genomes arecompared, using basic methods to produce phylogenetic andproteomics trees, and to identify the pan- and core genomesof this set of sequenced strains. A hierarchical clustering ofvariable genes allowed clear separation of the strains intoclusters, including known pathotypes; clinically relevantserotypes can also be resolved in this way. In contrast,when in silico MLST was performed, many of the variousstrains appear jumbled and less well resolved. The predictedpan-genome comprises 15,741 gene families, and only 993(6%) of the families are represented in every genome,comprising the core genome. The variable or ‘accessory’genes thus make up more than 90% of the pan-genomeand about 80% of a typical genome; some of thesevariable genes tend to be co-localized on genomicislands. The diversity within the species E. coli, and theoverlap in gene content between this and related species,suggests a continuum rather than sharp species borders inthis group of Enterobacteriaceae.
Introduction
The availability of complete genome sequences frommultiple isolates of a given species has opened up a wholenew range of research strategies. By far the best-studiedbacterial species is Escherichia coli, and the highestnumber of individual genome sequences is available forthis species, which has been the working horse ofbacteriology for as long as the specialization exists.Numerous basic molecular processes have been firstcharacterized and extensively studied in E. coli, leading toinsights that could subsequently be applied to other bacteria[47]. Despite the vast amount of knowledge alreadyavailable for E. coli, based on decades of experimentalresearch, genetic manipulation and, more recently, obser-vations based on single or multiple genome sequences,comparison of a large number of E. coli genome sequencescan still provide novel insights, such as the presence ofgenomic islands, present in some pathogenicity groups, butmissing in others. At the time of writing, there are morethan 100 E. coli genome sequence projects reported, manyof which have been deposited to GenBank. Here, wecompare 61 publically available genome sequences of E.coli and Shigella spp. isolates.
Escherichia spp. and Shigella spp. are Gram-negative,facultative anaerobic, intestinal bacteria belonging to theEnterobacteriaceae, which are taxonomically placed within thegamma subdivision of the Proteobacteria phylum. AlthoughShigella spp. isolates have been rewarded their own genus,which is divided into several species (representing differentsero-groups), its separation from Escherichia spp. is mainlyhistorical. For example, in Bergey’s Manual of SystematicBacteriology, the section on Shigella phylogeny begins withthe following sentence: “Scientific evidence accumulated todate strongly supports that view that Shigella species are
O. Lukjancenko : T. M. Wassenaar :D. W. Ussery (*)Center for Biological Sequence Analysis, Building 208,Department of Systems Biology,The Technical University of Denmark,2800 Kgs. Lyngby, Denmarke-mail: [email protected]
T. M. WassenaarMolecular Microbiology and Genomics Consultants,Tannenstrasse 7(55576 Zotzenheim, Germany
Microb EcolDOI 10.1007/s00248-010-9717-3
biotypes/pathotypes or clones of E. coli” [39]. More than50 years ago it was observed that Shigella spp. and E. colihave the same fertility system [25]; in 1972, Brenner et al. [3]found that based on DNA/DNA hybridization, that Shigellaspp. and E. coli are the same species. Experiments withmultilocus enzyme electrophoresis concluded that nearly allof the Shigella species are clones from within E. coli species[35]. Further, analysis of 16S rRNA sequence alignmentplaces Shigella spp. within E. coli [6]. Thus, all currentevidence indicates that Shigella spp. should be classified as E.coli [23, 36]. Both genera contain highly diverse species,although Shigella spp. are as related to E. coli as they are toeach other. E. coli is a ubiquitous component of the intestinalgut flora of animals including humans, and can survive andmultiply in abiotic environments as well. The speciescomprises both benign and pathogenic variants, whilstShigella spp. are all enteropathogens in mammals.
E. coli isolates have in the past been divided intosubgroups in various ways. Based on established patho-genicity towards the human host, pathogenic versuscommensal E. coli have been recognized, although it isacknowledged that 'pathogenic' E. coli strains maycolonize other animal species asymptomatically. Patho-genic E. coli have further been subdivided according totheir typical site of infection and clinical manifestations inhumans, for instance enteropathogenic, uropathogenic, orextra-intestinal pathogenic E. coli, or based on theirvirulence mechanisms, such as enterohemorragic (EHEC),enterotoxigenic, enteroinvasive, and enteroaggregative E.coli [1, 13]. Other divisions that are frequently used arebased on serology (e.g., serotypes O127:H7 or K12) or,mainly for population genetic purposes, on phylogeneticproperties of particular housekeeping genes, as establishedby MULTI-ENZYME electrophoresis and later by multi-locus sequence typing (MLST) [25]. Finally, some isolatesare described simply for their source of isolation, such asenvironmental isolates or avian pathogenic E. coli.
All these subdivisions have been applied more or lessfrequently to group isolates that share particular features.We were interested to see if any of these groupings wouldhold when isolates were compared based on their completegenome sequences, considering some or all of their genes.Isolates from some groups (based on whatever grouping)have been more frequently sequenced than from others, andcomplete information on all characteristics of interest(pathogenicity, source of isolation, serotype) is not avail-able for all sequenced isolates. Despite these recognizedshortcomings in sampling bias and recorded information,comparison of these 61 genome sequences revealed thatneither the 16S gene, nor gene fragments usually used forMLST, provides biologically meaningful information on therelatedness of the sequenced isolates. The best way toanalyze this is by taking into account all the genomic
content, rather than looking at one or a few individualgenes. The E. coli core genome has been previouslyreported to be less than half the genes [13], with morethan half the E. coli genes in any given genome beingfound in some strains, but missing in others. Many of thesevariable genes can be clustered to specific regions, locatedon genomic islands in an E. coli chromosome.
Materials and Methods
Bacterial Genomes and Gene Annotations
Sixty-one bacterial genomes of E. coli and Shigella spp.were used in this study (Table 1). Of these, 39 fullysequenced genomes and 19 genomes for which thesequence was still in progress at the time of extractionwere obtained from GenBank (1). Sequence from E. coliO103 Oslo was obtained from Norwegian VeterinaryInstitute and sequences from strains LANL ECA andLANL ECF were obtained from Los Alamos NationalLab. Genome sequences of Escherichia albertii, Escherichiafergusonii, and Salmonella enterica Typhimurium LT2 wereincluded for comparison (Table 1). The ‘quality score’ foreach genome is given in Table 1, based on the suggestedscale by Chain et al. [4]. A completely sequenced genomethat has been deposited to GenBank is given a score of ‘1’,with the only exception being E. coli O157:H7 isolateEDL933, which currently has more than 4,000 “N’s” in theDNA sequence of the GenBank file, representing unfilledgaps along the chromsomal sequence—hence, this genome isgiven a lower score of ‘2’. The higher scores represent lowerquality (and often more contigs, or pieces of the DNA,although sequence quality is not measured only by this, asdescribed in [4]).
16S Ribosomal RNA Analysis
The sequences encoding 16S ribosomal RNA wereextracted from the analyzed genomes using RNAmmer[22]; sequences with an RNAmmer score above 1,400 wereconsidered reliable and were kept for analysis. From everygenome, the gene with highest similarity to rrsH of E. coliK12 MG1655 was selected and these sequences werealigned using ClustalX [24]. A phylogenic tree wasgenerated by ClustalX using the Bootstrap neighborhood-joining method, showing the bootstrap values at branchpoints, visualized by NJPlot [34].
In Silico MLST
The alleles for seven housekeeping genes used for MLST ofvarious species (www.mlst.net) were analyzed. These were
O. Lukjancenko et al.
Tab
le1
Genom
esused
inthisstud
y
GPID
Strain
Size(bp)
Noof
genes
Genedensity
(genes/Kbp
)Noof
contigs
Accession
number
Quality
score
Patho
type,serotype,
othercharacteristics
Reference
225
E.coliK12
MG16
554,63
9,67
54,14
90.89
41
U00
096
1Com
mensal,K12
[2]
226
E.coli01
57:H7Sakai
5,59
4,47
75,23
00.93
43
BA00
0007
1EHEC,O15
7:H7
[15]
259
E.coli01
57:H7EDL93
35,62
0,52
25,31
20.94
52
AE00
5174
2EHEC,O15
7:H7
[33]
313
E.coliCFT07
35,23
1,42
85,33
91.02
01
AE01
4075
1UPEC,O6:K2:H1
[45]
1395
9E.coliHS
4,64
3,53
84,37
80.94
21
CP00
0802
1Com
mensal,O9
[37]
1396
0E.coliE24
377A
5,24
9,28
84,74
90.90
47
CP00
0800
1ETEC,O13
9:H28
[37]
1557
2E.coliB7A
5,30
0,24
24,64
80.87
628
9AAJT00
0000
004
ETEC,O14
8:H28
[37]
1557
6E.coliF11
5,21
5,96
14,70
40.90
1119
AAJU
0000
0000
3ExP
EC,O6:H31
[37]
1557
7E.coliE22
5,52
8,23
85,10
50.92
312
7AAJV
0000
0000
3EPEC,O10
3:H2
[37]
1557
8E.coliE1100
195,37
6,211
4,93
40.91
713
7AAJW
0000
0000
3EPEC,O111:H9
[37]
1563
9E.coli53
638
5,37
1,79
04,80
30.89
44
AAKB00
0000
002
EIEC,O14
4TIG
R
1619
3E.coli10
1-1
4,97
9,76
74,60
70.92
591
AAMK00
0000
003
EAEC,O-:H10
[18,
37]
1623
5E.coli53
64,93
8,92
04,62
00.93
51
CP00
0247
1UPEC,O6:K15
:H31
[7]
1625
9E.coliUTI89
5,17
9,97
15,02
10.96
92
CP00
0243
1UPEC
[5]
1635
1E.coliK12
W3110
4,64
6,33
24,22
60.90
91
AP00
9048
1Com
mensal,K12
[16]
1671
8E.coliAPECO1
5,49
7,65
34,42
80.80
13
CP00
0468
1APEC,O1:K1:H7
[21]
1808
3E.coliATCC87
394,74
6,21
84,20
00.88
41
CP00
0946
1K12
derivativ
eDOEJG
I-PGF
1805
7E.coliSE11
5,15
5,62
64,67
90.90
77
AP00
9240
1Com
mensal,O15
2:H28
[32]
1828
1E.coliB
str.REL60
64,62
9,81
24,20
50.90
81
CP00
0819
1Com
mensal,strain
B[19]
1905
3E.coliSE15
4,83
9,68
34,48
80.92
72
AP00
9378
1Com
mensal,O15
0:H5
[42]
1946
9E.coliSMS-3-5
5,21
5,37
74,74
30.90
95
CP00
0970
1Env
iron
mentalisolate
[11]
2007
9E.coliK12
DH10
B4,68
6,13
74,12
60.88
01
CP00
0948
1K12
derivativ
e[8]
2071
3E.coliBL21
(DE3)
4,55
7,50
84,15
70.91
21
CP00
1509
1Com
mensal,strain
Bph
ylog
roup
Korea
Research
Institu
reof
Bioscienceand
Biotechno
logy
2773
7E.coli01
57:H7EC40
425,61
7,72
85,23
20.93
14
ABHM00
0000
002
EHEC,O15
7:H7
J.Craig
VenterInstitu
te
2773
3E.coli01
57:H7EC40
455,66
0,95
85,34
30.94
38
ABHL00
0000
002
EHEC,O15
7:H7
J.Craig
VenterInstitu
te
2774
5E.coli01
57:H7EC40
765,70
5,64
55,34
20.93
613
5ABHQ00
0000
002
EHEC,O15
7:H7
J.Craig
VenterInstitu
te
2774
3E.coli01
57:H7EC4113
5,65
5,84
75,00
50.88
423
1ABHP00
0000
003
EHEC,O15
7:H7
J.Craig
VenterInstitu
te
2773
9E.coli01
57:H7EC4115
5,70
4,17
15,31
50.93
11
CP00
1164
1EHEC,O15
7:H7
J.Craig
VenterInstitu
te
2774
1E.coli01
57:H7EC41
965,62
0,60
65,07
20.90
218
6ABHO00
0000
003
EHEC,O15
7:H7
J.Craig
VenterInstitu
te
2773
5E.coli01
57:H7EC42
065,62
9,93
25,20
20.92
37
ABHK00
0000
002
EHEC,O15
7:H7
J.Craig
VenterInstitu
te
2774
9E.coli01
57:H7EC44
015,73
3,13
35,18
50.90
418
6ABHR00
0000
003
EHEC,O15
7:H7
J.Craig
VenterInstitu
te
2775
1E.coli01
57:H7EC44
865,93
3,16
65,42
90.91
516
5ABHS00
0000
003
EHEC,O15
7:H7
J.Craig
VenterInstitu
te
2775
3E.coli01
57:H7EC45
015,67
7,18
15,12
40.90
225
0ABHT00
0000
004
EHEC,O15
7:H7
J.Craig
VenterInstitu
te
2775
5E.coli01
57:H7EC50
85,65
6,66
65,01
90.88
727
2ABHW00
0000
004
EHEC,O15
7:H7
J.Craig
VenterInstitu
te
Comparison of 61 Sequenced Escherichia coli Genomes
Tab
le1
(con
tinued)
GPID
Strain
Size(bp)
Noof
genes
Genedensity
(genes/Kbp
)Noof
contigs
Accession
number
Quality
score
Patho
type,serotype,
othercharacteristics
Reference
2775
7E.coli01
57:H7EC86
95,73
1,06
55,22
00.91
014
7ABHU00
0000
003
EHEC,O15
7:H7
J.Craig
VenterInstitu
te
2884
7E.coli01
57:H7TW14
588
5,67
0,29
75,80
31.02
310
ABKY00
0000
002
EHEC,O15
7:H7
J.Craig
VenterInstitu
te
2896
5E.coliBL21
(DE3)
4,55
7,04
14,08
70.89
61
AM94
6981
1Com
mensal,strain
BAustralianCenter
forBioph
armaceutical
Techno
logy
3003
1E.coliDH1
4,63
0,70
74,16
01
CP00
1637
1K12
derivativ
eDOEJG
I-PGF
3068
1E.coliBL21
(DE3)
4,57
0,93
84,22
80.94
91
CP00
1665
1Com
mensal,strain
BDOEJG
I-PGF
3257
1E.coli01
27:H6E23
48/69
5,06
9,67
84,55
40.89
83
FM18
0568
1EPEC,O12
7:H6
[17]
3341
3E.coli55
989
5,15
4,86
24,76
30.92
31
CU92
8145
1EAEC
[43]
3337
3E.coliIA
I14,70
0,56
04,35
30.92
61
CU92
8160
1Com
mensal,O8
[43]
3337
5E.coliS88
5,03
2,26
84,69
60.93
32
CU92
8161
1ExP
EC,O45
:K1:H7
[43]
3340
9E.coliED1a
5,20
9,54
84,91
50.94
31
CU92
8162
1Com
mensal,O81
[43]
33411
E.coliIA
I39
5,13
2,06
84,73
20.92
21
CU92
8164
1UPEC,O7:K1
[43]
3341
5E.coliUMN02
65,32
4,39
14,82
60.90
63
CU92
8163
1UPEC,O7:K1
[43]
3377
5E.coliBW29
524,57
8,15
94,08
40.89
21
CP00
1396
1K12
derivativ
e[10]
4101
3E.coli01
03:H212
009
5,52
4,86
05,12
10.92
62
AP01
0958
1EHEC,O10
3:H2
[31]
4102
1E.coliO26
:H11
1136
85,85
1,45
85,51
60.94
24
AP01
0958
1EHEC,O26
:H11
[31]
4102
3E.coli0111:H-1112
85,76
6,08
15,40
70.93
76
AP01
0960
1EHEC,O111:H-
[31]
E.coli01
03Oslo
4,96
0,07
64,57
10.92
115
3Unp
ublished
4EHEC,O10
3NorwegianVet
Institu
te
E.coliLANLECA
5,48
9,84
54,75
00.86
564
Unp
ublished
3EHEC,O15
7:H7
Los
Alamos
National
Lab
E.coliLANLECF
5,55
6,39
34,81
30.86
663
Unp
ublished
3EHEC,O15
7:H7
Los
Alamos
National
Lab
310
S.flexneri2a
301
4,82
8,82
14,17
70.86
51
AE00
5674
1Dysentery,Serog
roup
2a[20]
408
S.flexneri2a
2457
T4,59
9,35
44,06
10.88
21
AE01
4073
1Dysentery,Serog
roup
2a[44]
1637
5S.
flexneri584
014,57
4,28
44,115
0.89
91
CP00
0266
1Dysentery,Serog
roup
5[29]
1563
7S.
boydiiCDC
3083
-94
4,87
4,65
94,24
60.87
11
CP00
1063
1Dysentery,Serog
roup
1[35]
1314
6S.
boydiiSb2
274,64
6,52
04,13
40.88
91
CP00
0036
1Dysentery,Serog
roup
4[41]
1314
5S.
dysenteriaeSd1
974,56
0,911
4,27
10.93
61
CP00
0034
1Dysentery,Serog
roup
1[41]
1619
4S.
dysenteriae10
125,23
5,53
54,41
90.84
618
9AAMJ000
0000
03
Dysentery,Serotyp
e4
TIG
R,[34]
1315
1S.
sonn
eiSs046
5,05
5,31
64,21
90.83
41
CP00
0038
1Dysentery
[47]
2885
1E.albertiiTW07
627
4,69
8,53
34,38
60.93
364
ABKX01
0000
003
Enterop
atho
genic
J.Craig
VenterInstitu
te
3336
9E.ferguson
iiATCC35
469
4,64
3,86
14,31
90.93
02
CU92
8158
1Com
mensal
[43]
241
S.enterica
serovarTyp
himurium
LT2
4,95
1,37
14,52
50.91
32
AE00
6468
1Enterop
atho
genic
[27]
GPID
geno
meprojectidentifierat
NCBI(http
://www.ncbi.n
lm.nih.gov
/genom
es/lp
roks.cgi).
EPECenteropathog
enic,UPECurop
atho
genic,
ExP
ECextra-intestinal
pathog
enic,ETECenterotoxigenic,
EIECenteroinvasive,EAECenteroaggregative,
APECavianpathog
enic
E.coli
O. Lukjancenko et al.
fragments of adk, fumC, icd, gyrB, mdh, purA, and recA.The obtained DNA sequences were extracted from thegenome sequence, concatenated and phylogenically analyzedas described above. Alignments were not manually adjusted toavoid subjective interpretation of the outcome.
Predicted Proteome Analysis
The predicted proteomes comprising all protein-codinggenes were extracted from the GenBank files for thepublished genomes. For unpublished genomes, they werepredicted using EasyGene [30]. All predicted proteomeswere compared by BLASTP reciprocal pairwise compari-son. Two genes were attributed to a single gene family andconsidered 'conserved' when they shared at least 50%amino acid identity over at least 50% of the length of thelongest gene.
A hierarchical clustering was performed for the completepan-genome as described by Snipen et al. [38]. Briefly, apan-genome matrix was constructed consisting of 1 s and0 s where each row corresponds to a gene family, asdescribed above, and each column to a genome. Cell (i,j) inthe matrix is 1 if gene family i is present in genome j, or 0if it is absent. Manhattan distances were calculated andused for hierarchical clustering to generate the tree. Theplotted distance between two genomes shows the proportionof gene families where their present/absent status differs.Thus, pan-genome hierarchical clustering analyses genes thatare not conserved, but vary in their presence or absencebetween genomes. Shorter distances represent genomes withmore gene families in common. Genes only occurring in asingle genome (singletons) were not included in the analysis.Bootstrap values (per mil) were computed for each inner nodeby re-sampling the rows of the matrix.
A pan- and core genome plot was constructed accordingto [12]. The order of genomes was chosen based on thepan-genome tree, starting with the largest E. coli O157genome. For the pan-genome curve, all cumulative BLASThits found in the genomes were plotted as a running total,which increases as more genomes are added. The numberof gene families with at least one representative in everygenome was plotted for the core genome and this slowlydecreases with the addition of more genomes, as thesegenomes may lack genes from gene families that had beenconserved in the previously plotted genomes.
A BLAST atlas was constructed as described by Hallinet al. [14].
Results and Discussion
A number of characteristics of each of the 61 genomes aresummarized in Table 1, such as their size, their number of
recognized protein genes, and their gene density. Their GCcontent varies around 50% for all genomes (not shown), buttheir size and number of genes varies extensively. Thesmallest E. coli genome included is that of strain BL21(DE3) sequenced by the Korean consortium, which is only4.56 Mbp, and the smallest Shigella genome is that ofShigella dysenteriae Sd197, with 4.56 Mbp. The longestgenome of the completed genomes so far belongs to E. coliO157:H7 strain EC4115, with 5.70 Mbp. Longer genomesare listed in Table 1, but since those sequences are still inmultiple contigs, it is possible that their stated length isoverestimated. These size differences mean that around onemillion nucleotides (approximately 20% of a genome) canbe absent in one E. coli or Shigella isolate and present inanother. These 'extra' sequences are not void, as indicatedby the variation in number of genes: the longest E coligenome has 1,158 more predicted genes than the shortest E.coli genome (5,315 genes for strain EC4115 and 4157genes for BL21). Further, the observed gene density isrelatively constant, at 0.911±0.04 genes per 1,000 basepairs. It should be noted that published proteomes havebeen defined using different gene prediction programs anddefinitions, so that the observed slight variation in genedensity might be explained by non-standardized geneidentification.
Phylogeny of 16S Ribosomal RNA and MLST Genes
A phylogenetic tree based on the 16S ribosomal RNAsequences extracted from a representative set of 20 Enter-obacteriacea genomes is shown in panel a of Fig. 1, whichis in agreement with the known phylogeny of the family.The tree for the full set of the 61 E. coli and Shigellastrains, including two additional species of Escherichia andone from S. enterica is shown in Fig. 1b. From this figure,it is obvious that phylogeny of the 16S rRNA gene does notresolve well within the genus level, as is known, becausethe rRNA operons are so similar. Although some of the treenodes are predicted with uncertainty, clearly the generaShigella and Escherichia are not separated, nor are E. coligenes separated from those of E. fergusonii or E. albertii.This finding was expected, considering the close related-ness between Escherichia spp. and Shigella spp. In general,16S sequences are not suitable to analyze inter-strainrelationships within a species or between closely relatedspecies, as illustrated with this set of Enterobacteriaceaegenes. This questions the reliability to use 16S as anindicator for the species to which unknown sequencedDNA belongs [45].
Next, it was investigated if conserved housekeepinggenes, frequently assessed for MLST, provide a betterrepresentation of the relatedness of the investigatedgenomes. Various MLST schemes are in use for E. coli
Comparison of 61 Sequenced Escherichia coli Genomes
[9, 28] or Shigella spp. [35] but these are not standardizedand the genes assessed in these schemes are not conservedin all genomes. We used the combination of sevenhousekeeping genes that has been applied to a number ofbacterial species [26] (www.mlst.net). Since S. entericalacks fumC (an observation that somewhat weakens the
general applicability of this MLST gene set), that genomewas not included in the analysis. The resulting tree, shownin Fig. 2, still mixes E. coli with Shigella species, and doesnot separate all pathogenic strains from commensal strains.Some of the phylogroups previously defined by multilocusenzyme electrophoresis are clearly separated, such as the E
Buchnera aphidicola str 5A (Acyrthosiphon pisum)
Candidatus Hamiltonella defensa 5AT (Acyrthosiphon pisum)
Sodalis glossinidius 'morsitans'
Escherichia coli K-12 MG1655
Shigella boydii CDC 3083-94
Citrobacter koseri ATCC BAA-895
1000
Salmonella enterica serovar Typhimurium LT2
609
Citrobacter sakazakii ATCC BAA-894
954
Erwinia amylovora ATCC 49946
648
Eneterobacter species 638
Klebsiella pneumoniae 342
846
668
Pectobacterium atrosepticum SCRI1043
Serratia proteamaculans 568
Dickeya dadantii Ech586
586
921
737
Edwardsiella ictaluri 93-146
586
420
Photorhabdus asymbiotica ATCC 43949
Xenorhabdus bovienii SS-2004
Proteus mirabilis HI4320
356
428
876
Yersinia pestis Angola
261
926
999
Wigglesworthia glossinidia endosymbiont of Glossina brevipalpis
0.02
Enterobacteriaceae genera 16S rRNA treeaFigure 1 Phylogenetic treebased on extracted 16S rRNAsequences. a Comparison of 20different Enterobacteriaceae,based on extracted 16S rRNAsequences from the GenBanksequence files. E. coli andShigella are shown in green. bTree of 61 sequenced E. coli(black) and related species(colored), based on the alignmentof the 16S rRNA gene sequence.Apart from Shigella spp., thegenes from E. albertii and E.fergusonii are also included(arrows). The 16S rRNA gene ofS. enterica Typhimurium LT2was used as the root. Bootstrapvalues, indicated in red, showthat most nodes are predictedwith uncertainty; nevertheless,the genera Escherichia spp.and Shigella spp. are notseparated in this tree, and thethree Escherichia species are alsomixed
O. Lukjancenko et al.
cluster containing all O157 strains, the A/B cluster ofcommensal K12 and B strains, and the B2 clustercontaining some of the uropathogenic strains, in accordanceto comparisons carried out by others [40]. Other authorsconcluded that the O157 serotype of EHEC probablyevolved in successive evolutionary events [9]; however,that conclusion is not supported by the MLST tree. Andalthough the B phylogroup is known for its commensal
isolates, one of which being used by Delbrück and Luria fortheir famous phage work, this branch also contains theenteroaggregative strain 101-1 (Fig. 2). Moreover, the twoS. dysenteriae strains are widely separated from each other.Pupo et al. [36], who used a different set of MLST genes,also found that isolates of the three species Shigellaflexneri, Shigella boydii, and S. dysenteriae, could notalways be grouped together nor separated from E. coli.
E. coli 55989S. sonnei Ss046
E. coli BL21 (DE3)E. coli BL21 (DE3)E. coli B str. REL606E. coli O103:H2 str. 12009E. coli E22
E. coli CFT073S. dysenteriae Sd197
E. coli 101-1E. albertii TW07627
E. coli B7AE. coli O26:H11 str. 11368
S. boydii CDC 3083-94S. flexneri 2a str. 301
790
707
S. boydii Sb227S. flexneri 2a str. 2457T
S. flexneri 5 str. 8401E. coli 536
510
221
E. coli E110019
196
E. coli APEC O1E. coli BL21 (DE3)
87
E. coli ED1aE. coli S88
S. dysenteriae 1012
500
E. coli O127:H6 str. E2348/69E. coli SE15
E. coli UTI89E. fergusonii ATCC35469
493
124
E. coli F11
292
E. coli Oslo O103
383
494
377
E. coli K12 str. ATCC8739E. coli HS
E. coli IAI39E. coli E24377AE. coli O111:H- str. 11128
E. coli UMN026E. coli SE11E. coli SMS-3-5E. coli O157:H7 str. TW14588E. coli O157:H7 str. SakaiE. coli O157:H7 str. EC4401E. coli O157:H7 str. EC4196E. coli O157:H7 str. EC4115E. coli O157:H7 str. EC4113E. coli O157:H7 str. EC4076E. coli O157:H7 str. EC4045E. coli O157:H7 str. EC4042E. coli IAI1E. coli 53638
E. coli K12 str. BW2952E. coli K12 str. DH1
E. coli K12 str. DH10BE. coli K12 str. MG1655
272
E. coli K12 str. W3110E. coli O157:H7 str. LANL ECA
E. coli O157:H7 str. EC508E. coli O157:H7 str. LANL ECFE. coli O157:H7 str. EC4206E. coli O157:H7 str. EC4486E. coli O157:H7 str. EC4501E. coli O157:H7 str. EC869
639
604
E. coli O157:H7 str. EDL933
225
951
774
404
423
101
56
161
108
343
S. enterica Typhimurium str. LT20.005
bFigure 1 (continued)
Comparison of 61 Sequenced Escherichia coli Genomes
Various enteroinvasive E. coli serotypes have been sug-gested as ancestral to the different Shigella serogroups [23],which could explain the lack of differentiation power ofMLST in this case. Apparently, neither MLST gene sets aresuitable to group these Enterobacteriaceae organisms in a
meaningful way. The performance of MLST could in theorybe improved by selecting different genes, for instance usinga set of genes specifically chosen to produce the desiredgrouping. However, the strength of MLST analysis shouldbe that a conserved set of genes is able to identify
E. coli BL21(DE3)E. coli BL21(DE3)E. coli BL21(DE3)E. coli B str. REL606E. coli 101-1
E. coli K12 str. MG1655E. coli K12 str. DH1E. coli K12 str. DH10B
E. coli K12 str. W3110E. coli BW2952
1000
1000
Shigella dysenteriae 1012
830
E. coli 536E. coli F11E. coli S88E. coli UTI89E. coli APEC O1E. coli CFT073
1000
1000
669
E. coli ED1aE. coli O127:H6 str. E2348/69
E. coli SE15E. coli IAI39E. coli SMS-3-5
1000
998
939
E. coli UMN026E. coli K12 str. ATCC 8739
E. coli HSE. coli 53638
741
E. coli IAI1
1000
E. coli 55989E. coli Oslo O103
E. coli E22E. coli O103:H2 str. 12009
708
1000
E. coli B7AE. coli SE11
E. coli O111:H- str. 11128E. coli O26:H11 str. 11368
E. coli E1100191000
Shigella sonnei Ss046
792
Shigella boydii CDC 3083-94Shigella boydii Sb227E. coli E24377A1000
702
Shigella dysenteriae Sd197Shigella flexneri 2a str. 2457TShigella flexneri 5 str. 8401
Shigella flexneri 2a str. 301E. coli O157:H7 str. EC4401E. coli O157:H7 str. EC4206E. coli O157:H7 str. SakaiE. coli O157:H7 str. EC4115E. coli O157:H7 str. EC4076E. coli O157:H7 str. EC4045E. coli O157:H7 str. EDL933E. coli O157:H7 str. EC4501E. coli O157:H7 str. LANL ECFE. coli O157:H7 str. EC4042E. coli O157:H7 str. EC508E. coli O157:H7 str. EC4486E. coli O157:H7 str. EC869E. coli O157:H7 str. EC4196E. coli O157:H7 str. TW14588E. coli O157:H7 str. EC4113E. coli O157:H7 str. LANL ECA
1000
1000
E. fergusonii ATCC354
1000
E. albertii TW07627
0.01
Figure 2 Phylogenetic tree ofconcatenated MLST gene alleles(adk, fumC, icd, gyrB, mdh,purA, recA), extracted from thegenome sequences. Color use isthe same as in Fig. 1
O. Lukjancenko et al.
phylogenetic relationships in any collection of isolates fromone species. If one has to select a 'standard' gene setspecifically for the species under investigation, it weakensthe general application of MLST considerably.
Pan-Genome Comparisons
MLST analyzes allelic differences in genes whosepresence has to be conserved in all genomes. However,we hypothesized that genes that are variably present couldprovide useful information as to the true relatedness of theanalyzed genomes. Since the variable fraction contain genesthat are present in some, absent in other genomes, aphylogenetic analysis cannot be performed to capture allinformation. Figure 3 displays a pan-genome clustering tree,based on the gene families that are variably present in theanalyzed genomes (gene families comprising singletons wereexcluded). The hierarchical clustering obtained by thisanalysis correctly separates the Shigella spp. and S.Typhimurium from Escherichia spp. and, within the latter
genus, separates E. coli from the other Escherichia spp(Fig. 3). Moreover, all E. coli O157:H7 genomes now clustertogether, as do the K12 derivatives (W3110, MG1655, DH1,BW2952, DH10B, and ATCC8739). The strains belongingto phylogenic group B are also positioned in one cluster,to which the non-pathogenic commensal strain HS alsoseems to belong. All these are avirulent isolates, and it isquite impressive that all these are positioned closetogether in the tree. We conclude that this analysis ofvariable genes identifies inter-strain relationships that canbe correlated to the lifestyle of the organisms.
The contribution of every genome to the completepan-genome of E. coli and related organisms is demon-strated in Fig. 4, where the pan-genome and core genome,as defined by other authors [40] of the analyzed sequencesis plotted. The number of novel gene families for everyadded genome is also shown. As can be seen, all genomescontribute to the increase of the pan-genome. Thisincrease is less strong when similar genomes are added(for instance all four K12 genomes, or the B strains). The
E. coli 101 1E. coli B7AE. coli E22E. coli E110019E. coli APECO1E. coli F11E. coli ED1aE. coli O127:H6 str. E2348/69E. coli 536E. coli SE15E. coli CFT073E. coli S88E. coli UTI89E. coli IAI39E. coli 53638E. coli UMN026E. coli SMS 3 5E. coli O103 OsloE. coli O111:H str.11128E. coli O103:H2 str. 12009E. coli O26:H11 str. 11368 E. coli E24377AE. coli 55989E. coli IAI1E. coli SE11E. coli HSE. coli BL21 (DE3 DOE)E. coli BL21 (DE3 AU)E. coli BL21 (DE3Korea)E. coli B str. REL606E. coli K12ATCC8739 E. coli K12 DH10BE. coli K12 str. BW2952E. coli K12 str. DH1E. coli K12 str. MG1655E. coli K12 str. W3110E. coli O157:H7 str. EC4045E. coli O157:H7 str. LANL ECAE. coli O157:H7 str. LANL ECFE. coli O157:H7 str. TW14588E. coli O157:H7 str. SakaiE. coli O157:H7 str. EDL933E. coli O157:H7 str. EC4401E. coli O157:H7 str. EC4206E. coli O157:H7 str. EC869E. coli O157:H7 str. EC4486E. coli O157:H7 str. EC4042E. coli O157:H7 str. EC4115E. coli O157:H7 str. EC4076E. coli O157:H7 str. EC4501E. coli O157:H7 str. EC508E. coli O157:H7 str. EC4113E. coli O157:H7 str. EC4196
100
100
100
42
61
51
61
86
83
89
40
79
38
65
96
56
86
50
85
62
60
60
97
83
58
69
35
97
42
16
99
36
67
35
82
59
82
84
100
95
S. flexneri 5 str. 8401S. flexneri 2a str. 301S. flexneri 2a str. 2457TS. sonnei Ss046S. boydii Sb227S. boydii CDC 3083 94
100
6060
0.000.15 0.10 0.050.20
Relati ve manhattan distance
S. dysenteriae Sd197S. dysenteriae 1012E. albertii TW07627 E. fergusonii ATCC 35469 Salmonella enterica Typhimurium str. LT2
99
58
69
70
70
Figure 3 Pan-genome clustering of E. coli (black) and related species(colored), based on the alignment of their variable gene content. Thegenomes now cluster according to species and a relatedness between
E. coli K12 derivatives (green block) and group B isolates (orangeblock) is visible
Comparison of 61 Sequenced Escherichia coli Genomes
addition of Shigella spp. genomes does not alter the shapeof the pan-genome curve, but addition of the otherEscherichia genomes causes a sharp increase. Thecontribution of E. fergusonii to the pan-genome has beennoted before [42].
The core genome reduces in size as more genomes areadded, with an expected significant drop when theshorter genomes are assessed (starting with E. coli K12DH10B, at position 18 in Fig. 4). The core genomereaches 1,472 gene families conserved in 53 E. coligenomes, which is further reduced to 993 gene familiesif Shigella spp. are considered as well. The bars show howmany novel gene families each genome contributes to thegrowing pan-genome. It should be noted that the order inwhich genomes are analyzed influences the number ofthese reported novel gene families other than for single-tons. When novel genes are considered, instead of novelgene families, the findings can be even more dramatic. Forinstance, six novel E. coli genome sequences identifiedapproximately 10,000 novel genes [42]. Previous work has
estimated a core genome of 1,976 genes for 20 E. coligenomes and a pan-genome of 17,831 genes. Our analysisof 53 E. coli genomes identified 1,472 conserved genefamilies and 13,296 gene families comprising the pan-genome. We prefer to report these findings as gene families,instead of individual genes, using clearly defined criteria forinclusion of genes into a gene family (described in the“Materials and Methods” section).
Where are all these variable genes located in agenome? Gene order is not strongly conserved betweenthe analyzed genomes, so that gene location dependswhich genome is considered. Nevertheless, by visualizingwhere a gene, whose presence can vary, is located on asingle reference genome provides further information,and this can be visualized in a BLAST atlas [14]. In theBLAST atlas of Fig. 5, it becomes apparent that the variablegene content is not evenly distributed over the referencegenome, but appears to be distributed over various islands.The reference chromosome of E. coli O157:H7 EC4115was chosen, as it is the largest chromosome for which a
1 : Escherichia coli 0157:H7 str. EC4196 2 : Escherichia coli 0157:H7 str. EC4113 3 : Escherichia coli 0157:H7 str. EC508 4 : Escherichia coli 0157:H7 str. EC4501 5 : Escherichia coli 0157:H7 str. EC4076 6 : Escherichia coli 0157:H7 str. EC4115 7 : Escherichia coli 0157:H7 str. EC4042 8 : Escherichia coli 0157:H7 str. EC4486 9 : Escherichia coli 0157:H7 str. EC86910 : Escherichia coli 0157:H7 str. EC420611 : Escherichia coli 0157:H7 str. EC440112 : Escherichia coli 0157:H7 str. EDL93313 : Escherichia coli 0157:H7 str. TW1458814 : Escherichia coli 0157:H7 str. Sakai15 : Escherichia coli 0157:H7 EC404516 : Escherichia coli 0157:H7 str. LANL ECF17 : Escherichia coli 0157:H7 str. LANL ECA18 : Escherichia coli K12 str. DH10B19 : Escherichia coli K12 str. MG165520 : Escherichia coli K12 str. W311021 : Escherichia coli K12 str. DH122 : Escherichia coli BW295223 : Escherichia coli ATCC873924 : Escherichia coli B REL60625 : Escherichia coli BL21 (DE3 Korea)26 : Escherichia coli BL21 (DE3 AU)27 : Escherichia coli BL21 (DE3 DOE)28 : Escherichia coli HS29 : Escherichia coli SE1130 : Escherichia coli IAI131 : Escherichia coli 5598932 : Escherichia coli E24377A33 : Escherichia coli O26:H11 str. 1136834 : Escherichia coli 0127:H6 str. E2348/6935 : Escherichia coli O103:H2 str. 1200936 : Escherichia coli O111:H- str. 1112837 : Escherichia coli 0103 Oslo38 : Escherichia coli SMS 3 539 : Escherichia coli UMN02640 : Escherichia coli 5363841 : Escherichia coli IAI3942 : Escherichia coli UTI8943 : Escherichia coli S8844 : Escherichia coli CFT07345 : Escherichia coli SE1546 : Escherichia coli 53647 : Escherichia coli ED1a48 : Escherichia coli F1149 : Escherichia coli APECO150 : Escherichia coli E11001951 : Escherichia coli E2252 : Escherichia coli B7A53 : Escherichia coli 101 154 : Shigella flexneri 2a 2457T55 : Shigella flexneri 2a 30156 : Shigella flexneri 5 840157 : Shigella boydii CDC 3083 9458 : Shigella boydii Sb22759 : Shigella sonnei Ss04660 : Escherichia fergusonii ATCC 3546961 : Escherichia albertii TW0762762 : Salmonella enterica Typhimurium LT263 : Shigella dysenteriae Sd19764 : Shigella dysenteriae 1012
New gene families
Core genome
Pan genome
5000
10000
15000
1 3 5 7 9 11 13 15 17 19 21 23 61 6325 27 33 3729 31 35 39 41 43 45 47 53 5749 51 55 59
Numberof genefamilies
Figure 4 Pan- and core genome plot of the analyzed genomes. Theblue pan-genome curve connects the cumulative number of genefamilies present in the analyzed genomes. The red core genome curve
connects the conserved number of gene families. The gray bars showthe numbers of novel gene families identified in each genome
O. Lukjancenko et al.
complete sequence is currently available. Around this,all other genomes are plotted, whereby lack of colorindicates that particular gene from EC4115 is missing inthe shown genome. The strong conservation of genepresence within the O157 serotype (in green) contrasts withthe multiple 'gaps' seen in the other lanes. Every gaprepresents multiple genes in strain EC4115, illustrating thatgene variation is not evenly distributed along the genome,but located in islands.
Concluding Remarks
“This gene is not found in E. coli”, is an expression oftenheard in discussions about novel genes in variousorganisms, and when people are looking for functionalmatches in databases. It is a sobering thought to realizethat any given E. coli genome sequenced will have onlyroughly 20% of its genes part of the E. coli core, and theremaining 80% are not found in all other E. coli genomes.After a comparison of the diversity with many sequenced
E. coli genomes, it has become clear such a statement canonly be valid when it is specified which E. coli genomesequence has been searched. Of the predicted pan-genomecomprising about 16,000 gene families, the core (slightlyless than a thousand genes) is found to be only about afifth of a typical E. coli genome which contains around5,000 genes. Many of the accessible or variable genes,making up more than 90% of the pan-genome and roughlyfour fifth of a typical genome, are often found co-localizedon genomic islands. The diversity within the species E.coli, and the overlap in gene content between this andrelated species is far greater than many had anticipated,and represents a broad set of functions for adapting tomany different environments. The comparative methodsused here are generally applicable to genomes of relatedspecies, and are considered a valuable tool to evaluatecurrent insights of species' relatedness and evolutionaryhistory.
Acknowledgments We would like to thank the Danish ResearchCouncils and the DTU Globalization funds for financial support.
0M 0.5M1M
1.5
M2M
2.5M3M3.5
M
4M
4.5
M
5M
Terminus
Origin
Escherichia coli O157:H7 strain EC4115
Figure 5 BLAST atlas. In themiddle, a genome atlas of E. coliO157:H7 strain EC4115 isshown, around which BLASTlanes are shown. Every lanecorresponds to a genome, withthe following colors (goingoutwards): green E. coli O157:H7 (15 lanes); light blue E. coliLANL strains (two lanes); darkblue Shigella spp. (eight lanes);red E. coli K12 and derivatives(six lanes); orange E. coli strainB phylogroup (four lanes);followed by all other E. coligenomes in different colors.The outermost three lanesrepresent E. fergusonii, E.albertii, and S. entericaTyphimurium LT2. Lack ofcolor indicates that the genes atthat position in strain EC4115were not found in the genomeof that lane. The position ofreplication origin and terminusis indicated
Comparison of 61 Sequenced Escherichia coli Genomes
Open Access This article is distributed under the terms of theCreative Commons Attribution Noncommercial License which per-mits any noncommercial use, distribution, and reproduction in anymedium, provided the original author(s) and source are credited.
References
1. Anjum MF, Lucchini S, Thompson A, Hinton JCD, WoodwardMJ (2003) Comparative genomic indexing reveals the phyloge-nomics of Escherichia coli pathogens. Infect Immun 71:4674–4683
2. Blattner FR, Plunkett G3, Bloch CA, Perna NT, Burland V, RileyM, Collado-Vides J, Glasner JD, Rode CK, Mayhew GF, GregorJ, Davis NW, Kirkpatrick HA, Goeden MA, Rose DJ, Mau B,Shao Y (1997) The complete genome sequence of Escherichiacoli K-12. Science 277:1453–1462
3. Brenner DJ, Fanning GR, Skerman FJ, Falkow S (1972)Polynucleotide sequence divergence among strains of Escherichiacoli and closely related organisms. J Bact 109:953–965
4. Chain PS, Grafham DV, Fulton RS, Fitzgerald MG, Hostetler J,Muzny D, Ali J, Birren B, Bruce DC, Buhay C, Cole JR, Ding Y,Dugan S, Field D, Garrity GM, Gibbs R, Graves T, Han CS,Harrison SH, Highlander S, Hugenholtz P, Khouri HM, KodiraCD, Kolker E, Kyrpides NC, Lang D, Lapidus A, Malfatti SA,Markowitz V, Metha T, Nelson KE, Parkhill J, Pitluck S, Qin X,Read TD, Schmutz J, Sozhamannan S, Sterk P, Strausberg RL,Sutton G, Thomson NR, Tiedje JM, Weinstock G, Wollam A,Genomic Standards Consortium Human Microbiome ProjectJumpstart Consortium, Detter JC (2009) Genomics. Genomeproject standards in a new era of sequencing. Science 326:236–237
5. Chen SL, Hung C, Xu J, Reigstad CS, Magrini V, Sabo A, BlasiarD, Bieri T, Meyer RR, Ozersky P, Armstrong JR, Fulton RS,Latreille JP, Spieth J, Hooton TM, Mardis ER, Hultgren SJ,Gordon JI (2006) Identification of genes subject to positiveselection in uropathogenic strains of Escherichia coli: a compar-ative genomics approach. Proc Natl Acad Sc USA 103:5977–5982
6. Cilia V, Lafay B, Christen R (1996) Sequence heterogeneitiesamong 16S ribosomal RNA sequences, and their effect onphylogenetic analyses at the species level. Mol Biol Evol13:451–461
7. Dobrindt U, Blum-Oehler G, Nagy G, Schneider G, Johann A,Gottschalk G, Hacker J (2002) Genetic structure and distributionof four pathogenicity islands (PAI I(536) to PAI IV(536)) ofuropathogenic Escherichia coli strain 536. Infect Immun70:6365–6372
8. Durfee T, Nelson R, Baldwin S, Plunkett G3, Burland V, Mau B,Petrosino JF, Qin X, Muzny DM, Ayele M, Gibbs RA, Csörgo B,Pósfai G, Weinstock GM, Blattner FR (2008) The completegenome sequence of Escherichia coli DH10B: insights into thebiology of a laboratory workhorse. J Bacteriol 190:2597–2606
9. Feng PCH, Monday SR, Lacher DW, Allison L, Siitonen A, KeysC, Eklund M, Nagano H, Karch H, Keen J, Whittam TS (2007)Genetic diversity among clonal lineages within Escherichia coliO157:H7 stepwise evolutionary model. Emerging Infect Dis13:1701–1706
10. Ferenci T, Zhou Z, Betteridge T, Ren Y, Liu Y, Feng L, ReevesPR, Wang L (2009) Genomic sequencing reveals regulatorymutations and recombinational events in the widely usedMC4100 lineage of Escherichia coli K-12. J Bacteriol191:4025–4029
11. Fricke WF, Wright MS, Lindell AH, Harkins DM, Baker-AustinC, Ravel J, Stepanauskas R (2008) Insights into the environmental
resistance gene pool from the genome sequence of the multidrug-resistant environmental isolate Escherichia coli SMS-3-5. JBacteriol 190:6779–6794
12. Friis C, Wassenaar TM, Javed MA, Snipen L, Lagersen K, HallinPF, Newell DG, Manning G, Ussery DW (Submitted forpublication) Genomic characterization of Campylobacter jejuniM1
13. Fukiya S, Mizoguchi H, Tobe T, Mori H (2004) Extensivegenomic diversity in pathogenic Escherichia coli and Shigellastrains revealed by comparative genomic hybridization micro-array. J Bacteriol 186:3911–3921
14. Hallin PF, Binnewies TT, Ussery DW (2008) The genomeBLASTatlas-a GeneWiz extension for visualization of whole-genome homology. Mol Biosyst 4:363–371
15. Hayashi T, Makino K, Ohnishi M, Kurokawa K, Ishii K,Yokoyama K, Han CG, Ohtsubo E, Nakayama K, Murata T,Tanaka M, Tobe T, Iida T, Takami H, Honda T, Sasakawa C,Ogasawara N, Yasunaga T, Kuhara S, Shiba T, Hattori M,Shinagawa H (2001) Complete genome sequence of enterohemor-rhagic Escherichia coli O157:H7 and genomic comparison with alaboratory strain K-12. DNA Res 8:11–22
16. Hayashi K, Morooka N, Yamamoto Y, Fujita K, Isono K, Choi S,Ohtsubo E, Baba T, Wanner BL, Mori H, Horiuchi T (2006)Highly accurate genome sequences of Escherichia coli K-12strains MG1655 and W3110. Mol Syst Biol 2:2006.0007
17. Iguchi A, Thomson NR, Ogura Y, Saunders D, Ooka T,Henderson IR, Harris D, Asadulghani M, Kurokawa K, Dean P,Kenny B, Quail MA, Thurston S, Dougan G, Hayashi T, ParkhillJ, Frankel G (2009) Complete genome sequence and comparativegenome analysis of enteropathogenic Escherichia coli O127:H6strain E2348/69. J Bacteriol 191:347–354
18. Itoh Y, Nagano I, Kunishima M, Ezaki T (1997) Laboratoryinvestigation of enteroaggregative Escherichia coli O untypeable:H10 associated with a massive outbreak of gastrointestinal illness.J Clin Microbiol 35:2546–2550
19. Jeong H, Barbe V, Lee CH, Vallenet D, Yu DS, Choi S, CoulouxA, Lee S, Yoon SH, Cattolico L, Hur C, Park H, Ségurens B, KimSC, Oh TK, Lenski RE, Studier FW, Daegelen P, Kim JF (2009)Genome sequences of Escherichia coli B strains REl606 and Bl21(DE3). J Mol Biol 394:644–652
20. Jin Q, Yuan Z, Xu J, Wang Y, Shen Y, Lu W, Wang J, Liu H, YangJ, Yang F, Zhang X, Zhang J, Yang G, Wu H, Qu D, Dong J, SunL, Xue Y, Zhao A, Gao Y, Zhu J, Kan B, Ding K, Chen S, ChengH, Yao Z, He B, Chen R, Ma D, Qiang B, Wen Y, Hou Y, Yu J(2002) Genome sequence of Shigella flexneri 2a: insights intopathogenicity through comparison with genomes of Escherichiacoli K12 and O157. Nucleic Acids Res 30:4432–4441
21. Johnson TJ, Kariyawasam S, Wannemuehler Y, Mangiamele P,Johnson SJ, Doetkott C, Skyberg JA, Lynne AM, Johnson JR,Nolan LK (2007) The genome sequence of avian pathogenicEscherichia coli strain O1:K1:H7 shares strong similarities withhuman extraintestinal pathogenic E. coli genomes. J Bacteriol189:3228–3236
22. Lagesen K, Hallin P, Rødland EA, Staerfeldt H, Rognes T, UsseryDW (2007) Rnammer: consistent and rapid annotation ofribosomal rna genes. Nucleic Acids Res 35:3100–3108
23. Lan R, Alles MC, Donohoe K, Martinez MB, Reeves PR (2004)Molecular evolutionary relationships of enteroinvasive Escher-ichia coli and Shigella spp. Infect Immun 72:5080–5088
24. Larkin MA, Blackshields G, Brown NP, Chenna R, McGettiganPA, McWilliam H, Valentin F, Wallace IM, Wilm A, Lopez R,Thompson JD, Gibson TJ, Higgins DG (2007) Clustal W andClustal X version 2.0. Bioinformatics 23:2947–2948
25. Luria SE, Burrous JW (1957) Hybridization between Escherichiacoli and Shigella. J Bacteriology 74:461–476
O. Lukjancenko et al.
26. Maiden MC, Bygraves JA, Feil E, Morelli G, Russell JE, UrwinR, Zhang Q, Zhou J, Zurth K, Caugant DA, Feavers IM, AchtmanM, Spratt BG (1998) Multilocus sequence typing: a portableapproach to the identification of clones within populations ofpathogenic microorganisms. Proc Natl Acad Sc USA 95:3140–3145
27. McClelland M, Sanderson KE, Spieth J, Clifton SW, Latreille P,Courtney L, Porwollik S, Ali J, Dante M, Du F, Hou S, LaymanD, Leonard S, Nguyen C, Scott K, Holmes A, Grewal N,Mulvaney E, Ryan E, Sun H, Florea L, Miller W, Stoneking T,Nhan M, Waterston R, Wilson RK (2001) Complete genomesequence of Salmonella enterica serovar Typhimurium LT2.Nature 413:852–856
28. Moura RA, Sircili MP, Leomil L, Matté MH, Trabulsi LR, EliasWP, Irino K, Pestana de Castro AF (2009) Clonal relationshipamong atypical enteropathogenic Escherichia coli strains isolatedfrom different animal species and humans. Appl Environ Micro-biol 75:7399–7408
29. Nie H, Yang F, Zhang X, Yang J, Chen L, Wang J, Xiong Z, PengJ, Sun L, Dong J, Xue Y, Xu X, Chen S, Yao Z, Shen Y, Jin Q(2006) Complete genome sequence of Shigella flexneri 5b andcomparison with Shigella flexneri 2a. BMC Genomics 7:173
30. Nielsen P, Krogh A (2005) Large-scale prokaryotic gene predic-tion and comparison to genome annotation. Bioinformatics21:4322–4329
31. Ogura Y, Ooka T, Iguchi A, Toh H, Asadulghani M, Oshima K,Kodama T, Abe H, Nakayama K, Kurokawa K, Tobe T, HattoriM, Hayashi T (2009) Comparative genomics reveal the mecha-nism of the parallel evolution of O157 and non-O157 enter-ohemorrhagic Escherichia coli. Proc Natl Acad Sc USA106:17939–17944
32. Oshima K, Toh H, Ogura Y, Sasamoto H, Morita H, Park S, OokaT, Iyoda S, Taylor TD, Hayashi T, Itoh K, Hattori M (2008)Complete genome sequence and comparative analysis of the wild-type commensal Escherichia coli strain SE11 isolated from ahealthy adult. DNA Res 15:375–386
33. Perna NT, Plunkett G3, Burland V, Mau B, Glasner JD, Rose DJ,Mayhew GF, Evans PS, Gregor J, Kirkpatrick HA, Pósfai G,Hackett J, Klink S, Boutin A, Shao Y, Miller L, Grotbeck EJ,Davis NW, Lim A, Dimalanta ET, Potamousis KD, Apodaca J,Anantharaman TS, Lin J, Yen G, Schwartz DC, Welch RA,Blattner FR (2001) Genome sequence of enterohaemorrhagicEscherichia coli O157:H7. Nature 409:529–533
34. Perrière G, Gouy M (1996) WWW-query: an on-line retrievalsystem for biological sequence banks. Biochimie 78:364–369
35. Pupo GM, Karaolis DK, Lan R, Reeves PR (1997) Evolutionaryrelationships among pathogenic and nonpathogenic Escherichiacoli strains inferred from multilocus enzyme electrophoresis andmdh sequence studies. Infect Immun 65:2685–2692
36. Pupo GM, Lan R, Reeves PR (2000) Multiple independentorigins of Shigella clones of Escherichia coli and convergentevolution of many of their characteristics. Proc Natl Acad ScUSA 97:10567–10572
37. Rasko DA, Rosovitz MJ, Myers GSA, Mongodin EF, Fricke WF,Gajer P, Crabtree J, Sebaihia M, Thomson NR, Chaudhuri R,Henderson IR, Sperandio V, Ravel J (2008) The pan-genomestructure of Escherichia coli: comparative genomic analysis of E.coli commensal and pathogenic isolates. J Bacteriol 190:6881–6893
38. Snipen L, Ussery DW (2010) Standard operating procedure forcomparing pan-genome trees. Standards Genomic Sciences2:135–141. doi:10.4056/sigs.38923
39. Strockbine NA, Maurelli AT (2005) “Genus XXXV-Shigella”,page 812 of Bergey’s manual of systematic bacteriology, 2nd edn.Springer publishing company, New York
40. Tettelin H, Masignani V, Cieslewicz MJ, Donati C, Medini D,Ward NL, Angiuoli SV, Crabtree J, Jones AL, Durkin AS,Deboy RT, Davidsen TM, Mora M, Scarselli M, Margarit yRos I, Peterson JD, Hauser CR, Sundaram JP, Nelson WC,Madupu R, Brinkac LM, Dodson RJ, Rosovitz MJ, SullivanSA, Daugherty SC, Haft DH, Selengut J, Gwinn ML, Zhou L,Zafar N, Khouri H, Radune D, Dimitrov G, Watkins K,O'Connor KJB, Smith S, Utterback TR, White O, Rubens CE,Grandi G, Madoff LC, Kasper DL, Telford JL, Wessels MR,Rappuoli R, Fraser CM (2005) Genome analysis of multiplepathogenic isolates of Streptococcus agalactiae: implications forthe microbial “pan-genome”. Proc Natl Acad Sc USA102:13950–13955
41. Toh H, Oshima K, Toyoda A, Ogura Y, Ooka T, Sasamoto H, ParkS, Iyoda S, Kurokawa K, Morita H, Itoh K, Taylor TD, Hayashi T,Hattori M (2010) Complete genome sequence of the wild-typecommensal Escherichia coli strain SE15 belonging to phyloge-netic group B2. J Bacteriol 192:1165–1166
42. Touchon M, Hoede C, Tenaillon O, Barbe V, Baeriswyl S, Bidet P,Bingen E, Bonacorsi S, Bouchier C, Bouvet O, Calteau A,Chiapello H, Clermont O, Cruveiller S, Danchin A, Diard M,Dossat C, Karoui ME, Frapy E, Garry L, Ghigo JM, Gilles AM,Johnson J, Le Bouguénec C, Lescat M, Mangenot S, Martinez-Jéhanne V, Matic I, Nassif X, Oztas S, Petit MA, Pichon C, RouyZ, Ruf CS, Schneider D, Tourret J, Vacherie B, Vallenet D,Médigue C, Rocha EPC, Denamur E (2009) Organised genomedynamics in the Escherichia coli species results in highly diverseadaptive paths. PLoS Genet 5:e1000344
43. Wei J, Goldberg MB, Burland V, Venkatesan MM, Deng W,Fournier G, Mayhew GF, Plunkett G3, Rose DJ, Darling A, MauB, Perna NT, Payne SM, Runyen-Janecky LJ, Zhou S, SchwartzDC, Blattner FR (2003) Complete genome sequence andcomparative genomics of Shigella flexneri serotype 2a strain2457T. Infect Immun 71:2775–2786
44. Welch RA, Burland V, Plunkett G3, Redford P, Roesch P, Rasko D,Buckles EL, Liou S, Boutin A, Hackett J, Stroud D, Mayhew GF,Rose DJ, Zhou S, Schwartz DC, Perna NT,Mobley HLT, DonnenbergMS, Blattner FR (2002) Extensive mosaic structure revealed by thecomplete genome sequence of uropathogenic Escherichia coli. ProcNatl Acad Sc USA 99:17020–17024
45. Woo PCY, Lau SKP, Teng JLL, Tse H, Yuen K (2008) Then andnow: use of 16S rDNA gene sequencing for bacterial identifica-tion and discovery of novel bacteria in clinical microbiologylaboratories. Clin Microbiol Infect 14:908–934
46. Yang F, Yang J, Zhang X, Chen L, Jiang Y, Yan Y, Tang X, WangJ, Xiong Z, Dong J, Xue Y, Zhu Y, Xu X, Sun L, Chen S, Nie H,Peng J, Xu J, Wang Y, Yuan Z, Wen Y, Yao Z, Shen Y, Qiang B,Hou Y, Yu J, Jin Q (2005) Genome dynamics and diversity ofShigella species, the etiologic agents of bacillary dysentery.Nucleic Acids Res 33:6445–6458
47. Zimmer C (2008) Microcosm: E. coli and the new science of life.Pantheon books, New York
Comparison of 61 Sequenced Escherichia coli Genomes
Top Related