Alan Moran_Thesis submission (1)

35
GENE40060 - Genetics Research Project Detection of short-term positive selection in Verotoxigenic Escherichia coli Submitted by: Alan Moran Student Number: 11452982 Supervisor: Dr Peadar Ó Gaora, BA, MSc, PhD

Transcript of Alan Moran_Thesis submission (1)

Page 1: Alan Moran_Thesis submission (1)

GENE40060 - Genetics Research Project

Detection of short-term positive selection in Verotoxigenic

Escherichia coli

Submitted by: Alan Moran

Student Number: 11452982

Supervisor: Dr Peadar Ó Gaora, BA, MSc, PhD

Page 2: Alan Moran_Thesis submission (1)

1

Summary:

The aim of this study was to identify genes that are under short-term positive selection in

Verotoxigenic Escherichia coli (VTEC), primarily genes associated with virulence or the

enhancement of virulence. VTEC are responsible for a number of diseases, primarily

haemolytic uremic syndrome (HUS) in humans. Furthermore, these bacteria produce

characteristic virulence factors such as verotoxins, and intimin. Thus, it was the extended aim

of this study to investigate virulence factors outside of those that distinguish these bacteria,

which are associated with a VTEC-infection. Positive, or Darwinian selection, refers to a

more extreme phenotype that is constantly selected for within the population, resulting in an

increase in the frequency of the allele. In relation to this, the ‘short-term’ basis of this

investigation describes the situation whereby this phenotype has been selected for relatively

recently, therefore it is likely that this phenotype is coded by a single allele which exhibits no

silent mutations. A number of candidate genes were detected on this basis using the software

programme ‘Timezone’, which focused primarily on constructing phylogenetic gene trees,

and examining mutations and hotspot mutations within these trees. A common trend was

noticed in the candidate genes with most results associated with virulence showing up for

genes associated with bacteriophages, the membrane, transposases, and cell motility. These

results agreed with many other studies which have illustrated the importance these bacteria

place on virulence-associated phenomena such as horizontal gene transfer, and modifying the

membrane in order to avoid the host immune defences. Other investigations were carried out

in order to study the associated pattern of evolution that was occurring. Here it was noticed

that it was primarily parallel hotspot mutations that were occurring in these genes, an example

of selection acting on these genes in order to induce a gain-of-function rather than a loss-of-

function. This study in its entirety demonstrated that selection is acting on these bacteria

mainly through hotspot mutations in order to modify primarily commensal genes and change

their function with the aim of enhancing virulence.

Page 3: Alan Moran_Thesis submission (1)

2

Table of Contents:

Summary 1

1. Introduction 3

1.1 Background 3

1.2 Mechanism of Infection 3

1.3 Defining non-O157:H7 infections 4

1.4 Comparison of VTEC and commensal strains of E.coli 4

1.5 Short-term positive selection 5

1.6 Statement of Intent 6

2. Materials & Methods 7

2.1 Software 7

2.2 Sequence processing 7

2.3.1 Timezone; extraction of orthologous gene sets from multiple genomes 9

2.3.2 Timezone; candidate gene selection 11

2.4 Troubleshooting problems 11

3. Results 12

3.1 Candidate gene selection 12

3.2 Zonal phylogeny analysis 13

3.3 Candidate gene list 14

3.4 Core gene presence among candidates 15

3.5.1 DAVID analysis; O157 analysis 16

3.5.2 DAVID analysis; Commensal and ‘top serotype’ strains 18

3.6 Premature Stop Codon analysis in Commensal and ‘top serotype’ strains 20

3.7.1 Hotspot analysis 21

3.7.2 Hotspot analysis; parallel vs coincidental 22

3.7.3 Hotspot analysis; recombinant O157, O104, and O111 genes 23

4. Discussion 24

5. Acknowledgements 30

6. References 31

7. Appendix 34

Page 4: Alan Moran_Thesis submission (1)

3

1. Introduction:

1.1 Background

Escherichia coli (E. coli) is a household name for scientists and non-scientists alike. It is a

natural resident of the lower intestine in humans, and is a very well-studied model organism.

However, it often makes more negative headlines due to many reported outbreaks of the

pathotypes of this bacteria which can cause very harmful effects on its host. One such

pathotype is ‘Verotoxigenic Escherichia coli’ (VTEC), which is also referred to as Shiga

toxin-producing E. coli (STEC).

VTEC regularly cause sporadic infection and outbreaks in human populations. In addition,

this pathotype is responsible for a wide range of diseases in humans such as diarrhoea,

haemorrhagic colitis, and haemolytic uremic syndrome (HUS) (1). Strains that belong to the

serotype O157:H7 are the most common cause of infection. Farm animals such as cattle and

sheep, are normally the most frequent reservoirs of this bacterium. Hence, infection often

occurs as a result of food contamination.

1.2 Mechanism of Infection

This pathotype of E. coli is referred to as VTEC due to one defining characteristic alluded to

in its name. VTEC have the capacity to produce one or more Shiga-like verotoxins (VT), VT1

and VT2, which are also referred to as stx (1). There are two sub-types of VT1 and four sub-

types of VT2, and they are encoded by bacteriophages (2). Studies have reported that VTEC

expressing VT2 in human infections have a higher risk of causing severe disease (1). Studies

of the mechanism of infection have illustrated that these toxins are AB5 toxins that bind to

tissues that express the glycolipid receptor globotriaosylceramide (Gb3). An AB5 toxin is a

toxin that contains a polypeptide A subunit that in linked to a pentamer of identical B

subunits. The A subunit is the active component, while the B subunits are responsible for

mediating the entry of the holotoxin (A subunit) into the cell (2). This results in interference

with the 60S ribosomal subunit which inhibits protein synthesis. This action leads to cell

death, or apoptosis.

Although this is the characteristic method of pathogenesis, it must be noted that it is not the

only one. Another key factor in the virulence of VTEC is its adhesion and colonization to

specific sites such as the small intestine, in a manner similar to Enteropathogenic E. coli

strains (EPEC). In this case, attaching and effacing (AE) lesions are produced on the target

Page 5: Alan Moran_Thesis submission (1)

4

cells. It achieves this by the production of the adhesion factor intimin, which is responsible

for the attachment to intestinal epithelial cells (3). Intimin is encoded by the eae gene, which

is located on the chromosomal LEE pathogenicity island. Furthermore, the LEE pathogenicity

island also harbours other important virulent genes such as tir, espA, espB, espC, and espD.

The espA,B, and D-genes are associated with the production of a Type III secretion system

(TTSS) which aids the transfer of VTEC proteins into the host cell (3). It appears that VTs

may be the defining disease-causing feature of this bacteria, but studies have illustrated that

VTEC serotypes regularly implicated in disease frequently contain the LEE pathogenicity

island (3).

1.3 Defining non-O157:H7 infections

VTEC have been the cause for much concern regarding foodborne illnesses worldwide,

resulting in outbreaks in both Western and developing countries alike. As aforementioned, the

serotype O157:H7 has been the most highlighted cause of VTEC infections. As a result, this

bacterium has been widely studied. However, it is becoming increasingly evident that there

are many disease-causing non-O157:H7 serotypes also. Although these serotypes may share

similar pathogenic traits with O157:H7, they must still be examined based on their own merits

in order for a successful diagnosis to be made. Examination into this area has resulted in what

scientists now refer to as ‘the big six’, the most common infectious non-O157:H7 VTEC

agents; O26, O45, O103, O111, O121, and O145 (4).

1.4 Comparison of VTEC and commensal strains of E.coli

It is important to remember that E. coli is part of the natural microflora of the human

gastrointestinal tract, and largely exists within a commensal, or even mutualistic relationship

with humans (5). However, the pathogenic E. coli clones have been able to exploit new niches

as a result of the shift from commensalism to pathogenicity. This contrast can serve as a

useful scenario for scientists who seek to explore what other differences may now be present

in the genetic makeup of VTEC.

The application of Comparative Genomics is extremely useful in cases such as this. For

example, by contrasting the pathogenic VTEC to the natural commensal state of E. coli, one

could make possible inferences on where the shift to pathogenesis has occurred before, and

where it may occur again. This apparent shift to a pathogenic state, or pathoadaptation (6), is

not uncommon with regard to bacterial lineages. For example, Staphylococcus aureus is

commonly located in the Nasopharynx and moist skin folds of humans, causing no damage to

Page 6: Alan Moran_Thesis submission (1)

5

the host. However, it can cause serious infection when found in other areas of the body. For

example, patients can suffer from pneumonia when this bacterium infects the lungs (6). Thus,

comparing the various VTEC serotypes to one another may allow scientists to make more

accurate characterizations of each. Results such as this would be highly desirable in a clinical

setting.

1.5 Short-term positive selection

Scientific research has traditionally focused on two primary methods of the acquisition of

pathogenic traits: Horizontal gene transfer and the accumulation of mutations in genes over

long-periods of time. However, another mechanism of adaptation of pathogenic bacterial

species is coming to the fore; the occurrence of point mutations in genes common to all

strains, also referred to as ‘core’ genes (7).

This phenomenon has been referred to by many studies as ‘short-term selection’. This

describes an evolutionary approach that has been taken on by many pathogenic bacterial

lineages in order to increase pathogenic fitness via pathoadaptation in commensal genes

present in members of that lineage (8). Although these pathogenic adaptations are beneficial

within a certain niche, there is sacrifice involved as they cause disruption to the original role

of the gene. Hence, these pathoadaptations are continuously under positive, or ‘Darwinian’,

selection and are constantly selected out of the genome also. This strategy is for the purposes

of facilitating the expense that must be paid in order to achieve greater virulence (8).

Many studies have focused on searching for specific pathogenic genes and their association

with a certain phenotype, or niche (8). However, this type of approach is often set on

detecting genes which have adapted over a long-evolutionary timescale via various mutations

in order to specifically confer a pathogenic function, or genes that have been newly acquired

via horizontal transfer. Short-term selection has often been missed by researchers as this form

of diversification occurs on a relatively recent timescale based on the nature of the genes to be

regularly selected for-and-against. Previous research has often lacked the necessary tools

required to examine this type of adaptation. However, as technology and computational

approaches have developed, this type of approach is more feasible.

The central approach of this study involves the use of the Timezone software package. This

applies useful approaches in the detection of one of the main footprints of short-term positive

selection; hotspot or convergent mutations. Hotspot mutations are mutations which

continuously occur at the same amino acid positions within genes. When a hotspot mutation

Page 7: Alan Moran_Thesis submission (1)

6

occurs, it can be a very significant event as this indicates that the replacement of a specific

amino acid provides a specific adaptive advantage in a certain environment (9). Since these

positions regularly accumulate mutations, certain functions can subsequently be selected in-

and-out. The nature of these mutations suit the aim of short-term selection. Hence, detection

of hotspot mutations serves as a useful marker.

1.6 Statement of Intent

The chief aim of this study is to identify relevant virulent and pathogenic genes that are

undergoing short-term positive selection in a number of VTEC strains. This will be conducted

on the basis of performing analysis on the VTEC serotypes O157, O104, and O111. In

addition to this, it is a secondary aim of this study to recognise the associated patterns of

evolution that are occurring. Further comparative studies will be made between a sub-set of

Commensal strains and the foremost disease causing VTEC-serotypes. This type of study is

extremely important for the purposes of identifying further pathogenic factors associated with

these bacteria which will better enable us to characterize O157 and non-O157 infections.

Hence, studies such as this could aid the development of new treatments against these

pathogenic strains.

Page 8: Alan Moran_Thesis submission (1)

7

2. Materials & Methods:

2.1 Software

Timezone requires a Windows-based (XP or higher) operating-system (8). Table 1 outlines

the Timezone dependencies and other programs required in the study. Important programs

such as Clustal and BLAST are contained within the Timezone package. In addition, PAUP*

4.0 must be purchased and downloaded separately (10). This application must be installed

correctly for Timezone to utilize it properly, as described by Chattopadhyay et al. (8).

Table 1: A list of the software version used in this project, and where to acquire them.

Program Source

Timezone 1.0 http://sourceforge.net/projects/timezone1/

TreeView X 0.5.0 http://darwin.zoology.gla.ac.uk/~rpage/treevie

wx/download.html

PAUP* 4.0 http://paup.csit.fsu.edu/downl.html

WinSCP 5.5.6 http://winscp.net/eng/download.php

PuTTY 0.63 http://www.chiark.greenend.org.uk/~sgtatham/

putty/

2.2 Sequence processing

Relevant sequences were downloaded from NCBI along with a collection of novel strains

sequenced by the lab. Thus, this large amount of data was sorted and organised into files

representative of the strains to be analysed. The Appendix (Table A) illustrates the script that

was used to perform this task.

Page 9: Alan Moran_Thesis submission (1)

8

Figure 1: Flow-chart demonstration of the process that was followed in order to prepare

sequences for Timezone. Serotype directories were labelled O157, O111, O104,

Commensals (containing a subset of commensal strains), and ‘top serotypes’ (O157 and non-

O157 ‘big six’ strains selected on the basis of reported outbreaks over the last decade or so)

(4).

Most of the sequence files contained ‘scaffolds’. In this case, a scaffold refers to the genomic

and plasmid DNA contigs. These contigs were not present together as a continuous stretch of

DNA sequence. Hence, it was necessary to concatenate the files in fasta format into one file

which was representative of the entire genome of the strain in question, as demonstrated by

Figure 1. Following the movement of the concatenated file into its respective directory, the

lengthy fasta headers in the sequence identifier of every strain were reduced in order for

PAUP* to run efficiently. The script used to solve this problem is displayed in the Appendix

(Table B). Further format requirements found it necessary that all sequences being primed for

input to be saved as ‘text’ files also. Thus, it was necessary to move the processed sequences

from UNIX into the Windows setting and subsequently save them as ‘text’ files. The final

instructions regarding the titles of the list of strains to be analysed were followed, as described

by Chattopadhyay et al. (8).

Furthermore, it was necessary to input a fully annotated reference genome in genbank format,

against which Timezone can compare the sequences to be analysed to obtain the entire gene-

set present. The reference genomes downloaded from NCBI are described in Table 2. These

reference genomes were also subsequently saved as text files in ‘C:\TimeZone_v1.0\Input’.

Page 10: Alan Moran_Thesis submission (1)

9

Table 2: The profile of serotypes that were subject to analysis.

Serotype Number of strains

analysed

Reference genome

O157 14 E. coli O157:H7 str. Sakai

O104 14 E. coli O104:H4 str. 2011C-3493

O111 11 E. coli O111: H- str. 11128

Commensal serotypes 14 E. coli str. K-12 substr. MG1655

‘Top’-disease causing

serotypes

10 E. coli O157:H7 str. Sakai

At the Timezone command prompt, instructions were followed as described by

Chattopadhyay et al. (8).The cut-off value for sequence-identity and coverage of sequence

length was selected as 95% in both cases. Timezone began its workflow upon entering these

final details. The entire workflow process along with the outputs produced is summarized in

Figure 2.

2.3.1 Timezone; extraction of orthologous gene sets from multiple genomes

Timezone was able to extract the orthologous gene sets from the strains to be analysed based

on alignment of these sequences with the reference genome. Most of the E.coli sequences had

up to 5200 genes present in their genome. Firstly, a list of sequences which contained non-

ACGT characters present in their genes was produced. The genes from these sequences were

excluded from the creation of the orthologous-gene list as a sequence with a large amount of

these types of characters was considered to be of poor-quality (8). An orthologous list of

genes which contain premature stop codons (PSC) is also produced (Figure 2). But this list

was also excluded from further analysis.

Page 11: Alan Moran_Thesis submission (1)

10

Figure 2: Flow-chart demonstration of the work-flow followed by Timezone. Genome

sequences or gene lists were used as input (red box). Outputs are highlighted in the blue box.

Specific analysis steps are shown in the Process column.

Page 12: Alan Moran_Thesis submission (1)

11

2.3.2 Timezone; candidate gene selection

Gene-specific alignment and phylogenetic trees were generated. This was subsequently used

to supply the main process of Timezone whereby genes are analysed for the presence of short-

term positive selection. This is illustrated by Figure 2 and comprises numerous tests including

zonal phylogeny analysis, the calculation of the ratio of structural to silent mutations in the

terminal and internal branches of phylogenetic gene trees, the rate and ratio of total structural

to silent mutations in genes, and calculation Tajima D and Fu & Li D values for each gene set.

This was followed by testing for recombination by Rec-MaxChi and Rec-Phylpro, which

separated the final list of candidate genes from candidate-genes that had arose through

recombination.

2.4 Troubleshooting problems

A Timezone run using over 10 sequences can take in excess of 30 hours to finish. This proved

to be problematic when running a standalone computer with regard to maintaining power, and

maintaining that type of workload. In response to this, it was necessary to set up a remote

Windows Server. In addition, Timezone was run through the Windows command line.

Page 13: Alan Moran_Thesis submission (1)

12

3. Results:

3.1 Candidate gene selection

The principle behind most of the tests carried out by Timezone is to detect changes due to

positive selection. This is normally in the form of an amino acid change (a structural or non-

synonymous change). Secondly, the tests try to identify if this change occurred relatively

recently in an evolutionary timescale. There are a number of criteria that signify this. A gene

was selected for candidacy based on meeting just one, or a combination, of the following

criteria: significantly higher allelic diversity in the evolutionarily recent zone than in the fixed

(long-term) zone (EXT>PRI diversity at P<0.05), the occurrence of evolutionarily recent

structural hotspot mutations (HSfreq-EXT), a significant higher ratio of non-synonymous to

synonymous mutations in the terminal branches (Tips) than in internal branches (Twigs)

(Tips>Twigs dN/dS at P<0.05), dN/dS values significantly higher than 1 (dN/dS-based

selection), or a negative D* value.

Table 3: A condensed illustration of the primary output of an O157 Timezone run.

Gene

Name

Product EXT>PRI

diversity

at P<0.05

HSfreq -

EXT

Tips>Twigs

dN/dS at

P<0.05

dN/dS-based

selection

ECs2998 Kil protein sig 0 non-sig

Neutral

ECs1986

tail assembly

protein

non-sig 0.26087

non-sig Purifying

ECs1122 outer

membrane

protein

non-sig 0.33333 Sig Purifying

Table 3 displays the gene, its protein product, and the results of the candidate-determining

tests that were conducted. The tests displayed are the main tests by which a gene was selected

for candidacy, which was followed by testing for recombination. In the cases of ‘HSfreq-

EXT’ and ‘dN/dS-based selection’, values of ‘>0’ and ‘positive’ represent significance,

respectively.

Page 14: Alan Moran_Thesis submission (1)

13

3.2 Zonal phylogeny analysis

This type of analysis categorizes genes into ‘RECENT’ or ‘FIXED’ in each of the strains used

for analysis. These two categories refer to the fact that the gene may either have multiple

evolutionary linked alleles differing via synonymous mutations (FIXED; Primary zone) or

may be encoded by single alleles, exhibiting no silent mutations (RECENT; External zone). A

high frequency of alleles in the external zone versus the primary zone signifies the presence of

positive selection.

Figure 3: Phylogram of the O157 gene ECs1991 which codes for an outer-membrane

protein. Red-boxes highlight short-term selection, whereas blue-boxes highlight long-term

selection. Each node follows a format such as this, ‘RECENT-O157 H str H2687-n1-1S/2N-

D47E/R81H’, this implies: ‘zone –strain name- number of strains representing this allele (n1)-

number of synonymous and non-synonymous mutations giving rise to this allele (1S/2N)- the

specific amino acid polymorphism, including the residual position (e.g. glutamate for

aspartate at position 47)’ (8).

Page 15: Alan Moran_Thesis submission (1)

14

3.3 Candidate gene list

A list of candidate genes was produced based on meeting the aforementioned criteria. This list

of genes has undergone testing for recombination. Candidate genes that have not been

produced through mutation are not considered to be under the action of ‘true’ selection.

In addition, it should be noted that the results for the DNA sequence and protein alignments

of genes, the topologies of these alignments, and the results of the zonal phylogeny analysis

which includes ZP-trees, and information of mutations and HS-mutations, as well as the

results of the other candidate-determining tests and recombination tests, were only visible for

those genes that have been deemed suitable for candidacy (Figure 4). This includes genes that

were considered to be recombinant. However, an annotation overview list was produced for

all orthologues identified.

3

15

9

0

2

4

6

8

10

12

14

16

Rhs element Proteins Phage Proteins Transposases

O104

1

30

1 16

3 1 1 30

5

10

15

20

25

30

35

O157

A

B

Page 16: Alan Moran_Thesis submission (1)

15

Figure 4A, 4B & 4C: The number and profile of gene products extracted from the

primary output of Timezone for O157, O104, and O111. Hypothetical proteins with no

described function have been excluded from the analysis represented here. O157; total

candidate gene number: 74, total number of hypothetical proteins found: 27. O104; total

candidate gene number: 32, total number of hypothetical proteins found: 5. O111; total

candidate gene number: 68, total number of hypothetical proteins found: 5. Note that the size

of the bars are relevant to the total number of candidate genes found for each strain.

3.4 Core gene presence among candidates

Table 2 illustrates the number of strains that were analysed (including the reference genome)

for each serotype. Timezone presented the number and names of strain sequences that a

candidate gene was present in. 15 strains were analysed during O157 and O104 analysis. To

be considered a core gene, a gene would need to be present in all 15 strains to be considered a

core gene. Likewise, 12 strains were analysed during O111 analysis, due to less O111 strains

being available.

4

21

30

14 3

0

5

10

15

20

25

30

35

DNA associated;methylation,

replication, andrepair

Phage Proteins Transposases Endonuclease MembraneProteins

Endopeptidase

O111

C

Page 17: Alan Moran_Thesis submission (1)

16

Figure 5: The distribution of core and mosaic genes throughout the genes selected for

candidacy. The coloured-bar at the top of the graph represents this distribution from unique

(present in one sequence) to core (present in all sequences). There is a total of 25 core genes

under short-term positive selection.

3.5.1 DAVID analysis; O157 analysis

Database for Annotation, Visualization and Integrated Discovery (DAVID) analysis was

completed in order to visualize the Gene Ontology (GO) terms associated with the serotypes

at the centre of this study (11) (12). Chart analysis was performed in this case. This groups’

genes that are represented by similar or identical GO terms.

A threshold count of 3 was applied. This determined that in order for a term to be considered

significant, it must represent a minimum gene count of 3. As a result, 50 genes were excluded

as the genes in this exclusion list may not have a relationship with any of the other genes

above the similarity threshold.

Page 18: Alan Moran_Thesis submission (1)

17

Table 4: The most commonly associated GO terms with the candidate O157 genes.

Term

Category

Gene count % of total

candidate genes

P-value

Outer membrane 7 9.7 1.4e-5

Virulence-related outer

membrane protein

6 8.3 1.4e-7

Outer membrane

protein, beta-barrel

6 8.3 2.9e-7

Cell outer membrane 6 8.3 6.6e-5

External encapsulating

structure part

6 8.3 5.9e-4

Cell envelope 6 8.3 1.9e-3

Envelope 6 8.3 9.8e-3

External encapsulating

structure

6 8.3 1.3e-2

Terminase small

subunit

4 5.6 3.8e-7

Terminase small

subunit

4 5.6 4.6e-6

DNA packaging 4 5.6 6.2e-6

Phage lambda

membrane protein lom

4 5.6 1.3e-5

Phage lamda minor tail

protein L

4 5.6 2.8e-5

Putative prophage tail

fibre, C-terminal

4 5.6 1.7e-4

Phage minor tail protein

L

4 5.6 2.3e-4

Phage-related tail

assembly protein I

3 4.2 1.5E-3

Bacteriopage lambda

tail assembly I

3 4.2 6.4e-3

Table 4 represents the number of genes associated with the GO term and the percentage this

makes up of the total genes selected for analysis. Note a gene can be associated with more

than one GO term. In addition, the mean P-values are also illustrated to display statistical

significance.

Page 19: Alan Moran_Thesis submission (1)

18

3.5.2 DAVID analysis; Commensal and ‘top serotype’ strains

Cluster analysis was the main form of inspection here. This groups chart GO terms together

based on common biology and similar function. Both analyses resulted in a large amount of

associated GO terms. Hence, the classification stringency was selected as ‘highest’. In

addition to this, the effort to maintain statistical significance was strengthened by increasing

the kappa ‘similarity term overlap’. In the case of ‘Top Serotypes’ this kappa value was raised

to 6, and in the case of ‘Commensal strains’ it value was increased to 9. In order to maintain

the analysis integrity, it was important to compare a similar number of cluster terms (<20).

However, the number Commensal strains showed greater restraint to increasing the kappa

score, thus it was necessary to increase it one factor higher.

A

Page 20: Alan Moran_Thesis submission (1)

19

Figure 6A & 6B: The GO cluster terms that were most represented for ‘Commensals’

and ‘top serotypes’. The biological significance of group terms are graded by their

enrichment score (11) (12). The percentage represents the ‘enrichment’ score for each cluster

over the total ‘enrichment’ score. The higher the percentage, the more enriched the group is

and hence is it more biologically significant, relative to the other groups. The list of terms

begins with the most enriched, and ends with the least enriched.

B

Page 21: Alan Moran_Thesis submission (1)

20

0 50 100 150 200 250 300 350 400 450 500

#Genes with PSC

#Genes with PSC

Top Serotype 466

Commensal 191

3.6 Premature Stop Codon analysis in Commensal and ‘top serotype’ strains

Figure 2 illustrates that genes and strains that contain PSC are listed, and subsequently

excluded from further analysis as these genes have been inactivated. Some studies have

shown that PSC are correlated to bacterial evolution (26).

Figure 7: A display of the number of genes which had premature stop codons present in

each analysis. Each bar is coloured coded, and the exact number of genes with PSC is given.

Page 22: Alan Moran_Thesis submission (1)

21

0.2535418

0.3913443

7

0.3098282

3

0 0.1 0.2 0.3 0.4 0.5

Mean ratio of HS mutationsto total amount of aa

changes

O111 O104 O157

0%

100%

Hotspot frequencies

Long-term

Short-term

3.7.1 Hotspot analysis

Hotspot mutations have been described as the ‘footprints of short-term positive selection’.

These types of mutations can illustrate interesting patterns of evolution upon further

inspection. For example, the frequency at which HS-mutations appear in the genome can

signify the extent to which selection acts on these mutations in order to drive evolution.

Figure 8A & 8B: A display of the ratio of HS-mutations. 8A displays the total proportion

of hotspot (HS) mutations in the long and short term zones of O157, O104, and O111. There

were only 2 HS-mutations in the long-term zone of the genes of these serotypes. Hence, the

percentage of HS-mutations in the long-term zone is 0.315%, but this is not visible in this

graph. 8B illustrates the mean ratio of HS-mutations to the total number of amino acid (aa)

changes in each of the genomes of the serotypes analysed.

A B

Page 23: Alan Moran_Thesis submission (1)

22

3.7.2 Hotspot analysis; parallel vs coincidental

The nature of these hotspot mutations is extremely important in order to determine what

pattern of evolution is being followed. Hotspot mutations can occur as either parallel or

coincidental. The former refers to a situation whereby the same amino acid replacement

occurs at each of these hotspot positions, whereas the latter refers to the occurrence of

different amino acid replacements (23). Figure 9 illustrates that parallel hotspot mutations are

predominantly occurring in these VTEC strains.

Figure 9: Different types of hotspot accumulations across the three serotypes analysed.

Here we can see the number of candidate genes in each strain that accumulated parallel

hotspot mutations only, coincidental hotspot mutations only, or both. Genes that accumulated

no hotspot mutations are not included here.

Page 24: Alan Moran_Thesis submission (1)

23

46%

23%

31%#Genes Para

#Genes Coin

#Genes both 59%

41% #Genes Para

#Genes Coin

3.7.3 Hotspot analysis; recombinant O157, O104, and O111 genes

Recombination-labelled genes may point towards some interesting patterns of evolution

present here as parallel hotspot polymorphisms may occur as point mutations, such changes

may also occur due to recombination. Yet Figure 10A shows there is a high proportion of

genes with both parallel and coincidental hotspot mutations, and the percentage of genes with

just coincidental hotspot mutations also appears to be high.

Figure 10A & 10B: The distribution of the different types of HS-mutations in candidate

genes produced through recombination. 10A displays the distribution of the nature of

hotspots in recombinant genes. 10B illustrates the total distribution of parallel and

coincidental hotspot changes. In this case, recombinant genes that have both have been

included in both the number of genes with parallel, and the number of genes with coincidental

hotspot mutations.

A B

Page 25: Alan Moran_Thesis submission (1)

24

4. Discussion:

There are many genes under short-term positive selection in the serotypes of VTEC that were

studied, and many of them are associated with pathogenicity. Observation of Figure 4

illustrates that there is a prominent presence of ‘phage-related’ proteins. This grouping covers

a wide range of proteins and their functions, including DNA packaging, tail assembly,

terminases, capsid assembly, portal proteins, and holin proteins, to name just a few.

Horizontal gene transfer plays a massive role in the evolution of bacteria which can account

for this observation. This mechanism of gene transfer is commonly mediated by

bacteriophages. These phages invade their bacterial host and integrate their genomes as

prophages into the resident genetic material. Indeed, these prophages can carry important new

information such as virulence factors, or further niche adaptation mechanisms (13).

Observation of Figure 5 illustrates that nearly all aspects to do with production, release, and

integration of phages are under positive selection. For example, Table 4 illustrates that GO

term, “Phage lambda membrane protein lom” is heavily represented. This protein is

incorporated into the host cell membrane during E. coli infection by phage lambda. Hence, it

is evident that this selection is favouring this process as it must be of benefit to these bacteria.

It is apparent that this phenomenon is not just prevalent in one serotype such as O157 in

Figure 5A, but in all three serotypes that have been examined. In addition to this, examples of

this occurrence can be observed in tangible settings. For example, there have been mass

reports of the outbreak in 2011 in Germany of haemolytic uraemic syndrome (HUS)

associated with E. coli O104:H4. Genomic studies have shown that the enhanced

pathogenicity of this strain was probably as a result of horizontal transfer due to the presence

of stx-2 (normally present in other E. coli strains) and β-lactamase-encoding plasmid CTX-M-

15 (often identified in other members of Enterobacteriaceae) (14).

The membrane is under heavy selection as Table 4 illustrates the number of GO terms and

their large gene count that are associated with the membrane in these VTEC strains. The

membrane serves as the primary contact region for host-pathogen interactions and thus it

appears as a natural candidate for positive selection since there is constant pressure to avoid

immune system recognition, and also to have the capability to invade host cells (15). There

are 2 GO term categories of interest highlighted by Table 4: Virulence-related outer

membrane protein (P-value 1.4e-7) and Outer membrane protein, beta-barrel (P-value 2.9e-7).

Page 26: Alan Moran_Thesis submission (1)

25

Upon further inspection, “Virulence-related outer membrane proteins” refers to protein family

members which confer a distinct virulent phenotype such as lom and OmpX in E. coli. The

structure of OmpX is integral to its function as it contains a highly-variable four-strand β-

sheet protruding from the cell surface which would aid the binding of external proteins with

complementary β-sheets. This type of binding promotes adhesion and invasion of mammalian

cells, as well as defence against the host immune response (16, 17). Indeed, it has been

established that adhesion inside the host system is a vital part of the VTEC virulence armoury.

In this manner, positive selection for this protein family further enhances the virulence of

these bacteria.

Examination into the “Outer membrane protein, beta barrel” reveals that this is a

transmembrane beta-barrel structure, or porin, that allows the passage of small, hydrophilic,

or charged molecules (15). However, this structure also has a role to play in host-immune

interaction and pathogenesis since it serves as a receptor for phages, antibiotics, and colicins

(15). This transmembrane beta-barrel structure can be found in outer membrane proteins such

as OmpA, and in the outer membrane enzyme PagP of pathogenic gram-negative bacteria.

Outer membrane protein A (OmpA) plays a multitude of roles. For example, colicins K and L

require the action of OmpA for correct functioning, and it also serves as a receptor for a

number of T-even like phages (18). PagP, or its E.coli homolog CrcA, also aids the bacterium

to avoid the host immune system. Lipopolysaccharide (LPS) is a major component of the

outer membrane in gram-negative bacteria. It contains a hydrophobic anchor, referred to as

lipid A. In addition to this, lipid A is also an active component of the LPS endotoxin. This

promotes septic shock during a bacterial infection in extreme cases (19). However, the

pathogenic capabilities of this lipid can be further enhanced with some modification. The

aforementioned enzymes catalyse the transfer of palmitate from a phospholipid to a

glucosamine unit of lipid A. This action provides the bacteria with resistance to the response

of the innate immune system, such as cationic anti-microbial peptides (CAMPs). Furthermore,

it also antagonizes LPS-mediated signal transduction in human cells (19). Thus, a common

trend can be observed in the membrane. It appears that positive selection in many of these

genes seems to be acting on processes associated with host-immune attack and evasion, and

binding of phages and colicins.

Selection for phage and membrane-associated activities is evident. However, O104 and O111

did not return any results for associated GO terms. This is most likely due the fact that there is

Page 27: Alan Moran_Thesis submission (1)

26

poorer characterization of these serotypes and hence the ‘GI’ numbers used for input did not

map to any GO terms present in the database to elicit any significant results. However, the

data presented in Figure 5 suggests that transposases and transposable-elements merit further

examination.

A transposase catalyses the movement of a transposon to another part the genome. A number

of transposases appeared to be under short-term positive selection during this analysis such as

transposase IS3, IS629 transposase OrfB, and IS1 and IS5 transposases. There are a number

of opinions in the literature as to what significance positive selection for transposable

elements there might be. Some studies suggest that insertion of transposable elements has a

negative fitness effect on the organism, and simply occurs due to the selfish nature of these

genetic elements. Genes, like organisms, struggle for existence and the most successful genes

are those that persist. Thus, it has been postulated these genes successfully persist in a manner

which is similar to the nature of pathogens persisting in their hosts (15).

However, other research has suggested some theories that are quite on the contrary. For

example, it has been suggested that silent catabolic operons in E. coli can be activated by IS

elements in the presence of the substrate for that operon. In addition, this transposition occurs

at a higher rate in starving cells than in growing ones. In this case, these transposable

elements contribute to the survival of the cell (20). In any case, it is unclear as to why these

groups of genes are being positively selected for in VTEC, whether it be for selfish purposes

or for the benefit of the organism in terms of survival and pathogenesis. Despite this,

however, it cannot be denied that these transposases are being selected for, heavily so in the

case of O111, and it certainly re-opens the debate as what role these elements are playing.

Figure 7A displays that there is a focus on ‘Organelle membrane’ in the Commensal strains

whereas ‘top serotypes’ displays a more even distribution of terms under selection, with the

term ‘cell wall biogenesis’ being the most highly represented. Although a case could be made

for enhanced virulence selection in the case of ‘top serotypes’ as there is a decent

representation of ‘cell motility’ (13%) and ‘taxis’ (10%). Some studies have described that

increased mechanisms for cell motility and chemotaxis is associated with enhanced virulence

in bacteria (25). Thus, this is a point worth highlighting in this case. Despite this, however,

the overall profile appears to be pretty similar with some minor exceptions.

Perhaps this is unsurprising however, since it is largely commensal genes that are under

selection in both cases. Previous studies have indicated that there is significant mosaicism

Page 28: Alan Moran_Thesis submission (1)

27

between the genome sequences of commensal and pathogenic strains of E. coli. Indeed,

inspections such as this have revealed that traits that were largely thought to be almost unique

to the pathogenic strains, can be found within the commensal genome also (22). This would

most likely aid the survival of the pathogenic species as the commensal population

continually serves as a useful ‘resource’ for which further pathogenic members can be

obtained via horizontal gene transfer in order to explore novel niches. However, the

commensal and pathogenic populations are not so diverse that the commensals cannot

maintain the primary reservoir habitat where the long-term survival of the organism mainly

lies. For example, pathoadpative traits will be selected-for in the pathogenic habitat but

selected-against in the commensal habitat (23). This is the theory behind ‘source-sink’

dynamics and hence, in this manner, the commensal and opportunistic nature of E. coli can be

maintained.

In addition to this, analysis of the premature stop codons (PSC) in the commensal and ‘top

serotype’ strains is particularly interesting. Figure 7 demonstrates that the number of genes

with PSC in the top pathogenic serotype strains is almost 2.5 times the number of genes with

PSC in the commensal strains. Some studies have suggested that this is the result of the

adaptation of the pathogen to its ‘novel’ habitat. Thus far, pathoadaptation in VTEC has been

described by gain-of-function modification in order for the bacteria to better exploit its niche.

However, it is equally important for genes that are no longer compatible with the ‘pathogenic

lifestyle’, to be inactivated. In other-words, this is pathoadaptation via loss-of-function

modifications. This is another direction evolution can take during adaptation to a new habitat.

At the beginning of this study, it was alluded to that this type of analysis would yield a

significant presence of core genes in the results. Figure 5 further supports this hypothesis.

Although the most significant presence is technically from mosaic genes, it should be noticed

that these points are mainly concentrated in the locality of the core gene region. Thus, it has

become evident that pathoadaptation is occurring in these pathogenic bacteria through the

means of mutations in commensal genes in order to confer a short-term advantage, yet these

mutations will only be mildly deleterious in the ancestral, commensal niche. This type of

pathoadaptation suits the opportunistic nature of these bacteria.

Significantly, one must observe the absence of the VTEC characteristic pathogenic genes in

the candidate list of genes. Before this study was conducted, it was expected that these genes

would naturally be present such is their important to these strains of E. coli. However, the

Page 29: Alan Moran_Thesis submission (1)

28

very nature of this analysis does not include these genes. This is due to the fact that the

candidate list of genes, for the most part, includes primarily commensal genes that are

possibly being hijacked in order to further enhance the bacteria’s virulent weaponry. These

genes are under short-term positive selection, and are just as likely to be selected-against in

order to return the balance. In contrast to this, the previously stated quintessential VTEC

genes are constant virulent factors for these bacteria. In other-words, they are not likely to be

selected in-and-out of the genome, rather they continuously serve the pathogenic efforts of

these bacteria.

The point mutations that are occurring in the commensal or ‘core’ genes are occurring mainly

as mutations in hotspot positions. Figure 8A demonstrates the extent to which short-term

positive selection uses these types of mutations as its main driver since almost none can be

witnessed to be occurring in the ‘long-term’ zone. This certainly fits in with the picture that

these protein variants which have accumulated recent hotspot mutations could be functionally

significant for short-term adaptation (23). However, such is the nature of hotspot mutations,

these protein variants could also be reverted back to their original, commensal state. In

addition to this, Figure 8B illustrates the overall importance hotspot mutations have as these

type of mutations account for 25-39% of total number of changes happening in the three

VTEC serotypes that have been examined.

The predominance of parallel HS-mutations shown by Figure 9 signifies that selection is

acting on these genes in order to modify the protein in a specific and directional manner, as

the same amino acid is being continuously inserted into these positions. This is in line with

the principle of positive selection which aims to produce a shift in the phenotype. In contrast

to this, if coincidental hotspot mutations were predominant, this would show that selection

was acting in order to eliminate protein function as multiple types of amino acids would be

accumulating in positions that are vital to the function of the protein (23). In the case of O111,

however, it appears that both parallel and coincidental changes are co-occurring.

Recombinant genes are showing a high frequency of parallel changes (Figure 10). This is to

be expected as parallel HS-mutations can occur as point mutations, which also may occur due

to recombination. Hence, there is normally a much higher frequency of recombinant genes

that display parallel HS-mutations than coincidental HS-mutations. In this case, however,

there is a high proportion of recombinant genes displaying coincidental hotspot mutations.

Figure 10B illustrates the total proportion of coincidental and parallel hotspot mutations in the

Page 30: Alan Moran_Thesis submission (1)

29

recombinant genes. This supports the observation that coincidental hotspot changes are

holding a high percentage of the total number of HS-mutations in recombinant genes, higher

than would be normally expected. This is suggestive of the power of positive selection to

produce sequence changes not just through mutation, but through recombination also. This

broadens the horizons of the organism and widens its scope to adapt (8, 23).

In conclusion, this study has succeeded in achieving its aims by identifying further traits

associated with the pathogenicity of VTEC, including a more detailed characterization of the

virulent traits associated with some non-O157 strains. It is undeniable that selection is

favouring ‘phages’ in these bacteria in order to increase the transfer of virulent genetic

material across the population. Although we are currently aware of the genes that identify

VTEC strains, this study has focused on the ‘short-term’ selection of other genes for similar

purposes. This is important as this ‘short-term’ focus fits in with ‘source-sink’ life-cycle of

Escherichia coli populations. Furthermore, this study has also achieved its secondary goal in

recognizing the pattern of evolution that is occurring here. The short-term pathoadaptation of

VTEC is occurring largely through hotspot mutations. Once again, this type of mutation is

suitable as it can be manipulated to produce gain-of-function in genes for shifting to a

pathogenic state, or to perform a loss-of-function in genes in order to revert back to

commensalism and ultimately maintain the survival of the species. It has been observed that

many of the genes are not specialist virulence factors, rather they are commensal genes that

are now being used to improve the armoury of virulent factors when exploiting novel niches.

Page 31: Alan Moran_Thesis submission (1)

30

5. Acknowledgements:

A word of thanks to Lisa Rogers and Dr. Peadar Ó Gaora for their contribution to this study.

Page 32: Alan Moran_Thesis submission (1)

31

6. References:

(1): Karama, M., Johnson, R. P., Holtslander, R., McEwen, S. A., & Gyles, C. L. (2008).

Prevalence and characterization of verotoxin-producing Escherichia coli (VTEC) in cattle

from an Ontario abattoir. Canadian Journal of Veterinary Research, 72(4), 297.

(2). Karmali, M. A., Gannon, V., & Sargeant, J. M. (2010). Verocytotoxin Escherichia coli

(VTEC). Veterinary microbiology, 140(3), 360-370.

(3). Bolton, D. J. (2011). Verocytotoxigenic (Shiga toxin–producing) Escherichia coli:

virulence factors and pathogenicity in the farm to fork paradigm. Foodborne pathogens and

disease, 8(3), 357-365.

(4). Yin, S., Jensen, M. A., Bai, J., DebRoy, C., Barrangou, R., & Dudley, E. G. (2013). The

evolutionary divergence of Shiga toxin-producing Escherichia coli is reflected in clustered

regularly interspaced short palindromic repeat (CRISPR) spacer composition. Applied and

environmental microbiology, 79(18), 5710-5720.

(5). Nataro, J. P., & Kaper, J. B. (1998). Diarrheagenic escherichia coli. Clinical microbiology

reviews, 11(1), 142-201.

(6). Sokurenko, E. V., Hasty, D. L., & Dykhuizen, D. E. (1999). Pathoadaptive mutations:

gene loss and variation in bacterial pathogens. Trends in microbiology, 7(5), 191-195.

(7). Chattopadhyay, S., Paul, S., Kisiela, D. I., Linardopoulou, E. V., & Sokurenko, E. V.

(2012). Convergent molecular evolution of genomic cores in Salmonella enterica and

Escherichia coli. Journal of bacteriology, 194(18), 5002-5011.

(8). Chattopadhyay, S., Paul, S., Dykhuizen, D. E., & Sokurenko, E. V. (2013). Tracking

recent adaptive evolution in microbial species using TimeZone. Nature protocols, 8(4), 652-

665.

(9). Chattopadhyay, S., Dykhuizen, D. E., & Sokurenko, E. V. (2007). ZPS: visualization of

recent adaptive evolution of proteins. BMC bioinformatics, 8(1), 187.

(10). Swofford, D. L. 2003. PAUP*. Phylogenetic Analysis Using Parsimony (*and Other

Methods). Version 4. Sinauer Associates, Sunderland, Massachusetts.

(11). Huang D.W., Sherman B.T., Lempicki R.A. (2009). Systematic and integrative analysis

of large gene lists using DAVID Bioinformatics Resources. Nature Protoc. 4(1): 44-57.

Page 33: Alan Moran_Thesis submission (1)

32

(12). Huang D.W., Sherman B.T., Lempicki R.A. (2009). Bioinformatics enrichment tools:

paths toward the comprehensive functional analysis of large gene lists. Nucleic Acids

Research; 37(1):1-13.

(13). Asadulghani, M. D., Ogura, Y., Ooka, T., Itoh, T., Sawaguchi, A., Iguchi, A., &

Hayashi, T. (2009). The defective prophage pool of Escherichia coli O157: prophage–

prophage interactions potentiate horizontal transfer of virulence determinants. PLoS

pathogens, 5(5), e1000408.

(14). Juhas, M. (2013). Horizontal gene transfer in human pathogens. Critical reviews in

microbiology, (0), 1-8.

(15). Petersen, L., Bollback, J. P., Dimmic, M., Hubisz, M., & Nielsen, R. (2007). Genes

under positive selection in Escherichia coli. Genome research, 17(9), 1336-1343.

(16). Otto, K., & Hermansson, M. (2004). Inactivation of ompX causes increased interactions

of type 1 fimbriated Escherichia coli with abiotic surfaces. Journal of bacteriology, 186(1),

226-234.

(17). Vogt, J., & Schulz, G. E. (1999). The structure of the outer membrane protein OmpX

from Escherichia coli reveals possible mechanisms of virulence.Structure, 7(10), 1301-1309.

(18). Johansson, M. U., Alioth, S., Hu, K., Walser, R., Koebnik, R., & Pervushin, K. (2007).

A minimal transmembrane β-barrel platform protein studied by nuclear magnetic

resonance. Biochemistry, 46(5), 1128-1140.

(19). Bishop, R. E., Gibbons, H. S., Guina, T., Trent, M. S., Miller, S. I., & Raetz, C. R.

(2000). Transfer of palmitate from phospholipids to lipid A in outer membranes of

Gram‐negative bacteria. The EMBO journal, 19(19), 5071-5080.

(20). Hall, B. G. (2000). Transposable elements as activators of cryptic genes in E. coli.

In Transposable Elements and Genome Evolution (pp. 181-187). Springer Netherlands.

(21). Rasko, D. A., Rosovitz, M. J., Myers, G. S., Mongodin, E. F., Fricke, W. F., Gajer, P., &

Ravel, J. (2008). The pangenome structure of Escherichia coli: comparative genomic analysis

of E. coli commensal and pathogenic isolates. Journal of bacteriology, 190(20), 6881-6893.

(22). Sokurenko, E. V., Hasty, D. L., & Dykhuizen, D. E. (1999). Pathoadaptive mutations:

gene loss and variation in bacterial pathogens. Trends in microbiology, 7(5), 191-195.

Page 34: Alan Moran_Thesis submission (1)

33

(23). Chattopadhyay, S., Weissman, S. J., Minin, V. N., Russo, T. A., Dykhuizen, D. E., &

Sokurenko, E. V. (2009). High frequency of hotspot mutations in core genes of Escherichia

coli due to short-term positive selection. Proceedings of the National Academy of

Sciences, 106(30), 12412-12417.

(24). Maurelli, A. T. (2007). Black holes, antivirulence genes, and gene inactivation in the

evolution of bacterial pathogens. FEMS microbiology letters, 267(1), 1-8.

(25). Josenhans, C., & Suerbaum, S. (2002). The role of motility as a virulence factor in

bacteria. International Journal of Medical Microbiology, 291(8), 605-614.

(26). Wong, T. Y., Fernandes, S., Sankhon, N., Leong, P. P., Kuo, J., & Liu, J. K. (2008).

Role of premature stop codons in bacterial evolution. Journal of bacteriology, 190(20), 6718-

6725.

Page 35: Alan Moran_Thesis submission (1)

34

7. Appendix:

Table A: Bash commands used together in one script referred to as the ‘assembly

script’.

Table B. ‘Header-truncator’ script.