Recognition of fold and sugar linkage for glycosyltransferases by multivariate sequence analysis

42
Recognition of fold and sugar linkage for glycosyltransferases by multivariate sequence analysis Maria L. Rosén 1 , Maria Edman 1,2 , Michael Sjöström 2 , and Åke Wieslander 1 * From the 1 Department of Biochemistry & Biophysics, Stockholm University, SE 106 91 Stockholm, Sweden, and the 2 Department of Chemistry, Organic Chemistry, Research Group for Chemometrics, Umeå University, SE 901 87 Umeå, Sweden Running title: Fold classification by multivariate sequence analysis *Corresponding author. Phone: +46-8-16 24 63 Fax: +46-8-15 36 79 E-mail: [email protected] Full address: Department of Biochemistry & Biophysics, Stockholm University, SE 106 91 Stockholm, Sweden JBC Papers in Press. Published on May 17, 2004 as Manuscript M402925200 Copyright 2004 by The American Society for Biochemistry and Molecular Biology, Inc. by guest on January 10, 2019 http://www.jbc.org/ Downloaded from

Transcript of Recognition of fold and sugar linkage for glycosyltransferases by multivariate sequence analysis

Recognition of fold and sugar linkage for glycosyltransferases by multivariate sequence

analysis

Maria L. Rosén1, Maria Edman1,2, Michael Sjöström2, and Åke Wieslander1*

From the 1 Department of Biochemistry & Biophysics, Stockholm University,

SE 106 91 Stockholm, Sweden, and the 2 Department of Chemistry, Organic Chemistry,

Research Group for Chemometrics, Umeå University, SE 901 87 Umeå, Sweden

Running title: Fold classification by multivariate sequence analysis

*Corresponding author. Phone: +46-8-16 24 63 Fax: +46-8-15 36 79

E-mail: [email protected]

Full address:

Department of Biochemistry & Biophysics, Stockholm University,

SE 106 91 Stockholm, Sweden

JBC Papers in Press. Published on May 17, 2004 as Manuscript M402925200

Copyright 2004 by The American Society for Biochemistry and Molecular Biology, Inc.

by guest on January 10, 2019http://w

ww

.jbc.org/D

ownloaded from

2

SUMMARY

Glycosyltransferases (GTs) are among the largest groups of enzymes found and are usually

classified on basis of sequence comparisons into many families of varying similarity (CAZy

systematics). Only two different Rossman-like folds have been detected (GT-A and GT-B)

within the small number of established crystal structures. A third uncharacterized fold has

been indicated with transmembrane organisation (GT-C). We here use a method based on

multivariate data analyses (MVDA) of property patterns in amino acid sequences, and can

with high accuracy recognise the correct fold in a large data set of GTs. Likewise, a retaining

or inverting enzymatic mechanism for attachment of the donor sugar could be properly

revealed in the GT-A and GT-B fold group sequences by such analyses. Sequence alignments

could be correlated to important variables in MVDA, and the separating amino acid positions

could be mapped over the active sites. These seem to be localised to similar positions in space

for the α/β/α binding motifs in the GT-B fold group structures. Analogous, active-site

sequence positions were found for the GT-A fold group. Multivariate property patterns could

also easily group most GTs annotated in the genomes of Escherichia coli and Synechocystis to

proper fold or organisation group, according to benchmarking comparisons at the MetaServer.

We conclude that the sequence property patterns revealed by the multivariate analyses seem

more conserved than amino acid types for these GT groups, and these patterns are also

conserved in the structures. Such patterns may also potentially define substrate preferences.

by guest on January 10, 2019http://w

ww

.jbc.org/D

ownloaded from

3

INTRODUCTION

Glycosyltransferases (GTs) are one of the largest and most diverse enzyme groups in all

living cells. This enzyme group performs many critical functions such as the synthesis of

glycogen, and carbohydrate-polymers, they act on proteins that mediate cell-cell interactions

and glycosylate transcription regulators (1). Hence, the variety of acceptors that GTs act on is

highly diverse, with saccharides, lipids, proteins, and nucleic acid as the most common. To

reflect the structural variation of the acceptors and donators that glycosyltransferases can use,

and the fact that the sequence similarity within the GTases is low, a large diversity of folds of

these enzymes have been expected (2). Glycosylhydrolases, enzymes performing the reverse

reaction have been found to have many different fold types (2), but so far only two different

folds have been discovered within the solved crystal structures for the GTs, named GT-A and

GT-B respectively (3). A third glycosyltransferase group (GT-C) has been discovered by

iterative BLAST searches and by structural comparisons (4). These proteins are integral

membrane proteins with the active site in the long loop, and with the transmembrane helix

number varying between 8 to 13. The GT-C family can also be found with Hidden Markov

method searches within the GT families. This method has also identified a fourth family,

unique for eukaryotes, named GT-D (5).

The GT-A fold consists of two tightly associated β/α/β domains, of varying sizes, with

separated nucleotide (SGC) and acceptor binding domains (6). The majority of the proteins in

this fold group have a short N-terminal cytoplasmic domain followed by a transmembrane

(TM) segment, a stem region to reach out from the membrane, and finally the large globular

enzyme part (3). The GT-B group has two similar, but less tightly associated Rossman-like

β/α/β fold domains, and are frequently membrane associated (7). However, only a very

by guest on January 10, 2019http://w

ww

.jbc.org/D

ownloaded from

4

limited number of proteins from the GT-B group seem to have TM segments. The GT-A

group presently consists of nine solved crystal structures from sequence families GT-2, GT-6,

GT-7, GT-8, GT-13, and GT-43, and the GT-B group of seven structures from GT-1, GT-4,

GT-20, GT-28, GT-35, GT-63, and GT-64, respectively, in the CAZy systematics sequence

database (http://afmb.cnrs-mrs.fr/~cazy/CAZY/index.html). Reaction mechanisms of both

retaining and inverting types are included in both A and B fold groups (denoted clans). Within

the GT-A fold family with the retaining mechanism, the stereochemistry of the C1 position of

the donor sugar substrate is conserved and well studied (2,8). The retaining mechanism is

poorly understood and may consist of a few steps involving stable intermediates. Hence, there

is no coupling between reaction mechanism and fold.

Most comparative studies of glycosyltransferases have been based on amino acid sequence

comparisons using BLAST and similar methods, which mainly will account for amino acid

similarities and identities at specific positions. In the CAZy database, glycosyltransferases

have been divided into about 70 families based on such sequence similarities (9), over a

comparatively short stretch of the protein. To be classified within the same family an E value

of less then E-3 over at least 100 amino acids is needed (9). The database consists of both

predicted ORFs and fully functionally determined proteins. GTs within the same sequence

family can have highly diverse substrate specificities, e.g. like members of family 2, but still

share substantial sequence similarities (10). The opposite has also been recorded, enzymes

having very low sequence similarity can utilise the same substrates (7) and have the same fold

(11). There is often a low sequence similarity between different families, and the fold within a

family is expected to be conserved for its members (9,11,12). The catalytic mechanisms are

also expected to be conserved in each CAZy family (13). However, in general very few amino

acid positions are conserved among the GTs, and the sequence similarities for the structures

by guest on January 10, 2019http://w

ww

.jbc.org/D

ownloaded from

5

within the two established fold groups are surprisingly low. Furthermore, the number of

predicted glycosyltransferases (from translated gene sequences) within an organism does not

reflect the size of the genome. For example Escherichia coli (K12) with a genome size of

4639 kb has 34 predicted glycosyltransferases according to CAZy, Bacillus subtilis 4215 kb

has 28 GTs, Synechocystis 3573 kb has 61 GTs, and finally Mycoplasma pneumoniae of 816

kb has three predicted GTs. To predict the function of new or encoded glycosyltransferases,

the ORF of interest needs to have a close sequence similarity with an enzyme of known

function. Likewise, to predict the function of new enzyme groups by sequence similarity

methods is almost impossible.

A new approach is needed. One possible way used here to search for new GTs, would be to

look for physico-chemical properties patterns along the whole sequence. Protein sequences

behave in manners far from random and the amino acid sequence is organised to reflect the

structure (14). Hence proteins with the same (or similar) structure or function are expected to

have similar property patterns in the sequence, even at low sequence similarities, and over

their full lengths or only partially (e.g. for certain domains). This is very evident for proteins

with repeated motifs, e.g. TIM barrels (14,15), and must be valid for others as well.

Furthermore, for proteins with the same function but different folds, certain local patterns in

the sequence determining the function can still be the same. GTs of a given family performing

the same reaction would be expected to have similar properties. With the multivariate

sequence analysis methods, based on amino acid properties, proteins with no sequence

similarity can be grouped together, visualising conserved property patterns within the

proteins. The sequence is translated into values describing different amino acid properties, i.e.

hydrophobicity, bulk volume, polarisability and charge. The periodicity of properties along a

sequence is calculated followed by a multivariate analysis (16). Furthermore, sequence length

by guest on January 10, 2019http://w

ww

.jbc.org/D

ownloaded from

6

variations are of less importance. This method has been used to predict location of cellular

proteins in Synechocystis 2(Rajalahti et. al., in preparation), to characterise E.coli

compartment proteins (17), and to classify signal peptides (18).

In the present study, as a first step, a reference set of glycosyltransferases from families with

known structure members was compiled from enzymes classified by CAZy. The multivariate

analysis method could conveniently separate the three different fold types from each other and

furthermore divide the two different reaction mechanisms within the A and B structural fold

groups using sequence information. The ability to predict the fold and the sugar orientation

properly was also established. Furthermore, ORFs of unknown function and that share no

sequence homology with known glycosyltransferases could potentially be identified with this

method to be glycosyltransferases.

by guest on January 10, 2019http://w

ww

.jbc.org/D

ownloaded from

7

EXPERIMENTAL PROCEDURES

Data Set─In this study 141 selected glycosyltransferases from 13 different CAZy families

(see supplementary material) were used as reference sets. The data set mainly includes

sequences with known function or with high sequence similarity with proteins with known

function, where the probability for the same function is high. The references set include both

prokaryotic and eukaryotic enzymes, using various substrates and acceptors. Full-length

amino acid sequences were used, i.e. no signal peptide or anchor domains found especially in

the GT-A group have been removed. The number of amino acids in the protein sequences

varies between 229 and 994. This data set is non-redundant and contains no pair of sequences

that share more than 55% identity.

Multivariate Sequence Analysis─The aim of this study is to search and use periodic physical

features in the proteins that separate GTs according to structure and reaction mechanism.

Amino acids can be described and characterised in a number of ways. Parameters such as

retention-times in different chromatographic systems, electric properties, molecular mass etc

are commonly used, as described by Wold et al., 1993 (16). To decrease the number of

parameters that describes each amino acid, and at the same time keep the information in a few

variables, so-called z-scales are used. These z-scales are derived from 29 physical-chemical

experimental parameters for the amino acids with principal component analysis (19). They can

approximately be translated as z1 “hydrophobicity/hydrophilicity”, z2 “bulk of side-chain”

and z3 as “polarisability/charge”, see Table I. To describe the periodicities in a protein, auto-

cross-covariance in the z-values, are used (ACC) (16). The ACC-program multiplies the first

z-value for the first amino acid with the second, followed with the first and third up to the

highest lag. The same procedure is performed with the second and the third z-scales and all

combinations of them, see Fig. 1. The ACC-terms are the average value for every interaction-

by guest on January 10, 2019http://w

ww

.jbc.org/D

ownloaded from

8

term at different lags, hence z(1)1 x z (1)2, z(1)1 x z (2)2, z(1)1 x z(3)2 etc. For short peptides such

as signal peptides, the size of the maximal lag is dependent on the shortest peptide in a set, but

for large polypeptides such as glycosyltransferases, the optimal lag seems to be between 15-

25 aa. After optimisation, a window size of 19 was chosen in this study, which gives 171

variables (19x3x3) for each sequence. Auto covariances with lag= 1, 2, 3...L were calculated

by the equation:

Index j is used for the z-scales (j=1,2,3), index i is the aa position (i=1,2...n) and n is the

number of amino acids in the sequence (cf. Fig 1). The crossed covariances between the two

different scales j and k are given by (note the difference between ACCjk and ACCkj):

Partial least squares projections to latent structures discriminant analysis (PLS-DA) finds

the relationships between a X-matrix and a Y-matrix, i.e. find relationship between sequence

properties of the X-matrix, and a Y-matrix defining the group membership (20). In this study

the Y-matrix is composed of dummy variables, hence a value of 1 is given to members of the

same group and 0 for non-members. The method is using class membership in the Y-matrix,

that in PLS is composed of features that are responses to the variables in the X matrix, here

fold group or reaction mechanism. PLS-DA is using the assumption that sequences belonging

to the same class have common features and therefore will behave similarly in the analysis, as

visualised in the score plot. This method can also be used to predict relationships for new

unclassified sequences. In PLS-DA a multidimensional space is formed where every variable

(e.g. z(1)1z(1)2, z(1)1z(2)2, z(1)1z(3)2) represents one dimension and every object (data from one

sequence) is a point in this space. For the reference set, this means 141 points in a 171

∑−

+

×=

lagn

i

lagijijlagj lagn

zzACC ,,

,

∑−

+≠ −

×=

lagn

i

lagikijlagkj lagn

zzACC ,,

,

by guest on January 10, 2019http://w

ww

.jbc.org/D

ownloaded from

9

dimensional space. To get the first PLS component a line that best approximates the data (best

separates the objects according to group classification) is fitted to the multidimensional space

and the data points are projected onto this line in order to get coordinate values or scores. To

get the second component, a new line is fitted to the data space that describes the second-most

of the variation. Only two components are plotted at one time, hence the first, best separating

dimension and the second in this work. The PLS model contains information both regarding

the relationship among the objects (scores) and the contribution of the variables to the model

(PLS weights). A weight plot shows the contribution of the variables for the separation, hence

what periodic features that are responsible for the separation between the fold structures or

reaction mechanisms. These features can then be searched for on the sequence level. The

objects best separated, hence localised far away from the centre of the plot, are the ones best

described by the variables with the largest weight. Such variables are therefore most likely to

be identified at the sequence level in these proteins.

To evaluate the complexity i.e. the numbers of PLS components to use in the model cross

validation is preformed, where all objects are withdrawn, here 1/10 at a time, and their y-

values are predicted from an updated model based on 9/10 of the objects. This procedure is

repeated ten times. A Q (cum)2 value is calculated which describes how much of the variance

in the Y matrix can be predicted by the model. To obtain a perfect score of 1, all objects

should be predicted back to the exact position given by the the Y matrix. A Q(cum)2 larger

then 0.1 corresponds to a 95% significance of the model (21).

Sequence alignment methods─Lalign was used to make pair-wise sequence alignments

(http://www.ch.embnet.org/software/LALIGN_form.html). To combine the pair-wise

alignments ClustalW was used (22). The ACC variables were searched for in the sequences

by guest on January 10, 2019http://w

ww

.jbc.org/D

ownloaded from

10

with the use of the established alignments from the above method. The sequences were

translated to the z-values found to be important followed by sliding along the translated

sequences according to the ACC variable, e.g. aligning the first amino acid with the 18th

according to variable z(1)1z(1)18 for the GT-A group, see results and Fig. 4 below. The

products (according to e.g. z(1)1z(1)18) between the two aligned amino acids at the positions

indicated were then calculated, and the values compared for the aligned positions. Product

values with equal signs (cf. z-values in Table I), hence positive or negative values, were

marked. These positions were indicated on the sequence alignment (cf. Fig. 4 below), and in

selected protein structures using Swiss-PdbViewer (23). The prediction of transmembrane

segments was done using TMHMM at http://www.cbs.dtu.dk/services/TMHMM-2.0/ (24).

by guest on January 10, 2019http://w

ww

.jbc.org/D

ownloaded from

11

RESULTS

Fold and reaction mechanism revealed by multivariate analysis─Do glycosyltransferases

contain sequence property periodicities related to protein architecture and enzyme function?

GTs from prokaryotes and eukaryotes using many different substrates were used in the

reference sets selected from the CAZy classification, and mainly from families where

members have at least one established crystal structure. A few related families were also

included when needed. Proteins belonging to the same CAZy family are believed to have the

same fold. MVDA, here PLS-DA, could divide the glycosyltransferase sequences according

to the established (or predicted) fold (Fig. 2). In the structure division between GT-A, GT-B

and the third newly discovered GT-C group, with transmembrane topology, a Q(cum)2 value

of 0.730 was obtained (Fig. 2 A). This is a high Q(cum)2 value (“prediction ability” of the

model, maximum is 1.0; cf. experimental procedures), but the separation between GT-A and

GT-B was less pronounced and partially overlapping. Note that the number of transmembrane

segments seems to have less impact on the distribution of the GT-C sequences, i.e. no

grouping. To further investigate the differences between the different fold groups GT-A and

GT-B were each analysed separately with GT-C, yielding Q(cum)2 values of 0.868 (Fig. 2 B)

and 0.91 (data not shown), respectively. A division between the GT- A and GT-B yielded a

Q(cum)2 of 0.655 (Fig. 2 C). A further separation within the GT-A and GT-B groups was

achieved on basis of reaction mechanism (Fig. 3 A & B), and Q(cum)2 values of 0.702 and

0.562 respectively, were obtained for the inverting and retaining clans within these two fold

groups. Hence, multivariate ACC analyses of potential sequence property profiles for

glycosyltransferases revealed clear groupings for fold and reaction mechanism by robust PLS

models with high cross validation values.

by guest on January 10, 2019http://w

ww

.jbc.org/D

ownloaded from

12

Separating sequence features─A PLS weight plot showing the distribution of each variable

for the reaction mechanism division for each score plot can be drawn (cf. experimental

procedures). Here only the corresponding weight plot for the division between retaining and

inverting GTs for the GT-B group are shown (Fig. 3). Variables that are on the same side as

the sequence class of interest in the score plot are positively correlated, and the variables on

the diagonal side of the plot are negatively or oppositely correlated. The variable information

was used to investigate if properties responsible for the separation of reaction mechanism

could be outlined in the sequences. Sequences from proteins with known structure and the

same reaction mechanism and fold were aligned to search for the variables in the sequences.

The proteins used to represent the retaining enzymes of the GT-A fold group were: α3GalT

(PDB, 1FG5) (25), GTA (PDB code, 1LZ0) (26), GTB (PDB, 1LZ7) (26) from CAZy

family GT6, LgtC (PDB, 1G9R) (27) from family GT-8, and the inverting β4Gal-T1 (1FGX)

(28) GT7; GlcaT-I (PDB, 1FGG) (29) family GT43; SpsA, (PDB, 1QGQ) (30) GT2 and

GnT1 (1FO8) (8) from family GT13. The chosen sequences were aligned by various methods,

both multiple (ClustalW), and pair-wise (LALIGN) alignments, and for both full length and

partial protein sequences. This was preformed to establish if there are any conserved regions

between the different GT families achieving the same stereochemistry for the sugar linkage. A

stretch of 84 amino acids in the retaining group (cf. above) could successfully be aligned. The

members of family 6 and 8 have no significant sequence similarity when comparing the whole

proteins. However, no long gap-free sequence alignment between all the different proteins

belonging to the inverting group (cf. sequence Table in supplementary material) could be

found using the above methods. A selection of variables with high weights found to be

important for the reaction mechanism separation within fold group GT-A (as illustrated in Fig.

3) were (in rank) z(2)1z(3)13, z(2)1z(2)5, z(1)1z(1)18, z(2)1z(2)10, z(2)1z(1)4, and z(1)1z(1)14.

Important variables at shorter distances were size (z(2)) and hydrophilicity/hydrophobicity

by guest on January 10, 2019http://w

ww

.jbc.org/D

ownloaded from

13

(z(1)) patterns, e.g. positions 1 to 4 and 1 to 5, corresponding to 3 and 4 amino acids apart,

such as amino acids on the same side of a helix. Autocorrelation analyses have shown that

proteins with several helices have a strong periodicity of hydrophobicity, of approximately

3.7 (31). This is close to the 1 to 5 and 1 to 4 positions in this work (cf. above). The longer

correlation distances (1 to 13, 1 to 18 above) are longer than the average lengths of alpha

helices and beta strands in proteins, but in Rossman-like folds important functional residues

are frequently found in the connecting loops (32,33). The established alignments between the

selected retaining enzymes were used to search for the latter variables at the sequence level,

but shorter distances, hence z(2)1z(2)5, z(2)1z(1)4, may be more difficult to visualise due to the

number of helices in the structures. The variable z(2)1z(3)13 was also used in the search for

property patterns in the alignment, but the pattern was not as clear as the z(1)1z(1)18 variable

and could be of importance in other regions of the proteins. Parts of this alignment

coincidentally overlap with the UDP binding site of the donor substrates (34). Using the z-

values from Table I the correlation for the product of the variable z(1)1z(1)18 along the

sequences in the alignment was tentatively identified (Fig. 4). Within the established

alignment, 12 variables (i.e. products) were negative and 5 positive for the retaining GT-A

group (Fig. 4, top section). This variable should according to the PLS weight plot be

negatively correlated to this group. Positions in the alignment that are members of such

variable pairs can be superimposed onto each other in the corresponding crystal structures

indicating similar position in space, as illustrated for the GT-B fold group below (Fig. 5).

However, no comparison could be made between the retaining and inverting GT-A groups

(clans), since a useable comparative alignment for the inverting mechanism sequences could

not be established for the families involved.

by guest on January 10, 2019http://w

ww

.jbc.org/D

ownloaded from

14

In the GT-B fold group the most important variables were (in rank) z(2)1z(2)5, z(1)1z(1)16

z(2)1z(2)8, z(1)1z(3)14, z(3)1z(1)11, and z(2)1z(2)2 for the separation of the retaining and inverting

sequences (Fig. 3 B), according to the PLS weight plot (Fig. 3 C). More of the major variables

here seemed to involve correlations over shorter distances than for the GT-A group; e.g.

z(2)1z(2)2 indicates side-chain volumes for residues on opposite sides of β-strands, and

z(2)1z(2)5 volumes for residues next to each other on the face of an α helix, respectively. The

alignment method described above was also applied here, and established between sequences

from the GT1 and GT28 families, with no or low sequence homology over the whole proteins.

GtfB (PDB, 1IIR)(35) and MurG (PDB, 1NLM) (36) were chosen in order to correlate the

alignment to the structures. Two more sequences from each family were chosen to get a more

stable alignment. Furthermore, to compare the inverting and retaining mechanisms, four

sequences were chosen from family GT4, a retaining family, but without solved crystal

structures. Two of these have a well established (validated) fold model, the alDGS and

spMGS lipid glycosyltransferases (37). An alignment could be made for both reaction

mechanisms for these proteins over the same region, which contains the UDP-binding sites

(Fig. 4, middle/bottom sections). The variable z(1)1z(3)16 (cf. above) was traced at the

sequence level by its z products, and should be positively correlated to the inverting enzyme

group according to Fig. 3. In the alignment 18 positive and 7 negative pairs was found within

the inverting group, and 3 positive and 7 negative pairs were found for the retaining families

(Fig. 4), confirming the importance of the z(1)1z(3)16 variable. A visualisation of the donor

substrate binding regions in the MurG and GtfB structures (both inverting) showed that these

positions seem to occupy similar positions in the structure space (Fig. 5 A-C), potentially so

also in the model for a retaining enzyme (Fig. 5 D). The difference between inverting and

retaining reaction mechanism within the GT-B fold group sequences was the amino acid

hydrophobicity. In the inverting group, where positive values are important, the property

by guest on January 10, 2019http://w

ww

.jbc.org/D

ownloaded from

15

pattern pairs consist of one hydrophobic and one hydrophilic amino acid (see Table I). The

retaining group should have equal sign hydrophobicity values, hence two hydrophobic or two

hydrophilic amino acids at positions 1 and 16. The pattern was similar for the GT-A fold

group, where the variable is z(1)1z(1)18, however the pattern has not yet been identified in the

inverting enzyme clan.

Predictions from Genomes─The three GT fold groups could be well separated from each

other, where the division between GT-C and each of the two others separately were stronger

(Fig. 2). All proposed GTs from E.coli annotated in the CAZy database were used to evaluate

the prediction power of the ACC/PLS model. The results from the fold predictions are shown

in Figure 2 D and Table II. Many family members of the GT-A fold group are known to have

an N-terminal helix that anchors the protein to the membrane, TM segments in the GT-B

group are however rare. A pair-wise division between the GT-A and GT-B groups’

individually with the GT-C group was also made and the E.coli proteins were predicted into

these models (data not shown), revealing the same results as with the three fold groups

together. Evaluation of Table II shows that seven E.coli glycosyltransferases were predicted

to belong to the GT-C fold group. All these have one or more hydrophobic TM segments

each. No proteins without TM segments were incorrectly grouped to this family for the E.coli

set. E.coli has four other proteins that have one proposed TM segment each, but they were not

grouped to the GT-C fold. The latter ones only come from two families, GT8 and GT51. GT8

has a member with a solved crystal structure, LgtC of the GT-A fold group (27). The C-

terminal part of LgtC consists of a domain very rich in basic residues and several hydrophobic

and aromatic residues in an amphiphatic organisation, but no TM segments. Hence,

presence/absence of amphiphatic segments seemed of minor importance (data not shown).

This domain is believed to bind to the membrane and is relatively conserved within the family

by guest on January 10, 2019http://w

ww

.jbc.org/D

ownloaded from

16

(38). The E.coli proteins described above from GT8 have the conserved amino acids (data not

shown) indicating that these protein are correctly grouped; hence they do not have any TM

segments. Multiple fold predictions, made by the superior MetaServer (39) (at

http://bioinfo.pl/meta), were used to evaluate the ACC/PLS predictions here. The family

GT51 had neither of the established fold type according to the MetaServer, supporting the

ACC/PLS results. The E.coli glycosyltransferases with predicted transmembrane helices from

GT2 all have the GT-A fold as a separate domain in between or after the integral membrane

domains. Generally, the results from the MetaServer coincided with the results obtained with

the multivariate method used here (see Table III).

The 61 annotated, potential glycosyltransferases from Synechocystis found in the CAZy

database (data not shown), were also analysed by the same method. Here, 10 GTs were

predicted to belong to the GT-C group. The separation between the GT-A and GT-B fold, and

versus the GT-C group (like in Fig. 2 C & D), was also preformed for the Synechocystis GTs.

The results were again very similar between the different classification methods (Table III).

The proteins grouped to the GT-C fold type all have a high hydrophobic TM segment content.

There were however a few exceptions; one protein from family 19 was grouped to GT-C

group but has no predicted hydrophobic TM segment, and there were eight GTs that have

predicted TM segments, but were not recognised by the ACC method. Within the latter, three

are from CAZy GT51 family. This group was not grouped to the GT-C fold type (cf. above),

even though they all have one predicted TM segment (but not experimentally verified). A

total evaluation of the prediction results using the MetaServer for “benchmarking” revealed

that only that 10 out of 54 or 19% were “incorrectly” grouped (including TM ones). These

results include classification of proteins with TM segments as a correct result. Within these

ten, two contain both the GT-A and GT-B fold types (gene number sll1528 and slr1063) and

by guest on January 10, 2019http://w

ww

.jbc.org/D

ownloaded from

17

had intermediate PLS results, grouping to either of the groups with a prediction value close to

0.5. These enzyme sequences were spliced (in silico) according to fold type. This division

distributed the new sequence segments to the “correct” fold types (data not shown). Two other

proteins (slr1816 and slr0626) also consisted of multiple domains with different functions,

like a known GT fold linked to a Trp domain involved in protein interactions (40). The

domain with a known GT fold fell into the correct fold type group when spliced (data not

shown).

In summary, glycosyltransferases can be classified and analysed by multivariate analysis

methods on basis of sequence property patterns. The method could successfully predict the

fold of GTs and also the orientation of the catalyzed bond between the donor sugar and the

acceptor molecule, i.e. retaining or inversion mechanisms. Potentially important amino acid

positions in the donor sites were also suggested. From genomic analyses new fold types,

where the two major folds GT-A and GT-B were found within the same protein, could also be

recognised.

by guest on January 10, 2019http://w

ww

.jbc.org/D

ownloaded from

18

DISCUSSION

Fold predictions of glycosyltransferases have recently been discussed by many (4,41,42).

This is an important enzyme group, since the enzymes utilise many different sugar substrates

and also acts on an extensive number of acceptors, with only a few fold groups described. The

sequence similarity varies greatly between enzymes performing the same enzymatic reaction

but the fold can still be the same. GTs have also been used to evaluate different fold

recognitions methods, and the fold knowledge has been used in attempts to identify new GTs

in organisms, e.g. Mycobacterium tuberculosis (42). In the latter study, multivariate methods

were used to analyse a large data set, including fold prediction results, as well as molecular

mass and theoretical isoelectric points (42). In the present study a different approach was

developed. Glycosyltransferases of known structure were analysed by translating the amino

acid sequence into z-scores describing their physico-chemical properties (19), followed by

comparing property patterns between the sequences. Of course, some other supervised

classification methods that are applicable for problems with more variables than objects could

have been used here. For example support vector machines, have successfully been applied by

Chou and co-workers (43,44) to predict structural domains in whole proteins. However, since

at present, we are unaware of a more general comparison between the PLS-DA and the

support vector machine classifier we will not speculate of the outcome of a change of

classification method, but most advanced classification methods if correctly trained usually

give rather similar results.

Conservation of properties─ The bearing idea was here that properties are more conserved

than amino acid types, leading to conservation of structures (cf. Mirny & Shakhnovich 1999)

(33). Describing and comparing the GTs based on property patterns worked well both for fold

type classification and stereochemistry of the enzyme product sugar linkage (Fig. 2 & 3). The

by guest on January 10, 2019http://w

ww

.jbc.org/D

ownloaded from

19

division between different fold types was better resolved when only analysing two fold types

at the time, especially when membrane-spanning group GT-C was included. When all three

fold types were separated in the same model, a partial overlap between GT-A and GT-B was

seen. Both GT-A and GT-B consist of Rossman-like folds, with alternating alpha helices and

beta strands. This was recognised by the method, and the difference between these two fold

types was smaller than the difference between globular proteins and transmembrane ones with

a high TM helix content (i.e. GT-A plus GT-B versus GT-C) (Fig. 2 A). Amphipathic helices

were not interfering, hence property patterns are very different for this transmembrane group

compared to the cytosolic and surface-bound proteins.

Evaluation of genome data─The multivariate method was also used to predict the fold for all

GTs in E.coli and Synechocystis included in CAZy (Table II & III). In E.coli the proportion of

GTs with established functions is high, and Synechocystis is an organism with very high GT

content. The data set consists of GTs belonging to the same GT families as the reference set,

but also from other families. This method works best with proteins from families included in

the reference set; it becomes easier for the program to predict the membership probability if

the predicted protein share some homology with the proteins in the reference set, most evident

for the GT-A and GT-B fold groups.

The MVDA method used here recognised the hydrophobic helices and grouped these

proteins according to the TM helix content to the GT-C family, even when the dominating

part of the protein had the GT-A or GT-B fold. In the E.coli set, seven GTs with high TM

content were predicted to belong to the GT-C group (Table II). These proteins are annotated

in CAZy family 2 (GT-A fold) and 51 (indifferent), and are not considered part of the GT-C

group (4). Five other E.coli proteins containing transmembrane segments were not grouped to

by guest on January 10, 2019http://w

ww

.jbc.org/D

ownloaded from

20

this latter fold group. Note that the TM content within the same GT family seems to vary

greatly, and TM segments are not accounted for in CAZy. Despite this, fold and reaction

mechanism could properly be predicted for proteins grouped to other GT families than the

major ones.

The predictions were consistent, independent of the fold families included in the separation

analysis, and independent on if two or three fold classes were used. The results obtained from

the MetaServer were used as a benchmark comparison. When predicting the fold, with all

three fold groups in the reference set, 82% of the E.coli GTs were predicted to have the right

fold, and excluding the GT-C group increased the prediction correctness to 86% (Table III).

In the first fold prediction, recognition of TM segments was regarded as the correct result.

The scores were improved when the GT-C group was removed from the prediction, hence the

TM content is no longer accounted for (Table III). Predicted glycosyltransferases from

Synechocystis, was also analysed by the same method, but even though the fraction of GTs

belonging to the major families GT2 and GT4 are larger, the prediction ability was somewhat

lower (Table III). However, most of these are not as well studied as the E.coli enzymes.

Retaining and inverting mechanisms─The ability to predict retaining and inversion

mechanisms was even higher than the structural predictions, 100% for the E.coli and 86% for

the Synechocystis set within the GT-A fold group, and 60% versus 80% for the GT-B group.

The same comparison for the GT-C group cannot be made, since it seems to consist only of

inverting enzymes. Little is understood about the differences at the sequence level between

inversion and retention. The importance of acidic residues in the active site has been

suggested, but exact conserved positions have not been established (41). Extensive sequence

alignments revealed here that there are charge differences in the UDP-binding motif within

by guest on January 10, 2019http://w

ww

.jbc.org/D

ownloaded from

21

the GT-B fold family (GT4, GT20 and GT1, GT28, respectively), data not showed. The

conserved Ex7E motif has an Asp or Glu in the first position in GT4 and GT-20 (retaining),

but the inverting enzymes in GT-1 and GT-28 have a His, Lys or Arg. The Lys261 in MurG is

known to bind to a phosphate in the UDP-GalNAc donor (36), His293 in GtfA is known to

have the same function (45), hence there is a charge difference between the two reaction

mechanism within this motif.

The established alignment over the UDP-binding region was further analysed, to search for

differences between the inverting/retaining clans in the GT-B family. The differences between

the inversion enzyme clan within the GT-A group were too high and no alignment could be

established, and choosing only a few families could give incorrect results. Within the GT-B

group there are three solved crystal structures for the inversion group. There is one solved

structure (OtsA) within the retaining enzymes, and three fold models within the retaining GT4

family. The OtsA could not be aligned over the active site with the members of GT4 due to

longer loops and helixes in OtsA than the other enzymes (46). However, a comparison

between the two different reaction mechanisms could be achieved at the sequence level and

even on the structure level. The conserved positive variable patterns within the inverting clan

(Fig. 4) were found to be superimposed when comparing the structures within the GT-B fold

group (Fig. 5). The best comparison could be made for the inverting enzyme group where the

three different crystal structures have been solved, GftA(45), GtfB (35), and MurG (36,47).

GtfA was not used in the property alignment due to a different donor sugar nucleotide (45). A

preliminary comparison could also be made between inverting and retaining mechanism

within the GT-B group. When superimposing a fold model of alDGS (retaining) on to GtfB

and MurG the corresponding negative variable pairs in the retaining group (Fig. 4) were found

to be located in the same area (Fig. 5). The positions in MurG (inverting, GT-B fold) that are

by guest on January 10, 2019http://w

ww

.jbc.org/D

ownloaded from

22

known to bind UDP-GlcNAc, A264, L265, T266, E269, Q288, and Q289 (36), are located

within the alignment (Fig. 4), hence the property pattern positions might be involved in

guiding the donor substrate to the right orientation. They can also be important for the right

structure of the α/β/α motif of the active site (47). The positions in the GT-A retaining group

were also found around the nucleotide binding area (Fig. 4) and may have similar functions

(34). The property pattern positions from this alignment could also be superimposed at the

structure level, indicating a functional importance (data not shown). These GT-B and GT-A

sites are good targets for functional (mutational) analyses.

Conclusions─The ACC/PLS (multivariate) method that we describe here structurally

classifies and identifies glycosyltransferases with high accuracy, on basis of amino acid

sequence information. The method can even be used for predicting the stereochemistry of the

reaction mechanism. Potential separating sequence parameters between the inverting and

retaining mechanism have also been suggested. The positions found to be conserved within

fold groups performing the same stereo chemical reaction are good candidates for

mutagenesis, to better understand the differences between the two reaction types. This study

has also identified four proteins containing more than one fold type, i.e. Synechocystis

slr1816, sll1528, slr0626, slr1063. Multivariate analysis of all GTs annotated in the CAZy

database may find new fold groups. Detailed analyses of large GT sequence families, such as

GT-2 and GT-4, could potentially also find subgroups related to specific substrates, products

and small structure differences such as high TM content.

by guest on January 10, 2019http://w

ww

.jbc.org/D

ownloaded from

23

Acknowledgements

We thank Dr. Anders Öhman, Umeå University, for his help with the fold analysis, and the

Swedish Natural Science Research Council for financial support.

by guest on January 10, 2019http://w

ww

.jbc.org/D

ownloaded from

24

References:

1. Boix, E., Swaminathan, G. J., Zhang, Y., Natesh, R., Brew, K., and Acharya, K. R.

(2001) J. Biol. Chem. 276, 48608-48614

2. Bourne, Y., and Henrissat, B. (2001) Curr. Opin. Struct. Biol. 11, 593-600

3. Breton, C., and Imberty, A. (1999) Curr. Opin. Struct. Biol. 9, 563-571

4. Liu, J., and Mushegian, A. (2003) Protein Sci. 12, 1418-1431

5. Kikuchi, N., Kwon, Y. D., Gotoh, M., and Narimatsu, H. (2003) Biochem. Biophys.

Res. Commun. 310, 574-579

6. Unligil, U. M., and Rini, J. M. (2000) Curr. Opin. Struct. Biol. 10, 510-517

7. Breton, C., Mucha, J., and Jeanneau, C. (2001) Biochimie 83, 713-718

8. Unligil, U. M., Zhou, S., Yuwaraj, S., Sarkar, M., Schachter, H., and Rini, J. M.

(2000) EMBO J. 19, 5269-5280

9. Campbell, J. A., Davies, G. J., Bulone, V., and Henrissat, B. (1997) Biochem. J. 326,

929-939

10. Henrissat, B., and Davies, G. J. (2000) Plant Physiol. 124, 1515-1519

11. Davies, G., and Henrissat, B. (1995) Structure 3, 853-859

12. Henrissat, B., and Davies, G. (1997) Curr. Opin. Struct. Biol. 7, 637-644

13. Gebler, J., Gilkes, N. R., Claeyssens, M., Wilson, D. B., Beguin, P., Wakarchuk, W.

W., Kilburn, D. G., Miller, R. C., Jr., Warren, R. A., and Withers, S. G. (1992) J. Biol.

Chem. 267, 12559-12561

14. Rackovsky, S. (1998) Proc. Natl. Acad. Sci. U S A 95, 8580-8584

15. Wold, S., and Sjöström, M. (1998) Acta Chemica Scandinavica 52, 517-523

16. Wold, S., Jonsson, M., Sjöström, M., Sandberg, M., and Rännar, S. (1993) Anal.

Chim. Acta. 277, 239-253

by guest on January 10, 2019http://w

ww

.jbc.org/D

ownloaded from

25

17. Sjötröm, M., Rännar, S., and Wieslander, Å. (1995) Chemometrics and Intelligent

Laboratory Systems 29, 295-305

18. Edman, M., Jarhede, T., Sjöström, M., and Wieslander, A. (1999) Proteins 35, 195-

205

19. Hellberg, S., Sjöström, M., Skagerberg, B., and Wold, S. (1987) J. Med. Chem. 30,

1126-1135

20. Wold, S., Eriksson, L., and Sjöström, M. (1998) in Encyclopedia of Computational

Chemistry (Schleyer, v. R., ed), pp. 2006-2022, John Wiley & Sons, New York

21. Eriksson, L., Johansson, E., Kettaneh-Wold, N., and Wold, S. (2001) Multi- and

Megavariate Data Analysis Principles and Applications, Umetrics, Umeå

22. Thompson, J. D., Higgins, D. G., and Gibson, T. J. (1994) Nucleic. Acids. Res.22,

4673-4680

23. Guex, N., and Peitsch, M. C. (1997) Electrophoresis 18, 2714-2723

24. Krogh, A., Larsson, B., von Heijne, G., and Sonnhammer, E. L. (2001) J. Mol. Biol.

305, 567-580

25. Gastinel, L. N., Bignon, C., Misra, A. K., Hindsgaul, O., Shaper, J. H., and Joziasse,

D. H. (2001) EMBO J. 20, 638-649

26. Patenaude, S. I., Seto, N. O., Borisova, S. N., Szpacenko, A., Marcus, S. L., Palcic, M.

M., and Evans, S. V. (2002) Nat. Struct. Biol. 9, 685-690

27. Persson, K., Ly, H. D., Dieckelmann, M., Wakarchuk, W. W., Withers, S. G., and

Strynadka, N. C. (2001) Nat. Struct. Biol. 8, 166-175

28. Gastinel, L. N., Cambillau, C., and Bourne, Y. (1999) EMBO J. 18, 3546-3557

29. Pedersen, L. C., Tsuchida, K., Kitagawa, H., Sugahara, K., Darden, T. A., and

Negishi, M. (2000) J. Biol. Chem. 275, 34580-34585

30. Tarbouriech, N., Charnock, S. J., and Davies, G. J. (2001) J. Mol. Biol. 314, 655-661

by guest on January 10, 2019http://w

ww

.jbc.org/D

ownloaded from

26

31. Horne, D. S. (1988) Biopolymers 27, 451-477

32. Brändén, C.-I., and Tooze, J. (1999) Introduction to protein structure, 2. Ed., Garland,

New York

33. Mirny, L. A., and Shakhnovich, E. I. (1999) J. Mol. Biol. 291, 177-196

34. Heissigerova, H., Breton, C., Moravcova, J., and Imberty, A. (2003) Glycobiology 13,

377-386

35. Mulichak, A. M., Losey, H. C., Walsh, C. T., and Garavito, R. M. (2001) Structure 9,

547-557

36. Hu, Y., Chen, L., Ha, S., Gross, B., Falcone, B., Walker, D., Mokhtarzadeh, M., and

Walker, S. (2003) Proc. Natl. Acad. Sci. U S A 100, 845-849

37. Edman, M., Berg, S., Storm, P., Wikström, M., Vikström, S., Öhman, A., and

Wieslander, Å. (2003) J. Biol. Chem. 278, 8420–8428,

38. Wakarchuk, W. W., Cunningham, A., Watson, D. C., and Young, N. M. (1998)

Protein Engineering 11, 295-302

39. Ginalski, K., Elofsson, A., Fischer, D., and Rychlewski, L. (2003) Bioinformatics 19,

1015-1018

40. Blatch, G. L., and Lassle, M. (1999) Bioessays 21, 932-939

41. Franco, O. L., and Rigden, D. J. (2003) Glycobiology 13, 707R-712R

42. Wimmerova, M., Engelsen, S. B., Bettler, E., Breton, C., and Imberty, A. (2003)

Biochimie 85, 691-700

43. Chou, K. C., and Cai, Y. D. (2002) J. Biol. Chem. 277, 45765-45769

44. Cai, Y. D., Lin, S. L., and Chou, K. C. (2003) Peptides 24, 159-161

45. Mulichak, A. M., Losey, H. C., Lu, W., Wawrzak, Z., Walsh, C. T., and Garavito, R.

M. (2003) Proc. Natl. Acad. Sci. U S A 100, 9238-9243

by guest on January 10, 2019http://w

ww

.jbc.org/D

ownloaded from

27

46. Gibson, R. P., Tarling, C. A., Roberts, S., Withers, S. G., and Davies, G. J. (2003) J.

Biol. Chem. 279, 1950-1955

47. Ha, S., Walker, D., Shi, Y., and Walker, S. (2000) Protein Sci. 9, 1045-1052

by guest on January 10, 2019http://w

ww

.jbc.org/D

ownloaded from

28

Footnotes

1 The abbreviations used are: CAZy, carbohydrate active enzymes; MVDA, multivariate data

analyses; GT, Glycosyltransferase; BLAST, basic local alignment search tool ; SGC, SpsA

GnT I core; UDP, uridine diphosphate; GlcNAc, N-acetyl-d-glucosamine; TM,

transmembrane; PDB, protein data bank; ORF, open reading frame; ACC, auto cross

covariance; PLS-DA, Partial least squares projections to latent structures discriminant

analysis;

2 Unpublished cited work:

Rajalahti, T., Huang, F., Sjöström, M., Norling, B., and Wieslander, Å., in preparation

by guest on January 10, 2019http://w

ww

.jbc.org/D

ownloaded from

29

FIGURE LEGENDS

FIG 1. The steps used in analysing protein sequences with an ACC description of the

periodicity of the protein sequence followed by PLS-DA analysis. (A) Proteins are

collected from CAZy with a known fold type, (B) the amino acids are translated into the three

z-scales (from Table I) and are used for (C) calculation of auto covariance and cross

covariance variables, and (D) the data is collected in a data matrix. A discriminant vector for

each class is constructed and used as Y in the PLS-DA (e.g. y (GT-A) and y (GT-B)).

FIG. 2. Fold division of the glycosyltransferase reference sets based on sequence

property patterns. Q (cum)2 values (“prediction ability” of the model) for each comparison

are given in the plots, showing the two first score vectors t1 and t2 from the PLS-DA. The

predicted transmembrane segment content within each enzyme is indicated with numbers and

the fold group with colours. GT-A, red; GT-B, blue; GT-C green; and predicted E.coli GTs in

black. A, division between the GT-A, GT-B, and GT-C fold types; B, division between GT-A

and GT-C; C, GT-A and GT-B division; D, GT-A, GT-B, and GT-C fold division with the

E.coli GTs predicted. PS stands for prediction set.

FIG. 3. Division between the reaction stereochemistry within fold types. Panel (A) and

(B) showing the two first score vectors t1 and t2 from the PLS-DA. Based on sequence

property patterns; retaining (□) or inversion (▼) of the sugar linkage within the GT-A (A) and

GT-B (B) fold types. The Q (cum)2 values are given for each division. C, The corresponding

PLS-weight plot (w1*c1 / w2*c2) for the GT-B analysis in the B-panel, showing the x-variable

and y variable weights (w* and c respectively), where variables to the fare left and right in

by guest on January 10, 2019http://w

ww

.jbc.org/D

ownloaded from

30

the plot have the most separating information. A variable found on the same side as the

inversion group is positively correlated to this group and negatively correlated to the retaining

group. A weight plot for GT-A is not shown.

FIG. 4. Sequence alignment of selected GTs within the reference sets. The variable pairs

indicated were aligned with ClustalW. p, first position in positive variable pairs, e.g.

z(1)1z(1)18 (indicated from the PLS weight plots), (+) the second position (e.g. 18); n, first

position in a negative variable pair, (-) the second position. Conserved amino acids are

marked with (*), (:) amino acids with the same properties, and (.) similar properties as marked

in ClustalW.

FIG. 5. Important sequence pattern positions in active site structures. A, Ribbon

structure of MurG (PDB code) from E.coli with the UDP-GlcNAc donor substrate shown in

yellow. Amino acids conserved in the variable sequence pairs and found around the active site

are marked in green. B, The UDP-GlcNAc binding region with the variable sequence pairs

indicated: Ala253, Asp256, Ser262, Ser268, Ala271, and Pro276. C, MurG and GtfB (PDB)

ribbon structures of the UDP-binding area with the amino acids Gly300, Ala303 Gly309,

His315, Ala318, and Pro323 in GtfB superimposed onto the MurG structure. D, The

corresponding region in alDGS lipid GT from Acholeplasma laidlawii. The positions marked,

Asp233, Val235, Glu242, Val248, Glu250, and Pro257, are members of negative variables

pairs.

by guest on January 10, 2019http://w

ww

.jbc.org/D

ownloaded from

31

TABLE I Descriptor Scales for the Coded Amino Acidsa

Amino acid z1 z2 z3 Phe (F) -4.92 1.30 0.45Trp (W) -4.75 3.65 0.85

Ile (I) -4.44 -1.68 -1.03Leu (L) -4.19 -1.03 -0.98Val (V) -2.69 -2.53 -1.29Met (M) -2.49 -0.27 -0.41Tyr (Y) -1.39 2.32 0.01Pro (P) -1.22 0.88 2.23Ala (A) 0.07 -1.73 0.09Cys (C) 0.71 -0.97 4.13Thr (T) 0.92 -2.09 -1.40Ser (S) 1.96 -1.63 0.57Gln (Q) 2.19 0.53 -1.14Gly (G) 2.23 -5.36 0.30His (H) 2.41 1.74 1.11Lys (K) 2.84 1.41 -3.14Arg (R) 2.88 2.52 -3.44Glu (E) 3.08 0.039 -0.07Asn (N) 3.22 1.45 0.84Asp (D) 3.64 1.13 2.36

aDerived by a principal component analysis of 29 physico-chemical properties for the amino

acids (19). z1 reflects ‘‘hydrophobicity/hydrophilicity’’, z2 side-chain ‘‘bulk volume’’, and

z3‘‘polarizability and charge’’, respectively.

by guest on January 10, 2019http://w

ww

.jbc.org/D

ownloaded from

32

TABLE II

Predictions from the E.coli GT set. The scores are describing the probability to belong to a

group, with 1 being an absolute score. The highest scores within each group are highlighted.

The MetaServer results are shown to the right. GT26, GT51, and GT66 have none of the three

known fold types according (5), but GT26 has been proposed to have the GT-B fold type by

Liu and Mushegian, 2003 (4).

GT-A GT-B GT-A GT-B GT-C GT-A GT-B Inverting Retaining Retaining Inverting Meta server Jury score b

EC201 a 0.10 0.35 0.55 0.69 0.31 0.83 0.17 GTA 116EC202 0.15 0.30 0.55 0.81 0.19 0.64 0.36 GTA 108EC203 0.67 0.44 -0.11 0.57 0.43 0.62 0.38 GTA 99EC204 0.27 0.75 -0.02 0.23 0.77 0.61 0.39 0.94 0.06 GTA 133EC205 -0.34 0.53 0.81 0.62 0.38 0.99 0.01 GTA 128EC206 -0.05 0.44 0.61 0.74 0.26 0.58 0.42 GTA 131EC207 -0.11 0.52 0.59 0.41 0.59 1.23 -0.23 -0.21 1.21 GTA 116EC208 -0.20 0.64 0.56 0.32 0.68 1.24 -0.24 0.17 0.83 GTA 112EC209 0.28 0.79 -0.07 0.25 0.75 0.85 0.15 0.77 0.23 GTA 106EC401 -0.01 1.05 -0.04 -0.10 1.10 0.74 0.26 GTB 196EC402 0.06 0.86 0.08 0.05 0.95 1.26 -0.26 GTB 191EC403 -0.28 1.26 0.02 -0.23 1.23 0.81 0.19 GTB 183EC404 -0.12 1.02 0.10 -0.17 1.17 0.44 0.56 GTB 197EC405 -0.06 1.00 0.06 0.04 0.96 0.57 0.43 GTB 194EC501 0.13 0.70 0.18 0.27 0.73 0.66 0.34 GTB 169EC801 0.92 0.10 -0.02 0.87 0.13 0.00 1.00 GTA 101EC802 0.82 0.22 -0.04 0.78 0.22 0.17 0.83 GTA 106EC901 0.18 0.77 0.05 0.06 0.94 0.96 0.04 GTB 143EC1901 0.15 0.71 0.15 0.38 0.62 0.75 0.25 GTB 128EC1902 0.47 0.53 0.00 0.34 0.66 1.02 -0.02 GTB 138EC1903 0.48 0.50 0.03 0.37 0.63 0.74 0.26 GTB 119EC1904 -0.29 1.13 0.16 -0.13 1.13 0.62 0.38 GTB 204EC2001 0.44 0.65 -0.09 0.25 0.75 1.09 -0.09 GTB 170EC2601 0.39 0.66 -0.05 0.34 0.66 0.86 0.14 2LBP 95EC2801 -0.26 1.04 0.22 -0.12 1.12 0.25 0.75 GTB 86EC3001 -0.18 0.90 0.28 0.04 0.96 0.37 0.63 GTB 202EC3501 0.48 0.59 -0.07 0.37 0.63 0.78 0.22 GTB 518EC3502 0.31 0.78 -0.09 0.17 0.83 0.88 0.12 GTB 529EC5101 0.03 0.48 0.49 0.63 0.37 0.49 0.51 0.74 0.26 P-bind 31EC5102 0.00 0.88 0.12 0.09 0.91 0.49 0.51 0.70 0.30 P-bind 248EC5103 0.28 0.72 0.01 0.25 0.75 0.67 0.33 0.78 0.22 P-bind 204EC5104 0.56 0.42 0.03 0.46 0.54 0.40 0.60 0.53 0.47 P-bind 254EC5601 0.45 0.48 0.07 0.50 0.50 0.42 0.58 0.35 0.65 GTB 88EC6601 0.72 0.32 -0.04 0.73 0.27 0.48 0.52 0.35 0.65 1JX6 18

a EC201; E.coli GT sequence 1 of CAZy family 2. EC6601; sequence 1 of family 66 b Score in the 3D- Jury compilation of the MetaServer (39), 50 corresponds to a prediction accuracy of above 90%

by guest on January 10, 2019http://w

ww

.jbc.org/D

ownloaded from

33

Table III

Prediction yields from different fold and reaction mechanisms for the E.coli and

Synechocystis genome sets a.

Division E.coli (%) Synechosystis (%)

Allb Fold 82 67GTA/ GTB b Fold 86 80GTA Inv / Ret d 100 86GTB Inv / Ret d 60 80

a Intermediate prediction results, hence results close to 0.5 have been removed, see Table II. b GT-A, GT-B, GT-C fold types included. All sequences are included in the calculations,

however sequences from families with unknown fold type neither taken as correct nor wrong. c GTA/GTB; No true GT-C families are seen in E. coli and only one in Synechocystis. The

GT-C group have then been removed and the results recalculated based on the division

between only the two remaining fold groups. d Inv/Ret; Results from inverting (inv) and retaining (ret) mechanisms.

by guest on January 10, 2019http://w

ww

.jbc.org/D

ownloaded from

34

Protein

PLS-DA

Auto-covariance

Cross-covariance

.......

.......Sliding window of variable length

Data matrix with ACC-terms

Proteins

ACC averages K

N

Protein (i)

.......

.......z1 z2 z3 z1 z2 z3z1 z2 z3 z1 z2 z3z1 z2 z3z1 z2 z3

aa aa aa aa aa aa

z1 z2 z3z1 z2 z3 z1 z2 z3z1 z2 z3z1 z2 z3 z1 z2 z3

aa aa aa aa aa aa

A

B

C

GT-A GT-B

y-inv y-ret1 0 1 0 1 0 0 1 0 1 0 1

Protein

PLS-DA

Auto-covariance

Cross-covariance

.......

.......Sliding window of variable length

Data matrix with ACC-terms

Proteins

ACC averages K

N

Protein (i)

.......

.......z1 z2 z3 z1 z2 z3z1 z2 z3 z1 z2 z3z1 z2 z3z1 z2 z3

aa aa aa aa aa aa

z1 z2 z3z1 z2 z3 z1 z2 z3z1 z2 z3z1 z2 z3 z1 z2 z3

aa aa aa aa aa aa

.......z1 z2 z3z1 z2 z3 z1 z2 z3z1 z2 z3z1 z2 z3z1 z2 z3 z1 z2 z3z1 z2 z3z1 z2 z3z1 z2 z3z1 z2 z3z1 z2 z3

aa aa aa aa aa aa

z1 z2 z3z1 z2 z3z1 z2 z3z1 z2 z3 z1 z2 z3z1 z2 z3z1 z2 z3z1 z2 z3z1 z2 z3z1 z2 z3 z1 z2 z3z1 z2 z3

aa aa aa aa aa aa

A

B

C

GT-A GT-B

y-inv y-ret1 0 1 0 1 0 0 1 0 1 0 1 by guest on January 10, 2019

http://ww

w.jbc.org/

Dow

nloaded from

35

-5

0

5

-1 0 -5 0 5

t[2]

t[1 ]

0

0

0

0

0

00

0

0

0

0

0

11

11 1

1 11

1

1

1

1

10

000

0

011

1111

33

4

3

1

1

1 1

1

1

1

1

111 1

1

1

00 0

0

0

0

0

00

0

0

0

00

0

0

00 0

0

00

0

0

0

0

0

000 00

00

0

000

0

20

0

00

1

0

0

01

Q2=0.655C

-5

0

5

-1 0 -5 0 5 1 0

t[2]

t [1 ]

000

0

0

0

0

0

00

00

11 11

1

1

1

111

1

11

000

0

0

0

1111

11

3

3

4

3

1

111

1

1

111

111

1

1

6777

5

910

1198

12

10

1010 11

1312

10

13 1312

1011 912

10

7

15887

9

8910

-1 0

-5

0

5

-1 0 -5 0 5 1 0

t[2]

t [1 ]

0

0

0

00

00 0

0

0

001

1

111 11

11

1

1

11 00

00

00

1111

11

33 431

111

11

1

11

111

1

1

00

000

0

0 00

0

0

000000

0

0

0

00

00

0

0

0000

00

00

0 0

0

0

02

0

0

0

0

1 0

0

0

1

6

7

7

75

9

10

11

9

8

12

10

1010

11

13 121013 1312

10

11912

10

7

15887 98

9

10

A

Q2=0.868B

-1 0

-5

0

5

-1 0 -5 0 5 1 0

tPS

[2]

tP S [1 ]

10

6

0

0

4

5

2 2

0

00

0

0

00

10

0

000

0

0

0

0

0

00

1

11

10

0

00

0

00

00 0

0

0

001

1

111 11

11

1

1

11 00

00

00

1111

11

33 431

111 1

1

1

11

111

1

1

000000

0 00

0

0

000000

0

0

0

00

00

0

0

0000

00

00

0 0

0

0

02

0

0

0

0

1 0

0

0

1

6

7

7

75

9

10

11

9

8

12

10

1010

11

13 121013 1312

10

11912

10

7

15887 98

9

10

D

Q2=0.730

Q2=0.730

by guest on January 10, 2019http://w

ww

.jbc.org/D

ownloaded from

36

-5

0

5

-5 0 5

t[2]

t[1 ]

-0

-0

0

0

0

-0 -0 0 0 0

w*c

[2]

w*c[1]

z11z12

z11z22

z11z32

z11z13

z11z23

z11z33

z11z14

z11z24z11z34z11z15

z11z25

z11z35

z11z16

z11z26 z11z36

z11z17

z11z27

z11z37

z11z18

z11z28

z11z38

z11z19

z11z29

z11z39

z11z110

z11z210

z11z310

z11z111

z11z211z11z311z11z112z11z212

z11z312

z11z113z11z213

z11z313

z11z114

z11z214

z11z314

z11z115

z11z215

z11z315

z11z116

z11z216

z11z316z11z117

z11z217

z11z317z11z118

z11z218

z11z318

z11z119

z11z219

z11z319

z11z120z11z220

z11z320

z21z12

z21z22 z21z32

z21z13

z21z23

z21z33

z21z14

z21z24z21z34

z21z15z21z25z21z35

z21z16

z21z26

z21z36

z21z17z21z27

z21z37

z21z18z21z28z21z38

z21z19

z21z29

z21z39z21z110

z21z210 z21z310z21z111

z21z211

z21z311z21z112z21z212

z21z312z21z113

z21z213

z21z313

z21z114

z21z214

z21z314

z21z115

z21z215

z21z315

z21z116

z21z216

z21z316

z21z117

z21z217

z21z317

z21z118

z21z218

z21z318

z21z119

z21z219z21z319z21z120

z21z220

z21z320

z31z12z31z22

z31z32z31z13

z31z23

z31z33

z31z14

z31z24

z31z34

z31z15

z31z25

z31z35

z31z16

z31z26

z31z36

z31z17

z31z27

z31z37z31z18

z31z28

z31z38

z31z19

z31z29

z31z39

z31z110

z31z210z31z310

z31z111

z31z211

z31z311

z31z112z31z212z31z312z31z113

z31z213

z31z313

z31z114z31z214

z31z314z31z115

z31z215

z31z315

z31z116z31z216

z31z316

z31z117

z31z217

z31z317

z31z118

z31z218

z31z318

z31z119

z31z219

z31z319z31z120z31z220

z31z320

$M1

$M1.DA2

-5

0

5

-5 0 5

t[2]

t[1 ]

A

B

C

Q2=0.702

Q2=0.562

• y-inv

•y-ret

-5

0

5

-5 0 5

t[2]

t[1 ]

-0

-0

0

0

0

-0 -0 0 0 0

w*c

[2]

w*c[1]

z11z12

z11z22

z11z32

z11z13

z11z23

z11z33

z11z14

z11z24z11z34z11z15

z11z25

z11z35

z11z16

z11z26 z11z36

z11z17

z11z27

z11z37

z11z18

z11z28

z11z38

z11z19

z11z29

z11z39

z11z110

z11z210

z11z310

z11z111

z11z211z11z311z11z112z11z212

z11z312

z11z113z11z213

z11z313

z11z114

z11z214

z11z314

z11z115

z11z215

z11z315

z11z116

z11z216

z11z316z11z117

z11z217

z11z317z11z118

z11z218

z11z318

z11z119

z11z219

z11z319

z11z120z11z220

z11z320

z21z12

z21z22 z21z32

z21z13

z21z23

z21z33

z21z14

z21z24z21z34

z21z15z21z25z21z35

z21z16

z21z26

z21z36

z21z17z21z27

z21z37

z21z18z21z28z21z38

z21z19

z21z29

z21z39z21z110

z21z210 z21z310z21z111

z21z211

z21z311z21z112z21z212

z21z312z21z113

z21z213

z21z313

z21z114

z21z214

z21z314

z21z115

z21z215

z21z315

z21z116

z21z216

z21z316

z21z117

z21z217

z21z317

z21z118

z21z218

z21z318

z21z119

z21z219z21z319z21z120

z21z220

z21z320

z31z12z31z22

z31z32z31z13

z31z23

z31z33

z31z14

z31z24

z31z34

z31z15

z31z25

z31z35

z31z16

z31z26

z31z36

z31z17

z31z27

z31z37z31z18

z31z28

z31z38

z31z19

z31z29

z31z39

z31z110

z31z210z31z310

z31z111

z31z211

z31z311

z31z112z31z212z31z312z31z113

z31z213

z31z313

z31z114z31z214

z31z314z31z115

z31z215

z31z315

z31z116z31z216

z31z316

z31z117

z31z217

z31z317

z31z118

z31z218

z31z318

z31z119

z31z219

z31z319z31z120z31z220

z31z320

$M1

$M1.DA2

-5

0

5

-5 0 5

t[2]

t[1 ]

A

B

C

Q2=0.702

Q2=0.562

• y-inv

•y-ret

by guest on January 10, 2019http://w

ww

.jbc.org/D

ownloaded from

37

GT-A Z(1)1z(1)18 Retaining linkage nn p n n pn p n- p+nn nnp+n n - + -- --+ - - HsBgGtB (O14758) EILTPLFG-TLHPSFYGSSREAFTYERRPQSQAYIPKDEGDFYYMGAFFGGSVQEVQRLTRACHQAMMVDQANGIEAVWHDESHL- HsBgGtA (P16442) EILTPLFG-TLHPGFYGSSREAFTYERRPQSQAYIPKDEGDFYYLGGFFGGSVQEVQRLTRACHQAMMVDQANGIEAVWHDESHL- BsGalT (P14769) ETLGESVA-QLQAWWYKADPNDFTYERRKESAAYIPFGEGDFYYHAAIFGGTPTQVLNITQECFKGILKDKKNDIEAQWHDESHL- NmLgtC (P96945) -SLTPLWDTDLGDNWLGACIDLF-VERQEGYKQKIGMADGEYYFNAGVLLINLKKWRRHDIFKMSCEWVEQYKDVM-QYQDQDIL- * * : : : * **: * :*::*: ...: . . . . :: :.: ::*:. * GT-B Z(1)1z(1)16 Inverting linkage - + + + ++ p p nn pnn n + + -- p-- p p n p n p - p+ p pp pp p p +pp ++ + + ++ AoGtfD (gi13591783) QALFRRVAAVIHHGSAGTEHVATRAGVPQLVIPRNTD----QPYFAGRVAALGIGVAHDGPTPTFESLSAALTTVLAPETRARAEAVAGMVLTDGAAAAADLVLAAVGREKPAVPA 112 AoGtfB (P96559) QVLFGRVAAVIHHGGAGTTHVAARAGAPQILLPQMAD----QPYYAGRVAELGVGVAHDGPIPTFDSLSAALATALTPETHARATAVAGTIRTDGAAVAARLLLDAVS-------- 104 NmMurG (gi7380690) VSAYRDADLVICRAGALTIAELTAAGLGALLVPYPHAVDDHQTANARFMVQAEAGLLLPQTQLTAEKLAEILGGLNREKCLKWAENARTLALPHSADDVAEAAIACAA-------- 108 EcMurG (P17443) AAAYAWADVVVCRSGALTVSEIAAAGLPALFVPFQHK+DRQQYWNALPLEKAGAAKIIEQPQLSVDAVANTLAGWSRETLLTMAERARAASIPDATERVANEVSRVAR-------- 107 : . *: :..* * : ** :.:* * * : . . : : :: * * . ...: .* . Retaining linkage + + - p n n p n - -n pnn n - +-- - alDGS (gi22641488) VDGAVIKGAFSGADCVFFPSYEETEGIVVLEGLASKTPVVLRDIPVYYDWLFHK spMGS (gi15958596) IAPSETALYYKAADFFISASTSETQGLTYLESLASGTPVIAHGNPYLNNLISDKMFG bbMGS (gi2681362) IPWEEIYYYYKISDIFASLSKSEVYPMTVIEALTAGIPAILINDYIYKDVIKEGIN EcMtfB (P47594) VPDEDLPYLYAAARTFVYPSFYEGFGLPILEAMSCGVPVVCSNVTSLPEVVGDAG : : . * * : :*.::. .: . :

by guest on January 10, 2019 http://www.jbc.org/ Downloaded from

by guest on January 10, 2019http://w

ww

.jbc.org/D

ownloaded from

39

SUPPLEMENTARY MATERIAL Rosén et al., 2004 Recognition of fold and sugar linkage for glycosyltransferases by multivariate sequence

analysis

Glycosyltransferase sequences used as references for training of the MVDA method. Number Organsim Gi number Function TM

helixes CAZy

Famliy 1 Amycolatopsis orientalis 5971640 Tdp-Epi-Vancosaminyltransferase 0 1 2 Homo sapiens 1407590 ceramide galactosyl transferase 2 1 3 Nicotiana tabacum 20146091 glucosyltransferase NTGT3 0 1 4 Amycolatopsis orientalis 13591783 glycosyltransferase 0 1 5 Arabidopsis thaliana 9392679 glucosyltransferase 0 1 6 Brassica napus 9794913 glucosyltransferase 0 1 7 Felis catus 2842546 UDP-glucuronosyltransferase 1 1 8 Stevia rebaudiana 21435782 UDP-glucosyltransferase 0 1 9 Basillus subtilis 580877 glycosyl transferase 0 2

10 Haemophilus ducreyi 8118046 beta 1-4 glucosyltransferase 0 2 11 Sinorhizobium meliloti 605654 exoM 0 2 12 Streptococcus agalactiae 13022167 beta-1,3-glucosyltransferase 0 2 13 Klebsiella pneumoniae 5006991 glucosyl transferase 0 2 14 Bradyrhizobium sp 12642177 nodulation N-acetylglucosaminyltransferase 0 2 15 Neisseria gonorrhoeae 595813 glycosyl transferase 0 2 16 Streptococcus agalactiae 3721919 N-acetylglucosaminyltransferase 0 2 17 Streptococcus pneumoniae 2198542 ss-1,3-N-acetylglucosaminyltransferase 0 2 18 Campylobacter jejuni 12004281 beta-1,3-galactosyltransferase 0 2 19 Acetobacter xylinus 3298349 beta-D-1,6 Glucosyl transferase 0 2 20 Streptococcus agalactiae 13022168 beta-1,3-galactosyltransferase 0 2 21 Acholeplasma laidlawii 22651488 alpha1,2-glucosyltransferase 0 4 22 Lactococcus lactis 15674119 glycosyltransferase 0 4 23 Streptococcus pneumoniae 15900944 glycosyltransferase 0 4 24 Streptococcus pneumoniae 15900945 glycosyltransferase 0 4 25 Acholeplasma laidlawii 14043013 1,2-diacylglycerol 3-glucosyltransferase 0 4 26 Borrelia burgdorferi 15594799 glycosyltransferase 0 4 27 Klebsiella pneumoniae 557195 galactosyl transferase 0 4 28 Streptococcus thermophilus 1276879 glycosyltransferase 0 4 29 Arabidopsis thaliana 11357895 sulfogalactosetransferase 0 4 30 Synechocystis sp. PCC 6803 1001478 sulfogalactosetransferase 0 4 31 Gluconacetobacter xylinus 1054906 alpha-mannosyltransferse 0 4 32 Mycobacterium tuberculosis 3719234 alpha-mannosyltransferse 0 4 33 Campylobacter coli 1486283 galactosyltransferase 0 4 34 Escherichia coli 3142172 mannosyl transferase 0 4 35 Escherichia coli 598471 mannosyltransferase B 0 4 36 Pseudomonas aeruginosa 3249545 glycosyltransferase WbpX 0 4 37 Pseudomonas aeruginosa 3249551 glycosyltransferase WbpY 0 4 38 Pseudomonas aeruginosa 3249553 glycosyltransferase WbpZ 0 4 39 Salmonella typhimurium 3132887 alpha1,3-glucosyltransferase 0 4 40 Synechococcus sp 14595230 tetrahydrobiopterin glucosyltransferase 0 4

by guest on January 10, 2019http://w

ww

.jbc.org/D

ownloaded from

40

41 Callithrix sp. 1086056 alpha 1,3 galactosyltransferase 1 6 42 Homo sapiens 4590454 alpha 1,3 galactosyltransferase 1 6 43 Mus musculus 15419872 alpha-1,3-galactosyltransferase 1 6 44 Sus scrofa 609567 alpha-1,3-galactosyltransferase 1 6 45 Bos taurus 163124 alpha 1-3 galactosyltransferase 1 6 46 Gallus gallus 1469908 beta-1,4-galactosyltransferase 1 7 47 Gallus gallus 1469906 beta-1,4-galactosyltransferase 1 7 48 Homo sapiens 3132898 beta-1,4-galactosyltransferase 1 7 49 Bos taurus 127820 Beta-1,4-galactosyltransferase 1 1 7 50 Ciona intestinalis 9229932 beta 4 galactosyltransferase 1 7 51 Homo sapiens 3132896 beta-1,4-galactosyltransferase 1 7 52 Mus musulus 6651182 beta-1,4-galactosyltransferase III 1 7 53 Rattus norvegicus 3258653 beta-1,4-galactosyltransferase 1 7 54 Escherichia coli 26110705 Lipopolysaccharide 1,3-galactosyltransferase 0 8 55 Salmonella typhimurium 3132884 alpha1,3-galactosyltransferase 0 8 56 Escherichia coli 26110704 Lipopolysaccharide 1,2-glucosyltransferase 0 8 57 Salmonella typhimurium 16422285 alpha1,2-glucosyltransferase 0 8 58 Escherichia coli 3821841 alpha1,3-glucosyltransferase 0 8 59 Neisseria meningitidis 21654776 alpha 1,4 galactosyltransferase 0 8 60 Oryctolagus cuniculus 165782 acetylglucosaminyltransferase I 1 13 61 Cricetulus griseus 14388961 N-acetylglucoaminyltransferase I 1 13 62 Drosophila melanogaster 7804912 N-acetylglucoaminyltransferase I 1 13 63 Mus musculus 6754684 N-acetylglucoaminyltransferase I 1 13 64 Solanum tuberosum 6779206 N-acetylglucoaminyltransferase I 1 13 65 Xenopus laevis 15211610 N-acetylglucoaminyltransferase I 1 13 66 Arabidopsis thaliana 1865677 trehalose-6-phosphate synthase 0 20 67 Aspergillus niger 551471 trehalose-6-phosphate synthase 0 20 68 Candida albicans 1488038 trehalose-6-phosphate synthase 0 20 69 Methanothermobacter

thermautotrophicus 2104413 trehalose-6-phosphate synthase 0 20

70 Zygosaccharomyces rouxii 8886767 trehalose-6-phosphate phosphatase 0 20 71 Escherichia coli 862973 trehalose-6-phosphate synthase 0 20 72 Pichia pastoris 14718993 ceramide glucosyltransferase 3 21 73 Rattus norvegicus 4105567 ceramide glycosyltransferase 3 21 74 Gossypium arboreum 14718995 ceramide glucosyltransferase 4 21 75 Homo sapiens 1325917 ceramide glucosyltransferase 3 21 76 Magnaporthe grisea 14718991 ceramide glucosyltransferase 1 21 77 Mus musculus 9256626 alpha-mannosyltransferase 6 22 78 Oryza sativa 13161358 alpha-mannosyltransferase 7 22 79 Caenorhabditis elegans 19069522 alpha-mannosyltransferase 7 22 80 Drosophila melanogaster 23092941 alpha-mannosyltransferase 7 22 81 Schizosaccharomyces pombe 19075493 alpha-mannosyltransferase 5 22 82 Saccharomyces cerevisiae 6321296 alpha-mannosyltransferase 9 22 83 Trypanosoma brucei 7657993 alpha-mannosyltransferase 10 22 84 Caenorhabditis elegans 17566740 alpha-mannosyltransferase 10 22 85 Drosophila melanogaster 14549429 polypeptide N-acetylgalactosaminyltransferase 1 27 86 Drosophila melanogaster 7303062 polypeptide N-acetylgalactosaminyltransferase 1 27 87 Homo sapiens 1136285 polypeptide N-acetylgalactosaminyltransferase 1 27 88 Homo sapiens 971461 polypeptide N-acetylgalactosaminyl transferase 1 27 89 Homo sapiens 6318186 polypeptide N-acetylgalactosaminyltransferase 7 1 27 90 Macaca fascicularis 11041469 polypeptide N-acetylgalactosaminyltransferase 1 27 91 Mus musculus 1575723 polypeptide N-acetylgalactosaminyltransferase-T3 1 27 92 Mus musculus 13650039 polypeptide GalNAc transferase-T2 1 27 93 Mus musculus 13650041 polypeptide N-acetylgalactosaminyltransferase 7 1 27 94 Rattus norvegicus 1141792 polypeptide GalNAc transferase 1 27

by guest on January 10, 2019http://w

ww

.jbc.org/D

ownloaded from

41

95 Rattus norvegicus 4092503 polypeptide N-acetylgalactosaminyltransferase T6 1 27 96 Rattus norvegicus 14150450 polypeptide N-acetylgalactosaminyltransferase T9 1 27 97 Arabidopsis thaliana 22328179 digalactosyldiacylglycerol synthase 2 0 28 98 Arabidopsis thaliana 15229824 digalactosyldiacylglycerol synthase 1 0 28 99 Neisseria meningitidis 7380690 GlcNAc transferase murG 0 28

100 Escherichia coli 1786278 GlcNAc transferase murG 0 28 101 Cucumis sativus 7484757 beta-galactosyltransferase 0 28 102 Arabidopsis thaliana 18397057 monogalactosyldiacylglycerol synthase 0 28 103 Bacillus subtilis 1256630 beta -glucosyltransferase 0 28 104 Staphylococcus aureus 3256224 glucosyltransferase 0 28 105 Bifidobacterium longum 23326587 GlcNAc transferase murG 0 28 106 Brucella suis 23348278 GlcNAc transferase murG 0 28 107 Wigglesworthia glossinidia 25166161 GlcNAc transferase murG 0 28 108 Escherichia coli 16131293 maltodextrin phosphorylase 0 35 109 Homo sapiens 183351 glycogen phosphorylase type IV 0 35 110 Oryctolagus cuniculus 217748 glycogen phosphorylase 0 35 111 Saccharomyces cerevisiae 499700 glycogen phosphorylase 0 35 112 Nostoc sp. PCC 7120 17230642 hypothetical protein 10 39 113 Novosphingobium

aromaticivorans 23108578 O-mannosyl transferase 10 39

114 Nostoc punctiforme 23125173 O-mannosyl transferase 11 39 115 Mycobacterium leprae 15826938 arabinosyl transferase 13 39 116 Homo sapiens 14043940 beta-1,3-glucuronyltransferase 3 1 43 117 Arabidopsis thaliana 22326970 mannosyltransferase 10 50 118 Trypanosoma brucei 12246834 TbPIG-M 7 50 119 Schizosaccharomyces pombe 19113052 mannosyltransferase 15 50 120 Caenorhabditis elegans 17531359 mannosyltransferase 8 50 121 Homo sapiens 21553315 mannosyltransferase 8 50 122 Mycobacterium leprae 15826940 arabinosyl transferase 12 53 123 Mycobacterium smegmatis 20137781 arabinosyl transferase 10 53 124 Mycobacterium leprae 15826938 arabinosyl transferase 13 53 125 Mycobacterium smegmatis 11281426 arabinosyl transferase 13 53 126 Corynebacterium glutamicum 19551438 arabinosyl transferase 12 53 127 Saccharomyces cerevisiae 6324575 glycosyltransferase 10 57 128 Caenorhabditis elegans 17531619 glycosyltransferase 11 57 129 Arabidopsis thaliana 15240920 glucosyltransferase 9 57 130 Saccharomyces cerevisiae 6324641 glucosyltransferase 12 57 131 Homo sapiens 5031953 alpha-1,3-mannosyltransferase 7 58 132 Saccharomyces cerevisiae 6319389 alpha-1,3-mannosyltransferase 9 58 133 Schizosaccharomyces pombe 19114765 mannosyltransferase 8 58 134 Neurospora crassa 12802365 mannosyltransferase 9 58 135 Arabidopsis thaliana 15227129 mannosyltransferase 10 58 136 Homo sapiens 14349125 alpha-glucosyltransferase 9 59 137 Arabidopsis thaliana 11292174 alpha-glucosyltransferase 8 59 138 Caenorhabditis elegans 17509245 potassium channel regulator 12 59 139 Schizosaccharomyces pombe 19114133 alpha-glucosyltransferase 10 59 140 Enterobacteria phage T4 9632714 beta-gt protein 0 63

by guest on January 10, 2019http://w

ww

.jbc.org/D

ownloaded from

Maria L. Rosén, Maria Edman, Michael Sjöström and Åke Wieslandersequence analysis

Recognition of fold and sugar linkage for glycosyltransferases by multivariate

published online May 17, 2004J. Biol. Chem. 

  10.1074/jbc.M402925200Access the most updated version of this article at doi:

 Alerts:

  When a correction for this article is posted• 

When this article is cited• 

to choose from all of JBC's e-mail alertsClick here

Supplemental material:

  http://www.jbc.org/content/suppl/2004/06/10/M402925200.DC1

by guest on January 10, 2019http://w

ww

.jbc.org/D

ownloaded from