A Complementary Bioinformatics Approach to Identify ... · A Complementary Bioinformatics Approach...

12
Bioinformatics A Complementary Bioinformatics Approach to Identify Potential Plant Cell Wall Glycosyltransferase-Encoding Genes 1[w] Jack Egelund, Michael Skjøt, Naomi Geshi, Peter Ulvskov, and Bent Larsen Petersen* Biotechnology Group, Danish Institute of Agricultural Sciences, DK–1871 Frederiksberg C, Copenhagen, Denmark; and Center for Molecular Plant Physiology (PlaCe), The Royal Veterinary and Agricultural University, DK–1871 Frederiksberg C, Copenhagen, Denmark Plant cell wall (CW) synthesizing enzymes can be divided into the glycan (i.e. cellulose and callose) synthases, which are multimembrane spanning proteins located at the plasma membrane, and the glycosyltransferases (GTs), which are Golgi localized single membrane spanning proteins, believed to participate in the synthesis of hemicellulose, pectin, mannans, and various glycoproteins. At the Carbohydrate-Active enZYmes (CAZy) database where e.g. glucoside hydrolases and GTs are classified into gene families primarily based on amino acid sequence similarities, 415 Arabidopsis GTs have been classified. Although much is known with regard to composition and fine structures of the plant CW, only a handful of CW biosynthetic GT genes—all classified in the CAZy system—have been characterized. In an effort to identify CW GTs that have not yet been classified in the CAZy database, a simple bioinformatics approach was adopted. First, the entire Arabidopsis proteome was run through the Transmembrane Hidden Markov Model 2.0 server and proteins containing one or, more rarely, two transmembrane domains within the N-terminal 150 amino acids were collected. Second, these sequences were submitted to the SUPERFAMILYprediction server, and sequences that were predicted to belong to the superfamilies NDP-sugartransferase, UDP-glycosyltransferase/glucogen-phosphorylase, carbohydrate-binding domain, Gal-binding domain, or Rossman fold were collected, yielding a total of 191 sequences. Fifty-two accessions already classified in CAZy were discarded. The resulting 139 sequences were then analyzed using the Three-Dimensional-Position-Specific Scoring Matrix and mGenTHREADER servers, and 27 sequences with similarity to either the GT-A or the GT-B fold were obtained. Proof of concept of the present approach has to some extent been provided by our recent demonstration that two members of this pool of 27 non-CAZy-classified putative GTs are xylosyltransferases involved in synthesis of pectin rhamnogalacturonan II (J. Egelund, B.L. Petersen, A. Faik, M.S. Motawia, C.E. Olsen, T. Ishii, H. Clausen, P. Ulvskov, and N. Geshi, unpublished data). The plant cell wall (CW) consists of four major polysaccharide components, namely cellulose, callose, hemicellulose, and pectin. CW synthesis/formation can be divided into three major steps. (1) Initially, the various building blocks in the form of activated glycosyl residues (NDP-sugars) are synthesized via two different pathways—the nucleotide interconver- sion pathway or the salvage pathway (for overview, see Carpita, 1996). The synthesis of the NDP-sugars may occur in the cytosol and/or the Golgi apparatus depending on the type of NDP-sugar synthesized (Mohnen, 1999). (2) The synthesized nucleotide sugars are then assembled into higher-order polysaccharide structures. Apart from cellulose and callose, biosyn- thesis of CW polysaccharides occurs in the endomem- brane system (Bolwell and Northcote, 1983; Zhang and Staehelin, 1992; Sherrier and VandenBosch, 1994), from which the polysaccharides are secreted into the wall where they undergo further modifications (Fry, 1995). (3) The final step, which constitutes the assem- bly of the various polysaccharide structures into the wall, remains in large part a mystery. However, self-assembly of wall components most likely plays a role (for discussion of a possible mechanism, see MacDougal et al., 1997), and both enzymatic and non- enzymatic mechanisms as well as arabinogalactan proteins and other wall structural proteins (Cosgrove, 1997) participate in the complex process. The noncellulosic polymers hemicellulose and pec- tin are synthesized by glycosyltransferases (GTs) pre- sumably located in the different compartments of the Golgi apparatus. These GTs are believed to be type II membrane-bound proteins with the catalytic domain (C-terminal) facing the lumen of the Golgi apparatus (Ridley et al., 2001; Sterling et al., 2001; Geshi et al., 2004). Although the GTs, for which the three-dimensional (3D) structures have been resolved, exhibit insignifi- cant or at the best very low sequence similarity, they adopt one of the following folds at the 3D-structure level: the GT-A (SpsA and SpsA-like) fold or the GT-B (B-GT and B-GT-like) fold (Bourne and Henrissat, 1 This work was supported by the Danish National Research Foundation and The Danish Research Agency. * Corresponding author; e-mail [email protected]; fax 45– 35282589. [w] The online version of this article contains Web-only data. www.plantphysiol.org/cgi/doi/10.1104/pp.104.042978. Plant Physiology, September 2004, Vol. 136, pp. 2609–2620, www.plantphysiol.org Ó 2004 American Society of Plant Biologists 2609 www.plantphysiol.org on May 30, 2018 - Published by Downloaded from Copyright © 2004 American Society of Plant Biologists. All rights reserved.

Transcript of A Complementary Bioinformatics Approach to Identify ... · A Complementary Bioinformatics Approach...

Page 1: A Complementary Bioinformatics Approach to Identify ... · A Complementary Bioinformatics Approach to Identify Potential Plant ... in the CAZy system—have been ... Dimensional-Position-Specific

Bioinformatics

A Complementary Bioinformatics Approachto Identify Potential Plant Cell WallGlycosyltransferase-Encoding Genes1[w]

Jack Egelund, Michael Skjøt, Naomi Geshi, Peter Ulvskov, and Bent Larsen Petersen*

Biotechnology Group, Danish Institute of Agricultural Sciences, DK–1871 Frederiksberg C, Copenhagen,Denmark; and Center for Molecular Plant Physiology (PlaCe), The Royal Veterinary and AgriculturalUniversity, DK–1871 Frederiksberg C, Copenhagen, Denmark

Plant cell wall (CW) synthesizing enzymes can be divided into the glycan (i.e. cellulose and callose) synthases, which aremultimembrane spanning proteins located at the plasma membrane, and the glycosyltransferases (GTs), which are Golgilocalized single membrane spanning proteins, believed to participate in the synthesis of hemicellulose, pectin, mannans, andvarious glycoproteins. At the Carbohydrate-Active enZYmes (CAZy) database where e.g. glucoside hydrolases and GTs areclassified into gene families primarily based on amino acid sequence similarities, 415 Arabidopsis GTs have been classified.Although much is known with regard to composition and fine structures of the plant CW, only a handful of CW biosyntheticGT genes—all classified in the CAZy system—have been characterized. In an effort to identify CW GTs that have not yet beenclassified in the CAZy database, a simple bioinformatics approach was adopted. First, the entire Arabidopsis proteome wasrun through the Transmembrane Hidden Markov Model 2.0 server and proteins containing one or, more rarely, twotransmembrane domains within the N-terminal 150 amino acids were collected. Second, these sequences were submitted to theSUPERFAMILY prediction server, and sequences that were predicted to belong to the superfamilies NDP-sugartransferase,UDP-glycosyltransferase/glucogen-phosphorylase, carbohydrate-binding domain, Gal-binding domain, or Rossman fold werecollected, yielding a total of 191 sequences. Fifty-two accessions already classified in CAZy were discarded. The resulting 139sequences were then analyzed using the Three-Dimensional-Position-Specific Scoring Matrix and mGenTHREADER servers,and 27 sequences with similarity to either the GT-A or the GT-B fold were obtained. Proof of concept of the present approachhas to some extent been provided by our recent demonstration that two members of this pool of 27 non-CAZy-classifiedputative GTs are xylosyltransferases involved in synthesis of pectin rhamnogalacturonan II (J. Egelund, B.L. Petersen, A. Faik,M.S. Motawia, C.E. Olsen, T. Ishii, H. Clausen, P. Ulvskov, and N. Geshi, unpublished data).

The plant cell wall (CW) consists of four majorpolysaccharide components, namely cellulose, callose,hemicellulose, and pectin. CW synthesis/formationcan be divided into three major steps. (1) Initially, thevarious building blocks in the form of activatedglycosyl residues (NDP-sugars) are synthesized viatwo different pathways—the nucleotide interconver-sion pathway or the salvage pathway (for overview,see Carpita, 1996). The synthesis of the NDP-sugarsmay occur in the cytosol and/or the Golgi apparatusdepending on the type of NDP-sugar synthesized(Mohnen, 1999). (2) The synthesized nucleotide sugarsare then assembled into higher-order polysaccharidestructures. Apart from cellulose and callose, biosyn-thesis of CW polysaccharides occurs in the endomem-brane system (Bolwell and Northcote, 1983; Zhang andStaehelin, 1992; Sherrier and VandenBosch, 1994),

from which the polysaccharides are secreted into thewall where they undergo further modifications (Fry,1995). (3) The final step, which constitutes the assem-bly of the various polysaccharide structures into thewall, remains in large part a mystery. However,self-assembly of wall components most likely playsa role (for discussion of a possible mechanism, seeMacDougal et al., 1997), and both enzymatic and non-enzymatic mechanisms as well as arabinogalactanproteins and other wall structural proteins (Cosgrove,1997) participate in the complex process.

The noncellulosic polymers hemicellulose and pec-tin are synthesized by glycosyltransferases (GTs) pre-sumably located in the different compartments of theGolgi apparatus. These GTs are believed to be type IImembrane-bound proteins with the catalytic domain(C-terminal) facing the lumen of the Golgi apparatus(Ridley et al., 2001; Sterling et al., 2001; Geshi et al.,2004).

Although the GTs, for which the three-dimensional(3D) structures have been resolved, exhibit insignifi-cant or at the best very low sequence similarity, theyadopt one of the following folds at the 3D-structurelevel: the GT-A (SpsA and SpsA-like) fold or the GT-B(B-GT and B-GT-like) fold (Bourne and Henrissat,

1 This work was supported by the Danish National ResearchFoundation and The Danish Research Agency.

* Corresponding author; e-mail [email protected]; fax 45–35282589.

[w]The online version of this article contains Web-only data.www.plantphysiol.org/cgi/doi/10.1104/pp.104.042978.

Plant Physiology, September 2004, Vol. 136, pp. 2609–2620, www.plantphysiol.org � 2004 American Society of Plant Biologists 2609 www.plantphysiol.orgon May 30, 2018 - Published by Downloaded from

Copyright © 2004 American Society of Plant Biologists. All rights reserved.

Page 2: A Complementary Bioinformatics Approach to Identify ... · A Complementary Bioinformatics Approach to Identify Potential Plant ... in the CAZy system—have been ... Dimensional-Position-Specific

2001; Hu and Walker, 2002; Coutinho et al., 2003;Wimmerova et al., 2003).

The Carbohydrate-Active enZYme (CAZy; http://afmb.cnrs-mrs.fr/CAZY/) database is a versatile andcomprehensive database of sequence-based carbohy-drate enzymes, where e.g. glucoside hydrolases andGTs are classified into families primarily based on aminoacid sequence similarities (Henrissat et al., 2001).Within a given family, the 3D structure is conserved,i.e. the same 3D fold is expected to occur in eachfamily (Coutinho et al., 2003).

Although composition of the major CW polysac-charides is reasonably well described (Carpita et al.,2001), only a handful of the biosynthetic genes havebeen identified. All of the seven known GTs, i.e. withproven or putative function in mannan (Edwards et al.,1999), hemicellulose (Perrin et al., 1999; Faik et al.,2002; Madson et al., 2003), and pectin synthesis(Bouton et al., 2002; Iwai et al., 2002; J. Sterling andD. Mohnen, personal communication), are classified inthe CAZy database. In this study, we have set up analternative bioinformatics scheme aimed at identifyingCW GTs with a predicted type II membrane topology,which are not classified in the CAZy database. Usingthis alternative approach, 27 non-CAZy classifiedaccessions with a predicted N-terminal transmem-brane domain (TMD) typical of type II membraneproteins and that were predicted to adopt the GT-A orthe GT-B fold were identified.

RESULTS

In an effort to obtain GTs with a type II membraneprotein topology, which have not been classified in theCAZy database, the following simple bioinformaticsapproach was adopted (for overview, see Fig. 1).

First, using the Transmembrane Hidden MarkovModel (TMHMM) 2.0 prediction server, the entireArabidopsis proteome (26,095 proteins) was scannedfor the presence of transmembrane helices, yieldinga total of 5,977 sequences with any number of pre-dicted transmembrane helices. Within this pool, po-tential type II membrane proteins with either one or, inrare cases, two (derived from the predicted trans-membrane helix and a hydrophobic signal peptide)predicted TMDs, which resided within the first 150amino acids from the N terminus, were identified andextracted, yielding a total of 2,248 and 363 accessions,respectively.

The 2,248 plus 363 sequences were then submittedto the SUPERFAMILY prediction server, and 191 se-quences predicted, indiscriminately of E-value, to be-long to the superfamilies NDP-sugartransferase (54),UDP-glycosyltransferase/glucogen-phosphorylase (33),Gal-binding domain (23), carbohydrate-binding domain(6), or the GT-B-similar Rossman fold (75) were collected.The 191 sequences were then blasted against the CAZydatabase (September 10, 2003), and sequences found inthe CAZy database were removed from the dataset,

leaving a total of 139 sequences (24, 25, 12, 5, and 73 fromthe 5 superfamilies, respectively), which were not classi-fied in the CAZy database. The 139 sequences were thenrun through the mGenTHREADER and 3D-Position-Specific Scoring Matrix (3D-PSSM) servers, respectively.A local set of protein IDs (Protein Data Bank [PDB]) ofproteins, whose 3D structures have been resolved andwhich adopt either the GT-A or the GT-B fold (Table I;references for the PDB IDs can be retrieved at http://www.RCSB.ORG/), and resolved 3D structures derivedfrom the CAZy database GT families were used to vali-date the output from each of the two servers. Twenty-seven of the 139 sequences (Table II) displayed similarityto one or more of the entries in Table I, i.e. the proteinspredicted to adopt either the GT-A or the GT-B fold.Recently, two highly identical accessions (Q9ZSJ2 andQ9ZSJ0; Table II; Fig. 2B) were shown to be CW-specificxylosyltransferases (J. Egelund, B.L. Petersen, A. Faik,M.S. Motawia, C.E. Olsen, T. Ishii, H. Clausen, P. Ulvskov,and N. Geshi, unpublished data), corroborating that theadopted bioinformatics strategy identifies GTs related toCW biosynthesis.

Filtering of the Arabidopsis Proteome

Choice of servers, strategies applied, and theoreticaland practical considerations of the filtering process aredescribed sequentially below.

Figure 1. Flow chart of the bioinformatics approach used to identify 27putative GTs not classified in the CAZy database. Web sites for thevarious servers used in this study are listed in ‘‘Materials and Methods.’’

Egelund et al.

2610 Plant Physiol. Vol. 136, 2004 www.plantphysiol.orgon May 30, 2018 - Published by Downloaded from

Copyright © 2004 American Society of Plant Biologists. All rights reserved.

Page 3: A Complementary Bioinformatics Approach to Identify ... · A Complementary Bioinformatics Approach to Identify Potential Plant ... in the CAZy system—have been ... Dimensional-Position-Specific

Table I. List of PDB IDs used to screen the result of the secondary structure prediction servers mGenTHREADER and 3D-PSSM

The PDB IDs were obtained from Wimmerova et al. (2003; lowercase) as well as manually from the CAZy database. GT families (uppercase; onlyone PDB ID per family). Origins and functions of the enzymeswere obtained from the PDB (http://www.pdb.mdc-berlin.de/pdb/; Berman et al., 2000).

PDB ID Origin Enzyme Function

GT-A1ABB Rabbit Glycogen phosphorylase Glycogen biosynthesis1EM6 Human Glycogen phosphorylase Glycogen biosynthesis1eyr N. meningitidis Selenomethionyl

cytosine-5’-monophosphate-acylneuraminate synthetase

Activation of sialic acid

1fgg Human Glucuronyltransferase l Heparan/chondroitin sulfatebiosynthesis

1foa Rabbit N-Acetylglucosaminyltransferase l Decoration of glycoproteins1fr8 Bovine b-1,4-Galactosyltransferase Decoration of glycoproteins1frw E. coli MobA Molybdopterin guanine

dinucleotide biosynthesis1g0r Pseudomonas aeruginosa Glucose-1-phosphate

thymidylyltransferaseBacterial cell wall biosynthesis

1g93 Bovine a-1,3-Galactosyltransferase Decoration of glycoproteins1g97 Streptococcus pneumoniae N-Acetylglucosamine-1-

phosphate uridyltransferaseSynthesis of UDP-N-acetylglucosamine

1ga8 N. meningitidis Lipopolysaccharidegalactosyltransferase

Lipooligosaccharide biosynthesis

1GA8 N. meningitidis Lipopolysaccharidegalactosyltransferaseimplicated in

Lipooligosaccharide biosynthesis

1GZ5 E. coli Trehalose phosphate synthase Trehalose biosynthesis1h7g E. coli 3-Deoxy-manno-octulosonate

cytidylyltransferaseLipopolysaccharide biosynthesis

1ini E. coli 4-Diphosphocytidyl-2-C-methylerythritol

Isoprenoid biosynthesis

1j94 Mouse b-1,4-Galactosyltransferase Lactose biosynthesis1ll2 Rabbit Glycogenin glucosyltransferase Glycogen biosynthesis1Iz0 Human Glycosyltransferase A Blood group biosynthesis1LZJ Human a-1/3Galactosyltransferase Blood group biosynthesis1OMX Mouse a-1,4-N-Acetylhexosaminyltransferase Heparan sulfate biosynthesis1qgq Bacillus subtilis NDP-sugartransferase Synthesis of spore coat1YGP Saccharomyces cerevisiae Glycogen phosphorylase Glycogen biosynthesis

GT-B1BGT Bacteriophage T4 b-Glucosyltransferase Nucleotide synthesis1c3j Bacteriophage T4 b-Glucosyltransferase Nucleotide synthesis1f0k E. coli Pyrophosphoryl-undecaprenol

N-acetylglucosamine transferasePeptidoglycan biosynthesis

1f6d E. coli Udp-N-acetylglucosamine2-epimerase

UDP-N-acetylglucosaminebiosynthesis

1FGG Human Glucuronyltransferase l Heparan/chondroitin sulfate synthesis1FGX Bovine b-1,4-Galactosyltransferase Glycoprotein and

glycosphingolipid synthesis1FO9 Rabbit N-Acetylglucosaminyltransferase l Decoration of glycoproteins1h5u Rabbit Glycogen phosphorylase Glycogen biosynthesis1iir Amycolatopsis orientalis Udp-glucosyltransferase Gtfb Synthesis of the Vancomycin

group of antibiotics1NLM E. coli Pyrophosphoryl-undecaprenol

N-acetylglucosamine transferaseBacterial cell wall synthesis

1PN3 A. orientalis Tdp-Epi-Vancosaminyltransferase Gtfa

Synthesis of the Vancomycingroup of antibiotics

1QG8 B. subtilis Nucleotide-diphospho-sugartransferase

Spore coat synthesis

1QKJ Bacteriophage T4 b-Glucosyltransferase Nucleotide synthesis1qm5 E. coli Maltodextrin phosphorylase Phosphorolysis of maltodextrin

Alternative Source of Putative Cell Wall Glycosyltransferases

Plant Physiol. Vol. 136, 2004 2611 www.plantphysiol.orgon May 30, 2018 - Published by Downloaded from

Copyright © 2004 American Society of Plant Biologists. All rights reserved.

Page 4: A Complementary Bioinformatics Approach to Identify ... · A Complementary Bioinformatics Approach to Identify Potential Plant ... in the CAZy system—have been ... Dimensional-Position-Specific

Table II. The 27 putative GTs identified as a result of filtering the Arabidopsis proteome through the TMHMM, SUPERFAMILY, mGenTHREADER, and

3D-PSSM servers as illustrated in Figure 1

Web sites for the various servers used in this study are listed in ‘‘Material and Methods.’’

E-ValueTrEMBL Protein ID SuperFa Best Fit to Known GT Foldb

3D-PSSM mGenTHREADERBLAST (NCBI) TMD

Q9LZ77 U GT-B 0.0017 0.001 Plant and bacteria 51-70

Q9M147 U GT-B 0.0112 0.001 Plant and bacteria 44-66

Q9FMW3 U GT-B Not found 0.023 Plant 53-72

Q9LU22 U GT-B Not found 0.068 Plant and animal 27-49

Q9C9Z9 N GT-B Not found 0.022 Plant and animal 27-49

O81786 R GT-B Not found 0.979 Plant and animal 45-62

Q9C920 N GT-A 3.56e208 0.005 Plant 13-35

Q9LTZ5 N GT-A 8.51e205 0.004 Plant and bacteria 21-42

Q9FM26 N GT-A 0.0173 0.005 Plant and bacteria 21-43

O04568 N GT-A 0.00930 0.009 Plant and bacteria 22-44

Q9FXA7 N GT-A 0.606 0.112 Plant 21-40

Q9C9Q6 N GT-A 0.063 0.017 Plant 13-35

Q9C9Q5 N GT-A 0.0993 0.022 Plant 13-35

Q9FMN8 N GT-A 0.115 0.015 Plant 26-45

Q9ZSJ2 N GT-A 0.271 0.037 Plant 36-55

Q9FF50 N GT-A 0.278 0.013 Plant 38-60

Q9M146 N GT-A 0.0355 0.059 Plant 42-61

Q9SZU2 N GT-A 0.0355 0.012 Plant 44-66

Q9ZSJ0 N GT-A 0.174 0.111 Plant 30-52

Q9SAD6 N GT-A Not found 0,046 Plant 20-42

Q9LKU7 R GT-A Not found 1.030 Plant 23-45

Q9LQS0 R GT-A Not found 0.774 Plant 45-67

Q9LYF7 R GT-A 4.11 Not found Plant 30-52

Q9LU27 R GT-A 3.51 Not found Plant 31-53

Q9T0G0 R GT-A 0.660 Not found Plant and bacteria 7-29

2TMD Proteins

Q9XEE9 U GT-A 2.73e205 0.0008 Plant and animal 4-26 and 113-135

Q9ZU10 R GT-B Not found 0.551 Plant and animal 5-27 and 47-69

Pfamc

TrEMBL Protein ID Length

Amino AcidsSignalP

Domain E-Value

DxD in Hydrophobic

PocketdIsoxaben Array

Up/Down-RegulatedEST

Q9LZ77 1091 Nonsecretory GT1 0.00017 Yes 137% No

Q9M147 963 Signal anchor GT1 3.9 3 1027 Yes 135% Yes

Q9FMW3 559 Signal anchor None Not found Not found 222% No

Q9LU22 419 Signal anchor None Not found Yes 22% Yes

Q9C9Z9 533 Signal peptide None Not found Not found 24% Yes

O81786 204 Signal anchor None Not found Not found 218% Yes

Q9C920 290 Signal peptide CTP-GT 2.9 3 10263 Yes 152% Yes

Q9LTZ5 582 Signal anchor GT2 0.0016 Yes 1483% Yes

Q9FM26 583 Signal anchor GT2 0.0088 Yes Not found Yes

O04568 516 Signal anchor None Not found Yes 238% Yes

Q9FXA7 383 Signal anchor None Not found Yes Not found Yes

Q9C9Q6 402 Signal anchor GT8 0.017 Yes Not found Yes

Q9C9Q5 428 Signal anchor GT8 0.09 Yes 21% Yes

Q9FMN8 624 Signal anchor None Not found Yes 144% Yes

Q9ZSJ2 361 Signal anchor None Not found Yes Not found No

Q9FF50 932 Signal anchor GT2 Not found Yes Not found Yes

Q9M146 360 Signal anchor None Not found Yes 159% Yes

Q9SZU2 588 Signal anchor GT2 0.011 Yes Not found No

Q9ZSJ0 367 Signal anchor GT2 0.011 Yes 264% Yes

Q9SAD6 371 Signal anchor Chemotaxis phosphatase 0.075 Yes 262% Yes

Q9LKU7 156 Nonsecretory Zinc finger domain 0.00044 Not found 225% No

Q9LQS0 118 Nonsecretory None Not found Not found Not found No

Q9LYF7 386 Signal anchor None Not found Not found 220% Yes

Q9LU27 384 Signal anchor None Not found Yes 228% No

Q9T0G0 389 Signal peptide Dehydrogenase 3.7 3 10221 Not found 12% Yes

2TMD Proteins

474 Signal peptide GT1 2.1 3 10219 Yes 222% Yes

200 Signal peptide None Not found Not found Not found Yes

aN, NDP-sugartransferases; R, NAD(P)-binding Rossmann-fold domains; U, UDP-glycosyltransferase/glycogen phosphorylase; G, galactose-binding domain-

like. b3D-PSSM and/or mGenTHREADER. cHits only shown for E-values , 0.1. dHCA analysis.

Egelund et al.

2612 Plant Physiol. Vol. 136, 2004 www.plantphysiol.orgon May 30, 2018 - Published by Downloaded from

Copyright © 2004 American Society of Plant Biologists. All rights reserved.

Page 5: A Complementary Bioinformatics Approach to Identify ... · A Complementary Bioinformatics Approach to Identify Potential Plant ... in the CAZy system—have been ... Dimensional-Position-Specific

Filter I: Identification of Potential Type IIMembrane Proteins

In two comparative tests, TMHMM 2.0 (Krogh et al.,2001) was found to be the best of the tested predictionservers measured as having the lowest fraction of thesum of false positives and false negatives within thetotal number of the experimentally assigned trans-membrane helices (TMH) segments ([false positives 1false negatives]/no. of TMH) used in the tests(Schwacke et al., 2003; Zhou and Zhou, 2003).TMHMM 2.0 was chosen as the initial filter becauseof its reliable and somewhat conservative predictionstrategy and because this server supports batch sub-missions of up to 4,000 accessions.

Filters II and III: Identification of Accessions inGT-Containing Superfamilies

The SUPERFAMILY database server was chosen asthe next filter because this facility incorporates analternative approach to the seed-based PSI-blast ap-proach used in the CAZy database classificationscheme and supports batch submissions of up to 20

accessions. The SUPERFAMILY database contains a li-brary of hidden Markov models (HMMs) representingall proteins of known structure (Gough et al., 2001;Gough and Chothia, 2002). The SUPERFAMILY facilityis based on the Structural Classification of Proteins(SCOP) protein domain classification database, whichin turn is based on multiple sequence alignmentsdesigned to represent a protein family in a structuraldomain-based hierarchical classification scheme withseveral levels, including the superfamily level (Murzinet al., 1995).

Filter IV: Identification of Putative GTs withinGT-Containing Superfamilies

mGenTHREADER is based upon a multilayeredneural network that was trained to combine sequencealignment score, length information, and energy po-tentials with PSI-BLAST searches, which have beenjumpstarted with structural alignment profiles fromFold Secondary Structure Prediction, PSI-BLAST pro-file, and predicted secondary structure (PSIPRED),predicted secondary structure, and bidirectional scor-

Figure 2. Phylogenetic tree of the 27 putative GTs. Four distinct homologous groups (A–D) consisting of two to six sequenceswere identified in the analysis.

Alternative Source of Putative Cell Wall Glycosyltransferases

Plant Physiol. Vol. 136, 2004 2613 www.plantphysiol.orgon May 30, 2018 - Published by Downloaded from

Copyright © 2004 American Society of Plant Biologists. All rights reserved.

Page 6: A Complementary Bioinformatics Approach to Identify ... · A Complementary Bioinformatics Approach to Identify Potential Plant ... in the CAZy system—have been ... Dimensional-Position-Specific

ing in order to calculate the final alignment score(Jones, 1999; McGuffin and Jones, 2003).

3D-PSSM constitutes a method for protein foldrecognition using one-dimensional (1D) and 3D se-quence profiles coupled with secondary structure andsolvation potential information (Kelley et al., 2000).

The output of the sequence-based SUPERFAMILYserver was evaluated by the sophisticated mGen-THREADER (multilayered neural network) and the3D-PSSM servers, which by operating at the fold levelin addition to 1D sequence information incorporates3D structural information, solvation potential, etc. (seealso above). The difference in the number of accessionspre-filter IV and post-filter IV (139 and 27, respec-tively) indicate that a major fraction of the 139 acces-sions, predicted by the SUPERFAMILY server tobelong to polysaccharide or CW relevant superfami-lies, were most likely non-GT proteins, as e.g. evi-denced by accessions containing an unusually highnumber of Pro and Ser residues (.50% of the totalamino acid residues) or by proteins with an estimatedmolecular mass ,20 kD. The 139 non-CAZy classifiedaccessions resulting from the SUPERFAMILY filteringand BLAST searches against the local CAZy databaseare available as supplemental data (available atwww.plantphysiol.org).

Elimination of False, For Example, Non-GT, Hits

The ability of the filtering process to eliminateaccessions that encode enzymes that do have NDP-sugars as substrate but are non-GTs were examined byapplying the sequential filtering procedure to two

quite different proteins: (1) a putative membrane-bound Arabidopsis protein involved in synthesis ofUDP-D-Xyl that in plants is incorporated in glycopro-teins and CW polysaccharides, including xyloglucan(XG), and (2) an Escherichia coli protein catalyzingthe epimerization of UDP-N-acetyl-D-glucosamine toUDP-N-acetyl-D-mannosamine involved in bacteriallipopolysaccharide biosynthesis. The Arabidopsis UDP-glucuronic acid decarboxylase (ATUXS2, At3g62830)is predicted to adopt a typical type II membrane proteintopology and thought to be located in the Golgi appa-ratus (Harper and Bar-Peled, 2002). When used asa negative control, ATUXS2 passes filter II-III (Rossmanfold, non-CAZy entry) but fails to pass filter IV, i.e.ATUXS2 do not adopt a GT-A or a GT-B fold structure.Furthermore, a DxD motive (as described below) is notfound in ATUXS2. Whereas the E. coli UDP-N-acetyl-D-glucosamine 2-epimerase (Kiino et al. 1993; P27828) asexpected do not adopt a typical type II membraneprotein structure when run through the TMHMMversion 2.0 server (Filter I), it is predicted to belong tothe UDP-glycosyltransferase/glycogen phosphorylasesuperfamily by the SUPERFAMILY prediction serverand is predicted to adopt a GT-B fold by mGen-THREADER. However, a DxD motif as described belowis not found.

When the six known plant CW GTs were runthrough the 3D-PSSM and mGenTHREADER servers,the galactomannan-specific a(1-6)galactosyltransfer-ase and the XG-specific a(1-6)xylosytransferase werepredicted to adopt the GT-B and the GT-A fold,respectively, although both proteins are classified inCAZy family GT34 (Table III). However, as indicated

Table III. Known type II CW GTs and their classification in the CAZy database

The glycosyltransferases listed are all from Arabidopsis. From the top, a(1-6)-D-xylose transferase, transfers D-xylose on to the b(1-4)glucan chains ofxyloglucan (Faik et al., 2002); a(1-6)-D-galactose transferase, transfers D-galactose on to the b(1-4)mannan backbone of galactomannan (Edwards etal., 1999); a(1-2)-L-fucose transferase, transfers the terminal L-fucose on to the galactosyl residue of the xyloglucan sidechain (Perrin et al., 1999);Quasimodo, involved in pectin biosynthesis* (Bouton et al., 2002); b(1-2)-D-galactose transferase, transfers D-galactose on to the a(1,6)-linked xylosein xyloglucan (Madson et al., 2003); b(1-4)-D-glucuronosyl transferase, transfers D-glucuronic acid on to a(1-4)-linked Fucose in RG II (Iwai et al.,2002)*.

E-ValueGT Function TrEMBL Protein ID CAZy Family SuperFa Best Fit to Known GT-foldb

3D-PSSM mGenTHREADERBlast (NCBI)

a(1-6)-D-xylT Q9LZJ3 GT-34 None GT-B 5.08 0.187 Plant and bacteriaa(1-6)-D-galT Q9ST56 GT-34 None GT-A 0.934 0.152 Plant and bacteriaa(1-2)-L-fucT Q9LJK1 GT-37 None GT-B 4.54 Not found Plant and animalQuasimodo Q9LSG3 GT-8 N GT-A 1.08 3 10212 0.009 Plant and animalb(1-2)-D-galT Q9LVB4 GT-47 N None Not found Not found Plantb(1-4)-D-glcAT Q8GSQ4 GT-47 N GT-B 5.04 0.090 Plant and animal

Pfamc

TMD Length Amino Acids SignalPDomain E-Value

DxD in Hydrophobic Pocketd Isoxaben Array

21-40 460 Signal anchor GT-34 1.1 3 102121 Yes Down 38%13-35 435 Signal anchor GT-34 1.0 3 102137 Yes Not foundNone 501 Signal anchor GT-10 1.6 3 10217 Yes Not found21-43 599 Signal anchor GT-8 1.4 3 102116 Yes Up 118%None 549 Nonsecretory Exotosin (GT-47) 1.2 3 10292 Not found Down 55%None 334 Nonsecretory Exotosin (GT-47) 2.3 3 10231 Yes Down 42%

*Function likely—activity not unequivocally demonstrated. aN, NDP-sugartransferases. b3D-PSSM and/or mGenTHREADER. cHitsonly shown for E-values , 0.1. dHCA analysis (data not shown).

Egelund et al.

2614 Plant Physiol. Vol. 136, 2004 www.plantphysiol.orgon May 30, 2018 - Published by Downloaded from

Copyright © 2004 American Society of Plant Biologists. All rights reserved.

Page 7: A Complementary Bioinformatics Approach to Identify ... · A Complementary Bioinformatics Approach to Identify Potential Plant ... in the CAZy system—have been ... Dimensional-Position-Specific

by the poor E-values, discrimination between theGT-A and GT-B fold was not feasible. In this respectit should be noted that plants in general synthesizea number of plant-specific CW polymers (not found inany other kingdom). The uniqueness of such struc-tures may be reflected in the structure of the bio-synthetic enzymes, and these may thus not be clearlyrelated to GTs of organisms from other kingdoms.None of the quite few characterized plant CW GTshave had their 3D structure resolved. It is in thiscontext that servers like mGenTHREADER and 3D-PSSM, which besides sequence similarity use variousparameters such as 3D information (see above) in theirprediction strategy, were chosen as validation tools.The prediction ability of the various servers willundoubtedly improve as new plant CW GTs areidentified and structurally analyzed. In summary, thefiltering process applied here was quite efficient ineliminating evident types of false positives. The poolof 27 accessions may still comprise non-GT accessionsand was thus subjected to a post-filtering analysis.

Post-Filtering Evaluation of the 27 Putative GTs

Homologous Sequences within and outside of thePlant Kingdom

The presence of homologous sequences was inves-tigated by subjecting the putative GTs to global pro-tein-protein BLAST (blastp). The majority of thesearches (22 out of 27) gave rise to plant-specific orplant- and bacteria-specific hits (Table II). As e.g.mycobacterial CWs contain plant CW-like polysac-charides, e.g. arabinogalactans (Crick et al., 2001),bacterial hits may not contradict function in plantCW synthesis. When the 27 sequences were blastedagainst the Arabidopsis expressed sequence tag (EST)database, 21 were represented by an EST (Table II).

Phylogeny

As a consequence of the adopted overall bioin-formatics approach used in this study, significantsimilarity throughout the 27 accessions was notanticipated. Alignment of the 27 putative GTs, how-ever, identified four groups of clustered homologousgenes, denominated A, B, C, and D, of which groups Band C (Fig. 2) display a high degree of conservation instretches of at least 20 to 80 amino acids (data notshown). The rest of the 27 accessions constituteda heterogeneous group with extremely low or insig-nificant sequence identity.

The genes in group B (Q9C9Q6, Q9C9Q5, Q9FXA7,Q9M146, Q9ZSJ2, and Q9ZSJ0; Table II; Fig. 2) fall intotwo distinct subgroups consisting of highly identicalgroup members (subgroup I: four sequences with 73%,75%, and 90% identity; subgroup II: two sequenceswith 72% identity) but with only 11% identity betweenthe two subgroups. The two highly identical acces-sions (Q9ZSJ2 and Q9ZSJ0; Table II; Fig. 2B) are the

xylosyltransferases mentioned above. Genes in groupC are approximately 550 amino acids long. Aside fromthe four GTs (accession nos. Q9LZ77, Q9M147,Q9C920, and Q9XEE9 ; Table II; Fig. 2C), which displaysignificant similarity to CAZy GT-family-1 and CTP-GTs, similarity for the rest of the 27 sequences to otherGTs with known function (plant or non-plant) wasextremely weak or nonexisting.

Prediction of Subcellular Localization

For 24 of the 27 putative GTs, the SignalP serverpredicted a signal anchor or signal peptide in or closeto the TMD (data not shown; Table II). When the sixGTs of group B (Fig. 2) were run through the TargetPserver (Krogh et al., 2001), a reliable prediction of theirsubcellular location could not be achieved. Similarresults were obtained when the six GTs with knownfunction in CW synthesis (Edwards et al., 1999; Perrinet al., 1999; Bouton et al., 2002; Faik et al., 2002; Iwaiet al., 2002; Madson et al., 2003; Table III) were runthrough these servers (data not shown). Althoughlocalization of these CW GTs has not been proven,sufficient evidence exists for Golgi as the place ofsynthesis of at least the major building blocks of theCW (for review, see Mohnen, 1999). Thus, the TargetPprediction server is not able to generate a reliableprediction of the localization of the plant CW GTs.

Expression Data

Recently, sets of CW-specific microarray data de-rived from suspension cultured cells, which had beenexposed to the herbicide N-[3-(1-ethyl-1-methyl-propyl)-1,2-oxazol-5-yl]-2,6-dimethoxybenzamide(isoxaben), have been made public (see ‘‘Materials andMethods’’). Isoxaben specifically inhibits cellulosesynthesis (Scheible et al., 2001), and plants adapted togrow in isoxaben compensate for the almost completeloss of the cellulose-XG load bearing structure byconstruction of walls made predominantly of pectin(Schedletzky et al., 1990; Encina et al., 2002). The array-derived expression data for the accessions Q9C920,Q9LTZ5 (and the highly homologous Q9FM26 notpresent in the array dataset), and Q9M146, which inthis experiment are up-regulated 52%, 483%, and 59%,respectively, might indicate function in pectin biosyn-thesis. However, the lack of confirmed pectin bio-synthetic GTs and expression pattern (spatial andtemporal) of each putative GT should be taken intoconsideration when interpreting the array data.

HCA Analysis

Hydrophobic cluster analysis (HCA) identified a pu-tative sugar-nucleotide-binding domain—the so-calledDxD motif (Breton et al., 1998, 2001; Wiggins andMunro, 1998; Costa et al., 2002; Stolz and Munro,

Alternative Source of Putative Cell Wall Glycosyltransferases

Plant Physiol. Vol. 136, 2004 2615 www.plantphysiol.orgon May 30, 2018 - Published by Downloaded from

Copyright © 2004 American Society of Plant Biologists. All rights reserved.

Page 8: A Complementary Bioinformatics Approach to Identify ... · A Complementary Bioinformatics Approach to Identify Potential Plant ... in the CAZy system—have been ... Dimensional-Position-Specific

2002)—or a degeneration of this motif [DE]-X-[DE](compare with Tarbouriech et al., 2001) surrounded bystretches of hydrophobic amino acids in 19 of the 27sequences as exemplified in Figure 3. It should bestressed, however, that the occurrence of a DxD motifalone is not diagnostic of a GT function (Gastinel, 2001;Coutinho et al., 2003). Parsing of the 27 sequencesthrough the Multiple EM for Motif Elicitation (MEME)version 3.0 server identified the putative sugar-nucle-otide-binding DxD motif in both subgroups of group B(Fig. 4). Genes in group C share common overallfeatures with the group B genes, e.g. varying sequenceidentity (69%, 35%, and 27%) and a DxD motif flankedby hydrophobic stretches situated approximately in themiddle of the proteins (Table II; Fig. 3).

When run through the Pfam server, for 10 of theaccessions a tentative CAZy GT family relationship(GT1, GT2, or GT8) could be assigned although theprediction power (E-value) in most cases was rela-tively poor (Table II).

Accessions Q9SAD6, Q9LKU7, Q9T0G0, O23479,Q9C990, and Q9C991 were predicted to contain otherdomains also with varying prediction power. Of these,a putative DxD motif as defined above could not beidentified for the accessions Q9LKU7, Q9T0G0, andO23479, perhaps suggesting that the proposed GTfunction for at least these accessions should be con-sidered carefully. The Pfam server is based on seedalignments, including also consensus alignment se-quences of the various CAZy families. The relativelylow number (10) of tentative CAZy GT family re-lationship assignments may be due to the Pfam/CAZysequence-based prediction strategies versus theprediction strategies used by the mGenTHREADERand 3D-PSSM servers.

DISCUSSION

In this study, we have identified 27 putative Arabi-dopsis GTs, which are not classified in the CAZy

Figure 3. HCA analysis showing the DxD motif within a pocket of hydrophobic amino acids. The protein sequences arerepresented on a duplicated a -helical net, and the clusters of contiguous hydrophobic residues (V, I, L, F, M,W, and Y) are boxed.The actual assessments of the individual HCA plots were done manually in order to identify similarities between the sequences.The one-letter code is used for amino acids except for Gly, Pro, Ser, and Thr, which are represented by symbols. Vertical linesdelineate hydrophobic pockets in which the DxD motif (boxed in black) can be found. Three well-known GTs were used in theanalysis. A, a(1-4)galactosyltransferase LgtC (Neisseria meningitides, TrEMBL accession no. Q8KHJ3); Persson et al. (2001). B,b(1–4)galactosyltransferase (Homo sapiens, TrEMBL accession no. Q9UBX8); Amando et al. (1999). C, Quasimodo—putativelyinvolved in pectin biosynthesis (Arabidopsis, TrEMBL accession no. Q94BZ8); Bouton et al. (2002) and representatives from the27 putative GTs, containing an identifiable DxD motif within a hydrophobic pocket: D, Q9ZSJ2; E, Q9M146; F, Q9C9Q5; G,A9LTZ5; H, Q9FF50; I, O045498.

Egelund et al.

2616 Plant Physiol. Vol. 136, 2004 www.plantphysiol.orgon May 30, 2018 - Published by Downloaded from

Copyright © 2004 American Society of Plant Biologists. All rights reserved.

Page 9: A Complementary Bioinformatics Approach to Identify ... · A Complementary Bioinformatics Approach to Identify Potential Plant ... in the CAZy system—have been ... Dimensional-Position-Specific

database. The 27 accessions have been selected asputative GTs, being typical of Golgi localized type IImembrane proteins and characterized using variousprediction servers, HCA analysis, and CW-specificarray datasets. Recent proof of concept of the strategyused in this study has to some extent been obtained asfunctions in CW biosynthesis for two GT members ofthe phylogenetically distinct group B (Fig. 2) wereestablished.

Although the topology of noncellulosic backbonesynthesizing enzymes remains an open question, it istempting to suggest that the enzymes responsible fore.g. the synthesis of the a(1-4)-linked homogalact-uronan backbone or the b(1-4)glucan backbone of XGmight resemble the multimembrane spanning andprocessive cellulose synthases, which synthesizes ho-mopolymers with the same linkage type. Enzymaticassays of proteinase-treated intact and detergent-disrupted Golgi vesicles suggest that the catalytic siteof the a(1-4)galacturonosyltransferase activity residesin the lumen of the Golgi (Sterling et al., 2001). Re-cently, an Arabidopsis gene, classified within CAZyGT-family-8, was cloned, heterologously expressed,and shown to possess galacturonosyltransferase activ-ity (J. Sterling and D. Mohnen, personal communica-tion). The predicted topology of this enzyme isa typical type II membrane-spanning protein, perhapssuggesting that at least pectin synthesizing enzymesprobably are type II membrane-anchored proteins.Current estimates suggest that about 1% of the openreading frames of each genome is dedicated to the taskof glycosidic bond synthesis (Coutinho et al., 2003).When using the 415 CAZy classified Arabidopsis GTsas an estimate of the total number of glycosidic bond-forming enzymes, in plant this number is 1.6%, partlydue to the existence of the highly complex CW. Basedon the number of Arabidopsis genes that are predictedto possess signal peptides, it has been estimated thatwell over 2,000 genes are likely to participate inbiosynthesis, assembly, and modification of CWs dur-ing plant development (Carpita et al., 2001). If solubleenzymes that participate in generation of substratesand the integral membrane-associated biosyntheticCW proteins, such as the cellulose synthase, are in-cluded, it has been estimated that some 15% of theArabidopsis genome is dedicated to CW biogenesisand modification (Carpita et al., 2001).

CAZy GT-family-1 consists of primarily solubleenzymes with function in secondary metabolism hav-

ing rather small molecules as acceptor substrates. Ifthe 121 Arabidopsis sequences in GT-family-1 aresubtracted from the total 415 sequences, 296 GTs areleft for glycosylation of proteins and lipids, synthesisof various polysaccharides, and CW biosynthesis. InArabidopsis-rich CAZy GT families, such as GT8,GT31, or GT47, alignments of Arabidopsis accessionsreveal the existence of highly identical genes withinthe GT families, which are likely to have identicalfunction but may be differentially expressed. For e.g.pectin synthesis alone, one of the major noncellulosicCW polysaccharides, which comprises the polysac-charides homogalacturonan and rhamnogalacturonanI and II, at least 53 distinct enzymatic activities arerequired (Mohnen, 1999; Ridley et al., 2001).

In this study, the use of the most conservativetransmembrane span prediction server as the firstfilter clearly filters out an unknown number ofGTs with a weak TMD profile and perhaps also GTswithout a TMD, which might interact in complexeswith other membrane-bound GTs. A significant num-ber of the Arabidopsis sequences, e.g. in the CAZydatabase GT-family-47, do not have a predictable TMDdomain and are therefore often referred to as solubleenzymes. Of the six noncellulosic plant CW GTs withknown function, the XG-specific b(1-2)galactosyltrans-ferase, the putative rhamnogalacturonan II-specificb(1-4)glucuronosyltransferase, and the XG-specifica(1-2)fucosyltransferase are not predicted to possessan N-terminal transmembrane helix when runthrough the TMHMM 2.0 server used in this study(Table III). However, when run through the variousprediction servers available at the ARAMEMNONsite, the three GTs were predicted to contain anN-terminal TMD-like structure by at least one of theprograms available at this site. In conclusion, it isunresolved whether some CW GTs are soluble. Re-porter gene and or tag fusion experiments may shedsome light on this.

CONCLUSION

The CAZy database serves as the most complex andrich source of carbohydrate active enzymes. Classifi-cation of GTs in the CAZy database is based primarilyon PSI-BLAST searches, using GTs with known func-tion and in some cases proteins for which the 3Dstructures have been resolved, as the seed (Henrissatet al., 2001). With respect to GTs, it is widely acceptedthat secondary structure is more conserved thanprimary sequence. The classification scheme used inthe CAZy database may not facilitate the identificationof GTs that are only similar at a level higher than theprimary sequence level (e.g. the fold level). A draw-back of the present alternative approach may be theuse of the SUPERFAMILY prediction server, which (ase.g. also Pfam) uses HMMs generated from alignmentsof proteins, the vast majority being of non-plant origin.We expect that a higher number of candidate GTs willbe retrieved when it is possible to screen the entire

Figure 4. Conserved region of the B group accessions (compare withphylogenetic analysis Fig. 2) identified by the MEME server showing theputative DxD motif possibly involved in binding of the nucleotidesugar.

Alternative Source of Putative Cell Wall Glycosyltransferases

Plant Physiol. Vol. 136, 2004 2617 www.plantphysiol.orgon May 30, 2018 - Published by Downloaded from

Copyright © 2004 American Society of Plant Biologists. All rights reserved.

Page 10: A Complementary Bioinformatics Approach to Identify ... · A Complementary Bioinformatics Approach to Identify Potential Plant ... in the CAZy system—have been ... Dimensional-Position-Specific

Arabidopsis proteome for proteins using servers op-erating at the fold level.

GTs situated in the Golgi apparatus involved insynthesis of the complex Asn-linked glycans of plantglycoproteins may be found among the accessionsuncovered in this study. We expect that a significantproportion of plant proteoglycan GTs are homologousto similar enzymes from other eukaryotes due to thestructural similarities that exist in these glycans acrosskingdoms. If this assumption is valid, many of theplant proteoglycan GTs are already in CAZy.

The sequential and parallel use of several predictionservers, albeit with relatively low stringency parametersettings, inclusion of negative and positive controls ofthe filtering, followed by a post-filtering evaluationwarrant that a substantial number of GTs indeed arefound within the 27 accessions. It is, however, also clearthat e.g. the use of the conservative TMHMM server hasas a consequence that relevant GTs have also beeneliminated and, hence, that the 27 putative GTs are buta subset of the GTs that remain to be recognized as such.

MATERIALS AND METHODS

The Arabidopsis Proteome

The Arabidopsis proteome in a nonredundant form was downloaded from

EMBL (http://www.ebi.ac.uk/proteome/index.html; 08072003), converted to

FASTA format, and split into 26,095 individual proteins using the Wisconsin-

package version 10.3 (http://www.biobase.dk/).

TMHMM Version 2.0 Server

Predictions of transmembrane helices were carried out using the TMHMM

server version 2.0 (Krogh et al., 2001; http://www.cbs.dtu.dk/services/

TMHMM). All predictions were performed using standard settings. The

proteome was submitted in subfiles (FASTA file format) containing the max

limit of 4,000 proteins per submission. Output format was one text line per

protein. The outputs from the entire proteome were collected in a single file for

further screening. Proteins containing one or two TMDs starting within the

N-terminal first 150 amino acid residues were extracted using the BBEdit

program version 6.1.2 for MacOSX. A Unix list file containing the resulting

accession numbers was generated, and these accession numbers were then used

to extract relevant protein sequences from the original proteome and stored in

a FASTA file. This FASTA file was then used in the next filtering process.

Superfamily Version 1.63 Server

The SUPERFAMILY facility (http://supfam.mrc-lmb.cam.ac.uk/SUPER-

FAMILY/) implements a searchable library (covering all proteins of known

structure) consisting of 1,232 SCOP superfamilies, each of which is represented

by a group of HMMs, i.e. SCOP-based single sequence HMMs (Gough et al.,

2001). The post-TMHMM FASTA file was divided into files containing

a maximum of 20 proteins. These files were submitted using the following

parameters: scoring options, Global/Global model/sequence scoring (for exact

domains), BLAST pre-filter P, 1.0 3 10210. The output files were collected in

one large file and sorted after superfamily domain. Proteins that were classified

as belonging to one of the superfamilies listed below were independently

collected using a Unix list file with relevant accession numbers, and a FASTA

file was generated for each of the five superfamilies used in this study:

NDP-sugartransferases, Gal-binding domain-like, UDP-glycosyltransferase/

glycogen phosphorylase, carbohydrate-binding domain, and Rossman fold.

Local CAZy Database

All the Arabidopsis protein accession numbers were collected from the

CAZy database (September 10, 2003). These accessions were then used to

generate a Unix list file that served as template for the generation of a FASTA

file from the Arabidopsis proteome. The BLAST 2.6.6 program for powerpc-

MacOSX was downloaded from ftp://ftp.ncbi.nih.gov and a BLASTable

database built from the FASTA file as described by the provider.

The five independent FASTA files derived from the superfamily search

were blasted against the local CAZy database using the BBEdit program

(standard conditions with filtering off). Proteins in the dataset, which were

found in the local CAZy database, were discarded.

3D-PSSM Server

Fast Web-based methods for protein fold recognition using 1D and 3D

sequence profiles coupled with secondary structure information, i.e. SCOP-

based profile HMMs, included the following: 3D-PSSM Web Server version

2.6.0 (http://www.sbg.bio.ic.ac.uk; Kelley et al., 2000) and mGenTHREADER

available at the PSIPRED home page (http://bioinf.cs.ucl.ac.uk/psiform.

html). All predictions were performed using standard settings. Proteins larger

than 800 amino acids were submitted twice either with truncations in the N or

the C terminus. The outputs were then collected and screened for known GT

PDB IDs. If more than one PDB ID was present in the output from the same

file, only the one with the lowest E-value was listed.

mGenTHREADER Server

mGenTHREADER (Jones, 1999; http://bioinf.cs.ucl.ac.uk/psipred/) is

a fold recognition server based on fold library profiles that uses the PSI-

BLAST profile and predicted secondary structure (PSIPRED). PSIPRED is

a secondary structure prediction method, incorporating two feed-forward

neural networks, which takes the output obtained from PSI-BLAST as input.

mGenTHREADER, accessible from the PSIPRED Protein Structure Prediction

Servers home page, was used with the following parameters: prediction

method, Fold Recognition (mGenTHREADER); output, default. The outputs

were then collected and screened for known PDB IDs of known GTs (compare

with Table I). If more than one PDB ID was present in the output from the

same file, only the one with the lowest E-value was listed.

PDB IDs

The PDB IDs that were used for screening the output of both the 3D-PSSM

and mGenTHREADER were collected from Wimmerova et al. (2003), who

searched the Mycobacterium tuberculosis genome for GTs using, among other

tools, fold recognition and the CAZy database (when the individual CAZy GT

family contained more than one PDB ID for the particular protein, only one of

the PDB IDs was used). References for the PDB IDs can be retrieved at http://

www.RCSB.org/. All proteins were classified to one of the two secondary fold

structures, GT-A or GT-B.

BLAST

As part of the validation process, the proteins were blasted using BLAST

algorithms, which were accessible from the server at NCBI (National Center

of Biotechnology Information; http://ww.ncbi.nlm.nih.gov). The search

included standard protein blast (blastp) and translated blast (tblastn). All

searches were performed using standard settings and the BLOSUM 80 matrix.

In the case of blastp, any hits, regardless of the e-value/identity, to animal,

bacterial, or plant sequences were reported. Presence of ESTs was checked by

blastn searches of the Arabidopsis EST database (http://ww.ncbi.nlm.nih.

gov).

SignalP Server

The candidate genes were scanned for the presence of signal peptides

using the SignalP version 2.0.b2 server (Nielsen et al., 1999) World Wide Web

server (http://www.cbs.dtu.dk/services/SignalP), which predicts the pres-

ence and location of signal peptide cleavage sites in amino acid sequences

using HMM-based predictions (Nielsen et al., 1999). Predictions were done

using the following parameters: organism group, Eukaryotes; method,

HMMs; graphics, none; output format, standard. Proteins were truncated

after the first 70 amino acids from the N terminus and submitted in a FASTA

file format.

Egelund et al.

2618 Plant Physiol. Vol. 136, 2004 www.plantphysiol.orgon May 30, 2018 - Published by Downloaded from

Copyright © 2004 American Society of Plant Biologists. All rights reserved.

Page 11: A Complementary Bioinformatics Approach to Identify ... · A Complementary Bioinformatics Approach to Identify Potential Plant ... in the CAZy system—have been ... Dimensional-Position-Specific

MEME Version 3.0

Sequences, for which the secondary structure resembled that of known

GTs, were submitted as a FASTA file to the MEME v 3.0 server (http://

meme.sdsc.edu/meme/website/meme.html) in order to search for conserved

domains. Standard settings were used for the search.

Alignments and Phylogenetic Analysis

All sequence alignments and calculations of sequence identities were

performed by use of ClustalX version 1.81, available from Universite Louis

Pasteur, Strasbourg (ftp://ftp-igbmc.u-strasbg.fr/pub/ClustalX; Thompson

et al., 1997). Alignments were edited and printed using the program SeqVu

(SeqVu version 1.0.1; http://www.cellbiol.com/soft.htm). Trees with boot-

strap values from 1,000 resampling replicates were obtained using the Njplot

program, which is part of the ClustalX program package. Printed trees were

modified using the TreeViewPPC version 1.6.6 software (http://

taxonomy.zoology.gla.ac.uk/rod/treeview.html).

HCA Analysis

HCA plots were obtained from the drawhca server on the Internet (http://

smi.snv.jussieu.fr/hca/hca-form.html). The actual assessments of the indi-

vidual HCA plots were done manually as described by Breton et al. (1998).

Pfam

Proteins were analyzed for the presence of known domains using the Pfam

HMM (Bateman et al., 2004) available at the St. Louis Pfam server (http://

pfam.wustl.edu). The searches were performed using individual global/local

search options and a cutoff E-value . 0.1. Only the best hits were reported.

ARAMEMNON

ARAMEMNON (Schwacke et al., 2003; Arabidopsis Membrane Protein

Database at http://aramemnon.botanik.uni-koeln.de/) consolidates predic-

tion of transmembrane helixes based on several TMD prediction servers.

ARAMEMNON uses the following servers: Alom_v2 (http://psort.nibb.ac.jp/

form2.html); HmmTop_v2 (http://www.enzim.hu/hmmtop/html/submit.

html); MemSat_v1.8 (http://bioinf.cs.ucl.ac.uk/psiform.html); PHDhtm

(http://cubic.bioc.columbia.edu/predictprotein/submit_exp.html#top);

PHDhtm (http://npsa-pbil.ibcp.fr/cgi-bin/npsa_automat.pl?page5/

NPSA/npsa_phd.html); PredTmr_v1 (http://biophysics.biol.uoa.gr/

PRED-TMR/input.html); SosuiG_v1.1 (http://sosui.proteome.bio.tuat.

ac.jp/cgi-bin/sosui.cgi?/sosui_submit.html); THUMBUP_v1 (http://

phyyz4.med.buffalo.edu/Softwares-Services_files/thumbup.htm); Tmap

(http://www.mbb.ki.se/tmap/); TMHMM_v2 (http://www.cbs.dtu.dk/

services/TMHMM/); TmPred (http://www.ch.embnet.org/software/

TMPRED_form.html); and TopPred_v2 (http://bioweb.pasteur.

fr/seqanal/interfaces/toppred.html).

Array Data

Isoxaben array data are available at http://affymetrix.arabidopsis.info/

narrays/experimentbrowse.pl.

Distribution of Materials

Upon request, all novel materials described in this publication will be

made available in a timely manner for noncommercial research purposes,

subject to the requisite permission from any third party owners of all or parts

of the material. Obtaining any permission will be the responsibility of the

requestor. Access to the novel accessions reported in this manuscript can be

requested by e-mail ([email protected]).

Sequence data from this article have been deposited with the EMBL/

GenBank data libraries under accession numbers Q8KHJ3, Q9UBX8, Q94BZ8,

Q9ZSJ2, Q9M146, Q9C9Q5, A9LTZ5, Q9FF50, O045498, Q9C9Q6, Q9C9Q5,

Q9FXA7, Q9M146, Q9ZSJ2, Q9ZSJ0, Q9LZ77, Q9M147, Q9C920, Q9XEE9,

Q9C920, Q9LTZ5, Q9FM26, and Q9M146, Q9SAD6, Q9LKU7, Q9T0G0,

O23479, Q9C990, Q9C991, Q9LKU7, Q9T0G0, and O23479.

ACKNOWLEDGMENTS

Dr. Ahmed Faik, Michigan State University, is acknowledged for in-

stigating this line of research in our lab; Dr. Christelle Breton, INRA France,

for helpful discussions and initial HCA analysis; and Dr. Kristian Axelsen,

Institute of Plant Biology, The Royal Veterinary and Agricultural University,

Denmark, and Swiss Institute of Bioinformatics, Geneva, for helpful dis-

cussions and propositions throughout the process. Dr. Julian Gough and

Ph.D. student Martin Madera are greatly appreciated for their skillful help

with submission to the SUPERFAMILY server. Dr. William G.T. Willats is

acknowledged for providing corrected array data.

Received March 17, 2004; returned for revision April 15, 2004; accepted April

20, 2004.

LITERATURE CITED

Amando M, Almeida R, Schwientek T, Clausen H (1999) Identification

and characterization of large galactosyltransferase gene families: galac-

tosyltransferases for all functions. Biochim Biophys Acta 1473: 35–53

Bateman A, Coin L, Durbin R, Finn RD, Hollich V, Griffiths-Jones S,

Khanna A, Marshall M, Moxon S, Sonnhammer ELL, et al (2004) The

Pfam protein families database. Nucleic Acids Res 32: D138–D141

Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H,

Shindyalov IN, Bourne PE (2000) The protein data bank. Nucleic Acids

Res 28: 235–242

Bolwell GP, Northcote DH (1983) Arabinan synthase and xylan synthase

activities of Phaseolus vulgaris. Subcellular localization and possible

mechanism of action. Biochem J 210: 497–507

Bourne Y, Henrissat B (2001) Glycoside hydrolases and glycosyltransfer-

ases: families and functional modules. Curr Opin Struct Biol 11: 593–600

Bouton S, Leboeuf E, Mouille G, Leydecker M-T, Talbotec J, Granier G,

Lahaye M, Hofte H, Truong N-H (2002) QUASIMODO1 encodes

a putative membrane-bound glycosyltransferase required for normal

pectin synthesis and cell adhesion in Arabidopsis. Plant Cell 14: 1–14

Breton C, Bettler E, Joziasse DH, Geremia RA, Imberty A (1998) Sequence-

function relationships of prokaryotic and eukaryotic galactosyltrans-

ferases. J Biochem (Tokyo) 123: 1000–1009

Breton C, Mucha J, Jeanneau C (2001) Structural and functional features of

glycosyltransferases. Biochimie 83: 713–718

Carpita N, Tierney M, Campbel M (2001) Molecular biology of the plant

cell wall: searching for the genes that define structure, architecture and

dynamics. Plant Mol Biol 47: 1–5

Carpita NC (1996) Structure and biogenesis of the cell walls of grasses.

Annu Rev Plant Physiol Plant Mol Biol 47: 445–476

Cosgrove DJ (1997) Relaxation in a high-stress environment: the molecular

bases of extensible cell walls and cell enlargement. Plant Cell 9:

1031–1041

Costa AA, Gomez FJ, Pereira M, Felipe MS, Jesuino RS, Deepe GS, de

Almeida Soares CM (2002) Characterization of a gene which encodes

a mannosyltransferase homolog of Paracoccidioides brasiliensis. Microbes

Infect 4: 1027–1034

Coutinho PM, Deleury E, Davies GJ, Henrissat H (2003) An evolving

hierarchical family classification for glycosyltransferases. J Mol Biol 328:

307–317

Crick DC, Mahapatra S, Brennan PJ (2001) Biosynthesis of the

arabinogalactan-peptidoglycan complex of Mycobacterium tuberculosis.

Glycobiology 11: 107R–118R

Edwards ME, Dickson CA, Chengappa S, Christopher C, Michael J,

Gidley MJ, Grant Reid SJ (1999) Molecular characterisation of a mem-

brane-bound galactosyltransferase of plant cell wall matrix polysaccha-

ride biosynthesis. Plant J 19: 691–697

Encina A, Sevillano JM, Acebes JL, Alvarez J (2002) Cell wall modifica-

tions of bean (Phaseolus vulgaris) cell suspensions during habituation

and dehabituation to dichlobenil. Physiol Plant 114: 182–191

Faik A, Price NC, Raikhel NV, Keegstra K (2002) An Arabidopsis gene

encoding an a-xylosyltransferase involved in xyloglucan biosynthesis.

Proc Natl Acad Sci USA 99: 7797–7802

Alternative Source of Putative Cell Wall Glycosyltransferases

Plant Physiol. Vol. 136, 2004 2619 www.plantphysiol.orgon May 30, 2018 - Published by Downloaded from

Copyright © 2004 American Society of Plant Biologists. All rights reserved.

Page 12: A Complementary Bioinformatics Approach to Identify ... · A Complementary Bioinformatics Approach to Identify Potential Plant ... in the CAZy system—have been ... Dimensional-Position-Specific

Fry SC (1995) Polysaccharide-modifying enzymes in the plant cell wall.

Annu Rev Plant Physiol Plant Mol Biol 46: 497–520

Gastinel LN (2001) Galactosyltransferases: a structural overview of their

function and reaction mechanisms. Trends Glycosci Glycotechnol 13:

131–145

Geshi N, Jørgensen B, Ulvskov P (2004) Subcellular localization and

topology of b(1,4)galactosyltransferase in potato. Planta 218: 862–868

Gough J, Chothia C (2002) SUPERFAMILY: HMMs representing all

proteins of known structure. SCOP sequence searches, alignments and

genome assignments. Nucleic Acids Res 30: 268–272

Gough J, Karplus K, Hughey R, Chothia C (2001) Assignment of

homology to genome sequences using a library of hidden Markov models

that represent all proteins of known structure. J Mol Biol 313: 903–919

Harper AD, Bar-Peled M (2002) Biosynthesis of UDP-xylose. Cloning and

characterization of a novel Arabidopsis gene family, UXS, encoding

soluble and putative membrane-bound UDP-glucuronic acid decarbox-

ylase isoforms. Plant Physiol 130: 2188–2198

Henrissat B, Coutinho PM, Davies J (2001) A census of carbohydrate-active

enzymes in the genome of Arabidopsis thaliana. Plant Mol Biol 47: 55–72

Hu Y, Walker S (2002) Remarkable structural similarities between diverse

glycosyltransferases. Chem Biol 9: 1287–1296

Iwai H, Masaoka N, Ishii T, Satoh S (2002) A pectin glucuronosyltransfer-

ase gene is essential for intercellular attachment in the plant meristem.

Proc Natl Acad Sci USA 99: 16319–16324

Jones DT (1999) GenTHREADER: an efficient and reliable protein fold

recognition method for genomic sequences. J Mol Biol 287: 797–815

Kelley LA, MacCallum RM, Sternberg MJE (2000) Enhanced genome

annotation using structural profiles in the program 3D-PSSM. J Mol Biol

299: 499–520

Kiino DR, Licudine R, Wilt K, Yang DH, Rothman-Denes LB (1993) A

cytoplasmic protein, NfrC, is required for bacteriophage N4 adsorption.

J Bacteriol 175: 7074–7080

Krogh A, Larsson B, von Heijne G, Sonnhammer ELL (2001) Predicting

transmembrane protein topology with a hidden Markov model: appli-

cation to complete genomes. J Mol Biol 305: 567–580

MacDougal AJ, Rigby NM, Ring SC (1997) Phase separation of plant cell

wall polysaccharides and its implications for cell wall assembly. Plant

Physiol 114: 353–362

Madson M, Dunand C, Li X, Verma R, Vanzin GF, Caplan J, Shoue DA,

Carpita NC, Reiter W-D (2003) The MUR3 gene of Arabidopsis encodes

a xyloglucan galactosyltransferase that is evolutionarily related to

animal exotosins. Plant Cell 15: 1662–1670

McGuffin LJ, Jones DT (2003) Improvement of the GenTHREADER

method for genomic fold recognition. Bioinformatics 19: 874–881

Mohnen D (1999) Biosynthesis of pectins and galactomannans. In BM

Pinto, ed, Comprehensive Natural Products Chemistry, Volume 3:

Carbohydrates and Their Derivatives Including Tannins, Cellulose,

and Related Lignins. Elsevier, Oxford, pp 497–527

Murzin AG, Brenner SE, Hubbard T, Chothia C (1995) SCOP: a structural

classification of proteins database for the investigation of sequences and

structures. J Mol Biol 247: 536–540

Nielsen H, Brunak S, von Heijne G (1999) Machine learning approaches

for the prediction of signal peptides and other protein sorting signals.

Protein Eng 12: 3–9

Perrin RM, DeRocher AE, Bar-Peled M, Zeng W, Norambuena L, Orellana

A, Raikhel NV, Keegstra K (1999) Xyloglucan fucosyltransferase,

an enzyme involved in plant cell wall biosynthesis. Science 284:

1976–1979

Persson K, Ly HD, Dieckelmann M, Wakarchuk WW, Withers SG,

Strynadka NCJ (2001) Crystal structure of the retaining galactosyl-

transferase LgtC from Neisseria meningitidisin in complex with donor

and acceptor sugar analogs. Nat Struct Biol 8: 166–175

Ridley BL, O’Neill MA, Mohnen D (2001) Pectins: structure, bio-

synthesis, and oligogalacturonide-related signaling. Phytochemistry

57: 929–967

Schedletzky E, Shmuel M, Delmer DP, Lamport TA (1990) Adaptation and

growth of tomato cells on the herbicide 2,6-dichlorobenzonitrile leads to

production of unique cell walls virtually lacking a cellulose-xyloglucan

network. Plant Physiol 94: 980–987

Scheible W-R, Eshed R, Richmond T, Delmer D, Somerville C (2001)

Modifications of cellulose synthase confer resistance to isoxaben and

thiazolidinone herbicides in Arabidopsis Ixr1 mutants. Proc Natl Acad

Sci USA 98: 10079–10084

Schwacke R, Schneider A, van der Graaff E, Fischer K,

Catoni E, Desimone M, Frommer WB, Flugge UI, Kunze R (2003)

ARAMEMNON, a novel database for Arabidopsis integral membrane

proteins. Plant Physiol 131: 16–26

Sherrier DJ, VandenBosch KA (1994) Secretion of cell wall polysaccharides

in Vicia root hairs. Plant J 5: 185–195

Sterling JD, Quigley HF, Orellana A, Mohnen D (2001) The catalytic

site of the pectin biosynthetic enzyme a-(1,4)-galacturonosyl-

transferase is located in the lumen of the Golgi. Plant Physiol 127:

360–371

Stolz J, Munro S (2002) The components of the Saccharomyces cerevisiae

mannosyltransferase complex M-Pol I have distinct functions in man-

nan synthesis. J Biol Chem 277: 44801–44808

Tarbouriech N, Charnock SJ, Davies GJ (2001) Three-dimensional

structures of the Mn and Mg dTDP complexes of the family GT-2

glycosyltransferase SpsA: a comparison of related NDP-sugar

glycosyltransferases. J Mol Biol 324: 655–661

Thompson JD, Gibson TJ, Plewniak F, Jeanmougin F, Higgins DG (1997)

The CLUSTAL_X windows interface: flexible strategies for multiple

sequence alignment aided by quality analysis tools. Nucleic Acids Res

25: 4876–4882

Wimmerova M, Engelsen SB, Bettler E, Breton C, Imberty A (2003)

Combining fold recognition and exploratory data analysis for searching

for glycosyltransferases in the genome of mycobacterium tuberculosis.

Biochimie 85: 691–700

Wiggins CA, Munro S (1998) Activity of the yeast MNN1 a-1,3-mannosyl-

transferase requires a motif conserved in many other families of

glycosyltransferases. Proc Natl Acad Sci USA 95: 7945–7950

Zhang GF, Staehelin LA (1992) Functional compartmentation of the Golgi

apparatus of plant cells. Immunocytochemical analysis of high-pressure

frozen- and freeze-substituted sycamore maple suspension culture cells.

Plant Physiol 99: 1070–1083

Zhou H, Zhou Y (2003) Predicting the topology of transmembrane helical

proteins using mean burial propensity and a hidden-Markov-model-

based method. Protein Sci 12: 1547–1555

Egelund et al.

2620 Plant Physiol. Vol. 136, 2004 www.plantphysiol.orgon May 30, 2018 - Published by Downloaded from

Copyright © 2004 American Society of Plant Biologists. All rights reserved.