Genome-wide analysis of intronless genes in rice and Arabidopsis

10
ORIGINAL PAPER Genome-wide analysis of intronless genes in rice and Arabidopsis Mukesh Jain & Paramjit Khurana & Akhilesh K. Tyagi & Jitendra P. Khurana Received: 3 February 2007 / Revised: 7 April 2007 / Accepted: 6 May 2007 / Published online: 20 June 2007 # Springer-Verlag 2007 Abstract Intronless genes, a characteristic feature of prokaryotes, constitute a significant portion of the eukary- otic genomes. Our analysis revealed the presence of 11,109 (19.9%) and 5,846 (21.7%) intronless genes in rice and Arabidopsis genomes, respectively, belonging to different cellular role and gene ontology categories. The distribution and conservation of rice and Arabidopsis intronless genes among different taxonomic groups have been analyzed. A total of 301 and 296 intronless genes from rice and Arabidopsis, respectively, are conserved among organisms representing the three major domains of life, i.e., archaea, bacteria, and eukaryotes. These evolutionarily conserved proteins are predicted to be involved in housekeeping cellular functions. Interestingly, among the 68% of rice and 77% of Arabidopsis intronless genes present only in eukaryotic genomes, approximately 51% and 57% genes have orthologs only in plants, and thus may represent the plant-specific genes. Furthermore, 831 and 144 intronless genes of rice and Arabidopsis, respectively, referred to as ORFans, do not exhibit homology to any of the genes in the database and may perform species-specific functions. These data can serve as a resource for further comparative, evolutionary, and functional analysis of intronless genes in plants and other organisms. Keywords Rice . Arabidopsis . Intronless genes . Evolution Introduction Most of the eukaryotic genes are interrupted by one or more noncoding, intragenic sequences called introns. The avail- ability of complete genomic sequence of different organisms and the annotated and transcribed sequences help delineate the structure of genes by resolving exons and introns. This has resulted in the identification of a number of single exonic/intronless genes in eukaryotic genomes, although they are considered to be a characteristic feature of pro- karyotes. Genome SEGE, a database on intronless genes from nine completely sequenced eukaryotic genomes, is available (Sakharkar and Kangueane 2004). The study of intronless genes from different organisms is particularly interesting because it helps in understanding gene evolution. Recently, evolutionary analyses of intronless genes in human and mouse has been reported (Agarwal and Gupta 2005; Sakharkar et al. 2006). Some large gene families, including G-protein receptors and olfactory receptors in human and mouse, are intronless (Gentles and Karlin 1999; Takeda et al. 2002). The information on the occurrence and analysis of intronless genes in plants is scanty. To our knowledge, no detailed analysis of intronless genes at whole genome level has been performed in plants. However, several isolated reports on the presence of intronless genes belonging to large gene families, such as F-box proteins, DEAD box RNA helicases, pentatricopeptide repeat (PPR) containing proteins in Arabidopsis, are available (Aubourg et al. 1999; Gagne et al. 2002; Lecharny et al. 2003; Lurin et al. 2004). Recently, we reported all the 58 members of early auxin- Funct Integr Genomics (2008) 8:6978 DOI 10.1007/s10142-007-0052-9 Electronic supplementary material The online version of this article (doi:10.1007/s10142-007-0052-9) contains supplementary material, which is available to authorized users. M. Jain : P. Khurana : A. K. Tyagi : J. P. Khurana (*) Interdisciplinary Centre for Plant Genomics and Department of Plant Molecular Biology, University of Delhi South Campus, Benito Juarez Road, New Delhi 110 021, India e-mail: [email protected]

Transcript of Genome-wide analysis of intronless genes in rice and Arabidopsis

ORIGINAL PAPER

Genome-wide analysis of intronless genes in riceand Arabidopsis

Mukesh Jain & Paramjit Khurana & Akhilesh K. Tyagi &Jitendra P. Khurana

Received: 3 February 2007 /Revised: 7 April 2007 /Accepted: 6 May 2007 / Published online: 20 June 2007# Springer-Verlag 2007

Abstract Intronless genes, a characteristic feature ofprokaryotes, constitute a significant portion of the eukary-otic genomes. Our analysis revealed the presence of 11,109(19.9%) and 5,846 (21.7%) intronless genes in rice andArabidopsis genomes, respectively, belonging to differentcellular role and gene ontology categories. The distributionand conservation of rice and Arabidopsis intronless genesamong different taxonomic groups have been analyzed. Atotal of 301 and 296 intronless genes from rice andArabidopsis, respectively, are conserved among organismsrepresenting the three major domains of life, i.e., archaea,bacteria, and eukaryotes. These evolutionarily conservedproteins are predicted to be involved in housekeepingcellular functions. Interestingly, among the 68% of rice and77% of Arabidopsis intronless genes present only ineukaryotic genomes, approximately 51% and 57% geneshave orthologs only in plants, and thus may represent theplant-specific genes. Furthermore, 831 and 144 intronlessgenes of rice and Arabidopsis, respectively, referred to asORFans, do not exhibit homology to any of the genes in thedatabase and may perform species-specific functions. Thesedata can serve as a resource for further comparative,evolutionary, and functional analysis of intronless genes inplants and other organisms.

Keywords Rice .Arabidopsis . Intronless genes . Evolution

Introduction

Most of the eukaryotic genes are interrupted by one or morenoncoding, intragenic sequences called introns. The avail-ability of complete genomic sequence of different organismsand the annotated and transcribed sequences help delineatethe structure of genes by resolving exons and introns. Thishas resulted in the identification of a number of singleexonic/intronless genes in eukaryotic genomes, althoughthey are considered to be a characteristic feature of pro-karyotes. Genome SEGE, a database on intronless genesfrom nine completely sequenced eukaryotic genomes, isavailable (Sakharkar and Kangueane 2004). The study ofintronless genes from different organisms is particularlyinteresting because it helps in understanding gene evolution.Recently, evolutionary analyses of intronless genes in humanand mouse has been reported (Agarwal and Gupta 2005;Sakharkar et al. 2006). Some large gene families, includingG-protein receptors and olfactory receptors in human andmouse, are intronless (Gentles and Karlin 1999; Takeda et al.2002).

The information on the occurrence and analysis ofintronless genes in plants is scanty. To our knowledge, nodetailed analysis of intronless genes at whole genome levelhas been performed in plants. However, several isolatedreports on the presence of intronless genes belonging tolarge gene families, such as F-box proteins, DEAD boxRNA helicases, pentatricopeptide repeat (PPR) containingproteins in Arabidopsis, are available (Aubourg et al. 1999;Gagne et al. 2002; Lecharny et al. 2003; Lurin et al. 2004).Recently, we reported all the 58 members of early auxin-

Funct Integr Genomics (2008) 8:69–78DOI 10.1007/s10142-007-0052-9

Electronic supplementary material The online version of this article(doi:10.1007/s10142-007-0052-9) contains supplementary material,which is available to authorized users.

M. Jain : P. Khurana :A. K. Tyagi : J. P. Khurana (*)Interdisciplinary Centre for Plant Genomicsand Department of Plant Molecular Biology,University of Delhi South Campus, Benito Juarez Road,New Delhi 110 021, Indiae-mail: [email protected]

responsive SAUR (small auxin-up RNAs) gene family inrice to be intronless (Jain et al. 2006b). The availability ofcomplete genome sequences of rice, the model monocotplant, and Arabidopsis, the model dicot plant, provides usnot only a genetic blueprint but also an opportunity forstudying functional and evolutionary genomics in plants(The Arabidopsis Genome Initiative 2000; Paterson et al.2004b; International Rice Genome Sequencing Project 2005;Vij et al. 2006). Furthermore, the comparative genomics ofrice and Arabidopsis can be used to gain knowledge of geneorganization and is particularly helpful in examininggenome evolution in both monocot and dicot plants.

In this study, we have identified intronless genes in riceand Arabidopsis by alignments of their annotated gene andprotein sequences. Cellular role and gene ontology (GO)category of the intronless genes have been predicted. Thedistribution and conservation of rice and Arabidopsisintronless genes among different taxonomic groups havealso been analyzed. A large fraction of intronless geneswere found to be conserved in rice and Arabidopsis. Thesedata provide insights for understanding evolutionary mech-anisms underlying gene/genome evolution in plants.

Materials and methods

Identification of intronless genes

All the annotated gene and protein sequences of 12 ricechromosomes were downloaded from batch data downloadtool available on The Institute for Genomic Research (TIGR)Rice Genome Annotation website (Yuan et al. 2005; http://www.tigr.org/tdb/e2k1/osa1/). TIGR Osa1 version 4 of ricepseudomolecules includes 62,827 gene models; of which4,734 have alternative splicing isoforms resulting in non-redundant 55,890 genes (loci). For Arabidopsis, all theannotated gene and protein sequences were downloadedfrom The Arabidopsis Information Resource (TAIR) data-base. TAIR release 6 contains a total of 31,407 genemodels, of which 3,159 have splice variants resulting innon-redundant 26,973 genes (loci). To identify intronlessgenes, all the protein sequences of rice and Arabidopsiswere aligned to their gene sequences locally by TBLASTNprogram (Altschul et al. 1997) without any filter and theresults of only the top hit were extracted for each queryprotein sequence. The results were clustered in three steps.In the first step, the proteins aligned completely with thegene sequences in one block with 100% identity over theentire length were chosen. In the second step, the redundancy(different gene models representing same gene loci) amongthe selected genes was removed. As a last step, genesannotated as transposable elements (TE) were removed toget a final list of non-redundant intronless genes in rice andArabidopsis.

Analysis of intronless genes

Cellular role and GO category of all the intronless geneswere predicted by ProtFun 2.2 server. The prediction ofprotein function or GO category by ProtFun relies solely onthe protein sequence given as input; it does not rely on thesequence similarity, but instead on sequence-derived fea-tures such as predicted posttranslational modifications,protein sorting signals, and physical/chemical propertiescalculated from amino acid composition (Jensen et al. 2002,2003). Therefore, ProtFun allows the prediction of functionof even ORFan proteins where no homologs can be found.The functional categorization of some of the representativeproteins with known function has been confirmed manuallyalso. The presence or absence of these rice intronless genesin organisms belonging to different taxonomic groups wasdetermined by BLink (BLAST-Link) tool available atNCBI. Blink displays precomputed protein BLAST align-ments of each query protein sequence in the Entrez data-bases by taxonomic criteria (Wheeler et al. 2005). PERLscripts and JAVA parsing tools were used to automate thesesteps.

Paralog identification

To identify paralogs of rice and Arabidopsis intronlessgenes, their protein sequences were aligned with them-selves and protein sequences of intron-containing genes inan all-against-all BLASTP search. Two loci were defined asparalogs if they match the criteria of expect value cut-off ofE-10 or less with at least 20% identity over the 70% of theaverage length of two sequences.

Conserved intronless genes in rice and Arabidopsis

To identify conserved intronless genes between rice andArabidopsis, the protein sequences of all the predictedintronless genes of Arabidopsis were aligned against theprotein sequences of all the predicted intronless genes ofrice, and vice versa, by BLASTP. An expect value cut-offof E-6 or less was used to identify the conserved intronlessgenes in rice and Arabidopsis.

Results and discussion

Identification of intronless genes in rice and Arabidopsis

The intronless genes in rice (Oryza sativa subsp. japonicacv Nipponbare) and Arabidopsis (Arabidopsis thaliana)were identified in three steps. The first step involved theTBLASTN alignment of all the annotated protein and genesequences. In the second step, redundant sequences repre-senting same gene loci were identified and removed to

70 Funct Integr Genomics (2008) 8:69–78

obtain a set of non-redundant intronless genes. Finally, as alast step, all the predicted intronless genes annotated astransposable elements (TEs) were removed. The overallanalysis revealed the presence of 11,109 and 5,846 intron-less genes in rice and Arabidopsis genomes, respectively.

The locus IDs and gene description of predicted intron-less genes in rice and Arabidopsis are provided in theSupplemental data files 1 and 2, respectively. The number ofpredicted intronless genes in Arabidopsis in our study islittle more than that predicted earlier (5,810) (Sakharkar andKangueane 2004). This slight increase in number may bebecause of the use of more recent version (TAIR6) ofArabidopsis genome annotation in this study. The BLASTsearch of the predicted protein sequences of 11,109intronless genes of japonica rice with the annotated proteinsof indica rice (cv 93-11) genome available at BGI RISe RiceGenome Database (http://rise.genomics.org.cn; Yu et al.2005) revealed that at least 68% of these genes areconserved (with ≥90% identity over the entire length) inboth subspecies (data not shown). This percentage mayincrease once a more exhaustive and accurate annotation ofthese two genomes becomes available.

The identified intronless genes are distributed on all therice (6–13%) and Arabidopsis (15–25%) chromosomes(Fig. 1). In rice, the highest number of intronless genes ispredicted on chromosome 1, the longest chromosome.Similarly, the longest chromosomes 1 and 5 of Arabidopsishave the highest number of intronless genes. Apparently, thedistribution of intronless genes in terms of number of genesper megabase (MB) among rice (26.6–32.8 genes per MB)and Arabidopsis (45.9–51.7 genes per MB) chromosomeslooks uniform. Interestingly, a strong bias towards theshorter length of proteins, which the intronless genesencode, was found. A strikingly large percentage of theseproteins (about 39% of rice and 29% of Arabidopsis) are of101 to 200 amino acids in length. While, the exact relevanceof this observation is not clear, this observation may havesome evolutionary significance. Although, the distributionof intronless genes on a rice or Arabidopsis chromosome israndom, several clusters of intronless genes are evident atsome rice and Arabidopsis chromosomes. Generally, theseclusters represent the members of multigene families, forexample, auxin-responsive genes and those encoding F-boxproteins, zinc finger proteins, cytochrome P450 proteins,AP2 domain proteins, and leucine-rich repeat proteins, thatare arranged in tandem repeats; however, some of thesetandemly repeated genes are interrupted with other genesand are not included in strictly defined tandem repeats.These results are consistent with the surprising outcome ofrice and Arabidopsis genome analysis that large percentages(14–17%) of their genes are arranged in tandem repeats(The Arabidopsis Genome Initiative 2000; InternationalRice Genome Sequencing Project 2005).

Recently, we found 17 (29%) members of intronlessSAUR gene family are clustered together in tandem on ricechromosome 9 (Jain et al. 2006b). It has been suggestedthat intronless gene families can evolve rapidly either bygene duplication or by reverse transcription/integration(Glusman et al. 2000; Lecharny et al. 2003; Lurin et al.2004; Jain et al. 2006b). Recently, Yu et al. (2005)presented evidence for the ongoing individual gene dupli-cations in rice, which provide a never-ending raw materialfor studying gene genesis and their functions.

Functional categorization

To learn about the functions of predicted rice andArabidopsis intronless genes, their annotations available atTIGR and TAIR, respectively, were explored. A significant-ly large percentage (60%) of rice intronless genes areannotated as hypothetical (39%) or expressed (21%)proteins. This indicates that functions of most of theseproteins are not known in rice or some of the predictedhypothetical proteins may be the result of incorrectannotations. However, only 37% of Arabidopsis intronless

a

b

Nu

mb

er o

f g

enes

PercentageNumber

Per

cen

tag

e o

f g

enes

0

200

400

600

800

1000

1200

1400

1600

1 2 3 5 6 7 9 10 11 120

2

4

6

8

10

12

14

16

Chromosome number

0

200

400

600

800

1000

1200

1400

1600

1 3 4 5

Per

cen

tag

e o

f g

enes

0

4

8

12

16

20

24

28

32

Nu

mb

er o

f g

enes

PercentageNumber

Chromosome number

Rice

Arabidopsis

4 8

2

Fig. 1 Intronless genes in rice and Arabidopsis. Summary ofdistribution of intronless genes on rice (a) and Arabidopsis (b)chromosomes. Both percentages and numbers are shown

Funct Integr Genomics (2008) 8:69–78 71

genes are annotated as hypothetical (6%) or expressed(31%) proteins. Putative functions have been assigned toother 40 and 63% of intronless genes in rice and Arabidopsis,respectively.

Several of these genes belonging to large gene familiesinvolved in various pathways, including protein synthesisand turnover (ribosomal proteins, F-box proteins), signaltransduction (auxin-responsive genes, protein kinases, pen-tatricopeptide repeat [PPR] proteins, leucine-rich repeat[LRR] proteins), DNA-binding (zinc finger proteins, AP2domain proteins), metabolism (cytochrome P450 proteins)and disease resistance proteins, are over represented in bothplants, indicating that intronless genes perform crucialfunctions in plant growth and development.

To further explore the functions of intronless genes,including those which are annotated as expressed orhypothetical proteins, their cellular role and GO categorywas determined using ProtFun (Fig. 2; Supplemental datafiles 3 and 4). This analysis revealed that the largest numberof rice and Arabidopsis intronless genes fall into thetranslation and energy metabolism functional categories,followed by cell envelope and amino acid biosynthesis.Metabolism represents the most abundant functional cate-gory in rice at the whole genome level also (Goff et al.2002; International Rice Genome Sequencing Project 2005).However, a significantly larger number of intronless genes

fall into translation, cell envelope, and amino acid biosyn-thesis functional categories compared to total genes,indicating their major role in basic cellular processes.Although the percentage of intronless genes predicted undervarious cellular role categories are similar for both rice andArabidopsis, the number vary significantly and in fact arehigher in rice, suggesting that the components of thesepathways have evolved more in rice. However, 11.14% ofArabidopsis intronless genes belong to amino acid biosyn-thesis category, as compared to only 6.74% in rice.Although there is a large difference in the percentage, thenumber of genes included in this category is nearly the samefor rice (748) and Arabidopsis (651), suggesting thatcomponents of some of the pathways are highly conservedand that the corresponding genes/gene families have notexpanded during evolution. Further, about 76% and 73%intronless genes of rice and Arabidopsis, respectively, couldbe associated with a GO category, with growth factor (morethan 30%) representing the most abundant category in bothplants. The proteins involved in transcription regulation,transport, immune response, and structural proteins are alsowell represented. The 14.41% (1215) of the rice intronlessgenes come under structural protein category; however, only4.77% (204) genes of this category are represented inArabidopsis, indicating a significantly large expansion ofthese genes in rice. The intronless genes involved in various

TB AAB

T

RTRF PP FAM

EM

CIM

BC

CE

CP

7.37% 6.74%1.91%

14.08%

0.01%

6.56%

19.46%1.04%4.92%5.64%

2.29%

29.99%

ST R

GF

IR

SR TR

CCTF

VGIC

H

SP

T

1.21% 2.17%7.26%

14.41%

5.23%

1.80%

10.55%

0.37%

14.56%3.14%

8.52%

33.73%

IC0.44%

a

b

TB AAB

T

RT

RFPP FAM EM

CIM

BC

CE

7.95% 11.14%

2.29%

16.72%

8.88%

12.90%0.77%5.00%6.45%

2.24%

25.87%

ST R

GF

IR

SRTR

CC

TF

VGIC

HSP

T

1.08% 2.88%5.45%

4.77%

6.99%

2.24%

9.24%

0.54%

14.61%4.44%

16.16%

30.91%

IC0.70%

c

d

Rice Arabidopsis

Fig. 2 Functional categoriza-tion of rice and Arabidopsisintronless genes. Cellular role(a, c) and GO category (b, d)were determined for rice (a, b)and Arabidopsis (c, d)intronless genes by ProtFun andpercentage of genes included ineach category are given. Cellularrole categories are: AAB aminoacid biosynthesis,BC biosynthesis of cofactors,CE cell envelope, CP cellularprocesses, CIM centralintermediary metabolism, EMenergy metabolism, FAM fattyacid metabolism, PP purinesand pyrimidines, RF regulatoryfunctions, RT replication andtranscription, T translation,TB transport and binding.GO categories are: ST signaltransducer, R receptor,H hormone, SP structuralprotein, T transporter, IC ionchannel; VGIC voltage-gatedion channel, CC cation channel,TF transcription, TR transcrip-tion regulation, SR stressresponse, IR immune response,GF growth factor

72 Funct Integr Genomics (2008) 8:69–78

cellular processes are also predicted under immune responseGO category. However, the biological relevance of thisprediction is not clear. These results altogether shed somelight on the functional significance of intronless geneencoded proteins in plants.

Taxonomic distribution

To investigate the evolutionary conservation of rice andArabidopsis intronless genes among different taxonomicgroups (archaea, bacteria, fungi, metazoan, plants, and othereukaryotes), the precomputed BLAST results of their proteinsequences were retrieved through Blink. Blink results for7,658 and 5,470 intronless genes in rice and Arabidopsis,respectively, could be retrieved. The results were analyzedin three steps. In the first step, genes conserved amongdifferent domains of life (viz. archaea, bacteria, andeukaryotes) were clustered for both rice and Arabidopsis(Fig. 3a,c; Supplemental data files 5 and 6). In the secondstep, the intronless genes limited specifically to a taxonomicgroup or in combination with another, were clustered(Fig. 3b,d; Supplemental data files 7 and 8). As a finalstep, the cellular role category of intronless genes present ineach taxonomic group or combination was determined(Tables 1 and 2). Such clustering provides important

information on the evolution and functional conservationof intronless genes. Interestingly, 301 of rice and 296 ofArabidopsis intronless genes are conserved across alldomains of life. These proteins are predicted to be involvedin basic cellular processes (e.g., translation and energymetabolism) and probably are essential for survival of alldomains of life. These intronless genes can be termed asslowly evolving genes. Consistent with this idea, it has beenproposed that essential genes evolve more slowly than non-essential genes (Wilson et al. 1977). Moreover, it has beendemonstrated that functionally important genes are moreevolutionarily conserved than less vital genes (Jordan et al.2002).

Furthermore, of the 343 and 350 intronless genes of riceand Arabidopsis, respectively, shared by archaea andeukaryotes, 42 and 54 are present only in these twokingdoms; the remaining 301 and 296 genes are presentin bacteria also. Similarly, 1,281 (16.7%) and 766 (14%)intronless genes of rice and Arabidopsis, respectively, areshared by bacteria and eukaryotes only. These groups ofproteins are well represented in the following functionalclasses: translation, cell envelope, energy metabolism,regulatory functions, and amino acid metabolism. None ofthe rice and Arabidopsis intronless genes has a homolog inarchaea or archaea and bacteria.

a

b

AAB

ABE

AE BE

E

ORFans

0 030

301

42 1281

5173

831

Per

cent

age

0

10

20

30

40

50

60

70

80

A B E AB AE BEABE F M OE P AP BP MP FP

OEP

0 30

5173

1281

0 42301

10 30 2

3904

2256

400119 46

Taxonomic group

AAB

ABE

AE BE

E

ORFans

0 01

295

54 766

4206

144

0

10

20

30

40

50

60

70

80

90

Per

cent

age

A B E AB AE BEABE F M OE P AP BP MP FP

OEP

Taxonomic group

0 1

4206

766

0 54295

1 0

3121

2 96242

73 72

c

d

Rice Arabidopsis

5

BB

Fig. 3 Distribution of rice andArabidopsis intronless genesamong different taxonomicgroups. a, c Venn diagramshowing the classification ofrice (a) and Arabidopsis (c)intronless genes into differentdomains (archaea, bacteria, andeukaryotes) of life.b, d Distribution of rice (b) andArabidopsis (d) intronless genesspecific to different taxonomicgroup combinations. A archaea,B bacteria, E eukaryotes, ABarchaea and bacteria, AE archaeaand eukaryotes, BE bacteria andeukaryotes, ABE archaea,bacteria, and eukaryotes,F fungi, M metazoan, P plants,OE other eukaryotes,AP archaea and plants, BP bac-teria and plants, MP metazoanand plants, FP fungi and plants,OEP other eukaryotes and plants

Funct Integr Genomics (2008) 8:69–78 73

Amazingly, 68% of rice and 77% of Arabidopsis intronlessgenes are present only in eukaryotic genomes. In addition,several rice and Arabidopsis intronless genes conserved ineukaryotic genomes are limited only to a taxonomic group(fungi, metazoan, plants, and other eukaryotes) or along withplants (metazoan and plants, fungi and plants, and othereukaryotes and plants). For example, 51% and 57% ofintronless genes of rice and Arabidopsis, respectively, haveorthologs only in plants and may represent the plant-specificproteins. These proteins are associated with diverse cellularrole categories, including translation, energy metabolism,cell envelope, amino acid biosynthesis, and transport andbinding. These proteins may have evolved only in plants orgot lost from other organisms during evolution. Interestingly,some of the rice and Arabidopsis intronless genes are present

only in archaea and bacteria other than the plants. Further-more, 30 rice and one Arabidopsis intronless genes havehomologs in bacteria only. These genes may represent theexamples of lateral gene transfer (LGT) events from archaeaor bacteria to plants. Although LGT has been acknowledgedas a major force in the evolution of prokaryotic genomes(Boucher et al. 2003), several recent evidences indicate thatthis process also occurs in the evolution of eukaryoticgenomes including plants (Rujan and Martin 2001; Copleyand Dhillon 2002; Andersson 2005).

ORFans

The species-specific unique sequences that share nosignificant sequence similarity with any ORFs outside the

Table 2 Distribution of Arabidopsis intronless genes in different taxonomic groups according to their functional category

Functional category B AE BE ABE E F M P AP BP MP FP OEP

AAB 0 6 48 38 511 0 0 423 0 6 18 9 2BC 0 2 14 16 93 0 0 79 0 2 7 0 1CE 1 1 193 39 672 0 0 550 0 30 37 17 12CP 0 0 0 0 0 0 0 0 0 0 0 0 0CIM 0 2 80 24 381 1 0 238 0 8 35 16 5EM 0 16 100 41 509 0 0 419 1 15 20 11 4FAM 0 0 7 0 34 0 1 22 0 1 1 1 1PP 0 0 28 28 220 0 0 148 0 5 26 6 0RF 0 4 41 7 294 0 0 182 0 3 33 3 11RT 0 0 23 5 87 0 0 60 0 2 5 1 1T 0 23 181 67 1,077 0 3 774 1 17 40 8 30TB 0 0 51 30 328 0 1 226 0 7 20 1 5Total 1 54 766 295 4,206 1 5 3,121 2 96 242 73 72

The description of functional categories and taxonomic groups is same as given for Table 1.

Table 1 Distribution of rice intronless genes in different taxonomic groups according to their functional category

Functional category B AE BE ABE E F M P OE AP BP MP FP OEP

AAB 0 10 83 25 406 0 0 349 0 0 14 23 9 3BC 0 0 25 9 106 0 0 81 0 0 4 6 4 1CE 4 1 221 43 808 1 1 638 0 0 48 56 28 8CP 0 0 0 0 1 0 0 0 0 0 0 0 0 0CIM 1 1 106 31 402 0 0 288 1 0 11 38 8 4EM 6 6 189 55 984 2 8 766 0 0 38 74 16 11FAM 0 1 12 1 42 0 3 36 0 1 2 0 0 0PP 1 2 69 21 329 0 0 216 0 0 14 37 17 4RF 3 3 92 12 259 0 5 165 0 0 14 35 4 2RT 1 0 39 3 86 0 1 59 1 0 5 4 4 1T 9 18 379 62 1323 7 12 984 0 1 89 94 24 10TB 5 0 66 39 425 0 0 321 0 0 17 33 5 2Total 30 42 1,281 301 5,172 10 30 3,903 2 2 256 400 119 46

AAB Amino acid biosynthesis, BC biosynthesis of cofactors, CE cell envelope, CP cellular processes, CIM central intermediary metabolism, EMenergy metabolism, FAM fatty acid metabolism, PP purines and pyrimidines, RF regulatory functions, RT replication and transcription, Ttranslation, TB transport and binding, B bacteria, AE archaea and eukaryotes, BE bacteria and eukaryotes, ABE archaea, bacteria, and eukaryotes,E eukaryotes, F fungi, M metazoan, P plants, OE other eukaryotes, AP archaea and plants, BP bacteria and plants, MP metazoan and plants, FPfungi and plants, OEP other eukaryotes and plants

74 Funct Integr Genomics (2008) 8:69–78

genome where they reside are referred to as ORFans, and,in particular, singleton ORFans (Fischer and Eisenberg1999; Siew and Fischer 2003a, b). The complete genomesequencing of several organisms has demonstrated thatORFans are integral components of most sequencedgenomes and the percentage of singleton ORFans can beas high as 60% as in the case of human malaria parasitePlasmodium falciparum (Gardner et al. 2002; Siew andFischer 2003a, b). Our study reveals 831 (7.48%) intronlessgenes represent ORFans in rice (Fig. 3a; Supplemental datafile 9). However, only 144 (2.46%) intronless genesrepresent ORFans in Arabidopsis (Fig. 3c; Supplementaldata file 9). We investigated these to learn about functionsthey encode.

Most of rice intronless ORFans have been annotated ashypothetical and other few as expressed protein, suggestingthe possibility of mis-annotation. Therefore, to confirm thetranscription of rice ORFans, their expression was analyzedusing gene expression evidence search page (http://www.tigr.org/tdb/e2k1/osa1/locus_expression_evidence.shtml)available at TIGR rice genome annotation. This analysisrevealed that one or more massively parallel signaturesequencing (MPSS) tag, FL-cDNA, EST, and/or peptidesequence(s) were available for only 292 (35%) of 831 riceORFans, indicating that only some of these genes areexpressed and encode functional proteins. Conversely, mostof the Arabidopsis intronless ORFans represent expressedgenes. This can explain the large difference in the number ofpredicted intronless ORFans between rice and Arabidopsis.Putative function has been assigned to only three and ten ofintronless ORFans of rice and Arabidopsis, respectively.Taken together, it can be speculated that ORFans may haveevolved only recently in these organisms and performspecific functions that have not been characterized so far.The other possibility can be that these proteins might havebeen present in other organisms also but got lost duringevolution. However, the possibility of some of theseproteins moving from ORFan to non-ORFan categorycannot be ruled out as more and more of the genomesequences of different organisms will become available.Nevertheless, the functions of ORFans, the most interestinggenome content of these two groups of angiosperms, remainto be discovered.

Furthermore, because only a few ORFans have beenexperimentally characterized, it has been suggested thatmost ORFans are not likely to correspond to functionalproteins, but rather to rapidly evolving proteins with non-essential roles (Schmid and Aquadro 2001; Long 2001;Domazet-Loso and Tautz 2003). To investigate the functionsof rice and Arabidopsis intronless ORFans, their distributionamong different cellular role categories was determinedusing ProtFun (Fig. 4). Surprisingly, about 41% of rice and38% of Arabidopsis intronless ORFans belong to translation

category, a highly conserved protein machinery amongdifferent organisms. Another functional category wellrepresented among ORFans is energy metabolism. Theseresults indicate that even the components of basic cellularmachinery such as translation and metabolic processes arerapidly evolving, and several of these may perform species-specific functions.

Paralogs of intronless genes

Plant genomes seem to have evolved through polyploidiza-tion and subsequent gene loss (Bancroft 2002; Bowers et al.2003; Paterson et al. 2004a). Polyploidization gives rise togene duplications and some of these duplicated genes areretained as paralogs during evolution. The analysis ofancient polyploids shows that a much larger fraction offunctional paralogs is actually retained (Veitia 2005). Inpaleopolyploid yeast, Saccharomyces cerevisiae, about 12%of paralogs have retained functionality for 100 million years(Kellis et al. 2004). The fraction of retained paralogs rises to72% in maize, through 11 million years of evolution (Ahnand Tanksley 1993); however, their functionality is yet to beassessed.

Similarly, rice and Arabidopsis also provide manyexamples of functional paralogs retained over severalcycles of polyploidization (Veitia 2005; Chapman et al.2006). To identify the paralogs of rice and Arabidopsisintronless genes, BLAST search of their protein sequenceswere performed against themselves and other intron-containing genes. This analysis revealed that 1,709 (15%)and 1,359 (23%) of rice and Arabidopsis intronless genes,respectively, have at least one intronless or intron-containingparalog (Fig. 5). Many of these genes encode proteinscontaining conserved domains and are implicated in criticalprocesses of plant growth and development. The preferential

0

5

10

15

20

25

30

35

40

45

AAB BC CE CP CIM EM FAM PP RF RT T TB

ArabidopsisRice

Per

cen

tag

e

Cellular roleFig. 4 Distribution of rice and Arabidopsis intronless ORFans invarious cellular role categories. The description of cellular rolecategories is similar to that given in legend to Fig. 2

Funct Integr Genomics (2008) 8:69–78 75

retention of duplicated genes encoding key domains can beexplained partly due to gene homogenization processes(Chapman et al. 2006). The selective advantage of retentionof duplicated genes may help buffer crucial functions(Chapman et al. 2006). Among these, 121 of rice and 91of Arabidopsis intronless genes have both intronless andintron-containing paralogs. Another 127 and 82 intronlessgenes of rice and Arabidopsis, respectively, have onlyintron-containing paralogs. The genesis of intron-containingparalogs from intronless genes is explained by events ofrecombinations with reverse transcribed pre-mRNAs(Boudet et al. 2001; Lecharny et al. 2003). The examplesof evolution of gene families exhibiting diverse exon/intronstructures by duplication from a single ancestral gene arefound both in plants and animals (Gotoh 1998; Boudet et al.2001; Jain et al. 2006a). Moreover, the prevalence of introngain over intron loss in the evolution of paralogous genefamilies has been demonstrated in eukaryotes (Babenko etal. 2004), indicating the evolution of intron-containingparalogs from intronless genes.

Conserved intronless genes in rice and Arabidopsis

ProtFun analysis of functional category and GO annotationshowed consistency between rice and Arabidopsis intronlessproteins without strong biases, as described earlier. Further,by reciprocal BLASTP searches, 4,362 (74.6%) of theproteins encoded by intronless genes in Arabidopsis were

found to have orthologs in rice that are also intronless(Fig. 6a). This estimate is comparatively lower than that hasbeen estimated recently (about 90%) for all the Arabidopsisgenes (International Rice Genome Sequencing Project 2005).Nearly 16.6% (972) of these genes, are highly conserved(expect value <E-100) in rice and Arabidopsis and mayperform essentially similar functions in these plants. Theseproteins are involved in basic cellular processes such as cellenvelope, purines and pyrimidines, amino acid biosynthesisand transport and binding (Fig. 6b). Several gene families,including LRR repeat proteins, pentatricopeptide repeatproteins, protein kinases, disease resistance proteins, cyto-chrome P450 genes, are overrepresented in this list. Theother 1,484 Arabidopsis intronless genes lack significanthomology to rice intronless genes. Most of these areclassified as hypothetical or expressed proteins, whichsuggest that some of these may be inaccurately predictedand others could be dicot-specific. The comparative analysisof rice and Arabidopsis at whole genome level also showed

1186 91 82

1461 121 127

Intronless genes with only

intronless paralogs

Intronless genes with only intron-

containing paralogs

Intronless genes with both intronless

and intron-containing paralogs

Rice

Arabidopsis

Fig. 5 Venn diagram showing paralogs of rice and Arabidopsisintronless genes. The number of rice and Arabidopsis intronless geneswith intronless and/or intron-containing paralogs are given

a

b

Nu

mb

er o

f g

enes

0

200

400

600

800

<::::: High Homology Low ::::>

0

50

100

150

200

250

AAB BC CE CP CIM EM FAM PP RF RT T TB

Nu

mb

er o

f g

enes

Cellular roleFig. 6 Conservation among rice and Arabidopsis intronless genes.a Protein sequences of predicted Arabidopsis intronless genes werealigned with that of rice intronless genes (BLASTP E-value ≤−6). Thenumber of proteins with expect value <E-180 (high homology) to ≤E-6(low homology) are shown in intervals spanning ten exponents (forexample, <−180, −180 to −171, −170 to −161 and so on). b Distributionof highly conserved (<E-100) intronless genes in various cellular rolecategories. The description of cellular role categories is similar to thatgiven in legend to Fig. 2

76 Funct Integr Genomics (2008) 8:69–78

that several unknown and some of the well-characterizedArabidopsis genes do not have rice orthologs and vice versa(Goff et al. 2002; Delseny 2003; International Rice GenomeSequencing Project 2005). The basic difference betweenmonocots and dicots will become clear only when thefunctions of these largely unknown genes (both intronlessand intron-containing) will become clear.

In conclusion, based on our large-scale alignments ofannotated gene and protein sequences in rice and Arabi-dopsis, we estimate that about one-fifth of genes areintronless in these plants. The biological role of such alarge number of intronless genes in their genomes isperplexing. The functional characterization of intronlessORFans of rice and Arabidopsis can help understand thebasic difference between monocots and dicots. We believethat these results will help improve our understanding onthe differential selection (as a process or force) of intronlessgenes in plants and other eukaryotic genomes. Furthermore,the datasets provided can serve as source for comparative,evolutionary, and functional studies.

Acknowledgments We are thankful to Rashmi Jain for technicalassistance. This work was supported financially by the Department ofBiotechnology, Government of India, and the University GrantsCommission, New Delhi. MJ acknowledges the Council of Scientificand Industrial Research, New Delhi, for the award of Senior ResearchFellowship.

References

Agarwal SM, Gupta J (2005) Comparative analysis of humanintronless proteins. Biochem Biophys Res Commun 331:512–519

Ahn S, Tanksley SD (1993) Comparative linkage maps of the rice andmaize genomes. Proc Natl Acad Sci USA 90:7980–7984

Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W,Lipman DJ (1997) Gapped BLAST and PSI-BLAST: a newgeneration of protein database search programs. Nucleic AcidsRes 25:3389–3402

Andersson JO (2005) Lateral gene transfer in eukaryotes. Cell MolLife Sci 62:1182–1197

Aubourg S, Kreis M, Lecharny A (1999) The DEAD box RNAhelicase family in Arabidopsis thaliana. Nucleic Acids Res27:628–636

Babenko VN, Rogozin IB, Mekhedov SL, Koonin EV (2004)Prevalence of intron gain over intron loss in the evolution ofparalogous gene families. Nucleic Acids Res 32:3724–3733

Bancroft I (2002) Insights into cereal genomes from two draft genomesequences of rice. Genome Biol 3: Reviews 1015.1–1015.3

Boucher Y, Douady CJ, Papke RT, Walsh DA, Boudreau ME, NesboCL, Case RJ, Doolittle WF (2003) Lateral gene transfer and theorigins of prokaryotic groups. Annu Rev Genet 37:283–328

Boudet N, Aubourg S, Toffano-Nioche C, Kreis M, Lecharny A(2001) Evolution of intron/exon structure of DEAD helicasefamily genes in Arabidopsis, Caenorhabditis, and Drosophila.Genome Res 11:2101–2114

Bowers JE, Chapman BA, Rong J, Paterson AH (2003) Unravellingangiosperm genome evolution by phylogenetic analysis ofchromosomal duplication events. Nature 422:433–438

Chapman BA, Bowers JE, Feltus FA, Paterson AH (2006) Bufferingof crucial functions by paleologous duplicated genes maycontribute cyclicality to angiosperm genome duplication. ProcNatl Acad Sci U S A 103:2730–2735

Copley SD, Dhillon JK (2002) Lateral gene transfer and parallelevolution in the history of glutathione biosynthesis genes.Genome Biol 3:1–25

Delseny M (2003) Towards an accurate sequence of the rice genome.Curr Opin Plant Biol 6:101–105

Domazet-Loso T, Tautz D (2003) An evolutionary analysis of orphangenes in Drosophila. Genome Res 13:2213–2219

Fischer D, Eisenberg D (1999) Finding families for genomic ORFans.Bioinformatics 15:759–762

Gagne JM, Downes BP, Shiu SH, Durski AM, Vierstra RD (2002) TheF-box subunit of the SCF E3 complex is encoded by a diversesuperfamily of genes in Arabidopsis. Proc Natl Acad Sci U S A99:11519–11524

Gardner MJ, Hall N, Fung E, White O, Berriman M, Hyman RW,Carlton JM, Pain A, Nelson KE, Bowman S, Paulsen IT, JamesK, Eisen JA, Rutherford K, Salzberg SL, Craig A, Kyes S, ChanMS, Nene V, Shallom SJ, Suh B, Peterson J, Angiuoli S, PerteaM, Allen J, Selengut J, Haft D, Mather MW, Vaidya AB, MartinDM, Fairlamb AH, Fraunholz MJ, Roos DS, Ralph SA,McFadden GI, Cummings LM, Subramanian GM, Mungall C,Venter JC, Carucci DJ, Hoffman SL, Newbold C, Davis RW,Fraser CM, Barrell B (2002) Genome sequence of the humanmalaria parasite Plasmodium falciparum. Nature 419:498–511

Gentles AJ, Karlin S (1999) Why are human G-protein-coupledreceptors predominantly intronless? Trends Genet 15:47–49

Glusman G, Sosinsky A, Ben-Asher E, Avidan N, Sonkin D, Bahar A,Rosenthal A, Clifton S, Roe B, Ferraz C, Demaille J, Lancet D(2000) Sequence, structure, and evolution of a complete humanolfactory receptor gene cluster. Genomics 63:227–245

Goff SA, Ricke D, Lan TH, Presting G, Wang R, Dunn M, GlazebrookJ, Sessions A, Oeller P, Varma H, Hadley D, Hutchison D, MartinC, Katagiri F, Lange BM, Moughamer T, Xia Y, Budworth P,Zhong J, Miguel T, Paszkowski U, Zhang S, Colbert M, Sun WL,Chen L, Cooper B, Park S, Wood TC, Mao L, Quail P, Wing R,Dean R, Yu Y, Zharkikh A, Shen R, Sahasrabudhe S, Thomas A,Cannings R, Gutin A, Pruss D, Reid J, Tavtigian S, Mitchell J,Eldredge G, Scholl T, Miller RM, Bhatnagar S, Adey N, RubanoT, Tusneem N, Robinson R, Feldhaus J, Macalma T, Oliphant A,Briggs S (2002) A draft sequence of the rice genome (Oryzasativa L. ssp. japonica). Science 296:92–100

Gotoh O (1998) Divergent structures of Caenorhabditis eleganscytochrome P450 genes suggest the frequent loss and gain ofintrons during the evolution of nematodes. Mol Biol Evol15:1447–1459

International Rice Genome Sequencing Project (2005) The map-basedsequence of the rice genome. Nature 436:793–800

Jain M, Kaur N, Garg R, Thakur JK, Tyagi AK, Khurana JP (2006a)Structure and expression analysis of early auxin-responsive Aux/IAA gene family in rice (Oryza sativa). Funct Integr Genomics6:47–59

Jain M, Tyagi AK, Khurana JP (2006b) Genome-wide analysis,evolutionary expansion, and expression of early auxin-responsiveSAUR gene family in rice (Oryza sativa). Genomics 88:360–371

Jensen LJ, Gupta R, Blom N, Devos D, Tamames J, Kesmir C,Nielsen H, Staerfeldt HH, Rapacki K, Workman C, AndersenCA, Knudsen S, Krogh A, Valencia A, Brunak S (2002)Prediction of human protein function from post-translationalmodifications and localization features. J Mol Biol 319:1257–1265

Funct Integr Genomics (2008) 8:69–78 77

Jensen LJ, Ussery DW, Brunak S (2003) Functionality of systemcomponents: conservation of protein function in protein featurespace. Genome Res 13:2444–2449

Jordan IK, Rogozin IB, Wolf YI, Koonin EV (2002) Essential genesare more evolutionarily conserved than are nonessential genes inbacteria. Genome Res 12:962–968

Kellis M, Birren BW, Lander ES (2004) Proof and evolutionaryanalysis of ancient genome duplication in the yeast Saccharo-myces cerevisiae. Nature 428:617–624

Lecharny A, Boudet N, Gy I, Aubourg S, Kreis M (2003) Introns in,introns out in plant gene families: a genomic approach of thedynamics of gene structure. J Struct Funct Genomics 3:111–116

Long M (2001) Evolution of novel genes. Curr Opin Genet Dev11:673–680

Lurin C, Andres C, Aubourg S, Bellaoui M, Bitton F, Bruyere C,Caboche M, Debast C, Gualberto J, Hoffmann B, Lecharny A,Le Ret M, Martin-Magniette ML, Mireau H, Peeters N, RenouJP, Szurek B, Taconnat L, Small I (2004) Genome-wideanalysis of Arabidopsis pentatricopeptide repeat proteinsreveals their essential role in organelle biogenesis. Plant Cell16:2089–2103

Paterson AH, Bowers JE, Chapman BA (2004a) Ancient polyploid-ization predating divergence of the cereals, and its consequencesfor comparative genomics. Proc Natl Acad Sci U S A 101:9903–9908

Paterson AH, Bowers JE, Chapman BA, Peterson DG, Rong J, WickerTM (2004b) Comparative genome analysis of monocots anddicots, toward characterization of angiosperm diversity. CurrOpin Biotechnol 15:120–125

Rujan T, Martin W (2001) How many genes in Arabidopsis comefrom cyanobacteria? An estimate from 386 protein phylogenies.Trends Genet 17:113–120

Sakharkar MK, Kangueane P (2004) Genome SEGE: a database for‘intronless’ genes in eukaryotic genomes. BMC Bioinformatics5:67

Sakharkar KR, Sakharkar MK, Culiat CT, Chow VT, Pervaiz S (2006)Functional and evolutionary analyses on expressed intronlessgenes in the mouse genome. FEBS Lett 580:1472–1478

Schmid KJ, Aquadro CF (2001) The evolutionary analysis of“orphans” from the Drosophila genome identifies rapidlydiverging and incorrectly annotated genes. Genetics 159:589–598

Siew N, Fischer D (2003a) Analysis of singleton ORFans in fullysequenced microbial genomes. Proteins 53:241–251

Siew N, Fischer D (2003b) Twenty thousand ORFan microbial proteinfamilies for the biologist? Structure 11:7–9

Takeda S, Kadowaki S, Haga T, Takaesu H, Mitaku S (2002)Identification of G protein-coupled receptor genes from thehuman genome sequence. FEBS Lett 520:97–101

The Arabidopsis Genome Initiative (2000) Analysis of the genomesequence of the flowering plant Arabidopsis thaliana. Nature408:796–815

Veitia RA (2005) Paralogs in polyploids: one for all and all for one?Plant Cell 17:4–11

Vij S, Gupta V, Kumar D, Vydianathan R, Raghuvanshi S, Khurana P,Khurana JP, Tyagi AK (2006) Decoding the rice genome.Bioessays 28:421–432

Wheeler DL, Barrett T, Benson DA, Bryant SH, Canese K, ChurchDM, DiCuccio M, Edgar R, Federhen S, Helmberg W, KentonDL, Khovayko O, Lipman DJ, Madden TL, Maglott DR, OstellJ, Pontius JU, Pruitt KD, Schuler GD, Schriml LM, Sequeira E,Sherry ST, Sirotkin K, Starchenko G, Suzek TO, Tatusov R,Tatusova TA, Wagner L, Yaschenko E (2005) Database resourcesof the National Center for Biotechnology Information. NucleicAcids Res 33:D39–D45

Wilson AC, Carlson SS, White TJ (1977) Biochemical evolution.Annu Rev Biochem 46:573–639

Yu J, Wang J, Lin W, Li S, Li H, Zhou J, Ni P, Dong W, Hu S, Zeng C,Zhang J, Zhang Y, Li R, Xu Z, Li X, Zheng H, Cong L, Lin L, YinJ, Geng J, Li G, Shi J, Liu J, Lv H, Li J, Deng Y, Ran L, Shi X,Wang X, Wu Q, Li C, Ren X, Li D, Liu D, Zhang X, Ji Z, Zhao W,Sun Y, Zhang Z, Bao J, Han Y, Dong L, Ji J, Chen P, Wu S, Xiao Y,Bu D, Tan J, Yang L, Ye C, Xu J, Zhou Y, Yu Y, Zhang B, ZhuangS, Wei H, Liu B, Lei M, Yu H, Li Y, Xu H, Wei S, He X, Fang L,Huang X, Su Z, Tong W, Tong Z, Ye J, Wang L, Lei T, Chen C,Chen H, Huang H, Zhang F, Li N, Zhao C, Huang Y, Li L, Xi Y, QiQ, Li W, Hu W, Tian X, Jiao Y, Liang X, Jin J, Gao L, Zheng W,Hao B, Liu S, Wang W, Yuan L, Cao M, McDermott J, SamudralaR, Wong GK, Yang H (2005) The Genomes of Oryza sativa: ahistory of duplications. PLoS Biol 3:e38

Yuan Q, Ouyang S, Wang A, Zhu W, Maiti R, Lin H, Hamilton J,Haas B, Sultana R, Cheung F, Wortman J, Buell CR (2005) TheInstitute for Genomic Research Osa1 rice genome annotationdatabase. Plant Physiol 138:18–26

78 Funct Integr Genomics (2008) 8:69–78