Molecular Evolution of VEF-Domain-Containing Genes in Plants · Identifying genes that act in...

17
Molecular Plant Volume 2 Number 4 Pages 738–754 July 2009 RESEARCH ARTICLE Molecular Evolution of VEF-Domain-Containing PcG Genes in Plants Ling-Jing Chen, Zhao-Yan Diao, Chelsea Specht and Z. Renee Sung 1 Department of Plant and Microbial Biology, University of California, Berkeley, CA 94720–3102, USA ABSTRACT Arabidopsis VERNALIZATION2 (VRN2), EMBRYONIC FLOWER2 (EMF2), and FERTILIZATION-INDEPENDENT SEED2 (FIS2) are involved in vernalization-mediated flowering, vegetative development, and seed development, respec- tively. Together with Arabidopsis VEF-L36, they share a VEF domain that is conserved in plants and animals. To investigate the evolution of VEF-domain-containing genes (VEF genes), we analyzed sequences related to VEF genes across land plants. To date, 24 full-length sequences from 11 angiosperm families and 54 partial sequences from another nine families were identified. The majority of the full-length sequences identified share greatest sequence similarity with and possess the same major domain structure as Arabidopsis EMF2. EMF2-like sequences are not only widespread among angiosperms, but are also found in genomic sequences of gymnosperms, lycophyte, and moss. No FIS2- or VEF-L36-like sequences were recovered from plants other than Arabidopsis, including from rice and poplar for which whole genomes have been se- quenced. Phylogenetic analysis of the full-length sequences showed a high degree of amino acid sequence conservation in EMF2 homologs of closely related taxa. VRN2 homologs are recovered as a clade nested within the larger EMF2 clade. FIS2 and VEF-L36 are recovered in the VRN2 clade. VRN2 clade may have evolved from an EMF2 duplication event that occurred in the rosids prior to the divergence of the eurosid I and eurosid II lineages. We propose that dynamic changes in genome evolution contribute to the generation of the family of VEF-domain-containing genes. Phylogenetic analysis of the VEF domain alone showed that VEF sequences continue to evolve following EMF2/VRN2 divergence in accordance with species relationship. Existence of EMF2-like sequences in animals and across land plants suggests that a prototype form of EMF2 was present prior to the divergence of the plant and animal lineages. A proposed sequence of events, based on domain organization and occurrence of intermediate sequences throughout angiosperms, could explain VRN2 evolution from an EMF2-like ancestral sequence, possibly following duplication of the ancestral EMF2. Available data further suggest that VEF-L36 and FIS2 were derived from a VRN2-like ancestral sequence. Thus, the presence of VEF-L36 and FIS2 in a genome may ultimately be dependent upon the presence of a VRN2-like sequence. Key words: VEF; EMF2; FIS2; VRN2; VEF-L36; Arabidopsis; PcG; phylogeny; evolution. INTRODUCTION Identifying genes that act in developmental pathways and de- termining how they or their interactions are modified throughout organismal evolution is a major focus of the field of evolutionary developmental biology. Understanding how genes and gene networks function during the development of the model plant Arabidopsis thaliana provides a starting point for investigating how characterized developmental pathways may have played a role in the evolution of diverse plant body plans (Irish and Benfey, 2004). The Polycomb Group protein (PcG) genes play a major role in epigenetic regulation of gene expression. Originally charac- terized in Drosophila, they encode a conserved group of chro- matin proteins found in animals and plants. Structurally different Drosophila PcG proteins form complexes that main- tain the repression of target genes. A PcG protein complex, composed of four core proteins (Suppressor of Zeste 12 (Su(z)12), Extra sex combs (Esc), P55, and Enhancer of zeste (E(z)) (Kuzmichev et al., 2002; Muller et al., 2002)), can meth- ylate histone H3 at lysine 27 through the E(z) SET domain, pro- viding a methyl mark for subsequent transcriptional repression and gene silencing (Cao et al., 2002; Czermin et al., 2002; 1 To whom correspondence should be addressed. E-mail zrsung@nature. berkeley.edu, fax (510) 642-4995, tel. (510) 642-6966. ª The Author 2009. Published by the Molecular Plant Shanghai Editorial Office in association with Oxford University Press on behalf of CSPP and IPPE, SIBS, CAS. doi: 10.1093/mp/ssp032, Advance Access publication 19 June 2009 Received 10 March 2009; accepted 25 April 2009

Transcript of Molecular Evolution of VEF-Domain-Containing Genes in Plants · Identifying genes that act in...

Page 1: Molecular Evolution of VEF-Domain-Containing Genes in Plants · Identifying genes that act in developmental pathways and de-termining how they or their interactions are modified

Molecular Plant • Volume 2 • Number 4 • Pages 738–754 • July 2009 RESEARCH ARTICLE

Molecular Evolution of VEF-Domain-ContainingPcG Genes in Plants

Ling-Jing Chen, Zhao-Yan Diao, Chelsea Specht and Z. Renee Sung1

Department of Plant and Microbial Biology, University of California, Berkeley, CA 94720–3102, USA

ABSTRACT Arabidopsis VERNALIZATION2 (VRN2), EMBRYONIC FLOWER2 (EMF2), and FERTILIZATION-INDEPENDENT

SEED2 (FIS2) are involved in vernalization-mediated flowering, vegetative development, and seed development, respec-

tively. Together with Arabidopsis VEF-L36, they share a VEF domain that is conserved in plants and animals. To investigate

the evolution of VEF-domain-containing genes (VEF genes), we analyzed sequences related to VEF genes across land

plants. To date, 24 full-length sequences from 11 angiosperm families and 54 partial sequences from another nine families

were identified. The majority of the full-length sequences identified share greatest sequence similarity with and possess

the samemajor domain structure asArabidopsis EMF2. EMF2-like sequences are not onlywidespread among angiosperms,

but are also found in genomic sequences of gymnosperms, lycophyte, and moss. No FIS2- or VEF-L36-like sequences were

recovered from plants other than Arabidopsis, including from rice and poplar for which whole genomes have been se-

quenced. Phylogenetic analysis of the full-length sequences showed a high degree of amino acid sequence conservation in

EMF2 homologs of closely related taxa. VRN2 homologs are recovered as a clade nested within the larger EMF2 clade. FIS2

and VEF-L36 are recovered in the VRN2 clade. VRN2 clade may have evolved from an EMF2 duplication event that occurred

in the rosids prior to the divergence of the eurosid I and eurosid II lineages. We propose that dynamic changes in genome

evolution contribute to the generation of the family of VEF-domain-containing genes. Phylogenetic analysis of the VEF

domain alone showed that VEF sequences continue to evolve following EMF2/VRN2 divergence in accordancewith species

relationship. Existence of EMF2-like sequences in animals and across land plants suggests that a prototype form of EMF2

was present prior to the divergence of the plant and animal lineages. A proposed sequence of events, based on domain

organization and occurrence of intermediate sequences throughout angiosperms, could explain VRN2 evolution from an

EMF2-like ancestral sequence, possibly following duplication of the ancestral EMF2. Available data further suggest that

VEF-L36 and FIS2 were derived from a VRN2-like ancestral sequence. Thus, the presence of VEF-L36 and FIS2 in a genome

may ultimately be dependent upon the presence of a VRN2-like sequence.

Key words: VEF; EMF2; FIS2; VRN2; VEF-L36; Arabidopsis; PcG; phylogeny; evolution.

INTRODUCTION

Identifying genes that act in developmental pathways and de-

termining how they or their interactions are modified

throughout organismal evolution is a major focus of the field

of evolutionary developmental biology. Understanding how

genes and gene networks function during the development

of the model plant Arabidopsis thaliana provides a starting

point for investigating how characterized developmental

pathways may have played a role in the evolution of diverse

plant body plans (Irish and Benfey, 2004).

The Polycomb Group protein (PcG) genes play a major role

in epigenetic regulation of gene expression. Originally charac-

terized in Drosophila, they encode a conserved group of chro-

matin proteins found in animals and plants. Structurally

different Drosophila PcG proteins form complexes that main-

tain the repression of target genes. A PcG protein complex,

composed of four core proteins (Suppressor of Zeste 12

(Su(z)12), Extra sex combs (Esc), P55, and Enhancer of zeste

(E(z)) (Kuzmichev et al., 2002; Muller et al., 2002)), can meth-

ylate histone H3 at lysine 27 through the E(z) SET domain, pro-

viding a methyl mark for subsequent transcriptional repression

and gene silencing (Cao et al., 2002; Czermin et al., 2002;

1 To whom correspondence should be addressed. E-mail zrsung@nature.

berkeley.edu, fax (510) 642-4995, tel. (510) 642-6966.

ª The Author 2009. Published by the Molecular Plant Shanghai Editorial

Office in association with Oxford University Press on behalf of CSPP and

IPPE, SIBS, CAS.

doi: 10.1093/mp/ssp032, Advance Access publication 19 June 2009

Received 10 March 2009; accepted 25 April 2009

Page 2: Molecular Evolution of VEF-Domain-Containing Genes in Plants · Identifying genes that act in developmental pathways and de-termining how they or their interactions are modified

Muller et al., 2002). Arabidopsis genes structurally similar to

Drosophila PcG genes have been reported and their mu-

tants characterized: CURLY LEAF (CLF) (Goodrich et al.,

1997), FERTILIZATION-INDEPENDENT SEED DEVELOPMENT1

(FIS1)/MEDEA (MEA) (Grossniklaus et al., 1998; Luo et al.,

1999), SWINGER (SWN) (Chanvivattana et al., 2004), FIS3/

FERTILIZATION-INDEPENDENT ENDOSPERM (FIE) (Ohad et al.,

1999), FERTILIZATION-INDEPENDENT SEED2 (FIS2) (Luo et al.,

1999), EMBRYONIC FLOWER2 (Yoshida et al., 2001), and VER-

NALIZATION2 (Gendall et al., 2001), and MULTICOPY SUPPRES-

SOR OF IRA1 (MSI1) (Hennig et al., 2003). Evidence indicates

that these genes encode proteins that form putative PcG com-

plexes involved in maintaining the silencing of Arabidopsis

MADS-box genes (Chanvivattana et al., 2004; Sung and

Amasino, 2004; Wood et al., 2006). Some PcG genes can be

grouped into families based on sequence homology, such

as CLF, MEDEA (MEA), and SWN (Chanvivattana et al., 2004)

and EMF2,VRN2, and FIS2 (Yoshida et al., 2001). It is possible

that these gene families are the result of gene duplication

and subsequent diversification from ancestral sequences that

were present prior to the divergence of the lineages, ultimately

leading to plants and animals.

Duplicationanddiversificationofnucleotidesequenceshave

been shown to lead to functional innovation across the tree of

life (Kim et al., 2004). EMF2 is a core component of the putative

PcG complex that represses flowering (Chanvivattana et al.,

2004). Loss-of-function mutation in the EMF2 gene leads to

elimination of vegetative growth in Arabidopsis (Yang et al.,

1995), resulting in early flowering. EMF2 thus may have played

a major role in plant survival and the evolution of phenological

variability. Protein interactions between EMF2 and three other

proteins, CLF (Goodrich et al., 1997), FIE (Kinoshita et al., 2001),

and MSI1 (Hennig et al., 2003), suggest that they function as

a protein complex in mediating floral repression. The putative

EMF2/CLF or SWN/FIE/MSI1 complex represses the flower

MADS-box genes AGAMOUS (AG), APETALLA 3 (AP3), and

PISTILLATA (PI) during vegetative development (Moon et al.,

2003; Calonje et al., 2008). CLF also represses flowering time

genes, such as FLOWERING LOCUS T (FT), AG-LIKE 19 (AGL19)

(Schonrock et al., 2006; Jiang et al., 2008). FIS2 is a core compo-

nent of the putative PcG complex FIS2/MEDEA (MEA)/FIE/MSI1

that regulates Arabidopsis seed development via repression of

PHERES1 during gametophyte and endosperm development

(Kohler et al., 2003). VRN2 is a core component of another pu-

tative PcG complex VRN2/CLF or SWN/FIE/MSI1 that induces

flowering in response to vernalization via the regulation of

theFLOWERINGLOCUSC (FLC) (Sung andAmasino, 2004; Wood

et al., 2006). It appears that the two groups of plant PcG genes,

CLF-MEA-SWN and EMF2-VRN2-FIS2, have co-evolved to form

multi-protein complexes that target different gene regulatory

networks (Calonje and Sung, 2006).

The molecular similarity of the VEF genes suggests that

they are related and may be the result of an historic gene du-

plication event followed by diversification. To understand

how the Arabidopsis VEF gene family evolved, we investi-

gated homologs of this gene family in Arabidopsis and other

land plants. In this paper, we identified 85 partial and full-

length sequences from land plants with a taxonomic focus

on flowering plants. Our results suggest that EMF2 is the most

plesiomorphic form of the gene and may have acted as a pro-

totype in the generation of the VEF gene family. Intragenic

sequence duplication, deletion/insertion, and intergenic

exon shuffling could account for the structural and functional

diversification of the VEF genes from an EMF2-like ancestor.

We propose that VRN2 evolved from an EMF2-like ancestor,

and that VEF-L36 and FIS2 were derived from a VRN2-like

ancestral sequence in Arabidopsis and possibly in other

angiosperms.

RESULTS

Domain Organization in Arabidopsis VEF Family Proteins

Using a deduced EMF2 amino acid sequence to BLAST against

GenBank, four full-length Arabidopsis proteins, EMF2

(At5g51230), FIS2 (At2g35670), VRN2 (At4g16845), and VEF-

L36 (At4g16810), were recovered with significant e-values

(,2e–12). In addition to the common VEF domain that defines

this gene family (Figure 1), EMF2, VRN2, and FIS2 share a C2H2

domain. EMF2 and VRN2 further share an N-terminal domain

(N-ter) that is present in the Drosophila homolog, Su(z)12, but

is absent in FIS2 and VEF-L36. However, VRN2 differs from

EMF2 in lacking sequence corresponding to EMF2 exon 5

Figure 1. Domain Organization of VEF-Domain-Containing Pro-teins of Arabidopsis.

Blue block: EMF2 N-terminal domain (N-ter), which is composed oftwo parts: an N-terminal cap (cap) and the remaining part (N-terDcap) as seen in VRN2. Orange block: EMF2-specific E5–10 domain.Green block: C2H2 zinc finger domain. Red block: VEF domain,which is uniquely located at the N-terminus of VEF-L36. Pink block:EMF2/VRN2-specific E15–17 domain. Light-blue block: VEF-L36-spe-cific repeat domain. Dark-green block: VEF-L36-specific L36 do-main. Yellow block: FIS2-specific S-rich domain. Purple block: FIS2C-terminal tail.

Chen et al. d Molecular Evolution of VEF-Domain-Containing PcG Genes | 739

Page 3: Molecular Evolution of VEF-Domain-Containing Genes in Plants · Identifying genes that act in developmental pathways and de-termining how they or their interactions are modified

through exon 10 (E5–10), as well as a stretch of sequence at

the N-terminal called the N-terminal cap (N-ter cap). VRN2

also has a 52-aa repeat in the C-terminus that is absent in

EMF2. Despite these differences, globally, VRN2 and EMF2

share similar domain organization and 45% amino acid

sequence identity.

First reported as EMF2-like 1 by Yoshida et al. (2001), VEF-L36

is a hypothetical protein, based on its predicted gene structure

from TAIR (TAIR: www.Arabidopsis.org/servlets/TairObject?id=

128616&type=locus). It shares only the VEF domain with the

other VEF proteins (Figure 1). Unlike EMF2, VRN2, and FIS2,

its VEF domain is located at the N-terminus and its

C-terminus comprises a sequence with low similarity to ribo-

somal protein L36. There is also a stretch of repeat sequence

in the middle region that is not found in any of the other

VEF genes.

Widespread of EMF2/VRN2 Homologs among Land Plants

To investigate the distribution of homologs of VEF genes in

plants, we used VEF-containing proteins to perform BLAST

searches against the databases listed above (see Methods). Us-

ing the Arabidopsis EMF2 amino acid sequence to BLAST

against GenBank, 10 full-length homologs were returned,

eight from grasses (Poaceae), one from Carica (Caricaceae),

and one from Silene (Caryophyllaceae) (Table 1). The grass

homologs included one from wheat (Triticum aestivum), three

from barley (Hordeum vulgaris), two from maize (Zea mays),

and two from rice (Oryza sativa). The Silene homolog is from

Silene latifolia of Caryophyllaceae, a member of the core eudi-

cots. The Chromatin Database (www.chromdb.org/) identifies

three full-length sequences from poplar (Populus trichocarpa:

VEF901, 902, and 904) and one partial sequence (VEF903). The

full-length sequences are heretofore referred to as PtEMF2_1

for VEF901, PtEMF2_2 for VEF902, and PtEMF2_4 for VEF904

(see Table 1A).

We also sequenced six full-length cDNAs from species in five

different angiosperm families representing early-diverging

monocots (Acorus; Acorales), higher monocots (Asparagus,

Yucca; Asparagales), basal eudicots (Eschscholzia; Papavera-

ceae), and the asterids (Solanum; Solanaceae) (see Methods).

The Kazusa DNA Research Institute provided one full-length

sequence from Lotus japonicus (Fabaceae). Using deduced

amino acid sequences of these cDNAs to BLAST against Gen-

Bank, the same homologs were returned as when using the

Arabidopsis EMF2 sequence. Using full-length VRN2, VEF-

L36, and FIS2 to BLAST against GenBank, we found mostly

the same sequences as described above, likely due to sequence

homology in the VEF domain.

Pair-wise identity scores of these full-length sequences indi-

cate that non-Arabidopsis sequences display higher identity to

Arabidopsis EMF2 and VRN2 than to FIS2 and VEF-L36 (Table 2).

Among these homologs, VEF-L36 shows lowest pair-wise iden-

tity to other members (average score: 8), followed by FIS2 (av-

erage score: 17). Both show higher identity to VRN2 than to

other EMF2/VRN2 homologs.

Sequence alignment of the 24 full-length proteins was per-

formed using MUSCLE (www.ebi.ac.uk). All non-Arabidopsis

full-length sequences possess the N-terminal (N-ter), C2H2,

and VEF domains homologous to that of EMF2/VRN2 sequen-

ces (Figure 2), indicating a high conservation of domain orga-

nization. These sequences are not likely to be orthologs of FIS2

or VEF-L36 due to both the presence of the EMF2/VRN2-

characteristic N-ter domain and the absence of either the S-rich

domain found in FIS2 or the L36 domain characteristic of VEF-

L36 (Figure 1). Sixteen full-length, non-Arabidopsis sequences

contain the complete N-ter that included the N-ter cap:

ZmEMF2_1, ZmEMF2_2, HvEMF2_4, HvEMF2_5, LjEMF2,

OsEMF2_4, AaEMF2, YfEMF2, AoEMF2, LeEMF2_1, SIEMF2,

LjEMF2, PtEMF2_1, PtEMF2_2, TaEMF2_3, CpEMF2. Five

sequences, EcEMF2_2, OsEMF2_9, HvEMF2_1, ZmEMF2-2, and

PtEMF2_4, lack the N-ter cap. One sequence from barley,

HvEMF2_1, lacks both N-ter cap and the VEF domain. Together

with Arabidopsis EMF2 and VRN2, these full-length EMF2/

VRN2 sequences represent 14 species from 11 angiosperm fam-

ilies (Acoraceae, Asparagaceae, Agavaceae, Poaceae, Caryo-

phyllaceae, Fabaceae, Brassicaceae, Solanaceae, Salicaceae,

Caricaceae, and Papaveraceae). No discernable FIS2 or VEF-

L36 orthologs were recovered from rice or poplar, despite

the availability of full genomic sequences.

In addition to the full-length sequences, we found ;140 in-

complete sequences showing significant homology to three

EMF2 domains in various genomic databases (see Methods).

After the elimination of identical sequences, 54 new sequences

homologous to one or more EMF2 domains were identified

(Table 1B): (1) 9 ESTs possess N-terminal domain sequences,

(2) 16 possess C2H2 domain sequences, and (3) 36 possess

VEF domain sequences, from nine additional angiosperm fam-

ilies (Malvaceae, Vitaceae, Liliaceae, Vitaceae, Nymphaeaceae,

Ranuculaceae, Asteraceae, Bromeliaceae, and Euphorbiaceae)

(Table1andSupplementalFigure1).Altogether,78sequences—

24 full-length and 54 partial sequences—were identified from

20 angiosperm families.

Outside of the angiosperms, we identified two gymnosperm

ESTs sharing homology with the EMF2 C-terminal domain from

Pinaceae (Supplemental Figure 2B and 2C), one each in Pinus

taeda (pine) and Picea engelmanii (spruce), and two individual

ESTs from the lycophyte species Selaginella mollendorffii (Ta-

ble 1C). One Selaginella partial sequence (SdEMF2p_1) con-

tained both N-ter and C2H2 domains, showing a 44–39%

identity to the respective domains of EMF2. The other Selag-

inella sequence (SdEMF2p) contained only the VEF domain,

showing a 58% identity to EMF2’s VEF in a 145-aa region of

overlap (Table 1C and Supplemental Figure 2A).

The Chromatin Database yielded three full-length sequen-

ces homologous to EMF2 from Physcomitrella patens (Bryo-

phyta; moss), PpEMF2_1, _2, _3 (Table 1C). Despite low

sequence similarity to Arabidopsis EMF2 (;25%), the moss

sequences possess N-ter, C2H2, and VEF domains. These find-

ings that EMF2/VRN2 homologs exist in lycophytes and

mosses and have similar domain structure to modern

740 | Chen et al. d Molecular Evolution of VEF-Domain-Containing PcG Genes

Page 4: Molecular Evolution of VEF-Domain-Containing Genes in Plants · Identifying genes that act in developmental pathways and de-termining how they or their interactions are modified

Table 1. Full-Length and Partial Sequences of VEF Gene Homologs.

(A) Full-length VEF gene homologs from Angiosperm.

Name Family Plant Accession #

AaEMF2 Acoraceae Acorus americanus, sweet flag GenBank: ABI99480

AoEMF2 Asparagaceae Asparagus officinalis, sparagus GenBank: ABD85301

AtEMF2 Brassicaceae Arabidopsis thaliana TAIR: AT5G51230

CpEMF2 Caricaceae Carica papaya CoGe: Chr Supercontig_1321118352–2159309

EcEMF2_1 Papaveraceae Eschscholzia californica, California poppy GenBank: ABD98790

EcEMF2_2 Papaveraceae Eschscholzia californica, California poppy GenBank: ABD98791

FIS2_692 Brassicaceae Arabidopsis thaliana TAIR: AT2G35670

HvEMF2_1 Poaceae Hordeum vulgare, barley GenBank: BAD99132

HvEMF2_4 Poaceae Hordeum vulgare, barley GenBank: BAD99131

HvEMF2_5 Poaceae Hordeum vulgare, barley GenBank: BAD99131

LeEMF2_1 Solanaceae Lycopersicon esculentum GenBank: ABI99480

LjEMF2 Fabaceae Lotus japonicus Legume database

OsEMF2_4 Poaceae Oryza sativa, rice TIGR: LOC_Os04g08034

OsEMF2_9 Poaceae Oryza sativa, rice TIGR: LOC_Os09g13630

PtEMF2_1 Salicaceae Populus trichocarpa, cottonwood ChromDB: VEF901

PtEMF2_2 Salicaceae Populus trichocarpa, cottonwood ChromDB: VEF902

PtEMF2_4 Salicaceae Populus trichocarpa, cottonwood ChromDB: VEF904

SlEMF2 Caryophyllaceae Silene latifolia, white campion GenBank: BAD93353

TaEMF2_3 Poaceae Triticum aestivum, wheat GenBank: AAX78232

VRN2_445 Brassicaceae Arabidopsis thaliana TAIR: AT4G16845

VEF_L36 Brassicaceae Arabidopsis thaliana TAIR: AT4G16810

YfEMF2 Yuccaceae Yucca filamentosa, Yucca GenBank: ABD85300

ZmEMF2_1 Poaceae Zea mays, maize ChromDB: VEF101

ZmEMF2_2 Poaceae Zea mays, maize ChromDB: VEF102

(B) EMF2/VRN2-related ESTs from Angiosperm.N-terminal (nine ESTs)Name Family Plant Accession #

CaEMF2p Solanaceae Capsicum annuum, pepper GenBank:CA847455

GmEMF2p_3 Fabaceae Glycine max, soybean TIGR: TC221104

GmEMF2p_4 Fabaceae Glycine max, soybean TIGR:TC211671

GrEMF2p Malvaceae Gossypium barbadense, cotton TIGR:TC40052

LsEMF2p_1 Asteraceae Lactuca saligna, lettuce TIGR:TA10917_4236

MtEMF2p Fabaceae Medicago truncatula TIGR: TC108897

SbEMF2p_2 Poaceae Sorghum bicolor, sorghum TIGR: TA29013_4558

VvEMF2p_3 Vitaceae Vitis vinifera, grape GenBank: CF609577

ZmEMF2p_3 Poaceae Zea mays, maize TIGR:CD436196

C2H2 zinc finger (16 ESTs)Name Family Plant Accession #

CcEMF2p_1 Rubiaceae Coffea canephora TIGR: TA7702_49390

CsEMF2p_1 Asteraceae Centaurea solstitialis TIGR: TA4722_347529

CtEMF2p Asteraceae Carthamus tinctorius TIGR: TA2823_4222

EeEMF2p Euphorbiaceae Euphorbia esula TIGR: TA17942_3993

GmEMF2p_3 Fabaceae Glycine max, soybean TIGR: TC221104

GtrEMF2p Asteraceae Gerbera hybrid cv. Terra Regina GenBank: AJ759904

Chen et al. d Molecular Evolution of VEF-Domain-Containing PcG Genes | 741

Page 5: Molecular Evolution of VEF-Domain-Containing Genes in Plants · Identifying genes that act in developmental pathways and de-termining how they or their interactions are modified

Table 1. Continued

C2H2 zinc finger (16 ESTs)Name Family Plant Accession #

LeEMF2p_2 Solanaceae Lycopersicon esculentum, tomato GenBank: AW038171

SbEMF2p_2 Poaceae Sorghum bicolor, sorghum TIGR: TA29013_4558

ScEMF2p Poaceae Secale cereale, cereal rye GenBank: BE587348

SoEMF2p_2 Poaceae Saccharum officinarum, sugarcane TIGR: TA38345_4547

SoEMF2p_3 Poaceae Saccharum officinarum, sugarcane TIGR: TC71329

SoEMF2p_1 Poaceae Saccharum officinarum, sugarcane GenBank: CA098901

TaEMF2p_2 Poaceae Triticum aestivum, wheat GenBank: BJ211655

ToEMF2p_1 Asteraceae Taraxacum officinale TIGR: TA5836_50225

VvEMF2p_1 Vitaceae Vitis vinifera, grape GenBank: CN006883

ZmEMF2p_4 Poaceae Zea mays, maize TIGR: TA193846_4577

VEF domain (36 ESTs)Name Family Plant Accession #

AcEMF2p Liliaceae Allium cepa GenBank: CF443745

AfEMF2p Ranunculaceae Aquilegia formosa TIGR: TA14166_338618

AnanasEMF2p Bromeliaceae Ananas comosus GenBank: DT339533

BnEMF2p_1 Brassicaceae Brassica napus GenBank: CX194398

BnEMF2p_2 Brassicaceae Brassica napus GenBank: CX188412

CcEMF2p Rubiaceae Coffea canephora TIGR: TA7701_49390

CiEMF2p_1 Asteraceae Cichorium intybus GenBank: EH708467

CiEMF2p_2 Asteraceae Cichorium intybus TIGR: TA5136_13427

CsEMF2p Asteraceae Centaurea solstitialis GenBank: EH782846

EeEMF2p Euphorbiaceae Euphorbia esula TIGR: TA17942_3993

GhEMF2p_1 Malvaceae Gossypium hirsutum, cotton GenBank: DW229901

GhEMF2p_2 Malvaceae Gossypium hirsutum, cotton TIGR: TA37052_3635

GhEMF2p_3 Malvaceae Gossypium hirsutum, cotton TIGR: TA35411_3635

GmEMF2p_1 Fabaceae Glycine max, soybean TIGR: TA61896_3847

HaEMF2p_1 Asteraceae Helianthus annuus, sunflower GenBank: CD848472

HpEMF2p_1 Asteraceae Helianthus paradoxus, sunflower GenBank: EL487885

LeEMF2p_3 Solanaceae Lycopersicon esculentum, tomato GenBank: BI932726

LsEMF2p Asteraceae Lactuca saligna, lettuce TIGR: TA3490_75948

LvEMF2p Asteraceae Lactuca virosa, wild lettuce GenBank: DW160707

MeEMF2p Euphorbiaceae Manihot esculenta, cassava GenBank: DV449784

MtEMF2p Fabaceae Medicago truncatula TIGR: TC108897

NaEMF2p Nymphaeaceae Nuphar advenar, yellow pondlily FGP: nad03-13ms2-e08

NtEMF2p Solanaceae Nicotiana tabacum, tobacco GenBank: EB678277

PsEMF2p Fabaceae Pisum sativum, pea GenBank: AAX47184

SbEMF2p_1 Poaceae Sorghum bicolor, sorghum TIGR: TA34517_4558

SbEMF2p_3 Poaceae Sorghum bicolor, sorghum TIGR: TA35158_4558

LeEMF2p Solanaceae Solanum lycopersicum, tomato GenBank: AW038171

SoEMF2p_1 Poaceae Saccharum officinarum, sugarcane GenBank: CA098901

SoEMF2p_2 Poaceae Saccharum officinarum, sugarcane TIGR: TA38345_4547

SoEMF2p_4 Poaceae Saccharum officinarum, sugarcane GenBank: CA098901

StEMF2p_2 Solanaceae Solanum tuberosum TIGR: TA35890_4113

StEMF2p_3 Solanaceae Solanum tuberosum GenBank: BQ505017

TaEMF2p_1 Poaceae Triticum aestivum, wheat TIGR: TA70383_4565

VvEMF2p_2 Vitaceae Vitis vinifera, grape TIGR: TA47215_29760

742 | Chen et al. d Molecular Evolution of VEF-Domain-Containing PcG Genes

Page 6: Molecular Evolution of VEF-Domain-Containing Genes in Plants · Identifying genes that act in developmental pathways and de-termining how they or their interactions are modified

angiosperm EMF2 indicate that EMF2 was likely to have been

present in the genomes of early land plants (Supplemental

Figure 2D).

Sequence Comparison of EMF2/VRN2 Class Homologs

Predicted full-length and partial EMF2/VRN2 protein homo-

logs were aligned using MUSCLE (see Methods).

N-terminal (N-ter) Domain

An N-terminal domain for Arabidopsis EMF2 was defined by

Yoshida et al. (2001) as a fragment starting from amino acid

47 to 81 (Figure 2A). The domain is also found in VRN2 and in

the Drosophila Su(z)12 protein. Our alignment of the full-

length sequences from all identified EMF2/VRN2 class homo-

logs shows that a larger area is conserved across land plants,

starting from the first amino acid of EMF2 to the end of exon

4 (aa 81), and is heretofore referred to as the N-ter domain

(Figure 2A). Relative to EMF2, VRN2 has an abbreviated N-

ter domain, starting translation from a methionine (M) cor-

responding to the second M of EMF2. The sequence between

the two methionines of EMF2 is referred to as the N-ter cap.

EMF2/VRN2 homologs of monocots Acorus, Yucca, Aspara-

gus, and the basal eudicot Eschscholzia all contain the N-

ter cap (Figure 2A), suggesting that the angiosperm ancestral

sequence may have had both M sites, similar to Arabidopsis

EMF2. Indeed, Selaginella SdEMF2p_1 and the Physcomitrella

sequences, PpEMF2_3 (VEF1503), have an N-ter cap (Supple-

mental Figure 2D), although the sequences and lengths are

divergent from that found in the identified angiosperm

sequences. Some N-ter cap’s second M is replaced with a dif-

ferent aa; for example, S1EMF2’s second M is replaced by an S

(Figure 2A).

In species with two or more EMF2 class homologous

sequences found so far, at least one sequence has such

a cap, such as rice (OsEMF2_4 vs. OsEMF2_9), maize (ZmEMF2_1

vs. ZmEMF2_2), poplar (PtEMF2_1 and PtEMF2_2 vs.

PtEMF2_4), and California poppy (EcEMF2_1 vs. EcEMF2_2)

(Figure 2A and Supplemental Figure 1A). The early land plants

also possess at least one sequence with the N-ter cap (Supple-

mental Figure 2A and 2D).

E5–10 Domain

VRN2 is missing most of the amino acid sequence correspond-

ing to EMF2 exon 5 through exon 10 (E5–10). Comparison of

Table 1. Continued

VEF domain (36 ESTs)Name Family Plant Accession #

VvEMF2p_4 Vitaceae Vitis vinifera, grape GenBank: AM447481

ZmEMF2p_4 Poaceae Zea mays, maize TIGR: TA193846_4577

(C) EMF2/VRN2 homologs from Gymnosperm, Spikemoss, and moss.GymnospermName Family Plant Accession #

PeEMF2p Pinaceae Picea engelmannii, spruce TIGR: TA1969_373101

PlEMF2p Pinaceae Pinus taeda, pine GenBank: CO368996

LycophyteName Family Plant Accession #

SdEMF2p Selaginellaceae Selaginella moellendorffii,Spikemoss

gnl|050718cr339|1588846_1

SdEMF2p_1 Selaginellaceae Selaginella moellendorffii,Spikemoss

gnl|050718cr339|1588846_2

MossName Family Plant Accession #

PpEMF2_1 Funariaceae Physcomitrella patens, moss ChromDB: VEF1501

PpEMF2_2 Funariaceae Physcomitrella patens, moss ChromDB: VEF1502

PpEMF2_3 Funariaceae Physcomitrella patens, moss ChromDB: VEF1503

Note: ‘p’ in the sequence name stands for partial sequence. The letters in the sequence name stand for the following plants: Aa: Acorus americanus,Ac: Allium cepa, Af: Aquilegia formosa, Ao: Asparagus officinalis, At: Arabidopsis thaliana, Bn: Brassica napus, Ca: Capsicum annuum, Cc: Coffeacanephora, Ci: Cichorium intybus, Cp: Carica papaya, Cs: Centaurea solstitialis, Ct: Carthamus tinctorius, Ec: Eschscholzia californica, Ee: Euphorbiaesula, Gh: Gossypium hirsutum, Gm: Glycine max, Gr: Gossypium barbadense, Gtr: Gerbera, Ha: Helianthus annuus, Hp: Helianthus paradoxus, Hv:Hordeum vulgare, Le: Lycopersicon esculentum, Lj: Lotus japonicus, Ls: Lactuca saligna, Lv: Lactuca virosa, Me: Manihot esculenta, Mt: Medicagotruncatula, Na: Nuphar, Nt: Nicotiana tabacum, Os: Oryza sativa, Pe: Picea engelmannii, Pl: Pinus taeda, Pp: Physcomitrella patens, Ps: Pisum sativum,Pt: Populus trichocarpa, Sb: Sorghumbicolor, Sc: Secale cereale, Sd: Spikemoss, Sl: Silene latifolia, So: Saccharumofficinarum, St: Solanum tuberosum,Ta: Triticum aestivum, To: Taraxacum officinale, Vv: Vitis vinifera, Yf: Yucca filamentosa, Zm: Zea mays.

Chen et al. d Molecular Evolution of VEF-Domain-Containing PcG Genes | 743

Page 7: Molecular Evolution of VEF-Domain-Containing Genes in Plants · Identifying genes that act in developmental pathways and de-termining how they or their interactions are modified

EMF2 and VRN2 genomic DNA revealed no conserved corre-

sponding sequence in VRN2 in this region, excluding the pos-

sibility of alternative mRNA splicing as the cause of the

difference. One full-length sequence, PtEMF2_4 from Populus

trichocarpa (poplar) (Figure 2 and Supplemental Figure 1B),

and three partial sequences, MtEMF2p from Medicago trunca-

tula, GmEMF2p_3 from Glycine max, and CaEMF2p from Cap-

sicum annuum, lack both the E5–10 domain and the N-ter cap,

like VRN2. Within the E5–10 domain, amino acids encoded by

EMF2 exon 5 (Figure 2A), 6, and 8 were highly conserved

among the non-Arabidopsis EMF2 homologs, suggesting po-

tential conserved function of the E5–10 region. Alignment

analysis suggested the presence of E5–10 in all three Physcomi-

trella sequences, though with divergent sequences (Supple-

mental Figure 2D).

C2H2 Zinc Finger Domain

Unlike most Arabidopsis C2H2-domain-containing genes that

have multiple C2H2 motifs in tandem (Englbrecht et al., 2004),

the three VEF proteins, VRN2, FIS2, and EMF2, contain a single

C2H2 motif that is encoded by exon12 and 13 in EMF2. Previous

studies found a conserved 37-aa C2H2 domain sequence in

EMF2 homologs from Drosophila, human, and Arabidopsis

(Yoshida et al., 2001). Alignment of the full-length sequences

shows a conserved region extending from EMF2 exon 11

through 14 in the EMF2 homologs, covering a range of

;77 aa. In the EMF2/VRN2 class, VRN2’s C2H2 is most divergent

from that of other members (Figure 2B). Selaginella’s

SdEMF2p_1 has the C2H2 region with 39% identity to that of

EMF2 (Supplemental Figure 2A). Physcomitrella’s PpEMF2_3

has a region corresponding to that of EMF2’s C2H2; its two

Table 2. Pair-Wise Alignment Scores of Full-Length VEF Protein Homologs.

Sequences name1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27

1. AaEMF2 -

2. AoEMF2 61 -

3. AtEMF2 54 55 -

4. CpEMF2 57 58 68 -

5. EcEMF2_1 53 55 53 54 -

6. EcEMF2_2 52 52 51 51 57 -

7. FIS2_692 18 19 16 18 19 18 -

8. HvEMF2_1 42 49 40 43 40 43 12 -

9. HvEMF2_4 52 60 49 52 50 49 17 78 -

10. HvEMF2_5 42 46 41 40 37 39 16 54 58 -

11. LeEMF2_1 58 59 58 62 53 52 18 41 50 42 -

12. LjEMF2 40 40 42 46 40 38 17 32 37 34 43 -

13. OsEMF2_4 45 50 42 45 45 43 17 53 61 54 43 32 -

14. OsEMF2_9 53 61 50 52 52 52 18 71 82 61 51 37 62 -

15. PpEMF2_1 25 25 27 26 25 25 13 17 27 23 27 21 22 24 -

16. PpEMF2_2 24 23 26 23 23 20 10 15 21 20 24 15 22 23 69 -

17. PpEMF2_3 26 28 29 28 29 26 15 20 29 22 28 22 28 29 55 53 -

18. PtEMF2_1 57 56 63 71 53 51 17 42 49 39 59 41 44 51 23 25 28 -

19. PtEMF2_2 56 58 63 70 53 50 19 40 50 42 60 42 42 51 27 21 32 84 -

20. PtEMF2_4 61 60 53 54 52 58 30 25 49 43 57 44 43 50 27 31 30 56 55 -

21. SlEMF2 53 53 57 62 50 47 18 35 45 39 58 44 39 47 23 19 27 57 58 49 -

22. TaEMF2_3 51 58 49 51 51 48 18 80 93 57 49 35 60 81 27 23 29 48 50 48 45 -

23. VEF_L36 8 8 8 8 8 7 7 5 8 7 8 6 8 8 8 9 7 8 8 12 8 8 -

24. VRN2_445 46 47 45 45 43 46 31 20 44 34 48 36 37 44 29 29 31 45 44 51 42 44 14 -

25. YfEMF2 61 82 56 58 57 52 18 50 58 47 59 40 49 60 25 24 28 57 59 60 53 57 9 47 -

26. ZmEMF2_1 51 57 48 51 48 47 16 68 75 58 50 36 60 77 23 23 27 47 48 48 46 78 8 43 55 -

27. ZmEMF2_2 55 59 51 51 51 52 17 71 80 60 53 39 62 81 24 22 29 51 51 48 49 80 8 44 59 80 -

Average score 46 49 46 48 44 43 17 42 51 41 47 35 43 51 26 25 28 47 47 46 43 51 8 40 49 49 51

Note: 1. The number listed in the top line represents sequence with same number that is listed in the first column.Calculation of pair-wise alignment scores was described in Methods. Average scores were calculated as the sum of the individual score in onecategory divided by 26. Among these homologs, VEF-L36 showed lowest identity to other members (average score: 8), followed by FIS2 (averagescore: 17). On the other hand, both showed higher identity to VRN2 than to other EMF2/VRN2 homologs (pair-wise alignment score between VEF-L36 and VRN2: 14, pair-wise alignment score between FIS2 and VRN2: 31). The average pair-wise alignment score of other EMF2/VRN2 members was;44, calculated as the sum of the average scores (excluding 8 and 17) divided by 25.

744 | Chen et al. d Molecular Evolution of VEF-Domain-Containing PcG Genes

Page 8: Molecular Evolution of VEF-Domain-Containing Genes in Plants · Identifying genes that act in developmental pathways and de-termining how they or their interactions are modified

Figure 2. Alignment of Three Domains of Predicted Full-Length Plant VEF Proteins.

Chen et al. d Molecular Evolution of VEF-Domain-Containing PcG Genes | 745

Page 9: Molecular Evolution of VEF-Domain-Containing Genes in Plants · Identifying genes that act in developmental pathways and de-termining how they or their interactions are modified

Cs line up with those in EMF2, but the two Hs are absent

(Supplemental Figure 2D).

E15–17 Domain

E15–17 is a region encoded by EMF2 exon 15 to 17, connecting

the C2H2 and VEF domains of EMF2, VRN2, and FIS2. Align-

ment of the EMF2 homologs shows that this region has the

highest variability of all EMF2 domains in both amino acid se-

quence composition and in total length, suggesting intensive

diversification including multiple insertion and/or deletion

events during the evolution of this region (Supplemental Fig-

ure 1D). All three Physcomitrella sequences, including

PpEMF2_3, appear to possess this region.

VEF (C-terminal) Domain

Alignment of C-terminal sequences of EMF2, VRN2, FIS2,

Su(z)12, and the human KIAA0160 led Yoshida et al. (2001)

to define an acidic-W/M domain, ;130 aa from exons 18–22

in Arabidopsis EMF2, which is characterized by an acidic cluster

and a sequence rich in tryptophan and methionine. A smaller

region was later called the VEF domain derived from the ini-

tials of VRN2, EMF2, and FIS2 (Birve et al., 2001), which did not

include sequences in exon 18, but extended beyond that of the

acidic-W/M domain (Figure 2). In this paper, we adopt

a broader sense of the VEF domain, encompassing both the

acidic-W/M, defined by Yoshida et al. (2001), and the VEF, by

Birve et al. (2001), domains (Supplemental Figure 1E–1G).

VRN2hasanadditional52-aaC’oftheVEFdomain(Supplemental

Figure 1G and Figures 1 and 2C) that is not shared with other

EMF2-class proteins, including VRN2-like sequences, full-length

or partial from plants other than Arabidopsis. Analysis using

RADAR (www.ebi.ac.uk/Radar/) suggests that this 52-aa region

is a duplication of a stretch of amino acids found within the

VEF domain (Supplemental Figure 1G).

Selaginella SdEMF2p corresponds to the VEF domain

(Supplemental Figure 2A). All three Physcomitrella sequences

and the two partial gymnosperm sequences possess the VEF

domain (Supplemental Figure 2B–2D). None of the VEF

domains found in Physcomitrella, Selaginella, pine, or spruce

possesses the VRN2-characteristic repeat sequence in their C-

termini, indicating that this repeat likely evolved in angio-

sperms after the divergence of the gymnosperm lineage.

Among the three moss sequences, PpEMF2_3 is the most sim-

ilar to EMF2 in that it possesses the N-ter cap, E5–10, C2H2-like,

and VEF regions.

Phylogenetic Analysis of Full-Length and VEF Sequences

Phylogenetic analysis of the full-length sequences using max-

imum likelihood and Bayesian methods recovered various lin-

eages reflecting organismal evolution (Figures 3 and 4). Using

human and Drosophila sequences as outgroups, phylogenetic

analyses of full-length sequences (Figure 3) and VEF domain

alone (Figure 4) both recovered a monophyletic angiosperm

lineage with monophyletic monocot and eudicot clades.

Within the monocots, the grasses (Poales) were also recov-

ered as monophyletic in both full-length and VEF-based gene

trees. For VEF domain analyses containing greater sampling

of land plant diversity, gymnosperms were found to be mono-

phyletic and sister to angiosperms, Selaginella sister to an an-

giosperm plus gymnosperm clade, and Physcomitrella

sequences sister to remaining land plants. As with full-length

sequences, monocots are recovered as monophyletic; how-

ever, Eschscholzia, unresolved in the full-length analysis,

groups with Aquilegia VEF domain (Figure 4), forming a basal

eudicot clade sister to monocots. This clade is unresolved with

respect to monocots and core eudicots. Within monophyletic

core eudicots, the asterids and rosids are roughly falling out

as separate clades, with a few exceptions (e.g. Silene within

rosid clade, two sequences of Gossypium recovered as sister to

the rosid plus asterid sister group, Lotus japonicus within an

otherwise monophyletic asterid clade, and one Helianthus se-

quence falling within the rosids rather than the asterids).

In addition, several sequences from core eudicot species are

resolved in a clade containing VRN2, FIS2, and VEF-L36 (Figure

4). This clade is distant from AtEMF2, indicating a different

evolutionary history for the VEF domain of VRN2, FIS2, and

VEF-L36. In the full-length analyses, PtEMF2_4 or VEF904, a

proposed VRN2 ortholog from Populus, is strongly supported

within a VRN2 clade reflecting potential homology (or full--

length sequence conversion) of the Populus sequence with

VRN2. In the VEF domain analyses, this Populus sequence

groups with other Populus sequences rather than with the

VRN2 clade, indicating that the VEF domain itself is not con-

verging on a VRN2-like VEF domain, despite full sequence

and domain-level similarity. Another potential VRN2 ortholog,

Medicago truncatula’s MtEMF2p, lacking the E5–10 domain

and the N-ter cap, is grouped in the VRN2 clade. It remains

The T-COFFEE (Version 4.85) program was used for the sequence alignment. Vertical lines on top of the sequence mark the boundaries ofEMF2 exons, and the arrows and numbers prefixed with an E on top of the sequence indicate EMF2 exons.(A)N-ter domain. Light-blue bar on top of the sequence marks the N-ter domain. Colorless horizontal bar marks the N-ter cap. Dark-blue barmarks the N-terminal domain defined by Yoshida (2001).(B) C2H2 domain. Green bar on top of the sequence marks C2H2 domain defined by Yoshida (2001). Numbers –1, +3, and +6 denote theposition relative to the start site of the a-helix of the C2H2 domain.(C)VEF domain. Red and yellow horizontal bars on top of the sequence mark the C-terminal domain defined by Yoshida et al. (2001) and theVEF domain defined by Birve et al. (2001), respectively. Because VEF-L36 only shares VEF with other homologs, its middle and C-terminalsequences were cut off.

746 | Chen et al. d Molecular Evolution of VEF-Domain-Containing PcG Genes

Page 10: Molecular Evolution of VEF-Domain-Containing Genes in Plants · Identifying genes that act in developmental pathways and de-termining how they or their interactions are modified

to be tested whether other homologs with VRN2-like domain

structure will have their VEF sequence converge with AtVRN2

VEF amino acid sequence.

Sequence Relationship between VEF-L36 and EMF2/VRN2

VEF-L36 cDNA was deduced from Arabidopsis genomic se-

quence (TAIR: www.Arabidopsis.org/servlets/TairObject?id=

128616&type=locus) but has not been assayed for function.

The 1872-bp open reading frame encodes a predicted 623-aa

protein, with the 125-aa VEF located at the N-terminus and

a 113-aa C-terminus with only low sequence similarity L36.

The RADAR program detected three types of repeat sequence

in the middle region of VEF-L36 (Figure 1 and Supplemental

Figure 3A). Except for the VEF domain, VEF-L36 shares no other

domains with the other three Arabidopsis VEF proteins. Using

its 495-aa sequence without the VEF domain to BLAST search

against GenBank, we found three Arabidopsis fragments

and one rice homolog, as well as few sequences in other

non-plant organisms, such asDrosophila,Dictyostelium,Danio,

and Trypanosoma, all lacking the VEF domain (Supplemental

Figure 3B). The rice homolog encodes a 410-aa protein with

low global homology to the non-VEF part of VEF-L36 (22%

identity and 37% similarity, Supplemental Figure 3C). To date,

VEF-L36 is the only gene found with both VEF and L36 domains.

The VEF domain of VEF-L36 is more closely related to that

of VRN2 than to EMF2, as indicated by phylogenetic analyses

of both the VEF domain alone and of full-length sequences

(Figures 3 and 4). Among the divergent amino acids between

EMF2 and VRN2, VEF-L36 shares nine with VRN2 and only

three with EMF2 (Table 3). Moreover, VRN2 (AT4G16845)

and VEF-L36 (AT4G16810) are closely linked on Arabidopsis

chromosome 4. Among the VEF-domain-containing proteins,

the VEF domain in VEF-L36 is the only one located at the

N-terminus of a protein. Together, these phenomena suggest

that the VEF domain of the VEF-L36 may be transferred from

VRN2 on a sister chromatin, through an accidental intronic

recombination event during meiosis (Figure 5C). This would

imply that only plants with VRN2 may generate L36-VEF. So

far, VEF-L36 has only been identified from Arabidopsis.

Sequence Relationship between FIS2 and EMF2/VRN2

FIS2 is similar to EMF2/VRN2 in possessing a single C2H2 and

the VEF domain, which is connected by a 459-aa region with

70 serines, called the S-rich domain. In addition to the two

types of repeats identified (Luo et al., 1999), RADAR identified

a third type of repeat in the S-rich domain (Supplemental Fig-

ure 4A). Sequences homologous to the S-rich domain have

been found in plants, fungi, bacteria, and animals, but none

share the C2H2 or VEF domains with FIS2. Despite the abun-

dance of the S-rich homologous domain in nature, the unique-

ness/rareness of the S-rich domain in VEF-domain-containing

protein family suggests that FIS2 may represent a unique evo-

lutionary event within the Arabidopsis lineage.

The C2H2 domain of FIS2 has greater sequence similarity to

VRN2 than EMF2 (Table 3). The VEF domain of FIS2 shows

Figure 3. Phylogenetic Analysis of Full-Length VEF Protein Homologs.

Phylogeny of EMF2/VRN2 using Bayesian inference; average branch lengths are shown. Measures of support are given at the nodes; Bayes-ian posterior probability (PP)/maximum likelihood bootstrap support (BS). Support values less than 50 are shown as hyphen "-" and supportvalues of 100 are shown as "+".

Chen et al. d Molecular Evolution of VEF-Domain-Containing PcG Genes | 747

Page 11: Molecular Evolution of VEF-Domain-Containing Genes in Plants · Identifying genes that act in developmental pathways and de-termining how they or their interactions are modified

Figure 4. Phylogenetic Analysis of VEF Domain Sequences.

Phylogeny of VEF domain using maximum likelihood as implemented in RAxML. Measures of support are given at the nodes; Bayesianposterior probability (PP)/maximum likelihood bootstrap support (BS). Support values less than 50 are shown as hyphens (-). Taxonomicgroups indicated at right, with exceptions described in text.

748 | Chen et al. d Molecular Evolution of VEF-Domain-Containing PcG Genes

Page 12: Molecular Evolution of VEF-Domain-Containing Genes in Plants · Identifying genes that act in developmental pathways and de-termining how they or their interactions are modified

a closer phylogenetic relationship to the VEF domain of VRN2

than to EMF2 (Figure 4), forming a clade with the VRN2 se-

quence indicating common ancestry to the exclusion of

EMF2. Among the amino acids diverged between EMF2 and

VRN2, FIS2 shares 20 identical amino acid residues with

VRN2 and only eight with EMF2 in the VEF domain (Table

3). Globally, FIS2 shared a higher pair-wise alignment score

with VRN2 than EMF2 (29 vs. 18%; Table 2).

DISCUSSION

The VEF domain is found in chromatin proteins required for

gene silencing throughout eukaryotic organisms. In addition

to the universal VEF domain, the VEF proteins possess other

characteristic domains that distinguish them from one an-

other. Based on domain organization, four Arabidopsis VEF

proteins were grouped into three classes: EMF2/VRN2, FIS2,

and VEF-L36 (Figure 1). Our analysis of homologous sequences

throughout land plants indicates the existence of EMF2 in

early diverging lineages of land plants (bryophytes and lyco-

phytes) and suggests the presence of an ancestral EMF2-like

gene in early land plants. Phylogenetic results (Figures 3

and 4) are consistent with the hypothesis that VRN2 was likely

derived from an EMF2-like ancestor within the angiosperms,

and that FIS2 and VEF-L36 were secondarily derived from

a VRN2-like ancestral sequence in Arabidopsis. Current phylo-

genetic hypotheses are limited in taxon sampling and in char-

acter sampling, constrained by currently available sequences

that are not equally distributed across angiosperm evolution

and may not represent complete genomic data for all species

sampled. Such limitations reduce overall phylogenetic resolu-

tion and make it difficult to assign orthology and paralogy to

the available sequences in the face of multiple gene and ge-

nome duplication events spanning angiosperm evolution.

However, given current sampling, our phylogenetic results in-

dicate that EMF2-like genes in angiosperms demonstrate an

evolutionary history largely consistent with the taxonomic his-

tory of the plants in which they are found.

Proposed Evolution of VEF Genes

The EMF2/VRN2 class proteins show strong sequence similarity

despite modified domain structure. Sequences with the EMF2-

like domain structure are widespread, found in animals and

most vascular plants. Sequences with the VRN2-like domain

structure have only been identified in poplar (PtEMF2_4), pep-

per (CaEMF2p), alfalfa (MtEMF2p), and soybean (GmEMF2_3)

(Table 1) as sequences that lack the N-ter cap and E5–10-like

VRN2. In Arabidopsis, EMF2 is an essential gene as evidenced

by the short-lived and sterile nature of the emf2 mutants.

VRN2 promotes vernalization-mediated flowering and vrn2

mutants flower late, but the loss of VRN2 is not lethal (Gendall

et al., 2001). Alternative vernalization mechanisms that do not

utilize a putative Arabidopsis VRN2 ortholog have evolved in

other species (Yan et al., 2004) and may be present in

Table 3. Number of Amino Acids Shared between FIS2/VEF-L36 and VRN2 or EMF2*.

Identical aa betweenFIS2 and VRN2

Identical aa betweenFIS2 and EMF2

Identical aa betweenVEF-L36 and VRN2

Identical aa betweenVEF-L36 and EMF2

C2H2 domain 20/131 8/131 na na

VEF domain 20/116 5/116 9/98 3/98

* Among the divergent amino acids between EMF2 and VRN2, the number of aa shared with EMF2 or VRN2 out of total number of aa inthe domain. na, not applicable.

Figure 5. Model on VRN2, FIS2, and VEF-L36 Evolution.

(A) Proposed VRN2 evolution from EMF2.(B) FIS2 evolution from VRN2.(C) VEF-L36 evolution from VRN2.

Chen et al. d Molecular Evolution of VEF-Domain-Containing PcG Genes | 749

Page 13: Molecular Evolution of VEF-Domain-Containing Genes in Plants · Identifying genes that act in developmental pathways and de-termining how they or their interactions are modified

Arabidopsisas well. While every plant sequenced thus far has at

least one copy of EMF2, VRN2 is found only infrequently. The

dispensable nature of VRN2 may result in its lower frequency

of occurrence throughout land plants. Based on our data, it

is likely that VRN2 can arise from a duplication of an EMF2-like

ancestor. Once an additional EMF2 copy is present, one of the

copies is no longer under strong selection and is able to diverge,

potentially resulting in a VRN2-like sequence. Under this sce-

nario, VRN2-like sequences could arise multiple times and inde-

pendently following any duplication event that included the

EMF2gene. Similarity in domain structure and amino acid com-

position could then be the result of convergent evolution.

Genes possessing all domains found in EMF2 exist in insects

and mammals (Yoshida et al., 2001; Schuettengruber et al.,

2007). It can be argued, based on the presence of EMF2-like

genes in animals, lycophytes, bryophytes, gymnosperms, and

angiosperms, that early land plants shared an ancestral se-

quence having the domain structure found in modern copies

of EMF2. As the gene or genome duplicated, VRN2 may have

arisen from a duplication of the ancestral EMF2 (Figure 5A), fol-

lowed by subsequent loss of the N-ter cap and the E5–10 do-

main, and the acquisition of the 52-aa C-terminal repeat. The

presence of intermediary forms with partial domain structure

suggests a potential step-wise evolution of VRN2 from

an EMF2-like sequence. Among the full-length and partial

sequences from 20 angiosperm families used in this analysis,

20 sequences contain complete N-ter domain (Figure 2A and

Supplemental Figure 1A), nine lack the N-ter cap only (Interme-

diary molecule #1 in Figure 5A) and four lack both the N-ter cap

and the E5–10 domain (Intermediary #2 in Figure 5A; Figure 2

and Supplemental Figure 1B) but do not contain a VEF repeat.

So far, no sequence that lacks E5–10 but contains the N-ter cap

has been found, suggesting that the N-ter cap may need to be

lost first in order for the E5–10 domain to be lost. Finally, only

one VRN2-like sequence, Arabidopsis VRN2, possesses the C-

terminal repeat (Supplemental Figure 1G).

Based on the frequency of the intermediary forms and

results from phylogenetic analyses, we propose a three-step

hypothesis in the evolution of VRN2 from a parental EMF2 fol-

lowing gene duplication (Figure 5A). In the first step, EMF2

loses the N-ter cap, resulting in Intermediary molecule #1. This

could be achieved by mutation of the first ATG, rendering the

second ATG as a translation-starting site. In the second step,

Intermediary #1 loses the E5–10 domain, resulting in Inter-

mediary molecule #2. This could be achieved by mutation of

the splice sites within exon 5–10, resulting in exon skipping

(Hayashi et al., 1991). In the third step, Intermediary #2 gains

a C-terminal repeat, resulting in the backbone of VRN2. Cur-

rently, this third step has only been observed in Arabidopsis.

The importance of the 52-aa VEF repeat to the VRN2 function

remains to be tested, but the intermediate sequences may rep-

resent intermediate forms that could be in the process of evolv-

ing the VRN2 function. Comparison of structure and function

between these sequences and VRN2 will be required to better

understand the relationships of these genes.

The proposed process could happen sequentially, resulting

in independent derivations of a VRN2-like sequence from an

EMF2-like ancestor multiple times throughout plant evolution.

Convergence of the VEF domain among the VRN2-like sequen-

ces may occur concurrently with the losses of domains during

steps 1 and 2, or may occur following these structural changes

due to selection on the resulting gene sequence. This later case

assumes that independently evolved VRN2 sequences would

converge upon a particular function, with selection then act-

ing in a similar manner on the individual VEF domains. Studies

demonstrating the function of VRN2-like sequences in plants

in which they are found would be required to understand the

selection events leading to convergence of sequence data.

More complete genomic and taxonomic sampling focused

on VRN2-like sequences will enable us to test for possible dif-

ferences on selection of the VRN2 clade in comparison with

various recovered EMF2 clades.

The presence of the VEF repeat only in Arabidopsis VRN2

indicates that it may be a lineage-specific event. In this case,

the ancestral VRN2 in the most recent common ancestor of

Arabidopsis and Populus would not have had the VEF repeat,

and the repeat was subsequently gained in the lineage leading

to Arabidopsis after its divergence from the eudicot lineage

leading to Populus. Phylogenetic analysis showed that the

full-length Populus and Arabidopsis VRN2-like sequences are

in the same clade, despite the lack of the VEF repeat in

PtVRN2_4. However, in the analysis of the VEF domain alone,

the VEF of PtEMF2_4 remained in the same clade as that of

PtEMF2_1 and PtEMF2_2, suggesting stabilizing selection on

the VEF domain in Populus since the duplication event leading

to the Populus EMF2/VRN2-like divergence. This indicates that

overall domain architecture of the EMF2 gene is evolving in-

dependently from within-domain protein structure, at least

for the VEF domain. Studies investigating evidence for direc-

tional selection on the VEF domain following duplication of

EMF2 will be helpful to assess the likelihood of VRN2 evolution

following gene or genome duplication.

Phylogenetic analysis and sequence similarity comparison

clearly demonstrate that the VEF domain of VEF-L36 is more

closely related to that of VRN2 than to EMF2 (Table 3 and Fig-

ures 3 and 4). Similarly, both the C2H2 and VEF domains of FIS2

are more closely related to those of VRN2 than EMF2 (Table 3

and Figures 3 and 4). These findings support the derivation of

FIS2 and VEF-L36 from VRN2; only plants that have evolved

VRN2 could generate sequences like Arabidopsis FIS2 and

VEF-L36. FIS2 is an essential gene in Arabidopsis, but has

not yet been identified in other plants, including plants with

full genome sequences. FIS2 is specifically expressed in the ga-

metophyte of Arabidopsis and prevents endosperm develop-

ment prior to fertilization (Luo et al., 1999, 2000). A search

against cDNA libraries constructed from various angiosperm

flowers did not result in any FIS2-like homologs. In plants that

did not evolve VRN2, EMF2-like or alternative sequences may

have evolved to prevent endosperm development without fer-

tilization. Alternatively, genes with functional but without

750 | Chen et al. d Molecular Evolution of VEF-Domain-Containing PcG Genes

Page 14: Molecular Evolution of VEF-Domain-Containing Genes in Plants · Identifying genes that act in developmental pathways and de-termining how they or their interactions are modified

sequence conservation (Calonje et al., 2008) may have evolved

to take the place of FIS2. The presence of FIS2 and VEF-L36

should be investigated across Brassicaceae and its sister family,

Capparaceae (Hall et al., 2002), in order to localize the poten-

tial duplication events leading to the evolution of these

sequences from a hypothetical VRN2-like ancestral sequence.

FIS2 may have diverged from a duplicated VRN2, while VEF-L36

may have evolved via a translocation of a VEF domain donated

by VRN2 (Figure 5B and 5C).

PRC2 components play important roles in animal develop-

ment, notably in insects and mammals (Schuettengruber

et al., 2007). Some animal VEF protein sequences in the data-

base possess all domains found in Su(z)12; others possess only

the VEF and C2H2, or only the VEF domain. Indeed, nematode

has a sequence that shares C2H2 and VEF domain with Su(z)12

(see GenBank’s protein databases). Protein sequence align-

ment based on identity/similarity did not identify any animal

protein with the VEF domain linked to FIS2’s S-rich or VEF-L36’s

L36 domain, despite the abundance of S-rich and L36 in nature.

A comprehensive evolutionary analysis of animal VEF-contain-

ing proteins is beyond the scope of the present study. How-

ever, gene duplication, domain deletion/insertion/

rearrangement apparently occurred during the evolution

of animal VEF proteins as well. For example, mouse has

one, chimpanzee has three, and zebra fish has two VEF pro-

tein homologs. Some animal homologs possess the N-termi-

nal sequence, while others do not; and some domains specific

to certain animals can be identified (data not shown). Gene

fusion involving the human VEF homolog would lead to neo-

plastic tumor growth (Li et al., 2008). Future investigation of

domain architectures in animal VEF proteins would provide

insights into the evolutionary trends of VEF proteins in plants

versus those in animals.

Dynamic Changes during VEF Gene Evolution

The evolution of the VEF genes in plants is characterized by

the mobility of the VEF domain, duplication, and functional

divergence of homologous sequences. In addition to its di-

verse location in the genome, a VEF domain can be located

in the N- or C-terminus within a genetic locus. A VEF do-

main-containing gene may even lose the VEF domain, as in

the case of HvEMF2_1. These phenomena indicate that the

VEF domain functions like a mobile functional module that

plays a major role in protein evolution, facilitated by intronic

recombination or exon shuffling (Patthy, 1996; Kolkman and

Stemmer, 2001).

The dynamic genetic changes that occurred during the evo-

lution of this small gene family caused varying degrees of di-

vergence in sequences located between the conserved

domains. For example, a region encoded by EMF2 exon 15

through exon 17 (E15–17), flanked by the highly conserved

C2H2 zinc finger and VEF domains, is a region with the lowest

conservation among the EMF2/VRN2 class homologs (Supple-

mental Figure 1D). While the ends use identical or similar

amino acids and have almost no length variation, the center

of the gene region encoded by exon 16 and the 5’ end of exon

17 requires indels for multiple sequence alignment represent-

ing up to 20–70 aa in length difference. The gradient in the

degree of similarity, from highly divergent at the center re-

gion to highly conserved at the 5’ and 3’ ends, may be infor-

mative in plant phylogenetics. Finally, we note that the VEF

gene tree reflected our best understanding of the organismal

tree for included taxa (Grass Phylogeny Working Group,

2001). Regions with high levels of variability combined with

low copy number may render EMF2, particularly the E15–17

domain, a useful phylogenetic tool for evaluating the evolu-

tionary relationships of plants across both deep and shallow

nodes.

METHODS

Identification of Sequences and Domains of VEF Genes

across Land Plants

Full-length EMF2 putative protein sequence was used to BLAST

(Basic Local Alignment Search Tool) search against the fol-

lowing databases: GenBank (www.ncbi.nih.gov/), TIGR/JCVI

(www.tigr.org/), the Floral Genome Project (http://fgp.bio.

psu.edu/fgp/), Plant Genome DataBase (www.plantgdb.org/),

the moss genome (www.cosmoss.org/, http://genome.jgi-

psf.org/Phypa1_1/Phypa1_1.home.html), the papaya genome

(http://tinyurl.com/3ua95v), the pine EST database (http://fungen.

botany.uga.edu/), the Plant Genome Network (http://pgn.cornell.

edu/cgi-bin/blast/blast_search.pl), Brassica (http://ukcrop.net/),

SOL Genomics Network (www.sgn.cornell.edu/), the poplar

genome (http://genome.jgi-psf.org), the Chromatin database

(www.chromdb.org/), and the Selaginella genome (http://

selaginella.genomics.purdue.edu). Sequences with an e-value

greater than 0.001 (non-significant homology) were elimi-

nated, thereby eliminating all non-plant sequences. Plant

sequences containing intact EMF2-like N-terminal, C2H2, and/

or C-terminal domains were selected for further analysis. For

identification of homologs of FIS2’s S-rich domain and VEF-

L36’s L36 domain, S-rich domain and L36 domain amino acid

sequences were used to BLAST search against the same data-

bases listed above with an e-value cut-off of 0.001.

Sequencing EMF2 Homolog cDNA

Plasmid cDNAs were extracted from bacteria culture according

to the manufacture’s protocols (QIAGEN Inc. Valencia, CA

91355, USA). M13 rev (5#-GGAAACAGCTATGACCATG-3’) and

M13 (–20) (5#-GTAAAACGACGGCCAG-3’) primers were used

for sequencing, with the following internal primers used

as necessary to obtain full sequences: Acorus: 5#-CTCAG-

TAGAGCATGTCTGCTG-3#, 5#-CCCATGCAATCGTGAGAATGC-3#,

5#-TGACACGCTGAAAGATGATG-3#, 5#-CATTAACTGCCTGATA-

CTCTTC-3#, Asparagus: 5#-CAATACGGAATCCATCATTTCTGC-

3#, 5#-CTTGCTCCAATGCCATTGGC-3’; Nuphar: 5#-GATGAGGTC-

GATGATGATATTGC-3#, 5#-CTGCCAAAACCCGCTGTTTC-3’; Yucca:

5#-GTCAATCGGGCATGTATACTG-3#, 5#-CTTGCTCCAACGCCATTG-

GC-3’; Eschscholzia 8.1: 5#-GCTGATTACAAGGAACAGACTG-3#,

Chen et al. d Molecular Evolution of VEF-Domain-Containing PcG Genes | 751

Page 15: Molecular Evolution of VEF-Domain-Containing Genes in Plants · Identifying genes that act in developmental pathways and de-termining how they or their interactions are modified

5#-CACGGAACATGACCATCTGC-3’;Eschscholzia8.2:5#-GAGGAAT-

GACAGGGTGGAAGC-3#, 5#-GTTCCAGAGATGCATAATCCTTG-3’;

Tomato: 5#-GCTTTGCCGAACTTGCCAG-3#, 5#-CCCTATGAGAATG-

AAAGAATTGCC-3#.

Sequence Alignment

T-coffee (www.ebi.ac.uk/t-coffee/) was used to produce

a global amino acid alignment using the default values for pro-

tein alignment. RADAR (www.ebi.ac.uk/Radar/) was used to

detect de novo repeat regions in EMF2 homologous sequen-

ces. Classification of VEFs subgroups was performed based

on domain organization in the aligned sequences.

The full-length VEF homologs were aligned using T-coffee

and pair-wise distance scores were calculated with ClustalW

(version 1.83, http://www.ebi.ac.uk/Tools/Radar/) as the num-

ber of identities in the best alignment divided by the number

of residues compared (gap positions excluded). Scores were

initially calculated as percent identity scores and were con-

verted to distances by dividing by 100 and subtracting from

1.0 to give total number of differences per site. No correction

for multiple substitutions was performed.

Phylogenetic Analysis

EMF2/VRN2 Full-Length Sequence Data

The T-coffee alignment was used for phylogenetic analysis

based on its superior prediction of primary homology state-

ments as compared with prior knowledge of functional domain

architecture; for example, the N-terminus-located C2H2 do-

main of FIS2 aligned with the EMF2 N-terminal domain when

using MegAlign or ClustalW, while, in T-coffee, the annotated

C2H2 domains aligned with one another across all sequences.

Bayesian phylogenetic analyses on aligned full-length

sequences were performed with MrBayes v. 3.1.2 (Huelsenbeck

and Ronquist, 2001; Ronquist and Huelsenbeck, 2003). The

model of protein evolution that best fit the protein sequence

data was selected using the AIC as implemented in ProtTest 2.0

(Abascal et al., 2005—see e-mail for citation). The best-scoring

model for the EMF2/VRN2 full-length alignment was the

Jones-Taylor-Thornton (JTT) probability model (Jones et al.,

1992), with rate variation among sites calculated as a gamma

distribution (+G), and global rearrangements were sampled

with a random order of input sequences. Posterior probabili-

ties of the generated trees were approximated using an MCMC

algorithm with four incrementally heated chains (T = 0.2) for

5 000 000 generations and sampling trees every 100 genera-

tions. Two independent runs were conducted for each dataset

simultaneously, the default setting in MrBayes v. 3.1.2. Follow-

ing completion, the sampled trees from each analysis were

plotted against their log-likelihood score to identify the point

at which log-likelihood scores reached a maximum value. All

trees prior to this point were discarded as the burn-in phase,

all post-burn-in trees from each run were pooled, and a 50%

majority-rule consensus tree was calculated to obtain a topol-

ogy with average branch lengths as well as posterior probabil-

ities as indicators of support for all resolved nodes.

VEF Domain Sequence Data

The VEF domain, a region held in common by EMF2, VRN2,

FIS2, and VEF-L36, was used to estimate the phylogenetic rela-

tionships among VEF gene sequences across land plants. Pro-

tein alignment of the VEF domain was performed with

MUSCLE, resulting in a multiple sequence alignment of about

130 aa. ProtTest 2.0 was also used to determine the model of

evolution that best fits the VEF domain alignment. The best-

scoring model for the VEF alignment was also JTT +G, and

global rearrangements were sampled with a random order

of input sequences. Bayesian and Maximum likelihood meth-

ods of phylogenetic inference were conducted on the VEF

domain alignment using MrBayes (tree not shown) and RAxML-

VI-HPC (Stamatakis, 2006), respectively. The analyses were per-

formed on the computer cluster of the Cyber-Infrastructure

for Phylogenetic Research project (CIPRES, www.phylo.org)

at the San Diego Supercomputer Center. Clade support, which

was assessed with nonparametric bootstrapping (Felsenstein,

1985) as implemented in RAxML-VI-HPC, was based on 100 rep-

licates. The tree with the highest log-likelihood score from the

RAxML analysis was chosen for representation here.

Accession Numbers

Novel full-length protein sequences generated for this study

were deposited in GenBank with the following accession num-

bers:YuccafilamentosaEMF2(YfEMF2,GenBankaccessionnum-

ber(acc.#)ABD85300);AsparagusofficinalisEMF2(AoEMF2,acc.

# ABD85301); Eschscholzia californica EMF2 (EcEMF2_2, acc. #

ABD98791); Eschscholzia californica EMF2 (EcEMF2_1, acc. #

ABD98790); Tomato EMF2 (LeEMF2_1, acc. # ABI99480); Acorus

americanus EMF2 (AaEMF2, acc. # ABI99481).

SUPPLEMENTARY DATA

Supplementary Data are available at Molecular Plant Online.

FUNDING

This work is supported by NSF grant #IBN 0236399 and USDA grant

#03–35301–13244 to Z.R.S.

ACKNOWLEDGMENTS

The authors thank Dr Hong Ma (Pennsylvania State University), the

Floral Genome Project, the and the SOL Genomics Network

(www.sgn.cornell.edu/) for providing EMF2 homologous cDNA

clones, Kazusa DNA Research Institute for providing Lotus japonica

EMF2 sequence to Dr Rieko Nishimura, Dr Jo Ann Banks (National

Science Foundation/Purdue University) for providing Selaginella

EMF2 homologous EST sequences, Dr Ralph Quatrano (Washington

University) for providing access to the Physcomitrella website, Drs

Hong Ma and Damon R. Lisch (UC Berkeley) for comments of the

manuscript, Steve Ruzin and Denise Schichnes (Bioimaging Facility,

CNR, UC Berkeley) for image processing, and our laboratory mem-

bers Myriam Calonje, Tiffany Tirtadinata, Robert Luan, Heather

752 | Chen et al. d Molecular Evolution of VEF-Domain-Containing PcG Genes

Page 16: Molecular Evolution of VEF-Domain-Containing Genes in Plants · Identifying genes that act in developmental pathways and de-termining how they or their interactions are modified

Driscoll, and Rosario Sanchez for help and support in preparation of

this work. No conflict of interest declared.

REFERENCES

Abascal, F., Zardoya, R., and Posada, D. (2005). ProtTest: selection of

best-fit models of protein evolution. Bioinformatics. 21,

2104–2105.

Birve, A., Sengupta, A.K., Beuchle, D., Larsson, J., Kennison, J.A.,

Rasmuson-Lestander, A., and Muller, J. (2001). Su(z)12, a novel

Drosophila Polycomb group gene that is conserved in verte-

brates and plants. Development. 128, 3371–3379.

Calonje, M., and Sung, Z.R. (2006). Complexity beneath the silence.

Curr. Opin. Plant Biol. 9, 530–537.

Calonje, M., Sanchez, R., Chen, L., and Sung, Z.R. (2008). EMBRY-

ONIC FLOWER1 participates in Polycomb group-mediated AG

gene silencing in Arabidopsis. Plant Cell. 20, 277–291.

Cao, R., Wang, L., Wang, H., Xia, L., Erdjument-Bromage, H.,

Tempst, P., Jones, R.S., and Zhang, Y. (2002). Role of histone

H3 lysine 27 methylation in Polycomb-group silencing. Science.

298, 1039–1043.

Chanvivattana, Y., Bishopp, A., Schubert, D., Stock, C., Moon, Y.H.,

Sung, Z.R., and Goodrich, J. (2004). Interaction of Polycomb-

group proteins controlling flowering in Arabidopsis. Develop-

ment. 131, 5263–5276.

Czermin, B., Melfi, R., McCabe, D., Seitz, V., Imhof, A., and

Pirrotta, V. (2002). Drosophila enhancer of Zeste/ESC complexes

have a histone H3 methyltransferase activity that marks chromo-

somal Polycomb sites. Cell. 111, 185–196.

Englbrecht, C.C., Schoof, H., and Bohm, S. (2004). Conservation, di-

versification and expansion of C2H2 zinc finger proteins in the

Arabidopsis thaliana genome. BMC Genomics. 5, 39.

Felsenstein, J. (1985). Confidence limits on phylogenies: an ap-

proach using the bootstrap. Evolution. 39, 783–791.

Gendall, A.R., Levy, Y.Y., Wilson, A., and Dean, C. (2001). The VER-

NALIZATION 2 gene mediates the epigenetic regulation of ver-

nalization in Arabidopsis. Cell. 107, 525–535.

Goodrich, J., Puangsomlee, P., Martin, M., Long, D.,

Meyerowitz, E.M., and Coupland, G. (1997). A Polycomb-group

gene regulates homeotic gene expression in Arabidopsis. Na-

ture. 386, 44–51.

Grass Phylogeny Working Group (Nigel P. Barker, Lynn G. Clark,

Jerrold I. Davis, Melvin R. Duvall, Gerald F. Guala, Catherine

Hsiao, Elizabeth A. Kellogg, and H. Peter Linder) (2001). Phylog-

eny and subfamilial classification of the grasses (Poaceae).

Annals of the Missouri Botanical Garden. 88, 373–457.

Grossniklaus, U., Vielle-Calzada, J.P., Hoeppner, M.A., and

Gagliano, W.B. (1998). Maternal control of embryogenesis by

MEDEA, a polycomb group gene in Arabidopsis. Science. 280,

446–450.

Hall, J.C., Sytsma, K.J., and Iltis, H.H. (2002). Phylogeny of Cappar-

aceae and Brassicaceae based on chloroplast sequence data.

Amer. J. Bot. 89, 1826–1842.

Hayashi, S.I., Kunisada, T., Ogawa, M., Yamaguchi, K., and

Nishikawa, S.I. (1991). Exon skipping by mutation of an authen-

tic splice site of c-kit gene in W/W mouse. Nucleic Acids Res. 19,

1267–1271.

Hennig, L., Taranto, P.,Walser,M., Schonrock, N., and Gruissem,W.

(2003). Arabidopsis MSI1 is required for epigenetic mainte-

nance of reproductive development. Development. 130,

2555–2565.

Huelsenbeck, J.P., and Ronquist, F. (2001). BRBAYES: Baysian infer-

ence of phylogenetic trees. Bioinformatics. 17, 754–755.

Irish, V.F., and Benfey, P.N. (2004). BeyondArabidopsis: translational

biology meets evolutionary developmental biology. Plant Phys-

iol. 135, 611–614.

Jiang, D., Wang, Y., Wang, Y., and He, Y.l (2008). Repression of

Flowering Locus C and Flowering Locus T by the Arabidopsis

Polycomb Repressive Complex 2 components. PLoS One. 3,

e3404.

Jones, D.T., Taylor, W.R., and Thornton, J.M. (1992). The rapid gen-

eration of mutation data matrices from protein sequences. Com-

put. Appl. Biosci. 8, 275–282.

Kim, S., Yoo, M., Albert, V., Farris, J., Soltis, P.S., and Soltis, D.E.

(2004). Phylogeny and diversification of B-function MADS-box

genes in angiosperms: evolutionary and functional implications

of a 260-million-year-old duplication. Amer. J. Bot. 91, 2102–2118.

Kinoshita, T., Harada, J.J., Goldberg, R.B., and Fischer, R.L. (2001).

Polycomb repression of flowering during early plant develop-

ment. Proc. Natl Acad. Sci. U S A. 98, 14156–14161.

Kohler, C., Hennig, L., Spillane, C., Pien, S., Gruissem, W., and

Grossniklaus, U. (2003). The Polycomb-group protein MEDEA

regulates seed development by controlling expression of the

MADS-box gene PHERES1. Genes Dev. 17, 1540–1553.

Kolkman, J.A., and Stemmer,W.P. (2001). Directed evolution of pro-

teins by exon shuffling. Nat. Biotechnol. 19, 423–428.

Kuzmichev, A., Nishioka, K., Erdjument-Bromage, H., Tempst, P.,

and Reinberg, D. (2002). Histone methyltransferase activity asso-

ciated with a human multiprotein complex containing the En-

hancer of Zeste protein. Genes Dev. 16, 2893–2905.

Li, J., Wang, J., Mor, G., and Sklar, J. (2008). A neoplastic gene fusion

mimics trans-splicing of RNAs in normal human cells. Science.

321, 1357–1361.

Luo, M., Bilodeau, P., Koltunow, A., Dennis, E.S., Peacock, W.J., and

Chaudhury, A.M. (1999). Genes controlling fertilization-

independent seed development in Arabidopsis thaliana. Proc.

Natl Acad. Sci. U S A. 96, 296–301.

Luo, M., Bilodeau, P., Dennis, E.S., Peacock,W.J., and Chaudhury, A.

(2000). Expression and parent-of origin effects for FIS2, MEA, and

FIE in the endosperm and embryo of developing Arabidopsis

seeds. Proc Natl Acad Sci. 97, 10637–10642.

Moon, Y.H., Chen, L., Pan, R.L., Chang, H.S., Zhu, T., Maffeo, D.M.,

and Sung, Z.R. (2003). EMF genes maintain vegetative develop-

ment by repressing the flower program in Arabidopsis. Plant

Cell. 15, 681–693.

Muller, J., Hart, C.M., Francis, N.J., Vargas, M.L., Sengupta, A.,

Wild, B., Miller, E.L., O’Connor, M.B., Kingston, R.E., and

Simon, J.A. (2002). Histone methyltransferase activity of a

Drosophila Polycomb group repressor complex. Cell. 111,

197–208.

Ohad, N., Yadegari, R., Margossian, L., Hannon, M., Michaeli, D.,

Harada, J.J., Goldberg, R.B., and Fischer, R.L. (1999). Mutations

in FIE, a WD Polycomb group gene, allow endosperm develop-

ment without fertilization. Plant Cell. 11, 407–416.

Chen et al. d Molecular Evolution of VEF-Domain-Containing PcG Genes | 753

Page 17: Molecular Evolution of VEF-Domain-Containing Genes in Plants · Identifying genes that act in developmental pathways and de-termining how they or their interactions are modified

Patthy, L. (1996). Exon shuffling and other ways of module ex-

change. Matrix Biol. 15, 301–310; discussion 311–302.

Ronquist, F., and Huelsenbeck, J.P. (2003). MrBayes 3: Bayesian phy-

logenetic inference under mixed models. Bioinformatics. 19,

1572–1574.

Schonrock, N., Bouveret, R., Leroy, O., Borghi, L., Kohler, C.,

Gruissem, W., and Hennig, L. (2006). Polycomb-group proteins

repress the floral activator AGL19 in the FLC-independent vernal-

ization pathway. Genes Dev. 20, 1667–1678.

Schuettengruber, B., Chourrout, D., Vervoort, M., Leblanc, B., and

Cavalli, G. (2007). Genome regulation by Polycomb and Trithorax

proteins. Cell. 128, 735–745.

Stamatakis, A. (2006). RAxML-VI-HPC: maximum likelihood-based

phylogenetic analyses with thousands of taxa and mixed models.

Bioinformatics. 22, 2688–2690.

Sung, S., and Amasino, R.M. (2004). Vernalization and epigenetics:

how plants remember winter. Curr. Opin. Plant Biol. 7, 4–10.

Wood, C.C., Robertson, M., Tanner, G., Peacock, W.J., Dennis, E.S.,

and Helliwell, C.A. (2006). The Arabidopsis thaliana vernaliza-

tion response requires a Polycomb-like protein complex that also

includes VERNALIZATION INSENSITIVE 3. PNAS. 103,

14631–14636.

Yan, L., Loukoianov, A., Blech, A., Tranquilli, G., Ramakrish, W.,

SanMiguel, P., Bennetzen, J., Echenique, v, and Dubcovsky, J.

(2004). The Wheat VRN2 gene is a flowering repressor down-

regulated by vernalization. Science. 303, 1640–1644.

Yang, C.H., Chen, L.J., and Sung, Z.R. (1995). Genetic regulation of

shoot development in Arabidopsis: role of the EMF genes. Devel-

opmental Biol. 169, 421–435.

Yoshida, N., Yanai, Y., Chen, L., Kato, Y., Hiratsuka, J., Miwa, T.,

Sung, Z.R., and Takahashi, S. (2001). EMBRYONIC FLOWER2,

a novel Polycomb group protein homolog, mediates shoot de-

velopment and flowering in Arabidopsis. Plant Cell. 13,

2471–2481.

754 | Chen et al. d Molecular Evolution of VEF-Domain-Containing PcG Genes