Structure and Genetics of the Partially Duplicated Gene RP Located ...

11
THE JOURNAL OF BIOL~CICAL CHEMISTRY 0 1994 by The American Society for Biochemistry and Molecular Biology, Inc. Vol. 269, No. 11, Issue of March 18, pp, 8466-8476, 1994 Printed in U.S.A. Structure and Genetics of the Partially Duplicated Gene RP Located Immediately Upstream of the Complement C4A and theC4B Genes in the HLA Class I11 Region MOLECULAR CLONING, EXON-INTRON STRUCTURE, COMPOSITE RETROPOSON, AND BREAKPOINT OF GENE DUPLICATION* (Received for publication, September 13, 1993, and in revised form, December 3, 1993) Liming Shen,%b*c Lai-chu WU,~,~~~ Salih Sanlioglu,qe Ruju Chen,” Anna R. Mendoza,” Andrew W. Dangel,” Michael C. Carrol1,f William B. Zipf,qb and C. Yung Yu~b*c~e*~*h From the “Children’s Hospital Research Foundation, Columbus, Ohio 43205, the bDepartment of Pediatrics, ‘The Ohio State Biochemistry Program, dDepartment of Medical Biochemistry and Department of Internal Medicine, ‘Molecular, Cellular and Developmental Biology Program, and gDepartment of Medical Microbiology and Immunology, The Ohio State Universit-y, Columbus, Ohio 43210, and the ’Department of Pathology, Harvard University Medical School, Boston, Massachusetts 02165 The correlation of many HLA-associated autoimmune and genetic diseases with the polymorphic complement C4 genes may be attributed to the presence of disease susceptibility genes in the close proximity of C4. We have cloned and characterized a pair of partially dupli- cated genes, RP1 and RP2, located 611 base pairs up- stream of the human C4A and C4B genes, respectively. The putative RP protein, consisting of364 amino acid residues, is basic and highly hydrophilic. There is a bi- partite nuclear localization signal at residues 114-131 and therefore RP may be a nuclear protein. Northern blot analysis suggested that RP is ubiquitously ex- pressed. The 5’ region of the RP1 gene is CpG rich, which is a characteristic of housekeeping genes. The RP1 gene contains nine exons. Located in the fourth intron is a cluster of Alu elements, and a newly defined composite retroposon SVA with a SINE, multiple copies of GC-rich VNTRs and an Alu element altogether enclosed by direct terminal repeats. Members of SVA are also present in the complement C2 genelocatedabout 20 kilobases up- stream of RP1 in the HLA and in the cytochrome CYPlAl gene. Determination of the DNA sequences for RP2 from two different HLA haplotypes revealed identical hybrid sequences which resulted from fusion of RP with the tenascin-likeGene X and truncation of the 5’ regions of both genes. Cumulative data suggest that the four tan- demly arranged genes RP, complement C4, steroid 21- hydroxylase(cYP211, and Gene X altogether form a modular structure, RCCX. The number of RCCX mod- ules varies from one to three or more in the population. Absence of the truncated genes RP2 and Gene XA have been detected in genomes with single RCCX modules. search Foundation (Grant 020-3131, a Basil OConnor Starter Scholar * This work was supported by the Columbus Children’s Hospital Re- Research Award from the March of Dimes Foundation of Birth Defects (5-FY93-0887), The Ohio State University Research Foundation Seed Grant 221398, and Pittsburgh Supercomputer Center computer usage Grants MCB9300070P and MCA920442P. The costs of publication of this article were defrayed in part by the payment of page charges. This with 18 U.S.C. Section 1734 solely to indicate this fact. article must therefore be hereby marked “advertisement” in accordance to the GenBankTM/EMBL Data Bank with accession numberQ L26260- The nucleotide sequence(s)reported in this paper has been submitted L26263. Pediatric Research, Dept. of Pediatrics, The Ohio State University, 700 To whom correspondence should be addressed Wexner Institute for Children’s Dr., Columbus, OH 43205. Tel.: 614-722-2820; Fax: 614-722- 2716. Duplication of the RCCX modules probably occurred be- fore the speciation of great apes and humans as they contain the same breakpoint region of RP and Gene X gene duplication. _________~ ~ Autoimmune, genetic, and malignant diseases are associated with the major histocompatibility complex (MHC)’ in humans (also known as the HLA) (1-5). This may be attributed to: (i) the presence of disease susceptibility genes, oncogenes, or tu- mor suppressor genes in the MHC; or (ii) the variable efficien- cies of the MHC class I, class 11, and possibly class I11 molecules for antigen presentation. Indeed, more than 36 new genes have been identified in the MHC and many of those gene products have been found to be involved in important cellular processes (6). Characterization of these novel genes and identification of their biological functions are essential to understanding the molecular basis of MHC-linked diseases. The fourth component of complement (C4) is a structural sub- unit for the multi-molecular C3 and C5 convertases of comple- ment activation in the immune response (7). The C4 genes are located at the class I11 region of the MHC (8). In humans there are two tandem C4 loci, Locus I and Locus 11, which are about 12 kb apart. Locus I codes for C4A and Locus I1 codes for C4B, although there are exceptions (9-11). Three kb downstream of the C4Aand C4B genes are the cytochrome P450 21-hydroxylase (CYF’21) A and B genes. CYP2lA is a pseudogene due to del- eterious mutations in exons. Homozygous mutations and/or de- letions of the CYP2lB genes may result in congenital adrenal hyperplasia (CAH), characterized by salt-losing and/or viriliz- ing phenomena (12-14). Concurrent deletions of C4 and CYP2l genes are well documented phenomena that may lead to various disorders (15, 16). A pair of new genes with tenascin-like se- quences, XA and XB, have been discovered overlapping the CYP2lA and the CYP2lB genes, respectively (17, 18). When the promoter regions of the C4A and the C4B genes were characterized, a polyadenylation (poly(A)) signal, AATA- AA, was found located 631 bp upstream of the transcriptional initiation sites of each C4 gene.2 Each of these poly(A) sites is plex; CAH,congenital adrenal hyperplasia; kb, kilobase(s); bp, base The abbreviations used are: MHC, major histocompatibility com- pairb); PCR, polymerase chain reaction; VNTR, variable number tan- dem repeats. L. C. Wu, C. Y. Yu, K. T. Belt, and R. D. Campbell, manuscript in preparation. 8466

Transcript of Structure and Genetics of the Partially Duplicated Gene RP Located ...

Page 1: Structure and Genetics of the Partially Duplicated Gene RP Located ...

THE JOURNAL OF BIOL~CICAL CHEMISTRY 0 1994 by The American Society for Biochemistry and Molecular Biology, Inc.

Vol. 269, No. 11, Issue of March 18, pp, 8466-8476, 1994 Printed in U.S.A.

Structure and Genetics of the Partially Duplicated Gene RP Located Immediately Upstream of the Complement C4A and the C4B Genes in the HLA Class I11 Region MOLECULAR CLONING, EXON-INTRON STRUCTURE, COMPOSITE RETROPOSON, AND BREAKPOINT OF GENE DUPLICATION*

(Received for publication, September 13, 1993, and in revised form, December 3, 1993)

Liming Shen,%b*c Lai-chu W U , ~ , ~ ~ ~ Salih Sanlioglu,qe Ruju Chen,” Anna R. Mendoza,” Andrew W. Dangel,” Michael C. Carrol1,f William B. Zipf,qb and C. Yung Y u ~ b * c ~ e * ~ * h From the “Children’s Hospital Research Foundation, Columbus, Ohio 43205, the bDepartment of Pediatrics, ‘The Ohio State Biochemistry Program, dDepartment of Medical Biochemistry and Department of Internal Medicine, ‘Molecular, Cellular and Developmental Biology Program, and gDepartment of Medical Microbiology and Immunology, The Ohio State Universit-y, Columbus, Ohio 43210, and the ’Department of Pathology, Harvard University Medical School, Boston, Massachusetts 02165

The correlation of many HLA-associated autoimmune and genetic diseases with the polymorphic complement C4 genes may be attributed to the presence of disease susceptibility genes in the close proximity of C4. We have cloned and characterized a pair of partially dupli- cated genes, RP1 and RP2, located 611 base pairs up- stream of the human C4A and C4B genes, respectively. The putative RP protein, consisting of 364 amino acid residues, is basic and highly hydrophilic. There is a bi- partite nuclear localization signal at residues 114-131 and therefore RP may be a nuclear protein. Northern blot analysis suggested that RP is ubiquitously ex- pressed. The 5’ region of the RP1 gene is CpG rich, which is a characteristic of housekeeping genes. The RP1 gene contains nine exons. Located in the fourth intron is a cluster of Alu elements, and a newly defined composite retroposon SVA with a SINE, multiple copies of GC-rich VNTRs and an Alu element altogether enclosed by direct terminal repeats. Members of SVA are also present in the complement C2 gene located about 20 kilobases up- stream of RP1 in the HLA and in the cytochrome CYPlAl gene. Determination of the DNA sequences for RP2 from two different HLA haplotypes revealed identical hybrid sequences which resulted from fusion of RP with the tenascin-like Gene X and truncation of the 5’ regions of both genes. Cumulative data suggest that the four tan- demly arranged genes RP, complement C4, steroid 21- hydroxylase (cYP211, and Gene X altogether form a modular structure, RCCX. The number of RCCX mod- ules varies from one to three or more in the population. Absence of the truncated genes RP2 and Gene XA have been detected in genomes with single RCCX modules.

search Foundation (Grant 020-3131, a Basil OConnor Starter Scholar * This work was supported by the Columbus Children’s Hospital Re-

Research Award from the March of Dimes Foundation of Birth Defects (5-FY93-0887), The Ohio State University Research Foundation Seed Grant 221398, and Pittsburgh Supercomputer Center computer usage Grants MCB9300070P and MCA920442P. The costs of publication of this article were defrayed in part by the payment of page charges. This

with 18 U.S.C. Section 1734 solely to indicate this fact. article must therefore be hereby marked “advertisement” in accordance

to the GenBankTM/EMBL Data Bank with accession numberQ L26260- The nucleotide sequence(s) reported in this paper has been submitted

L26263.

Pediatric Research, Dept. of Pediatrics, The Ohio State University, 700 To whom correspondence should be addressed Wexner Institute for

Children’s Dr., Columbus, OH 43205. Tel.: 614-722-2820; Fax: 614-722- 2716.

Duplication of the RCCX modules probably occurred be- fore the speciation of great apes and humans as they contain the same breakpoint region of RP and Gene X gene duplication.

_________~ ~

Autoimmune, genetic, and malignant diseases are associated with the major histocompatibility complex (MHC)’ in humans (also known as the HLA) (1-5). This may be attributed to: (i) the presence of disease susceptibility genes, oncogenes, or tu- mor suppressor genes in the MHC; or (ii) the variable efficien- cies of the MHC class I, class 11, and possibly class I11 molecules for antigen presentation. Indeed, more than 36 new genes have been identified in the MHC and many of those gene products have been found to be involved in important cellular processes (6). Characterization of these novel genes and identification of their biological functions are essential to understanding the molecular basis of MHC-linked diseases.

The fourth component of complement (C4) is a structural sub- unit for the multi-molecular C3 and C5 convertases of comple- ment activation in the immune response (7). The C4 genes are located at the class I11 region of the MHC (8). In humans there are two tandem C4 loci, Locus I and Locus 11, which are about 12 kb apart. Locus I codes for C4A and Locus I1 codes for C4B, although there are exceptions (9-11). Three kb downstream of the C4Aand C4B genes are the cytochrome P450 21-hydroxylase (CYF’21) A and B genes. CYP2lA is a pseudogene due to del- eterious mutations in exons. Homozygous mutations and/or de- letions of the CYP2lB genes may result in congenital adrenal hyperplasia (CAH), characterized by salt-losing and/or viriliz- ing phenomena (12-14). Concurrent deletions of C4 and CYP2l genes are well documented phenomena that may lead to various disorders (15, 16). A pair of new genes with tenascin-like se- quences, XA and XB, have been discovered overlapping the CYP2lA and the CYP2lB genes, respectively (17, 18).

When the promoter regions of the C4A and the C4B genes were characterized, a polyadenylation (poly(A)) signal, AATA- AA, was found located 631 bp upstream of the transcriptional initiation sites of each C4 gene.2 Each of these poly(A) sites is

plex; CAH, congenital adrenal hyperplasia; kb, kilobase(s); bp, base The abbreviations used are: MHC, major histocompatibility com-

pairb); PCR, polymerase chain reaction; VNTR, variable number tan- dem repeats.

L. C. Wu, C. Y. Yu, K. T. Belt, and R. D. Campbell, manuscript in preparation.

8466

Page 2: Structure and Genetics of the Partially Duplicated Gene RP Located ...

Structure and Genetics

followed by a stretch of GT-rich sequence that is a characteris- tic feature for the 3' end of a mammalian gene (19). Here we report the cloning and characterization of the novel gene(s) located immediately upstream of the C4 genes. This novel gene RP is partially duplicated in the MHC and the truncated ver- sion is often involved in a gene duplication or gene deletion process concurrent with neighboring genes. The similarity in physical locations suggest that RP is probably identical to G11 assigned by cosmid mapping (20).

MATERIALS AND METHODS

Isolation of RP cDNA Clones and Determination of R P cDNA Sequences

A 653-bp BstEII-AccI genomic DNA fragment was obtained from a 1.3-kb BstEII subclone of cos 3A3 (8, 21) and labeled by the multiprime labeling method (22) with a U. S. Biochemical Corp. kit and [CY-~*P]~CTP (Amersham). A A cDNA library made from the human T-cell line RPMI 8402 was a gift from Dr. Terry Rabbitts (MRC Labo- ratory of Molecular Biology, Cambridge). cDNA library screenings were performed using standard protocols (23).

Determination of RP cDNA sequence from clone R1.l was achieved by shotgun cloning of randomly sonicated DNA fragments into M13 SrnuI cut and phosphatased vector, and single-stranded DNA dideoxy sequencing (24, 25). Gel readings were assembled with Staden's DNA Analysis software (26).

Synthetic PCR Primers Oligonucleotides were synthesized by an Applied Biosystem Model

390B DNA Synthesis machine. Sequences of PCR primers used in this study are listed below. The relative positions of primers 1-5 in the RP1 gene (Fig. 4) are numbered in parentheses (with reverse orientation primers abbreviated as r.0.). Sequences added to the primers to facili- tate cloning are in lower case: 1) RPR5, 5'-AAG AGG ATC CGA CTC CAC AGG CCC-3' (2,423-2,400, r.0); 2) 1.6K-F1,5'-CTC TGG GCC CGA

GTG CCC-3' (1,009-1,032); 4) 0.8K-F1, 5'-TCC TCC AAA TGC AGT GAG GTT-3' (1,795-1,815); 5) HRP3, 5'-tcc gaa TTC ATG TCT CTG GCA GGC G-3' (10,964-10,946, r.0.); 6) YMRl, 5'-agg gaa tTC AGG

GCG TTC-3' (908-925); 3) 1.6K-F2, 5"CGT CAG CAG TTT TGT CAG

GGC CTC TGG GGC TAA CTC-3'.

Determination of the 5' Sequence of RP cDNA Total RNAs were isolated from cultured cells MOLT4 (T leukemia cell

line) and from HT29 (colon carcinoma cell line) by guanidine isothio- cyanate lysis and CsCl ultracentrifugation (27). Reverse transcription was carried out using oligo(dT) as the first primer and 1-5 pg of RNA according to the Perkin Elmer Cetus (Norwalk, CT) RNA PCR protocol (28). Amplification of RP cDNA was achieved by two rounds of PCR. The first PCR was performed using primers corresponding to the 3' end of the RP cDNA (HRP3) and 5' end of RP1 gene (1.6K-F1 or 1.6K-F2). Approximately 10% of product from the first PCR was used for the second PCR using "nested" RP primers, 1.6K-F2 and RPR5, and 0.8K-F1 and RPR5, corresponding to the 5' end of the RP1 gene (1.6K-F2 and 0.8K-Fl) and to RP1 Exon 3 (RPRB), respectively. The 1.6K-F1 and 1.6K-F2 sequences were assumed to be present in the RP1 transcript by the presence the Kozak consensus for an initiation codon (29), although the RP cDNA sequence obtained revealed that these primers are not close to the predicted RP initiation codon (Figs. 2 and 4). PCR conditions were: 1 cycle at 94 "C for 5 min; 30 cycles a t 94 "C for 1 min, 54 "C for 1 min, and 72 "C for 1 min; and 1 cycle a t 72 "C for 10 min. The PCR products were cloned into TA cloning vector (Invitrogen) and sequenced.

Northern Blot Analysis Total RNAs were isolated from cultured cell lines MOLT4, HepG2

(hepatoma cell line), U937 (monocytic cell line), and RPMI 8402 (pre-T lymphocytic cell line) as described (27). Poly(A+) RNA from MOLT4 was a gift from Dr. Caroline Bilsland (Harvard University). Human liver RNA was prepared as described previously (30). For each sample, -25 pg of total RNA or 2 pg of poly(A+) RNA was resolved by formaldehyde- agarose gel (0.8%) electrophoresis and blotted to Hybond N membrane (Amersham). The blot was hybridized with the R1.l cDNA probe and washed twice a t room temperature with 2 x SSC, 0.1% SDS and twice at 65 "C with 1 x SSC, 0.5% SDS (27).

of HLA Novel Gene RP 8467

Sequence Determination of RPl a n d RP2 Genes R P l Gene-Restriction fragments containing the entire RP1 gene

were subcloned from cos 3A3 and designated pSH13 and pBH4 (Fig. 3). The majority of DNA sequences for the RP1 gene were obtained by shotgun cloning of sonicated DNA fragments into M13 mp18 or into Bluescript KS vectors (Stratagene, LaJolla, CA) and single-stranded DNA dideoxy DNA sequencing (24,251. Sequenase kit (U. S. Biochemi- cal Corp., Cleveland OH) and [35SldATP were employed for sequencing reactions. Gel readings were compiled with Staden's DNA sequence analysis softwares (26). Gaps in sequence contigs were filled by further subcloning of the appropriate restriction fragments as shown in Fig. 3 and primer walkings using new primers based on known DNA se- quences for sequencing reactions. Sequence contigs were joined to- gether through sequence determination of PCR-amplified DNA frag- ments overlapping the junctions (shown by thick bars in Fig. 3). Overall, each nucleotide was determined more than three times and confirmed by sequences from both strands.

RP2 Sequence from cos 4A3-A 12-kb BglII fragment spanning the intergenic region between two C4B genes from cos 4A3 (9, 31) was subcloned into pUC18. From this subclone, pBRP7, a 5-kb BarnHI re- striction fragment was isolated and its sonicated fragments shotgun cloned into pBluescript KS vector by blunt-end ligation. Single- stranded DNAs were prepared after rescue with M13 K07. Gel readings were obtained by SpeedReader and DNA sequences analyzed with PC/ GENE softwares (Intelligenetics, La Jolla, CA).

RP2 Sequence from A-JM2a-A 5.3-kb TuqI fragment corresponding to the upstream region of a C4B5 gene from A-JM2a (32) was subcloned into pUC18 vector. From this plasmid, a 2.1-kb BurnHI-TuqI fragment was sequence-determined after shotgun cloning into M13 mp18 SrnaI- cut vector. Gel readings were assembled by Intelligenetics PCIGENE softwares.

Isolation of Genomic DNA Human genomic DNAs were isolated following standard protocols

(27) from cultured cell lines HepG2 and MOLT4, and from peripheral blood of normal individuals (Bl, L1, and SCOl), congenital adrenal hyperplasia patients (CAH-El, CAH-2, and CAH-31, Prader Willi pa- tients (PW-1 and PW-21, and a nasopharyngeal carcinoma patient (NPC-A). Appropriate consents from blood donors were obtained accord- ing to approved protocols by the Institutional Board of Columbus Chil- dren's Hospital.

Mouse genomic DNA was isolated from myeloma cell line NSO; Afri- can green monkey genomic DNA was isolated from from kidney cell line COS 7; and cotton top tamarin DNA was isolated from an Epstein-Barr virus transformed cell line NPC-LC (33). Orangutan and chimpanzee DNAs were prepared from cultured cell lines PUT2 and WES, respec- tively, obtained from the American Tissue Culture Collection.

Southern Blot Analysis 5-10 pg of genomic DNA was digested to completion with appropriate

restriction enzymes, resolved in a 0.8% agarose gel, blotted to Hybond N membrane, hybridized with [~~-~~PldCTP-labeled R1.l probe, and au- toradiographed after standard procedures (34).

PCR of Genomic DNAs Locations of possible breakpoints for Gene XA and RP2 hybrid re-

gions in primate and mouse genomes were determined by PCR (35) with primers YMRl and HRP3. For each reaction, 500-1,000 ng of genomic DNA and -250 ng of each primer were used. Conditions for PCR were: 1 cycle a t 94 "C for 5 min; 30 cycles at 94 "C for 1 min, 54 "C for 1 min, and 72 "C for 1 min and 15 s; and 1 cycle at 72 "C for 10 min.

Protein and DNA Sequence Analyses Comparison of the RP amino acid and DNA sequences with national

data bases were performed by GCG FASTA program from Pittsburgh Supercomputer Center. A dendrogram of the RP variable number tan- dem repeats (VNTR) were performed by PC/GENE Program Clustal. Other sequence analysis programs used included DOTPLOT, BESTFIT, PILEUP, PRE'ITY, and PUBLISH of the GCG package (36).

RESULTS Isolation of cDNA Clones for the Novel Gene RP Upstream of

C4-A 653-bp BstEII-Acc I fragment, which is 274 bp upstream of the major transcriptional start site of the C4Agene, was used to screen cDNA libraries. R1.l was isolated from the RPMI 8402 library. Determination ofthe 1.1-kb DNAsequence of R1.l revealed a 26-bp poly(A) tail which is 16 bp downstream of the

Page 3: Structure and Genetics of the Partially Duplicated Gene RP Located ...

8468 Structure and Genetics of HLA Novel Gene RP

1 2 3 4 5 6

-28s

-18s

0 .

FIG. 1. Northern blot analysis of human RP. Total RNAs isolated from human liver tissue (lane 1 ) or from cultured cell lines HepG2 (lane 2 ), U937 (lane 3 ) . U937 treated with phorbol ester PMA (lane 4 ) , RPMI 8402 (lane 6), and poly(A+) RNAfrom MOLT4 (lane 5) were resolved by formaldehyde-agarose gel electrophoresis, blotted to Hybond N mem- brane, and hybridized with R1.1 probe.

polyadenylation signal. The 3‘ sequence of the R1.l cDNA clone is identical to the 5’ regulatory sequences for the C4A and the C4B genes. This implies that there may be a pair of duplicated genes, designated RP1 and RP2, present immediately up- stream of the C4A and the C4B genes, respectively.

The R1.l cDNA was used as a probe for Northern blot anal- ysis to investigate the expression of RP transcripts isolated from cell lines of different origins (Fig. 1). This probe hybrid- ized to a message of 1.6-1.8 kb in size from RNA samples isolated from the liver (lane 1 ), hepatoma cell line HepG2, lympocytic cell lines MOLT4 and RPMI 8402, and monocytic cell line U937 (lanes 3-6). RP transcripts of similar size were also detected from Northern blot analysis of RNA samples from colon carcinoma cell line HT29 and neuroblastoma cell line IMR32 (data not shown). Thus, RP appears to be ubiquitously expressed with a major transcript size about 1.6-1.8 kb.

Since R1.l contains the poly(A) tail and is only 1.1 kb in size, the 5‘ region of the RP full-length cDNA is missing in this clone. Determination of the RP 5’ cDNA sequence was achieved by reverse transcriptase-PCR (28) with RNAs isolated from HT29 and MOLT4, using PCR primers derived from the upstream sequence of cDNA clone R1.l, and from the 5’ sequence of the RP1 gene. The coding sequence for RP1 is shown in Fig. 2. There is an open reading frame coding for 364 amino acid residues. An in-frame stop codon is located 33 nucleotides 5’ to the putative initiation codon. Thus, it is likely that the pre- dicted amino acid sequence for RP is complete.

The RP protein is extremely hydrophilic in its NH2-terminal half. There are several alternate hydrophilic and hydrophobic regions at the carboxyl-half of the protein. Overall, there are 53 positively charged residues ( A r g or Lys) and 37 negatively charged residues (Asp or Glu) in the RP protein. Thus, there is a net positive charge of 16 in the RP protein. In addition, there are 8 histidine residues which may also be positively charged.

Located between residues 114 and 131 is a bipartite, posi- tively charged nuclear localization signal (371, KRHHLIPPET- FGVKRRRKR. Hence, the RP protein may be targeted to the nucleus. The RP protein is rich in Gly (9.6%), Pro (6.9%), and

CTCCTCCRAATCCACTCACGTTAGGAAGCACCTCTGCCCTCAGATCAAGAATCCA~TTAC

CTCRAAGCTCCCCAACT~CCACCTCCGCACAGCTATCACCTCATCGCAGGCACGCCAGAG

CCCGAACCATCCAAAAGTGCT~~TGC~CGATCATCCAATCATTCAGCCACACTGCC M Q K W F S A F D D A I I Q R Q W R

GCCCAAACCCCTCCCCCGGCCGGGCACGTGTGAGG~CACGAAGCACC~GACACCAAC~ A N ~ S R C G C G V S F T K E V D T N \ ’

TCGCCACCCGCCCCCCTCCACCCCCCCAACCACTCCCCCGGCGTCCGTGCCC~C~AGCG A T C A P P R R Q R V P C R A C P W R E

AGCCMTCCGCGGCCCCCCTGGGCCCCGGCCIGCCCCAGCTGATCCTGGAGCCAC~CCC~ p I R C R R C A R P C G G D A G C T P C

G C G A G A C C C T A C C T C A C T C C T C T C C G C C C G A A G A C C C T A T ~ C C C T C C E T V R H C S A P E D P I F R F S S L H

AT~CCTACCCCTTCCCCGCTACCATAAAATCCCCCGATATCAGCTCGAACAGGCATCACC S Y P F P C T I K S R D M S W L L _ B H H L

TGATCCCGGACACC~CCACTAACAGGCCCCGGAACCGAGGGCCTGTCGACTCC~ATC . 4 RT-PCR I cDNA clone - .

I P E T F G V - C P V E S D P

CTCTTCGCCGTGACCCAGCCTCGGCCCGCGCCCCTCTCTCAGAACTCATGCAGCTC~CC L R G E P G S A R A A V S E L M Q L F P

CCCGAGCCCTC~GAGCACGCCCTCCCCCCCATCGTCCTGACCAGCCAGCTGTACACCC R C L F E D A L P P I V L R S Q V Y S L

TTCTCCCTGACAGCACCCTCCCCGACCGGCAGCTCAAGCAGC~CAAGACCACGGGGACA V P D R T V A D R Q L K E L Q E Q G E I

TCAGAATCCTCCAGCTGGGCTTCGAC~~CGATCCCCATGGAA~ATCTTCACTCAGGACT R I V Q L G F D L D A H C I I F T E D Y

ACACCACCAGAGTCCTCAACGCCTGTCATGGCCGACCCTATGCTGCCGGAGTCCAGAAAT R T R V L K A C D G R P Y A G A V Q K F

T T C T A C C T T C A G T A C T T C C A G C C T G T C G C G A C C T T A C ~ C C T G A G A C L A S V L P A C C D L S F Q Q D Q M T Q

AGACCTTTCGC~CACGCACTCAGAAATCACGGATCTCCTGAATGCTCGAGTCCTCACCC T F G F R D S E I T H L V N A G V L T V

TCCCAGATGCTCCCACCTCCTGCCTACCTGTCCCTGGACCTGCCAGATTCATCAAGTACT R D A C S U U L A V P C A G R F I K Y F

TTCTTAAACCGCGCCAGCCTGTCCTTACCATGGTCCGGAAGGC~CTACCGGGAACTGG V K G R Q A V L S M V R K A K Y R E L L

TCCTATCACACCTCCTGGGCGGGCGCCCCCCTCTCCTCGTGCGGCTTGGCCTCACCTACC L S E L L C R R A P V V V R L G L T Y H

ATCTCCACGACCTCATTCCCGCCCAGCTACTGGACTGCATCTCTACCACTTCAGCAACCC V H D L X C A Q L V D C I S T T S G T L

T C C T C C G C C T G C C A G A G A C A T C ~ G A ~ C T C C T C A T C A T T A C T G G G L R L P E T *

CCGCCACGGGACTAGAACAGCTGCATCATCCTGCCTCAGACAGGGTCACC~GGCAACGC

TTCGGACCCAGGATGAGTCTCCCGCTCTCCICTCTGC~GCTCAGATGTGACTGC~GC

TGTTTCCCTCCTTTCTCACCCAGTGCTGGGGTTTGAGC~TGC~CTCTGCCCTTCCATG

C A A A C T G C A A C C A G A A A T C C T C C C A A C G C T G T G C C T T C G C

TCTTATTACTCTCTCTTCATAGGAAGGTGCGAmCTCCGGAGCCTCCTCAAGCAGGC

CACCGTTCTTTTCTCTACGTCTCATCTT~TTGCCAAATAAAGTACCTCTGCCTGTG~

60

120

180 18

240 38

300 58

360 78

420 98

480 118

540 138

600 158

660 178

720 198

780 218

840 238

900 258

960 278

1020 298

1080 31R

1140 338

1200 358

1260 364

1320

1380

1440

1500

1560

RP. The first 521 bp was cloned by reverse transcriptase-PCR tech- Flc. 2. cDNA and predicted amino acid sequences of human

nique. The rest of the sequence was derived from clone R1.1. The pu- tative nuclear localization signal is hzghlighted. The basic residues (K and R) are in bold and the acidic residues (D and E) are italicized. The stop codon is asterisked. GenBank accession number: L26260.

L e d l e (12.6%) residues. While most of the Pro and Gly resi- dues cluster at the NH2-terminal portion, the majority of Leu/ Ile residues are present at the COOH-terminal region.

There are no N-linked glycosylation sites in the predicted RP sequence. Search of RP for protein modification sites with the PC/GENE PROSITE program revealed that there is a single potential tyrosine kinase phosphorylation site at residue 231; five potential kinase C phosphorylation sites at residues 80, 105, 112, 145, and 277; five potential casein kinase I1 phospho- rylation sites at residues 6,76,85,183, and 277; two amidation sites at residues 61 and 323; and nine potential N-myristoyla- tion sites a t residues 25, 27, 65, 69, 70, 74, 104, 144, and 211. Whether the endogenous RP protein is post-translationally

Page 4: Structure and Genetics of the Partially Duplicated Gene RP Located ...

Structure and Genetics of HLA Novel Gene RP 8469

1 2kb

u FIG. 3. A restriction map of the hu-

man R P 1 gene. Plasmid subclones de- rived from cos 3A3 corresponding to the RP1 and the 5‘ region of the complement C4A gene are shown under the restriction map. DNA sequence for RP1 was obtained by sequencing to completion the consecu-

BamHI-HindIII, and 3.8-kb HindIII-BglII tive 1.6-kb BamHI, 0.8-kb BamHI, 7.5-kb

restriction fragments. DNA sequences ob- tained from shotgun clones of randomly sonicated fragments are indicated as soni- cated. Overlapping of sequence contigs were achieved by sequencing PCR ampli- fied DNA fragments across the junctions (shown by solid bars). Horizontal arrows represent the direction and location of transcriptional initiation sites of RPl or C4A, a vertical arrow indicates the loca- tion of the RP1 poly(A) site.

RPl

r) Sal B B S B B K R R S A S (TI R H Nh S A B K Bg Bg R (TI B

I 1 I I I I I I I I I I V I I I I I I I I I I Sal H

I

I

p S H 1 3 (13 kb) I Bg A. Acc I

Bg. Bgl U

Nh. Nhe I K. Kpn I

I ( s i n ) p ~ ~ 4 I B. BamH I

Sal 5.9 kb

R R I I 6.4 kb

(3.8 kb) (sonicated)

PS. Pst I R. EmR I s. s s t 1 Sal. Sal I ~~~. ~ ~ ~

B S B

LLI - S P s S y A Sy Sy S Sm. Sma I sy. sty I T. Taq I

(sonicated) 1.6 kb

modified as revealed by these potential sites remained to be determined.

Gene Structure of RPl-The human RP1 gene located up- stream of a C4A3 gene in a cosmid clone was subcloned into plasmids (Fig. 3). The DNA sequence for the entire RP1 gene and the intergenic region between C4A and RP have been com- pletely determined. A sequence of 12,118 bp is shown in Fig. 4. The RP1 gene consists of 9 exons (Fig. 5). The 5’ boundary for Exon 1 has not been defined precisely partly because of the multiple initiation sites of transcription (data not shown). The 3’ end of Exon 1 was located because a continuous cDNA se- quence spanning 191 bp for the 3‘ region of Exon 1 and about 70 bp for the 5’ region of Exon 2 has been obtained (data not shown). The putative initiation codon is 128 bp downstream of the splice junction of Exon 2. The coding capacity of the exons ranges from 14 amino acid residues (for Exon 9) to 74 amino acid residues (for Exon 2). Exon 9 also contains a 3”untrans- lated region of 397 bp. The size of the introns ranges from 85 bp for Intron 2 to 6,136 bp for Intron 4 (Table I).

There are several notable features of the RP1 gene. First, the 5’ region of the RP1 gene is rich in CpG sequences. For ex- ample, there are 130 CpG dinucleotides from nucleotide 500- 3,000, compared with 30 from nucleotide 8,500-11,000 (Fig. 4). The CpG sequences are generally under-represented in mam- malian DNA and their presence (at the 5’ region of a gene) is strongly correlated with housekeeping genes (38). This would infer that RP is involved in an essential function.

Second, there are eight complete copies of A h elements pre- sent in Intron 4 and one of these elements (copy number 6) is actually located within another Alu element (copy number 5). Among the eight A h elements, copy numbers 1,2, and 8 belong to the A h - J subfamily that is similar to the 7SL DNA (39). Copy number 4 appears to be an integral component of a com- posite retroelement and will be discussed below. The other four Alu elements belong to the Ah-S subfamily with copy number 6 categorized to the b branch and copy number 3 categorized to the c branch (39). Ah-Sib is considered to be a young branch of the Alu family, which is consistent with our data as the incor- poration of copy number 6 into the RP gene has to be after the presence of copy number 5. Copy number 3 has an atypical trimeric structure in contrast to the dimeric structure for most Alu elements. The additional structure was labeled Ah-3.1 in Fig. 4.

B S y B ul

4.0 kb

0.8 kb

Third, there is a stretch of highly repetitive DNA sequences located between nucleotides 5,717 and 6,579 (Fig. 4). These repeats can be illustrated by multiple diagonal lines in a dot- plot analysis (36) (Fig. 6C). There are 21 copies of very similar but nonidentical, tandem repeats, each of which has a GC con- tent of 72-84% and a size of 35-45 bp. A dendrogram showing the relatedness of these 21 tandem repeats is presented in Fig. 6A. These 21 copies of tandem repeats together are flanked by hexameric sequences, TGGGCA (bored in Fig. 4). The presence of these hexamers precisely at both ends of the 35-45-bp tan- dem repeats suggests that these tandem repeats as a whole might exist as a structural unit in a transpositiodretroposition event, which resulted in the generation of a hallmark, the (hex- americ) direct repeats (40).

A Composite Retroposon Is Present in the RPl Gene- Comparison of the tandem repeat sequences in the RP1 gene with the GenBank data base through the GCG FASTA program (36) revealed striking similarities with a group of nonviral ret- roelements, SINE-R11, -R14, and -R19 (411, but arranged in the reverse orientation. Members of the SINE-R elements contain two basic components: (a) 3-6 copies ofVNTR which are highly similar to those 35-45-bp tandem repeats present in the RP1 gene; and (b ) a short interspersed repeat element (SINE) of -490 bp that is homologous to the genomic region between the env gene and the 3’-long terminal repeat of an endogenous retrovirus HERV-K10 (42). The SINE sequence in RP1 is only 135 bp in size. Most of the sequence corresponding to the HERV-K10 long terminal repeat region in the SINE-R retro- posons is absent in the RP1 gene. Immediately preceding the SINE element in RP is a stretch of T residues of 16 nucleotides.

In addition to the RP1 gene, the SINE-R related sequences have also been detected in introns of two other human genes, the complement C2 (43) and the cytochrome P450 CYPlAl(44) genes. The copy number for the VNTRs varies from 16 in CYPlAl, 17 in complement C2 (B allele), 21 in RP1, to 23 in another complement C2 gene (A allele). Analysis of the RP, C2, and CYPlAl genomic sequences reveals a more complex orga- nization of the reiterative sequences than those of SINE-Rs. Located immediately downstream of the VNTRs in each gene is a highly conserved region of 370-372 bp (Fig. 7) with sequence identities of about 95%. Present in each of these sequences are three stretches of Alu-related sequences of 25, 54, and 246 bp (Fig. 7) and a less well-defined sequence of 32 bp. One of the

Page 5: Structure and Genetics of the Partially Duplicated Gene RP Located ...

8470 Structure and Genetics of HLA Novel Gene RP

B a d I . CCATCC~CACGGGACCCACACACCTCMTCCCTCACCTCCCCTCATCCTCCACMCCCTC

CTCICCGC~GCTMCCMCCCCCCACACMCTTCATCCACACACACCCA~CCACCCC

TCACGGTCA~CTGCACCACCCACACCACCACCCACMCAICCCCACTCCCCCICCTCCC

ACCACAGMGGCACGCTCTTCCCTCCATGCTACACCTCTCCCC~CCTCITTCCCCATAC

GGCTCCCAGCTCCMCATCCTICCC~ACCCTCACATATTC~CATC~CATCCTACCA

MCGTCTTGACCGMGACACICCCTCTCCCTTACCCMCCCACCMCMCATTCCCC

ACCCC~GCGAGCMTGACTGACCCCACCA~~CCACCTTCTCICICACACCMUCCA

ACCCATCACTCCCACCCCTCCTCCACCCTCCATTCTCCC~CACCCTCCICCCCMCICC

C~CCCAGCCTTCMGCCTMC~CTCGCCCCTCCCCACCCCCATCCICMCCTCTACM

A c T C C T C C A T T C C C C A G G G C T C I C C A T C T C C T T C C ~ ~ C C C A C C I

IGGCCCCTCTGTCGATCCCCC~CCCCCTCTCTCCACTCTACCTCCCCTGACMCACCAC

ACCCICCCTTCCCAGCCCCCTCCCTACCACACACCACMCCCCACCTTCCTCTTMCCIC

CICACCCCTCTGCGGACCTTCCACC~CTCTCCACMCCACACMCCACCACCACCCCI

CGGCGCCTCTCMCCTCTGGCMCCCCMCCCCCCTATCMCCACCCCCMCTCACCTCC

ACACATGTACTGCTC-TTICTATCCCATCIACATMCCTCCCCCACCACCCCTCCCCC

ACCMCCCTCICCCCCCGACCC~CCCTCTCTCUC~CACTCACCTATACT~CCCTC

GMCCCCCACCCLTCCCACCTCCCACCCCTCCTCCCCCTCATACCCTCTCCTCACCAC~

TCICAGGICCCCCCGCCACCTCACTATCCCCTCTCCCACC~CCCTCCACCCCTCACACC

CAGGACTTACACCCTGMCGICTCACACMCCATTACTCACATCCTCCCCTCCMCMTA

CTCTTGTTTCTTCTMCCACTCATTCTCACCCCCCC~CCCTCTCCTMT~ACACC~ . 3’ and of RPI Exon 1 *.

.Taq I

~ACCTACGGCTCTCCAGATATACICCCTACCACCC~CCTCACCCCTC~CCCCCCTC

CCTGTTCCACCACCCACCACAC~CCTCCTCCACCC~CCTCCACCTCCTCCTCCCCCC

GCTCGTATCCATCCGCGTATCCCTCTCTCACCTC~CTTCCCCCCTCCACCCTTARCC Trq I .

GGGCTCCCCTATAGTAGCCCCCCTCCCCCTACTCCATCCTACTCCCCTTCACCATCCA

GGGAGMCCACCCCACTICCCMCCCCCCCCCTAC~CC~CCCCCCACA~ACACCC

CACCCICTCTGGGCAGAGMCCICCTCCACCACCTAC~CTTCCCACCCTCACCTACCI

CTGTCrrCTCACCTCCTCTCTICCTCCCCCTCCGATCCATCACCTCCTMCACMCCACC n d I

C C T A C A C A C ~ C C A ~ C T A C A C C A ~ C C T C C A C M C C A I C A C T C ~ A C C A C T M C

CGACCCACCTCATCGC~CCCICTCCCCTTC~CA~CACCCCCCT~CACCCCACCCC

CGCCMGCTCTCATCCTGCCTC~CCTTCCCCTTCACCCACCCTCCCTCC~TCCTCC . Exon 2 .

CTCCCCMCTTCCACCTCCCCACACCTATCACCT~T~CAGCACCC~CACCCCCMC

G ~ C M U G T C G T T T T C T G C T T C C A T C A T C C M I ~ ~ C A C C C A C A C T C C C C C C C ~ rap I .

~ ~ K U F S A F D D A I I ~ R ~ ~ R A N

CCCCTCCCCGGCCGGCGGACCICTCAC~CACCM~ACCTTCACACCMCCTCCCCAC P S R G C C G V S F T K E V D T N V A T

C C G C C C C C C T C C A C G C C C C C M C C A C T C C C C C G G C G T G C G C G M I G A P P R R Q R V P C R A C P U R E P I

CCCCCGCCGCCGTGGCGCCCCCCCTCCCCCAGTC~TCCTC~ATCTCCCTCCCCACCCC R G R R C A R P C C C D A G ( 7 4 )

CCCTCCCAGC*CTCACGGCCTCACCCACCACMCTTCACCCICC~CCT~TCACCTCC Exon 3

TCI~CACGGACCCCCGGCCACACCCTACCTCACTCCTCTCCCCCCCMCACCCTAI ( 7 5 ) G T P G E T V R H C S A P E D P l F

CAGGTTCTCTTCCCTCCATTCCIACCCCTTCCCCCCTACCATMMTCCCC~ATATCAC R F S S L H S Y P F P C T I K S R D I S

CTCCMCACCCATC*CCTC;TCCCCCACACC~CCA~MCACCCCCCCCMCCCACG Y K R H H L I P E T F C V K R R R K R G

CCCTCTCCACTCCCAICCTCTTCCCCCICACCCAC~MCCATCCCMCCCCCCCCCICC . B d I

P V E S D P L R G E P C ( 1 4 4 )

. Exon Y CCCTCACCCCA~ATCTGCCC~GTCCCCCCCCCCCCCTCTCTUCMCTU~~CC

( 1 4 5 ) s A R A A V S E L I4 9 L

TCTTCCCCCCACCCCTC~TC~CC~CCCCCTCCCCCCC*1CCACCTCT F P R C L F E D A L P P I V L R S Q V Y

. . . EcoR I .

C C C C C G C M C C C C C ~ C C T M C T C C T C C M ~ C C C T C T ~ C T ~ C T C A C T C C C C ~ C C C

GTTCACCMCC~TTAGGCCITCACCCCACCCUCTRCCCACCCACICCCCTCCC~CAC

ACACCCTCTIMUCTCCCCCACCIACTACACCCCATTTC-TCATCCACCT~CMCM

CCACCCT~CCCG~CCGACTCATCCCCCATCGTTC~CATACACCCCTCCCACTCC ral I

AGTCACACGTCAGATCATCACCCCCTTCTACAICACTCTTCTTTCC~TCTTAC~CTCACC

60

120

180

240

300

360

420

480

540

600

660

720

180

840

900

960

1020

1080

1140

1200

1260

1320

1380

1440

1500

1560

1620

1680

1140

1800

1860

1920

1980

2040

2100

2160

2221

2281

2341

2401

2461

2521

2581

2641

2701

2761

2821

2881

2941

3001

M T A T A T G A G C T G A ~ A T T A T C C C T C I C T C I ~ A C G

ATACACGTTCCTGGGACTCATCCMCCTTMTCM~ACCCTCTC~CCCCCCCCAITTC

G G A C I C T A C A T m T C M C T C C C T C M G l G C m r r M G r C T C C C C

CACTCACCT~CCC*MCCMCMCACCMTCTCICACACACCCACACACCCACTTC

C C T C A C T T M C A G C ~ A G C C C C C C C C C C T C C C T C A C A C C T C T M T C C C A C C A C ~ I- Alu (1) .

TGCGACGCCGAGGTCGGTCCATCIC~CACCTCATCACmCATACCACCATCCC~CA

IGACATMCCCMTCTCTACCATACAMMTTACCCACCCTCCTCCTACGCCCCTA

IACACCTACTCCCCAGCCIGACCTCCCACCA~CC~CACCTCCCCACCCCCACG~CCA

CACCTCACATACCACCGCTCCACCCCACCCTCCCTCATACACCTACAC~CTCTC~ A h (1) *I .

~ C T ~ M C A C C C C T A C T R C ~ C T ~ C A C A T ~ T C A

C*MCTGMRACTTCCCACTATTAGMCCTCTCATTCMCA~CTCCTCTMCMCACC . 1- A l u (2) .

C M G A G G T ~ C T M M T A T ~ A C C C T C A C T C T C C T C C C T ~ T C C C T C T A

C T C C C A C C A C ~ C C G A C A C T C A C C ~ C C A C M I A C C T T C ~ C ~ C ~ C A C ~ C C ~

C C C T G C ~ C ~ C T C A C A C C C C A ~ C ~ A C T ~ T M C T C A C A T C C C A T C T C I A C T A

M I ~ T ~ T T ~ C A C ~ R C T T A C C T C C C C A T C C T C C T C T C C A C C T A T T C T T C

T A G C T G ~ C C C M G ~ ~ C A C C C ~ C M C C A ~ C ~ C A C C C C A C C M ~ C A C C C T C C A C I

GACCTATCATTCCACGACTGCACTCWCCCTCCCTCCCTM~CACCCACACCCTCTCTCT-

Alu ( 2 ) - 1 . M M T C A h M C M C M C A A A A A M C C C A C ~ C T C C A ~ C C A ~ A C C A C C T

A C M C C T T M T ~ C T C A T C m ~ C M M C A C C ~ C M M ~ M C C C A R C T A G C A C T C

A C T T C A C G T C A G C T A C T C I C C C T C A C T A ~ C T C C C C G

C A C T C C T T M C G A ~ A ~ T T C ~ I M T C T A ~ c c r c T T A C C T m C r l l r ~

C C ~ M M G T T G U C M T C T A T ~ C A T A T A T M ~ ~ . ~ ~ ~ L T G A T C ~ A T * M C C

ACATACCTATCTACCCACTCCTACRC~CTAIACMTA~CACURATCCCA~TATA . 1- .

T C C C T C T C T C A T C ” T C T C I C I C A T C A C M ~ C C T C C ( A l u ( 3 ) . ~ C C T T A ~ C C ~ C A C C T C C A C T C T C C C T C T C T T C C

CCACGCTCCACTCURCRCTCATCTCMCTUCTC~CCTCTCCCTCCTCCCTTCM

GCMTTCTCCTCCCTCACICTCCTCA~ACCTCACACTAUCC~CCCCC~CCAICCCC

A G C T M ~ C C T A ~ A C T A C A C A T C ~ C ~ U T C A T G T T C G C C A T M T C C T C ~

C A T C I C C T C * C C T C C T C A T C T C C C C T C C T C C C C C C Alu ( 3 ) - 1 .

C A T C C A C C C C A C C C C C C C T M ~ R A ~ ~ ~ ~ A T A C ~ ~ A T C C G ~ C A C C C T ~ C C C . (p.rri.1) Alu (3.1) t .

CAGCCIGRCTCAMCTCRC*CTCACATC*TCTCCCTCCCTCCCCCTC~CAMC~CTC

-MTTGT SINE .

C M M

C T C C G A ~ R C M T -

FIG. 4. The complete DNA sequence of the human RP1 gene. Amino acid sequences for each exon are translated

3061

3121

3181

3241

3301

3361

3421

3481

3541

3601

3661

3721

3781

3841

3901

3961

4021

4081

4141

4201

4261

4321

4381

4441

4501

4561

4621

4681

4741

4801

4861

4921

4981

5041

5101

5161

5221

5281

5341

5401

5461

5521

5581

5641

5701

5761

5821

5881

5941

6001

6061

6121

6181

6241

6301

6361

under nucleotide sequences. Donor and acceptor sites of introns are underlined. Onintations ofAlu and SINE are indicated by arrows. Target sites (if identified) ofAlu elements are in bold and underlined. The 21 copies of VNTRs are numbered VI to V21 with the first nucleotide of each repeat hzghlighted. Direct repeats flanking the VNTRs and also those flanking the SVA composite retroposon are boxed. An asterisk indicates the precise location of the feature described in the figure. GenBank accession number: L26261.

Page 6: Structure and Genetics of the Partially Duplicated Gene RP Located ...

Structure and Genetics of HLA Novel Gene RP

CCCACTACCTGGGACTACAGCTCCCCGCCACTATGCC~CTMTTCITT~~TCT

ATT~AGTAGAGATCCGCTTTCACCCTCTTACCCACCATCCTCTICATCTCCCCACCTC

CTCAICCACCCCTCTCACCCIGCC-tTCCTCCGATTACACCCA~ACCCACCCUTCT 'Alu (6 )

*I CCCCTATTPITCI~T~TTMTCCACACCCCGITTCATCATCITCCCUCCCT~TCPIG

MCTTGMCTTCTCACCTCMCICATCUCCCITACCGTCC~CIGCICCCATTACAC

6421

6481

6541

6601

6661

6721

6781

6841

6901

6961

1021

1081

1141

7201

7261

1321

1381

1441

7501

7561

7621

1681

7141

7801

7861

7921

1981

8041

8101

8161

8221

8281

8141

8401

8461

8521

8181

8641

8701

8161

8821

8881

8941

9001

9061

9 1 2 1

9181

97A0

9300

9360

9420

9r80

FIG. P-continued

8471

9540

9600

9660

9720

9180

9840

9900

9960

10020

m 8 n

10140

10200

10260

10320

10380

10440

10500

10560

10620

10680

10~40

10800

10860

10920

10980

11040

11100

11160

11220

11280

11340

11400

11459

11518

11578

11638

11698

11718

11818

11878

11938

11998

12058

Alu-related sequences (ie. 246 bp; Alu copy number 4 in Fig. 4) and 281-297; two mini-insertions at nucleotides 121-122 and appears to be an entire Alu element and is flanked by a pair of 258-262; and 34 scattered point mutations, when compared target site repeats. This particular Alu element in RP1, C2, or with the consensus Alu sequence (39) (Fig. 8). CYF'lAl genes is unusual in that there are many conserved For the case in RP1, the poly(T) track, SINE, 21 copies of changes such as three deletions at nucleotides 1-22, 200-210, VNTRs, and the described -370-bp Alu-related sequences are

Page 7: Structure and Genetics of the Partially Duplicated Gene RP Located ...

8472 Structure and Genetics of HLA Novel Gene RP

FIG. 5. The organization of human RP1 gene. The coding exons are in solid, black boxes and noncoding sequences in RPl C4A

exons are in stripped boxes. Categoriza- tion ofAlu elements (i.e. J and S classes, T+ S S Sibs J r) Ref. 39) are marked. An arrow shows the orientation of transcription.

Alu

J J S k A IM

I ld

Exon 1 2 3 4 5 6 7 8 9

TABLE I A summary of the exon-intron structure of the human RPl gene

Exons Introns Nucleotide number‘ Size No. of amino Size

acids coded Phase*

bP bp -1,199‘ ? 5”uT 594

1,794-2,141 348 74 85 1 2,227-2,436 210 70 109 1 2,546-2,682 137 45 6,136 0

9,3369,469 140 47 895 2 10,365-10,462 98 33 105 1 10,568-10,715 148 49 202 2 10,918-11,360 442 14

8,819-8,914 96 32 416 0

a Nucleotide numberings correspond to those in Fig. 4. After Ref. 64. Transcriptional start site(s) of RP have not been mapped, therefore,

the size of exon 1 undefined.

enclosed by a direct repeat sequence of 13 bp, GATAATTC- CACTA (boxed in Fig. 4). Homologous organization enclosed by a pair of direct repeats of 18 bp is present in the complement C2 gene. Hence, these retroelements appear to form a family of retroposons with discrete, composite units (i.e. SINE, VNTRs, and Alu) proliferated in the human genome. This composite retroposon is named SVA in light of its composition (Fig. 7). Part of the DNA sequence for the SVA retroposon in the CYPlAl gene is not available but the existing data reveal a structure very similar to the SVA-RP and SVA-C2 (also known as SINE.R-C2).

RP Gene Is Duplicated in Most Human and Primate Genomes-Since RP sequences are present upstream of the complement C4A and the C4B genes, it infers that there may be two copies of RP genes in a haploid genome. A Southern blot analysis of BamHI-digested genomic DNAs, which were iso- lated from human peripheral blood lymphocytes from CAH pa- tients (lanes 1 3 ) , Prader Willi patients (lanes 4 and 5 ) , a nasopharyngeal patient (lane 6), normal individuals (lanes 7, 8, and 101, human tumor cell lines (lanes 9 and 11 ), an African green monkey cell line (COS7), and a cotton top tamarin cell line (NPC-LC), using a RP cDNA probe (R1.l) is shown in Fig. 9. Two distinct RP specific, BamHI fragments of 9.6 and 5.0 kb in size were detected in most human samples, but only a single 9.6-kb fragment was detected in the genomes of CAH-E1 (lane 1 ) and of SCOl (lane 10). CAH-E1 was a congenital adrenal hyperplasia patient with homozygous deletion of the CYP2lB genes3 SCOl was a normal individual who was typed as HLA B8 DR3 C4AQO C4B1 (there is a homozygous deletion of C4A genes in this individual). Subsequent restriction mapping and DNA sequencing data revealed that the 9.6-kb BamHI frag- ment corresponds to the RP1 gene, while the 5.0-kb BamHI fragment corresponds to the RP2 gene. Two RP-specific BamHI fragments of 9.3 and 4.7 kb were detected in African green monkey (lane 12), but a single fragment of 10 kb was detected

A. R. Mendoza, W. B. Zipf, L. Shen, A. W. Dangel, and C. Y. Yu, manuscript in preparation.

in cotton top tamarin (lane 13). These results suggest that the RP genes are duplicated in the majority of the human popula- tion, but in some individuals only the RP1 gene is present. They also infer that there may be two RP genes in an Old World monkey African green monkey but a single RP gene in a New World monkey cotton top tamarin. This same conclusion was obtained from a genomic Southern blot analysis of TaqI-di- gested DNAs with the R1.1 probe (data not shown).

Partial Gene Duplications of RP and Gene X-Located in the approximately 12-kb intergenic region between C4A and C4B are the CYP2lA pseudogene, RP sequence, and Gene XA which overlaps CYP2lA at the 3’ ends. CYP21Ais about 3.2 kb in size and located 3.0-kb downstream of the C4Agene; thus, Gene XA and the RP2 gene are localized in a region of 6 kb. This obser- vation appeared paradoxical as the size of the RP1 gene is about 11.5 kb (Fig. 41, while that of the Gene XB may be as large as 70 kb (45, 46).

In order to solve this puzzle, a 12-kb BglII fragment was subcloned from cos 4A3, which corresponds to the intergenic region between two C4B genes in an unusual haplotype C4A2 C4B1 C4B2 (9) (Fig. 1OA). The RP2-specific 5.0-kb BamHI re- striction fragment (Fig. 9) is located in this subclone, which was completely sequenced by shotgun cloning and dideoxy sequenc- ing. A comparison of this sequence with those for RP and Gene X cDNAs and the RP1 gene reveals a hybrid sequence derived from RP and Gene X (Figs. 1OB and 11). Specifically, this 4,971-bp sequence contains a 1,566-bp fragment corresponding to part of the 5’-untranslated region of a C4 gene (42 bp), the RP-C4 intergenic region (611 bp), and Exon 7-Exon 9 of the RP1 gene (913 bp) which is fused to a 3,405-bp fragment correspond- ing to the 3’ region of Gene X (Fig. 1OB). The 5’ ends for both RP and Gene X in this hybrid region are truncated and there- fore duplications for these two genes are incomplete. With re- spect to the RP1 gene, the breakpoint of gene duplication for the RP2 sequence is located in Exon 7 (Fig. 11) and is 2,093 bp downstream of the Alu clusters and SVA element (Fig. 4). A DNA sequence corresponding to Gene X cDNA is found 797 bp upstream of RP2. Nine hypothetical Gene X exons, Exons a to i can be deduced from this 4,971-bp BamHI fragment when compared with the published cDNA sequence (Fig. 1OB) (18). Gene XA is arranged in the opposite orientation with respect to RP, C4, and CYP2l genes. The first 332 bp of the published Gene X (partial) cDNA sequence is absent in the Gene XA sequence and therefore the Gene XB 5’ exons are absent in Gene XA. In addition, there is an internal deletion of 91 bp in the hypothetical Exon e. The truncation of the 5’ region and the internal deletion may change the reading frame and result in premature termination with respect to Gene XB cDNA (18). It remains to be determined whether the changes in Gene XA would lead to the generation of a new protein product. An independent study by Gitelman and collegues (17) has shown that the DNA sequence 5’ to the duplication breakpoint corre- sponds to intronic sequence of Gene XB. Thus, the Gene XA- RP2 hybrid was formed by a recombination at a Gene X intron and Exon 7 of RP.

To determine if there is a common breakpoint for gene du-

Page 8: Structure and Genetics of the Partially Duplicated Gene RP Located ...

Structure and Genetics of HLA Novel Gene RP 8473 B

A Rr"' CPCCTCCC----FGPTGGGGTGGT-GGCCGGGCPGPGGG-GCT--C-CT- -- C . c

RWls CPCCTCCC----PGPICGGGGTCGC-GGCCGGGCRGnGGC-GCT--C-CT- -- ~ ~ 1 3 CnCTTCTC----PGPCGGGGCGGC-TGCCGGGCGGnGGG-TCT--C-CT- --

-IIV2( CPCTTCCC----nGDCGGGGTGOC-tOCCTGGCCTGGCnGnGGCTGCD--CTCTG GG W " i B CPICTTCCT----nGPTGGGPTGGC-GGCCGGGCnGnGnC-DCT--C-CT- --

l S Y 2 CPICTTCCC----RGCnGGGG---C-GGCCGGGCnGnGGC-GCC--C--T- -- - I P S CRCTTCCC----RGTPGGGG--C-GGCCtGGCCGGGCnGnGGC-GCC"C-CT- "

l w l d CPCTTCCC----PGTPGGGG---T-GGCTGGGCnGGGGC-GCC--C-CT- -- IWI6 CnCPTCCC----PGPCGGGG---C-GGCGGGGCnnnGGC-GCT--C-CC- -- R w l 3 CPCTTTCC----nGFICTGGG---C-PltCCnGGCPGnGGG-GCT"C-CT- "

w u ~ CCCTTCTC----nGnCGGGG---P-GGCTtGGcTGGGCnGPGPC-CCT--C-CT- -- I W I ~ CPCPTCTC----aGnCGPTG-nGC-GGcCGGGCPGPGPC-GCT--C-CT- - - IND CnCPTCCC----PGPCnPTG-GGC-GGCCPGGC~G~GPC-G----C-C-- --

CaCCTCCC-- - -GGPCPGGGCGGCTGGCCGGGCGGG-CTGPC-CCC CC ::t> CPCCTCCC----GGPCGGGGTGGCTGGCCGGGCPGGTGG--CTG~C-CCC CC C9CCTCCC----GGDCGPGGCGGCTGGCCGGGCGGGG~~GGCTGPC-CCC DC

~ p y l CnCCTCCCTCCCPGDCOGGGCGGCTGGCCG--GGTGGGGGGCTGaC-CCC -C INB C P C C T C C C T C C C G G P C G G G G C G G C T G G C T G G G t G DC

CPCCTCCCTCCCGGDCPGGGC~GCTGGCCGGGCPG~GGGG-----"CTC E T

. . . I , , ,

. .

' ' . , . I

, , , , . , . . . . , _

,, . , . . , , . ,

. . ,', . ,' '

C~CCTCCCTCCCGGFiCGGGGCGGCTGGCCGGGC~GDGGGG-------CTC C T

1- I)y,: CDCCTCCCTCCTGGFiGGGGGCGGCTG-CCGGGCGGDGDTG----"-CTC i T L!!L , . , . I , I , , ,

FIG. 6. The VNTRS in the human RP1 gene. A, a dendrogram showing the relationship of the 21 copies of VNTRs (RPV1 to RPV21); B, an alignment of the VNTRs, gaps are shown by dash lines; C , a dot-plot of the VNTRs, diagonal lines represent internal repeats in the sequence.

SVA-RP

SINE Alu

13 16 135 6 868 a25 54324 246 4 6 13

FIG. 7. A comparison of the struc- tures of the composite SVA retro- posons in RPl, C2 (B allele), and CYPlAl genes. Numberings represent sizes in nucleotides. Shaded circles and SINE AlU boxes stand for Ah-related sequences. Solid triangles represent terminal, direct repeats flanking the SVA retroposons. Hatched triangles represent direct re- peats flanking A h elements. Solid, verti- cal stripes stand for tandem hexameric re-

p0ly-T 4

VNTR (1 7 copies) / I I l

18 19 49 1 6 714 6 2 5 5 4 3 2 4 241 4 6 18

peats (5' AGAGGG 3') present in SVA-CP and in SVA-1A1.

SVA-IAI SINE Alu

(368)

plication of Gene X-RP-C4-(CYP21), a 2.1-kb TaqI-BamHI re- striction fragment from A-JM2a spanning the RP2 sequence and the 5' region of a C4B5 gene of the C4A4 C4B5 haplotype (32) was determined (Fig. 1OC). This fragment covers the entire 913-bp RP2 sequence and also 573 bp of the Gene X sequence and its sequence is identical to the corresponding region ob- tained from cos 4A3, except for the presence of two point mu- tations. In other words, the breakpoint of the hybrid Gene XA-RP2 in a C4A4 C4B5 haplotype is identical to that of the C4A2 C4B1 C4B2 haplotype. (Further analysis of the polymor- phism of Gene XA-RP2 sequences will be published elsewhere.)

Thus, the genomic region between C4A and C4B genes con- tains pseudogenes or truncated sequences for three different genes, i.e. CYPBlA, Gene XA, and RP2. The tandem genes for RP, C4, CYP21, and Gene X appear to form a four-gene module RCCX that may be duplicated together in the MHC class I11 region (Fig. 12). However, duplications for the flanking RP and Gene X are incomplete. Homozygous deletions of RP2 in indi- viduals CAH-E1 and SCOl were concurrent with Gene XA. This is because the 5.0-kb BamHI restriction fragment containing Gene XA-RP2 sequences were absent in these individuals (Fig. 9). We have also found homozygous deletions of a C4 gene and a CYP2l gene in the genomes of CAH-E1 and SCO1. In other words, these individuals have single RCCX modular struc- t u r e ~ . ~

6 645 6 2 5 5 4 3 2 4 245 4 6

A Common Breakpoint Region for Duplication of the RCCX Modules in Great Apes-To determine if the RP-C4-CYP21- Gene X modules are duplicated with a Gene XA-RP2 hybrid region in humans and apes, PCR was performed with a set of primers corresponding to the RP (Exon 9) at one end and to Gene X at the other end (Fig. 1OD). As shown in Fig. 13 (Panel A), a 1.36-kb fragment was amplified from cosmid DNA (cos 5) that spans a long C4A and a short C4B gene (lane 11, from human genomic DNAs with RCCX bimodular structures, e.g. Raji (lane 3) and HepG2 (lane 4 1, and from chimpanzee (lane 5) and orangutan genomic DNAs (lane 6). Southern blot analysis of the samples shown in Panel A using the R1.l probe con- firmed that the 1.36-kb fragment contained RP sequence (Panel B ). Thus, there is a common breakpoint region for gene duplication of the RCCX modules in the great apes. On the other hand, no amplified products were detected from human genomic DNA, CAH-El, with a single RCCX modular structure (lane 2) , or from mouse genomic DNA(1ane 7). The former was expected because the corresponding PCR primers in a single modular haplotype are oriented in a head-to-head configura- tion, located 20-30 kb apart and therefore could not be ampli- fied by PCR. The breakpoint of gene duplication for mouse RP is undetermined but available sequence data corresponding to the 5' regions of the C4 and the Slp genes exclude the possi- bility for an identical breakpoint as in humans and apes.

Page 9: Structure and Genetics of the Partially Duplicated Gene RP Located ...

8474

Alu-RP Alu.C2 Alu . 1 A l Alu . con

Alu.RP A l u . C2 A1u.lAl A h . Con

A h . RP Alu.C2 A l u . 1 A l Alu . Con

A l u . RP A l u , C2 A l u , 1 A l Alu . Con

Alu.RP A l u . C2

A l u . Con Alu. 1 A l

A l u . RP Alu . C2 Alu . 1 A l A l u . Con

Structure and Genetics of HLA Novel Gene RP

1 50 .................... ..CTGCAATC CCGGCACCTC CGGAGGCCGA ........ C.A-...-- G. ........

........ C."""" C""""* COCCCGCCCC CCTGGCTCAC CC-- -T- - - - C - A - - - - - - - G..... .... ...................... ......................

GCCTCCCGCA TCACTCGCCG TTAGGAGCTC GAGACCA.CC CCCCCAACAC 51 100

......T .............................. e-. ..........

.......... ".".G.T ........... """.C" .......... "-G-.-.-- "--CT-A.. -C""-T-C ....... e.. T ........ T

101 AGCGAAACCC CGTCTCCACT AACAAAATAC GAAAACCACT CAGGCCTCCC

150

..................................................

.......... - -" - . - - -C .. A".... ..................... G.T.-."-. ...... T ... ..A".. ... A-"-TT--C -G- - - . - - -T

CCCGCGCCCC TGCAATCGCA GGCACTCCGC AGGCTGAGGC ACGAGAGTC. 151 200

.............................. .A .............. A".

.......... --T."-C.. .m.. .... G .......... " - - . - A - - C A... ..............................................

201 250 .......... AGGCAGGGAG GTTGCACTGA GCTGAGATGG CACCAGTACA ...................... C.."". """".C .......... .................... .......... .......... CTTGMCCCG G.ACCC-..- - . C - - - - - C - -GC..C-C.-

"c"""- ..........

251 GTCCAGCTTG GCCTCCCCAT CAGAGGGAGA

297 . . . . . . . . . . . . . . . . . .............................. ... . . . . . . . . . . . . . . .....................................

C - - - - - - . . . .:-G---CA ..... C.... CTCCGTCTCA """".T

RP), SVA-C2 (AIu-C~), and SVA-1A1 (Ah-1A1) with the Alu con- FIG. 8. An alignment of the A h sequences from SVA-RP (Alu-

sensus. Deletions are shown by dots; DNA sequences identical to that of Ah-RP are represented by dash lines, dissimilar sequences are shown at the corresponding positions.

1

3

FIG. 9. Southern blot analysis of RP genes in humans and mon- keys. Human genomic DNAs from CAH-E1 (lane 11, CAH-2 (lane 2 ), CAH-3 (lam 3 ) , PW-1 (lam 4 ) , PW-2 (lane 5), NPC-A(lane 6 ) , B1 (lane 7), L1 (lane 8), HepG2 (lane 9), SCOl (lane l o ) , MOLT4 (lane I l ) , African green monkey (COS 7; lane 12 ), and cotton top tamarin (NPC- LC; lane 13) genomic DNAs were digested with BamHI restriction enzyme, resolved by 0.8% agarose gel electrophoresis, blotted to Hybond N membrane, and hybridized with the R1.l probe.

DISCUSSION

Here we report the identification, cloning, and characteriza- tion of the novel gene RP located 611 bp upstream of the human complement component C4A and the C4B genes in the class I11 region of the HLA. The unusual modular duplication (and de- letion) of RP together with its neighboring genes Gene X, complement C4, and steroid 21-hydroxylase CYP21, and the association of the HLA with autoimmune and genetic diseases motivates an intensive investigation on the structure, genetics, and function of RP.

Although the deduced amino acid sequence of RP does not reveal striking similarities to any known proteins, i t sheds light on the properties and possible function of this ubiqui- tously expressed molecule. The presence of a bipartite nuclear localization signal suggests RP may be a nuclear protein. The highly hydrophilic and basic nature of the protein infers that

Lu 1 zkb

BorH I BorH I

B. I I I b

C.

D.

W I

FIG. 10. Partial gene duplications of Gene X and RP. A, a sub- clone of a 12-kb BglII restriction fragment from cos 4A3 spanning the intergenic region between two C4B genes; B, hypothetical exon-intron structures for Gene XAand RP2 in a 5.0-kb BamHI restriction fragment that has been completely sequenced (data base accession number L26263). The breakpoint for gene duplication for RP and Gene X at the chimeric region is marked by an arrow. C , the relative position of a 2.14-kb TaqI-BamHI restriction fragment from clone A-JM2a corre- sponding to the intergenic region between the C4A4 and C4B5 genes (32). This fragment has been completely sequenced (data base accession number 26262). D, the relative location of the PCR primers to determine the breakpoint of gene duplication for the Gene XA-RP2 (please refer to Fig. 13).

the protein might interact with negatively charged molecules such as DNA or acidic domains of transcriptional factors. A comparison of the RP protein sequence with other protein se- quences in national data bases revealed that the NH2-terminal portion of 157 residues in RP is 22.9% identical to the hypo- thetical 119.5-kDa uvr-A protein in bacteria Micrococcus (471, while the carboxyl portion of RP is about 20% identical to the yeast RAD7 protein (48). Both uvr-A and RAD7 are involved in the DNA repair mechanism. Similar to RP, a bipartite nuclear localization signal and a leucine-rich region are present in the RAD7 sequence (48, 49). Mutation of the RAD7 gene in yeast resulted in decreased proficiency of excision repair of DNA damaged by U V light (48). While analogs for many of the com- ponents involved in the DNA repair mechanism of yeast have been isolated, the human analog for the yeast RAD7 has not been cloned.

Immunological disorders such as systemic lupus erythema- tosus (50, 511, immunoglobulin IgA deficiency and common variable immunodeficiencies (52, 531, and malfunctions in re- production, such as recurrent spontaneous abortions (54), have been related to the null alleles of C4A. In this case null alleles of C4A imply a gross deletion of a C4A gene together with other genes in the RCCX module, or mutations of the C4A gene that may also be concurrent to RP, CYP21, or Gene X. A typical example for the latter can be found in a HLA B44 haplotype where the conversion of a C4B gene to C4A in the second C4 locus was concurrent with mutations of the CYP21B gene (10, 14). The diversities of many disorders correlated with the C4A null alleles infer deficiencies of different genes in the close proximity of C4A, andlor the presence of a malfunctioning gene with widespread functional properties. To this end malfunc- tions of a transcriptional factor, a DNA repair protein, or a molecule involved in the signal transduction pathway would result in the described disorders. A deficiency of the putative nuclear protein RP could be related to some of these problems.

The mouse also contains RP genes upstream of the C4 and

Page 10: Structure and Genetics of the Partially Duplicated Gene RP Located ...

Structure and Genetics of HLA Novel Gene RP 8475

284 (Exon 7 ) (Exon 8 ) 314

Gene XA sequence --- Breakpoint --- RP2 sequence FIG. 11. The breakpoint for the partial gene duplication of RP and gene X The RP cDNA, RP1 gene, and the Gene XA-RP2 hybrid

lower cases. The exon-intron boundaries for RP1 gene and the hybrid are underlined. The breakpoint for the partial gene duplication of RP and sequences are aligned and identical sequences are marked by vertical lines. Intronic sequences corresponding to RP1 gene and to Gene XA are in

Gene X is marked by a A.

10 20kb

111I1 -HLA [ A-C-B-I-TNF-HSP70 .... ._._ HLA [DR-DQ-DPI-

+ + e + + B * - 4

t Module I I Module I1 I FIG. 12. A molecular map of the RCCX gene modules in the HLA class I11 region. Arrows represent the transcriptional orientation of

functional genes. 21A, CYP21A; 21B, CYP21B. Data taken from Refs. 6, 8, 17, 21,46, 62, and 63 and this work.

Gene XA-RP2 for human and great apes. A, PCR amplification of FIG. 13. A common breakpoint region for gene duplication of

the Gene XA-RP2 breakpoint region using primers YMRl and HRP3 (Fig. 1OD) and genomic DNAs. DNA templates for PCR are from cos 5 (lane 1 ), CAH-E1 (with RCCX single modular structure; lune 2), Raji (with RCCX double modular structure; lune 3), HepG2 (with RCCX double modular structure; lune 41, WES (a chimpanzee cell line with RCCX bimodular structure; lane 5), and PUT (an orangutan cell line with RCCX bimodular structure; lane 6 ) . Molecular weight markers were 1-kb ladders (Life Technologies Inc.). B, autoradiograph of the Southern blot from A hybridized with R1.1 probe.

the Slp genes, although the breakpoint of gene duplication or deletion for RP appears to be different from that in human^.^ Whether both RP genes in the mouse are functional is yet to be determined. I t was shown that a crossover at the mouse C4- CYP2l region led to the lethality of homozygous embryos (55), which suggests the presence of essential gene(s) a t the region of crossover. On the other hand, breeding experiments for con- genic rats revealed the existence of a growth and reproduction complex (grc) with several genes closely linked to, if not pre- sent in, the MHC. One of the genes in thegrc has been inferred to be a tumor suppressor gene (56,571. Our zoo-blot experiment suggested that there are two copies RP genes in the haploid genome of rat.5 The physical location and the structural infor- mation together suggest that genes of the RCCX modules could be related to the grc.

The concept for the modular organization of the C4 and CYP2l genes was first suggested by Klein and collegues (15). This study extends the concept of the modular gene duplication to include the genes flanking C4 and CYP21, RP, and Gene X. In a RCCX bimodular (or trimodular) structure, the breakpoint of the four-gene duplication is present at Exon 7 of the RP1 gene and an intron of the Gene XB. This resulted in the com-

L. Shen, R. Chen, and C. Y. Yu, manuscript in preparation. C. Y. Yu, unpublished data.

plete duplication of a C4 gene and a CYP21 gene, but only partial duplications of RP and Gene X. The truncated se- quences, RP2 and Gene XA, form a chimeric hybrid at the intergenic region of the two C4 genes. This modular duplicatioddeletion pattern involving four structurally and functionally unrelated genes is intriguing. Partial gene dupli- cation has been suggested to be one of the major mechanisms leading to genetic diseases (58). This is because a partially duplicated DNA or a pseudogene DNA sequence may mutate without selection pressure and those (deleterious) mutations can be incorporated into the functional gene through recombi- nations or gene conversions, as observed in the cYP2l genes (reviewed in Ref. 59). Thus, the Gene XA-RP2 hybrid sequences could play a role in disrupting the gene function and in the genetic instabilities of the RCCX modules in the population.

Although the bimodular structures of RCCX are prevalent in the population, the single modular structures account for 10- 30% of the human genomes (60). Single RCCX modular StNC- tures consist of the intact RP1 and Gene XB loci, and varied types of the complement C4 and CYP2l genes3 In those single RCCX modular genomes, the absence of CYP2lB leads to CAH, while the absence of C4A is a predisposing factor for systemic lupus erythematosus. Thus, it is important to understand the mechanism leading to the deletion of genes of the RCCX mod- ules.

In many situations a repetitive DNA element such as an Alu element or an endogenous retrovirus was found at or proximal to the breakpoint of DNA rearrangements (58). In the RCCX modules, a cluster ofAlu elements and a composite retroposon SVA with 21 copies of VNTRs are present at Intron 4 of the RP1 gene, which is 2,093 bp upstream of the corresponding break- point of RP gene duplication. Notably dimeric sequences for Alu elements have not been reported in the C4A, CYP21A, Gene XA, RP2, C4B, and CYPBlB, a genomic region more than 50 kb in size.

Elucidation of the composite retroposon SVA was the result of a deliberate sequence comparison with DNA sequences in the GenBank data base. In contrast to a simpler retroposon SINE-R, the composite retroposon SVA in the RP1, C2, or CYPlAl genes contains 16-23 copies of unusually GC-rich VN- TRs and also additional sequences with an Alu element char- acterized by distinct deletions and mutations among SVA ret- roposons. The SINE and Alu elements are arranged in the opposite, head-to-head configurations. Since possible gene products of the SVA have not been defined at this stage, the sense DNA strand of SVA cannot be specified with confidence.

Page 11: Structure and Genetics of the Partially Duplicated Gene RP Located ...

8476 Structure and Genetics

However, the presence of multiple T residues at one end of the SVA could reflect the presence of a poly(A) structure similar to a messenger RNA that was reversely transcribed and subse- quently incorporated into the human genome. If this were the case, the SVA would be orientated in the opposite direction with respect to the resident genes RP, C2, and CYPlAl. The striking similarities in the organization and high sequence identities of SVA among the three genes suggest that SVA may be a recently evolved retroposon. All three SVA elements described above are located in an Alu-rich region. For example, the SVA-RP is pre- sent within an Alu cluster and the entire repetitive DNA region spans 4.4 kb in size. The SVA-C2 and SVA-CYPIA1 have ac- quired an additional structure with 7 copies of hexameric se- quences located immediately after the SVA-specific Alu ele- ment (Fig. 7). The SVA-RP contains only 135 bp of the 490-bp SINE element present in SINE-Rs. The missing region consists of a responsive element for glucocorticoids (41, 42) and there- fore the gene activity of RP may not be induced by these ste- roids. Whether the SVA element and its GC-nch VNTRs play a role in the function of RP, or in the unusually frequent RCCX modular variations such as gene duplications and/or deletions and polymorphisms, remains to be determined. Only three SVA retroposons have been elucidated so far but two of them (i.e. SVA-RP and SVA-C2) are localized about 20 kb apart in the polymorphic HLA class I11 region. I t is also of considerable interest to note that about seven copies of the SVA-related VNTR sequences are found close to the meiotic recombinational breakpoint of the HLA DQBl gene in the DR7 DQw2 haplotype (61).

Acknowledgments-RP was named sincerely to memorialize the late Professor Rodney Porter. We thank Dr. Terry Rabbitts (MRC, Cam- bridge) for the RPMI 8402 cDNA library, Dr. Carolyn Giles (Royal Post- graduate Medical School, London) and Dr. Sue ODorisio for blood samples, Bradley Baker for cosmid clones, and Dr. Sue ODorisio for constructive review the manuscript.

REFERENCES

2. 1.

3.

4.

5.

6.

7.

8.

9.

11. 10.

12.

13.

14.

15.

16.

Porter, R. R. (1983) Mol. B i d . Med. 1, 161-168 Alper, C. A. (1991) in The Immunogenetics ofAutoimmune Diseases, (Farid, N.

Ryder, L. P., Svejgaard, A., and Dausset, J. (1981) Annu. Reu. Genet. 15, R., ed) Vol. I, pp. 166-186, CRC Press, Bow Raton, FL

Lu, S., Day, N. E., Degos, L., Lepage, V., Wang, P. C., Chan, S. H., Simons, M., 169-187

McKnight, B., Easton, D., Zeng, Y., and de-The, G. (1990) Nature 346, 470-471

Rukstalis, D. B., Bubley, G. J., Donahue, J. P., Richie, J. P., Seidman, J . G., and DeWolf, W. C. (1989) Cancer Res. 49, 5087-5090

'Ikowsdale, J., Ragoussis, J., and Campbell, R. D. (1991) Immunol. Today 12, 443446

Reid, K. B. M. (1988) in Molecular Immunology (Hames, B. D., and Glover, D. M., ed) pp. 215-241, IRL Press, Oxford

Carroll, M. C., Campbell, R. D., Bentlev. D. R., and Porter, R. R. (1984) Nature 307,237-241

Carroll, M. C., Belt, T., Palsdottir, A,, and Porter, R. R. (1984) Philos. 'Dam. R.

Yu, C. Y., and Campbell, R. D. (1987) Immunogenetics 25,384-390 Awdeh, Z. L., and Alper, C. A. (1980) Proc. Natl. Acad. Sei. U. S. A. 77,357C 3580

White, P. C., New, M. I., and DuPont, B. (1986) Proc. Natl. Acad. Sci. U. S. A. 83, 5111-5115

Higashi, Y., Yoshioka, H., Yamane, M., Gotoh, 0.. and Fujii-Kuriyama, Y. (1986) Proc. Natl. Acad. Sci. U. S. A. 8 3 , 2841-2845

Rodrigues, N. R., Dunham, I., Yu, C. Y., Carroll, M. C., Porter, R. R., and Campbell, R. D. (1987) EMBO J. 6, 1653-1661

Kawaguchi, H., OhUigin, C., and Klein, J. (1991) in MolecularEuolution ofthe Major Histocompatibility Complex (Klein, J., and Klein, D., ed) pp. 357-381,

Carroll, M. C., Belt, K. T., Palsdottir, A,, and Yu, Y. (1985) Immunol. Reu. 87, Springer-Verlag, Heidelberg

3940

SOC. Lond. 306,379388

of HLA Novel Gene RP 17. Gitelman, S. E., Bristow, J., and Miller, W. L. (1992) Mol. Cell. Biol. 12,

2124-2134 18. Morel, Y., Bristow, J., Gitelman, S. E., and Miller, W. L. (1989)Proc. Natl. Acad.

~~~~ ~~~~

Sci. U. S. A. 86. 6582-6586 20. Sargent, C.A., Dunham, I., and Campbell, R. D. (1989) EMBOJ. 8,2305-2312 19. Proudfoot, N. (1991) Cell 64, 671-674

21. Yu, C. Y. (1991) J. Immunol. 146,1057-1066 22. Feinberg, A. P., and Vogelstein, B. (1984) A d . Biochem. 137, 266-267 23. Huynh, T. V., Young, R. A,, and Davis, R. D. (1985) in DNA Cloning (Glover, D.

M., ed) Vol. I, pp. 49-78, IRL Press, Oxford 24. Banker, A. T., and Barrell, B. G. (1989) in Nucleic Acids Sequencing:A Prac-

tical Approach (Howe, C. J., and Ward, E. S., ed) pp. 37-78, IRL Press,

25. Messing, J., and Banker, A. (1989) in Nucleic Acids Sequencing: A Practical Oxford

26. Staden, R. (1982) Nucleic Acids Res. 12,505-519 Approach (Howe, C. J., and Ward, E. S., ed) pp. 1-36, IRL Press, Oxford

27. Ausubel, F. M., Brent, R., Kingston, R. E., Moore, D. D., Seidman, J. G., Smith, J. A,, and Struhl, K. (1987) Current Protocols in Molecular Biology, Greene

28. Kawasaki, E. S. (1990) in PCR Protoco1s:A Guide to Methods and Applications Publishing Associates and Wiley-Interscience, New York

(Innis, M. A., Gelfand, D. H., Sninsky, J. J., and White, T. J., ed) pp. 21-27, Academic Press, San Diego

29. Kozak, M. (1987) Nucleic Acids Res. 15, 8125-8148 30. Wu, L.-C., Morley, B. J., and Campbell, R. D. (1987) Cell 48,331-342 31. Belt, K. T., Yu, C. Y., Carroll, M. C., and Porter, R. R. (1985) Immunogenetics

32. Yu, C. Y., Belt, K. T., Giles, C. M., Campbell, R. D., and Porter, R. R. (1986)

33. Zhang, H-Y., Dangel, A. W., Takimoto, T., and Glaser, R. (1989) Intervirology

34. Southern, E. M. (1975) J. Mol. Biol. 98,503517 35. Saiki, R. (1990) in PCR Protoco1s:A Guide to Methods and Applications (Innis,

M. A,, Gelfand, D., Sninsky, J., and White, T., eds) pp. 13-20, Academic Press, San Diego

36. Genetics Computer Group, Inc. (1991) GCG Sequence Analysis Software Pack- age: Program Manual, Version 7, Genetics Computer Group, Inc., Madison,

37. Robbins, J . , Dilworth, S. M., Laskey, R. A,, and Dingwall, C. (1991) Cell 64, WI

38. Bird, A. P. (1986) Nature 321,209-213 615-623

39. Jurka, J., and Smith, T. (1988) Proc. Natl. Acad. Sci. U. S. A. 85,47754778 40. Li, W.-H., and Graur, D. (1991) in Fundamentals of Molecular Euolution (Li,

W.-H., and Graur, D., ed) pp. 172-203, Sinauer Associates, Inc., Associates, Inc., Sunderland, MA

41. Ono, M., Kawakami, M., and Takezawa, T. (1987) Nucleic Acids Res. 15, 8725-8737

42. Ono, M., Yasunaga, T., Miyata, T., and Ushikubo, H. (1986) J. Virol. 60, 589- 598

43. Zhu, Z. B., Hsieh, S-L., Bentley, D. R., Campbell, R. D., and Volanakis, J . E. (1992) J. Exp. Med. 175, 1783-1787

44. Kawajiri, K., Watanabe, J., Gotoh, O., Tagashira, Y., Sogawa, K., and Fujii- Kuriyama, Y. (1986) Eur J. Biochem. 159, 219-225

45. Matsumoto, K., Arai, M., Ishihara, N., Ando, A,, Inoko, H., and Ikemura, T. (1992) Genomics 12, 485491

46. Matsumoto, K., Ishihara, N., Ando, A,, Inoko, H., and Ikemura, T. (1992) Immunogenetics 36,400-403

47. Shiota, S., and Nakayama, H. (1989) Mol. Gen. Genet. 217, 332340 48. Perozzi, G., and Prakash, S. (1986) Mol. Cell. Biol. 6, 1497-1507 49. Schneider, R., and Schweiger, M. (1991) FEBS Lett. 283,203-206 50. Atkinson, J. P. (1986) Springer Semin. Immunopathol. 9, 179-194 51. Fielder, A. H. L., Walport, M. J., Batchelor, J. R., Rynes, R. I., Black, C. M.,

Dodi, I. A,, and Hughes, G. R. V. (1983) Br. Med. J . 286,425428 52. Schaffer, F. M., Palermos, J., Zhu, Z. B., Barger, B. O., Cooper, M. D., and

Volanakis, J. E. (1989) Proc. Natl. Acad. Sci. U. S. A. 86,8015-8019 53. Wilton, A. N., Cohain, T. J., and Dawkins, R. L. (1985) Immunogenetics 21,

333-342 54. Laitinen, T., Lokki, M. L., Tulppala, M., Ylikorkala, O., and Koskimies, S.

21, 173-180

EMBO J. 5,2873-2881

30,5240

55. Shiroishi, T., Sagai, T., Natsuume-Sakai, S., and Moriwaki, K. (1987) Proc. (1991) Hum. Reprod. 6, 1384-1387

Natl. Acad. Sci. U. S. A. 84.2819-2823 56. Melhem, M. F., Kunz, H. W., and Gill, T. J., I11 (1993) Proc. Natl. Acad. Sci.

57. Kunz, H. W., Gill, T. J., 111, Dixon, B. D., Taylor, F. H., and Greiner, D. L. (1980) U. S. A. 90, 1967-1971

58. Hu, X., and Worton, R. G. (1992) Hum. Mutat. 1,3-12 59. Miller. W. L.. and Morel. Y. (1989) Annu. Reu. Genet. 23. 371-393

J. Exp. Med. 152, 1506-1518

60. Schnelder, F! M., Carroil, M. C., Alper, C. A,, Rittner, C., Whitehead, A. S.,

61. Satyanarayana, K., and Strominger, J. L. (1992) Immunogenetics 35,235-240 62. Levi-Strauss, M., Carroll, M. C., Steinmetz, M., and Meo, T. (1988) Science

63. Speiser, P. W., and White, P. C. (1989) DNA (N. X ) 8,745-751 64. Patthy, L. (1987) FEBS Lett. 214, 1-7

Yunis, E. J., and Colten, H. R. (1986) J. Clin. Inuest. 78, 65CM57

240,201-204