OF Vol. 268, 18. of June 25, pp. 13548-13555,1993 Q 1993 by … · 2001-06-22 · THE JOURNAL OF...

8
THE JOURNAL OF BIOLOGICAL CHEMISTRY Q 1993 by The American Society for Biochemistry and Molecular Biology, Inc. Vol. 268, No. 18. Issue of June 25, pp. 13548-13555,1993 Printed in U.S.A. Linker Chain L1 of Earthworm Hemoglobin STRUCTURE OF GENE AND PROTEIN: HOMOLOGY WITH LOW DENSITY LIPOPROTEIN RECEPTOR* (Received for publication, December 28, 1992, and in revised form, March 8, 1993) Tomohiko SuzukiS and Austen F. RiggsQ From the Department of Zoology, University of Texas, Austin, Texas 78712 The extracellular hemoglobins (Hbs) of annelids and tube worms are giant multisubunit proteins of up to = 200 polypeptides and molecular masses to at least 3,900 kDa. They differ from all other Hbs in having both 02-binding chains and “linker” chains. The latter are required for assembly and structural integrity of the protein and are deficient in or lack heme. We have determined the nucleotide sequences of the cDNA and gene for linker chain L1 of the hemoglobin of Lumbri- cr(8 terrestris. The cDNA-derived amino acid sequence has 225 residues and a calculated molecular mass of 25,847 Da. The chain is 21-28’3, identical to linker chains of the related annelid Tylorrhynchus hetero- chaetus and the deep-sea tube worm Lamellibrachia sp. A remarkable feature of the linker chains is a conserved 38-39-residue segment that contains a re- peating pattern of cysteinyl residues: (Cys-X6),-Cys- X6-Cys-X,,-Cys. This pattern, not present in any globin sequence, corresponds exactly to the cysteine-rich re- peats of the ligand binding domains of the low density lipoprotein (LDL) receptors of man and Xenopus lae- vis. Furthermore, the cysteine-richsegment of linker chain L1 has the sequence Asp-Gly-Ser-Asp-Gluwhich is characteristic of LDL receptor repeats. Similar cys- teine-rich sequences also occur in two other mamma- lian proteins, complement C9 and renal glycoprotein GP330. The results support the conclusion that the cysteine-rich motif of the LDL receptor and annelid Hbs is a multipurpose protein-binding unit of ancient origin which has been incorporated into diverse unre- lated proteins, presumably by the process of exon shuf- fling. to form the native molecule (3). Although no direct evidence exists for the structural role of thesechains in assembly, preparations of Lumbricus Hb which are deficient or lacking in linker chains do not form the full-sized 3,900-kDa molecule (3-7). The determination of the amino acid sequences of linker chains from the Hbs from the polychaete Tylorrhynchus het- erochuetus (8) and the deep-sea tube worm Lamellibrachia sp. (9) has shown that these 25-28-kDa chains are all homologous but are either unrelated or very distantly related to the heme- binding chains. The coding region of the gene for heme- binding chain c of Lumbricus Hb has exactly the same two- intron, three-exon organization found in the genes for verte- brate globins (10, 11). Examination of the linker sequences initially suggested (8, 9) that they might haveevolved by fusion of two genes for heme-binding chains to form a gene for a two-domain chain followed by loss of the first exon in the first domain and the last exon of the second domain. We have investigated the organization of a gene for a linker chain in Lumbricus Hb to test this hypothesis. MATERIALS AND METHODS Amplification of cDNA of Lumbricus Chain L1-Poly(A)+ RNA, prepared previously (ll), was used as template to synthesize cDNA with oligo(dT) as primer. The cDNA corresponding to chain L1’ was amplified from single-stranded cDNA by the polymerase chain reac- tion (PCR)’ using oligo(dT) with an adaptor that included an XbaI site (27-mer, Promega, Madison, WI) and a 2,048-fold redundant oligomer based on the known NHZ-terminal amino acid sequence (14) corresponding to positions 10-18. 5”TTCCAGTACCTXGTXAAGAAcCAGAA-3’ T A TT A T A Many extracellular Hbs of annelids are composed of two kinds of heme-containing subunits, a disulfide-linked trimer of three different polypeptide chains (17-18 kDa) and a 16- kDa monomer (for review, see Ref. 2) together with additional heme-deficient chains (25-32 kDa) that appear to be required as “linkers”in the assembly of the heme-containing subunits * This work was supported in part by National Institutes of Health Grants GM28410 and GM35847, National Science Foundation Grant DMB 88-10828, and Welch Foundation Grant F-0213. A preliminary account of this work has been presented (1). The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked “advertisement” in accordance with 18 U.S.C. Section 1734 solelyto indicate this fact. The nucleotide sequence(s) reported in this paper hus been submitted to the GenBankTM/EMBL Data Bank with accession number(s) L12688 and L12689. $Supported by a grant-in-aid for scientific research from the Ministry of Education, Science and Culture of Japan. Present ad- dress: Dept. of Biology, Faculty of Science, Kochi University, Kochi 780, Japan. 5 To whom correspondence and reprint requests should be sent. SEQUENCE 1 X stands for all four nucleotides. The reaction mixture (100 pl) contained 2.5 units of Taq polymerase (Perkin-Elmer-Cetus Instru- ments), 0.3 pg of oligo(dT), 10 pg of redundant oligomer, a 0.2 mM concentration of each of the four deoxynucleotide triphosphates, 40 ng of single-stranded cDNA, and 10 pg of bovine serum albumin, 50 mM KCl, 10 mM Tris-HC1 (pH 8.5 measured at =22 “C), 1.5 mM MgClZ, and 3 mM dithiothreitol. Since the enthalpy of ionization is 47.56 kJ/mol (X), the pH at 72 “C should be ~7.1-7.2. The sample was boiled for 2 min before the addition of enzyme. The cDNA was amplified for 30 cycles (1.5 min at 94 “C, 1.5 min at 50 “C, and 2 and 3 min at 72 “C for cycles 1-20 and 21-30, respectively) with a thermal The major structural linker chains will be designated L1, L2,and L3 to avoid the implication in previous nomenclature (12) that they are dimers (DlA, DlB, D2) and to conform to the nomenclature adopted for other linker chains (8,9). Subunit D1A (chain Va) is here renamed chain L1 (see Ref. 13). Chain L2 appears to correspond to D2 on the basis of molecular mass, but D1B has not been sufficiently characterized to permit identification. A separate nomenclature (12) for linker chains (Va, Vb, and VI) and subunits (DlA, DlB, and D2) appears to be unnecessary since they are not dimers. The abbreviations used are: PCR, polymerase chain reaction; kbp, kilobase pairs; bp, base pairs; LDL, low density lipoprotein. 13548

Transcript of OF Vol. 268, 18. of June 25, pp. 13548-13555,1993 Q 1993 by … · 2001-06-22 · THE JOURNAL OF...

Page 1: OF Vol. 268, 18. of June 25, pp. 13548-13555,1993 Q 1993 by … · 2001-06-22 · THE JOURNAL OF BIOLOGICAL CHEMISTRY Q 1993 by The American Society for Biochemistry and Molecular

THE JOURNAL OF BIOLOGICAL CHEMISTRY Q 1993 by The American Society for Biochemistry and Molecular Biology, Inc. Vol. 268, No. 18. Issue of June 25, pp. 13548-13555,1993

Printed in U.S.A.

Linker Chain L1 of Earthworm Hemoglobin STRUCTURE OF GENE AND PROTEIN: HOMOLOGY WITH LOW DENSITY LIPOPROTEIN RECEPTOR*

(Received for publication, December 28, 1992, and in revised form, March 8, 1993)

Tomohiko SuzukiS and Austen F. RiggsQ From the Department of Zoology, University of Texas, Austin, Texas 78712

The extracellular hemoglobins (Hbs) of annelids and tube worms are giant multisubunit proteins of up to = 200 polypeptides and molecular masses to at least 3,900 kDa. They differ from all other Hbs in having both 02-binding chains and “linker” chains. The latter are required for assembly and structural integrity of the protein and are deficient in or lack heme. We have determined the nucleotide sequences of the cDNA and gene for linker chain L1 of the hemoglobin of Lumbri- cr(8 terrestris. The cDNA-derived amino acid sequence has 225 residues and a calculated molecular mass of 25,847 Da. The chain is 21-28’3, identical to linker chains of the related annelid Tylorrhynchus hetero- chaetus and the deep-sea tube worm Lamellibrachia sp. A remarkable feature of the linker chains is a conserved 38-39-residue segment that contains a re- peating pattern of cysteinyl residues: (Cys-X6),-Cys- X6-Cys-X,,-Cys. This pattern, not present in any globin sequence, corresponds exactly to the cysteine-rich re- peats of the ligand binding domains of the low density lipoprotein (LDL) receptors of man and Xenopus lae- vis. Furthermore, the cysteine-rich segment of linker chain L1 has the sequence Asp-Gly-Ser-Asp-Glu which is characteristic of LDL receptor repeats. Similar cys- teine-rich sequences also occur in two other mamma- lian proteins, complement C9 and renal glycoprotein GP330. The results support the conclusion that the cysteine-rich motif of the LDL receptor and annelid Hbs is a multipurpose protein-binding unit of ancient origin which has been incorporated into diverse unre- lated proteins, presumably by the process of exon shuf- fling.

to form the native molecule (3). Although no direct evidence exists for the structural role of these chains in assembly, preparations of Lumbricus Hb which are deficient or lacking in linker chains do not form the full-sized 3,900-kDa molecule (3-7).

The determination of the amino acid sequences of linker chains from the Hbs from the polychaete Tylorrhynchus het- erochuetus (8) and the deep-sea tube worm Lamellibrachia sp. (9) has shown that these 25-28-kDa chains are all homologous but are either unrelated or very distantly related to the heme- binding chains. The coding region of the gene for heme- binding chain c of Lumbricus Hb has exactly the same two- intron, three-exon organization found in the genes for verte- brate globins (10, 11). Examination of the linker sequences initially suggested (8, 9) that they might have evolved by fusion of two genes for heme-binding chains to form a gene for a two-domain chain followed by loss of the first exon in the first domain and the last exon of the second domain. We have investigated the organization of a gene for a linker chain in Lumbricus Hb to test this hypothesis.

MATERIALS AND METHODS

Amplification of cDNA of Lumbricus Chain L1-Poly(A)+ RNA, prepared previously (ll), was used as template to synthesize cDNA with oligo(dT) as primer. The cDNA corresponding to chain L1’ was amplified from single-stranded cDNA by the polymerase chain reac- tion (PCR)’ using oligo(dT) with an adaptor that included an XbaI site (27-mer, Promega, Madison, WI) and a 2,048-fold redundant oligomer based on the known NHZ-terminal amino acid sequence (14) corresponding to positions 10-18.

5”TTCCAGTACCTXGTXAAGAAcCAGAA-3’ T A TT A T A

Many extracellular Hbs of annelids are composed of two kinds of heme-containing subunits, a disulfide-linked trimer of three different polypeptide chains (17-18 kDa) and a 16- kDa monomer (for review, see Ref. 2) together with additional heme-deficient chains (25-32 kDa) that appear to be required as “linkers” in the assembly of the heme-containing subunits

* This work was supported in part by National Institutes of Health Grants GM28410 and GM35847, National Science Foundation Grant DMB 88-10828, and Welch Foundation Grant F-0213. A preliminary account of this work has been presented (1). The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked “advertisement” in accordance with 18 U.S.C. Section 1734 solely to indicate this fact.

The nucleotide sequence(s) reported in this paper hus been submitted to the GenBankTM/EMBL Data Bank with accession number(s) L12688 and L12689.

$Supported by a grant-in-aid for scientific research from the Ministry of Education, Science and Culture of Japan. Present ad- dress: Dept. of Biology, Faculty of Science, Kochi University, Kochi 780, Japan.

5 To whom correspondence and reprint requests should be sent.

SEQUENCE 1

X stands for all four nucleotides. The reaction mixture (100 pl) contained 2.5 units of Taq polymerase (Perkin-Elmer-Cetus Instru- ments), 0.3 pg of oligo(dT), 10 pg of redundant oligomer, a 0.2 mM concentration of each of the four deoxynucleotide triphosphates, 40 ng of single-stranded cDNA, and 10 pg of bovine serum albumin, 50 mM KCl, 10 mM Tris-HC1 (pH 8.5 measured at =22 “C), 1.5 mM MgClZ, and 3 mM dithiothreitol. Since the enthalpy of ionization is 47.56 kJ/mol (X) , the pH at 72 “C should be ~7.1-7.2. The sample was boiled for 2 min before the addition of enzyme. The cDNA was amplified for 30 cycles (1.5 min at 94 “C, 1.5 min at 50 “C, and 2 and 3 min at 72 “C for cycles 1-20 and 21-30, respectively) with a thermal

The major structural linker chains will be designated L1, L2, and L3 to avoid the implication in previous nomenclature (12) that they are dimers (DlA, DlB, D2) and to conform to the nomenclature adopted for other linker chains (8,9). Subunit D1A (chain Va) is here renamed chain L1 (see Ref. 13). Chain L2 appears to correspond to D2 on the basis of molecular mass, but D1B has not been sufficiently characterized to permit identification. A separate nomenclature (12) for linker chains (Va, V b , and VI) and subunits (DlA, DlB, and D2) appears to be unnecessary since they are not dimers.

The abbreviations used are: PCR, polymerase chain reaction; kbp, kilobase pairs; bp, base pairs; LDL, low density lipoprotein.

13548

Page 2: OF Vol. 268, 18. of June 25, pp. 13548-13555,1993 Q 1993 by … · 2001-06-22 · THE JOURNAL OF BIOLOGICAL CHEMISTRY Q 1993 by The American Society for Biochemistry and Molecular

Gene for Linker Chain of Earthworm Hemoglobin 13549

cycler (MJ Research, Waltham, MA). Electrophoresis of the amplified products in 0.7% agarose gel in 0.04 M Tris, 0.02 M acetic acid, 0.01 M EDTA gave three major bands: 1.4,0.6, and 0.4 kbp. These products were isolated from the gel, subcloned into the SmaI site of pUC18, and sequenced by the dideoxy method using Sequenase (U. S. Bio- chemical Corp.). The 1.4-kbp product was found to be the target L1 cDNA.

The 5"flanking sequence was obtained as follows. Single-stranded cDNA was synthesized from poly(A)+ RNA by primer extension using the nonredundant oligomer 5"ACGTGGGTGATGTTGAGACTG- CAT-3' (24-mer), which corresponds to the complementary sequence of the cDNA in positions 342-365 (Fig. 1). A poly(A) tail was then added to the 3' end of the single-stranded cDNA by terminal trans- ferase (Stratagene, San Diego, CA). This material was used to amplify

the 5' region of L1 cDNA by PCR under conditions similar to those

adaptor for the XbaI site and the oligomer 5"CTTCTTCGCCAGGT- described above. The primers used (1 pmol each) are oligo(dT) with

AGTCGATGTGCA-3' (26-mer) (position 101-126, Fig. 1). The prod- ucts (100-150 bp) were subcloned in PCR-1000 (TA cloning system, Invitrogen, San Diego) and sequenced.

Screening of the AgtlO cDNA Library-The X g t l O cDNA library of Lumbricus terrestris, constructed earlier ( l l ) , was screened with L1 clone C26. The insert of one positive clone was cut out by digestion with EcoRI, purified by agarose gel electrophoresis, subcloned in the EcoRI site of pUC18, and sequenced.

Amplification of Genomic DNA of Lumbricus Ll Chain-The ge- nomic DNA used is the same sample as in previous work (11). For PCR, 2 wg of genomic DNA and nonredundant primers (1 pmol each)

CA GCC ATA ATC GTC AAC ATA ATG TGG TAC GTC CTA GGC CTT ATG CTG GTC GGT CTG GCG 39 M W Y V L G L M L V G L A

GCC GGG GCT AGC GAT CCC TAC CAG GAG AGG CGC TTT CAG TAC CTC GTT AAG AAC CAG AAC 99 A G A S D P Y Q E R R F Q Y L V K N Q N

CTG CAC ATC GAC TAC CTG GCG AAG AAG CTG CAC GAC ATC GAA GAG GAG TAT AAC AAG CTG 159 L H I D Y L A K K L H D I E E E Y N K L

ACC CAC GAC GTT GAC AAG AAG ACC ATT CGC CAG CTG AAG GCT CGC ATT AGC AAC CTA GAA 219

; H D V D K K T I R Q L K A R I S N L E

GAG CAC CAC TGC GAC GAG CAT GAG TCG GAA TGT CGT GGA GAT GTA CCA GAG TGC ATC CAC 279 E H H C D E H E S E C R G D V P E C I H

GAT CTG CTC TPC 'ET GAC GGA GAG AAG GAC TGC AGG GAC GGA AGC GAC GAA GAC CCG GAA 339 D L L F C D G E K D C R D G S D E D P E

ACA TGC AGT CTC AAC ATC ACC CAC GTC GGC AGC TCA TAC ACC GGC CTG GCC ACC TGG ACG 399 T C S L N I T H V G S S Y T G L A T W T

AGC TGC GAG GAC CTC AAC CCT GAC CAT GCC ATC GTC ACC ATC ACC GCC GCC CAC CGC AAA 459 S C E D L N P D H A I V T I T A A H R K

TCC l T C TTC CCG AAC CGT GTC TGG CTT CGG GCC ACC CTC TCC TAC GAG TTG GAC GAG CAT 519 S F F P N R V W L R A T L S Y E L D E H

GAC CAC ACG GTC AGC ACG ACC CAG CTG AGG GGT TTC TAC AAC 'ITC GGC AAA CGC GAA CTC 579 D H T V S T T Q L R G F Y N F G K R E L

CTT CTC GCT CCT CTG AAA GGT CAG TCG GAG GGA TAC GGG GTG ATC TGC GAC TTC AAC CTC 639 L L A P L K G Q S E G Y G V I C D F N L

GGC GAT GAT GAC CAC GCC GAC TGC AAG ATC GTC GTT CCG TCC AGT CTG TTC GTC TGC GCA 699 G D D D H A D C K I V V P S S L F V C A

4 CAC TPC AAC GCC CAA AGA TAC T AGG CAC GAC AGC AGA CTA TAC ACG TAC AAT CGA CGT 757 H F N A Q R Y

GAA ATT TTT TAT GAA GAA GCA AAA TI'C CGA GAT CGA ACG ATT GAT CAT CGG ACA ATC "T 817

GCA GTT GAA CTG TCT TCC GTT GTG AAA AAT CTT CTT ACA TI'G TCC GCC ATC 'ITT ACT CCC 877

CAG CAG TCG ACA GGC AAA GTC TCG CGC CTA GCA AAA GAA GCC AGC TAG ACG CGA GAC 'ITT 937

GTT TCA TCC GCT GAG AAA TAT ACT TTA AAA GTC ACA CTC TGT CAC TGG TCA ACA ACG AAC 997

TTC TTA AAG CCG AAT AAC ACT TAA TTA TIT' TAA AAC GAT TAA AAC GAA ACG CCT AAT TAT 1057

TAT GCA ATG ATA GCA GGA GAA CGG TCA CGT GGT TAC GAA AGA CCG CAG AAC AAG AAG GTC 1117

GCA GAA ACA TAG AAA TAA GCA CAA AAA CTG ATA AGT AAG TAT CAT TTT CTA TGG ACT GGA 1177

CTG AAC GGT AAA CAA GAC CAA AAA AAC TGC ACA CAA ATA CGG GAC AAA GTA ACA GGG AAA 1237

ACC AAA CAC GAT CTC ATA AAT TAA TTC GAA CTA CAC CCA AAG AGA AGA TCT TGT CGT TAA 1297

GGT CCA GTC GTC TGC CTC CAT TI'T G'IT CTT GTG GTC AGA TGC GAG TTG GAG TTA TGT TTG 1357

TCG AGA TAC TGT CGA GAA ATT CTA TGA GGT K A TCG AAA TI'A TI'C TGA TCA TGT TGT ACA 1417

CTG AAC TGA CGT GCG AAA ACT GAA AAT AAA TAT CGT TAG TTG AAA TAA AAC TCT CTC AAT 1477

TPG TAA AAA AAA AA 1491 FIG. 1. Nucleotide sequence of cDNA for linker chain L1 and the derived amino acid sequence. The sequence was constructed

from data of three cDNA clones (5-22,5-3, and C26) amplified independently. The sequence of nucleotides -20 to 100 is derived from clones 5-22 and nucleotides 99-1491 from clone C26. The signal sequence is underlined. Arrows mark the positions of splice junctions determined from the combined data of Figs. 1 and 2.

Page 3: OF Vol. 268, 18. of June 25, pp. 13548-13555,1993 Q 1993 by … · 2001-06-22 · THE JOURNAL OF BIOLOGICAL CHEMISTRY Q 1993 by The American Society for Biochemistry and Molecular

13550 Gene for Linker Chain of Earthworm Hemoglobin were used. DNA was amplified for 30 cycles (1.5 min at 94 “C, 1.5 min at 55 “C, and 6 min at 72 “C). Other conditions in PCR are the same as described above.

The central region of the L1 gene was amplified using forward and reverse primers, 5‘-CGCATTAGCAACCTAGAAGAGCAC-3’ (posi- tion 202-226 in the cDNA (Fig. I), 24-mer) and 5”GTTGA- AGTCGCAGATCACCCCGTA-3’ (position 613-636 in the cDNA (Fig. I), 24-mer). The amplified products were separated by electro- phoresis on an agarose gel, and the target DNA was identified with a Southern blot by using the cDNA insert of clone C26 as a probe. The fragment (-400 bp) was then isolated from the gel and subcloned into PCR-1000. Three clones were sequenced completely. Subsequent results showed that the forward primer used in this amplification was inappropriate because the genomic DNA corresponding to the primer was split by intron 1 (see Fig. 2). Nevertheless, PCR worked even in this situation.

ATGTGGTACGTCCTAGGCCTTATG-3‘ (position 1-24 (Fig. l), The 5’ region was amplified by PCR using two primers, 5‘-

24-mer) and 5’-TACATCTCCACGACATTCCGACTC-3’ (position 241-264 (Fig. I), 24-mer). The target product (-2.3 kbp) was identi- fied as described above and subcloned into PCR-1000. One clone (clone 2-36) was sequenced completely, and three clones were partially sequenced.

Finally, the 3’ region was amplified by using two primers, 5’- CACCGCAAATCCTTCTTCCCGAAC-3’ (position 451-474 (Fig. l), 24-mer) and 5’-CGCACGTCAGTTCAGTGTACAACA-3’ (position 1409-1432 (Fig. l), 24-mer). The target product (-1.7 kbp) was subcloned into PCR-1000. One clone (clone 10-2) was sequenced completely, and three clones were sequenced partially.

A FASTA search (16) for sequences similar to the cysteine-rich segment of chain L1 was performed by the Protein Identification Resource, Washin~on, D. C.

RESULTS

The cDNA encoding chain L1 of L. terrestrts, beginning with the position 73, was successfully amplified by PCR using oligo(dT) and a 2,048-fold redundant oligomer as primers. The 1.4-kbp product was subcloned into pUC18, and two independent clones, C26 and C38, were isolated and sequenced completely (Fig. 1). The amplification of the 5’ region yielded two additional clones, 5-22 and 5-3. The sequences of these clones differed in 14 positions in the coding region, but these differences were restricted to the third position of a codon and caused no amino acid differences (Table I). The noncod- ing 3’ region showed 31 single-base differences and 9 deletion/ insertion changes.

The 5”flanking region of the cDNA was amplified sepa- rately by PCR. The products were subcloned, and clone 5-22 with the longest insert (-150 bp) and another clone 5-3 (-100 bp) were sequenced. The entire cDNA sequence was obtained by combination of the sequences of clones C26 and 5-22 (Fig. 1). The translated amino acid sequence consists of 240 resi- dues including 15 residues of signal peptide. The sequence of the 28 NH2-terminal residues of chain L1 (excluding the signal sequence) corresponded exactly to those determined by direct sequencing (14) except for residue 20. Redetermination of the NH2-terminal sequence confirmed that histidine occupies this position. The earlier identi~cation of leucine in position 20 evidently resulted from carryover from residue 19.

The cDNA library was screened by using the insert of clone C26 as a probe, but no clone with a full-length target insert was found. One clone (XD131) with a -850-bp insert was isolated and sequenced. The sequence of clone AD131 corre- sponded to the position 509-1373 in Fig. 1. No nucleotide differences were found between AD131 and C26 (Fig. 1) in the coding region.

The nucleotide sequence of the gene for chain L1 was constructed from three overlapping fragments separately am- plified by PCR. No differences in sequence were found in the overlap regions. As shown in Fig. 2, the L1 gene is character- ized by three exons and two introns (intron 1,2,011 bp; intron

TABLE I N ~ ~ o ~ ~ e differences in codons in the cDNA and genomic DNA for

linker chain L1 of L. terrestris Dots (*) indicate the same nucleotide as found in cDNA clones 5-

22 and C26. -~

Codon Am!no cDNA clones Genomic clones

5-22 5-3 C26 C38 2-36 41 10-2 10-5

1 Ala GCT **G ... 3 Asp GAT **C

43 Asp GAC -*- **T 46 Thr ACC **G *** 52 Ala GCT -C 61 His CAC 0.T e.. e..

62 Cys TGC ..T e..

65 His 82 Phe

- ...

CAT **C e..

TTC **T 0..

105 Thr ACC -T 111 TyP TAC **. C*- 127 His CAT **C 135 Ala GCC **A 144 Arg CGT *a*

155 Leu 0.. ... *.c

TTG **e

169 Gly GGC **T ...... C..

189 Gly GGA **G ... .*. 0..

201 Asp GAT -C 0.. ..c 214 Leu CTG 0.0

218 Ala GCA 0-G ..G 0.G

0.0

... 0..

... ... 0..

.*A e..

Amino acid residues are those deduced from clones 5-22 and C26. * Sequencing a mixture of 10 other genomic clones gave TAC which

is consistent with the cDNA only one clone gave CAC (His).

2,717 bp). The sequences around splice junctions, pyrimidine tracts, and possible branch points of introns were Confirmed in at least four clones.

The genomic clones yielded five codon differences in addi- tion to those found in the codons of the cDNA (Table I). Codon 111 in genomic clone 41 was CAC (His) where TAC (Tyr) was found in cDNA clones C26 and C38. This is the only amino acid difference found. However, the nucleotide sequence in this region was redetermined with a mixture of 10 other genomic clones, and codon 111 was clearly TAC with only a trace of CAC. We conclude that TAC is the major codon.

DISCUSSION

Protein Alignment of the amino acid sequence of L1 with those of

the three other known sequences of linker chains (Fig. 3) shows that 21 residues are conserved in all four chains; the overall extent of identity is 23-30%. Twelve of the conserved residues (57%) are included in a segment that contains a repeating pattern of cysteinyl residues: (Cys-X&-Cys-X5- Cys-Xlo-Cys. This remarkable pattern, not present in any globin sequence, corresponds exactly (Fig. 4) to the seven cysteine-rich repeats of the ligand binding domain of the low density l ipopro~in (LDL) receptors of man (17) and Xenopus laeuis (18). Furthermore, the cysteine-rich segment of linker chain L1 contains the sequence Asp-Gly-Ser-Asp-Glu, which is characteristic of the LDL repeats (17,18). Similar cysteine- rich sequences occur in two mammalian proteins, complement C9 (19) and a renal glycoprotein (20). These data, taken together, clearly indicate that these cysteine-rich sequences all have a common ancestry. This is the first occurrence of the cysteine-rich motif of the LDL receptor in invertebrates. The correspondence between the LDL receptor repeats and complement C9 led Sudhof et ~ l . (17) to suggest that the motif has been incorporated into different proteins by the process of exon shuffling, an idea reinforced by the finding of introns

Page 4: OF Vol. 268, 18. of June 25, pp. 13548-13555,1993 Q 1993 by … · 2001-06-22 · THE JOURNAL OF BIOLOGICAL CHEMISTRY Q 1993 by The American Society for Biochemistry and Molecular

Gene for Linker Chain of Earthworm Hemoglobin

71 KGTGGTACG TCCTAGGCCT TATGCTGGTC GGTCl'GGCGG CCGGGGCTAG CGATCCCTAC CAGGAGAGGC 1 CXYFITCAGTA CCTCGTTAAG AACCAGAACC TGCACATCGA CTACCTXCG AAGMGCTGC ACGACATCGA ~~~~~ ~ ~~

141 AGAGGAGTAT AACAAGCTGA CCCACGACGT TGATAAGAAG ACCATTCGCC AGCTGAAGGC CCGCATTAGCl 211 AACCTAGAAGI~GTCAAC TCGGTAATCG TAATCGTACT TGTAATCTCC AAACTTCTAA AGCGCCATGG 281 GGAGTTTATC AGGGGCTAGC GGGAGGCTAA GCCCCCGGGA AGACCCAGAG TCA?Tm"rrA CCATGTAAAT 3 5 1 GGTATCTTTC CTCGGGCCTA GTCCCCCACG CGAACGATAC GAATGACTAA AAAGTCATCA A C m r ? r G G 421 CCGGGAAAAT ATTTCTCGTT TTCTGGACTG AGTCTCAACC CCAACACATT TAGCCCTCCA AGAACGGAAA 491 CCAAACTCCG CCCATGTAAA GCGCTACTCG AAAGCCAAGC GEGGCGGCAC CAACTTATTC ACGAGGGCTG 561 AGTCAAGTCA AAAGGGTTAT CCGTAAGACG GTAAGAGTAG TCCGAAGGTG ATCCAGGTCG AGT~CACTG

701 631

771 841 911 981 1051 1121 1191 1261 1331 1401 1471 1541 1611 1681 1751 1821 1891

2031 1961

2101 2171 2241 2311 2381 2451

2591 2521

TACTAGATA GTCACGCATC GCTGACAACG CAGCTCGTGA ATAAGCTGGT GCCCTGCATG CGCTTGGCTT TCGAGGGACG CTTTAGAAAT WGGCGATTG CGA?TATGGT CAGGATTATG ATTATGTTA? GTGAGACGCA A A A G A T C GTATITCCAT ATICGTTCAT TCGTTGGTGA ATTTCTATAG CCCTTCTTCA AGTGGTCTAC ATGGCATACT CAAAAGGCAC TCCTAAATCT A G C A m T A A TAGCGAACAG TT?TGAGTTG AATGTAAAGG T G e r m T A T T AGAGTCGCAG GAAGCAGGCA ACTCATGTAG GCCTAAGGAT ACTCGAACTC GCATCGTTAG

ACAGTCATGA GCTACATAAG GCAACGTGAT TCAGGCACAG AGGATAACCA TCTGAGCTGA ATTCCACCAA CATACAGACT ACAGTICCAG CGAAATATCG AATAGATTCA CGTGCGAAAC GAATAACAAT CGGTGAGGGG

ATACATCGCA TGAACCGTGT TAGAAAGAGA GATATCGAAT AAGTAAGTAA GTAAGTATGT ATGTAClTTA TGCATCGTCT TTTWAAGTT TITCGATGGT CTCCTGCTGT AACCGTATCA CATGTAGAGG TATATATAAT

TTGAACCGCC TGCAAACGCA CTGCGGGAAC AATGGACGAG AACAGAGTAT ACCTTCGTCT TTGCTCGCGT ATTGCAAAGA ATTGATCGTG TGAGATCATT ATCCAGACGG CTTAATTAAC TCCCTCCGTA TGGAATGGAC TIXFGCCCTTC TCAACACTCT GTTATAGGTC TACCATCATT TCTGTGACAT CTCTGTCTCT C A m A T A G GACCTI'CCTT CTATCCCTGG GTACCTCAAA ATGGGAGCGC TTCTGAAGGT TTTAAATACT GTGAACCTGC AAGAAACGCT TTATAATCGG TCAGTGGCGT AGTCAAGGGG GmCGGGGG TTAAACCCCC ATTGGCCTAG

13551

Exon 1

Intron 1

GAAAAAGTTT T m A A A G G T TATTTATCCA TTTAAATAAG AAAAAAAAAC AATTGCTTGT AGTAGCAATT G- CWATTCCC TCTGTTACTT T A ~ C C T T T TTATTTACTG T A A T A ~ G T T ~GGTTGGTGT CAAGAAGGAC AGCCGGCCAG ACCATAAAAC CCCAACCCCT CCCCCCATCA TAAATTCCTG GCTACGCCAC TATAATCGGT CAAATGATAG CGAAATGTGT TAGTTAGTAA TTTAGTTAA ACGACGATGT GTGATAACAA AGACCAWAG CGACGCCCTC CTCCTTCATC CACTC= ACTGGCACTG GCAGAATGCA ACCAGTGGCG C Possible A G A A A m T TTGGCATATC ACAGCAGTAG TACTTCGGGC =GACAC TlT"CATIT C G A A G A T G A T C GAGTITITAG TPGGATGGAT GGATCCAGAG TTGACCAGGG TCTATGACAA CAGAGGCCGT TACCCTGGAC Branch TAGTGTCGGA GGCATATCAC A G C A ~ A T A CGAAGACTAT TATATCAACA AAATCACACG GACGCTGTAG ACAGCCGCCA C A T T C A ~ TAGAGGCITA TAATATATAC GTTGGTTGTC GTTTTCTTCA G~AGCACCACT

Points GCGACGAGCA TGAGTCGGAA TGTCGTGGAG ATGTACCAGA GTGCATCCAC GATCECTCT TCTGTGACGG AGAGAAGGAC TGCAGGGACG GAAGCGACGA AGACCCGGAA ACATGCAGTC TCAACATCAC CCACGTCGGC AGCTCATACA CCGGCCTGGC CACCTGGACG AGCTGCGAGG ACCTCAACCC TGACCATGCC ATCGTCACCA TCACCGCCGC CCACCGCAAA TCCTICTTCC CGAACCGTGT CTGGCITCGG GCCACCCTCT CCTACGAGTT

CITCTCGCTC CTCTGAAAGG TCAGTCGGAG GGATACGGGG TGATCKCGA CTTCAACCTC GGCGATGATG GGACGAGCAT GACCACACGG TCAGCACGAC CCAGCTGAGG GGlTITTACA ACTTCGGCAA ACGCGAACTC

Exon 2

2661 ACCACGCCGA CTGCAAGATC GTCGWCCGT CCAGTCTATT CGTCXCGCG CACTTCAACG CCCAAAGATA~ GCACGA CAGCAGACTA TACA+TACG TATTGATAAC ACATGCATGC T A T A ~ T C T GTCTTCTCGT 2731

2801 CTCTGTGTCT ATGTCTGTTT GTCTGTCCTG TTTGTGTCTG TGTGTCTCTC CCTAT"AA TTATTTGGTT 2871 GGTCTACTCT ATCCTATTCC A G C C ~ T G T GTATCTPGCT AAAGTGACTG TAAGCWAGT ACGTFPTCG Intron 2 2941 CAGAAAGATA TCAGCGAGAC GAATGACAGG TCAGACAGGC ACTGGGTCAG AGAGATTAAA TGCGGGCTCT 3011 AACAGAAGGA AACAAGTCAT GTACAGCAGG CCTAGTAGGC TATATATGTC CWCACCAGC AAAAATTGCA 3081 GGGAAAGGTG GTAACCCATA TCTCAACAAA CCTGCTCAAA GGACGACAAG TTCTCTAGAC TGGAAGGTAA 3151 TGGTGGGATA TCAAACAACT TATAAAGGTC TGATCCTGGT CCTTTCAGTG TGTACCCCGT GAAACCCTAC 3221 CAAACAAGTC GGTGGCATAC AGTATCCGTG CCAGATTGTA GCAAGCACGA CATCTAAAGG AAGTGGACAC Possible 3291 GCAGCCCAGA TGGAACGTTG CAAGAAATGC TTCTGAATTC TPXGCTTTT TGTGC- T G A C A A A A G C C D-"-L 3361 CAGGGATCAC AATATTACGA TTECTGAAC TGTGCATGAA C~TGACCITC T G T A C A A ~ ATCTT~CGCAC DIHIIWI 3431 GGTACAATCG A ~ A A C C I T CTGTACAATT TTKTTCGC ~ T A C A A T C GAFGTGAAAT T ~ T A T G A A Points 3501 IGAAGCAAAAT TCCGAGATCG AACGACTGAT CATCGGACAT TCGTTGCAGT TGAACTGTCT TCCGTTGTGA 3571 3641 3711 3781 3851 3921 3991 4061 4131

AAAATCTKT TACATTGTCC GCCATCTTTA CTCCCCAGCA GTCGACAGGC AAAGTCTCGC GCCTAGCAAA

TGTCACTGGT CAACAACGAA CTTCfiAAAG CCGAATAACA CTTAATTATT TTAAAACGAT TAAAACGAAA CGCCTAATTA TTATGCAATG ATAGCAGGAG AACGGTCACG TGGTTACGAA AGACCGCAGA ACAAGAAGGT

AGAAGCCAGC TAGACGCGAG A C T T E ~ ATCCGCTGAG A A A T A G G C ~ A T A C ~ A A A AGTCACACTC

AAACAAGACC AAAAAAACTG CACACAAATA CGGGACAAAG CAACAGGGAA AACCAAACAC GATCTCATAA ATTAATTCGA ACTACACCCA AAGAGAAGAT CTTGTCGWA AGGTCCAGTC GTCTGCCTCC ATTTWTTCT

FIG. 2. Nucleotide sequence of the gene encoding Lumbricus linker chain L1. The sequence was constructed from the data of three genomic clones (2-36,41, and 10-2) amplified independently. The sequence of nucleotides 1-2251 is derived from clone 2-36, nucleotides 2236-2623 from clone 41, and nucleotides 2486-4166 from clone 10-2. Reverse complement and direct repeats are marked by arrows. The beginning and ending of the long 41-base direct repeats at the end of intron 2 are marked by dots ( 0 ) .

between most of the repeats of the LDL receptor. The con- served cysteine-rich segments of L1 and complement 9 differ from some of the LDL receptor repeats by less than the LDL repeats differ among themselves. An unrooted phylogenetic tree shows relationships between these cysteine-rich se- quences (Fig. 5). Although many similar trees can be con- structed from these short sequences with different algorithms (22, 23), they all suggest a relationship between the LDL receptor repeats, complement 9 and L1. Homologues of the LDL receptor of vertebrates have not yet been found in invertebrates, but the close correspondence of the L1 sequence with that of the LDL receptor greatly enhances this possibil- ity. The common ancestor to annelids and chordates probably dates from the Cambrian period, about 600 million years ago (24, 25), so that this cysteine-rich motif must have been conserved for at least this long.

A rigid, highly constrained and stable conformation of the

ligand binding repeats in the LDL receptor is dictated by the extensive disulfide cross-linking. The disulfide connectivity is not yet known but may be similar to that of wheat germ agglutinin (26) and to those of various peptide neurotoxins of cone shell molluscs (27) and other venoms (28-30) which display a variety of connectivities. The cross-linking in these toxins renders them extraordinarily stable; their toxicity can persist in 8 M urea and after exposure to anhydrous formic acid or 1 N HCl (31). Although none of the disulfide connec- tions has been chemically determined in linker L1, their arrangement outside the cysteine-rich segment can be pre- dicted (Fig. 6) from those determined for chain T1 in Tylor- rhynchus Hb.3

The disulfide-induced constraints in the LDL receptor re- peats must contribute to the specificity and tight binding of

T. Suzuki and T. Takagi, unpublished results.

Page 5: OF Vol. 268, 18. of June 25, pp. 13548-13555,1993 Q 1993 by … · 2001-06-22 · THE JOURNAL OF BIOLOGICAL CHEMISTRY Q 1993 by The American Society for Biochemistry and Molecular

13552 Gene for Linker Chain of Earthworm Hemoglobin 1 10 20 30

e * e LV T1 G A L G D R N G D C A C D R P S P R G Y W G G G M T G R S A F A D

A A V Q P L S V

T2 D D C V C L1

P G G R E M W Y V L G L M L V G L A A G A S D P Y Q I- Slgnal Sequence -1

7P LV G A A K S G G D I A R H M L N S

L1 H D V D - Spike Junction in L1

14.0

L1 I T H V G S S Y T G L A

17! 18; 19;

T1 V D V P K Y F P Q E P H V K A T I L M T S S K D G H E T Q S S L A LV A P S D I V Y K V H Q P L K V Q V D L F S K K G G L K Q S A S L H

T2 V Q R S S Y F Q S R L K V K G N L Q I K Y T A E G R D Q E D V L Q L1 A H R K S F F P N R V W L R A T L S Y E L D E H D H T V S T T Q

2eo 250 2po LV S N D R F K G Y I V R E M S G D K C A E F R F F K Q T1 D N N H T2 N D D R

C K G V M K H S G G D V C L T F T L E R I D C R A H I V H E A S L E H C G D D F V F V K E D D H H

L1 D D D H A D C K I V V P S S L F V C A H F N A Q R Y 0 0 FIG. 3. Comparison of the amino acid sequence of linker chain L1 of L. terrestris Hb with those of other linkers. LV, linker

chain of Hb from the vestimentiferan, Lumellibrachiu sp. (9); TI and T2, linker chains from Hb of T. heterochaetw (8) .

LDL

FIG. 4. Comparison of the cysteine-rich segments of linker chains with the seven repeats of the ligand binding domain of the LDL receptor of man (17). Symbols are as in Fig. 3. Circled amino acids mark codons that are split by splice junctions in the genes.

Page 6: OF Vol. 268, 18. of June 25, pp. 13548-13555,1993 Q 1993 by … · 2001-06-22 · THE JOURNAL OF BIOLOGICAL CHEMISTRY Q 1993 by The American Society for Biochemistry and Molecular

Gene for Linker Chain of Earthworm Hemoglobin 13553

apolipoprotein E to the LDL receptor (KO = 1.2 X 10"' M) (32, 33). The binding depends upon interaction between Asp and Glu residues of the receptor repeats and a corresponding set of Arg and Lys residues in a segment of helix 4 of apolipoprotein E of the cholesterol-transporting particle (33). Since linker chain L1 has all five of the Asp and Glu residues that are conserved in the ligand binding repeats of the LDL receptor, we suggest that a chain of Lumbricus Hb with a set of positively charged residues exists that is stereochemically similar to those determined for apolipoprotein E. It is likely that the conformation of the highly cross-linked cysteine-rich motif is not appreciably modified by insertion into a different protein. This conclusion is consistent with the finding that the hybrid protein formed by insertion of a snake neurotoxin (also highly cross-linked) into an Escherichia coli phosphatase has both biological activities (34). The positively charged groups might occur either in one of the heme-binding chains (a, b, or c), in the linker chains themselves, or in both. Chain d is probably not involved because the full hexagonal structure

?7 LV

\ h R3

R4

5- "

FIG. 5. A phylogenetic tree for the seven repeats (Rl-R7) of the ligand binding domain of the LDL receptor and the corresponding cysteine-rich segments of four linker chains (LV, T1, T2, and L1) and complement 9 (C9). The program used is that of Feng et al. (21).

can apparently form in its absence (4). A possible site for L1 binding is the NHz-terminal helical segment of heme-binding chain b, which has five positively charged residues in appro- priate positions (Fig. 7). Presumably, nonelectrostatic inter- actions between subunits must also be involved because high salt does not dissociate Lumbricus Hb (12). The LDL motif/ chain b may be one interface, but nonpolar interactions are likely to be the most important energetically. Linker chains may have multiple interactions with the heme-binding chains and are likely also to form linker-linker interactions.

A hydrophilicity plot of linker chain L1 (Fig. 8) comprises four distinct highly hydrophilic segments in the first half of the molecule which are very different from the pattern in the last half of the molecule. The fourth hydrophilic segment includes the cysteine-rich motif (residues 60-100). The hy- drophilicity patterns of chains c and L1 (Fig. 8) are clearly very different and do not suggest any obvious relationship. It is striking that almost half of the first 100 residues in L1 are charged. Perhaps the four hydrophilic segments correspond to different protein-protein contacts.

We have assumed that linker chain L1 does not bind heme, but we have not isolated linker chain L1 under conditions that would retain the heme. The data of the accompanying paper (13) show that one-third of the linkers have heme and are consistent with the earlier observations (12) that subunit D2 binds some heme. However, no observations have excluded the possibility that all linkers may bind some heme albeit with low affinity. We have examined each of the 14 histidyl residues in chain L1 as a possible candidate for a proximal histidine. Residues NHz-terminal to a putative histidine should then correspond to the E and F helices of globins. The only possibility for a proximal histidine which we have iden- tified on this basis appears to be His-136 with His-106 as the distal residue (Fig. 9). The alignment (Fig. 9) corresponds to 20-23% identical residues depending on whether gaps are included or not. The expected heme pocket residues on this basis seem plausible (Fig. 9). However, two problems arise with this idea. (i) Cys-120 is probably disulfide-linked, and this linkage is likely to perturb a potential binding site for heme. (ii) The conserved cysteine-rich motif is only 6 residues NH2-terminal to the hypothetical distal histidine. This block of disulfide-constrained residues would comprise part of the entrance to the heme pocket and would be expected to inter- fere with heme binding. A heme-containing linker fraction from Lumbricus Hb has recently been reported to have CO

Position 10 12 91 98 105 112 118 129 151 226 238 251 ....................... ...................... r - - - - - - - - I .............. LV

:....... I - - - I - - - - - - - - - - -

I - - - - - - - -

LDL-like Segment FIG. 6. Proposed disulfide connectivity for linker chains. Both T1 and T2 occur as disulfide-linked dimers (8), but only the disulfides

of those in T1 (solid lines) have been determined chemically (see Footnote 3). Dashed lines are those suggested for the other linker chains. The dotted connections provided for the cysteine-rich motif are arbitrary and depict the arrangement found in wheat germ agglutinin and in several cone shell toxins. The position numbers and symbols for chains are as depicted in Fig. 3.

Page 7: OF Vol. 268, 18. of June 25, pp. 13548-13555,1993 Q 1993 by … · 2001-06-22 · THE JOURNAL OF BIOLOGICAL CHEMISTRY Q 1993 by The American Society for Biochemistry and Molecular

13554 Gene for Linker Chain of Earthworm Hemoglobin

’P Apoliprotein E RLASHL Lumbricus b

1 10 20 26

FIG. 7. Comparison of residues 133-158 of helix 4 of apolipoprotein E of man (32, 33) with the NHz-terminal segment of chain b of L. terrestris Hb (36). This comparison suggests a possible similarity in protein-protein interactions and is not intended to imply any homology.

FIG. 8. Hydrophilicity plots of the amino acid sequences of chains L1 and c of L. terrestris, determined by the procedure of Kyte and Doolittle (36). Hydrophilicity window = 7.

Chain L1 4.00

2.00

0.00

-2.00

g -4.00

g 4.00

.- - E 20 40 60 80 100 120 140 160 180 200 220 E 0

2.00

0.00

-2.00

-4.00 I I I I I I I

20 40 60 80 100 120 140 160

Residue Number

FIG. 9. Possible alignment of the E, F, G , and H helices of chains c and L1 of L. terrestris. The assumption is made that residues 106 and 136 of L1 (marked with *) are the distal and proximal histidines. The helix designations are those tentatively assigned to chains a, b, c, and d on the basis of alignment with known structures (36, 37) and model building (38). Residue 70 of chain c corresponds to position 7 of the E helix. The numbering of residues is for chain L1, not the bookkeeping numbers of Fig. 3.

and NO binding kinetics similar to those for intact Hb and to those of subunits containing chains a, b, c, and d (39). This finding suggests that at least one linker chain not only has heme but also may be globin-related. Unfortunately, this fraction has not been characterized as to chain composition.

Comparison of the sequence-derived molecular mass, 25,847 Da, with the mass determined by mass spectrometry (13), 27,728 f 15 Da, suggests the presence of 1,881 Da of carbo- hydrate. Both mannose and N-acetylglucosamine have been reported for Lumbricus Hb (40, 41). N-Acetylglucosamine is usually bound to an asparagine residue in the sequence Asn- X-Ser/Thr. Although no such sequence is present in Lumbri- cus chains a, b, c, or d (36, 37), one possible binding site is present in L1: Asn-Ile-Thr (residues 103-105).

Gene

Heterogeneity-A total of 20 nucleotide differences was found in the cDNA and genomic DNA specifying the codons for chain L1 (Table I). Eighteen of the 20 differences involve changes in the third base, and 61% of these involve C c, T transitions. We were concerned that the PCR might be re- sponsible for some of these differences. Although we cannot

exclude this possibility completely, our reaction conditions are those that have been found to give error rates of 5 3 X

or one nucleotide substitution in about 3,300 (42), which is far below the observed frequency of substitution. Compar- ison of the 3’-noncoding regions showed 30 nucleotide substi- tutions and 9 deletion/insertion differences (data not shown). This frequency is more than twice that observed in the coding region. No known mechanism exists whereby the Taq polym- erase could recognize the codons in such a way as to make 90% of the supposed errors of transcription in the third position of the codons. We therefore believe that most of these differences are not transcriptional errors but reflect the presence of different genes. The more than %fold higher number of changes in the 3‘-noncoding region supports this conclusion. However, we cannot distinguish between poly- morphism and multiple alleles because the DNA was obtained from a large number of worms. A total of at least four genes appears to be required to explain the nucleotide differences in Table I.

The pronounced bias toward C c, T transitions (Table I) is about evenly distributed between C + T and T + C in contrast to the strong preference for T + C mutants in the

Page 8: OF Vol. 268, 18. of June 25, pp. 13548-13555,1993 Q 1993 by … · 2001-06-22 · THE JOURNAL OF BIOLOGICAL CHEMISTRY Q 1993 by The American Society for Biochemistry and Molecular

Gene for Linker Chain of Earthworm Hemoglobin 13555

fidelity of DNA synthesis by Taq polymerase (43). A high proportion of mutants of human Hb appears to result from C + T transitions that are caused by deamination of cytosine residues in methylated CpG dinucleotides (44). Examination of the L1 cDNA and genomic sequences, however, shows that only one of the 5 C 4 T transitions is followed by a G. Perhaps the high frequency of C c, T transitions in the L1 gene also reflects the presence of methylation “hot spots,” but the mechanism may be slightly different.

Introns-The 4,166-bp genomic sequence of chain L1 (Fig. 2) is punctuated by two introns of 2,011 and 717 bp, respec- tively, which separate three exons of 220+, 524, and 694 bp. The splice junctions are typical of those found in other organisms except that the first intron begins with GC rather than GT. The first splice junction immediately precedes the DNA encoding the cysteine-rich motif and is within one codon of the positions of the introns separating the DNA for the repeats of the ligand binding domain of the LDL receptor (Fig. 4). The second intron, unlike those of any globin, is outside the coding region and starts 21 bp downstream from the termination codon. Since no sequence was obtained 5’ to the start codon we do not know whether an additional pre- coding intron is present such as occurs in the gene for the mollusc Earbatia reeveana (45), and also present in at least some of the genes in Lumbricus for chains a, b, c, and d.4

Four exact inverted repeats occur in intron 1 of 16,12, 13, and 12 bp (see Fig. 2). The fourth inverted repeat occurs in the noncoding exon 3. The 41-bp sequence between positions 3402 and 3442 is directly repeated in positions 3443 to 3483 with only two differences. This repeat is remarkable because it goes through a splice junction between bases 3472 and 3473. Thus the splice junction itself is repeated (bases 3431 and 3432). The cDNA clones show that the second splice junction is utilized; we do not know whether the first (repeated) splice junction may also be used. Since exon 3 is completely in the noncoding region, differential splicing would have no effect on the protein but might conceivably affect the regulation of transcription or RNA processing.

Origin of Chain Ll-We have shown that the cysteine-rich segments of linker chains are clearly related to the LDL receptor and unrelated to globin chains, but the evolutionary origin of the remaining sequence of the chain is uncertain. The positions of the splice junctions in chains L1 and c suggest that the cysteine-rich motif might have been inserted into a globin chain by a process of exon shuffling in which the first 42 residues of chain c would have been replaced. (Chain c is compared here only because the structure of the gene is known.) Although analysis of the alignment (Fig. 9) in terms of Doolittle’s criteria (46) renders the relationship “improba- ble,” the correspondence of the position of the first L1 intron with those between the repeats of the ligand binding domain of the LDL receptor suggests that the exon encoding the cysteine-rich motif may have been inserted into a gene by exon shuffling such as has evidently happened with the gene for the LDL receptor (17). If so, it seems possible that exon 1 of a globin chain has been replaced by another sequence that included the cysteine-rich motif. Such an hypothesis is con- sistent with the alignment in Fig. 9 and our suggestion that a binding site for heme may be present. We conclude that the sequenced linker chains may be related to globins but that they have diverged too greatly to permit unequivocal recog-

‘ A precoding intron has been identified in the gene for chain b (R. A. Donahue and A. F. Riggs, unpublished results) and probably occurs also in the gene for chain c (11).

nition. Since at least one linker chain binds heme (see above), the structures of chains L2 and L3 are likely to provide crucial evidence concerning the possible globin origin of linkers.

Acknowledgments-We are greatly indebted to Sissy M. Jhiang for poly(A)+ RNA and genomic DNA preparations. We thank Peggy Centilli for construction of the oligomers, Paul Krieg for use of a thermal cycler, the Protein Identi~cation Resource for a data bank search, and Sandie Smith of the University of Texas Protein Sequenc- ing Center for analyses.

REFERENCES 1. Suzuki, T., Jhiang, S. M., Donahue, R. A., and Riggs, A. F. (1991) FASEB

2. Vinogradov, S. N. (1985) Comp. Biochem. Physiol. B 82 , l -15 3. Vinogradov, S. N., Lugo, S. D., Mainwar~g, M. G., Kapp, 0. H., and Crewe,

4. Kapp, 0. H., Mainwaring, M. G., Vinogradov, S. N., and Crewe, A. V. A. V. (1986) Proc. Natl. Acad. Sci U. S. A. 83,8034-8038

5. Suzukl, T., Takagi, T., and Ohta, S. (1988) Bwchem. J. 266,541-545 (1987) Proc. Natl. Acad. Sci. U. S. A. 8 4 , 7532-7536

6. Vinogradov, S. N., Sharma, P. K., Qahar, A. N., Wall, J. S., Westrick, J. A,, Simmons, J. H., and Gill, S. J. (1991) J. Biol. Chem. 266 , 13091- 13096

J. 5,1545 (abstr.)

8. Suzuki, T., Takagi, T., and Gotoh, T. (1990) J. Biol. Chem. 265, 12168- 7. Fushitani, K., and Riggs, A. F. (1991) J. BWl. Chem. 266,10275-10281

9. Suzuki, T., Takagi, T., and Ohta, S. (1990) J. Bioi. Chem. 265,1551-1555 12177

10. Jhiang, S. M., Garey, J. R., and Riggs, A. F. (1988) Science 240,334-336 11. Jhiang, S. M., and Riggs, A. F. (1989) J. Biol. Chem. 264,19003-19008 12. Mainwaring, M. G., Lugo, S. D., Fengal, R. A,, Kapp, 0. H., and Vinogradov,

13. Ownhy, D. W., Zhu, H., Schneider, K., Beavis, R. C., Chait, B. T., and S. N. (1986) J. Biol. Chem. 261,10899-10908

14. Fushitani, K., and Riggs, A. F. (1988) Proc. Natl. Acad. Sci. U. S. A. 8 6 , Riggs, A. F. (1993) J. Biol. Chem. 268,13539-13547

15. Berger, R. L. (1969) in Biochem~cu~ M ~ ~ a ~ ~ ~ m e t r y (Brown, H. D., ed) 9461-9463

16. Pearson, W. R., and Lipman, D. J. (1988) Proc. Natl. Acad. Sci. U. S. A. pp. 221-234, Academic Press, New York

17. Sudhof, T. C., Goldstein, J. L., Brown, M. S., and Russell, D. W. (1985) 86,2444-2448

18. Mehta, K. D., Chen, W. J., Goldstein, J. L., and Brown, M. S. (1991) J. Science 228,815-822

19. Discipio, R. G., Gehring, M. R., Podack, E. R., Kan, C . C., Hugli, T. E., Biol. Chem. 2 6 6 , 10406-10414

20. ~ y c h o w ~ u ~ , R., Niles, J. L., McCluskey, R. T., and Smith, J. A. (1989) and Fey, G. H. (1984) Pmc. Natl. Acad. Sci. C‘. S. A. 81,7298-7302

21. Feng, D. F., Johnson, M. S., and Doolittle, R. F. (1985) J. Mol. Evol. 2 1 , Sczeence 244,1163-1165

22. Swofford, D. L., and Olsen, G. J. (1990) in Mofeculor Systematics (Hillis, 112-125

D. M., and Moritz, C., eds) pp. 411-501, Sinauer Associates, Inc., Sun- derland, MA

23. Huelsenbeck, J. P., and Hillis, D. M. (1993) Systematic Biology, in press 24. Nichols. D. (1967) Svmn. Z w l . 20.209-229 ~ ~ ~ ~~~~~~~

25. Morris,’S. C.-(i985j*inrTL~ Or&& u ~ R e ~ t i o ~ h ~ s o/ the Lower Verte- brates (Morris, S. C., George, J. D., Gibson, R., and Platt, H. M., eds) pp. 350-352, Oxford Universit Press, Oxford, United Kingdom

26. Wright, C. S., Gavilanes, F., andYPeterson, D. L. (1984) Biochemistry 2 3 , 280-287

27. Olivera, B. M., Rivier, J., Clark, C., Ramilo, C. A,, Corpuz, G. P., Abogadie, F. C., Mena, E. E., Woodward, S. R., Hillyard, D. R., and CNZ, L. J .

28. Kimhall, M. R., Sato, A., Richarhon, J. S., Rosen, L. S., and Low, B. W. (1990) Science 2 4 9 , 257-263

29. Calvete, J. J., Wang, T., Mann, K., Schafer, W., Niewatowski, S., and (1979) Biochem. Bwphys. Res. Commun. 88,950-959

30. Massefski. W.. Redfield. A. G.. Hare. D. R.. and Miller. C. (1990) Science Stewart, G. J. (1992) FEBS k t t . 309,316-320

249,521-524 . . .

31. Tu, A. T. (1973) Annu. Reu. Biochem. 4 2 , 235-258 32. Mahley, R. W. (1988) Science 240,622-629 33. Wilson, C., Wardell, M. R., Weisgraber, K. H., Mahley, R. W., and Agard,

D. A. (1991) Science 252,1817-1822 34. Gillet, D., Ducancel, F., Pradel, E., Uonetti, M., Menez, A., and Boulain,

J.-C. (1992) Protein Eng. 5,273-278 35. Kyte, J., and Doolittle, FL F. (1982) J. Mol. BWE. 157, 105-132 36. ~ s h i t a n i , K., Matsuura, M. S. A., and Riggs, A. F. (1988) J. Bioi Chem.

37. Shishikura, F., Snow, J. W.. Gotoh, T., Vinogradov, S. N., and Walz, D. A.

38. Garlick, R. L., and Riggs, A. F. (1982) J. Biol. Chem. 257,9005-9015 39. Gibson, Q. H., Blackmore, R. S., Regan, R., Sharma, P. K., and Vinogradov,

S. N. (1991) J. Biol. Chem. 266,13097-13102 40. Shishikura, F., Mainwaring, M. G., Yurewicz,E. C.,. Lightbody, J. J., Walz,

D,A., and Vinogradov, S. N. (1986) Bwehrm. Bwphys. Acta 8 6 9 , 314-

263,6502-6517

(1987) J. Biol Chem. 262,3123-3131

41. Shlom, J. M., and Vinogradov, S. N. (1973) J. Bid. Chem. 248,7904-7912 42. Eckert, K. A., and Kunkel, T. A. (1990) Nucfeic Acids Res. 18,3739-3744 43. Tindall, K. R., and Kunkel, T. A. (1988) Biochemistry 27,6008-6013 44. Perutz, M. F. (1990) J. Mol. Biol. 2 1 3 , 203-206 45. Naito, Y., Rig s, C K , Vandergon, T. L., and Riggs, A. F. (1991) Proc.

46. Doolittle, R. F. (1981) Science 214,149-159 Natl. Acad. &i. U . S: A. 88,6672-6676

J L I