The Rat Casein Multigene Family

9
THE JOURNAL OF BIOLOGICAL CHEMISTRY 0 1985 by The American Society of Biological Chemists, Inc. Vol. 260, No. 11, Issue of June 10, pp. 7042-7050,1985 Printed in U.S.A. The Rat Casein Multigene Family FINESTRUCTUREANDEVOLUTIONOFTHE@-CASEINGENE* (Received for publication, October 19, 1984) William K. Jones, Li-yuan Yu-Lee$, Shirley M. Clift, Terry L. Brown§, and Jeffrey M. Rosenll From the Department of Cell Biology, Baylor College of Medicine, Houston, Texas 77030 Eight overlapping phage clones, spanning 34.4 kil- obase pairs of genomic DNA, containing the 7.2-kilo- base pair rat &casein genehave been isolated and characterized. The first 510 base pairs (bp) of 5’ flank- ing, 110 bp of 3‘ flanking, and all the exon/intron junctions have been sequenced. The &casein gene con- tains 9 exons ranging in size from 21 to 525 bp. We have attempted to identify potential regulatory ele- ments by searching for regions of sequence homology shared between milk protein genes which respond sim- ilarly to lactogenic hormones and by searching for previously reported hormone receptor-binding sites. Within the conserved first 200 bp of 5’ flanking se- quences 3 regions of greater than 70% homology were observed between the rat p- and y-casein genes. One of these contains a region 90% homologous to the chicken progesterone receptor-binding site. The conserved 5‘ noncoding recion, the highly conserved signal peptide, and the hydrophobic carboxyl-terminal region of the protein are each encoded by a separate exon. In con- trast the evolutionarily conserved phosphorylationsite of @-casein is formed by an RNA-splicing event. The exons which encode the phosphorylation sites of 8- casein appear to have resulted from an intragenic du- plication. Based upon the exon structure of the casein genes, an evolutionary model of intragenic and inter- genic exon duplications for this gene family is pro- posed. The p-casein gene is amember of a small gene family, containing at least five genes, responsive to both steroid and peptide hormones (1). Although these genes are each induced during lactation, eachcasein mRNA exhibits different kinet- ics of induction and extents of accumulation due to complex multihormonal regulation of transcription and mRNA stabil- ity (2, 3). In the rat @-casein mRNA accounts for ~20% of the cellular mRNA during lactation, thus encoding the pre- dominant calcium-sensitive casein in this species (4). Defini- tion of the structure of the casein genes, their gene products, and evolutionary interrelationships is a necessaryprerequisite * This paper is the third in a series. The first is “The Rat Casein Multigene Family. I. Fine Structure of the y-Casein Gene” (13). The costs of publication of thisarticle were defrayed inpart by the payment of page charges. This article must therefore be hereby marked “aduertisernent” in accordance with 18 U.S.C. Section 1734 solely to indicate this fact. $ Recipient of American Cancer Society Grant BC 425. 0 Recipient of National Institutes of Health Postdoctoral Fellow- ship GM09392. 7 Supported by National Institutes of Health Grant CA16303. TO whom reprint requests should be addressed Department of Cell Biology, Baylor College of Medicine, One Baylor Plaza, Houston, T X 77030. for elucidatingthe respective mechanisms by which both peptide and steroid hormones act. Analysis of rat, bovine and guinea pig casein cDNA sequences has demonstrated consid- erable divergence among the individual members of the casein gene family (5-8). In fact, the casein proteins are among the most rapidly diverging protein families studied (9). Only three regions of the casein mRNAs are conserved the 5‘ noncoding region, the signal peptide-coding regions, and the regions encoding the sites of phosphorylation and calcium binding (8). The casein gene family is believed to have evolved by both intragenic and intergenic duplication of a primordial gene containing a phosphorylation and calcium-binding site and a signal peptide sequence (8). This hypothesis is sup- ported by genetic data which revealed the clustering of the bovine casein genes and the localization of the mouse a-, @- and y-casein genes to chromosome 5 (10-12). We have previously reported the characterization of the large and complex 15-kb’ y-casein gene containing at least 9 exons and an unusual Goldberg-Hogness sequence (TTTAAAT (13)). We report the isolation, the fine mapping, and the sequencing of all the exons, the exonlintron junctions, and the 5‘ and 3‘ flanking regions of the @-caseingene. In addition, the analysis of an apparent primary gene transcript as well as various processed forms of this gene product are reported. From the comparison of the y-casein and @-casein 5‘ flanking regions, three sets of conservedsequences are described. Finally, a model of the evolution of this gene family, based upon the exon structure of the casein genes, is proposed. EXPERIMENTAL PROCEDURES~ RESULTS AND DISCUSSION As reported in the Miniprint Section, 8 overlapping phage clones, spanning 34.4 kb of genomic DNA, containing the @- casein gene locus have been isolated and mapped. The orien- tation of the gene was determined by hybridization of various restriction enzyme digestions to specific 5’ and 3’ cDNA probes. The comparison of restriction enzyme digestions of total rat DNA with the isolated clones showed that the latter were not rearranged. In addition to the 7.2-kb @-casein gene, these clones contain 14.6 kb of 5’ flanking DNA and 12.6 kb The abbreviations used are: kb, kilobase pair; bp, base pairh); 1 x SSC, 0.15 M NaC1, 0.015 M Na citrate; BSA, bovine serum albumin; SDS, sodium dodecyl sulfate. Portions of this paper (including “Experimental Procedures,” part of “Results,” Figs. 1-5, and Table I) are presented in miniprint at the end of this paper. Miniprint is easily read with the aid of a standard magnifying glass. Full size photocopies are available from the Journal of Biological Chemistry, 9650 Rockville Pike, Bethesda, MD 20814. Request Document No. 84M-3529, cite the authors, and include a check or money order for $4.00 per set of photocopies. Full size photocopies are also included in the microfilm edition of the Journal that is available from Waverly Press. 7042

Transcript of The Rat Casein Multigene Family

THE JOURNAL OF BIOLOGICAL CHEMISTRY 0 1985 by The American Society of Biological Chemists, Inc.

Vol. 260, No. 11, Issue of June 10, pp. 7042-7050,1985 Printed in U.S.A.

The Rat Casein Multigene Family FINE STRUCTURE AND EVOLUTION OF THE @-CASEIN GENE*

(Received for publication, October 19, 1984)

William K. Jones, Li-yuan Yu-Lee$, Shirley M. Clift, Terry L. Brown§, and Jeffrey M. Rosenll From the Department of Cell Biology, Baylor College of Medicine, Houston, Texas 77030

Eight overlapping phage clones, spanning 34.4 kil- obase pairs of genomic DNA, containing the 7.2-kilo- base pair rat &casein gene have been isolated and characterized. The first 510 base pairs (bp) of 5’ flank- ing, 110 bp of 3‘ flanking, and all the exon/intron junctions have been sequenced. The &casein gene con- tains 9 exons ranging in size from 21 to 525 bp. We have attempted to identify potential regulatory ele- ments by searching for regions of sequence homology shared between milk protein genes which respond sim- ilarly to lactogenic hormones and by searching for previously reported hormone receptor-binding sites. Within the conserved first 200 bp of 5’ flanking se- quences 3 regions of greater than 70% homology were observed between the rat p- and y-casein genes. One of these contains a region 90% homologous to the chicken progesterone receptor-binding site. The conserved 5‘ noncoding recion, the highly conserved signal peptide, and the hydrophobic carboxyl-terminal region of the protein are each encoded by a separate exon. In con- trast the evolutionarily conserved phosphorylation site of @-casein is formed by an RNA-splicing event. The exons which encode the phosphorylation sites of 8- casein appear to have resulted from an intragenic du- plication. Based upon the exon structure of the casein genes, an evolutionary model of intragenic and inter- genic exon duplications for this gene family is pro- posed.

The p-casein gene is a member of a small gene family, containing at least five genes, responsive to both steroid and peptide hormones (1). Although these genes are each induced during lactation, each casein mRNA exhibits different kinet- ics of induction and extents of accumulation due to complex multihormonal regulation of transcription and mRNA stabil- ity (2, 3). In the rat @-casein mRNA accounts for ~ 2 0 % of the cellular mRNA during lactation, thus encoding the pre- dominant calcium-sensitive casein in this species (4). Defini- tion of the structure of the casein genes, their gene products, and evolutionary interrelationships is a necessary prerequisite

* This paper is the third in a series. The first is “The Rat Casein Multigene Family. I. Fine Structure of the y-Casein Gene” (13). The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked “aduertisernent” in accordance with 18 U.S.C. Section 1734 solely to indicate this fact.

$ Recipient of American Cancer Society Grant BC 425. 0 Recipient of National Institutes of Health Postdoctoral Fellow-

ship GM09392. 7 Supported by National Institutes of Health Grant CA16303. TO

whom reprint requests should be addressed Department of Cell Biology, Baylor College of Medicine, One Baylor Plaza, Houston, T X 77030.

for elucidating the respective mechanisms by which both peptide and steroid hormones act. Analysis of rat, bovine and guinea pig casein cDNA sequences has demonstrated consid- erable divergence among the individual members of the casein gene family (5-8). In fact, the casein proteins are among the most rapidly diverging protein families studied (9). Only three regions of the casein mRNAs are conserved the 5‘ noncoding region, the signal peptide-coding regions, and the regions encoding the sites of phosphorylation and calcium binding (8). The casein gene family is believed to have evolved by both intragenic and intergenic duplication of a primordial gene containing a phosphorylation and calcium-binding site and a signal peptide sequence (8). This hypothesis is sup- ported by genetic data which revealed the clustering of the bovine casein genes and the localization of the mouse a-, @- and y-casein genes to chromosome 5 (10-12).

We have previously reported the characterization of the large and complex 15-kb’ y-casein gene containing at least 9 exons and an unusual Goldberg-Hogness sequence (TTTAAAT (13)). We report the isolation, the fine mapping, and the sequencing of all the exons, the exonlintron junctions, and the 5‘ and 3‘ flanking regions of the @-casein gene. In addition, the analysis of an apparent primary gene transcript as well as various processed forms of this gene product are reported. From the comparison of the y-casein and @-casein 5‘ flanking regions, three sets of conserved sequences are described. Finally, a model of the evolution of this gene family, based upon the exon structure of the casein genes, is proposed.

EXPERIMENTAL PROCEDURES~

RESULTS AND DISCUSSION

As reported in the Miniprint Section, 8 overlapping phage clones, spanning 34.4 kb of genomic DNA, containing the @- casein gene locus have been isolated and mapped. The orien- tation of the gene was determined by hybridization of various restriction enzyme digestions to specific 5’ and 3’ cDNA probes. The comparison of restriction enzyme digestions of total rat DNA with the isolated clones showed that the latter were not rearranged. In addition to the 7.2-kb @-casein gene, these clones contain 14.6 kb of 5’ flanking DNA and 12.6 kb

The abbreviations used are: kb, kilobase pair; bp, base pairh); 1 x SSC, 0.15 M NaC1, 0.015 M Na citrate; BSA, bovine serum albumin; SDS, sodium dodecyl sulfate.

Portions of this paper (including “Experimental Procedures,” part of “Results,” Figs. 1-5, and Table I) are presented in miniprint at the end of this paper. Miniprint is easily read with the aid of a standard magnifying glass. Full size photocopies are available from the Journal of Biological Chemistry, 9650 Rockville Pike, Bethesda, MD 20814. Request Document No. 84M-3529, cite the authors, and include a check or money order for $4.00 per set of photocopies. Full size photocopies are also included in the microfilm edition of the Journal that is available from Waverly Press.

7042

Structure and Evolution of the 6-Casein Gene 7043

FIG. 6. Sequencing strategy of the &casein gene. A map of the p-casein gene is shown (middle). Closed boxes rep- resent exons (I-ZX) and open boxes rep- resent introns (A-H) . Each arrow above represents one sequencing gel. Open cir- cles represent 3' labeling, and closed cir- cles represent 5' labeling. The region of the P-mRNA which each exon encodes is indicated at the bottom of the figure.

of 3' flanking DNA, none of which has yet been shown to overlap another functional casein gene. The small size of the exons made their preliminary localization by DNA blots par- ticularly difficult. Thus, alternative methods such as reduced stringency RNA sandwich blots and R-loop analysis had to be employed. The detection of apparently full-length 7.5-kb RNA transcripts confirmed the predicted size of the p-casein gene. These results are described in detail in the Miniprint Section.

Sequencing of the p-Casein Gene-The structure of the @- casein gene was determined by the sequencing strategy shown in Fig. 6. Six thousand base pairs were sequenced, of which 66% including all the exon/intron boundaries, the first 510 bp of 5' flanking, and 110 bp of 3' flanking sequences were sequenced in both directions (Fig. 7). Only 8 base pair differ- ences were observed in comparison to the published (3-cDNA sequences altering 1 amino acid (5). These changes may represent sequencing ambiguities or allelic differences. The 7.2-kb @-casein gene contains 9 exons separated by 8 introns and thus appears similar in organization to the y-casein gene which contains a minimum of 9 exons (the sizes of all the exons and introns are listed in Table I) (13). All the exon/ intron junctions of the coding exons are located between codons, and the exon/intron splice junctions are in agreement with the canonical sequence suggested by Mount (14).

5' Flanking Sequence Comparison-Our laboratory and oth- ers have been studying the structure of several milk protein genes which are all expressed in the mammary gland in response to the lactogenic hormones prolactin and hydrocor- tisone. From the comparison of the flanking regions of various milk protein genes, it has been possible to identify regions of shared homology which may represent regulatory regions. The y - and @-casein 5' flanking regions were initially compared since the expression of both of these evolutionarily related genes is induced by prolactin and glucocorticoids and inhib- ited by progesterone (15).

The p- and y-casein gene 5' flanking regions were first compared by dot matrix analysis using the stringent criteria that 7 of 10 nucleotides must be identical to be scored positive (Fig. 8). Three regions of sequence satisfying these criteria were observed within the conserved 5' flanking regions. In the @-casein gene the proximal region of homology lies be- tween -48 and -63, the medial between -106 and -119, and the distal between -130 and -165. In the y-casein these regions are at the same positions 2 3 bp. The proximal regions are composed of -25% G/C base pairs, and the distal regions have the highest G/C content of -40% G/C base pairs. Aside from these three regions, the two 5' flanking sequences show poor homology. Using the same criteria, no homology was observed between the first 200 bp of the p- or y-casein gene 5' flanking regions and the corresponding regions of the whey acidic protein (16) or a-lactalbumin (17) genes. These genes,

200bp

although not evolutionarily related to the caseins, display similar developmental and hormonal regulation (2).

The 0- and y-casein genes possess distinctly different Gold- berg-Hogness sequences. The Goldberg-Hogness sequence is believed to play an important role in determining the correct site of transcription initiation (18). In the p-casein gene, the TATA box (TATATAT) shows a greater homology to the

canonical Goldberg and Hogness sequence (TATA-A-) (18)

than the more divergent TATAs (TTTAAAT) of y-casein, a- casein,3 and whey acidic protein genes (13, 16). It has been reported in other genes that a T at the second position in the Goldberg-Hogness sequence, rather than an A, reduces its in vitro promoter efficiency (19). This difference in the TATA sequences may be involved in the relatively higher levels of p-casein gene expression in comparison to the other casein genes, but this remains to be established. The @-casein gene does not have an identifiable CAAT box (18) at the usual location -80 bp from the CAP site. The sequence CAAAT is found near -58 bp and within the proximal region of conser- vation observed within the casein family.

Both strands of sequenced regions of the p-casein gene have been searched for hormonally responsive elements such as the reported binding sites of the progesterone (20-22) and gluco- corticoid (23) receptors as well as a sequence common to estrogen-regulated genes (23). The hexanucleotide (TGTTCT) common to the reported glucocorticoid receptor- binding sites (23) was found five times in the p-casein gene at -510 in the 5' flanking region, at 26 within exon I, at 2484 within exon 111, at 5051 within exon VII, and 5800 within intron G. None of the surrounding sequences showed greater than 80% homology to any of the four glucocorticoid receptor- binding sites reported by Renkawitz et al. (23). No sequences identical to the nanomer common to estrogen-regulated genes were observed. No sequences showed greater than 80% ho- mology to the progesterone receptor-binding site reported by Mulvihill (20). However, one sequence between -157 and -143 (TGTCCCCCAGAATT) did have 86% homology to the sequence between -184 and -171 of the chicken ovalbumin gene (TGTTACCCAGAATT) which has been reported by Compton et al. (21) to be within the progesterone receptor- binding site defined by DNase I footprinting (21). This region has also been shown to be involved in progesterone and estrogen regulation by deletion analysis (22). Interestingly, this is within the distal conserved region of the (3- and y- casein genes.

The three conserved 5' flanking regions of the casein genes are candidates for regulatory elements. The conservation, which is greater than most of the coding regions of p- and y- casein, indicates that these regions have been selected for

T T A A

L.-Y. Yu-Lee, unpublished observation.

7044 Structure and Evolution of the 0-Casein Gene

t t t t a c a a a t t C a L a q L r t r a t c t t t t a t a

t q t a a a a c q c a q t c c c c t q t qtacrqaata -400

q t q a q c a a t c a a a a a 9 t a t q c t a c t t c a t q

t q q a q a t q a a a t c t t q q c c t c t c t a a a q c t

qq t taaaqcc c c c c a c t a q q ctqqqqaq*q

-150

-,"" -"" t t C C t t t C C t q a c a a q t t c c

qqqaaaqaaa ataqaaaqaa

c t c a c q a d c c acaaa t taqc -50

+l ATCCTCTGAG CTTCATCTTC

t d t t t C C d t t a t t t a t t t c a

c t t t c t q a a a a t a q a t a t t t

t q c c t a q a a a a a t t t q q t a c +I00

a t a t a a a q c c a q t q a c t a c t

1150

t t a t t t q a t t t q a t q t a t t t t a c t * q t q t a

a q c t q t t q q a aqaaqcaaae t q c a t q t t t a

t q q a a t t c t t qatc tagctc a t a q t a t t a c +so0

t t C t d C Z t t C f t t t t t d t t 4 4 4 t a t t t t a t . ._ CCtCta9qaC CCCCCtattC CattCCttCC

+650

t q q c q c c c t q c t q s -// 260 bp //- +700

c...cctt.. a q a t t c c t t t t c t c a c t t g t t c t a a t a c t .

t taqt tqaat q t a t t t a a a t t t t c c c * * * t ct tqqtqaat

a c c a q t c t a t c t t a q c a t t a t q q e q t t c t t a q t q c t c a q q

t q t q a a t t t t q a t q t c t c t a t e t c a t t q c c a c a t a q q t q a

t q c t q a q q m t q t q t a c t a q q c t q q a q a q a c a t t t c a q r t

t t c a c c a q c t t c t q a a t t q c t q c c t t q t t t a a t q t c c c c c

a c c a t t t t c t a a t c a t q t q a a c t t c t t q q a a t t a a q q a a c

a t g t c a t t a a q t a t q q t a t a t a t a c a q t c a c a q a q t c t q a

- k 5 0

-100

- 2 5 0

-I50

-100

TCTCTTGTCC XCGCTAAAG q t d d q a d c t t Caqacqtdaq +io

..ccc.t.ct a q a c t a t q t t a t a q a a a q a a q a q a t t t a c c

t a a a a a a c t q t q a q a t t a t t t t c a a q g a c a q q t c t t t t q a +ZOO

a a a t t a t a a t 'laccactqqc t t c a t a a a a t t q q q t a a a t q

a t a a a q q a q a q a c t a a q a a t t a t c t c c t a c a c a a q c t a t c

dtqC*CtC.a t t a t a t c a t q a a a a t a q a t a c t t a t q q c a t

t a a q c a a q a a a a t t q t a a t c a t a a a a c t t t t c t a q q q q c t +550

aagaa*t*tt agaaaccatg ta*aaaat*q aaacaaaqtc

t t a t t t a t t t a c a t t q c a i d a t q t t * i C c c t t C t t W t t t C C

c a q c c t q a t t C t d t q a q a q t q t n c c a c a c a acaccacclll:

+ l o o

+250

+400

+450

+600

aqaac t q a a a a c t a t q a q t a q a a t q aqttqaaqaa +loo3

a a q t t c t q q t t c a a a a t t q q t q t a t a a a a t q t c a a q g t c a t1050

a t t c a a a a t a acacaaaaac a c a a c c c t c t t t C C t t t C d . 3

t c t a q q a t q a t t q q a t t a t q t q q a a a c m a tqaaqqaaqc t t c q a q a c a a d q t q t t g d t a a t q a a t a a t a

caacatacac aqaaaaacaa acaaaaaacc a a a c c e a c t t c t t t t n a c t a a t q q q a t a q t t t t q c c d c c t

t n a a q t a t t q q a a a q a a t q c t t t a t a a a t a t a t t c a a a t t t t t a t t t t t t n a t q c t q t q a t a a t t n a c a a *l15O

tl2OO

r1250

*1300

qcaacaaaac a q t t a a t a c a c t t t q t c t a c t a a t t n a t c t a q a t q t q a a a t q t q a q a c t c c a t a q a a c t q

a q a t a t t a a t a t t t t g t t c a t t q t c t q q t c a c c t q t a a q t q c t c a q g a g t t t a a a a t n a a t g a c t t q t q q

+1400

+1k50

+1500 +1550 a a t q t t t q q c t a q t t a t q c a t t c t t c a t a q c t t t t g t a a q q c a t c c a t q . 2 a a c t q t t a q c a a a q q a o a q t

taaaaaaqa. *aqaatqqca q q a q c t q t t q a a c t q c t q a a c q c t q a t q a t qqCEqCaqCt qaccqac tqa

q t g t a q q q c t t * q t a a t a q q a g t a t t t * a q qasccagaqc qaccccaatt 9 a t t q a c t a a c t a i d t c t q t t

t1600

r1650 +1700

+I750 t C t d a t t C t t C t C t t C a t t C acdq CACTTG ACAGCC A X AAG GR' TlC A X C l T GCC TGC CTT GTG

NET LYS VAL PHE ILE LEU ALA CY5 LEU VAL

GCli CCT K T CTT GCA AGG G I G qtg tq taagaaqa q C t t C -// 466 bp //- a t g t a t t d A I A LEU ALA LEU ALA ARG GLU

+1800

a t a a t q t a c c t a t a a a a c t t a t t a a a a t a t c a a a q a q a t t q t a a a t a t a a q t t t t c a a q a t t a t q a a a q c r2300 +2350

c a t a t a t c a a t t t t t t t a t t c t q a a q a q a a taaaaaqaca t q c a a a t c a a q c a q a a q t a t q t t t a q t a a r2k00

I q l C n

C a q q q t t q a t a t a q t d a c a t a a t q t t t t t a d t t q t t t g c t t t t a c a q AAG W I T G C A TTC ACT GTG X C LYS ASP AL4 PHE THR V A L S E R

-<I>"

+ 2 5 0 0 TCT GAG q t a d q a q t t t t t a t t c q a q a c a a a t t t C c a a c t C t a a q a t t a c t a c a t t t t t a a t q a t c t t q a a q t SER GLU

+2550

C C t d d t t q q a t c t a t d a t t t a d c t a t a t a t d t q t d t q t a t t t t t t q t t t a caq ACT GGT AGT ATT TCC THR GLY SER I L E SER

+2600

+2650 AGT GAG q t a a q q C a d t t q t c t c t c d q a q q c a q C a a a t t q C C a t c t t t c t q t t g a q t t a t a S E I G L U

-/i 4 0 1 bp / I - c c q n q a t q t c a a t c q a q a q a a c c t t a q q a q t C t t d c t t q a a q c a t t t q t t

t a t t a t t t t r t c t t a q c a a t q c t q c a a a t q c c t c t c a q a q q a q t a t t c a t t t t t c a t q q t t c a r t t a a a

1 c t t a 7 c a t a t t a q c t a q c a a a c t c q q a q t q a t t d C t d C a q t a t a t t q t c a t t a q t c t q t a t t t t c a q c t

a t q t r t q t q t q t t t t q t t t t aqqaatqqea ttt.aaqatc t t a t q t a q a q aaaqaaccac gq t tqaagaq

q q d q t f d q c a t t c t t t q q a t a t c t c t t t q t a c a c c a t q c t e a t t t a t t t t t c t q t c t t q a c t q t q c t c a c

+llOO

+3150 r3200

+I250

+ I I O O 11150

+I400

a t t t t c a c q t a t f q t t t a t t t c t q q c a q AAA CTT CAG AAG GTG AAA CTC A K GOA C A G G V . C A G LYS LEU GLN LYS VAL LYS LEU ?ET GLY GL'I GTG CAG

+I700 *I750

TCC GAG q tdaac tq tc Ca tCaaqaa t tc -// 828 bp / / - agaacnnacq qaccaqanqc SCR GLU

t t c q a t t t c t q t q c t t a t t a t t q q q t a a a a q a t t a t t a a t t a i d t c a a q a a t q t c q a c t t t q c t a a a t t q t + 4 0 5 0

qaactaaaqq C t C t t t r q r t t c t a q GIT R C CTC CAG AAT AAA T T T CAC TCC GGC ATT CliG TCA ASP VAL LEU G W ASN LYS PHE H I S SER GLY I L E GLN SFR

+4900

GAA CCC CAG GCC ATT CCA T A T GCC CAG ACC A'PC TCT TGC ACT CCC A T T CCA CAA AAC A X GLU PRO GLN ALA I L F PRO 'IYR ALA GLN I W R I L E SER CY5 5ER PRO I L E PRO GLN AS': I L E

CAG CCT A l T GCT C M CCC CCT GTG GTG CCA ACT G T P GU: CCT ATC ATT TCT CCT <A4 CTG GLN PRO I L E ALA GLN PRO PRO VAL VAL PRO I W R ASP GLY PRO I L E I L E SER PRO GLU LEV

GAA X C l T C C T I GAA GCT G I A GCC ACT GTC C T T CCC AAG CAC MA CAG ATG CCC l T C CTT

+ 4 9 5 0

'5000

* S O 5 0

GLU SER PHE LEU LYS ALA LYS AU THR VAL LEU PRO LYS HIS LYS GW YET pm P H E LEU +5100

ARC TCT GAA ACC &'G C X CGC CTC T T T AAC TCT CAA ATC CCC ACT CCT GAT C T 7 GCT AAiiT ASN SER GLU THR VAL LEU ARG LEU PHE ASN SER G W I L E PRO SER LEU ASP 1.EIJ ALA 45'1

CTG CAC CTT CCT CAG TCT CCA GCC CAG C?C CAG G C A CAA ATP G I G CAG GCC TTT CCI: CAG L E U H I S L E U PRO GLN SER PRO ALA G L l l LEU GLN ALA GLN I L E V A L GLN ALA PHE PRO GL!I

6 1 5 0 + 5 2 0 0

LC.,<,,

ACT CCC GCG G X GTT TCT TCT CAG CCC CAG C R : TCT CAT CCT CAG TCC AAA ACT CAG T 4 C THR PRO A L A V A L V A L SER SER GLN PRO GLN LEU SER H I S PRO GLN SER LYS SER CL ' I TYR

C T T G T G CAG CAA CTI\ GCA CCC C X TTC CAA CAA GGT ATG CCT G x CAA GAC CT? '.T L E U V A L GLN GLN LEU ALA PRO LEU PHE GLH GLN CLY *ET PRO VAL GLN ASP LE t i LF" T:.~:

+5300

+5350 TAC CTA GAT m CTG CTT AAC ccc ACC CTC CAG r r c m KC ACT CAR CAR c y CA': -cr TYR LEU ASP LEU LEU LEU ASN PRO THR LEO GUI PHE LEU ALA THR GLN G1.N L E 3 qIS SFP

ACT TCT q t a a q t t t q a a t t t t t t c t c t t t t t a a t t c tCCtCaaaat q ta taqdtqq tqdcatadqa +5400 r5450

THR SER

q a q q a a a t q a q a q c t c t a c a a t q a a q q a a t a q a t a t a t c t c t c a t t a a t a a a t a q a t g a q c c t c a q q q t t

a q c t a t a a a a c a q a a t c a c t q q a q a t c t c q t t t c q a a c c n t t c t c a q g t q CCtdCtqCCq aqCtqCt t lC

a q a t a a c t q q t c t a q c q t a a a t a q c c a a t a a c a t t t e n a q aacaqaccaq t t a c a c a t q a c a c c t q a c t t

c t q a c t q q q a n a c t t q q a a t t c t q a c t t g c a q a a t a t t a a q c t a q a c t a a a c c a c a a t t . 3 taLCtqCCdQ

t C r L a t t a q a t t t t t t q q t a a c a a a t t t t t c t t t q c a q a q t t C t C l d d t t t t qc tqaqccq at*tatta=a +%ma

c a c t q c a q t t t t a t t t t q t t t t t c t t q a t t c q t t a a a t t t g t t c t t t t t c a t C g t C q t C d t q q a q t c a a a

a a q t t q c a a t q t t c a t t c q c cac'ltqccac a c a c t t t a q q c a a q t q t t q q a t t a a q a a q q q a a a a t q t a t

r5500

+ 5 5 5 0

+5600 +5650

+ 5 7 0 0

+5750

+5050

+5900

O t t a c t t a t t t c t a t a t t t q C a t t t a a q GTC TAA GIGGAT TTCCGGGTTA TPTTCTCCTC ATCATTTTTT +5950 +6000

V A L T E m

mTG qtaa q t t a t t q q a a a c t a a a q a q t a q a a a t q a t a a t t t t t a t a t t t t a t q t c t t t t t a t a t t t c

a t t c t c c t a t ataqac*ce.q caqaqta9.a a t t t c a t c a a a q t a t t c a t q q a c t a c t a t q I t t t C a C C q q +6100 +6150

t t q m t q c t t CttaCCaaCa tqacactqq. t a a a a t t t a t a t t t c a a q q a a c t t t c a t t t t a t C t q L C t q

c c a t t q q t a t a q q C F t q t C t t c c t t c c a a a a q c t t t t a t c t t t a c a c c t c a t * t q a t c a t a c t q t q a a a a

aaaaqcaqaa a t q q a q t t t t c t t t a t a t t t t a t q t q q a t a c a t c t t t t t q c t t t t q q t t a c a q q t c a a r t

t t t l l C t t t C C a a t a t a a g a a C C a t a q q C t C a c a t t t t c a t t a t q a a t t t a a a a t a q t t t t c a t t c a c q a a

+6050

+6200

+6250

+6300 r6150

+6kDO

r6450 +6650 Caaaa taq t t accaaqaa t t qqccac -// 183 bp //- a c q c q c c t t q q q q t t t q a a t q d c t t t t t c a c

16700 a g a q q t a t c a t a t c a q a t a t t t a c a t t a t q a t t t a t a q c a g t a q c a a a a t t a c a t t t a t q a a q t a a c t a t

a a a a t c a a t a a a a t t a t q a t t g q q q t t c a c cIcadcatq* aqqqaaact t q taccaaaqq qqc tacqcat +6750 +6800

t e q a a a a t q t tqeqaaCtCC t q a t t t a a c q c a a c a q a o c t q a t t c t a c d t C t t t c t t t c a caq AATTGAC

TGAAACTGGA AATGTACAAT CATTTTCATC TTGGGATCAT GCTACAAAAG ?GATATATTT CTGAATGAAA +6900 +6950

CTACATGGAA AAAAAGAAl iT TTTATTCTTT TATTTATTCT ATGTATTATA TGATATTCAT TTGAATTTGA

CCCATGGACT CTATATGTGT CTCAATTATA 'LTTCATATAA TACTACAAAT GTTCAATATT GAGTTTAAAT

GAAAGTCAAT AATGTATACA AAATAGCTCA XAAAAAW GITPTACTAT TATTTCTTTC GGAACCTATT

TCCTTTCCAG TCATTTCAAT AAAATAATCC TTTAGGCAT a t t t c d q t t g t C a t q t C t t C d t t a t d d t t t t

+6850

r7000

+7050

+7100 r7150

r7200

CCaaI IC ta tE t aaaqac t * t t cqq t t t t aa t q taqc ra r t q tqq ta t ta .3 a t t ' aCa tCa t a c a c q q q q +7250

FIG. 7. Sequence of the &casein gene. The 5' flanking, 3' flanking, and intron sequences are denoted by lower case letters. Exons are represented by upper case letters, and the amino acids encoded are noted where appropriate. The lengths of unsequenced regions are estimated from restriction mapping data.

during evolution. The resemblance of the proximal region to the CAAT sequence and the distal region to the progesterone receptor-binding site suggest possible functions for these two regions which can be tested experimentally. To accomplish this, the entire @-casein gene obtained from the genomic phage clone B13 (Fig. 1) has been inserted into a bovine papilloma virus expression vector to test the possible functional role of these and other sequence^.^

Y. David-Inouye, personal communication.

3' Flunking Sequence Comparison-The first 110 bp of the 3' flanking region of the 0-casein gene have been sequenced in both directions. As previously reported, the 0-casein gene has a canonical poly(A) sequence (AATAAA) located 16 bp 5' to the polyadenylation site which for the @-casein gene is TA (5). Four bp 3' to the A nucleotide is the sequence CAGTTG, a close derivative of the pentanucleotide sequence CAYUG reported by Berget (24). This pentanucleotide is proposed to hybridize with U4 RNA and be involved in mRNA

Structure and Evolution of the P-Casein Gene 7045

A \

\.

'\

'.\ \

a-Casein

B 1 CAC CACACATCTI

I I Ill II B I I C C I G A C A A G I I

-1q3 I CCAITAICTT A ICATGGCCI CAAICAAACG GTTIAAGAAC ICCCTAGAA-

I I I I I I II I I a C C I I C A C C A G C T l C T G A A T T G - C I G C - C I T G I T T A A T G I C C C C C - A G A A I

- 96 1 I I A G A A T I C G A A l G I - C I T T I T - A G G T A T T IC

B I I G G A A T I A A A A G G A A C T I T I G A A T A T C T T AC I I IIIII II IIII I I I I1

I A I G A I G C I A G A A C C T G G I I T A A A I A G T G C G GGAGCTACCC ACTCCT---- -4 I

I II I1 IIII I I I l l I I l l I I a GTCATI-AAG - - T A l G G I A I ATATACAGTC A C A G A G I C I G A I A G A C C A I C

FIG. 8. Panel A , dot matrix analysis of the 8- and y-casein 5' regions. The first 200 bp of each gene were compared using a dot matrix analysis. The location of the p- and y-casein gene sequences are indicated by the maps along the X and Y axis. T o be scored as positive, at least 7 out of 10 nucleotides were required to match. Panel B, comparison of the first 200 bp of 5' flanking DNA. The sequences ofthe p- and y-casein 5' flanking DNA were aligned to give maximum homology as indicated by the dot matrix analysis. The sequence is arranged 5' to 3' with the -1 nucleotide in the lower right-hand corner. The Goldberg-Hogness sequences and the three conserved regions are enclosed by boxes.

cleavage site selection. The downstream location of the pen- tanucleotide and the lack of such a sequence between the AATAAA and the cleavage site indicate that the 0-casein gene belongs to the class I1 SV40 late type of polyadenylation sites.

Casein Gene Structure-In addition to identifying potential regulatory elements it was also of interest to determine pos- sible evolutionary relationships among the casein genes. The casein proteins are under unusual evolutionary pressure, be- cause unlike enzymes they do not need to maintain the stringent geometry of an active site but are still required for the reproductive success of mammals. Thus the genes have undergone rapid divergence while still encoding a functional protein. Gilbert (25) and Blake (26) have proposed that pro- teins are composed of functional domains, and these domains are encoded by an exon or a group of exons. Thus, new

proteins could be easily assembled from previously evolved functional domains by the recruitment of the corresponding exons to form a new gene. It was intriguing to apply this model to the analysis of the casein genes. To fulfill their nutritional role the caseins must perform three functions: 1) to be secreted; 2) to form protein aggregates termed micelles; and 3) to be phosphorylated to allow Ca2+ binding and trans- port. When the exon structure of the p-casein gene was examined, the first two functions are clearly carried out by polypeptide sections encoded by individual exons.

The conserved 5' noncoding region and the signal peptide which allows casein protein secretion are encoded by exons I and 11, respectively (Figs. 6 and 7 ) . Exon I contains most of the 5' noncoding sequence of the gene. The site of transcrip- tion initiation is assumed to start at the A residue identified from primer extension experiments (5). This exon is similar in size as well as in sequence to the first exon of the y- and the rat a-casein genes.3 Exon I1 encodes the remainder of the 5' noncoding sequence, the entire signal peptide sequence, and the first two amino acids of the mature protein. The signal peptides are the most conserved region of the caseins and presumably constitute a functional domain.

Casein proteins aggregate and form micelles by the inter- action of their carboxyl-terminal hydrophobic domains (27). Micelle formation allows milk to contain higher concentra- tions of protein and calcium phosphate complexes than would be possible for individual protein molecules. The hydrophobic domains do not need to maintain a constrained geometry since they do not appear to close pack. This model of a loose casein micelle structure is supported by the absence of x-ray diffraction spacing indicating that the caseins do not form an ordered structure (28). The reported density of micelles formed from equal weights of a- and p-bovine caseins is -0.3- 0.4 g/cc (27) in contrast to the significantly higher density of -1.4 g/cc for a close packed globular protein (29), which also supports the loose packing model. This loose packing allows caseins to accommodate more changes in their amino acid composition than can a typical protein. Thus it is not sur- prising that the coding regions of the casein genes appear to have diverged rapidly. The @-casein hydrophobic domain, except for the terminal amino acid, is encoded by exon VII. This region is composed of 44% hydrophobic residues as compared with the more hydrophilic amino-terminal region of the protein which contains 26% hydrophobic residues, Thus, the second casein function, i.e. micelle formation, is carried out by a region of the protein encoded by a single exon. However, since the exon(s) which encode the hydropho- bic domains of the other rat casein genes have not been

Phosphorylation Site

Ser Ser Glu . Glu Ser

p-casein 1 TCC AGT GAG 1 1 ~ G A A TCC \ Exon IV 7 Exon V

Consensus for

iplice Junction - . . - . . . . . . . - . . Sequence Exon Intron Exon

FIG. 9. Structure of the &casein major phosphorylation site. The exon structure of the p-casein major phosphorylation site is shown above. Exon IV encodes the Ser-Ser-Glu residues, and exon V the Glu residue. The relationship of the conserved GAG and GAA glutamate codons to the consensus exon/intron splice junction is illustrated below.

7046 Structure and Evolution of the P-Casein Gene

sequenced, it is not known if they also have a similar exon structure.

The relationship of the exons encoding the phosphorylation sites to the structure of this conserved functional domain is more complex. The sites of phosphorylation and calcium binding are conserved among the casein genes at both the amino acid and nucleic acid levels (8). Phosphorylation sites are typically found in the amino-terminal region of the casein proteins. The casein kinase phosphorylates a serine two resi- dues to the amino-terminal side of the sequence, Ser-X-Y, where Y is either a phosphorylated serine or an acidic residue, usually glutamic acid (30). Minor phosphorylation sites con- tain only one glutamic acid residue, Ser-X-Glu, while the more common major phosphorylation sites have two glutamic acid residues, Ser-Ser-Glu-Glu. The coding region for this series of residues, one of the most highly conserved portions of the casein mRNA, is as follows (8).

A.

B. ~ C ~ ~ ~ l ~ 6 ~ 6 ~ 6 ~ ~ ~ C ~ ~ ~ S ~ 6 ~ 1 ~ 6 ~ C ~ ~ ~ C ~ 1 ~ 6 ~ ~ ~ 6 ~ pvl

0 ( C ~ S ~ l ~ l ~ 6 ~ ~ i ~ ~ C ~ ~ [ l [ ~ ~ l ~ C ~ - ~ ~ ~ ~ ~ l i 6 ~ ~ i 6 pv

6 ~ 1 ~ ~ ~ 6 ~ 1 ~ ~ ~ 1 ~ 1 ~ - ~ - ~ - ~ 1 ~ C ~ C ~ ~ ~ 6 ~ 1 ~ C ~ ~ ~ 6

( C ~ ~ ~ l ~ l [ C ~ r ~ t [ l ~ 6 ~ r [ s ~ l ~ c ~ c ~ l ~ c ~ l ~ 6 ~ ~ ~ 6 plllMinor@ PlVMajo@

p Exona

FIG. 10. Conservation of casein phosphorylation sites. The frequency of each base a t a given position within the major phospho- rylation site of the rat a- and y-caseins is shown in panel A. The conserved GAG at right corresponds to the amino-terminal glutamate codon of the major phosphorylation site. The bottom row of panel A records the most frequently used nucleotide at that position. In panel B, the nucleotide sequences of the 3’ ends of exons 111-VI of rat p- casein are listed. When a particular nucleotide agrees with the most frequently used nucleotide from panel A, it is enclosed by a bold box.

FIG. 11. Model of the evolution of the casein gene family. For discussion of the model, see text. For estimates of the rates of divergence, see Hobbs and Rosen (8). E, glutamic acid; S, serine.

TCN-AGN-GAG-GAA S e r - S e r - G l u - G l u

The conserved major phosphorylation site of /I-casein is not encoded by a single exon but rather formed by an RNA- splicing event (Fig. 9). The conserved sequence of the major phosphorylation site is split with the Ser-Ser-Glu residues, the equivalent of a minor phosphorylation site, encoded by the 3‘ region of exon IV and the final glutamic acid residue encoded by the 5’ end of exon V.

How do these exons relate to the exons encoding functional domains postulated by Gilbert (25) and Blake (26)? It appears that in the caseins the “functional phosphorylation domain,” corresponding to an exon, is a minor phosphorylation site. These minor sites are converted to major sites by a splicing event which juxtaposes a glutamate codon with the minor phosphorylation site to form a major phosphorylation site.

Interestingly, those exons which have junctions between codons, such as the caseins, and which conform to the con- sensus sequences should frequently code for Glu-Glu residues at their exon/intron junctions (14). The 3‘ exon sequence C A (-AG I GT-AGT) could only code for glutamine, lysine, and A G

glutamate while the codon starting with the 5‘ exon sequence (AG I G) would on average encode glutamate one-fourth of the time. Thus, it appears that Glu-Glu residues, character- istic of the major phosphorylation site, would be commonly encoded by exons which meet the above criteria.

The positioning of introns within the phosphorylation cod- ing region may explain the complete conservation of the glutamate codons of the phosphorylation site. The amino- terminal glutamate codon is always GAG while that of the carboxyl-terminal is GAA. The position of these codons at the 3‘ and 5‘ ends of exons, respectively, means that they form part of the exon/intron consensus sequence reported by Mount (14) (Fig. 9). Thus the codons are under selective pressure not only to maintain the phosphorylation site but also to maintain correct splicing.

The split architecture of the conserved phosphorylation site

I 17 Rat, Mouse Divergence

Insertion 6 Duplication Of 18bp Repeat

75-Mammalian Radiation

lntergenic Duplication 6 Mutation

> C 0 .- - - .- I (Sn)E H (Sn)E H Hydrophobic Intragenic Duplication

200-400-Primitive Mammals

1 Hydrophobic +(Secreted ea++ Binder) Primordial Casein

I 5’ Noncoding {-iGCS+ Phosphory la t io~ Exon

Recruited Exons 1 Signal Peptide Hydrophobic Hydrophobic Exon

Structure and Evolution of the 0-Casein Gene 7047

may reveal a relationship between the exon structure of the @-casein gene and the structure of the @-casein protein. Craik (31) has hypothesized that exon/intron junctions often encode peptides found at the surface of the native protein. If this were true for the @-casein protein, the phosphorylation sites would exist at its surface. A surface location for the phospho- rylation site would allow the polar residues, serines and glu- tamic acids, which form the phosphorylation site to be at the protein surface. Also, a surface location would seem advan- tageous since it would be more accessible to the Golgi casein kinase activity as well as for Ca2+ binding.

Exons IV and V, as well as exons I11 and VI, appear to have originated by an exon duplication event. The four exons are quite similar in sequence, especially at their 3‘ ends, both to each other and to the phosphorylation sites of rat a- and y- casein (Fig. 10). It was not surprising that exons I11 and IV of the @-casein gene were similar to the conserved phospho- rylation sites since they each encode a phosphorylation site, but the similarities of the 3’ regions of exons V and VI were surprising since they do not encode a functional phosphoryla- tion site. We hypothesize that this homology is the result of an earlier duplication which produced the 4 present-day ex- ons.

A similar duplication seems to have occurred in the bovine cu,,-casein gene (7). Two 24-bp repeats were observed in the cDNA sequence, each of which encode a minor phosphoryla- tion site at its 3’ end. If they correspond to exons, the major phosphorylation site of the bovine a.,-casein would also be formed by a splicing event identical to that in the rat @-casein. Six other 24-nucleotide blocks with 54% homology to the 24- bp repeat noted above were also found, one of which also encodes a minor phosphorylation site a t its 3‘ end. When this latter 24-bp block is compared to the preferred sequences (Fig. 10A), 14 out of the possible 16 nucleotides are identical. If the bovine cual-casein gene structure is like the rat @-casein, it is predicted that these blocks will each be encoded by separate exons.

In addition to the sequence homology, the small sizes and the grouping of the @-casein exons 111-VI, i.e. two sets of small exons separated by a small intron (I11 and IV, V and VI), suggest that they are the result of two duplication events. One small exon may have been duplicated to give two small exons separated by a small intron. Later, this pair of exons reduplicated to give the two pairs of exons observed today.

The 3‘ noncoding region of the @-casein gene, in contrast to the y-casein gene, is divided into two exons, VI11 and IX. Exon VI11 encodes the carboxyl-terminal amino acid and the termination codon. However, the rest of the 3’ noncoding region, including the polyadenylation signal AATAAA, is contained in exon IX. In contrast, the final exon of y-casein contains the entire 3’ region of the gene starting with the carboxyl-terminal amino acid.3

Casein Gene Evolution-A hypothetical model of the evo- lution of the calcium-sensitive casein gene family is proposed in Fig. 11. The casein gene is composed of several distinct regions each encoded by its own exon or exons. We propose that these exons are the result of exon recruitment to form a primitive casein gene followed by intra- and intergenic dupli- cations. Based upon the rates of divergence of the casein signal peptides (8) a calcium-sensitive casein-like gene ap- pears to have originated 300 million years ago at the time of the appearance of primitive mammals. This casein-like gene may have been the result of exon recruitment bringing to- gether the basic elements of the modern caseins: a 5‘ noncod- ing region exon, a signal peptide exon, an exon with a minor phosphorylation site at its 3’ end, and a hydrophobic domain exon.

Between 300 million years ago and the mammalian radia- tion, 75 million years ago, two types of duplications appear to have occurred. The phosphorylation exon was duplicated in- tragenically to create the several small exons observed in the @-casein gene. These exons encode several minor phospho- rylation sites which may have been converted to a major phosphorylation site by mutating the first codon of the down- stream exon to a glutamate codon. Intergenic duplications also occurred creating the individual members of the casein gene family. I t is unlikely that K-casein, a noncalcium-sensi- tive casein with a distinct signal peptide sequence, is a product of these intragenic duplications (7). Finally, before the diver- gence of the rat and the mouse 17 million years ago, the a- casein gene underwent an insertion of 9% copies of an 18-bp repeat which is flanked by a direct repeat (CCAA) (8). This accounts for the smaller size of the other mammalian a- caseins in comparison to the rat and mouse (7).

This model explains the strong homologies between the 5’ noncoding regions and the signal peptides of the casein genes. I t also accounts for the duplication of the phosphorylation sites even though they are split by an intron. The validity of this model should be tested by determining the exon structure of other casein genes from both the rat and other mammalian species.

Acknowledgments-We thank Dr. Miles Mace for his assistance in R-loop analysis, Sara Rupp for technical assistance, Drs. Tom Sar- gent, Linda Jagodzinski, and James Bonner for providing the rat DNA libraries, Dr. C. Lawrence for his assistance in computer anal- ysis, and Patricia Kettlewell for preparation of this manuscript.

1. 2.

3.

4.

5.

6.

7.

8.

9.

10.

11.

12.

13.

14. 15.

16.

17. 18.

19.

20.

21.

22.

REFERENCES

Topper, Y. S. (1970) Recent Prog. Horm. Res. 26, 286-308 Hobbs, A. A., Richards, D. A., Kessler, D. J., and Rosen, J . M.

(1982) J. Biol. Chem. 257, 3598-3605 Guyette, W. A., Matusik, R. J., and Rosen, J . M. (1979) Cell 17,

1013-1023 Rosen, J . M., Woo, S. L. C., and Comstock, J . P. (1975) Biochem-

istry 14,2895-2903 Blackburn, D. E., Hobbs, A. A., and Rosen, J. M. (1982) Nucleic

Acids Res. 10,2295-2307 Hall, L., Laird, J. E., Pascall, J. C., and Craig, R. K. (1984) Eur.

J. Biochem. 396, 1-5 Stewart, A. F., Willis, I. M., and Mackinlay, A. G . (1984) Nucleic

Acids Res. 12,3895-3907 Hobbs, A. A., and Rosen, J . M. (1982) Nucleic Acids Res. 10,

8079-8098 Dayhoff, M. 0. (1976) in Atlas of Protein Sequence and Structure

(Dayhoff, M. O., ed) Vol. 5, Suppl. 2, pp. 261-262, National Biomedical Research Foundation, Bethesda, MD

Matyukov, V. S., and Urnysher, A. P. (1980) Genetika 16, 884- 886

Grosclaud, F., Mercier, J.-C., and Ribadeau-Dumas, B. (1973) Neth. Milk Dairy J. 27, 328-340

Gupta, P., Rosen, J. M., D’Eustachio, P., and Ruddle, F. H. (1982) J. Cell Biol. 93, 199-204

Yu-Lee, L., and Rosen, J. M. (1983) J. Biol. Chem. 258, 10794- 10804

Mount, S. M. (1982) Nucleic Acids Res. 10,459-472 Rosen, J. M., O’Neal, D. L., McHugh, J. E., and Comstock, J. P.

(1978) Biochemistry 17, 290-297 Campbell, S . M., Rosen, J. M., Hennighausen, L. G., Strech-Jurk,

U., Sippel, A. E. (1984) Nucleic Acids Res. 12, 8685-8697 Qasba, P. K., and Safaya, S. K. (1984) Nature 308, 377-380 Breathnach, R., and Chambon, P. (1981) Annu. Rev. Biochem.

Concino, M., Goldman, R. A,, Caruthers, M. H., and Weinmann,

Mulvihill, E. R., LePennec, J.-P., and Chambon, P. (1982) Cell

Compton, J. G., Schrader, W. T., and O’Malley, B. W. (1983)

Dean, D. C., Gope, R., Knoll, B. J., Riser, M. E., and O’Malley,

50,349-383

R. (1983) J. Biol. Chem. 258, 8493-8496

28,621-632

Proc. Natl. Acad. Sci. U. S. A. 80, 16-20

7048 Structure and Evolution of the ,&Casein Gene

B. W. (1984) J. Biol. Chem. 259, 9967-9970 36. Smith, G. E., and Summers, M. D. (1980) Anal. Biochem. 109, 23. Renkawitz, R., Schiitz, G., Dietmar, V. D. A., and Beato, M.

(1984) Cell 37, 503-510 37. Johnson, M. L., Levy, J., Supowit, S. C., Yu-Lee, L., and Rosen, 24. Berget, S. M. (1984) Nature 309, 179-182 J. M. (1983) J. Biol. Chem. 258,10805-10811 25. Gilbert, W. (1978) Nature 271, 501 38. Prentki, P., Karch, F., Iida, S., and Meyer, J. (1981) Gene (Amst.) 26. Blake, C. C. F. (1978) Nature 273,267 14,289-299 27. Waugh, D. F., Creamer, L. K., Slattery, C. W., and Dresdner, G. 39. Vieira, J., and Messing, J. (1982) Gene (Amst.) 19, 259-268

28. Tuckey, S., Roche, H., and Clark, G. L. (1938) J. Dairy Sci. 21, 65, 75-85

29. Metzler, D. E. (1977) in Biochemistry. The Chemical Reactions of 197

30. Mercier, J.-C. (1981) Biochimie 63, 1-17 1523 31. Craik, C. S., Rutter, W. J., and Fletterick, R. (1983) Science 220, 43. Rosen J. M. (1976) Biochemistry 15, 5263-5271

1125-1129 44. Messing, J . (1983) Methods Enzymol. 101, 20-78 32. Maniatis, T., Fritsch, E. F., and Sambrook, J. (1982) Molecular 45. Maxam, A. M., and Gilbert, W. (1977) Proc. Natl. Acad. Sci. U.

Cloning, A Laboratory Manual, p. 104, Cold Spring Harbor S. A. 74,560-564 Laboratory, Cold Spring Harbor, NY 46. Maizel, J. V., Jr., and Lenk, R. P. (1981) Proc. Natl. Acad. Sci.

33. Sargent, T. D., Wu, J.-R., Sala-Trepat, J . M., Wallace, R. B., U. S. A. 78, 7655-7669 Reyes, A. A., and Bonner, J. (1979) Proc. Natl. Acad. Sci. U. S. 47. Schibler, U., and Weber, R. (1974) Anal. Biochem. 5 8 , 225-230

123-129

W. (1970) Biochemistry 9, 786-795 40. Chaconas, G., and van de Sande, J. H. (1980) Methods Enzymol.

767-776 41. Holmes, D., and Quigley, M. (1981) Anal. Biochem. 114, 193-

Living Cells, 1st Ed., p. 75, Academic Press, New York 42. Birnboim, H. C., and Doly, J. (1979) Nucleic Acids Res. 7, 1513-

A. 76,3256-3260 48. Chirgwin, J., Przybyla, A. E., MacDonald, R. J., and Rutter, W. 34. Blather, F. R., Williams, B. G., Bleckl, A. E., Denniston-Thomp- J. (1979) Biochemistry 18,5294-5298

son, K., Farber, H. E., Furlong, L.-A., Grunwald, D. J., Kiefer, 49. Bailey, J. M., and Davidson, N. (1976) Anal. Biochem. 70,75-85 D. O., Moore, D. D., Schumm, J. W., Sheldon, E. L., and 50. Shinnick, T. M., Lund, E., Smithies, O., and Blather, F. R. Smithies, 0. (1977) Science 196, 161-169 (1975) Nucleic Acids Res. 2, 1911-1929

35. Southern, E. M. (1975) J. Mol. Biol. 98, 503-517

Structure and Evolution of the P-Casein Gene 7049

Supplementary Yaterial LO

The Rat Casein Xultigene Family. Fine SLruct~ce and Evolution of the 8-CaSein Gene

William K. Jones, Li-yuan Yu-Lee. Shirley M . Clift Terry L. Brown b Jeffrey Y . Rose"

E X P E R I M W U P R O C E W R e S

Materials - All restriction enzymes were purchased from Bethesda Research LaboratOr-LeS New England Bkolabs, or Boehringer Nannhelm. and used elther in the buffers 'cecommended by the suppliers or the three salt buffers TeCOm- mended by Yaniat15 dl. 1321. DNA polymerase I was from BlOlabS. DNA polymerase 1 Klenow fragment was. from Bethesda Research Laboratories Or noehrlnger Yannheim. T4 DNA ligase. bacterial alkaline phosphatase and TI polynvcleot~de kinase were purchased from nethesda Research Laboratories. Calf intestinal phosphatase was from Boehringer Yannheim. Elutip-D COlYmnS were purchased from Schleichec and Schuell. Type XAR-5 x-ray filll Was purchased from Eastman Kodak.

G e l Transfer and Bybridiration - Cloned DNA fragments were transferred to nitrocellulose filters either by the method of Southern (35) or bidirec- tionally by the method of Smith and Summers (361. Filters were then treated and hybridized as previously described 1131. Genomic DNA was prepared from mammary glands and DNA blots performed as described 137) .

Plasid subcloning and Plamid W A Preparation - DNA fragments from phage clones were subcloned into either pBR325 I381 phage DNA was digested wzth e RI and treated with either bacterial alkaline or pUC8 139) plasmids. Total

phosphatase (401 or calf intestinal phosphatase 132). Ligations and trann-

performed by either the method of Holmes and Quigley (411 or by the alkaline formations were done as described previously 1131. Plasmid mini-lysates Were

lysis method Of Birnboim and Doly (421. Plasmid DNA was prepared as described previously 1 1 3 ) .

R - L m Analysis - Phage clones 110 ug/ml) containing &-casein DNA Were hybrlsized to a @-casein mRNA-enriched Sepharose 48 column fraction 130 ug/m11 (431 in 7 0 0 formamide. 0 . 3 H N K l , 10 mJ4 TriS-HC1 IpH 7.41, 10 mY NazEDTA. R-loop analysis was performed as described previously 113).

The sequencing of end-labeled fragments was done as described by naxam and m Sequencing - Dideoxy sequencing was done as described by Messing ( 4 4 1 .

Gilbert ( 4 5 1 with the following modification. End-labeled DNA fragments were isolated from low melt agarose gel5 using ElUtip-D columns. The 8eCt10nS Of the gel containing the fragment were excised and melted at 68-C, diluted in 0.2 M NaC1. 20 mU Tris-IICI, pH 7.4. 1.0 nM Na EDTA and the fragments isolated as recommended by the supplier. DNA %ragdents were routinely precipitated in 2.5 M NH acetate with two Y 0 1 ~ m e S of ethanol. Computer analysis of the sequence 'data was done with the HELEX Sequence Information System ( 4 6 1 .

RNA Sandwich Blots - The Subclones of interest were digested with the appro- priate enzymes and electrophoresed on agarose gels. After bidirectional

in a solution of 50% fornamide, 0.1% SDS, 0.048 S A . 0.048 polyvinylpyrroli- transfers the filters were baked for 4 hr. The filters were marked and placed

done 0 0 4 1 Ficoll. 0.6 M NaC1. 5 tel Na EDTA 50 tel TriS-RC1. pR 7.4 and 300 ug/mi s'heared Sal& sperm DNA Ihybriaization solutlonl. After 4 hr of hybridization at 37.C. 5 ug/m1 of lactating mammary gland polytA)' RNA ( 4 3 ) was added to one of the filters and hybridization was allowed to continue for another 12 hr. The filters were washed for 10 min at 68-C i n 2 x ssc, 0.10 SDS and then placed in a seal-a-bag containing the hybridization solution and nick translated cloned CDNA probes (2.5 x 106 cpWml1. The filters were hybridized at 37.C for 12 hr, and washed first with 2 X SSC and 0.1, SDS at rom temperature for 30 min, and then at 68.C for 2 hr. The filters were air dried and autoeadiographed.

m A Iulalysir by In SItu IWbrIdiratIon - Nuelel were isolated at -2O.C in 501 glycero , 50 1111 Trig-HC , pH 7.5. 5 tel Na EGTA 25 nU KC1, 0.15 tel apemidine and 0 . 2 mJ4 spermine I:,). The nucIear*RNA 'was isolated by the guanidine thiocvanate/Cscl qradient method ( 4 8 1 . Nuclear RNA was fractionated on a denatiring nethyl~nercury hydroxide agarose gel I491 and the 8-casein nni casein DNA probes 1501. transcripts *ere identified by & hybridization with nick-translated

msmTs

Identification of A Clones

~

n o independent rat genomic M A libraries wre screened, initially with a 8-casein =DNA clone and Subsequently with $-casein genomic subclone*. .?, total of 8 Overlapping )i clones encompassing 34.4 Kb of rat M A were isolated

positions of the E R t , 5 HI and N I Sites ace shorn (Fig. lA). (Fig. 11. These clones were napped with several restriction enzymes and the

Fig. 1. Panel A. Map of the Rat ~-C:asein &ne US. overlapping ~ - ~ ~ ~ ~ i ~ Specific phage Clones are shorn in the 5' to 3 ' orientation. Solid hoxes represent exons. Open boxes represent introns, and vertical lines indi-

ECO RI subclones are indicated above the line by the numbers which care restriction sites. RI. e HI. 3 I. The names of the various

correspond to their sire, in Xb, in order of appearance, a , b, c , etc. The names of the various phage clones are indicated beside each clone.

T~ confirm that no rearrangements or gross deletions had occurred during the construction and isolation Of the phage Clones, a 8-CDNA clone or

digested genOnic DNA (Fig. 1s). When the 1.9 ECO RI subclone, located in the Subcloned ECO RI genomic fragments Yere used as probes against restriction

middle of the gene, was used as a probe, it7bridized to Only a 14.5 Kb B~~ HI fragment which lies between a Site within the 2.0 E Rl fragment and a site in the 1.75b ECO RI fragment. This same = H I fragment is detected by the 2.8 ECO RI s u m o n e , containing primarlly 5 ' flanking DNA. The 2.8

X F h l n the 1.9 Eco RI fragment and t e 3 3a E RI fragment. This same 10.2 ECO RI sub== detects a 10.2 Kb

M+ I .fragment created by the 3 I Sites

Rb = I f r a g m c is detected by 8-CIXqAr but the hybridization Signal as expected is less intense. The 0-cDNA also detects a 5.5 Kb %.PI fragment and a strongly hybridizing 1.2 Kb 9 I fragment. The latter fragment lies between the YsD I Sites of the 1.9 and 2.0 RI fragments. These results

rearrangements have occurred. The absence Of additional bands in the genomic indicated t h c t h e map Of the genomic clones is Correct and no detectable

DNA blots and the isolation of only one set of OveClaPPing phage Clones from

exists in the rat genome. t w independent DNA libraries indicates that only one authentic 8-casein gene

-

23.5- Kb A B

9.7-

6.6-

4.3-

2.2- 2.1-

1.4- 1.1- 0.9-

1.9 2.0

C

I

2.8

D Kb

I "14.5

-10.2

-5.5

1.2

cDNA

Genomic ORA Blots. Rat Figure 1. Panel 8.

genomic DNA I15 uq/lanel was digested with HI. lanes A and 5 , or x I , lane C and D. The come- sponding DNA blots Were hybridized LO the , " d l - cated IzP-labeled P - casein e R 1 sUbClone5. lanes A, 8, and C: or 823, a @-casein cnNA clone, lane D. *olec"lar weights are designated by the arrows.

Cb.r.eteri..tion Of the -is Clooes

the genomic clones by DNA blots of the A clones using 5' and 3 ' 8-casein CDNA The @-casein gene was oriented and exons initially positioned within

probes isolated from the 823 CDNA clone [Fig. 21.

2 '123 4 5 6 7 8 Kb P2

Kb (1 I. I:

3.34

5' 3' 5' 3'

- $"3" PSt

I323 - 200bp

rig. 2. Ocientatim of the I Clones and Uralization of KXonS. Genomlc phage clone 812, lanes 1-4, and 82, lanes 5-8, were digested with EEO R I , lanes 1, 3, 5 and 7 and ECO RI/Bam AI! lanes 2 . 4 . 6, 8. The E t s of lanes 1, 2, 5 , a i 6 WeXthenTybrldized with the 5' p5c I fragment of 823, shown below, which had been 32P-labeled. The DNA blots of lanes 3, 4, 7, and 8 were similarly hybridized with a IZP-labeled 3 ' est I fragment Of 823. Molecular wights are indicated by the arrows.

-.w- - 1 1 - -.2?-

1 2 3 4

Fig.

prehybridized in tI lanes 3 and 4 . pane" .. . .. . lanes were then hybridized %

7050 Structure and Evolution of the @-Casein Gene

were detected in both the presence and absence of mRNA are indicated by indicated by the arrows. On the nap below, restriction fragments which

the solid line. while restriction fragments which were detected in only the presence of nRNA art indicated by the dashed line. Solid boxes indicate the pDSitionS of exons 11-VI.

In panel A. three RSa I restriction fragment., 1 . 4 . 0 . 8 7 . and 0.14 Kb feag- ments. of the 3 . 3 b e R I subclone hybridized to the 823 =DNA probe. The 0 . 8 1 Kb and 0.74 Kb fragment, which hybridized in both the presence and

contamed within the E23 =DNA clone. The 1.4 Kb fragment, which hybridized absence Of mRNA, must each contain at least one exon whose Sequence is

to the probe only in the presence Of mRNA. nust contain One Or more exons whose sequence does not overlap the 023 EDNA clone sufficiently to allow hybridization. This exon was further localized to the 0.415 Rb fragment between the Xba I Site and a Hind I11 Site by the absence of hybridization to

0.635 Rb Hindiii anTthe 0 . 4 1 5 Kb Hind III/Xba I fragments (panel El. An R- the 0.63 K b T s t I/Eco RI f r z e n t (panel A ) and the hybridization to the

loop expeeiment confirmed the locatTn of t G exons and located additional exons.

R - C m p Analysis

The genomic E12 phage clone and 8-casein mRNA were annealed under R- loop conditions and analyzed by electron microscopy as Shorn in Fig. 4. From the known lengths of exons VI1 and intron F (see below) a scaler was deter- mined which allowed the conversion Of the observed lengths of the m t r m

the locations Of the nost 5' and 3 ' exon6 within the 3.3b Eco R I fragment, loops to Rb. By comparing the intron loop sire and the 0-casein gene map,

which were mapped by the RNA sandwich blot technique d e s c r m above, were confirmed. TWO small exons, I l l and IV, located in the middle of the 3.3b fragment failed to anneal under the R-loop conditions probably due to their small sire [see below). In additional exon was found to lie 1.7 Rb 5 ' of

Since restriction mapping data had located the 3 ' nost exon T t h e middle of exon I1 Or approximately 500 bp from the 3' end Of the 2.8 EcO RI fragment.

R-caseinyne was estimated to be approximately 1.2 Kb plus 250 bp Of p 1 y l ~ I the 2.0 ECO RI fragment (data not shown1 the primary tCanSCCipt length of the

for a total length Of 1 1 . 5 Rb.

A R

Pig. 1. R-lloop Analysis Of 8-12 Clone. 812 DNA was denatured and hybridized to Sepharose I B fraction enriched @-casein nRNA as described

used to calibrate the electron micrograph in panel A and to predict the in .Experimental Procedures.. The sizes of exon VI1 and intron F were

psltions Of exons I and I1 as shown in panel n. FSIA Blots

RNA blots using lactating poly(Al+ nuclear RNA were performed in order to confirm the predicted primary transcript sire. TO avoid RNA degradation, the nuclei were isolated at -20.C in the presence of glycerol 146) and the RNR was isolated by a guanidine thiocyanate-CsC1 prOtOco1 ( 4 7 ) . TO facili- tate detection of large R N A transcripts, hybridization was performed & situ to qels without transfer to a filter. Using both the 1.9 Kb and the 3 . 3 m

Kb, identical in sire to the estimated primary transcript sire Of 7.5 Kb ECO RI Subclones as probes (Fig. 51 an apparent full length transcript of

vas detected.

Kb

7 . 5 - q y

5.7- . 1

5.2-

of other lower w l e c u l a r ~ e l g h t gene products were detected (Fig. 51. Y O S ~ In addition to an apparent primary 8-casein gene transcript a numher

of these were detected by both the 3.3b and 1.9 ECO R I subclone probes, suggesting that the two subclones are w r t of the saneprimary transcriptmn unit. However, one gene product. the 2.6 Kb band, was only detected by the 3.3b probe. R11 true precursors must retain a11 the exons of the gene. Since

VII, it cannot be a true 8-casein pre~ut-10~. This gene product appears to the 2.6 Kb gene product was not detected by the 1.9 probe containing exon

contain a POlylA) tail as it was enriched on a dT-cellulose column. ut l east two possible explanations can be suggested. The first is that the 3.3b probe is detecting the PrOdUCts of another gene due to repeats within itself. The 3.3b probe is known to contain repeats [data not shown). A Second pssibiliry is that the 2.6 Kb band is a dead end processing product Of the 0 casein gene transcripts. h'hatever its origin, this 2.6 Kb RNA molecule is present in relatively high concentrations in lactating tissue.

- K m A

I

11

111

I V

V

VI

V I 1

V I 1 1

IX

- sin

40

63

27

2 1

24

62

521

48

326

.IVU msmm

1-40

41-103

106-130

131-1s1

152-17s

176-217

218-743

744-791

792-1117

SIZE

1.7 Kb

0.68 Kb

0.12 Kb

0.97 Kb

0.09 Kb

1.1 Kb

0.18 Kb

0.86 Kb

-