Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases...

197
Molecular Evolution of Type I Collagen (COL1a1) and Its Relationship to Human Skeletal Diseases by Daryn Amanda Stover A Dissertation Presented in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy Approved November 2010 by the Graduate Supervisory Committee: Brian C. Verrelli, Chair Thomas E. Dowling Michael S. Rosenberg Anne C. Stone Gary T. Schwartz ARIZONA STATE UNIVERSITY December 2010

Transcript of Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases...

Page 1: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

Molecular Evolution of Type I Collagen (COL1a1) and

Its Relationship to Human Skeletal Diseases

by

Daryn Amanda Stover

A Dissertation Presented in Partial Fulfillment of the Requirements for the Degree

Doctor of Philosophy

Approved November 2010 by the Graduate Supervisory Committee:

Brian C. Verrelli, Chair

Thomas E. Dowling Michael S. Rosenberg

Anne C. Stone Gary T. Schwartz

ARIZONA STATE UNIVERSITY

December 2010

Page 2: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

i

ABSTRACT

Skeletal diseases related to reduced bone strength, like osteoporosis, vary in

frequency and severity among human populations due in part to underlying

genetic differentiation. With >600 disease-associated mutations (DAMs),

COL1a1, which encodes the primary subunit of type I collagen, the main

structural protein in bone, is most commonly associated with this phenotypic

variation. Although numerous studies have explored genotype-phenotype

relationships with COL1a1, surprisingly, no study has undertaken an evolutionary

approach to determine how changes in constraint over time can be modeled to

help predict bone-related disease factors.

Here, molecular population and comparative species genetic analyses were

conducted to characterize the evolutionary history of COL1a1. First, nucleotide

and protein sequences of COL1a1 in 14 taxa representing ~450 million years of

vertebrate evolution were used to investigate constraint across gene regions.

Protein residues of historically high conservation are significantly correlated with

disease severity today, providing a highly accurate model for disease prediction,

yet interestingly, intron composition also exhibits high conservation suggesting

strong historical purifying selection. Second, a human population genetic analysis

of 192 COL1a1 nucleotide sequences representing 10 ethnically and

geographically diverse samples was conducted. This random sample of the

population shows surprisingly high numbers of amino acid polymorphisms (albeit

rare in frequency), suggesting that not all protein variants today are highly

Page 3: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

ii

deleterious. Further, an unusual haplotype structure was identified across

populations, but which is only associated with noncoding variation in the 5’

region of COL1a1 where gene expression alteration is most likely. Finally, a

population genetic analysis of 40 chimpanzee COL1a1 sequences shows no amino

acid polymorphism, yet does reveal an unusual haplotype structure with

significantly extended linkage disequilibrium >30 kilobases away, as well as a

surprisingly common exon duplication that is generally highly deleterious in

humans. Altogether, these analyses indicate a history of temporally and spatially

varying purifying selection on not only coding, but noncoding COL1a1 regions

that is also reflected in population differentiation. In contrast to clinical studies,

this approach reveals potentially functional variation, which in future analyses

could explain the observed bone strength variation not only seen within humans,

but other closely related primates.

Page 4: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

iii

To my family, with love

Page 5: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

iv

ACKNOWLEDGMENTS

I would like to thank the members of my committee: Brian Verrelli, Anne Stone,

Thomas Dowling, Michael Rosenberg, and Gary Schwartz, for their advice and

support, both with my dissertation research and with my graduate education and

professional development. I would especially like to thank my chair, Brian

Verrelli, for dedicating a considerable amount of time and energy to helping me

reach my academic and professional goals. Regardless of the topic, be it statistical

analyses, writing grant proposals and manuscripts, or how to deal with a difficult

student, he was always willing to provide advice and guidance to help get me

through. Without his efforts, I would not be here today. A special thanks also to

Anne Stone for providing the primate DNA samples used in Chapter 4 and to

Michael Rosenberg for saving me a considerable amount of time by creating the

PhaseSeqs script to help transfer data among analysis programs.

I would also like to thank my family for their love, support, and

encouragement throughout my life as well as for their dedication to my education

and to fostering my personal and professional interests. Thank you to my friends

also for their love and support, and especially for the much-needed distractions

from graduate school. I also greatly appreciate the helpful discussion of my

research with past and present members of the Verrelli Lab and colleagues at

Arizona State University (ASU). Finally, I owe a specially thank you to Michael

Hammer, Elizabeth Wood, and Matthew Kaplan for originally sparking my

interest in human evolutionary genetics as an undergraduate and for providing a

strong foundation in molecular laboratory techniques.

Page 6: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

v

Funding for my dissertation research was provided by the National

Science Foundation via a Doctoral Dissertation Improvement Grant (DEB-

0909637), the Graduate and Professional Students Association at ASU, and

through the generous use of Verrelli Lab start-up funds. I would also like to thank

the School of Life Sciences and the Graduate College for funding my graduate

education as a teaching and research associate.

Page 7: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

vi

TABLE OF CONTENTS

Page

LIST OF TABLES…………………………………………………………….....ix

LIST OF FIGURES……………………………………………………………....x

CHAPTER

1 INTRODUCTION……………………………………………………….1

Type I Collagen and the COL1a1 Subunit……………………….2

Potential Importance of Noncoding COL1a1 Polymorphism…….4

Bone Phenotypic Variation among Primates……………………..6

Research Questions……………………………………………….8

2 COMPARATIVE VERTEBRATE EVOLUTIONARY ANALYSES OF

COL1a1………………………………………………………………….11

Abstract………………………………………………………….11

Introduction……………………………………………………..12

Materials and Methods………………………………………….16

Results…………………………………………………………..24

Discussion……………………………………………………….28

Conclusion………………………………………………………34

Acknowledgements……………………………………………..35

Page 8: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

vii

CHAPTER Page

3 HAPLOTYPE STRUCTURE AND AMINO ACID POLYMORPHISM

AT HUMAN COL1a1…………………………………………………..41

Abstract………………………………………………………….41

Introduction……………………………………………………..42

Materials and Methods………………………………………….45

Results…………………………………………………………..51

Discussion……………………………………………………….56

Conclusion………………………………………………………61

Acknowledgements……………………………………………..62

4 COMPARATIVE HUMAN AND CHIMPANZEE ANALYSES OF

COL1a1…………………………………………………………………66

Abstract…………………………………………………………66

Introduction……………………………………………………..67

Materials and Methods………………………………………….71

Results…………………………………………………………..75

Discussion……………………………………………………….83

5 CONCLUSION…………………………………………………………96

LITERATURE CITED………………………………………………………...101

Page 9: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

viii

APPENDIX Page

A SUPPLEMENTARY MATERIAL: CHAPTER 2…………………….121

B SUPPLEMENTARY MATERIAL: CHAPTER 3…………………….156

C SUPPLEMENTARY MATERIAL: CHAPTER 4…………………….168

Page 10: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

ix

LIST OF TABLES

Table Page

1. Human Clade A Collagen Gene Exon and Intron Characteristics…………..36

2. Clade A Collagen Gene Human-Chimpanzee Divergence Estimates………37

3. COL1a1 Human Population Diversity Estimates…………………………...63

4. Intraspecific and Interspecific Tests of Neutrality…………………………..64

5. Chimpanzee COL1a1 Diversity Estimates by Gene Region………………..90

6. Chimpanzee COL1a1 Haplogroup-Specific Diversity Estimates…………...91

Page 11: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

x

LIST OF FIGURES

Figure Page

1. COL1a1 Gene Locus Diagram………………………………………………38

2. COL1a1 Disease-Associated Mutations and Amino Acid Evolutionary

Rates………………………………………………………………………....39

3. Intron Length Frequency Distributions for Clade A Collagen Genes……….40

4. Human COL1a1 Linkage Disequilibrium Patterns………………………….65

5. Chimpanzee COL1a1 Exon Duplication Diagram…………………………..92

6. Chimpanzee Chromosome 17 Haplotypes…………………………………..93

7. Chromosome 17 Gene and PCR Fragment Diagram………………………..95

Page 12: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

1

CHAPTER 1: INTRODUCTION

The incidence and severity of numerous complex human diseases vary greatly

among geographic populations (e.g., Abate and Chandalia 2003; Hajjar et al.

2006; Lau et al. 2006). Because a majority of these diseases have a genetic

component in addition to an environmental one, it is likely that variation within

the human genome is a significant source of observed phenotypic variation among

populations. Included among these diseases are skeletal disorders related to

variation in bone strength (measured as bone mineral density, BMD) like

osteoporosis, which has been shown to vary significantly among populations (e.g.,

Lauderdale et al. 1997; Looker et al. 1997; Melton 1997; Barrett-Connor et al.

2005), attributed in part to genetic differentiation (e.g., Dvornyk et al. 2003; Gong

and Haynatzki 2003; Gong et al. 2006; Koller et al. 2010). As we move into an

era of personalized, genome-based healthcare (Ng et al. 2008), characterization of

this genetic variation will enable medical practitioners to target preventative

measures to individuals with specific genotypes linked to increased risk of

developing skeletal disorders. In order to facilitate the design of novel treatments,

however, we must go a step further than simply identifying disease-associated

mutations (DAMs). Instead, it is crucial that we understand the evolutionary

context, both among populations and species, of potentially functional mutations.

Taking into account the historic effects of evolutionary pressures like natural

selection on the accumulation of this genetic variation will allow us to better

understand the origin and evolution of skeletal disease phenotypes. Specifically,

Page 13: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

2

we can determine not only how and when such genetic variation is deleterious,

but, potentially, when it is beneficial, which can guide the design of innovative

treatments that could prevent the onset of disease symptoms entirely.

To date, >30 candidate genes have been associated with variation in BMD,

osteoporosis, or osteoporotic fracture (Ho et al. 2000; Shen et al. 2003; Liu et al.

2006; Gong et al. 2006; Rivadeneira et al. 2009; Ralston 2010). Of these, collagen

type I alpha 1 (COL1a1), which encodes part of the bone structural protein type I

collagen, consistently displays some of the strongest evidence of association with

disease across populations (e.g., Garcia-Giralt et al. 2002; Stewart et al. 2006;

Ioannidis et al. 2007; Jiang et al. 2007; Kaufman et al. 2008). Here, we examine

the recent and ancient evolutionary history of this >17-kb chromosome 17q21.33

locus using molecular evolutionary and population genetic approaches to better

understand the nature of human DAMs. Our results offer new insight into the

origin of skeletal disease in humans and what genetic variation may be

contributing to phenotypic differences in bone strength among populations.

Type I collagen and the COL1a1 subunit

Type I collagen, which is encoded by two subunits of COL1a1 and one subunit of

COL1a2 wound together to form a triple-helix, is the most abundant protein in

vertebrates and is the main structural protein of bone, teeth, and tendon (Viguet-

Carrin et al. 2006). As such, mutations in these genes have been associated with

several skeletal and connective-tissue disorders (Dalgleish 1997; Marini et al.

2007). Within COL1a1 alone, >600 DAMs have been identified (Dalgleish 1997;

Page 14: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

3

Marini et al. 2007), the majority of which are linked to osteoporosis, osteogenesis

imperfecta types I-IV, and Ehlers-Danlos Syndrome types I and VIIA, which

afflict ~200 million, ~500,000, and 200,000 individuals worldwide, respectively

(e.g., Stoll et al. 1989; Burrows 1999; Reginster and Burlet 2006). These DAMs

primarily affect protein coding regions, particularly within the triple-helix

domain, which is composed of a repeating amino acid sequence with glycine, the

smallest of the amino acids, in every third position, the repetition of which is

crucial to enabling the domain to wind into its compact structure in type I

collagen (Yamada et al. 1980; Bernard et al. 1983; Exposito et al. 2002; Boot-

Handford and Tuckwell 2003; Aouacheria et al. 2004; Wada et al. 2006). Thus,

substitutions of these glycines are often the most phenotypically severe DAMs

(e.g., resulting in lethal OI type II; Kuivaniemi et al. 1997; Dalgleish 1997;

Marini et al. 2007; Rauch et al. 2010).

The phenotypic severity associated with COL1a1 mutations, however, can

be quite variable depending on the position of the affected amino acid and how

the mutation alters the thermostability of type I collagen, both of which have been

used previously to predict the phenotypic outcome of novel mutations that affect

glycine residues (e.g., Persikov et al. 2005; Marini et al. 2007; Bodian et al. 2008,

2009). These previous methods, however, do not incorporate an evolutionary

approach. For example, genome-wide studies (e.g., Miller and Kumar 2001;

Subramanian and Kumar 2006) using evolutionary site models have shown that

DAMs in general are found more often at amino acid positions that are highly

Page 15: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

4

conserved across species. Applying similar means to examine the long-term

evolutionary history of COL1a1 in vertebrates may allow for increased accuracy

in predicting the phenotypic severity of novel mutations, particularly those that

affect non-glycine positions, which have been largely ignored by previous

prediction models.

Potential importance of noncoding COL1a1 polymorphism

Within the natural population (i.e., individuals who lack clinical symptoms of

skeletal disease), COL1a1 amino acid variation is rare based on a study of 48

individuals (96 chromosomes) from each of four populations in the United States

(European-, African-, Mexican, and Chinese-Americans) in which only 3 amino

acid mutations were identified, each of which at an allele frequency <2% in the

total sample (Chan et al. 2008). As such, COL1a1 protein variation is unlikely to

explain the association of this gene with bone phenotypic variation among

populations. Rather, genetic variation in noncoding regions that affects the

expression of COL1a1 may contribute significantly to these population

differences in bone-related phenotypes, as has been hypothesized for phenotypic

variation among populations in general (e.g., Ge et al. 2009; Kasowski et al.

2010).

A mutation in an Sp1 transcription factor binding site in the first intron of

COL1a1, for example, has been shown to increase gene expression and likely

accounts for the associated change in the ratio of COL1a1 and COL1a2 subunits

that reduces the structural integrity of type I collagen (Grant et al. 1996; Mann et

Page 16: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

5

al. 2001; Jin, van’t Hof et al. 2009). This mutation reaches frequencies >20%

among populations of western European ancestry (e.g., Grant et al. 1996, Ralston

et al. 2006; Jiang et al. 2007), but is only rarely found among Africans (e.g.,

Beavan et al. 1998) and is absent among Asians (e.g., Han et al. 1999; Nakajima

et al. 1999; Lau et al. 2004). Thus, this single noncoding mutation has been found

to contribute significantly to population variation in bone strength and fracture

risk (e.g., Beavan et al. 1998; Bandres et al. 2005; Ralston et al. 2006; Jiang et al.

2007). Further, this mutation is found in significant linkage disequilibrium with

two other mutations in the promoter that are also associated with reduced BMD

(Garcia-Giralt et al. 2002, 2005; Stewart et al. 2006; Jiang et al. 2007; Jin, Stewart

et al. 2009). Recently, these three polymorphisms have been shown to have

haplotype-specific affects on COL1a1 expression resulting in not only low BMD,

but reduced overall bone quality as well (Jin, Stewart et al. 2009; Jin, van’t Hof et

al. 2009). Aside from these polymorphisms, however, little is known about

noncoding variation at COL1a1, particularly in the natural population (e.g., Chan

et al. 2008), as these regions have been largely ignored in previous studies in

favor of screening individuals for amino acid mutations. As such, it would be

interesting to investigate the extent of potentially functional genetic

differentiation in noncoding regions of COL1a1 in ethnically and geographically

diverse populations.

Noncoding variation that may have functional implications for COL1a1

gene expression need not be restricted to transcription factor binding sites,

Page 17: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

6

however, but can rather include intron compositional properties that impact gene

structure. For example, genome-wide studies have shown that highly-expressed

genes have higher GC-content and shorter introns, which are related to increased

transcriptional efficiency (Hurst et al. 1996; Castillo-Davis et al. 2002; Urrutia

and Hurst 2001, 2003; Comeron 2004; Kudla et al. 2006). COL1a1 is likely

highly expressed given its abundance in connective tissue and its importance

during fetal development and wound repair (e.g., Gelse et al. 2003; Hildebrand et

al. 2005; Cohen 2006), yet because studies are often constrained to analyses of

soft-tissues (e.g., Su et al. 2004; Blekhman, Oshlack et al. 2008), little is known

about COL1a1 expression. However, as with coding regions, comparisons of the

long-term evolutionary history of COL1a1 introns among vertebrates could also

reveal historical selective pressures related to functional constraint, thereby

providing novel targets in the search for polymorphisms associated with bone-

related phenotypic differences among populations.

Bone phenotypic variation among primates

While examining COL1a1 variation in vertebrates in general will shed light on the

ancient evolutionary history of this gene, to better understand the origin of

skeletal disease in humans we must also determine how evolutionary pressures

have changed more recently within our lineage. Specifically, although bone

phenotypic differences exist among human populations, this does not mean that

such variation is unique to our species. Rather, evolutionary processes that led to

the prevalence of skeletal disease in humans could be shared with other closely-

Page 18: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

7

related primate species. Although bone phenotypic data for non-human primates

is limited, two general trends have already emerged. First, BMD does vary within

other species (e.g., Sumner et al. 1989; Cerroni et al. 2000; Black et al. 2001;

Gunji et al. 2003; Havill et al. 2003), which may be due in part to underlying

genetic variation (e.g., Lipkin et al. 2001; Havill et al. 2005). Second, slight

differences in bone morphology have been documented among species, even

including differences between humans and our closest-living relative, the

chimpanzee, in osteoporotic-like symptoms, such as patterns of bone loss and the

accumulation of microfractures with age (e.g., Sumner et al. 1989; Wang et al.

1998; Gunji et al. 2003; Kikuchi et al. 2003; Mulhern and Ubelaker 2003, 2009;

Matsumura et al. 2010).

These data suggest that bone phenotypic variation common among

humans may not be isolated to our lineage. However, with the limited availability

of phenotypic data it is difficult to assess the extent of this variation within

species. Instead, population genetic comparisons of candidate genes can allow us

to make inferences about potential functional differences that may exist within

and between species, as has been done for other phenotypes like color vision and

resistance to viral infection (e.g., Wooding et al. 2005; Verrelli et al. 2008). Given

the link between COL1a1 and bone phenotypic variation in humans, this locus is a

perfect candidate for such an approach. For example, comparing patterns of

genetic variation at COL1a1 within chimpanzees to those observed in humans

would reveal if the evolutionary constraints at this locus shifted recently in our

Page 19: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

8

lineage since its divergence from the last common ancestor of humans and

chimpanzees ~4-6 million years (My) ago.

Research Questions

1) Is there regional variation in historic selective pressures at COL1a1 and how

does this relate to the location and severity of DAMs?

Although COL1a1 has been extensively studied due to its direct link with

skeletal phenotype, previous studies have not only focused on coding regions, but

have largely ignored an evolutionary approach. As reported in Chapter 2, a

comparative species approach is used to examine variation in functional constraint

in both coding and noncoding regions of COL1a1 (Stover and Verrelli 2010).

Specifically, we examine evolutionary change at each amino acid site over the

past 450 My of vertebrate history to identify specific positions and overall protein

domains that are evolutionarily conserved, which are inferred to be sites or

regions of high functional constraint. Given this high constraint, these regions are

expected to result in more severe phenotypes if mutated, which we test using the

location and associated phenotypic severity of known DAMs, thereby generating

a model that can predict the severity of novel mutations.

2) Is the recent evolutionary history of COL1a1 in humans consistent with historic

selective pressures in vertebrates?

Population differences in BMD, osteoporosis, and osteoporotic fracture

are well established in humans as is the significant contribution of genetic

variation to these differences; however, the underlying cause of this genetic

Page 20: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

9

diversity among populations is unknown. While it is possible that these patterns

of variation may be expected given historical selective pressures, they may,

alternatively, be an outcome of recent shifts in functional properties of bone in the

human lineage. For example, Wu and Zhang (2010) invoked variation in positive

selection as the general driving force behind genetic differentiation in skeletal

genes among populations. However, a recent weakening of purifying selection

among populations can also result in similar patterns of variation, such as an

excess of rare amino acid polymorphism (e.g., Bustamante et al. 2005; Boyko et

al. 2008; Lohmueller et al. 2008). Alternatively, genetic differentiation among

populations that results in bone phenotypic variation may simply be due to neutral

processes like genetic drift and differing demographic histories.

In Chapter 3, we use a population genetic approach to test these

possibilities for the recent evolution of COL1a1 in humans as compared to our

findings in Chapter 2 for the historic evolution of this gene. Specifically, we

collected nucleotide sequence data for the COL1a1 locus from a total of 96

individuals (192 chromosomes) representing 10 globally-distributed populations

and compare patterns of coding and noncoding diversity and haplotype structure

observed at COL1a1 with expectations under neutrality and different models of

selection. As this is the first comparative study of noncoding variation at COL1a1

among ethnically diverse, natural populations, we also discuss the potential

functional impact of genetic differentiation in introns.

Page 21: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

10

3) Is the evolutionary history of COL1a1 in humans consistent with that in other

primates?

Phenotypic data suggests that skeletal variation exists both within and

between other primate species, which could imply that evolutionary processes that

have led to the prevalence of skeletal disease in humans may not be unique to our

lineage. We address this possibility in Chapter 4 using a population genetic

approach to determine if selective constraints acting on COL1a1 in humans (as

addressed in Chapter 3) are similar to those affecting the evolution of this locus in

chimpanzees. Specifically, we collected nucleotide sequence data for the COL1a1

locus from a total of 20 individuals (40 chromosomes) from the western Africa

Pan troglodytes verus subspecies. As with our human dataset, we compare

patterns of coding and noncoding diversity and haplotype structure with

expectations under neutrality and models of selection. By using our closest-living

relative for this comparative population approach, we can shed light on the origin

of skeletal disease in humans within the past 4-6 My as it relates to COL1a1.

Page 22: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

11

CHAPTER 2: COMPARATIVE VERTEBRATE EVOLUTIONARY

ANALYSES OF COL1a1a

Abstract

Collagen type I alpha 1 (COL1a1), which encodes the primary subunit of type I

collagen, the main structural and most abundant protein in vertebrates, harbors

hundreds of mutations linked to human diseases like osteoporosis and

osteogenesis imperfecta. Previous studies have attempted to predict the

phenotypic severity associated with type I collagen mutations, yet an evolutionary

analysis that compares historical and recent selective pressures, including across

non-coding regions, has never been conducted. Here, we use a comparative

genomic and species evolutionary analysis representing ~450 My of vertebrate

history to investigate functional constraints associated with both exons and introns

of the >17-kb COL1a1 gene. We find that although the COL1a1 amino acid

sequence is highly conserved, there are both spatial and temporal signatures of

varying selective constraint across protein domains. Further, sites of high

evolutionary constraint significantly correlate with the location of disease-

associated mutations, the latter of which also cluster with respect to specific

severity classes typically categorized in clinical studies. Finally, we find that

COL1a1 introns are significantly short in length with high GC-content, patterns

that are shared across highly-diverged vertebrates, and which may be a signature

a Published as: Stover DA, Verrelli BC. 2010. Comparative vertebrate evolutionary analyses of type I collagen: potential of COL1a1 gene structure and intron variation for common bone-related diseases. Mol. Biol. Evol., doi: 10.1093/molbev/msq221.

Page 23: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

12

of strong stabilizing selection for high COL1a1 gene expression. In conclusion,

although previous studies focused on COL1a1 coding regions, the current results

implicate introns as areas of high selective constraint and targets of bone-related

phenotypic variation. From a broader perspective, our comparative evolutionary

approach provides further resolution to models predicting mutations associated

with bone-related function and disease severity.

Introduction

Fibrillar collagens are the main structural proteins in vertebrates, abundant in

connective tissue, cartilage, bone, and tendon (e.g., Gelse et al. 2003). In humans,

fibrillar collagen proteins are linked to numerous skeletal diseases, such as

osteoporosis, osteoarthritis, osteogenesis imperfecta (OI) and chondrodysplasia

(e.g., Dalgleish 1997; Cohen 2006). The fibrillar collagen gene most commonly

associated with disease is COL1a1, which encodes two of the three subunits of

type I collagen (the third being encoded by COL1a2), is the most abundant

protein in mammals, and is the main structural protein of bone, teeth, and tendon

(Viguet-Carrin et al. 2006). Within this gene, >600 human disease-associated

mutations (DAMs) have been identified, the majority of which are linked to

osteoporosis afflicting ~200 million individuals globally, as well as OI (or

“brittle-bone disease”) types I-IV and Ehlers-Danlos Syndrome types I and VIIA

(Dalgleish 1997; Marini et al. 2007) afflicting ~500,000, and 200,000 individuals,

respectively (e.g., Stoll et al. 1989; Burrows 1999; Reginster and Burlet 2006).

Interestingly, COL1a1 is also associated with population variation in bone

Page 24: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

13

strength, measured as bone mineral density (Garcia-Giralt et al. 2002; Stewart et

al. 2006; Jiang et al. 2007). Thus, as frequency differences among populations in

osteoporosis are well-documented (e.g., Lauderdale et al. 1997; Looker et al.

1997; Melton 1997; Barrett-Connor et al. 2005), COL1a1 is a leading candidate

gene in predicting not only severe bone-related diseases, but also natural bone

variation among populations in general.

The majority of known COL1a1 DAMs affect protein coding regions,

typically within the triple-helix domain that is composed of a repeating amino

acid sequence with glycine, the smallest of the amino acids, in every third

position and primarily separated by proline residues (Yamada et al. 1980; Bernard

et al. 1983; Exposito et al. 2002; Boot-Handford and Tuckwell 2003; Aouacheria

et al. 2004; Wada et al. 2006). This repetition is crucial, as it enables the triple-

helix domain to wind into its compact structure in type I collagen. As such, the

most phenotypically severe (e.g., OI type II) DAMs often result from substitutions

of these glycines, while less severe phenotypes (e.g., OI type I) often result from

alterations of the length of the triple-helix domain, which is normally encoded by

43 of 51 exons (fig. 1) within the >17-kb COL1a1 locus (Kuivaniemi et al. 1997;

Dalgleish 1997; Marini et al. 2007; Rauch et al. 2010). Studies have attempted to

predict the phenotypic severity associated with COL1a1 mutations based on their

amino acid position and affect on the thermostability of type I collagen (e.g.,

Persikov et al. 2005; Marini et al. 2007; Bodian et al. 2008, 2009); however,

surprisingly, no model has incorporated an evolutionary approach. For example,

Page 25: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

14

genome-wide studies (e.g., Miller and Kumar 2001; Subramanian and Kumar

2006) using evolutionary site models have shown that DAMs in general are found

more often at amino acid positions that are highly conserved across species.

Because previous efforts to identify DAMs at COL1a1 have focused

primarily on coding regions, it is unclear whether variation in other gene regions

is functionally relevant. In fact, given the abundance of rare disease-associated

amino acid variants (Dalgleish 1997; Marini et al. 2007) and the rarity of COL1a1

protein variation in general in the natural population (Chan et al. 2008), the

observed variation in bone strength among human populations implicates

COL1a1 non-coding regions as functionally relevant as well. For example, an Sp1

transcription factor binding site mutation in the first intron of COL1a1 increases

gene expression, and likely accounts for the associated change in the ratio of

COL1a1 and COL1a2 subunits that reduces the structural integrity of type I

collagen (Grant et al. 1996; Mann et al. 2001; Jin, van’t Hof et al. 2009).

Interestingly, this mutation reaches frequencies >20% in certain populations, but

is absent in others, thus contributing to ethnic and geographic variation in bone

strength and fracture risk (Bandres et al. 2005; Ralston et al. 2006; Jiang et al.

2007).

Potentially functional non-coding variation associated with gene

expression may also include intron compositional properties that impact gene

structure. For example, genome-wide studies have shown that highly-expressed

genes have higher GC-content (Urrutia and Hurst 2001; Comeron 2004; Kudla et

Page 26: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

15

al. 2006), as well as shorter introns, which is related to transcriptional efficiency

(Hurst et al. 1996; Castillo-Davis et al. 2002; Urrutia and Hurst 2003; Comeron

2004). COL1a1 is likely highly expressed given its abundance in connective

tissue and its importance during fetal development and wound repair (e.g., Gelse

et al. 2003; Hildebrand et al. 2005; Cohen 2006), yet because studies are often

constrained to analyses of soft-tissues (e.g., Su et al. 2004; Blekhman, Oshlack et

al. 2008), little is known about COL1a1 gene expression. In this respect, in

addition to evolutionary analyses of COL1a1 coding regions, similar analyses of

introns (e.g., length and GC-content) could also reveal historical selective

pressures related to functional constraint.

Attempts have been made to predict mutations associated with bone-

related disease, the majority of which include COL1a1 because of its direct link to

skeletal phenotypes. However, these studies have focused on protein sequences,

and more specifically, have often ignored an evolutionary approach. Here, we

present the first molecular evolutionary and statistical analysis of both COL1a1

coding and non-coding sites to ask several questions: (1) Is there evidence of

differential selective pressure across protein domains, and how does this vary

across recent/historical periods of time? (2) Do amino acid sites of high and low

functional constraint over evolutionary time predict the gene locations and

severity of human DAMs? (3) Finally, are there particular aspects of the gene

structure that are evolutionarily unique, compared to even other fibrillar

collagens?

Page 27: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

16

Materials and Methods

DNA Amplification and Sequencing

Our DNA sequences were collected either from public genome databases

(National Center for Biotechnology Information (NCBI) and Ensembl), or were

generated directly when necessary. For example, with an estimated molecular

divergence time from our common ancestor at ~4-6 My (e.g., Kumar et. al, 2005),

the chimpanzee is our closest living primate relative and is necessary for estimates

of COL1a1 nucleotide site divergence in the human lineage. However, regions of

the chimpanzee (and to some extent the human) COL1a1 gene sequences have

assembly errors in their genome databases (which is likely a result of the

repetitive nature of the sequence). Thus, DNA sequences for the >17-kb

chromosome 17q21.33 locus (fig. 1) were generated from a human and a western

Africa Pan troglodytes verus chimpanzee, both sampled from the Coriell Institute

for Medical Research (Camden, NJ). Although there are several recognized

chimpanzee subspecies, P. t. verus is the most appropriate contrast with humans

because it has similar levels of nuclear diversity (Stone et al. 2002; Gilad et al.

2003; Fischer et al. 2004; Wooding et al. 2005; Verrelli et al. 2006, 2008; Claw et

al. 2010).

The overall high GC-content at COL1a1 (discussed below) often required

that our polymerase chain reaction (PCR) fragments are generated in short

sequences (e.g., as small as 500 bp) with various temperature and buffer

conditions. PCR products were purified using shrimp alkaline phosphatase and

Page 28: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

17

exonuclease I (US Biochemicals, Cleveland, OH) prior to DNA sequencing using

an ABI 3730 automated sequencer (Applied Biosystems, Foster City, CA).

Sequences were aligned and edited using Sequencher v. 4.5 (Gene Codes, Ann

Arbor, MI). All PCR and sequencing primers were designed from available

human sequence (accession # NT_010783.14) and are available upon request.

Gene Structure Analyses

To test hypotheses about selective pressures acting on gene structure, we first

used a gene family approach comparing COL1a1 in humans to other closely-

related fibrillar collagens. Phylogenetic comparisons of vertebrate fibrillar

collagen proteins reveal three clades with COL1a1 being one of five “clade A”

collagens (Boot-Handford and Tuckwell 2003; Aouacheria et al. 2004; Wada et

al. 2006). As such, length and GC-content data for all exons and introns were

collected from public databases for the four other clade A collagens (COL1a2,

COL2a1, COL3a1, and COL5a2). Sequence gaps and Alu elements were omitted,

the latter of which to avoid polymorphic insertions (e.g., a polymorphic Alu in

COL3a1; Milewicz et al. 1996). Further, estimates of intron nucleotide divergence

were calculated from sites that best reflect “neutrality.” Although we cannot say

with certainty which sites are selectively neutral, we do expect that purifying

selection acts relatively stronger on certain sites with putative function compared

to others (e.g., McDonald and Kreitman 1991; Wray et al. 2003). Thus, we

omitted first introns, which are typically enriched for transcription factor binding

sites (e.g., Bornstein et al. 1987; Majewski and Ott 2002), as well as intron 5’ and

Page 29: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

18

3’ splice sites. While this first comparison examines the “uniqueness” of COL1a1

among fibrillar collagens, a second comparison was used to also gain an overall

genomic perspective. We used the Gazave et al. (2007) dataset, which includes

length, GC-content, and human-chimpanzee nucleotide divergence estimates for a

total of 51,673 introns compiled from 7,791 genes, representing the largest

evolutionary analysis of human genome introns. As with above, first introns and

intron splice sites were omitted.

Because we are interested in whether specific aspects of COL1a1 introns

are unusual, both independently and with respect to overall structure, analyses of

“means” alone for intron length and GC-content are not informative. Further, the

distribution of intron length in the human genome is dramatically skewed (i.e.,

lacks “normality,” Gazave et al. 2007); thus, because of the enormity of the

genome-wide sample relative to COL1a1, even non-parametric tests that compare

only means of distributions may not be statistically sensitive. Instead, we

constructed bins for the distributions of intron “length” and of GC-content

“percentage” for each clade A collagen and for the Gazave et al. (2007) dataset,

and compared these binned distributions using non-parametric row by column

(RxC) chi-squared tests (e.g., supplementary tables 1a-c, Appendix A). Among

clade A collagens, comparisons of binned distributions were also made for exon

length and GC-content. These analyses that compare distributions of site

frequency classes allow for fine-scale statistical tests and increased power,

especially in cases of unequal sample size (e.g., Akashi and Schaeffer 1997;

Page 30: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

19

Verrelli and Tishkoff 2004; Hernandez et al. 2007). Statistical significance levels

for all tests were corrected for multiple comparisons using a standard Bonferroni

method.

We also carried out comparative analyses across taxa to estimate how

aspects of the gene structure of COL1a1 may have changed over evolutionary

time. To avoid errors in alignment and problems associated with incomplete

sequences from available databases, we only used species for which complete

COL1a1 intron and exon sequences were publicly available. That is, species with

gaps in their genomic sequence, regardless of the length of the gap, were

excluded. As a result, the only species for which complete, high-quality COL1a1

sequences were publicly available are mouse (Mus musculus), dog (Canis

familiaris), cow (Bos taurus), western-clawed frog (Xenopus tropicalis), and

zebrafish (Danio rerio), which were added to our human and chimpanzee

sequences. This “7-species dataset” reflects a broad sampling across ~450 My of

vertebrate evolution. Unlike our previous comparisons within the human genome,

these analyses involve the same number of exons and introns. Thus, comparisons

across species were conducted for COL1a1 exon and intron length and GC-

content using Mann-Whitney U tests. To test hypotheses regarding the

conservation of independent introns across evolutionary time, we also used non-

parametric RxC chi-squared tests to compare orthologous introns and exons

among these species (e.g., supplementary tables 2a-e, Appendix A).

Page 31: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

20

Nucleotide Site Divergence

To estimate the impacts of long- and short-term selective pressures on the

COL1a1 protein, we first constructed a “vertebrate” dataset reflecting ~450 My

from available complete cDNA sequences, which includes the 7-species dataset in

addition to rat (Rattus norvegicus), donkey (Equus asinus), African clawed frog

(Xenopus laevis), rainbow trout (Oncorhynchus mykiss), and goldfish (Carassius

auratus). Amino acid sequences (which were the only available complete

sequence) from chicken (Gallus gallus) and Japanese firebelly newt (Cynops

pyrrhogaster) were also collected. Second, we also constructed a “primate”

dataset using our human and chimpanzee sequences in combination with database

partial-cDNA sequences (see supplementary table. 3, Appendix A) from gorilla

(Gorilla gorilla), orangutan (Pongo abelii), macaque (Macaca mulatta), and

marmoset (Callithrix jacchus), reflecting ~45 My in total. Sequences within the

two datasets were aligned with a Clustal analysis in MEGA v.4 (Kumar et al.

2008), with the removal of alignment gaps.

With these two datasets, we conducted several analyses of the ratio of

divergence at nonsynonymous and synonymous sites (dN/dS; e.g., Nei and

Gojobori 1986). To identify statistically significant differences, we used the

GABranch algorithm of Pond and Frost (2005a) available through the

Datamonkey on-line server (Pond and Frost 2005b). This analysis enables more

flexible testing of hypotheses in identifying different dN/dS evolutionary

“classes” of constraint using a maximum-likelihood approach. For both our

Page 32: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

21

general vertebrate and primate datasets, the relationships among species were

described with a maximum parsimony tree constructed in MEGA v.4 (Kumar et

al. 2008) and the best-fit nucleotide substitution model for each dataset was

determined by the selection procedure of Pond and Frost (2005c).

Other than the triple-helix domain, COL1a1 contains N-and C-terminal

non-collagenous domains, which are found in all fibrillar collagens and

hypothesized to have different functional constraints (e.g., Exposito et al. 2002;

Boot-Handford and Tuckwell 2003; Aouacheria et al. 2004). Thus, we also

applied the GABranch algorithm (Pond and Frost 2005a) to each COL1a1 domain

separately, and used the HyPhy program (Pond et al. 2005) to assess statistical

significance. Specifically, likelihoods are generated from a substitution rate model

for dN/dS evolving independently across all lineages in each domain, and then

compared to a second set of likelihoods (using a standard likelihood ratio test

employed by HyPhy) generated from models where each domain is constrained to

fit another. For example, is the rate for the triple-helix different from that

observed in the N- or C-terminal across lineages?

Unlike the coding region analyses above, non-coding regions are

sufficiently diverged to preclude alignment across taxa, even for closely-related

non-human primates in our dataset. Thus, we estimated historical selective

constraint on the COL1a1 intron nucleotide sequence in humans with an analysis

of human-chimpanzee divergence. As with above, we wish to examine how

divergence varies across introns (which is especially relevant to our analysis of

Page 33: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

22

intron lengths), and not compare simple “means;” thus, we used similar RxC chi-

squared tests as above. Specifically, after removing alignment gaps and splice

sites, a distribution of divergence for COL1a1 introns was constructed as bins of

“percentage” (number of differences per nucleotide) and compared to similar bins

of human-chimpanzee divergence for introns in the other four clade A collagen

genes, as well as in the Gazave et al. (2007) dataset.

DAMs at COL1a1

Finally, we examined the association of COL1a1 amino acid sites of high and low

evolutionary conservation with where DAMs are located. Following the method

of Subramanian and Kumar (2006), we estimated the evolutionary substitution

rate at each amino acid site from our vertebrate dataset (14 species listed above

for which the entire amino acid sequence is available), using a maximum-

likelihood approach implemented in PAML v.4.2 (Yang 2007). Specifically, the

distribution of evolutionary rates among sites was categorized into an 8-class

discrete gamma model (with an estimated gamma shape parameter starting at 0.5),

and unequal substitution rates among amino acids were accounted for using the

model of Jones et al. (1992). A total of 293 unique missense and nonsense

COL1a1 DAMs that could be plotted within our vertebrate alignment were

collected from the Database of Osteogenesis Imperfecta and Type III Collagen

Mutations (Dalgleish 1997). We then used a chi-squared test to contrast the

distribution of these observed mutations with the distribution of expected

Page 34: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

23

mutations in the 8-class model derived from the Subramanian and Kumar (2006)

analysis.

As the majority of COL1a1 mutations are associated with OI, we also

examined this distribution according to the clinical severity of disease symptoms

(Sillence et al. 1979; Basel and Steiner 2009). Type I OI typically presents with

mild bone weakening similar to that caused by osteoporosis; OI II, the most

severe form, is typically lethal in the pre- or post-natal stage; OI III, the most

severe of survivable OI types, is characterized by severe skeletal deformities and

fragility fractures; and those that cannot be placed into types I-III are typically

classified as OI IV. With this classification, we used four categories: (1) OI I and

osteoporosis-associated mutations; (2) OI IV and related mutations (e.g., I/IV and

III/IV); (3) OI III mutations; and (4) lethal OI II and II/III mutations, which total

249 mutations that meet these criteria. We repeated the same chi-squared analyses

above using the expected distributions from the Subramanian and Kumar (2006)

analysis, but with each of these 4 severity categories as the observed distributions.

Finally, to determine if there is significant spatial clustering of these severity

categories across the COL1a1 amino acid sequence, we used the maximum-

likelihood approach of Zhang and Townsend (2009) as implemented in the

MACML program.

Page 35: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

24

Results

Exon Pattern Analyses

COL1a1 exons, while seemingly short in length, are not unusual for clade A

fibrillar collagens (table 1; supplementary tables 1a and c, Appendix A). Exon

length does not differ between human and chimpanzee, and although there are

differences within our 7-species dataset (which are primarily restricted to the N-

terminal domain), they are not statistically unusual (supplementary tables 2a and

c-e, Appendix A). At 66%, COL1a1 exon GC-content is considerably higher than

that seen at human genes in general (International Human Genome Sequencing

Consortium 2001), which is not surprising given the abundance of GC-rich

glycine and proline codons typically found in collagens. However, COL1a1 exon

GC-content is still significantly greater than the other clade A fibrillar collagens

(except COL2a1), even when examining only glycine- and proline-rich triple-

helix coding regions (table 1; supplementary tables 1b and c, Appendix A).

Further, although COL1a1 exon GC-content overall differs for comparisons

between mammals and non-mammals, no differences are seen in comparisons of

orthologous exons (supplementary tables 2b-e, Appendix A).

Protein Evolution and DAMs

COL1a1 human-chimpanzee synonymous divergence is not unusual compared to

other genes (e.g., Chimpanzee Sequencing Consortium 2005), while

nonsynonymous divergence at COL1a1 (as well as at all clade A collagens), is

virtually absent (table 2). Low COL1a1 amino acid divergence is also observed

Page 36: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

25

within our primate sample (dN/dS<0.15; supplementary table 3, Appendix A),

and although the GABranch analysis found evidence for two dN/dS rate classes

overall, the small number of fixations precluded analyses among domains. On the

other hand, the analysis of our vertebrate sample found evidence for six dN/dS

rate classes in the triple-helix domain (dN/dS<0.36) and three rate classes for each

of the C- and N-terminal domains (dN/dS<0.17 and <0.74, respectively;

supplementary fig. 1, Appendix A). With the HyPhy analysis, we find little

evidence for rate differences between the C- and N-terminal domains for

comparisons between primate and other mammalian lineages, whereas, the rate in

the triple-helix is significantly less than that of the N-terminal, but significantly

greater than that of the C-terminal domain (likelihood ratio test, P<0.009).

Using the approach of Subramanian and Kumar (2006), our site analysis

finds spatial variation across COL1a1, with sites in the N-terminal domain having

a significantly higher evolutionary rate (fig. 2; supplementary tables 4a and b,

Appendix A). In addition, DAMs occur more often at highly conserved amino

acid positions, a relationship that is consistent for all of the phenotypic severity

categories associated with OI disease except, interestingly, the least severe

category 1 (fig. 2; supplementary tables 4a-c, Appendix A). In fact, when

focusing on the extremes of lethal (category 4) and more benign, but osteoporotic-

like (category 1) mutations, the former occur at much more highly conserved

amino acid positions a significantly greater proportion of the time (Χ2=25.2,

P=0.0001). Finally, our MACML analysis finds clustering across COL1a1 in

Page 37: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

26

several regions, with statistically significant evidence involving the two extreme

DAM categories, regardless of the model used from Zhang and Townsend (2009).

A significant cluster for category 1 includes residues 190-366 (P<0.05), which is

the N-terminal end of the triple-helix domain, whereas a significant cluster for

category 4 includes residues 352-1186 (P<0.05), which is centered on the triple-

helix domain.

Intron Pattern Analyses

COL1a1 introns are significantly shorter than those of the other clade A collagens

with the exception of COL2a1 (table 1, fig. 3). In fact, the intron structure at

COL1a1 - many introns, but all of them relatively short in length - is statistically

very unusual (X2=150, P<10-6) when compared to human genes in general

(Gazave et al. 2007). We do find evidence for significant variation in length of

orthologous introns between taxa within our 7-species dataset; however, as

overall intron length has not changed dramatically (supplementary tables 5a and

c-e, Appendix A), COL1a1 intron content appears to be similar even across

vertebrates separated by ~450 My.

Intron GC-content at COL1a1 is significantly greater than that at the other

clade A collagens except COL2a1 (supplementary tables 6b and d, Appendix A).

However, we also found that GC-content was negatively correlated with intron

length at COL1a1 (supplementary table 6e, Appendix A), a pattern that has been

generally noted from genome-wide studies (e.g., International Human Genome

Sequencing Consortium 2001; Gazave et al. 2007; Pozzoli et al. 2008).

Page 38: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

27

Nonetheless, even when the dataset for the 5 genes was standardized to include

introns only 80-500 bp in length to reflect intron sizes at COL1a1, the results did

not change. Unlike COL1a1, no other collagen gene shows a significant

correlation between intron length and GC-content; thus, instead of “percentage”

GC-content, we examined linear regressions of the number of G/C nucleotides in

each intron, which standardizes GC-content for intron length. F-test analyses

comparing these regressions confirm that intron GC-content at COL1a1 is

unusually higher than all other clade A collagens except COL2a1 (supplementary

table 6f, Appendix A). In fact, when compared to the Gazave et al. (2007) dataset,

and genes with introns of only 80-500 bp, COL1a1 still has significantly greater

GC-content (X2=35, P=0.0009). Unlike intron length, intron GC-content differs

significantly in our 7-species dataset at COL1a1, regardless of whether overall

distributions or pairwise comparisons among equivalent introns are compared

(supplementary tables 5b-e, Appendix A). Overall intron GC-content is

significantly higher in mammals vs. non-mammals, except for mouse, which has

significantly lower intron GC-content than even other mammals.

Analyses of genome-wide introns indicate that length and GC-content are

also positively correlated with human-chimpanzee divergence (Gazave et al.

2007). While this seems to be the case for intron length and divergence at

COL1a1 and the other clade A collagens, the relationship for GC-content and

divergence is only weakly correlated for the collagen genes (supplementary table

6e, Appendix A). Thus, as in analyses above, we standardized datasets to include

Page 39: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

28

introns 80-500 bp for divergence comparisons. Intron divergence at COL1a1 is

low, but not statistically unusual compared to the other collagens (table 2;

supplementary tables 6c and d, Appendix A) or human genes in general (X2=8.0,

P=0.6). However, it is possible that our analysis of COL1a1 divergence is an

underestimate as it does not take into consideration the significantly high, gene-

specific GC-content (e.g., Hernandez et al. 2007). To test this latter hypothesis,

we used divergence at COL1a1 non-CG dinucleotides (0.43%) as a conservative,

“background” substitution rate (i.e., instead of the higher genome-wide estimate

of ~1%; Chimpanzee Sequencing Consortium 2005). This estimate was adjusted

for the ~10-fold increase in mutation rate expected for CG-dinucleotides in

general (Subramanian and Kumar 2003), and compared to the rate “observed” at

CG-dinucleotide sites (i.e., after using orangutan as an outgroup to correct for

changes recently in the human and chimpanzee lineages). When these two

estimates were applied to the total number of intron sites at COL1a1 to simulate

sampling variance, interestingly, observed human-chimpanzee divergence appears

to be reduced (X2=6.14, P=0.013).

Discussion

Our evolutionary analyses of COL1a1 spanning the past ~450 My are consistent

with it being one of the most highly conserved in vertebrates. While this may be

the case for the protein overall, we find evidence for spatial/temporal variation in

selective pressures across domains. In fact, the C-terminal domain that is

responsible for the recognition and assembly of type I collagen subunits (e.g.,

Page 40: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

29

Exposito et al. 2002; Boot-Handford and Tuckwell 2003; Aouacheria et al. 2004),

is actually the most highly conserved domain. Like the N-terminal, the C-terminal

peptide is cleaved after the subunit’s triple-helix domain has been assembled into

type I collagen; thus, interestingly, the protein region that exhibits the strongest

signature of purifying selection is also one that is not part of the mature protein.

On the other hand, the relatively less constraint on the COL1a1 triple-helix

domain implies more “flexibility” in this region over evolutionary time, which

may contribute to observed structural and mechanical variation in bone, including

mineral content and organization of collagen fibers, among mammals, birds, and

reptiles (e.g., Currey 1987; Wang et al. 1998; Rensberger and Watabe 2000).

It is possible that the evolutionary differences seen across domains are the

result of directional selection pressures over time. Others have suggested this

explanation (e.g., Morgan et al. 2010); however, the rate variation observed here

is not what would be typically interpreted as a signature of “positive selection.”

Specifically, the estimates of dN/dS over both recent and deeper evolutionary

lineages are more consistent with purifying selection, and thus the differences

among domains in these analyses, while statistically significant, are more

reflective of variation in functional constraint and not adaptive evolution.

Furthermore, our observations of very little amino acid variation among primates,

including no fixation between human and chimpanzee, together with the observed

slight differences in the properties of bone documented among primates (e.g.,

Page 41: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

30

Wang et al. 1998; Kikuchi et al. 2003), are consistent with strong purifying

selection that simply varies from ancient to more recent evolutionary time.

The abundance of DAMs at COL1a1 is also consistent with high

functional constraint; however, the fact that so many exist, and yet are not

associated with lethality (e.g., category 1 mutations) may suggest that purifying

selection has recently become weak in the human lineage. This hypothesis has

been championed as a general explanation for human disease prevalence, as high

amino acid polymorphism in humans (albeit rare in frequency), in spite of the

little amino acid divergence with chimpanzees, is a genome-wide pattern (e.g.,

Bustamante et al. 2005; Boyko et al. 2008; Lohmueller et al. 2008). However, the

fact that purifying selection, and the absence of any positive selection, dominates

the evolutionary history at COL1a1 is somewhat novel compared to other genes

strongly associated with human disease (e.g., Blekhman, Man et al. 2008). Thus,

instead of invoking weak purifying selection in the human lineage, a better

explanation for the pattern of DAMs seen today is that selective constraints

simply differ across protein domains, which is also consistent with the overall

vertebrate pattern at COL1a1.

In fact, our comparisons of the location of DAMs with the estimates of

evolutionary constraints at their respective sites helps further support this

hypothesis. For example, although previous studies have suggested that the

probability of lethality for a mutation increases as one moves away from the N-

terminus (e.g., Byers et al. 1991; Marini et al. 2007; Rauch et al. 2010), we find

Page 42: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

31

that DAMs not only cluster in specific protein sites of high evolutionary

constraint, but that they even cluster with respect to classes of lesser severity,

including more common, osteoporotic-like conditions. Interestingly, the N-

terminal domain, where only four DAMs have been observed (Dalgleish 1997), is

also the region here that shows the least evolutionary constraint over recent as

well as deep evolutionary time. On the other hand, ~7% of COL1a1 positions,

including those in the triple-helix, lack known DAMs and are also the most

rapidly evolving sites. Thus, while factors such as amino acid size,

thermostability, and domain are important to bone-related disease prediction

models (e.g., Persikov et al. 2005; Marini et al. 2007; Bodian et al. 2008, 2009),

an evolutionary site model may detect bone-related phenotypes of lesser, but more

common, severity with finer resolution, particularly for non-glycine mutations,

which have been largely ignored by previous prediction models.

Our final analyses of the evolutionary history of the overall gene structure

also find unusual patterns for non-coding DNA. Of particular interest is that

COL1a1 introns have remained relatively short over the past ~450 My despite

divergence in overall genome size (e.g., haploid human genome is ~2-fold larger

than that of zebrafish; Hedges and Kumar 2002). Nonetheless, there are several

reasons why these patterns are not expected a priori. First, because of the

repetitive nature of vertebrate fibrillar collagen coding sequence, which has been

proposed to have originated through a series of duplication events from an

ancestral collagen with a single exon of ~54 bp (Bernard et al. 1983; Valkkila et

Page 43: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

32

al. 2001; Exposito et al. 2002; Boot-Handford and Tuckwell 2003; Aouacheria et

al. 2004; Wada et al. 2006), there is extensive homology among gene regions.

This homology presents the potential for increased unequal crossing-over events

that result in exon duplications and deletions in the triple-helix domain, which,

although important in the history of collagen evolution, is highly deleterious today

as it typically leads to severe disease (e.g., Barsh et al. 1985; Cohn et al. 1993;

Bodian et al. 2009). As such, separating the short COL1a1 exons to reduce

“interexon homology” (e.g., Cohn et al. 1993) would actually predict longer

introns, yet this is not observed here.

A second consideration involves the greater mutation rate associated with

high GC-content in general (e.g., Subramanian and Kumar 2003), and thus, the

expectation that mutational input at COL1a1, including introns, may be relatively

high. However, our conservative estimates suggest that intron site divergence at

COL1a1 may actually be reduced. We may expect that purifying selection on

deleterious amino acid variants coincidentally reduces neutral, linked

polymorphism in introns (as a result of background selection, e.g., Charlesworth

et al. 1995). However, under this scenario, we do not expect intron site

divergence, if neutral, to be reduced as gene regions become effectively unlinked

and evolutionarily independent over longer periods of time (e.g., Hellmann et al.

2003). For example, even at COL1a1, synonymous sites show typical levels of

neutral divergence, yet nonsynonymous sites are highly conserved. Thus, the

pattern found for intron site divergence at COL1a1 may be consistent with

Page 44: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

33

purifying selection, which suggests that intron site variation has functional

implications as well.

As previously noted, transcriptional efficiency is correlated with GC-

content and intron length across the genome (Hurst et al. 1996; Castillo-Davis et

al. 2002; Urrutia and Hurst 2003; Vinogradov 2003; Comeron 2004; Kudla et al.

2006; Pozzoli et al. 2008); thus, one explanation for the significantly short, GC-

rich introns at COL1a1 is that high gene expression is maintained by stabilizing

selection. As type I collagen is the most abundant protein in mammals and is

consistently required across life stages, from fetal development to wound repair

(e.g., Gelse et al. 2003; Hildebrand et al. 2005; Cohen 2006), strong selection for

COL1a1 transcriptional efficiency would appear to be necessary. In addition, for a

gene the size of COL1a1, the observation of so few transposable elements (4,

including 1 Alu) is unusual given their frequency in the genome, yet interestingly,

this pattern is also consistent with human genes with high expression

(International Human Genome Sequencing Consortium 2001; Hackenberg et al.

2005; Pozzoli et al. 2008). COL2a1 may have similarly short, GC-rich introns

with few transposable elements (4, again including only 1 Alu) as COL1a1

because of similar constraints on gene expression. This may be the case as type II

collagen, encoded by COL2a1, is the main structural protein of cartilage and is,

therefore, also as functionally important in bone development (Gelse et al. 2003;

Cohen 2006). Finally, while certain introns (e.g., first introns) may exhibit lower

evolutionary divergence as they are known to be enriched for transcription factor

Page 45: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

34

binding domains, reduced divergence across COL1a1 introns overall may also be

related to intron length. For example, transcription enhancers often accumulate

within the ~150 bp of exon-intron boundaries (e.g., Majewski and Ott 2002), and

thus, given the relatively short intron lengths at COL1a1, purifying selection may

be expected to be higher if the number of intron nucleotide sites that are

selectively neutral is also relatively reduced.

Conclusion

Although previous analyses of the COL1a1 protein have invoked positive

selection to explain the molecular rate heterogeneity observed across both

lineages and time, our dataset including ~450 My of vertebrate evolution

concludes that these patterns are best explained by variation in purifying

selection. Furthermore, although low evolutionary COL1a1 site rates are

consistent with the abundance of DAMs found in the human population today, our

unique approach may predict not only the location of these mutations, but also

their degree of severity. In fact, these patterns imply that COL1a1-related diseases

are likely not isolated to humans, or primates in general, which suggests that even

distantly-related vertebrates would be suitable models for research on common

bone-related diseases such as osteoporosis. Finally, the unusual patterns seen for

the COL1a1 intron structure may be a signature of selective constraint for high

gene expression that has interesting implications. Specifically, given the inferred

history of purifying selection on them, these non-coding regions should also be

Page 46: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

35

considered when predicting which variants at COL1a1 may explain potentially

deleterious, or even adaptive, bone-related phenotypes today.

Acknowledgments

The authors thank G. Perry and C. Hepp for database and PAML assistance,

respectively, and S. Kumar for comments on early versions of the analyses. This

work was supported by a National Science Foundation grant DEB-0909637 to

B.C.V. and D.A.S.

Page 47: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

36

Table 1

Human Clade A Collagen Gene Exon and Intron Characteristics

Gene Characteristic COL1a1 COL1a2 COL2a1 COL3a1 COL5a2 Gene size (bp) 16,012 35,362 30,915 37,285 145,535 No. of Exons 51 52 54 51 54 Exon length (bp) 86 ± 52 79 ± 49 83 ± 53 86 ± 53 83 ± 54 Intron length (bp) 193 ± 136 563 ± 502 416 ± 442 454 ± 316 1,380 ± 1,297Exon GC-content (%) 66 ± 5 59 ± 8 64 ± 6 59 ± 7 57 ± 7 Intron GC-content (%) 59 ± 7 35 ± 5 55 ± 6 30 ± 4 32 ± 4 Note: Gene size is the distance from the start to the stop codon. Length and GC-content values denote means and standard deviations, excluding first introns and splice sites. See supplementary tables 1a, b, and 6a, b (Appendix A) for more information.

Page 48: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

37

Table 2

Clade A Collagen Gene Human-Chimpanzee Divergence Estimates

Gene COL1a1 COL1a2 COL2a1 COL3a1 COL5a2

Sites nb dc n d n d n d n d Intronsa 9,463 0.69 28,160 0.97 21,619 1.09 22,242 0.94 71,769 0.74 Synonymous 1,212 1.49 1,140 0.79 1,220 1.31 1,195 0.50 1,209 1.16 Nonsynonymous 3,180 0 2,958 0.10 3,240 0.03 3,203 0.03 3,288 0.06 a Excludes first introns, see table 1 and supplementary table 6c (Appendix A) for more information. b Number of nucleotides

c Divergence as number of differences per nucleotide (%)

Page 49: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

Fig. 1. The COL1a1 gene locus, with all coding and non-coding regions to scale.

DNA sequenced region in human and chimpanzee includes 1,223 bp of the

promoter (striped box), 263 bp of the 5’ and 3’ mRNA untranslated regions (white

boxes), the 4,392 bp of coding sequence for all 51 exons (black boxes), with non-

coding regions in between. The “*” denotes a 496-bp gap in the collected

sequence of intron 25 due to the presence of an Alu element.

38

Page 50: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

Fig. 2. Locations of DAMs are shown as vertical markers along the COL1a1

amino acid sequence for (A) category 4 mutations, which include lethal OI II and

II/III groups, and (B) category 1 mutations, which include OI I and osteoporosis-

associated groups. Long and short vertical markers reflect non-glycine and

glycine substitutions, respectively, across the three protein domains. The relative

estimated evolutionary rates along the sequence are calculated from 14 diverse

vertebrate taxa using the Subramanian and Kumar (2006) analysis, with a

trendline summarizing a sliding window of averages across windows each of 10

amino acid residues (position number based on human sequence). See Materials

and Methods and supplementary table 4a (Appendix A) for more information.

39

Page 51: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

Fig. 3. Frequency distributions of intron length for human clade A collagen genes,

excluding first introns. The X-axis reflects the intron length bins (all scales are the

same), and the Y-axis shows the number of introns within these bins. See

supplementary tables 6a and d (Appendix A) for more information.

40

Page 52: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

41

CHAPTER 3: HAPLOTYPE STRUCTURE AND AMINO ACID

POLYMORPHISM AT HUMAN COL1a1

Abstract

Bone strength and the incidence and severity of related skeletal disorders like

osteoporosis vary significantly among human populations of different ethnic

origin due in part to underlying genetic differentiation, such as at the bone

structural protein gene, collagen type I alpha 1 (COL1a1). Previous research has

shown that, not only has the COL1a1 protein been highly conserved over deep

vertebrate evolutionary time, but that exon-intron structure has been as well,

suggesting that noncoding regions represent functionally important domains for

examining bone-related phenotypes. Here, we have collected DNA sequence

variation from both coding and noncoding regions of the >17-kb COL1a1 locus in

192 chromosomes from 10 ethnically and geographically diverse natural human

population samples to determine how recent and ancient evolutionary pressures

have differed and what this predicts about bone-related variation in populations

today. Surprisingly, we find population diversity for amino acid polymorphism to

be higher than that predicted from clinical studies, significant geographic

variation for an unusual haplotype block structure that includes the 5’ upstream

region of noncoding sequence, and an ancient origin for the functionally relevant

and well-studied Sp1 binding site polymorphism. While previous studies have

long focused on amino acid variation at COL1a1 as a source of deleterious

function, our evolutionary approach has led us to conclude otherwise; specifically,

Page 53: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

42

protein variation is not only more abundant and older than believed, but

noncoding regions may also greatly contribute to existing variation in the

expression of bone mineral density across diverse human populations.

Introduction

Bone strength, measured as bone mineral density (BMD), varies significantly

among populations with individuals of African ancestry tending to have greater

bone strength and better overall bone quality than other populations (Looker et al.

1998; Bachrach et al. 1999; Barrett-Connor et al. 2005; Baxter-Jones et al. 2010).

Similar trends are also seen for disorders related to skeletal strength including

osteoporosis, which is generally at its highest frequency among western

Europeans (e.g., Lauderdale et al. 1997; Looker et al. 1997; Melton 1997; Barrett-

Connor et al. 2005). Although this variation in bone strength and related disease

susceptibility is due in part to environmental differences among populations (e.g.,

dietary intake of calcium and vitamin D; Matkovic et al. 1990; Lau et al. 2005;

Adami et al. 2009; Musumeci et al. 2009), the majority of this phenotypic

variation is also genetically related given that BMD is estimated to be as high as

~80% heritable (e.g., Gueguen et al. 1995; Prentice 2001; Brown et al. 2005;

Videman et al. 2007). As bone-related disorders like osteoporosis alone impact

>200 million worldwide (Reginster and Burlet 2006), it is clear that models

leading to the characterization of potential sources of this genetic variation would

be highly valuable, not only for screening genotypes, but also for the development

of effective drug treatments (e.g., Qureshi et al. 2002).

Page 54: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

43

Among the candidate genes most commonly associated with variation in

bone strength and related disease susceptibility across populations is COL1a1

(e.g., Garcia-Giralt et al. 2002; Stewart et al. 2006; Jiang et al. 2007; Ioannidis et

al. 2007; Kaufman et al. 2008), which encodes the primary subunit of the main

structural protein in bone, type I collagen. Over 600 skeletal and connective tissue

disease-associated mutations (DAMs) have been identified within this single

gene, the majority of which are linked to osteoporosis, osteogenesis imperfecta

types I-IV, and Ehlers-Danlos Syndrome types I and VIIA (Dalgleish 1997;

Marini et al. 2007). One consideration is that estimates of amino acid

polymorphism abundance and deleterious function are biased as they come from

clinical genotype screens based on bone-related disorders. In fact, our previous

comparisons of estimated single amino acid evolutionary rates across a diverse

sample of taxa with the locations of DAMs in humans developed a model that

shows a history characterized by purifying selection, but also shows clustering of

variation in selective constraint across the protein sequence (Stover and Verrelli

2010). Thus, while it is clear that amino acid polymorphism can result in

deleterious phenotypes, we may predict that not all amino acid variation in natural

populations has severe or even detectable impacts on protein function.

Another consideration is the extent to which the focus of previous studies

on COL1a1 exons has also biased our perception of where functional variation

accumulates. For example, one study has measured COL1a1 polymorphism (Chan

et al. 2008), but with a sample of only admixed-American populations and a focus

Page 55: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

44

on coding regions, our understanding of other gene regions and of the natural

global human population is deficient at best. In fact, in the first evolutionary

analysis of coding and noncoding COL1a1 regions representing ~450 My of

vertebrate divergence, we previously showed that intron structure and content

have also been historically conserved (Stover and Verrelli 2010), likely due to the

need to maintain high transcriptional efficiency of this gene given its importance

in fetal development and wound repair (e.g., Gelse et al. 2003; Hildebrand et al.

2005; Cohen 2006). As such, introns and other noncoding regions may actually

represent better candidates for functional variation related to BMD expression if

amino acid variation is sufficiently rare in frequency. Nonetheless, while it is

clear that purifying selection has historically impacted noncoding regions of

COL1a1, it is unknown how selective pressures have impacted these regions over

recent evolutionary time.

The strongest evidence of association between COL1a1 noncoding genetic

variation and variation in bone strength and osteoporotic fracture-risk is a single

nucleotide polymorphism (SNP) located in an Sp1 transcription factor binding site

in the first intron (e.g., Efstathiadou et al. 2001; Bandres et al. 2005; Ralston et al.

2006; Jiang et al. 2007). This noncoding SNP increases the expression of COL1a1

and likely accounts for the associated change in the ratio of COL1a1 and

COL1A2 subunits in type I collagen that reduces its structural integrity (Grant et

al. 1996; Mann et al. 2001; Jin, van’t Hof et al. 2009). Despite its seemingly

negative impact on function, this SNP has reached a relatively high frequency

Page 56: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

45

among individuals of western European ancestry (>20%; e.g., Grant et al. 1996,

Ralston et al. 2006; Jiang et al. 2007), but is only rarely found among Africans

(e.g., Beavan et al. 1998) and is absent among Asians (e.g., Han et al. 1999;

Nakajima et al. 1999; Lau et al. 2004). While these previous studies may suggest

an origin of the Sp1 SNP in Europe given its frequency, this would imply a

relatively recent age. As this SNP is the most-studied in genotype-phenotype

relationships at COL1a1, an understanding of its age and evolutionary history

would address questions of its relevance over time with respect to bone-related

disease and gene expression in humans.

Here we conduct the first natural population genetic analyses in a global

and random sample of humans to address the functional importance of COL1a1

coding and noncoding polymorphism. We are specifically interested in testing

hypotheses to answer: (1) Is amino acid polymorphism reflective of clinical

studies in that it is typically deleterious and thus expected to be rare? (2) Are there

significant patterns of variation associated with noncoding regions that imply

functional relevance as predicted by our previous comparative species studies? (3)

Finally, what does the age and geographic pattern of the Sp1 functionally relevant

SNP predict about collagen-related gene expression in the global population?

Materials and Methods

Population Samples

As our overall objective was to obtain a general estimate of how COL1a1 genetic

diversity is distributed across natural populations from a global perspective, our

Page 57: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

46

sample included 96 individuals (total of 192 chromosomes; table 3) with no

known bone abnormalities (i.e., a random sample with respect to phenotypic

diversity) representing 10 geographically and ethnically diverse populations

publicly available from the Coriell Institute of Medical Research (Camden, NJ).

These populations also reflect a random sample of human genetic diversity

outside and inside of sub-Saharan Africa, the latter typically having higher

nucleotide and haplotype diversity owing to its older history and larger estimated

effective population size (Ne) compared to the recent demographic history of

proposed expansion associated with “non-African” groups (e.g., Rosenberg et al.

2002; Cavalli-Sforza and Feldman 2003; Tishkoff and Verrelli 2003; Campbell

and Tishkoff 2008). These samples have been previously used in similar

population statistical analyses by this lab and others, and enable comparisons of

genetic diversity across loci (e.g., Bersaglieri et al. 2004; Xu et al. 2005; Evans et

al. 2006; Claw et al. 2010). Samples include: sub-Saharan African (18

chromosomes, catalog number HD12), North African (14, HD11), Middle Eastern

(20, HD05), Russian (20, HD23), Chinese (20, HD32), Japanese (20, HD07),

Southeast Asian (20, HD13), Mexican (20, HD08), Northern European (20,

HD01), and Italian (20, HD21). Finally, as the chimpanzee is our closest living

relative with an estimated molecular divergence time from our common ancestor

at ~4-6 My (e.g., Kumar et al. 2005), we used COL1a1 nucleotide sequence

previously collected by us (Stover and Verrelli 2010) from a Pan troglodytes

verus (western Africa subspecies) individual to conduct analyses that require

Page 58: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

47

inferences of ancestral and derived states and estimates of locus-specific

divergence.

DNA Amplification and Sequencing

DNA sequence data were collected for a >17-kb region of the COL1a1 locus (fig.

4), which resides on chromosome 17q21 in human and chimpanzee genomes. Our

previous protocol for the generation and DNA re-sequencing of COL1a1

nucleotide sequences was followed here, specifically using short polymerase

chain reaction (PCR) fragments to avoid problems associated with sequence

secondary structure due to the repetitive nature of coding regions and the

unusually high GC-content (~60%) of this locus (Stover and Verrelli 2010). As

before, DNA sequencing of PCR fragments was conducted on an ABI 3730

(Applied Biosystems, Foster City, CA) and sequences were aligned and edited

using Sequencher v. 4.5 (Gene Codes, Ann Arbor, MI).

Statistical Analyses

Because of the documented genetic patterns associated with the different

evolutionary histories among geographic regions, all diversity analyses were

conducted for the global population as well as for each population sample

separately. Although synonymous exon sites and intron sites are not completely

“neutral,” in human datasets these “silent” sites are typically considered as the

most appropriate proxy for neutral evolution, especially when compared to

nonsynonymous sites in exons that can directly impact the protein. Other

noncoding regions may also have putative “functional” effects on gene expression

Page 59: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

48

and regulation and even exhibit significant evolutionary constraint, such as first

introns (e.g., Bornstein et al. 1987; Majewski and Ott 2002), and 5’ and 3’

untranslated mRNA (UTR) and promoter regions (e.g., Wray et al. 2003;

Haygood et al. 2007; Cheung and Spielman 2009). In fact, although there is a

known minimal promoter length required for COL1a1 (Chu et al. 1985), sites as

far as 2 kb upstream of the transcription start site are known to affect COL1a1

expression (e.g., Garcia-Giralt et al. 2005). Thus, all diversity analyses were

applied to several gene regions, in addition to simply exons and introns, to test

hypotheses of functional constraint across the COL1a1 locus.

The DnaSP v. 5.1 program (Rozas et al. 2003) was used for genetic

diversity statistic estimates, unless otherwise noted. Insertion-deletion (indel)

polymorphisms were excluded from all nucleotide diversity estimates given that

they are not generally expected to reflect neutral mutation rates (i.e., such as

SNPs). We used the PHASE v. 2.1.1 program (Stephens et al. 2001) to

statistically resolve heterozygous sequences for each individual into haplotypes.

This analysis was repeated with 100, 500, and 800 iterations with phased

haplotypes from the run with the highest average goodness-of-fit used in

subsequent analyses (Stephens et al. 2001). Genetic diversity was estimated as

Watterson’s (1975) θW, which is based upon the number of segregating sites (S)

corrected for sample size, and as θπ, which is based upon the average number of

pairwise differences among sequences (Nei 1987). Under neutrality, these two

estimates of the parameter θ = 4Neµ (for an autosomal locus, with “µ” denoting

Page 60: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

49

mutation rate per bp) are expected to be similar, which can be tested using

Tajima’s (1989) D statistic. This statistic can indicate skews in the SNP frequency

spectrum due to deviations from neutrality, yet it can also reflect demographic

attributes. To distinguish among these scenarios, we used the MS program

(Hudson 2002) to conduct coalescent simulations similar to other human analyses

in modeling parameters, such as population size and recombination (i.e., Verrelli

and Tishkoff 2004; Auton et al. 2009; Scheinfeld et al. 2009; Claw et al. 2010)

when necessary (see Results).

Given the association of COL1a1 with variation in bone strength across

populations, we are interested in determining whether specific variants or

haplotypes show significant genetic differentiation among samples. We first used

an FST analysis from Hudson et al. (1992) to relate information about the

proportion of variation shared within and between groups. However, we also used

Hudson’s (2000) Snn statistic as it examines genetic differentiation across samples

by considering the actual pairwise differences among haplotypes, and not simply

haplotype frequencies. This analysis is statistically more powerful in this respect

and is less sensitive to sample size (Hudson 2000), and it also provides a

simulation of the data to detect statistical significance, with P values corrected for

multiple comparisons by a standard Bonferroni method.

While our Snn analyses are intended to detect haplotype differentiation

among groups, we also used analyses of linkage disequilibrium (LD), the

rationale of which is haplotype structure across our sample may be a result of

Page 61: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

50

different haplotypes (i.e., different combinations among SNPs in different

groups), or it could simply be the result of similar haplotypes that differ in

frequency. By using additional analyses of LD, we should be able to distinguish

between these scenarios by determining if specific associations among SNPs or

haplotype “blocks” are evident within and among groups, and which may then

explain any detected significant haplotype differentiation from our Snn analyses

(e.g., Claw et al. 2010). We first used a simple model that measures correlations

among SNPs as r2, with statistical significance assessed by chi-squared tests after

a Bonferroni correction, to detect LD across COL1a1. However, as LD is not

unusual in the human genome (i.e., as a result of historical population expansion

and reduced Ne), it is appropriate to incorporate a background model of

recombination when interpreting significant LD from an evolutionary perspective

(i.e., Hudson 2001). Thus, we also used the LDhat program (McVean et al. 2002),

which applies the approximate-likelihood method of Hudson (2001) and uses

permutation analyses to determine if pairwise comparisons among SNPs exhibit

significant LD given a locus-specific estimate of θ and the recombination

parameter ρ = 4Nec (where “c” denotes the recombination rate per nucleotide).

Variants rare in frequency and that are uninformative (as determined by LDhat)

were omitted prior to the analysis.

In the case of the Sp1 binding site SNP (or others that emerge) where we

desire to estimate the age of a mutation, we used the estimator of Thomson et al.

(2000), which has been similarly used by others for human datasets (e.g., Verrelli

Page 62: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

et al. 2002; Scheinfeld et al. 2009; Claw et al. 2010) as assumptions of population

equilibrium and recombination are relaxed. The age estimate (t) involves the

relationship

∑=

=n

i

i

nx

t1 )( µ

where xi is the number of mutational differences between the ith sequence and the

most recent common ancestor of all sequences, n is the total number of sequences

in the sample, and µ is the mutation rate. The latter parameter is a “neutral” gene-

specific estimate based on the number of nucleotide substitutions between human

and chimpanzee COL1a1 sequences divided by twice the estimated divergence

time between species (i.e., 4-6 My; Kumar et al. 2005).

Finally, we also examined human-chimpanzee divergence in a test of

neutrality that contrasts intraspecific and interspecific patterns. As levels of

polymorphism and divergence are expected to be correlated under a neutral model

of evolution, we can compare these two classes of variation at putative functional

sites with those at our defined “neutral” sites using a Fisher’s exact test (i.e.,

McDonald and Kreitman 1991). This test enables us to examine how selective

pressures have differentially shaped the magnitude of human lineage-specific

variation over both short- and long-evolutionary time.

Results

Polymorphism and Divergence

A total of 16,993 bp of the COL1a1 locus was collected from each individual in

our human sample (fig. 4; supplementary table 1, Appendix B). Overall silent site

51

Page 63: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

52

SNP diversity at COL1a1 (table 3) is consistent with the human genome average

of ~0.1% as well as with that seen in similar population samples (e.g.,

Sachidanandam et al. 2001; Verrelli et al. 2002; Verrelli and Tishkoff 2004;

Garrigan and Hammer 2006; Claw et al. 2010). While diversity in this sample

appears to be skewed towards rare alleles (as indicated by a negative Tajima's D;

table 3), this is not surprising as it is consistent with the aforementioned historical

population expansion signature seen in molecular studies. Of note is that

population samples also exhibit similar SNP diversity (i.e., θπ; table 3). In all,

COL1a1 silent site polymorphism appears to reflect what is generally expected

under neutrality.

In examining variation outside of introns (and synonymous sites), only

one SNP, which is found in the 3’ UTR, reaches a frequency >5%. Thus, in

general, variation associated with promoter, UTR, and nonsynonymous sites is

relatively rare (supplementary table 1, Appendix B). Even so, we identified 6

nonsynonymous SNPs found among 6 populations, all but one of which are

singletons (supplementary table 2, Appendix B). Interestingly, while 2 of these

variants, both found in the triple-helix domain, have been reported previously (1

with unknown phenotype and 1 in association with osteopenia; Spotila et al. 1994;

Dalgleish 1997), the other 4 variants are novel (i.e., never documented in clinical

reports; supplementary table 2, Appendix B). Two of these novel variants also

occur in the triple-helix domain (including one glycine-altering mutation) and the

others in the C-terminal non-collagenous domain.

Page 64: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

53

Our interspecific comparisons with chimpanzee find no fixed

nonsynonymous differences and estimates of divergence across other putatively

functional domains, such as UTR and promoter regions, are comparable to silent

sites (supplementary table 1, Appendix B). When contrasted with intraspecific

variation for these site classes, a statistical excess of nonsynonymous

polymorphism is apparent (McDonald-Kreitman test, table 4). While this pattern

may be consistent with balancing selection, under this scenario we may expect

that adaptive variants are not sufficiently rare in frequency (e.g., Akashi and

Schaeffer 1992). In fact, when we removed rare variants (<5%) from all

polymorphic classes, patterns of polymorphism and divergence no longer deviate

from that expected under neutrality (data not shown).

Haplotype Structure and Age Estimates

As is typical with analyses of sub-Saharan Africans, we find significant

population differentiation when compared to other global groups. However, while

differentiation among these and human population samples in general is ~10-15%

globally (e.g., Rosenberg et al. 2002; Claw et al. 2010), several of our groups

differ dramatically for COL1a1, with pairwise comparisons varying from 0-35%

(supplementary fig. 1, Appendix B). Although contrasts involving sub-Saharan

Africans largely contribute to this diversity, even non-African samples exhibit

differentiation as high as 22%. While sampling variance may potentially explain

this observation for FST, our statistical assessment using Hudson's Snn reveals

significant differentiation not only between contrasts involving sub-Saharan

Page 65: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

54

Africans, but also among several pairs, most involving Asian population samples

(supplementary fig. 1, Appendix B).

As illustrated in our LD analysis (fig. 4), several significant correlations

among SNPs are evident, primarily restricted to the 5’ and 3’ ends of COL1a1.

Overall we do not find significantly different combinations of SNPs among

samples, but instead that the significant Snn patterns are largely explained by

differences in haplotype frequency (supplementary fig. 2, Appendix B). Our

LDhat analysis finds that very few pairwise comparisons are significantly

associated given a background recombination rate model, suggesting overall low

rates of crossing-over in the region. Interestingly, several SNPs primarily located

between introns 11-16 exhibit significantly less LD than expected (fig. 4);

therefore, we employed the HOTSPOTTER program of Li and Stephens (2003),

which tests the hypothesis that elevated haplotype diversity in a gene region is

inconsistent with that observed across the gene given local and locus-specific

estimates of recombination rates. We first examined the entire global dataset and

located a “hotspot” centered on intron 13 with an estimated ρ = 9.75-fold greater

than the background (ρ = 4) and that is 1.9 x 105 times more likely (P<0.05) to

explain the observed haplotype diversity than is a constant ρ model. When

populations were analyzed independently, this same pattern was still evident; yet,

likely due to a lack of statistical power with smaller sample sizes, significance

was not detected for the Italian, Middle Eastern, North African, and sub-Saharan

African samples.

Page 66: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

55

With this elevated haplotype diversity centered on intron 13 identified by

the LD analyses, we re-examined the overall haplotype structure and significant

Snn patterns in the 5' and 3' “haploblocks” (supplementary fig. 2, Appendix B). As

such, FST and Snn were recalculated and permutation analyses performed as above

for each of the 5’ (positions 368-4245) and the 3’ (positions 5714-16094) regions,

excluding SNPs in the hotspot (positions 4560-5430; fig. 4). In the 5’ region,

several contrasts differ by as much as 62% (supplementary fig. 3, Appendix B),

yet conversely, in the 3’ region, estimates of genetic differentiation are relatively

much lower (supplementary fig. 4, Appendix B). Thus, it appears that the overall

haplotype structure variation among groups is largely explained by a ~4-kb

haploblock in the 5’ region that coincidentally includes the first-intron Sp1

binding site SNP (fig. 4). Specifically, this block is entirely absent from our

Chinese and Japanese samples and is found only once in the Southeast Asian

sample.

For our analysis of the Sp1 SNP, based on comparison with chimpanzee

sequence, the Sp1-T allele is the derived state, which shows considerable

variation in frequency (0-20%) among geographic regions (supplementary table 3,

Appendix B). For our Thomson et al. (2000) age estimate, we first constructed a

neighbor-joining tree for the entire dataset using MEGA v. 4 (Tamura et al. 2007),

which shows that Sp1-T alleles do not form a single group (supplementary fig. 5,

Appendix B). Of the 17 Sp1-T bearing haplotypes, even if we look at only the 11

that form a single group (and conservatively ignore what appear to be

Page 67: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

56

recombinants), we obtain an age estimate of 190 ± 38 Ky. It is possible that

increased recombination rates could result in artificially elevated age estimates for

mutations as “old” and “young” haplotypes may recombine frequently; however,

other than the apparent “hotspot,” estimates of LD at COL1a1 suggest that

recombination rates overall are not unusual. In addition, the overall Sp1-T

haplotype frequency and associated SNP diversity (θπ = 0.04%) is not consistent

with a recent increase in its frequency by some form of population expansion or

selection. Altogether, with its geographic distribution, these data strongly support

a relatively old age estimate of the Sp1 SNP.

Discussion

This study follows up on the observations of Stover and Verrelli (2010) that

suggested introns and other noncoding regions at COL1a1 have been

evolutionarily conserved across vertebrates and represent viable candidates to

identify functional and disease-related variation today. With the first population

genetic contrasts of coding and noncoding COL1a1 gene regions in a globally

diverse and random sample of humans, our current analyses find unusual patterns

of amino acid polymorphism relative to divergence over deep time, evidence of a

geographically varying haploblock structure outside of coding regions, and

support for an ancient origin of the functionally-relevant Sp1 binding site variant.

Here, we discuss the implications these novel observations have for the long- and

short-term evolutionary pressures that have shaped bone-related phenotypes and

disease in natural human populations.

Page 68: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

57

As our previous analyses of the protein sequence across ~450 My of

vertebrate evolution show a long history of strong functional constraint (Stover

and Verrelli 2010), we would predict that amino acid polymorphism in the

population today would be rare. This rationale is also supported by the

documented abundance of COL1a1 amino acid polymorphism associated with

severe disease in clinical studies. Thus, the observation of 6 amino acid

polymorphisms in our sample may be considered surprising. In addition, a

previous study that focused on coding regions in 4 populations of American-

admixed individuals identified 3 rare amino acid variants, 2 of which were novel

(Chan et al. 2008). An additional hypothesis for this study was that clinical

screens reveal COL1a1 amino acid polymorphisms because of their biased link to

disease, but that such variation in randomly sampled natural populations is likely

to have little phenotypic consequence. However, the 205 amino acid variant found

here that has previously been associated with osteopenia (Spotila et al. 1994),

implies this hypothesis is not entirely supported. Although phenotypic

information is unknown for the novel mutations, we can use the model we

generated based on evolutionary constraint at sites over time to estimate the

likelihood of severity (Stover and Verrelli 2010). Interestingly, the four novel

amino acid variants found in this study as well as one from the Chan et al. (2008)

screen occur at amino acid positions considered to have the lowest evolutionary

rates. Thus, we may predict that these amino acid polymorphisms likely have

some functional association with respect to skeletal phenotype.

Page 69: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

58

Given the complete lack of amino acid divergence between human and

chimpanzee COL1a1, the pattern of rare human amino acid polymorphisms may

suggest that purifying selection against COL1a1 protein variation in human

populations has been relatively weak. However, as pointed out in Stover and

Verrelli (2010), mutations of varying degrees of severity cluster on the protein,

and these regions show varying levels of evolutionary constraint even within the

primate lineage beyond the last 4-6 My. Specifically, patterns of polymorphism

and divergence here suggest that certain mutations are rapidly removed by

purifying selection, while others may have less severe effects on bone phenotype.

Thus, purifying selection at COL1a1 is certainly not “weak,” but appears to vary

with respect to its strength across the protein sequence. In fact, individuals with a

COL1a1 amino acid polymorphism comprise >9% of our random sample, which

could be considered quite common compared to genome-wide DAMs in general.

In addition, these variants are distributed across 6 different geographic locales,

one of which is sub-Saharan Africa where most argue purifying selection has been

strongest compared to other recently colonized areas (e.g., Lohmueller et al.

2008). Thus, a random sampling not only reveals novel amino acid variants, but

also surprisingly suggests these mutations may not be as rare as predicted, and

could in fact be a source of observed bone strength variation among populations

(Looker et al. 1998; Bachrach et al. 1999; Barrett-Connor et al. 2005; Baxter-

Jones et al. 2010).

Page 70: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

59

Interspecific patterns of COL1a1 intron length and content seen across

ancient and even recent evolutionary time for vertebrates implied strong historical

functional constraint in these regions (Stover and Verrelli 2010). Estimates of

silent site human polymorphism are on par with that seen at autosomal loci in

general; thus while COL1a1 has significantly high GC-content that may otherwise

cause elevated mutational input (Stover and Verrelli 2010), levels of noncoding

polymorphism are not unusual. That said, the interesting pattern of variation

associated with noncoding regions is the haplotype structure that effectively

separates the locus into two “blocks” with significant population differentiation

being associated with the 5’ region. Thus, in spite of strong purifying selection

acting on coding regions and overall exon-intron structure, likely to reduce exon

shuffling and unequal crossing-over events (Cohn et al. 1993; Stover and Verrelli

2010) that are highly deleterious (e.g., Barsh et al. 1985; Cohn et al. 1993), the 5’

end of COL1a1 appears evolutionarily independent with respect to effects of

hitchhiking. As bone phenotypic variation across human populations is likely

explained in part by gene expression variation (e.g., Dohi et al. 1998; Ota et al.

2001; Fang et al. 2003; Liu et al. 2003; Jin, van’t Hof et al. 2009), the fact that we

find highly significant geographic variation for COL1a1 haplotypes in the 5’

region, where polymorphism is most likely to alter promoter function and

expression, is most interesting and warrants inspection from a combined

evolutionary-functional perspective.

Page 71: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

60

A perfect example is the Sp1 binding site functional variant. As previously

noted, although many studies have investigated its affect on COL1a1 gene

expression, few have examined its evolutionary history and origin (Gong and

Haynatzki 2003). The observed geographic distribution and age estimate suggest

that it is not recent, but instead may have existed prior to the emergence of

modern humans out of Africa (e.g., Garrigan and Hammer 2006). The possibility

that this variant and the associated impact on phenotype, such as increased

fracture risk, have been segregating in the population for a relatively long time is

further support against the hypothesis that COL1a1 amino acid polymorphism in

general is the result of very recent “weak” purifying selection. In fact, the patterns

observed here also do not rule out the possibility of the Sp1 SNP having historic

adaptive value, and thus, larger samples of this allele from global populations,

with long-range LD, could further clarify this hypothesis. Finally, it should be

noted that although haplotypes bearing the Sp1 SNP show geographic variation

among ethnically diverse groups that exhibit bone-related phenotypic variation,

this haplotype alone cannot explain the significant LD, 5’ haploblock, and

geographic differentiation observed at COL1a1. Thus, it is also possible that other

expression variants are also responsible, yet it is difficult to say whether any

within the dataset here are more than candidates. However, our molecular

evolutionary analysis implies that further haplotype structure studies upstream of

the 5’ region and in larger samples would narrow down the potential sites with

Page 72: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

61

which to conduct functional analyses (i.e., tests of the mutational impact on

promoter expression through reporter assays).

Conclusion

Although the abundance of DAMs discovered in clinical screens might suggest

that amino acid mutation at COL1a1 is highly deleterious in general, the

proportion of individuals with amino acid variants in our random sampling of the

natural population is not consistent with this hypothesis. This pattern suggests that

the natural population harbors COL1a1 amino acid variation much higher than

believed, and thus, could be contributing to functional variation observed across

groups in BMD. In addition, the pattern of haploblock structure associated with

noncoding variation supports the prediction from our previous interspecific

comparisons among vertebrates suggesting that noncoding regions also represent

potential foci for functional variation. Similarly, the estimated ancient age and

geographic distribution of the Sp1 SNP also imply that it is not consistent with a

deleterious evolutionary model. In fact, together with the other patterns of

noncoding SNPs, the Sp1 variant may reflect gene expression variation that is

common, not new in origin, and leads to different BMD phenotypes across

populations, a rather surprising hypothesis given the functional constraint long

predicted for type I collagen.

Page 73: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

62

Acknowledgments

The authors thank Michael Rosenberg for developing a PhaseSeqs data-script to

streamline the transfer of data for program analyses. This research was funded in

part by National Science Foundation grants BCS-0715972 (to B.C.V) and DEB-

0909637 (to B.C.V. and D.A.S.).

Page 74: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

63

Table 3

COL1a1 Population Diversity Estimates

Population na Sb θπc Dd

Global 192 109 0.11 -1.22Sub-Saharan African 18 61 0.14 -0.69North African 14 41 0.11 -0.20Middle Eastern 20 35 0.10 0.27 Russian 20 34 0.10 0.31 Chinese 20 27 0.08 0.51 Japanese 20 24 0.09 1.54 Southeast Asian 20 41 0.10 -0.23Mexican 20 39 0.07 -1.06Northern European 20 29 0.09 0.85 Italian 20 29 0.09 0.81 a Number of chromosomes

b Number of silent site SNPs (synonymous and intron sites, excluding the first intron and splice sites; see Materials and Methods) c Average number of pairwise differences between sequences (%)

d Tajima’s D statistic

Page 75: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

64

Table 4

Intraspecific and Interspecific Tests of Neutrality

No. of Differences Sample Sites Polymorphic Fixed P-value Global Nonsynonymous 6 0 Synonymous 12 18 0.02 Silent 109 83 0.04 Sub-Saharan African Nonsynonymous 1 0 Synonymous 7 18 0.31 Silent 61 83 0.43 Non-African Nonsynonymous 5 0 Synonymous 7 18 0.01 Silent 84 83 0.05 Note: “Silent” sites include synonymous and intron sites, excluding the first intron and splice sites. “P-value” refers to McDonald-Kreitman analysis (Fisher’s Exact Test), and is the result of contrasts between “Nonsynonymous” with “Synonymous” sites and “Nonsynonymous” with “Silent” sites for each of the three samples.

Page 76: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

Fig. 4. COL1a1 gene diagram. Black boxes denote exons, white boxes denote 5’

and 3’ mRNA untranslated regions, and the striped box denotes the region of the

promoter re-sequenced here. The “*” denotes a 496-bp gap in the collected

sequence of intron 25 due to the presence of an Alu element. Positions (in bp) are

numbered starting with the first nucleotide of the first exon. Positions in bold

represent the “5’ haplotype” absent among our Asian samples (see Results). The

Sp1 binding site variant in the first intron is at position 1126. Significant pairwise

correlations among polymorphic sites >5% frequency in 192 human

chromosomes are represented with light-shaded boxes, whereas dark-shaded

boxes represent comparisons that exhibit significantly less linkage disequilibrium

than expected given a gene-specific evolutionary model of recombination (see

Materials and Methods).

65

Page 77: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

66

CHAPTER 4: COMPARATIVE HUMAN AND CHIMPANZEE

ANALYSES OF COL1a1

Abstract

The most abundant structural protein in vertebrates, type I collagen, is encoded in

part by the collagen type I alpha 1 (COL1a1) gene, which harbors >600 mutations

linked to human skeletal diseases like osteoporosis. Our previous comparative

species work reflecting ~450 My of vertebrate evolution has shown that the

COL1a1 protein exhibits patterns of varying selective constraint across the amino

acid sequence, and surprisingly strong evidence for functional constraint on intron

composition and content. In addition, our previous population genetic analyses of

a natural, global human sample are consistent with these species comparisons in

showing several amino acid variants, in addition to an unusual haplotype structure

for noncoding regions likely related to observed differences in gene expression

across populations. Here we conduct an analysis of the >17-kb COL1a1 locus in

40 chromosome sequences from a population sample of chimpanzees, our closest

living relative, to determine whether patterns seen in humans at coding and

noncoding regions are unique. Interestingly, although we find no amino acid

variation, we reveal a significant excess of intermediate frequency polymorphism

that segregates between two haplogroups, as well as a partial exon duplication at

~20% in frequency, which is surprising given the latter are extremely rare and

deleterious in humans. Finally, long-range linkage disequilibrium analyses of

flanking regions spanning ~180-kb of the COL1a1 chromosomal region find an

Page 78: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

67

ancient age for the two haplogroups predating chimpanzee-bonobo divergence.

These patterns of noncoding haplotype structure and exon diversity are discussed

in light of their differences from humans as well as the implications they have for

functional significance and skeletal disease evolution.

Introduction

Bone strength as well as the incidence and severity of related skeletal disorders

like osteoporosis vary significantly among human populations (e.g., Lauderdale et

al. 1997; Looker et al. 1997; Melton 1997; Bachrach et al. 1999; Barrett-Connor

et al. 2005; Baxter-Jones et al. 2010) due in part to genetic differentiation (e.g.,

Dvornyk et al. 2003; Gong and Haynatzki 2003; Lui et al. 2003; Gong et al. 2006;

Koller et al. 2010). Phenotypic data available from non-human primates, though

limited, also suggest that bone strength varies within other species (e.g., Sumner

et al. 1989; Cerroni et al. 2000; Black et al. 2001; Gunji et al. 2003; Havill et al.

2003), which is also likely due in part to underlying genetic variation (e.g., Lipkin

et al. 2001; Havill et al. 2005). In addition, slight differences in bone morphology

have also been documented between humans and our closest-living relatives,

chimpanzees, for osteoporotic-like symptoms, such as patterns of bone loss and

the accumulation of microfractures with age (e.g., Sumner et al. 1989; Wang et al.

1998; Gunji et al. 2003; Kikuchi et al. 2003; Mulhern and Ubelaker 2003, 2009;

Matsumura et al. 2010). As such, these data suggest that underlying genetic

variation among species at bone-related genes may contribute to phenotypic

Page 79: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

68

variation, the determination of which would further our understanding of not only

human skeletal disease, but also natural variation in bone strength.

Among the genes most commonly associated with bone-related

phenotypic variation in humans is COL1a1, which encodes the primary subunit of

type I collagen, the main structural protein of bone, teeth, and tendon (Viguet-

Carrin et al. 2006). With >600 disease-associated mutations (DAMs; primarily

linked to osteoporosis, osteogenesis imperfecta types I-IV, and Ehlers-Danlos

Syndrome types I and VIIA; Dalgleish 1997; Marini et al. 2007) as well as

associations with natural variation in bone strength among human populations

(e.g., Garcia-Giralt et al. 2002; Stewart et al. 2006; Jiang et al. 2007; Ioannidis et

al. 2007; Kaufman et al. 2008), this locus is a prime candidate for research in

bone-related phenotypic variation, not only among human populations, but among

other species as well (Stover and Verrelli 2010).

The majority of known COL1a1 DAMs affect protein coding regions,

typically within the triple-helix domain that is composed of a repeating amino

acid sequence with glycine, the smallest of the amino acids, in every third

position, which enables this domain to wind into its compact structure in type I

collagen (Yamada et al. 1980; Bernard et al. 1983; Exposito et al. 2002; Boot-

Handford and Tuckwell 2003; Aouacheria et al. 2004; Wada et al. 2006). Because

type I collagen is a triple-helix comprised of two COL1a1 and one COL1a2

subunits, known protein length mutations of any size are deleterious, often lethal,

since an increase in length of one subunit disrupts helix stability without a similar

Page 80: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

69

lengthening of the other subunit (e.g., Cohn et al. 1993; Pace et al. 2001; Cabral et

al. 2003). As such, large duplications/deletions of the 51 exons encoded by the

>17-kb COL1a1 locus are exceedingly rare (e.g., Barsh et al. 1985; Cohn et al.

1993; Bodian et al. 2009). Amino acid mutations have similar effects on helix

stability with varying phenotypic outcomes from lethality to mild, osteoporotic-

like symptoms (Kuivaniemi et al. 1997; Dalgleish 1997; Marini et al. 2007; Rauch

et al. 2010), which we have previously shown can be predicted based upon the

degree of evolutionary conservation of amino acid positions over the past ~450

My, with a positive correlation between site conservation and the phenotypic

severity of DAMs (Stover and Verrelli 2010).

Within the natural human population, however, although the number of

COL1a1 amino acid polymorphisms is higher than expected given the overall

high evolutionary constraint at this locus, the low frequency of these variants

suggests that the association of COL1a1 with bone-related phenotypic variation

among populations cannot be explained by protein variation alone (Stover and

Verrelli, Chapter 3). Rather, this association is also driven by noncoding

variation. For example, a first intron, Sp1 transcription factor binding site

mutation has already been shown to increase COL1a1 gene expression, causing

population differences in reduced bone strength and increased fracture risk due to

variation in the frequency of this polymorphism (e.g., Grant et al. 1996; Mann et

al. 2001; Bandres et al. 2005; Ralston et al. 2006; Jiang et al. 2007; Jin, van’t Hof

et al. 2009). Further, because COL1a1 intron composition has been highly

Page 81: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

70

conserved historically, even resulting in reduced human-chimpanzee intron

divergence, introns in general may harbor phenotypically important genetic

variation (Stover and Verrelli 2010). In fact, within the human population,

noncoding regions of COL1a1 demonstrate unusually high haplotype

differentiation among populations, particularly for the 5’ region of the gene where

variants that alter expression are most likely to occur (Stover and Verrelli,

Chapter 3).

Overall, several interesting patterns at COL1a1 in humans have emerged.

First, the number of amino acid polymorphisms in the natural population is

surprisingly high (Chan et al. 2008; Stover and Verrelli, Chapter 3), suggesting

that not all protein variants today are highly deleterious. Second, length variation

of COL1a1 coding regions has yet to be identified in the natural population,

unassociated with a deleterious disease phenotype (Chan et al. 2008; Bodian et al.

2009; Stover and Verrelli, Chapter 3). Third, noncoding variation at COL1a1

demonstrates unusual haplotype structure among populations (Stover and Verrelli,

Chapter 3). To determine how unusual these human population-level patterns may

be, however, they must be placed in an evolutionary context. Although our

comparisons among distantly-related species help identify ancient factors

affecting COL1a1, comparative population genetic analyses would reveal what

factors are similarly impacting other species today. As such, chimpanzees serve as

a perfect model for this purpose, particularly since available data suggests that

bone-related phenotypic differences may exist between our closely-related

Page 82: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

71

species. Here we present the first population genetic analysis of COL1a1 in a

population sample of chimpanzees to determine if this gene has been subject to

similar selective pressures as in humans.

Materials and Methods

Population Samples

Although there are several recognized chimpanzee subspecies, the western Africa

Pan troglodytes verus subspecies represents the most appropriate contrast with

humans because of similar levels of nuclear population diversity (Stone et al.

2002; Gilad et al. 2003; Fischer et al. 2004; Wooding et al. 2005; Verrelli et al.

2006, 2008; Claw et al. 2010). As such, nucleotide sequence data were collected

from DNA samples of 20 wild-born, unrelated P. t. verus chimpanzees (40

chromosomes) that have been used previously in population genetic analyses of

other loci (Stone et al. 2002; Wooding et al. 2005, 2006; Verrelli et al. 2006,

2008; Claw et al. 2010). Available COL1a1 gene sequences for orangutan (Pongo

abelii) and macaque (Macaca mulatta) were also obtained from the National

Center for Biotechnology Information (NCBI) database in order to infer derived

versus ancestral states of polymorphic positions for estimates of divergence.

DNA Amplification and Sequencing

DNA sequence was collected for a total of 16,989 bp of the COL1a1 locus on

chromosome 17 including, 1,223 bp of the promoter and 263 bp of the 5’ and 3’

mRNA untranslated regions. This sequence is contiguous, spanning all of the 51

exons and intervening introns except for a 508-bp gap in intron 25 that was

Page 83: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

72

avoided due to the presence of an Alu element. Protocols follow those of Stover

and Verrelli (2010) in which both human and chimpanzee gene sequences were

initially generated. Polymerase chain reaction (PCR) and sequencing primers

were designed from human genome sequence (accession # NT_010783.14) and

are available upon request. PCR products were purified using shrimp alkaline

phosphatase and exonuclease I (US Biochemicals, Cleveland, OH) prior to DNA

sequencing with an Applied Biosystems (Foster City, CA) 3730 capillary

sequencer. Sequences were aligned and edited using Sequencher v. 4.5 (Gene

Codes, Ann Arbor, MI).

Statistical Analyses

While nonsynonymous exon sites are generally expected to be the most highly

conserved due to their importance in protein function, other noncoding regions

may also have putative “functional” effects on gene expression and regulation and

even exhibit significant evolutionary constraint, such as first introns (e.g.,

Bornstein et al. 1987; Majewski and Ott 2002), and 5’ and 3’ untranslated mRNA

(UTR) and promoter regions (e.g., Wray et al. 2003; Haygood et al. 2007; Cheung

and Spielman 2009). Synonymous exon sites and non-first introns on the other

hand, though not completely “silent” in terms of functional significance (e.g.,

Urrutia and Hurst 2003; Chamary and Hurst 2005), are generally less constrained

relative to these other sites (i.e., nonsynonymous, promoter, UTR) and thus, are

commonly used to reflect estimates of neutrality in population genetic studies

similar to this one (e.g., Haygood et al. 2007; Claw et al. 2010). As such, all

Page 84: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

73

diversity analyses were applied to several gene regions, in addition to simply

exons and introns, to test hypotheses of functional constraint across the COL1a1

locus. The DnaSP v. 5.1 program (Rozas et al. 2003) was used for these diversity

statistic estimates, unless otherwise noted.

Heterozygous sequence data were resolved into haplotypes using PHASE

v. 2.1.1 (Stephens et al. 2001). To examine the consistency of haplotype

reconstruction among runs, this process was repeated with 100, 250, and 500

iterations, with the best-fit haplotypes from the likelihood model being used for

all subsequent analyses. Genetic diversity was estimated as Watterson’s (1975)

θW, which is based upon the number of segregating sites (S) corrected for sample

size, and as θπ, which is based upon the average number of pairwise differences

among sequences (Nei 1987). These two estimates of the population parameter θ

= 4Neµ (for an autosomal locus, with “µ” denoting mutation rate per bp) are

expected to be equal under neutrality; however, non-neutral and demographic

processes are expected to skew the SNP frequency spectrum, which can be

detected using Tajima’s (1989) D statistic. Significance of estimates of COL1a1

diversity and of D was determined through comparisons to estimates from

previously published P. t. verus studies of other autosomal loci (Gilad et al. 2003;

Yu et al. 2003; Fischer et al. 2004; Claw et al. 2010) using coalescent simulations

provided by DnaSP with 10,000 replicates. To compare putatively silent and

functional SNP diversity within and between chimpanzees and humans, we also

used McDonald and Kreitman (1991) interspecific tests of neutrality, which test

Page 85: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

74

the hypothesis that functional sites exhibit patterns of diversity within recent and

historical time periods consistent with that seen at silent sites that reflect a simple

evolutionary model of drift.

Finally, to enable comparisons of human and chimpanzee haplotype

structure and linkage disequilibrium (LD) at COL1a1, we first identified

associations among SNPs using r2 with significance assessed by Fisher’s exact

tests after a standard Bonferroni correction for multiple comparisons as

implemented in DnaSP. Similar to analyses in Chapter 3, we also examined

correlations using the LDhat program of McVean et al. (2002), which uses the

approximate-likelihood method of Hudson (2001) and permutation analyses to

determine if pairwise comparisons among SNPs are in significantly more or less

LD than expected given the distance between them and the background rates of

recombination and mutation. Low-frequency SNPs are uninformative for these

comparisons; therefore, only polymorphisms >5% frequency were used in

haplotype analyses to increase our statistical power to detect correlations. As

previous analyses of P. t. verus populations, including those using these same

samples here, show no evidence of historical structure either geographically or

temporally (e.g., Wooding et al. 2005; Verrelli et al. 2006, 2008; Becquet et al.

2007; Claw et al. 2010; Leuenberger and Wegmann 2010), analyses such as

Hudson’s (2001) Snn that were performed in Chapter 3 for our human sample were

unnecessary here.

Page 86: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

75

Results

Chimpanzee COL1a1 Polymorphism and Haplotype Structure

A total of 16,989 bp of the COL1a1 locus was collected from each of the

chimpanzees with 10,677 bp representing silent sites (table 5). No

nonsynonymous mutations were detected in our sample (table 5; supplementary

fig. 1, Appendix C). Surprisingly, however, there is a duplication involving 36-bp

of exon 35 (fig. 5) found at an allele frequency of 17.5%, including one

homozygous individual. Intron splice sites surrounding this partial exon

duplication are intact, and thus, if this duplication is encoded, would result in an

additional 12 amino acids being added to the COL1a1 protein without altering the

downstream reading frame (fig. 5).

Although we find no significant pattern of diversity among gene regions

strictly based on “numbers” of mutations from our McDonald-Kreitman tests

(supplementary table 1, Appendix C), silent diversity at COL1a1 (θπ), both overall

and for intron and synonymous sites independently is significantly higher than the

average (θπ=0.1%) of previously published autosomal loci (P=0.001; table 5;

Gilad et al. 2003; Yu et al. 2003; Fischer et al. 2004; Claw et al. 2010). Given

levels of silent diversity at COL1a1, the observed positive value of silent Tajima’s

D is significantly higher than expected under a standard neutral model (P=0.006;

table 5) and is also well outside the range of previously reported values across the

genome, suggesting that there is a significant excess of high frequency SNPs at

COL1a1.

Page 87: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

76

LD analyses show that the significant excess of common polymorphisms

at COL1a1 is likely the result of sequences or lineages being split into two

haplogroups, designated haplogroups “A” (found at 55% allele frequency) and

“B” (found at 45% frequency), that extend the length of our sequenced region

(fig. 6; supplementary fig. 2, Appendix C). After excluding 5 predicted

recombinant sequences to be conservative about shared variation among lineages,

these two haplogroups are defined by 62 polymorphisms, 4 of which fall within

promoter and UTRs and an additional 8 within the first intron (supplementary fig.

1, Appendix C). Though silent SNP diversity (θπ) associated with haplogroup B

chromosomes is over two-fold lower than that associated with haplogroup A

chromosomes (excluding recombinants; table 6), coalescent simulations suggest

that this difference is not unusual. However, given the number of silent SNPs in

our overall sample found at an allele frequency >5% (S=70), it is highly unusual

to find only 4 such SNPs associated with haplogroup B chromosomes

(P<0.00001) and even 25 such SNPs associated with haplogroup A chromosomes

(P=0.01; table 6). This pattern further supports the hypothesis that the high levels

of diversity found in our overall sample are due to the presence of these two high

frequency, and highly divergent haplogroups.

A neighbor-joining tree constructed from the number of differences

between chromosomes for SNPs >5% allele frequency using MEGA v. 4 (Tamura

et al. 2007) also supports the hypothesis that the haplotype associated with the

partial exon 35 duplication (hereafter referred to as the “exon duplication-bearing

Page 88: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

77

haplotype”) occurred on the haplogroup A background (fig. 6). Further, no SNPs

>5% frequency are found among the 7 chromosomes bearing the exon duplication

(table 6). Even if we only use the 8 silent SNPs found among the non-

recombinant haplogroup A chromosomes as a conservative expectation of the

level of silent diversity associated with this haplogroup, coalescent simulations

reveal that it is statistically unusual to find 7 chromosomes with no associated

polymorphisms (P=0.02).

Polymorphism and Haplotype Structure Surrounding COL1a1

While high levels of diversity and LD appear to be characteristic of COL1a1, it is

possible that these patterns are typical of this region of chromosome 17 in

chimpanzees. Estimates of LD surrounding COL1a1 can also be informative

about the age and origin of specific polymorphisms that can then be used to test

evolutionary hypotheses of adaptive vs. neutral scenarios, i.e., is the exon 35

duplication consistent with positive selection? Thus, we collected additional

nucleotide sequence data for a series of 1-2 kb PCR fragments 5’ and 3’ of

COL1a1, starting ~10-kb away and spanning a total of ~180-kb (fig. 7). Because

we are primarily interested in variation in silent diversity for testing these

evolutionary hypotheses, PCR fragments were targeted to intergenic and intron

regions when available in both directions.

Comparisons of rates of recombination and patterns of LD over short

intervals within the human and chimpanzee genomes have revealed significant

differences even over the relatively short time separating these two species (e.g.,

Page 89: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

78

Ptak et al. 2005; Winckler et al. 2005). Thus, here we sampled diversity from

bonobo (Pan paniscus), which has an estimated divergence time from the

chimpanzee lineage at ~0.8-1.8 My (e.g., Stone et al. 2002; Won and Hey 2005;

Becquet et al. 2007; Wegmann and Excoffier 2010), as this can provide estimates

of LD and haplotype structure to better contrast patterns seen here at COL1a1 for

P. t. verus. DNA from 13 (26 chromosomes) bonobos was used in PCR and

sequencing as described above to examine these same regions flanking COL1a1

both 5’ and 3’ (fig. 7).

In total, we have collected an additional 13,836 bp of sequence 5’ and

18,907 bp 3’ of COL1a1, with 13,114 bp and 13,286 bp constituting silent sites,

respectively (supplementary table 2, Appendix C), giving a grand total of ~50-kb

of sequence data in chimpanzees spread across ~180-kb of chromosome 17.

Outside of the COL1a1 locus in chimpanzees, silent SNP diversity (θπ) decreases

with regional estimates consistent with those previously reported for other

autosomal loci (supplementary table 2, Appendix C; Gilad et al. 2003; Yu et al.

2003; Fischer et al. 2004; Claw et al. 2010). Similarly, Tajima’s D decreases

outside of the COL1a1 locus falling within the range of previously reported

values (supplementary table 2, Appendix C). There is also reduced support for

haplogroups A and B outside of the COL1a1 locus as evidenced by only a single

SNP in the surrounding regions (~10-kb 3’ of the COL1a1 UTR) found in

significant LD with these haplogroups and by the significant increase in haplotype

diversity on either side of COL1a1 as indicated by our LDhat analysis (fig. 6;

Page 90: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

79

supplementary fig. 2, Appendix C). However, even when considering the

background rate of recombination with our LDhat analysis, there are still

haplotypes with significant LD among sites >100-kb away, included among

which is the COL1a1 exon duplication-bearing haplotype (supplementary fig. 2,

Appendix C).

We may expect that an allele that has been affected by recent positive

directional selection may bear a signature of reduced genetic diversity and

unusually long-range LD (e.g., Tishkoff et al. 2001, 2007; Sabeti et al. 2005,

2006; Saunders et al. 2006). To determine if the exon duplication-bearing

haplotype is associated with unusually long-range LD, we used the Long-Range

Haplotype test of Sabeti et al. (2002) as implemented in the program Sweep v. 1.1

(Sabeti et al. 2007). Non-overlapping cores of between 3 and 10 polymorphisms

identified by the method of Gabriel et al. (2002) were generated for the COL1a1

locus (for a total of 9 cores). Using polymorphisms >5% frequency from our

entire chromosome 17 sequenced region, extended haplotype homozygosity

(EHH) was calculated for each COL1a1 core haplotype at increasing distances

(measured in kb) from the core (Sabeti et al. 2002). Because a fine-scale

recombination map is not yet available for the chimpanzee genome, we corrected

for local variation in recombination rate using the relative EHH (REHH) measure,

which compares EHH of each core haplotype to all other core haplotypes at

COL1a1 (Sabeti et al. 2005). As implemented in the significance calculator

available in Sweep, we assessed the significance of the extent of relative LD

Page 91: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

80

associated with the exon duplication-bearing haplotype by generating a

distribution of REHH measured at ~85-kb from either side of each core haplotype

and asked whether REHH associated with the exon duplication-bearing haplotype

is significantly greater than that of other core haplotypes of similar frequency at

COL1a1 (i.e., 15-20% frequency). Relative LD associated with the exon

duplication-bearing haplotype does not extend significantly further than expected

given the frequency of this haplotype (supplementary fig. 3, Appendix C). In fact,

EHH does not begin to decay within our sequenced region for 3 COL1a1

haplotypes (with a maximum frequency of 22.5%), including the exon

duplication-bearing haplotype.

Within bonobos, although the number of polymorphisms identified in 5’

and 3’ regions was comparable to those observed in chimpanzees, the frequency

of these polymorphisms is reduced. For example, overall silent SNP diversity (θπ)

in both regions 5’ and 3’ of COL1a1 in bonobos is approximately half that

observed in chimpanzees and results in an overall trend toward negative Tajima’s

D values among all PCR fragments (supplementary table 3, Appendix C). In fact,

given levels of silent diversity within the 3’ region, the observed negative value of

silent Tajima’s D is significantly lower than expected (P=0.02; supplementary

table 3, Appendix C) indicating that there is a skew toward low frequency

polymorphism in bonobos. As such, only 45 polymorphisms in our dataset reach

an allele frequency adequate for linkage analyses (>5% frequency). Interestingly,

similar to the chimpanzee dataset, at least two long-range haplotypes exist in

Page 92: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

bonobos with significant LD spanning sites >100-kb away (supplementary figs. 4

and 5, Appendix C).

COL1a1 Haplotype Age Estimates

We used the method of Thomson et al. (2000) to estimate the age of haplogroups

A and B as well as of the exon duplication-bearing haplotype. As previously

described (Thomson et al. 2000; Scheinfeldt et al. 2009; Claw et al. 2010), this

age estimate (t) is based upon the relationship:

∑=

=n

i

i

nx

t1 )( µ

where xi is the number of mutational differences between the ith sequence and the

estimated most recent common ancestor (MRCA) of all sequences, n is the total

number of sequences in the sample, and µ is the mutation rate. Here, as in Claw et

al. (2010), µ is estimated as the number of substitutions between human and

chimpanzee divided by twice the estimated molecular divergence time between

these species, or 5 My (± 1 My; Kumar et al. 2005). As previously mentioned,

COL1a1 human-chimpanzee divergence may be lower than expected (Stover and

Verrelli 2010), which would result in an underestimate of µ and an overestimate

of age; therefore, all ages were calculated using two estimates of µ. First, we used

a gene-specific estimate based upon the number of substitutions between human

and chimpanzee observed at COL1a1 (i.e., 92). Second, we used the number of

substitutions (excluding nonsynonymous sites) observed in our regions

surrounding COL1a1 and standardized this divergence rate by the length of our

COL1a1 sequenced region to calculate the expected number of substitutions at

81

Page 93: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

82

this locus (i.e., 120) if it were evolving at the same rate as the surrounding

regions, which we then used to get a regional estimate of µ. A neighbor-joining

tree constructed from the number of differences at COL1a1 between

chromosomes, including the 5 haplogroup A-B recombinant chromosomes, in

MEGA v. 4 (Tamura et al. 2007) was used to determine xi (supplementary fig. 6,

Appendix C). Because we are dealing with phased haplotypes (see Materials and

Methods), only SNPs >5% allele frequency were used for these age estimates,

which will cause a slight underestimate of xi and, therefore, an underestimate of

age. Thus, to further aid the resolution of haplotype ages, we also genotyped our

bonobo samples for the partial exon duplication and the A and B haplogroups by

sequencing two PCR fragments previously used to amplify COL1a1 in

chimpanzees. The A and B haplogroups were specifically genotyped using 6 of

the segregating SNPs in the first intron of COL1a1 (positions 111-702,

supplementary fig. 1, Appendix C).

Using our gene-specific µ, we estimate the age of the MRCA of

haplogroups A and B to be 3.6 ± 0.7 My. Even with using a more conservative

regional estimate of µ, we still estimate that these haplogroups split 2.8 ± 0.6 My,

which predates the chimpanzee-bonobo molecular divergence time (0.8-1.8 My).

Our 26 bonobo chromosomes are fixed for the haplogroup B allele for 5 of our

genotyped first intron SNPs and the haplogroup A allele for the 6th, which is also

consistent with haplogroup A and B alleles existing in the ancestral population

prior to the divergence of chimpanzee and bonobo. We further estimate the

Page 94: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

83

MRCAs of haplogroup A and B chromosomes within chimpanzees to be 1.5 ± 0.3

My and 169 ± 34 Ky, respectively (or 1.1 ± 0.2 My and 130 ± 26 Ky using our

regional estimate of µ).

As previously mentioned, chromosomes with the exon duplication-bearing

haplotype are not variable within the COL1a1 locus (for SNPs >5% frequency);

however, one SNP (position 7246, supplementary fig. 1, Appendix C) separates

this haplotype from the other members of haplogroup A. As such, we estimate

that the exon duplication-bearing haplotype diverged 109 ± 22 Ky (or 83 ± 17 Ky

using our regional estimate of µ). Examining our entire sequenced region, 2 SNPs

are polymorphic among chromosomes bearing this haplotype (supplementary fig.

1, 7, Appendix C), which when using an estimate of µ based upon the number of

substitutions between human and chimpanzee for the entire region (i.e., 367),

gives an estimated MRCA of the exon duplication-bearing haplotype as 27 ± 5

Ky. Consistent with this recent origin of the exon duplication-bearing haplotype,

we do not find the duplication in our bonobo sample.

Discussion

Human and Chimpanzee Amino Acid Polymorphism

Although the COL1a1 protein has been highly conserved over the past ~450 My

of vertebrate evolution, even including no amino acid divergence between human

and chimpanzee (Stover and Verrelli 2010), there is an abundance of DAMs in

humans (Dalgleish 1997; Marini et al. 2007). While this may be consistent with

varying selective pressures over space and time, it is unlikely to be the result of

Page 95: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

84

simply weak purifying selection, as discussed in both Chapter 2, and especially in

Chapter 3 where we find that amino acid variation overall is not very rare. On the

other hand, the lack of amino acid variation in chimpanzees suggests strong

purifying selection within this species. It is possible that these two divergent

patterns seen in populations today simply reflect different environmental

constraints between our species, such as in locomotion, diet, and skeletal growth

periods (e.g., Larsen 1995; Abbott et al. 1996; Bogin and Smith 1996; Cotter et al.

2009; Hancock et al. 2010). Tests of gene-specific variation in a functional setting

could evaluate this hypothesis.

Polymorphism and Haplotype Structure

Contrary to amino acid polymorphism, genetic diversity at COL1a1 in

chimpanzees is high and significantly skewed toward common variants, which

segregate into two high frequency and highly divergent haplogroups (A and B).

Age estimates suggest that the variation between these haplogroups existed prior

to the divergence of chimpanzee and bonobo. On the one hand, both haplogroup

A and B alleles have become fixed at different COL1a1 sites within the bonobo

lineage. Combined with the abundance of low frequency polymorphism observed

in the regions surrounding COL1a1, this fixation may be consistent with a recent

expansion of the bonobo population, as has been previously suggested based upon

other autosomal loci (Fischer et al. 2006, but see Eriksson et al. 2004 for

mtDNA). On the other hand, variation between haplogroups A and B still remains

in chimpanzees today.

Page 96: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

85

Based upon our sequenced regions surrounding COL1a1, we estimate that

LD associated with these haplogroups in chimpanzees spans ~30-40 kb.

Combined with the high frequency polymorphism that segregates between

haplogroups, this is a classic signature of balancing selection expected in a region

of low recombination (Charlesworth 2006), as is inferred for COL1a1 compared

to the surrounding regions. However, certain demographic models can also cause

similar patterns. Specifically, if two populations that differed in allele frequencies

at COL1a1 were to admix, high levels of polymorphism and LD would be

observed at this locus (e.g., Smith et al. 2001; Smith and O’Brien 2005). There

are several reasons why this possibility is unlikely here. First, as previously noted,

there is no evidence of substructure within this subspecies of chimpanzee (e.g.,

Wooding et al. 2005; Verrelli et al. 2006, 2008; Becquet et al. 2007; Claw et al.

2010; Leuenberger and Wegmann 2010). Second, patterns of diversity and LD

observed at COL1a1 are unique among previously reported studies (e.g., Stone et

al. 2002; Gilad et al. 2003; Fischer et al. 2004; Wooding et al. 2005, 2006;

Verrelli et al. 2006, 2008; Becquet et al. 2007; Claw et al. 2010), which is not

expected under a demographic model as such processes are predicted to have

genome-wide effects. Finally, given the estimated age of the split of haplogroups

A and B, it is unlikely that such a strong pattern resulting from historic population

structure would still persist today since even low levels of recombination are

expected to break apart associations among polymorphisms relatively quickly

(e.g., Clegg et al. 1980; Asmussen and Clegg 1982). We can infer that at least low

Page 97: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

86

levels of recombination exist at COL1a1 in chimpanzees because: 1) we find 5

recombinant A-B chromosomes within our population sample, and 2) even though

no nonsynonymous divergence has occurred between human and chimpanzee,

there are typical levels of synonymous divergence demonstrating that these sites

have become unlinked over time (Stover and Verrelli 2010). As such, we would

expect to find less haplotype structure at COL1a1 if these patterns were purely

due to demography. Thus, given our current understanding of chimpanzee

population genetics, there is a strong possibility that the pattern here reflects

ancient balancing selection.

Polymorphic Exon Duplication

Surprisingly, a partial duplication of exon 35, which falls within the triple-helix

domain of type I collagen, has also reached a relatively high frequency of 17.5%

in our chimpanzee population. Within humans, mutations that affect protein

length, and particularly of the triple-helix domain, of fibrillar collagens in general

let alone those of type I collagen, are exceedingly rare as they are often lethal or

result severely deleterious phenotypes (e.g., Barsh et al. 1985; Cohn et al. 1993;

Raff et al. 2000; Cabral et al. 2003; Bodian et al. 2009). As is expected given their

deleterious nature, length variants of the COL1a1 protein have not been found in

humans among samples of the natural population (Chan et al. 2008; Stover and

Verrelli, Chapter 3), which raises the question of how a COL1a1-exon duplication

could have risen to high frequency in chimpanzees.

Page 98: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

87

The rarity of polymorphism among alleles with the exon 35 duplication,

the long-range LD associated with its haplotype, and the estimated age of these

alleles could suggest that the exon duplication rose to its current frequency

relatively recently and rapidly, as might be expected under positive directional

selection (e.g., Sabeti et al. 2006). However, other haplotypes at COL1a1 of

similar frequency also have little associated polymorphism (e.g., a 22.5%

frequency haplotype on the haplogroup B background that only has 5 associated

polymorphisms within our entire sequenced region; fig. 6). Further, LD associated

with the exon duplication-bearing haplotype does not extend beyond that of other

COL1a1 haplotypes of similar frequency. However, within humans, haplotype

blocks can easily extend beyond the ~180-kb length of our sequenced region in

chimpanzees, particularly when affected by directional selection (e.g., Sabeti et al.

2002, 2005; Voight et al. 2006; Tishkoff et al. 2007; Enard et al. 2010). Thus, one

could argue that our sampled region of chromosome 17 does not span enough

distance to be able to identify a signature of positive selection based on patterns of

LD, which is supported by our EHH comparisons that show homozygosity does

not decay within our sequenced region for 3 COL1a1 haplotypes. Thus, while it is

unclear if the pattern here supports an adaptive model for this duplication, it is

highly unlikely that this duplication is deleterious, which in itself is a surprising

result given the impact COL1a1 length variants have in humans.

Page 99: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

88

Functional Implications and Future Directions

Although the pattern consistent with balancing selection does not readily point to

a “functional” SNP, there are numerous polymorphisms that could have

functional implications. First and foremost is the partial exon duplication. Even

though intact intron splice sites border the exon duplication, this does not

necessarily mean that the exon is encoded as part of the mature protein. Rather,

the entire duplicated region may simply be excised during mRNA processing as

part of the original intron 34, which would further support a neutral explanation

for the high frequency of this polymorphism.

Several other possibilities exist within COL1a1, including 4

polymorphisms in the promoter and UTR, which are regions known to affect gene

expression (e.g., Wray et al. 2003; Haygood et al. 2007). Additionally, 8

polymorphisms segregate between these haplogroups in the first intron in which

numerous transcription factor binding sites have been identified (Bornstein et al.

1987; Vergeer et al. 2000; Jin, van’t Hof et al. 2009) such that these

polymorphisms may also affect COL1a1 expression. As unusual patterns of

variation associated with noncoding regions are also seen in human populations, it

remains to be seen how this variation, adaptive or not, reflects functional variation

within and between species. In addition, functional analyses to determine whether

the gene duplication is transcribed and possibly translated would have a dramatic

impact on our assessment of how protein diversity can evolve at COL1a1.

Nonetheless, our results suggest significantly unusual patterns for humans,

Page 100: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

89

chimpanzees, and bonobos for a gene that is otherwise believed to be highly

conserved. This brings into question how population diversity levels may look for

other primates, both closely- and distantly-related. As these comparative

population and species genetic analyses continue, it is not out of reason to

speculate that similar bone strength variation and even skeletal disorders exist

within other primate species, as patterns of similar variation among these groups

predict. Two perfect examples in the human Sp1 and chimpanzee exon 35

duplication already exist from population samples. Thus, studying skeletal

phenotypes, other than simply rare diseases, using non-human primates as

evolutionary models becomes an attractive possibility for medical intervention.

Page 101: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

90

Table 5

COL1a1 Diversity Estimates by Gene Region

Region sitesa Sb θπc Dd

Total 16,989 92 0.21 2.44Promoter 1,223 4 0.14 1.95UTRe 263 1 0.19 1.66First intronf 1,462 11 0.30 2.05Other intronsf 9,470 64 0.26 2.30Synonymous 1,207 12 0.41 2.26Nonsynonymous 3,168 0 0 n/a Silentg 10,677 76 0.28 2.37 a Number of nucleotides

b Number of SNPs

c Average number of pairwise differences between sequences (%)

d Tajima’s D statistic

e 5’ and 3’ mRNA untranslated regions (UTR)

f Excludes splice sites

g Includes synonymous and intron sites, excluding the first intron and splice sites

Page 102: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

91

Table 6

COL1a1 Haplogroup-Specific Diversity Estimates

Haplogroup na Sb θπc

A 22 25 0.08A (excluding recombinants)d 17 8 0.04B 18 4 0.02Exon duplication-bearing 7 0 0 a Number of chromosomes

b Number of silent-site SNPs >5% allele frequency

c Average number of pairwise differences between sequences (%)

d Haplogroup A members excluding 5 chromosomes predicted to have recombined with Haplogroup B

Page 103: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

Fig. 5. Diagram of a 124-bp duplication in chimpanzees involving a partial

duplication of COL1a1 exon 35 (36 bp) that precedes the normal, full-length 54-

bp exon 35. A partial duplication of intron 34 separates these exons. “AG” and

“GT” indicate intact intron splice sites bordering the partial exon duplication.

exon 35exon 35 intron 35intron 34

54 bpAGAG GT

36 bpGT

intron 3488 bp

partial repeat

92

Page 104: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

93

Fig. 6. Inferred haplotypes for all polymorphisms with an allele frequency >5%.

Chromosomes with identical haplotypes have been combined into one row with

the number per haplotype listed on the side of the figure. The derived allele for

each site, as inferred from human-chimpanzee-macaque-orangutan contrasts is

represented with a grey box. The 85 polymorphisms at the COL1a1 locus are

indicated with a dashed line at the top of the figure. “5’ region” refers to 35

polymorphisms found within our additional sequenced fragments 5’ of COL1a1

and “3’ region” refers to 46 polymorphisms found within our additional fragments

3’ of COL1a1 as shown in fig. 7. Haplotypes belonging to COL1a1 haplogroup A

are indicated with a solid line on the side of the figure with the first two rows

being haplotypes with the exon duplication (see Results for more information);

remaining haplotypes belong to COL1a1 haplogroup B. See supplementary fig. 1

(Appendix C) for polymorphism positions and allelic states.

Page 105: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

94

94

Page 106: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

Fig. 7. Gene map of the chromosome 17 region surrounding COL1a1, drawn to

scale, with arrows indicating the orientation of genes in the 5’ to 3’ direction (note

this is the reverse orientation of genes in the genome in order to show COL1a1

from left to right). Positions are numbered according to the first base of the first

exon of COL1a1 as position 1 as determined from the human genome reference

sequence (NCBI build 36.1). Solid lines above the position scale indicate 1-2 kb

regions 5’ (to the left) and 3’ (to the right) of COL1a1 that were PCR amplified

and sequenced in 40 P. t. verus and 26 P. paniscus chromosomes. See Results for

more information.

COL1a1TMEM92 SGCA PPP1R9B SAMD

-80 -60 -40 -20 1 20 40 60 80 100-100

Nucleotide Position (kb)

95

Page 107: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

96

CHAPTER 5: CONCLUSION

The main objective of this research was to examine the recent and ancient

evolutionary history of the COL1a1 gene, which codes for the primary subunit of

type I collagen, the main structural and most abundant protein in mammals, to

gain new perspectives on the molecular origins of human skeletal disease related

to reduced bone strength. The molecular variation at this gene was characterized

using three timescales: historically over the past ~450 million years (My) of

vertebrate evolution, recently within and among human populations, and within

the past 4-6 My since the divergence of human and chimpanzee. These timescales

allow for fine-scale resolution of evolutionary change at COL1a1, generating an

expectation based upon historic change and a means for estimating when shifts

may have occurred in selective pressures affecting this locus that could explain

the prevalence of disease-associated mutations (DAMs) in humans.

As discussed in Chapter 2, the COL1a1 amino acid sequence has been

highly conserved during vertebrate evolution; however, temporal and spatial

variation in selective constraint is still apparent among protein domains, which

may be contributing to bone phenotypic variation among vertebrates. Further, it

was shown that this variation in selective constraint can be used to predict the

phenotypic severity of human DAMs. In addition to the COL1a1 protein, this

locus is characterized by the conservation of unusually short, GC-rich introns,

which may be in response to strong stabilizing selection to maintain increased

gene expression. This functional constraint has even lead to reduced human-

Page 108: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

97

chimpanzee intron divergence relative to expectations for a GC-rich region.

Though historically considered to be non-functional, COL1a1 introns and the

variation within them may actually impact bone-related phenotypes. Given these

inferences from a molecular evolutionary model, it would be of particular interest

to determine if these patterns of functional constraint are consistent genome-wide

among genes known to be highly expressed across vertebrate lineages.

Specifically, compared to neutral expectations, is reduced human-chimpanzee

intron divergence typical of highly-expressed genes? If so, this pattern would

offer further evidence in support of the importance of intron structure and

composition to the efficiency of gene expression, which would greatly impact

future research in the identification of DAMs in general.

As discussed in Chapter 3, in contrast to historic divergence, no

significant reduction in intron variation was observed recently within humans.

Nonetheless, significant haplotype differentiation was found among populations,

including the absence of an entire haplotype-block from Asian samples. Increased

haplotype diversity provides evidence of a gene region with an increased rate of

recombination, the location of which suggests that the 5’ region of COL1a1 has

been evolutionarily unlinked from the 3’ region, allowing for independence

between the majority of coding and promoter variation. These results have

important implications for the design of future association studies of bone

phenotypic variation in humans. First, to accurately measure potential

associations with the COL1a1 locus, these studies should use polymorphisms both

Page 109: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

98

5’ and 3’ of the region with elevated recombination. Second, genotype-phenotype

results based upon a single ethnic group cannot be extrapolated to other groups

given the high levels of population differentiation at COL1a1. Because this

differentiation primarily involves noncoding regions, and given the historic

selective constraint on COL1a1 intron structure and composition, it will be

interesting to determine if noncoding variation in general has functional

consequences that contribute to phenotypic differences in bone strength and

disease susceptibility among human populations.

Although the COL1a1 protein has been highly conserved historically,

>9% of the individuals from a random sampling of the natural human population

carry COL1a1 amino acid variation. Based on the evolutionary site model

discussed in Chapter 2, these amino acid variants are predicted to have at least

some impact on bone-related phenotypes. As with vertebrate comparisons, these

data are consistent with spatial variation in selective constraints at COL1a1, with

highly deleterious mutations being rapidly removed by purifying selection and

others of less severe phenotypic impacts remaining polymorphic within humans

and, therefore, likely contributing to population variation in skeletal phenotypes.

As discussed in Chapter 4, the absence of amino acid variation in chimpanzees, as

well as the lack of amino acid divergence between human and chimpanzee,

suggests that the abundance of DAMs and high proportion of amino acid variation

observed in the natural population in humans could be indicative of recent shifts

in selective pressures affecting COL1a1 within the past 4-6 My in the human

Page 110: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

99

lineage. Thus, it would be interesting to sample additional primate populations to

see how this trend compares across species to address whether this level of

variation is truly unique to humans.

Other than protein variation, it appears that both humans and chimpanzees

share unusual patterns for noncoding COL1a1 variation. Specifically, population

variation in the 5’ region in humans, including noncoding variants like the Sp1

polymorphism, and long-range, seemingly ancient haplotype differentiation in

chimpanzees, were detected. As this genetic variation may cause expression

differences across human populations and needs to be explored functionally, the

patterns in chimpanzee are surprising and warrant similar analyses in this species

to determine whether both also share bone strength differences. While this is pure

speculation at this point, these comparative population genetic analyses are,

nonetheless, the first to identify these noncoding patterns of variation.

In addition, a partial exon duplication has reached a relatively high

frequency in chimpanzees, which, in contrast to COL1a1 exon length variants

found in humans, suggests that this duplication is not deleterious. As such, it is

important to determine if it is actually encoded, the resolution of which could

greatly improve our understanding of the evolution of fibrillar collagen gene

structure in which exon duplication was once adaptive in the proliferation of

collagen genes, but is deleterious in humans today. Unfortunately, viable tissue

for the extraction of COL1a1 mRNA from a chimpanzee individual carrying this

duplication is currently unknown. If this duplication can be shown to be

Page 111: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

100

transcribed and possibly even translated, it represents the first example of viable

exon variation within populations and provides an amazingly valuable model with

which to study how type I collagen has historically evolved its repetitive

structure. In addition, its adaptive potential also may shed light on how to develop

synthetic treatments to increase bone strength, which are needed for a large

proportion of the human population.

Overall, this research provides new insight into the molecular origins of

skeletal disease in humans as it relates to the COL1a1 locus. Specifically, patterns

of genetic variation are consistent with a history of temporal and spatial variation

in purifying selection, not only affecting coding regions, but also noncoding

regions of COL1a1. From a clinical perspective, noncoding regions of this locus

represent important targets for future investigations in identifying genetic

variation that impacts bone-related phenotypic differences within and among

populations and species. Because COL1a1 is only one of dozens of genes

associated with variation in bone strength and skeletal disease susceptibility, it

will be interesting to determine if similar evolutionary histories are common

among other candidate genes. This work shows that an understanding of even

single amino acid and nucleotide changes at the sequence level over time can

radically alter our perception of bone-strength variation; thus, programs that

involve molecular evolutionary analyses within and between populations and

species will prove to be successful in modeling functional bone-related

phenotypes in the population today.

Page 112: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

101

LITERATURE CITED

Abate N, Chandalia M. 2003. The impact of ethnicity on type 2 diabetes. J. Diabetes Complications. 17:39-58.

Abbott S, Trinkaus E, Burr DB. 1996. Dynamic bone remodeling in later

pleistocene fossil hominids. Am. J. Phys. Anthropol. 99:585-601. Adami S, Bertoldo F, Braga V, Fracassi E, Gatti D, Gandolini G, Minisola S, Rini

GB. 2009. 25-hydroxy vitamin D levels in healthy premenopausal women: Association with bone turnover markers and bone mineral density. Bone 45:423-426.

Akashi H, Schaeffer SW. 1997. Natural selection and the frequency distributions

of “silent” DNA polymorphism in drosophila. Genetics 146:295-307. Aouacheria A, Cluzel C, Lethias C, Gouy M, Garrone R, Exposito JY. 2004.

Invertebrate data predict an early emergence of vertebrate fibrillar collagen clades and an anti-incest model. J. Biol. Chem. 279:47711-47719.

Asmussen MA, Clegg MT. 1982. Rates of decay of linkage disequilibrium under

2-locus models of selection. J. Math. Biol. 14:37-70. Auton A, Bryc K, Boyko AR, et al. (13 co-authors). 2009. Global distribution of

genomic diversity underscores rich complex history of continental human populations. Genome Res. 19:795-803.

Bachrach LK, Hastie T, Wang MC, Narasimhan B, Marcus R. 1999. Bone

mineral acquisition in healthy asian, hispanic, black, and caucasian youth: A longitudinal study. J. Clin. Endocrinol. Metab. 84:4702-4712.

Bandres E, Pombo I, Gonzalez-Huarriz M, Rebollo A, Lopez G, Garcia-Foncillas

J. 2005. Association between bone mineral density and polymorphisms of the VDR, ERalpha, COL1A1 and CTR genes in spanish postmenopausal women. J. Endocrinol. Invest. 28:312-321.

Barrett-Connor E, Siris ES, Wehren LE, Miller PD, Abbott TA, Berger ML,

Santora AC, Sherwood LM. 2005. Osteoporosis and fracture risk in women of different ethnic groups. J. Bone Miner. Res. 20:185-194.

Barsh GS, Roush CL, Bonadio J, Byers PH, Gelinas RE. 1985. Intron-mediated

recombination may cause a deletion in an alpha 1 type I collagen chain in a lethal form of osteogenesis imperfecta. Proc. Natl. Acad. Sci. U. S. A. 82:2870-2874.

Page 113: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

102

Basel D, Steiner RD. 2009. Osteogenesis imperfecta: Recent findings shed new light on this once well-understood condition. Genet. Med. 11:375-385.

Baxter-Jones ADG, Burrows M, Bachrach LK, Lloyd T, Petit M, Macdonald H,

Mirwald RL, Bailey D, McKay H. 2010. International longitudinal pediatric reference standards for bone mineral content. Bone 46:208-216.

Beavan S, Prentice A, Dibba B, Yan L, Cooper C, Ralston SH. 1998.

Polymorphism of the collagen type Ialpha1 gene and ethnic differences in hip-fracture rates. N. Engl. J. Med. 339:351-352.

Becquet C, Patterson N, Stone AC, Przeworski M, Reich D. 2007. Genetic

structure of chimpanzee populations. PLoS Genet. 3:e66. Bernard MP, Chu ML, Myers JC, Ramirez F, Eikenberry EF, Prockop DJ. 1983.

Nucleotide sequences of complementary deoxyribonucleic acids for the pro alpha 1 chain of human type I procollagen. statistical evaluation of structures that are conserved during evolution. Biochemistry 22:5213-5223.

Bersaglieri T, Sabeti PC, Patterson N, Vanderploeg T, Schaffner SF, Drake JA,

Rhodes M, Reich DE, Hirschhorn JN. 2004. Genetic signatures of strong recent positive selection at the lactase gene. Am. J. Hum. Genet. 74:1111-1120.

Black A, Tilmont EM, Handy AM, Scott WW, Shapses SA, Ingram DK, Roth

GS, Lane MA. 2001. A nonhuman primate model of age-related bone loss: A longitudinal study in male and premenopausal female rhesus monkeys. Bone 28:295-302.

Blekhman R, Man O, Herrmann L, Boyko AR, Indap A, Kosiol C, Bustamante

CD, Teshima KM, Przeworski M. 2008. Natural selection on genes that underlie human disease susceptibility. Curr. Biol. 18:883-889.

Blekhman R, Oshlack A, Chabot AE, Smyth GK, Gilad Y. 2008. Gene regulation

in primates evolves under tissue-specific selection pressures. PLoS Genet. 4:e1000271.

Bodian DL, Chan TF, Poon A, Schwarze U, Yang K, Byers PH, Kwok PY, Klein

TE. 2009. Mutation and polymorphism spectrum in osteogenesis imperfecta type II: Implications for genotype-phenotype relationships. Hum. Mol. Genet. 18:463-471.

Page 114: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

103

Bodian DL, Madhan B, Brodsky B, Klein TE. 2008. Predicting the clinical lethality of osteogenesis imperfecta from collagen glycine mutations. Biochemistry 47:5424-5432.

Bogin B, Smith BH. 1996. Evolution of the human life cycle. Am. J. Hum. Biol.

8:703-716. Boot-Handford RP, Tuckwell DS. 2003. Fibrillar collagen: The key to vertebrate

evolution? A tale of molecular incest. Bioessays 25:142-151. Bornstein P, McKay J, Morishima JK, Devarayalu S, Gelinas RE. 1987.

Regulatory elements in the first intron contribute to transcriptional control of the human alpha 1(I) collagen gene. Proc. Natl. Acad. Sci. U. S. A. 84:8869-8873.

Boyko AR, Williamson SH, Indap AR, et al. (14 co-authors). 2008. Assessing the

evolutionary impact of amino acid mutations in the human genome. PLoS Genet. 4:e1000083.

Brown LB, Streeten EA, Shapiro JR, McBride D, Shuldiner AR, Peyser PA,

Mitchell BD. 2005. Genetic and environmental influences on bone mineral density in pre- and post-menopausal women. Osteoporosis Int. 16:1849-1856.

Burrows NP. 1999. The molecular genetics of the ehlers-danlos syndrome. Clin.

Exp. Dermatol. 24:99-106. Bustamante CD, Fledel-Alon A, Williamson S, et al. (14 co-authors). 2005.

Natural selection on protein-coding genes in the human genome. Nature 437:1153-1157.

Byers PH, Wallis GA, Willing MC. 1991. Osteogenesis imperfecta - translation of

mutation to phenotype. J. Med. Genet. 28:433-442. Cabral WA, Mertts MV, Makareeva E, Colige A, Tekin M, Pandya A, Leikin S,

Marin JC. 2003. Type I collagen triplet duplication mutation in lethal osteogenesis imperfecta shifts register of alpha chains throughout the helix and disrupts incorporation of mutant helices into fibrils and extracellular matrix. J. Biol. Chem. 278:10006-10012.

Campbell MC, Tishkoff SA. 2008. African genetic diversity: Implications for human demographic history, modern human origins, and complex disease mapping. Annu. Rev. Genomics Hum. Genet. 9:403-433.

Page 115: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

104

Castillo-Davis CI, Mekhedov SL, Hartl DL, Koonin EV, Kondrashov FA. 2002. Selection for short introns in highly expressed genes. Nat. Genet. 31:415-418.

Cavalli-Sforza LL, Feldman MW. 2003. The application of molecular genetic

approaches to the study of human evolution. Nat. Genet. 33:266-275. Cerroni AM, Tomlinson GA, Turnquist JE, Grynpas MD. 2000. Bone mineral

density, osteopenia, and osteoporosis in the rhesus macaques of cayo santiago. Am. J. Phys. Anthropol. 113:389-410.

Chamary JV, Hurst LD. 2005. Evidence for selection on synonymous mutations

affecting stability of mRNA secondary structure in mammals. Genome Biol. 6:R75.

Chan TF, Poon A, Basu A, Addleman NR, Chen J, Phong A, Byers PH, Klein TE,

Kwok PY. 2008. Natural variation in four human collagen genes across an ethnically diverse population. Genomics 91:307-314.

Charlesworth D, Charlesworth B, Morgan MT. 1995. The pattern of neutral

molecular variation under the background selection model. Genetics 141:1619-1632.

Charlesworth D. 2006. Balancing selection and its effects on sequences in nearby

genome regions. PLoS Genet. 2:e64. Cheung VG, Spielman RS. 2009. Genetics of human gene expression: Mapping

DNA variants that influence gene expression. Nat. Rev. Genet. 10:595-604.

Chimpanzee Sequencing Consortium. 2005. Initial sequence of the chimpanzee

genome and comparison with the human genome. Nature 437:69-87. Chu ML, Dewet W, Bernard M, Ramirez F. 1985. Fine-structural analysis of the

human pro-alpha-1(i) collagen gene - promoter structure, alui repeats, and polymorphic transcripts. J. Biol. Chem. 260:2315-2320.

Claw KG, Tito RY, Stone AC, Verrelli BC. 2010. Haplotype structure and

divergence at human and chimpanzee serotonin transporter and receptor genes: Implications for behavioral disorder association analyses. Mol. Biol. Evol. 27:1518-1529.

Page 116: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

105

Clegg MT, Kidwell JF, Horch CR. 1980. Dynamics of correlated genetic systems. V. rates of decay of linkage disequilibria in experimental populations of DROSOPHILA MELANOGASTER. Genetics 94:217-234.

Cohen MM, Jr. 2006. The new bone biology: Pathologic, molecular, and clinical

correlates. Am. J. Med. Genet. A. 140:2646-2706. Cohn DH, Zhang XM, Byers PH. 1993. Homology-mediated recombination

between type-i collagen gene exons results in an internal tandem duplication and lethal osteogenesis imperfecta. Hum. Mutat. 2:21-27.

Comeron JM. 2004. Selective and mutational patterns associated with gene

expression in humans: Influences on synonymous composition and intron presence. Genetics 167:1293-1304.

Cotter MM, Simpson SW, Latimer BM, Hernandez CJ. 2009. Trabecular

microarchitecture of hominoid thoracic vertebrae. Anat. Rec. (Hoboken) 292:1098-1106.

Crawford DC, Bhangale T, Li N, Hellenthal G, Rieder MJ, Nickerson DA,

Stephens M. 2004. Evidence for substantial fine-scale variation in recombination rates across the human genome. Nat. Genet. 36:700-706.

Currey JD. 1987. The evolution of the mechanical properties of amniote bone. J.

Biomech. 20:1035-1044. Dalgleish R. 1997. The human type I collagen mutation database. Nucleic Acids

Res. 25:181-187. Dohi Y, Iki M, Ohgushi H, Gojo S, Tabata S, Kajita E, Nishino H, Yonemasu K.

1998. A novel polymorphism in the promoter region for the human osteocalcin gene: The possibility of a correlation with bone mineral density in postmenopausal japanese women. J. Bone Miner. Res. 13:1633-1639.

Dvornyk V, Liu XH, Shen H, et al. (13 co-authors). 2003. Differentiation of

caucasians and chinese at bone mass candidate genes: Implication for ethnic difference of bone mass. Ann. Hum. Genet. 67:216-227.

Efstathiadou Z, Tsatsoulis A, Ioannidis JP. 2001. Association of collagen ialpha 1

Sp1 polymorphism with the risk of prevalent fractures: A meta-analysis. J. Bone Miner. Res. 16:1586-1592.

Page 117: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

106

Enard D, Depaulis F, Roest Crollius H. 2010. Human and non-human primate genomes share hotspots of positive selection. PLoS Genet. 6:e1000840.

Eriksson J, Hohmann G, Boesch C, Vigilant L. 2004. Rivers influence the

population genetic structure of bonobos (pan paniscus). Mol. Ecol. 13:3425-3435.

Evans PD, Mekel-Bobrov N, Vallender EJ, Hudson RR, Lahn BT. 2006.

Evidence that the adaptive allele of the brain size gene microcephalin introgressed into homo sapiens from an archaic homo lineage. Proc. Natl. Acad. Sci. U. S. A. 103:18178-18183.

Exposito JY, Cluzel C, Garrone R, Lethias C. 2002. Evolution of collagens. Anat.

Rec. 268:302-316. Fang Y, Van Meurs JBJ, Bergink AP, Hofman A, Van Duijn CM, Van Leeuwen

JP, Ap Pols H, Uitterlinden AG. 2003. Cdx-2 polymorphism in the promoter region of the human vitamin D receptor gene determines susceptibility to fracture in the elderly. J. Bone Miner. Res. 18:1632-1641.

Fischer A, Pollack J, Thalmann O, Nickel B, Paabo S. 2006. Demographic history

and genetic differentiation in apes. Curr. Biol. 16:1133-1138. Fischer A, Wiebe V, Paabo S, Przeworski M. 2004. Evidence for a complex

demographic history of chimpanzees. Mol. Biol. Evol. 21:799-808. Fullerton SM, Bernardo Carvalho A, Clark AG. 2001. Local rates of

recombination are positively correlated with GC content in the human genome. Mol. Biol. Evol. 18:1139-1142.

Gabriel SB, Schaffner SF, Nguyen H, et al. (18 co-authors). 2002. The structure

of haplotype blocks in the human genome. Science 296:2225-2229. Garcia-Giralt N, Enjuanes A, Bustamante M, Mellibovsky L, Nogues X, Carreras

R, Diez-Perez A, Grinberg D, Balcells S. 2005. In vitro functional assay of alleles and haplotypes of two COL1A1-promoter SNPs. Bone 36:902-908.

Garcia-Giralt N, Nogues X, Enjuanes A, Puig J, Mellibovsky L, Bay-Jensen A,

Carreras R, Balcells S, Diez-Perez A, Grinberg D. 2002. Two new single-nucleotide polymorphisms in the COL1A1 upstream regulatory region and their relationship to bone mineral density. J. Bone Miner. Res. 17:384-393.

Garrigan D, Hammer MF. 2006. Reconstructing human origins in the genomic

era. Nat. Rev. Genet. 7:669-680.

Page 118: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

107

Gazave E, Marques-Bonet T, Fernando O, Charlesworth B, Navarro A. 2007. Patterns and rates of intron divergence between humans and chimpanzees. Genome Biol. 8:R21.

Ge B, Pokholok DK, Kwan T, et al. (27 co-authors). 2009. Global patterns of cis

variation in human cells revealed by high-density allelic expression analysis. Nat. Genet. 41:1216-U78.

Gelse K, Poschl E, Aigner T. 2003. Collagens--structure, function, and

biosynthesis. Adv. Drug Deliv. Rev. 55:1531-1546. Gilad Y, Bustamante CD, Lancet D, Paabo S. 2003. Natural selection on the

olfactory receptor gene family in humans and chimpanzees. Am. J. Hum. Genet. 73:489-501.

Gong G, Haynatzki G, Haynatzka V, Howell R, Kosoko-Lasaki S, Fu YX, Yu F,

Gallagher JC, Wilson MR. 2006. Bone mineral density-affecting genes in africans. J. Natl. Med. Assoc. 98:1102-1108.

Gong G, Haynatzki G. 2003. Association between bone mineral density and

candidate genes in different ethnic populations and its implications. Calcif. Tissue Int. 72:113-123.

Grant SFA, Reid DM, Blake G, Herd R, Fogelman I, Ralston SH. 1996. Reduced

bone density and osteoporosis associated with a polymorphic Sp1 binding site in the collagen type I alpha 1 gene. Nat. Genet. 14:203-205.

Gueguen R, Jouanny P, Guillemin F, Kuntz C, Pourel J, Siest G. 1995.

Segregation analysis and variance-components analysis of bone-mineral density in healthy families. J. Bone Miner. Res. 10:2017-2022.

Gunji H, Hosaka K, Huffman MA, Kawanaka K, Matsumoto-Oda A, Hamada Y,

Nishida T. 2003. Extraordinarily low bone mineral density in an old female chimpanzee (pan troglodytes schweinfurthii) from the mahale mountains national park. Primates 44:145-149.

Hackenberg M, Bernaola-Galvan P, Carpena P, Oliver JL. 2005. The biased

distribution of alus in human isochores might be driven by recombination. J. Mol. Evol. 60:365-377.

Hajjar I, Kotchen JM, Kotchen TA. 2006. Hypertension: Trends in prevalence,

incidence, and control. Annu. Rev. Public Health 27:465-490.

Page 119: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

108

Han KO, Moon IG, Hwang CS, Choi JT, Yoon HK, Min HK, Han IK. 1999. Lack of an intronic sp1 binding-site polymorphism at the collagen type I alpha 1 gene in healthy korean women. Bone 24:135-137.

Hancock AM, Witonsky DB, Ehler E, et al. (11 co-authors). 2010. Colloquium

paper: Human adaptations to diet, subsistence, and ecoregion are due to subtle shifts in allele frequency. Proc. Natl. Acad. Sci. U. S. A. 107 Suppl 2:8924-8930.

Havill LM, Mahaney MC, Cox LA, Morin PA, Joslyn G, Rogers J. 2005. A

quantitative trait locus for normal variation in forearm bone mineral density in pedigreed baboons maps to the ortholog of human chromosome 11q. J. Clin. Endocrinol. Metab. 90:3638-3645.

Havill LM, Mahaney MC, Czerwinski SA, Carey KD, Rice K, Rogersa J. 2003. Bone mineral density reference standards in adult baboons (papio hamadryas) by sex and age. Bone 33:877-888.

Haygood R, Fedrigo O, Hanson B, Yokoyama KD, Wray GA. 2007. Promoter

regions of many neural- and nutrition-related genes have experienced positive selection during human evolution. Nat. Genet. 39:1140-1144.

Hedges SB, Kumar S. 2002. Genomics. vertebrate genomes compared. Science

297:1283-1285. Hellmann I, Ebersberger I, Ptak SE, Paabo S, Przeworski M. 2003. A neutral

explanation for the correlation of diversity with recombination rates in humans. Am. J. Hum. Genet. 72:1527-1535.

Hernandez RD, Williamson SH, Bustamante CD. 2007. Context dependence,

ancestral misidentification, and spurious signatures of natural selection. Mol. Biol. Evol. 24:1792-1800.

Hildebrand KA, Gallant-Behm CL, Kydd AS, Hart DA. 2005. The basics of soft

tissue healing and general factors that influence such healing. Sports Med. Arthrosc. 13:136-144.

Ho NC, Jia L, Driscoll CC, Gutter EM, Francomano CA. 2000. A skeletal gene

database. J. Bone Miner. Res. 15:2095-2122. Hudson RR, Slatkin M, Maddison WP. 1992. Estimation of levels of gene flow

from DNA sequence data. Genetics 132:583-589.

Page 120: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

109

Hudson RR. 2000. A new statistic for detecting genetic differentiation. Genetics 155:2011-2014.

Hudson RR. 2001. Two-locus sampling distributions and their application.

Genetics 159:1805-1817. Hudson RR. 2002. Generating samples under a wright-fisher neutral model of

genetic variation. Bioinformatics 18:337-338. Hurst LD, McVean G, Moore T. 1996. Imprinted genes have few and small

introns. Nat. Genet. 12:234-237. International Human Genome Sequencing Consortium. 2001. Initial sequencing

and analysis of the human genome. Nature 409:860-921. Ioannidis JP, Ng MY, Sham PC, et al. (28 co-authors). 2007. Meta-analysis of

genome-wide scans provides evidence for sex- and site-specific regulation of bone mass. J. Bone Miner. Res. 22:173-183.

Jiang H, Lei SF, Xiao SM, Chen Y, Sun X, Yang F, Li LM, Wu S, Deng HW.

2007. Association and linkage analysis of COL1A1 and AHSG gene polymorphisms with femoral neck bone geometric parameters in both caucasian and chinese nuclear families. Acta Pharmacol. Sin. 28:375-381.

Jin H, Stewart TL, Hof RV, Reid DM, Aspden RM, Ralston S. 2009. A rare

haplotype in the upstream regulatory region of COL1A1 is associated with reduced bone quality and hip fracture. J. Bone Miner. Res. 24:448-454.

Jin H, van't Hof RJ, Albagha OM, Ralston SH. 2009. Promoter and intron 1

polymorphisms of COL1A1 interact to regulate transcription and susceptibility to osteoporosis. Hum. Mol. Genet. 18:2729-2738.

Jones DT, Taylor WR, Thornton JM. 1992. The rapid generation of mutation data

matrices from protein sequences. Comput. Appl. Biosci. 8:275-282. Kasowski M, Grubert F, Heffelfinger C, et al. (17 co-authors). 2010. Variation in

transcription factor binding among humans. Science 328:232-235. Kaufman J, Ostertag A, Saint-Pierre A, Cohen-Solal M, Boland A, Van

Pottelbergh I, Toye K, de Vernejoul M, Martinez M. 2008. Genome-wide linkage screen of bone mineral density (BMD) in european pedigrees ascertained through a male relative with low BMD values: Evidence for quantitative trait loci on 17q21-23, 11q12-13, 13q12-14, and 22q11. J. Clin. Endocrinol. Metab. 93:3755-3762.

Page 121: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

110

Kikuchi Y, Udono T, Hamada Y. 2003. Bone mineral density in chimpanzees, humans, and japanese macaques. Primates 44:151-155.

Koller DL, Ichikawa S, Lai D, et al. (13 co-authors). 2010. Genome-wide

association study of bone mineral density in premenopausal european-american women and replication in african-american women. J. Clin. Endocrinol. Metab. 95:1802-1809.

Kudla G, Lipinski L, Caffin F, Helwak A, Zylicz M. 2006. High guanine and

cytosine content increases mRNA levels in mammalian cells. PLoS Biol. 4:e180.

Kuivaniemi H, Tromp G, Prockop DJ. 1997. Mutations in fibrillar collagens

(types I, II, III, and XI), fibril-associated collagen (type IX), and network-forming collagen (type X) cause a spectrum of diseases of bone, cartilage, and blood vessels. Hum. Mutat. 9:300-315.

Kumar S, Filipski A, Swarna V, Walker A, Hedges SB. 2005. Placing confidence

limits on the molecular age of the human-chimpanzee divergence. Proc. Natl. Acad. Sci. U. S. A. 102:18842-18847.

Kumar S, Nei M, Dudley J, Tamura K. 2008. MEGA: A biologist-centric

software for evolutionary analysis of DNA and protein sequences. Brief Bioinform. 9:299-306.

Larsen CS. 1995. Biological changes in human-populations with agriculture.

Annu. Rev. Anthropol. 24:185-213. Lau CS, Yin G, Mok MY. 2006. Ethnic and geographical differences in systemic

lupus erythematosus: An overview. Lupus 15:715-719. Lau EMC, Choy DTK, Li M, Woo J, Chung T, Sham A. 2004. The relationship

between COLI A1 polymorphisms (sp 1) and COLI A2 polymorphisms (eco R1 and puv II) with bone mineral density in chinese men and women. Calcif. Tissue Int. 75:133-137.

Lau HHL, Ng MYM, Ho AYY, Luk KDK, Kung AWC. 2005. Genetic and

environmental determinants of bone mineral density in chinese women. Bone 36:700-709.

Lauderdale DS, Jacobsen SJ, Furner SE, Levy PS, Brody JA, Goldberg J. 1997.

Hip fracture incidence among elderly asian-american populations. Am. J. Epidemiol. 146:502-509.

Page 122: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

111

Laval G, Patin E, Barreiro LB, Quintana-Murci L. 2010. Formulating a historical and demographic model of recent human evolution based on resequencing data from noncoding regions. PLoS One 5:e10284.

Leuenberger C, Wegmann D. 2010. Bayesian computation and model selection

without likelihoods. Genetics 184:243-252. Li N, Stephens M. 2003. Modeling linkage disequilibrium and identifying

recombination hotspots using single-nucleotide polymorphism data. Genetics 165:2213-2233.

Lipkin EW, Aumann CA, Newell-Morris LL. 2001. Evidence for common

controls over inheritance of bone quantity and body size from segregation analysis in a pedigreed colony of nonhuman primates (macaca nemestrina). Bone 29:249-257.

Liu YJ, Shen H, Xiao P, Xiong DH, Li LH, Recker RR, Deng HW. 2006.

Molecular genetic studies of gene identification for osteoporosis: A 2004 update. J. Bone Miner. Res. 21:1511-1535.

Liu YZ, Liu YJ, Recker RR, Deng HW. 2003. Molecular studies of identification

of genes for osteoporosis: The 2002 update. J. Endocrinol. 177:147-196. Lohmueller KE, Indap AR, Schmidt S, et al. (12 co-authors). 2008. Proportionally

more deleterious genetic variation in european than in african populations. Nature 451:994-997.

Long JR, Zhao LJ, Liu PY, et al. (11 co-authors). 2004. Patterns of linkage

disequilibrium and haplotype distribution in disease candidate genes. BMC Genet. 5:11.

Looker AC, Orwoll ES, Johnston CC,Jr, Lindsay RL, Wahner HW, Dunn WL, Calvo MS, Harris TB, Heyse SP. 1997. Prevalence of low femoral bone density in older U.S. adults from NHANES III. J. Bone Miner. Res. 12:1761-1768.

Looker AC, Wahner HW, Dunn WL, Calvo MS, Harris TB, Heyse SP, Johnston CC,Jr, Lindsay R. 1998. Updated data on proximal femur bone mineral levels of US adults. Osteoporos. Int. 8:468-489.

Majewski J, Ott J. 2002. Distribution and characterization of regulatory elements

in the human genome. Genome Res. 12:1827-1836.

Page 123: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

112

Mann V, Hobson EE, Li BH, Stewart TL, Grant SFA, Robins SP, Aspden RM, Ralston SH. 2001. A COL1A1 Sp1 binding site polymorphism predisposes to osteoporotic fracture by affecting bone density and quality. J. Clin. Invest. 107:899-907.

Mann V, Ralston SH. 2003. Meta-analysis of COLIA1 Sp1 polymorphism in

relation to bone mineral density and osteoporotic fracture. Bone 32:711-717.

Marini JC, Forlino A, Cabral WA, et al. (27 co-authors). 2007. Consortium for

osteogenesis imperfecta mutations in the helical domain of type I collagen: Regions rich in lethal mutations align with collagen binding sites for integrins and proteoglycans. Hum. Mutat. 28:209-221.

Matkovic V, Fontana D, Tominac C, Goel P, Chesnut CH. 1990. Factors that

influence peak bone mass formation - a study of calcium balance and the inheritance of bone mass in adolescent females. Am. J. Clin. Nutr. 52:878-888.

Matsumura A, Gunji H, Takahashi Y, Nishida T, Okada M. 2010. Cross-sectional

morphology of the femoral neck of wild chimpanzees. Int. J. Primatol. 31:219–238.

McDonald JH, Kreitman M. 1991. Adaptive protein evolution at the adh locus in

drosophila. Nature 351:652-654. McVean G, Awadalla P, Fearnhead P. 2002. A coalescent-based method for

detecting and estimating recombination from gene sequences. Genetics 160:1231-1241.

Melton LJ. 1997. The prevalence of osteoporosis. J. Bone Miner. Res. 12:1769-

1771. Milewicz DM, Byers PH, Reveille J, Hughes AL, Duvic M. 1996. A dimorphic

alu sb-like insertion in COL3A1 is ethnic-specific. J. Mol. Evol. 42:117-123.

Miller MP, Kumar S. 2001. Understanding human disease mutations through the

use of interspecific genetic variation. Hum. Mol. Genet. 10:2319-2328. Morgan CC, Loughran NB, Walsh TA, Harrison AJ, O'Connell MJ. 2010.

Positive selection neighboring functionally essential sites and disease-implicated regions of mammalian reproductive proteins. BMC Evol. Biol. 10:39.

Page 124: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

113

Mulhern DM, Ubelaker DH. 2003. Histologic examination of bone development in juvenile chimpanzees. Am. J. Phys. Anthropol. 122:127-133.

Mulhern DM, Ubelaker DH. 2009. Bone microstructure in juvenile chimpanzees.

Am. J. Phys. Anthropol. 140:368-375. Mulhern DM, Ubelaker DH. 2003. Histologic examination of bone development

in juvenile chimpanzees. Am. J. Phys. Anthropol. 122:127-133. Musumeci M, Vadala G, Tringali G, Insirello E, Roccazzello AM, Simpore J,

Musumeci S. 2009. Genetic and environmental factors in human osteoporosis from sub-saharan to mediterranean areas. J. Bone Miner. Metab. 27:424-434.

Nakajima T, Ota N, Shirai Y, Hata A, Yoshida H, Suzuki T, Hosoi T, Orimo H,

Emi M. 1999. Ethnic difference in contribution of Sp1 site variation of COLIA1 gene in genetic predisposition to osteoporosis. Calcif. Tissue Int. 65:352-353.

Nei M, Gojobori T. 1986. Simple methods for estimating the numbers of

synonymous and nonsynonymous nucleotide substitutions. Mol. Biol. Evol. 3:418-426.

Nei M. 1987. Molecular Evolutionary Genetics. New York: Columbia University

Press. Ng PC, Zhao Q, Levy S, Strausberg RL, Venter JC. 2008. Individual genomes

instead of race for personalized medicine. Clin. Phar. Therapeutics 84:306-309.

Ota N, Nakajima T, Nakazawa I, Suzuki T, Hosoi T, Orimo H, Inoue S, Shirai Y,

Emi M. 2001. A nucleotide variant in the promoter region of the interleukin-6 gene associated with decreased bone mineral density. J. Hum. Genet. 46:267-272.

Pace JM, Atkinson M, Willing MC, Wallis G, Byers PH. 2001. Deletions and

duplications of gly-xaa-yaa triplet repeats in the triple helical domains of type I collagen chains disrupt helix formation and result in several types of osteogenesis imperfecta. Hum. Mutat. 18:319-326.

Payseur BA, Nachman MW. 2002. Natural selection at linked sites in humans.

Gene 300:31-42.

Page 125: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

114

Persikov AV, Ramshaw JA, Brodsky B. 2005. Prediction of collagen stability from amino acid sequence. J. Biol. Chem. 280:19343-19349.

Pond SL, Frost SD, Muse SV. 2005. HyPhy: Hypothesis testing using

phylogenies. Bioinformatics 21:676-679. Pond SL, Frost SD. 2005a. A genetic algorithm approach to detecting lineage-

specific variation in selection pressure. Mol. Biol. Evol. 22:478-485. Pond SL, Frost SD. 2005b. Datamonkey: Rapid detection of selective pressure on

individual sites of codon alignments. Bioinformatics 21:2531-2533. Pond SL, Frost SD. 2005c. Not so different after all: A comparison of methods for

detecting amino acid sites under selection. Mol. Biol. Evol. 22:1208-1222. Pozzoli U, Menozzi G, Fumagalli M, Cereda M, Comi GP, Cagliani R, Bresolin

N, Sironi M. 2008. Both selective and neutral processes drive GC content evolution in the human genome. BMC Evol. Biol. 8:99.

Prentice A. 2001. The relative contribution of diet and genotype to bone

development. Proc. Nutr. Soc. 60:45-52. Ptak SE, Hinds DA, Koehler K, Nickel B, Patil N, Ballinger DG, Przeworski M,

Frazer KA, Paabo S. 2005. Fine-scale recombination patterns differ between chimpanzees and humans. Nat. Genet. 37:429-434.

Qureshi AM, Herd RJ, Blake GM, Fogelman I, Ralston SH. 2002. Colia1 Sp1

polymorphism predicts response of femoral neck bone density to cyclical etidronate therapy. Calcif. Tissue Int. 70:158-163.

Raff ML, Craigen WJ, Smith LT, Keene DR, Byers PH. 2000. Partial COL1A2

gene duplication produces features of osteogenesis imperfecta and ehlers-danlos syndrome type VII. Hum. Genet. 106:19-28.

Ralston SH, Uitterlinden AG, Brandi ML, et al. (32 co-authors). 2006. Large-

scale evidence for the effect of the COLIA1 Sp1 polymorphism on osteoporosis outcomes: The GENOMOS study. PLoS Med. 3:e90.

Ralston SH. 2010. Genetics of osteoporosis. Ann. NY Acad. Sci. 1192:181-189. Rauch F, Lalic L, Roughley P, Glorieux FH. 2010. Genotype-phenotype

correlations in nonlethal osteogenesis imperfecta caused by mutations in the helical domain of collagen type I. Eur. J. Hum. Genet. 18:642-647.

Page 126: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

115

Reginster JY, Burlet N. 2006. Osteoporosis: A still increasing prevalence. Bone 38:S4-9.

Rensberger JM, Watabe M. 2000. Fine structure of bone in dinosaurs, birds and

mammals. Nature 406:619-622. Rivadeneira F, Styrkarsdottir U, Estrada K, et al. (36 co-authors). 2009. Twenty

bone-mineral-density loci identified by large-scale meta-analysis of genome-wide association studies. Nat. Genet. 41:1199-U58.

Rosenberg NA, Pritchard JK, Weber JL, Cann HM, Kidd KK, Zhivotovsky LA,

Feldman MW. 2002. Genetic structure of human populations. Science 298:2381-2385.

Rozas J, Sanchez-DelBarrio JC, Messeguer X, Rozas R. 2003. DnaSP, DNA

polymorphism analyses by the coalescent and other methods. Bioinformatics 19:2496-2497.

Sabeti PC, Reich DE, Higgins JM, et al. (17 co-authors). 2002. Detecting recent

positive selection in the human genome from haplotype structure. Nature 419:832-837.

Sabeti PC, Schaffner SF, Fry B, Lohmueller J, Varilly P, Shamovsky O, Palma A,

Mikkelsen TS, Altshuler D, Lander ES. 2006. Positive natural selection in the human lineage. Science 312:1614-1620.

Sabeti PC, Varilly P, Fry B, et al. (267 co-authors). 2007. Genome-wide detection

and characterization of positive selection in human populations. Nature 449:913-918.

Sabeti PC, Walsh E, Schaffner SF, et al. (15 co-authors). 2005. The case for

selection at CCR5-Delta32. PLoS Biol. 3:e378. Sachidanandam R, Weissman D, Schmidt SC, et al. (42 co-authors). 2001. A map

of human genome sequence variation containing 1.42 million single nucleotide polymorphisms. Nature 409:928-933.

Saunders MA, Good JM, Lawrence EC, Ferrell RE, Li WH, Nachman MW. 2006.

Human adaptive evolution at myostatin (GDF8), a regulator of muscle growth. Am. J. Hum. Genet. 79:1089-1097.

Scheinfeldt LB, Biswas S, Madeoy J, Connelly CF, Schadt EE, Akey JM. 2009.

Population genomic analysis of ALMS1 in humans reveals a surprisingly complex evolutionary history. Mol. Biol. Evol. 26:1357-1367.

Page 127: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

116

Shen H, Recker RR, Deng HW. 2003. Molecular and genetic mechanisms of osteoporosis: Implication for treatment. Curr. Mol. Med. 3:737-757.

Sillence DO, Senn A, Danks DM. 1979. Genetic-heterogeneity in osteogenesis

imperfecta. J. Med. Genet. 16:101-116. Smith MW, Lautenberger JA, Shin HD, Chretien JP, Shrestha S, Gilbert DA,

O'Brien SJ. 2001. Markers for mapping by admixture linkage disequilibrium in african american and hispanic populations. Am. J. Hum. Genet. 69:1080-1094.

Smith MW, O'Brien SJ. 2005. Mapping by admixture linkage disequilibrium:

Advances, limitations and guidelines. Nat. Rev. Genet. 6:623-632. Soares P, Achilli A, Semino O, Davies W, Macaulay V, Bandelt HJ, Torroni A,

Richards MB. 2010. The archaeogenetics of europe. Curr. Biol. 20:R174-83.

Spotila Ld, Colige A, Sereda L, et al. (15 co-authors). 1994. Mutation analysis of

coding sequences for type-i procollagen in individuals with low bone-density. J. Bone Miner. Res. 9:923-932.

Stephens M, Smith NJ, Donnelly P. 2001. A new statistical method for haplotype

reconstruction from population data. Am. J. Hum. Genet. 68:978-989. Stewart TL, Jin H, McGuigan FE, et al. (11 co-authors). 2006. Haplotypes defined

by promoter and intron 1 polymorphisms of the COLIA1 gene regulate bone mineral density in women. J. Clin. Endocrinol. Metab. 91:3575-3583.

Stoll C, Dott B, Roth MP, Alembik Y. 1989. Birth prevalence rates of skeletal

dysplasias. Clin. Genet. 35:88-92. Stone AC, Griffiths RC, Zegura SL, Hammer MF. 2002. High levels of Y-

chromosome nucleotide diversity in the genus pan. Proc. Natl. Acad. Sci. U. S. A. 99:43-48.

Stover DA, Verrelli BC. 2010. Comparative vertebrate evolutionary analyses of

type I collagen: potential of COL1a1 gene structure and intron variation for common bone-related diseases. Mol. Biol. Evol., doi: 10.1093/molbev/msq221.

Page 128: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

117

Su AI, Wiltshire T, Batalov S, et al. (13 co-authors). 2004. A gene atlas of the mouse and human protein-encoding transcriptomes. Proc. Natl. Acad. Sci. U. S. A. 101:6062-6067.

Subramanian S, Kumar S. 2003. Neutral substitutions occur at a faster rate in

exons than in noncoding DNA in primate genomes. Genome Res. 13:838-844.

Subramanian S, Kumar S. 2006. Evolutionary anatomies of positions and types of

disease-associated and neutral amino acid mutations in the human genome. BMC Genomics 7:306.

Sumner DR, Morbeck ME, Lobick JJ. 1989. Apparent age-related bone loss

among adult female gombe chimpanzees. Am. J. Phys. Anthropol. 79:225-234.

Tajima F. 1989. Statistical method for testing the neutral mutation hypothesis by

DNA polymorphism. Genetics 123:585-595. Tamura K, Dudley J, Nei M, Kumar S. 2007. MEGA4: Molecular evolutionary

genetics analysis (MEGA) software version 4.0. Mol. Biol. Evol. 24:1596-1599.

Thomson R, Pritchard JK, Shen P, Oefner PJ, Feldman MW. 2000. Recent

common ancestry of human Y chromosomes: Evidence from DNA sequence data. Proc. Natl. Acad. Sci. U. S. A. 97:7360-7365.

Tishkoff SA, Reed FA, Ranciaro A, et al. (19 co-authors). 2007. Convergent

adaptation of human lactase persistence in africa and europe. Nat. Genet. 39:31-40.

Tishkoff SA, Varkonyi R, Cahinhinan N, et al. (17 co-authors). 2001. Haplotype

diversity and linkage disequilibrium at human G6PD: Recent origin of alleles that confer malarial resistance. Science 293:455-462.

Tishkoff SA, Verrelli BC. 2003. Patterns of human genetic diversity: Implications

for human evolutionary history and disease. Annu. Rev. Genomics Hum. Genet. 4:293-340.

Urrutia AO, Hurst LD. 2001. Codon usage bias covaries with expression breadth

and the rate of synonymous evolution in humans, but this is not evidence for selection. Genetics 159:1191-1199.

Page 129: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

118

Urrutia AO, Hurst LD. 2003. The signature of selection mediated by expression on human genes. Genome Res. 13:2260-2264.

Valkkila M, Melkoniemi M, Kvist L, Kuivaniemi H, Tromp G, Ala-Kokko L.

2001. Genomic organization of the human COL3A1 and COL5A2 genes: COL5A2 has evolved differently than the other minor fibrillar collagen genes. Matrix Biol. 20:357-366.

Vergeer WP, Sogo JM, Pretorius PJ, de Vries WN. 2000. Interaction of Ap1, Ap2,

and Sp1 with the regulatory regions of the human pro-alpha1(I) collagen gene. Arch. Biochem. Biophys. 377:69-79.

Verrelli BC, Lewis CM,Jr, Stone AC, Perry GH. 2008. Different selective

pressures shape the molecular evolution of color vision in chimpanzee and human populations. Mol. Biol. Evol. 25:2735-2743.

Verrelli BC, McDonald JH, Argyropoulos G, Destro-Bisol G, Froment A,

Drousiotou A, Lefranc G, Helal AN, Loiselet J, Tishkoff SA. 2002. Evidence for balancing selection from nucleotide sequence analyses of human G6PD. Am. J. Hum. Genet. 71:1112-1128.

Verrelli BC, Tishkoff SA, Stone AC, Touchman JW. 2006. Contrasting histories

of G6PD molecular evolution and malarial resistance in humans and chimpanzees. Mol. Biol. Evol. 23:1592-1601.

Verrelli BC, Tishkoff SA. 2004. Signatures of selection and gene conversion

associated with human color vision variation. Am. J. Hum. Genet. 75:363-375.

Videman T, Levalahti E, Battie MC, Simonen R, Vanninen E, Kaprio J. 2007.

Heritability of BMD of femoral neck and lumbar spine: A multivariate twin study of finnish men. J. Bone Miner. Res. 22:1455-1462.

Viguet-Carrin S, Garnero P, Delmas PD. 2006. The role of collagen in bone strength. Osteoporosis Int. 17:319-336.

Vinogradov AE. 2003. DNA helix: The importance of being GC-rich. Nucleic

Acids Res. 31:1838-1844. Voight BF, Kudaravalli S, Wen X, Pritchard JK. 2006. A map of recent positive

selection in the human genome. PLoS Biol. 4(3):e72.

Page 130: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

119

Wada H, Okuyama M, Satoh N, Zhang S. 2006. Molecular evolution of fibrillar collagen in chordates, with implications for the evolution of vertebrate skeletons and chordate phylogeny. Evol. Dev. 8:370-377.

Wang X, Mabrey JD, Agrawal CM. 1998. An interspecies comparison of bone

fracture properties. Biomed. Mater. Eng. 8:1-9. Watterson GA. 1975. On the number of segregating sites in genetical models

without recombination. Theor. Popul. Biol. 7:256-276. Wegmann D, Excoffier L. 2010. Bayesian inference of the demographic history of

chimpanzees. Mol. Biol. Evol. 27:1425-1435. Winckler W, Myers SR, Richter DJ, et al. (11 co-authors). 2005. Comparison of

fine-scale recombination rates in humans and chimpanzees. Science 308:107-111.

Won YJ, Hey J. 2005. Divergence population genetics of chimpanzees. Mol. Biol.

Evol. 22:297-307. Wooding S, Bufe B, Grassi C, Howard MT, Stone AC, Vazquez M, Dunn DM,

Meyerhof W, Weiss RB, Bamshad MJ. 2006. Independent evolution of bitter-taste sensitivity in humans and chimpanzees. Nature 440:930-934.

Wooding S, Stone AC, Dunn DM, Mummidi S, Jorde LB, Weiss RK, Ahuja S,

Bamshad MJ. 2005. Contrasting effects of natural selection on human and chimpanzee CC chemokine receptor 5. Am. J. Hum. Genet. 76:291-301.

Wray GA, Hahn MW, Abouheif E, Balhoff JP, Pizer M, Rockman MV, Romano

LA. 2003. The evolution of transcriptional regulation in eukaryotes. Mol. Biol. Evol. 20:1377-1419.

Wu DD, Zhang YP. 2010. Positive selection drives population differentiation in

the skeletal genes in modern humans. Hum. Mol. Genet. 19:2341-2346. Xu G, Bhatnagar V, Wen G, Hamilton BA, Eraly SA, Nigam SK. 2005. Analyses

of coding region polymorphisms in apical and basolateral human organic anion transporter (OAT) genes [OAT1 (NKT), OAT2, OAT3, OAT4, URAT (RST)]. Kidney Int. 68:1491-1499.

Yamada Y, Avvedimento VE, Mudryj M, Ohkubo H, Vogeli G, Irani M, Pastan I,

Decrombrugghe B. 1980. The collagen gene - evidence for its evolutionary assembly by amplification of a dna segment containing an exon of 54 bp. Cell 22:887-892.

Page 131: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

120

Yang Z. 2007. PAML 4: Phylogenetic analysis by maximum likelihood. Mol. Biol. Evol. 24:1586-1591.

Yu N, Jensen-Seaman MI, Chemnick L, Kidd JR, Deinard AS, Ryder O, Kidd

KK, Li WH. 2003. Low nucleotide diversity in chimpanzees and bonobos. Genetics 164:1511-1518.

Zhang Z, Townsend JP. 2009. Maximum-likelihood model averaging to profile

clustering of site types across discrete linear sequences. PLoS Comput. Biol. 5:e1000421.

Page 132: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

121

APPENDIX A

SUPPLEMENTARY MATERIAL: CHAPTER 2

Page 133: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

122

Supplementary Table 1a

Human Clade A Collagen Gene Exon Lengths

Exon Length (bp) Exon # COL1a1 COL1a2 COL2a1 COL3a1 COL5a2

1 103 70 85 79 97 2 195 11 207 203 225 3 35 15 17 51 14 4 36 36 33 114 33 5 102 93 33 81 33 6 72 54 54 54 54 7 45 45 102 54 111 8 54 54 78 54 78 9 54 54 45 54 45 10 54 54 54 54 54 11 54 54 54 54 54 12 54 54 54 45 54 13 45 45 54 54 54 14 54 54 54 45 54 15 45 45 45 54 45 16 54 54 54 99 54 17 99 99 45 45 45 18 45 45 54 99 54 19 99 99 99 54 99 20 54 54 45 108 45 21 108 108 99 54 99 22 54 54 54 99 54 23 99 99 108 54 108 24 54 54 54 99 54 25 99 99 99 54 99 26 54 54 54 54 54 27 54 54 99 54 99 28 54 54 54 54 54 29 54 54 54 45 54 30 45 45 54 99 54 31 99 99 54 108 54 32 108 108 45 54 45

Page 134: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

123

33 108 54 99 54 99 34 54 54 108 54 108 35 54 54 54 54 54 36 108 54 54 108 54 37 54 108 54 54 54 38 54 54 54 54 54 39 162 54 108 162 108 40 108 162 54 108 54 41 108 108 54 108 54 42 54 108 162 54 162 43 108 54 108 108 108 44 54 108 108 54 108 45 108 54 54 108 54 46 54 108 108 54 108 47 108 54 54 108 54 48 283 108 108 298 108 49 191 259 54 188 54 50 243 185 108 243 108 51 144 243 289 144 292 52 144 188 188 53 243 240 54 144 144

Note: Lines within columns give approximate locations of triple-helix domain boundaries. RxC chi-squared tests (see supplementary table 1c) were conducted on comparisons among genes of binned distributions with bins in increments of 50 bp, up to 300 bp (e.g., 50, 100, 150, etc.). These bin intervals enabled the best resolution in comparisons among genes and bin size did not alter the results. See Materials and Methods (Chapter 2) for additional information.

Page 135: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

124

Supplementary Table 1b

Human Clade A Collagen Gene Exon GC-Content

Exon GC-Content (%) Exon # COL1a1 COL1a2 COL2a1 COL3a1 COL5a2

1 62.1 47.1 67.1 46.8 43.3 2 63.6 36.4 58.0 47.3 49.3 3 62.9 33.3 58.8 58.8 42.9 4 72.2 63.9 48.5 58.8 48.5 5 76.5 69.9 57.6 48.1 60.6 6 51.4 48.1 55.6 70.4 59.3 7 71.1 64.4 69.6 59.3 62.2 8 66.7 57.4 52.6 57.4 48.7 9 63.0 64.8 71.1 61.1 57.8 10 66.7 59.3 53.7 50.0 59.3 11 61.1 55.6 64.8 48.1 59.3 12 55.6 57.4 57.4 53.3 51.9 13 60.0 53.3 59.3 68.5 61.1 14 75.9 68.5 63.0 60.0 51.9 15 64.4 66.7 64.4 64.8 64.4 16 68.5 70.4 70.4 58.6 64.8 17 71.7 64.6 75.6 53.3 53.3 18 55.6 57.8 66.7 66.7 50.0 19 70.7 68.7 68.7 61.1 64.6 20 59.3 63.0 51.1 55.6 55.6 21 68.5 63.9 67.7 61.1 63.6 22 63.0 59.3 53.7 66.7 53.7 23 64.6 64.6 69.4 64.8 60.2 24 64.8 57.4 64.8 58.6 55.6 25 67.7 64.6 70.7 63.0 62.6 26 72.2 53.7 61.1 59.3 46.3 27 66.7 59.3 65.7 57.4 65.7 28 68.5 59.3 63.0 55.6 59.3 29 61.1 63.0 63.0 64.4 61.1 30 66.7 57.8 66.7 64.6 55.6 31 67.7 63.6 63.0 59.3 59.3 32 65.7 60.2 71.1 63.0 60.0

Page 136: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

125

33 63.0 53.7 67.7 61.1 54.5 34 70.4 68.5 69.4 63.0 59.3 35 74.1 70.4 61.1 61.1 50.0 36 66.7 61.1 66.7 59.3 61.1 37 64.8 59.3 64.8 61.1 63.0 38 66.7 66.7 68.5 59.3 57.4 39 65.4 63.0 69.4 65.4 57.4 40 63.9 61.1 66.7 64.8 55.6 41 63.0 60.2 66.7 58.3 61.1 42 68.5 63.9 66.7 57.4 67.3 43 69.4 64.8 66.7 60.2 66.7 44 77.8 60.2 67.6 72.2 57.4 45 66.7 59.3 68.5 56.5 59.3 46 64.8 57.4 66.7 68.5 64.8 47 65.7 63.0 68.5 57.4 57.4 48 64.0 62.0 65.7 51.3 58.3 49 55.0 55.6 59.3 43.6 51.9 50 62.6 47.6 65.7 46.1 58.3 51 59.0 46.5 64.4 43.8 56.2 52 42.4 51.1 42.0 53 56.4 40.0 54 54.9 47.9

Note: Lines within columns give approximate locations of triple-helix domain boundaries. RxC chi-squared tests (see supplementary table 1c) were conducted on comparisons among genes of binned distributions with bins starting at 40% in increments of 10%, up to 80% (e.g., 40, 50, 60, etc.). These bin intervals enabled the best resolution in comparisons among genes and bin size did not alter the results. See Materials and Methods (Chapter 2) for additional information.

Page 137: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

126

Supplementary Table 1c

Summary of RxC Chi-Squared Tests for Human Clade A Collagen Gene Exons

Chi-squared Value Region Compared

Included Exons COL1a2 COL2a1 COL3a1 COL5a2

Exon Length All 1.0 1.0 2.0 1.0 All 17.6* 4.9 24.0* 37.5* Exon GC-

content triple-helix 11.3* 1.1 16.1* 29.7* Note: * denotes comparisons that are statistically significant after a Bonferroni correction, P<0.0125.

Page 138: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

127

Supplementary Table 2a

COL1a1 Exon Lengths among Vertebrates

Exon Length (bp) Exon # Human Chimpanzee Mouse Dog Cow W. Frog Zebrafish

1 103 103 76 91 103 82 82 2 195 195 195 195 195 189 192 3 35 35 32 35 35 32 23 4 36 36 36 36 36 36 36 5 102 102 102 102 102 93 93 6 72 72 69 72 69 69 69 7 45 45 45 45 45 45 45 8 54 54 54 54 54 54 54 9 54 54 54 54 54 54 54

10 54 54 54 54 54 54 54 11 54 54 54 54 54 54 54 12 54 54 54 54 54 54 54 13 45 45 45 45 45 45 45 14 54 54 54 54 54 54 54 15 45 45 45 45 45 45 45 16 54 54 54 54 54 54 54 17 99 99 99 99 99 99 99 18 45 45 45 45 45 45 45 19 99 99 99 99 99 99 99 20 54 54 54 54 54 54 54 21 108 108 108 108 108 108 108 22 54 54 54 54 54 54 54 23 99 99 99 99 99 99 99 24 54 54 54 54 54 54 54 25 99 99 99 99 99 99 99 26 54 54 54 54 54 54 54 27 54 54 54 54 54 54 54 28 54 54 54 54 54 54 54 29 54 54 54 54 54 54 54 30 45 45 45 45 45 45 45 31 99 99 99 99 99 99 99 32 108 108 108 108 108 108 108

Page 139: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

128

33 108 108 108 108 108 108 108 34 54 54 54 54 54 54 54 35 54 54 54 54 54 54 54 36 108 108 108 108 108 108 108 37 54 54 54 54 54 54 54 38 54 54 54 54 54 54 54 39 162 162 162 162 162 162 162 40 108 108 108 108 108 108 108 41 108 108 108 108 108 108 108 42 54 54 54 54 54 54 54 43 108 108 108 108 108 108 108 44 54 54 54 54 54 54 54 45 108 108 108 108 108 108 108 46 54 54 54 54 54 54 54 47 108 108 108 108 108 108 108 48 283 283 283 283 283 280 280 49 191 191 191 191 191 191 191 50 243 243 243 243 243 243 243 51 144 144 144 144 144 144 144

Note: Lines within columns give approximate locations of triple-helix domain boundaries. Mann-Whitney U tests were conducted among species as were RxC chi-squared tests among orthologous exons (see supplementary tables 2d and e). See Materials and Methods (Chapter 2) for additional information.

Page 140: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

129

Supplementary Table 2b

COL1a1 Exon GC-Content among Vertebrates

Exon GC-content (%) Exon # Human Chimpanzee Mouse Dog Cow W. Frog Zebrafish

1 62.1 62.1 59.2 61.5 61.2 45.1 54.9 2 63.6 63.6 54.9 64.1 60.5 47.6 61.5 3 62.9 62.9 43.8 57.1 57.1 46.9 52.2 4 72.2 72.2 72.2 69.4 69.4 61.1 61.1 5 76.5 77.5 71.6 74.5 76.5 67.7 68.8 6 51.4 51.4 49.3 48.6 47.8 42.0 50.7 7 71.1 71.1 68.9 71.1 71.1 68.9 71.1 8 66.7 66.7 64.8 64.8 68.5 59.3 64.8 9 63.0 64.8 63.0 70.4 68.5 61.1 63.0

10 66.7 66.7 70.4 66.7 70.4 72.2 68.5 11 61.1 61.1 61.1 59.3 61.1 61.1 61.1 12 55.6 55.6 55.6 57.4 59.3 55.6 59.3 13 60.0 60.0 64.4 62.2 62.2 64.4 60.0 14 75.9 74.1 70.4 75.9 74.1 64.8 70.4 15 64.4 64.4 64.4 64.4 64.4 62.2 62.2 16 68.5 68.5 68.5 68.5 66.7 66.7 68.5 17 71.7 72.7 69.7 74.7 68.7 62.6 65.7 18 55.6 55.6 55.6 57.8 57.8 55.6 62.2 19 70.7 70.7 68.7 72.7 72.7 63.6 66.7 20 59.3 59.3 57.4 61.1 63.0 55.6 64.8 21 68.5 68.5 64.8 68.5 68.5 63.0 67.6 22 63.0 63.0 63.0 63.0 66.7 63.0 61.1 23 64.6 65.7 64.6 64.6 67.7 60.6 62.6 24 64.8 64.8 66.7 64.8 66.7 61.1 61.1 25 67.7 66.7 66.7 68.7 68.7 60.6 63.6 26 72.2 72.2 66.7 68.5 63.0 55.6 57.4 27 66.7 66.7 64.8 66.7 64.8 61.1 63.0 28 68.5 68.5 63.0 68.5 68.5 57.4 59.3 29 61.1 61.1 63.0 66.7 64.8 61.1 61.1 30 66.7 64.4 64.4 68.9 64.4 55.6 57.8 31 67.7 66.7 62.6 64.6 69.7 57.6 63.6 32 65.7 65.7 63.9 64.8 64.8 63.0 63.0

Page 141: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

130

33 63.0 63.9 60.2 62.0 64.8 54.6 59.3 34 70.4 70.4 72.2 70.4 70.4 66.7 57.4 35 74.1 75.9 72.2 74.1 75.9 68.5 63.0 36 66.7 66.7 61.1 65.7 67.6 61.1 60.2 37 64.8 64.8 63.0 66.7 66.7 63.0 66.7 38 66.7 68.5 68.5 72.2 70.4 59.3 63.0 39 65.4 64.8 64.8 68.5 70.4 62.3 60.5 40 63.9 63.0 61.1 65.7 62.0 55.6 60.2 41 63.0 63.0 63.0 64.8 65.7 63.0 63.0 42 68.5 66.7 68.5 63.0 63.0 63.0 63.0 43 69.4 69.4 70.4 72.2 72.2 63.9 63.0 44 77.8 79.6 70.4 66.7 72.2 70.4 77.8 45 66.7 66.7 60.2 63.9 67.6 59.3 61.1 46 64.8 64.8 66.7 61.1 68.5 57.4 64.8 47 65.7 65.7 66.7 66.7 65.7 55.6 59.3 48 64.0 65.0 62.2 66.1 65.0 56.4 60.7 49 55.0 55.0 54.5 54.5 54.5 48.7 51.8 50 62.6 62.6 59.7 61.7 61.7 50.6 53.5 51 59.0 59.0 56.2 55.6 60.4 50.0 49.3

Note: Lines within columns give approximate locations of triple-helix domain boundaries. Mann-Whitney U tests were conducted among species as were RxC chi-squared tests among orthologous exons (see supplementary tables 2d and e). See Materials and Methods (Chapter 2) for additional information.

Page 142: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

131

Supplementary Table 2c

Summary of COL1a1 Exon Characteristics among Vertebrates

Species

No. of amino acids

Exon length (bp)

Exon GC-content

(%) Human 1464 86 ± 52 66 ± 5 Chimpanzee 1464 86 ± 52 66 ± 5 Mouse 1453 86 ± 52 64 ± 6 Dog 1460 86 ± 52 65 ± 5 Cow 1463 86 ± 52 66 ± 5 Western clawed frog 1449 85 ± 52 59 ± 6

Zebrafish 1447 85 ± 52 62 ± 5 Note: Length and GC-content values denote means and standard deviations.

Page 143: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

132

Supplementary Table 2d

Summary of Mann-Whitney U Tests for COL1a1 Exons among Vertebrates

Mann-Whitney U Region Compared

Included Exons

Species Compared Human Mouse Dog Cow

W. Frog

Chimpanzee 1300.5 Mouse 1293.0 Dog 1294.0 1299.0 Cow 1300.0 1293.5 1294.5

W. Frog 1285.5 1293.5 1291.5 1286.0

All

Zebrafish 1286.6 1294.0 1292.5 1287.0 1299.5 Chimpanzee 18.0

Mouse 15.5

Dog 16.5 16.5 Cow 17.5 16.0 17.0

W. Frog 14.5 17.5 15.5 15.0

Exon Length

N-terminal domain

(exons 1-6)

Zebrafish 14.5 17.0 15.5 15.0 18.0 Chimpanzee 1299.5

Mouse 1091.5 Dog 1281.5 1075 Cow 1210.0 1011.5 1227.5

W. Frog 576.5* 772.0* 576.5* 533.5*

All

Zebrafish 745.0* 966.5 750.5* 693.5* 1039.5 Chimpanzee 832.5

Mouse 705.0 Dog 814.0 677.0 Cow 742.0 610.5 771.5

W. Frog 349.0* 433.5* 324.0* 275.5*

Exon GC-content

triple-helix domain

(exons 7-47)

Zebrafish 460.5* 569.0 442.0* 377.5* 651.5 Note: * denotes comparisons that are statistically significant after a Bonferroni correction, P<0.003. Because chimpanzee is identical or highly similar to human for all comparisons, chimpanzee was excluded from further analyses. Exon length comparisons were not conducted for the C-terminal domain (exons 48-51) as only a single exon differs and the difference is only 3 bp (see supplementary table 2a).

Page 144: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

133

Supplementary Table 2e

Summary of RxC Chi-Squared Tests for COL1a1 Exons among Vertebrates

   Chi-Squared Value Region Compared

Included Exons

Species Compared Human Mouse Dog Cow W. Frog

Chimpanzee 0 Mouse 4.1 Dog 0.7 1.5 Cow 0.1 4.1 0.8

W. Frog 2.9 0.7 5.6 7.4

All

Zebrafish 9.7 2.1 7.9 9.7 6.0 Chimpanzee 0

Mouse 3.2 Dog 0.6 1.1 Cow 0.1 3.3 0.7

W. Frog 1.4 0.7 0.3 1.6

Exon Length

N-terminal domain

(exons 1-6)

Zebrafish 3.2 1.9 2.2 3.3 1.5 Chimpanzee 0.2

Mouse 6.7 Dog 2.8 4.5 Cow 2.5 5.5 2.7

W. Frog 13.5 4.8 13.8 12.7

All

Zebrafish 13.0 6.3 11.3 11.3 9.5 Chimpanzee 0.1

Mouse 1.5 Dog 1.8 1.7 Cow 1.7 1.9 1.7

W. Frog 3.8 2.5 4.1 2.9

Exon GC-content

triple-helix domain

(exons 7-47)

Zebrafish 3.5 3.8 4.6 3.2 2.7 Note: * denotes comparisons that are statistically significant after a Bonferroni correction, P<0.003. Because chimpanzee is identical or highly similar to human for all comparisons, chimpanzee was excluded from further analyses. Exon length comparisons were not conducted for the C-terminal domain (exons 48-51) as only a single exon differs and the difference is only 3 bp (see supplementary table 2a). The number of G and C bp were used for chi-squared comparisons of GC-content rather than percentage GC-content.

Page 145: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

134

Supplementary Table 3

Pairwise dN and dS Comparisons Across COL1a1 Domains in Primates

N-terminal Triple-helix C-terminal Species Pair dN dS dN dS dN dS H/C 0 0.007 0 0.016 0 0.027 H/O 0 0.014 0 0.040 0 0.076 H/M 0 0.022 0 0.067 0 0.066 C/O 0 0.022 0 0.039 0 0.071 C/M 0 0.029 0 0.066 0 0.060 O/M 0 0.036 0 0.058 0 0.087

Note: H = human, C = chimpanzee, O = orangutan, and M = macaque. For the N-terminal domain, the average number of synonymous and nonsynonymous sites across species is 138 and 396 bp, respectively. For the triple-helix, the average number of synonymous and nonsynonymous sites across species is 822 and 1986 bp, respectively. For the C-terminal domain, all species have 183 synonymous and 633 nonsynonymous bp.

Page 146: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

135

Supplementary Table 4a

Distributions of Amino Acid Positions and Observed Mutations among

Evolutionary Rate Bins

Positions per Domain Phenotypic Severity

Category

Evolutionary Rate Category

Total Positions

N-terminal

Triple-helix

C-terminal

Total Mutations 1 2 3 4

1 889 57 670 162 275 38 45 36 112 2 64 13 17 34 5 2 2 0 1 3 136 10 103 23 7 7 0 0 0 4 94 17 60 17 1 0 0 0 1 5 78 8 60 10 3 2 0 1 0 6 66 11 42 13 1 0 0 0 1 7 68 19 43 6 1 1 0 0 0 8 38 19 14 5 0 0 0 0 0 Total 1433 154 1009 270 293 50 47 37 115

Note: Evolutionary rate categories correspond to increments of 0.125 up to 1.0 (e.g., 0.125, 0.25, 0.375, etc.). Amino acid positions were binned according to these categories after scaling down the original rate estimates (ranging from 0.291 to 3.961). The number of amino acid positions in each rate category in each protein domain is listed under “Positions per domain.” RxC chi-squared tests were conducted to compare these distributions of positions among rate categories across domains (see supplementary table 4b). “Total mutations” provides the number of disease-associated mutations occurring at amino acid positions in each of the evolutionary rate categories and “phenotypic severity category” separates these mutations into categories according to the severity of their associated phenotype. Chi-squared tests were used to compare the distributions of these mutations among rate categories (see supplementary table 4c). See Materials and Methods (Chapter 2) for additional information.

Page 147: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

136

Supplementary Table 4b

Summary of RxC Chi-Squared Tests of Amino Acid Positions

Domain Compared N-terminal Triple-helixTriple-helix 124.8* C-terminal 52.3* 70.3*

Note: * denotes statistical significance at P<0.000001.

Page 148: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

137

Supplementary Table 4c

Summary of Chi-Squared Tests of Disease-Associated Mutations

Amino Acid Mutations

Chi-squared Value

Total 127.5* Severity Category 1 10.6 Severity Category 2 24.4** Severity Category 3 20.0** Severity Category 4 61.3**

Note: * denotes statistical significance at P<0.000001. ** denotes comparisons that are statistically significant after a Bonferroni correction, P<0.0125.

Page 149: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

138

Supplementary Table 5a

COL1a1 Intron Lengths among Vertebrates

Intron Length (bp) Intron # Human Chimpanzee Mouse Dog Cow W. Frog Zebrafish

2 151 144 118 82 143 287 377 3 98 98 115 99 106 99 193 4 86 86 86 95 85 796 631 5 718 719 781 814 837 113 319 6 223 223 233 263 213 860 214 7 154 154 158 159 146 92 288 8 159 159 156 191 177 417 452 9 494 494 369 526 501 874 185

10 112 112 115 116 113 982 124 11 335 335 326 325 328 92 84 12 84 84 77 98 83 88 81 13 112 112 94 111 112 680 84 14 110 110 109 126 105 100 244 15 174 174 132 159 163 344 83 16 253 253 230 257 247 494 442 17 84 84 85 84 82 84 90 18 99 99 89 87 92 167 674 19 127 127 115 128 124 788 85 20 214 214 177 124 161 84 197 21 90 90 99 76 74 738 93 22 121 118 279 261 274 647 90 23 161 161 144 170 195 113 94 24 84 84 85 88 82 84 931 25 898 911 519 589 601 141 88 26 139 139 185 169 186 543 95 27 99 99 96 99 100 115 246 28 107 107 104 112 94 119 107 29 446 446 396 426 432 125 135 30 89 89 83 88 87 451 86 31 293 293 208 269 263 616 152 32 454 454 352 422 564 150 101 33 216 216 187 186 205 1175 103 34 158 158 144 175 204 85 81 35 214 214 240 222 216 542 83

Page 150: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

139

36 84 83 84 81 82 81 147 37 122 122 106 115 114 213 88 38 136 136 126 140 134 588 87 39 97 97 99 96 85 157 452 40 153 153 337 163 141 849 138 41 103 103 99 103 100 440 104 42 100 100 105 109 88 75 116 43 376 376 363 392 542 768 85 44 108 108 117 109 114 78 97 45 334 333 252 275 331 236 96 46 357 357 319 285 274 612 108 47 88 88 82 87 433 334 78 48 128 128 131 137 105 167 106 49 292 292 197 267 285 339 89 50 125 125 125 96 111 290 282

Note: Splice sites are excluded, but Alus are included in this case to provide an accurate representation of length differences that have accumulated over time. Mann-Whitney U tests were conducted among species as were RxC chi-squared tests among orthologous exons (see supplementary tables 5d and e). See Materials and Methods (Chapter 2) for additional information.

Page 151: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

140

Supplementary Table 5b

COL1a1 Intron GC-Content among Vertebrates

Intron GC-Content (%) Intron # Human Chimpanzee Mouse Dog Cow W. Frog Zebrafish

2 58.2 73.6 48.3 81.7 67.1 36.9 36.9 3 74.4 58.2 48.7 58.6 56.6 40.4 28 4 39.7 74.4 54.7 75.8 75.3 35.4 33 5 54.7 40.2 40.7 39.9 40.1 33.6 42.3 6 48.7 54.3 49.8 47.9 48.8 36.6 29.9 7 42.8 48.1 53.8 49.7 52.1 29.3 30.2 8 49.4 43.4 48.7 44.5 42.9 30.7 38.3 9 54.5 49.6 46.3 49.2 50.9 34.6 33.5

10 51.9 55.4 53.9 56 54.9 38.9 29 11 45.2 51.9 44.8 49.8 50 46.7 31 12 59.8 46.4 45.5 46.9 51.8 31.8 32.1 13 62.7 59.8 55.3 48.6 55.4 34.3 31 14 58 62.7 59.6 56.3 53.3 40 28.7 15 55.7 57.5 46.2 61 57.7 38.4 18.1 16 53.6 55.7 51.7 57.6 55.1 36.6 29.9 17 65.7 53.6 54.1 59.5 53.7 35.7 27.8 18 63 64.6 60.7 65.5 62 34.1 35.6 19 66.8 63 51.3 57.8 58.1 37.1 25.9 20 67.8 67.3 60.5 68.5 62.1 33.3 29.9 21 60.3 67.8 52.5 65.8 63.5 32.8 40.9 22 63.4 60.2 56.6 59 62 36.6 26.7 23 59.5 63.4 62.5 64.7 57.9 34.5 40.4 24 61.2 58.3 52.9 53.4 59.8 28.6 42.1 25 61.6 53.3 49.7 53.8 52.6 29.1 35.2 26 54.2 60.4 53.5 64.5 57 36.1 28.4 27 54.9 60.6 55.2 65.7 67 31.3 24 28 57.3 54.2 46.2 56.2 53.2 34.5 31.8 29 61.1 54.7 50 53.8 54.4 36.8 30.4 30 54.6 55.1 51.8 59.1 56.3 30.6 38.4 31 62.5 60.4 53.8 60.2 57.8 32.3 34.9 32 57.6 54.2 46 53.8 52.8 32 34.7 33 58.9 62 53.5 61.8 60 37.3 29.1 34 66.7 57 50.7 63.6 61.8 25.9 30.9 35 60.7 59.3 57.1 59 59.3 33.8 32.5

Page 152: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

141

36 61.8 66.3 46.4 64.2 61 30.9 30.6 37 62.9 61.5 52.8 58.3 61.4 35.2 28.4 38 56.9 62.5 53.2 59.3 57.5 33.5 28.7 39 61.2 62.9 50.5 70.8 70.6 28.7 31.6 40 63 58.2 48.4 62.6 61.7 32.7 34.8 41 59.8 62.1 62.6 64.1 68 38.6 33.7 42 52.8 63 57.1 65.1 70.5 30.7 38.8 43 55.7 60.4 51.8 61 53.5 35.4 36.5 44 59.9 51.9 53 56.9 57 21.8 37.1 45 60.2 55 51.6 53.8 59.2 34.3 31.2 46 63.3 59.9 52.7 56.8 57.7 34.6 29.6 47 61 61.4 51.2 64.4 66.5 30.8 30.8 48 66.4 63.3 55 65 64.8 31.7 29.2 49 73.5 61.3 52.3 59.6 62.1 32.4 36 50 55.2 67.2 54.4 67.7 72.1 32.1 33.3

Note: Splice sites are excluded, but Alus are included in this case to provide an accurate representation of length differences that have accumulated over time. Mann-Whitney U tests were conducted among species as were RxC chi-squared tests among orthologous exons (see supplementary tables 5d and e). See Materials and Methods (Chapter 2) for additional information.

Page 153: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

142

Supplementary Table 5c

Summary of COL1a1 Intron Characteristics among Vertebrates

Species Intron length

(bp) Intron GC-content (%)

Human 203 ± 167 59 ± 7 Chimpanzee 203 ± 168 59 ± 7 Mouse 188 ± 135 52 ± 5 Dog 197 ± 150 59 ± 8 Cow 211 ± 166 58 ± 7 Western clawed frog 374 ± 303 34 ± 4

Zebrafish 192 ± 179 32 ± 5 Note: Length and GC-content values denote means and standard deviations, excluding first introns and splice sites.

Page 154: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

143

Supplementary Table 5d

Summary of Mann-Whitney U Tests for COL1a1 Introns among Vertebrates

Mann-Whitney U Region Compared

Included Introns

Species Compared Human Chimp. Mouse Dog Cow

W. Frog

Chimpanzee 1198.0

Mouse 1165.0 1167.5

Dog 1186.5 1187.0 1186.0 Cow 1194.5 1193.0 1163.5 1165.0

W. Frog 890.5 886.5 853.0 863.5 878.0

Intron Length All

Zebrafish 965.0 968.0 1019.5 1004.0 1005.0 771.5* Chimpanzee 1197.5

Mouse 442.5* 462.0* Dog 1185.0 1169.5 508.0* Cow 1120.0 1120.5 508.5* 1117.0

W. Frog 5.0* 4.0* 8.0* 4.0* 3.0*

Intron GC-content

All

Zebrafish 4.0* 4.0* 3.0* 4.0* 4.0* 907.0 Note: * denotes comparisons that are statistically significant after a Bonferroni correction, P<0.002.

Page 155: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

144

Supplementary Table 5e

Summary of RxC Chi-Squared Tests for COL1a1 Introns among Vertebrates

Chi-squared Value

Region Compared

Included Introns

Species Compared Human Chimp. Mouse Dog Cow

W. Frog

Chimpanzee 0.3

Mouse 322.8* 329.9*

Dog 203.6* 207.4* 162.8*

Cow 438.4* 445.4* 452.9* 309.1*

W. Frog 6315.1* 6341.8* 5287.8* 5626.6* 6144.0*

Intron Length All

Zebrafish 4641.1* 4660.8* 4129.9* 4456.2* 5013.6* 7674.5*

Chimpanzee 0.5

Mouse 208.0* 201.0*

Dog 128.3* 123.4* 82.7*

Cow 295.3* 289.7* 278.7* 190.9*

W. Frog 2579.0* 2564.1* 2115.4* 2288.0* 2566.0*

Intron GC-content

All

Zebrafish 2116* 2115.4* 1817.3* 2001.5* 2254.2* 2943.3* Note: * denotes comparisons that are statistically significant after a Bonferroni correction, P<0.002. The number of G and C bp were used for chi-squared comparisons of GC-content rather than percentage GC-content.

Page 156: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

145

Supplementary Table 6a

Human Clade A Collagen Gene Intron Lengths

Intron Length (bp) Intron # COL1a1 COL1a2 COL2a1 COL3a1 COL5a2

2 151 619 1483 230 5943 3 98 648 209 413 4120 4 86 1107 100 1276 1343 5 718 1274 103 937 1392 6 223 2931 159 451 4852 7 154 88 976 748 3533 8 159 88 624 642 1920 9 494 302 107 153 949 10 112 416 397 639 501 11 335 519 773 425 1118 12 84 1539 372 133 2936 13 112 287 127 457 954 14 110 95 302 654 866 15 174 385 522 416 489 16 253 494 2813 574 3074 17 84 139 475 147 3295 18 99 110 1398 204 3151 19 127 416 293 125 511 20 214 120 88 210 108 21 90 357 185 588 1228 22 121 109 387 330 227 23 161 909 366 215 1328 24 84 458 81 409 344 25 402 396 136 113 519 26 139 549 436 306 705 27 99 146 391 508 104 28 107 270 401 350 87 29 446 946 346 562 1245 30 89 1130 240 82 774 31 293 1216 243 260 1832 32 454 663 142 1497 317 33 216 941 234 81 997

Page 157: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

146

34 158 677 337 704 289 35 214 100 273 599 2756 36 84 92 233 344 206 37 122 356 370 265 413 38 136 832 249 190 401 39 97 1000 487 109 104 40 153 958 440 981 519 41 103 669 748 744 686 42 100 381 194 81 639 43 376 82 169 480 1110 44 108 136 352 505 1080 45 334 367 168 273 2297 46 357 473 160 92 563 47 88 122 178 777 1916 48 128 327 240 952 373 49 292 403 439 278 1022 50 125 706 353 733 2018 51 812 450 2464 52 339 1456 53 531 695

Note: Splice sites, Alus, and alignment gaps are excluded. RxC chi-squared tests (see supplementary table 6d) were conducted on comparisons among genes of binned distributions with bins in increments of 50 bp up to 1000 bp, with increments of 1000 bp thereafter up to 5000 bp (e.g., 50, 100, 150...1000, 2000, 3000, etc.). These bins enabled the best resolution in comparisons among genes and bin size did not alter the results. See Materials and Methods (Chapter 2) for additional information.

Page 158: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

147

Supplementary Table 6b

Human Clade A Collagen Gene Intron GC-Content

Intron GC-content (%) Intron # COL1a1 COL1a2 COL2a1 COL3a1 COL5a2

2 73.5 29.4 47.0 30.0 28.6 3 58.2 25.9 38.8 25.7 32.4 4 74.4 31.6 48.0 31.0 33.5 5 39.7 32.9 52.4 28.5 31.2 6 54.7 32.7 51.6 27.9 36.9 7 48.7 29.5 40.9 24.9 34.6 8 42.8 33.0 48.1 31.2 33.4 9 49.4 29.1 38.3 22.9 31.8 10 54.5 31.0 52.4 26.9 27.9 11 51.9 35.6 48.9 23.1 30.7 12 45.2 32.4 52.2 30.1 34.2 13 59.8 29.3 47.2 29.3 31.7 14 62.7 31.6 56.6 30.1 32.3 15 58.0 36.9 45.0 35.8 35.0 16 55.7 35.0 52.5 29.3 38.0 17 53.6 36.0 48.6 36.1 32.8 18 65.7 44.5 55.7 30.9 34.8 19 63.0 36.3 51.9 24.0 28.8 20 66.8 28.3 62.5 31.9 43.5 21 67.8 38.4 58.9 29.9 27.2 22 60.3 40.4 56.1 29.7 31.3 23 63.4 30.8 57.1 27.0 37.3 24 59.5 36.2 60.5 31.3 42.4 25 55.2 39.6 62.5 35.4 26.8 26 61.2 33.2 52.8 36.3 29.1 27 61.6 41.1 56.3 31.3 32.7 28 54.2 36.7 57.4 24.6 26.4 29 54.9 30.5 61.6 30.2 33.4 30 57.3 31.5 61.2 30.5 26.9 31 61.1 34.7 53.1 30.8 35.4 32 54.6 36.8 61.3 31.1 26.8 33 62.5 35.0 59.0 37.0 30.1

Page 159: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

148

34 57.6 37.4 55.5 32.1 34.3 35 58.9 37.0 64.8 32.7 35.3 36 66.7 39.1 60.9 33.4 33.5 37 60.7 33.7 64.6 29.4 28.6 38 61.8 38.8 55.0 23.7 24.2 39 62.9 34.7 60.2 24.8 37.5 40 56.9 38.7 64.3 30.0 32.0 41 61.2 36.3 58.6 29.6 30.2 42 63.0 40.7 63.4 25.9 28.5 43 59.8 34.1 62.1 34.2 35.0 44 52.8 49.3 59.7 25.9 28.7 45 55.7 32.4 56.0 33.3 34.7 46 59.9 34.5 60.0 35.9 31.8 47 60.2 52.5 62.9 33.2 27.6 48 63.3 37.9 56.2 27.5 32.4 49 61.0 35.7 56.0 31.3 28.8 50 66.4 32.9 54.4 30.4 36.7 51 33.1 59.6 35.6 52 53.1 28.6 53 56.1 30.6

Note: Splice sites, Alus, and alignment gaps are excluded. RxC chi-squared tests (see supplementary table 6d) were conducted on comparisons among genes of binned distributions with bins starting at 20% in increments of 5%, up to 85% (e.g., 20, 25, 30, etc.). These bins enabled the best resolution in comparisons among genes and bin size did not alter the results. See Materials and Methods (Chapter 2) for additional information.

Page 160: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

149

Supplementary Table 6c

Clade A Collagen Gene Intron Human-Chimpanzee Site Divergence

Intron Silent Divergence (%) Intron # COL1a1 COL1a2 COL2a1 COL3a1 COL5a2

2 0.66 0.48 1.15 2.17 0.69 3 0.00 1.39 1.44 1.69 0.87 4 1.16 0.99 0.00 1.41 0.82 5 0.70 1.02 3.88 0.64 0.36 6 0.90 1.23 0.63 0.22 0.68 7 0.00 0.00 1.02 1.07 0.76 8 0.63 1.14 1.28 0.78 0.73 9 0.20 1.66 1.87 1.31 0.84 10 0.89 0.48 1.26 0.78 0.00 11 0.30 0.39 0.78 0.47 0.45 12 2.38 0.52 1.08 2.26 0.65 13 0.00 0.35 0.79 1.31 0.73 14 0.00 2.11 0.66 0.46 0.92 15 0.57 1.30 0.96 0.96 0.82 16 0.00 1.01 1.46 0.35 0.85 17 0.00 2.88 1.26 2.04 0.79 18 1.01 2.73 1.00 0.98 1.17 19 2.36 0.24 1.02 0.80 0.39 20 1.40 0.83 0.00 1.43 0.93 21 2.22 1.40 0.00 1.19 0.65 22 0.83 2.75 1.81 1.21 0.00 23 2.48 0.66 2.19 1.40 1.20 24 0.00 0.22 1.23 0.24 1.16 25 0.00 1.01 1.47 1.77 0.58 26 0.72 0.18 0.92 1.96 0.57 27 1.01 0.68 0.51 1.38 0.96 28 0.00 1.48 0.75 0.57 0.00 29 0.67 0.74 1.45 0.53 1.12 30 1.12 1.06 0.42 0.00 0.78 31 0.68 1.32 1.23 1.15 0.71 32 0.66 0.75 2.11 1.40 0.32 33 0.46 1.28 0.43 2.47 0.70

Page 161: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

150

34 0.63 0.89 1.19 0.43 0.69 35 0.93 0.00 0.00 0.67 0.58 36 0.00 2.17 0.43 1.74 0.49 37 0.82 2.53 0.54 0.75 0.48 38 0.74 1.80 1.20 0.00 0.75 39 1.03 0.80 1.03 0.92 1.92 40 1.31 0.63 1.36 0.71 0.39 41 0.97 0.60 1.20 0.54 0.73 42 1.00 1.05 0.52 0.00 0.63 43 1.06 1.22 1.18 1.25 0.90 44 0.00 0.74 1.42 0.00 0.65 45 1.80 1.09 0.60 0.00 0.70 46 0.00 0.85 3.13 0.00 0.71 47 1.14 1.64 1.69 1.67 0.52 48 0.00 0.61 0.42 0.84 0.80 49 0.34 0.74 1.14 1.08 1.27 50 0.80 0.99 0.00 0.68 0.84 51 0.62 0.44 0.69 52 0.59 0.48 53 1.13 0.43

Note: Splice sites, Alus, and alignment gaps are excluded. RxC chi-squared tests (see supplementary table 6d) were conducted on comparisons among genes of binned distributions with bins in increments of 0.25%, up to 2.25% (0.25, 0.5, 0.75, etc.). These bins enabled the best resolution in comparisons among genes and bin size did not alter the results. See Materials and Methods (Chapter 2) for additional information.

Page 162: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

151

Supplementary Table 6d

Summary of RxC Chi-Squared Tests for Clade A Collagen Gene Introns

Chi-squared Value Region Compared

Included Introns COL1a2 COL2a1 COL3a1 COL5a2

Intron Length All 38.0* 24.3 31.6* 63.8* All 85.5* 9.7 94.6* 94.8* Intron GC-

content 80-500 bp 66.6* 10.1 78.0* 57.0* Intron Divergence 80-500 bp 14.0 13.4 12.9 7.9

Note: * denotes comparisons that are statistically significant after a Bonferroni correction, P<0.0125.

Page 163: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

152

Supplementary Table 6e

Summary of Correlation Coefficient (r) Analyses for Clade A Collagen Gene

Introns

Gene Length vs. GC-content

Length vs. No. of

Differences

GC-content vs.

Divergence COL1a1 -0.46* 0.51* 0.17 COL1a2 -0.27 0.89** 0.25 COL2a1 -0.21 0.95** -0.11 COL3a1 -0.02 0.80** 0.36 COL5a2 0.19 0.95** 0.51*

Note: * denotes statistical significance at P<0.001; **P<10-6.

Page 164: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

153

Supplementary Table 6f

Summary of F-tests Comparing Linear Regressions of the Number of G/C bp to

the Number of A/T bp for Human Clade A Collagen Gene Introns

Genes Compared COL1a1COL1a2 92.3* COL2a1 0.5 COL3a1 143.9* COL5a2 9.8*

Note: * denotes statistical significance at P<0.002.

Page 165: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

154

Supplementary Fig. 1. Phylogenetic trees displaying dN/dS rate classes for each

COL1a1 domain. Specifically, color-coding is used to highlight branches that fall

into the 3, 3, and 6 significantly different evolutionary rate classes within each of

the (A) N-terminal domain, (B) C-terminal domain, and (C) triple-helix domain,

respectively, as determined by hypothesis testing with the GABranch algorithm.

See Materials and Methods (Chapter 2) for more information.

Page 166: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

A

155

B

C

Page 167: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

156

APPENDIX B

SUPPLEMENTARY MATERIAL: CHAPTER 3

Page 168: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

157

Supplementary Table 1

Global Diversity and Human-Chimpanzee Divergence Estimates by Gene Region

Region sitesa Sb θπc Dd de

Total 16,993 133 0.07 -1.37 0.62Promoter 1,223 5 0.02 -1.49 0.82UTRf 263 3 0.19 -0.06 0.84First introng 1,459 10 0.04 -1.46 0.75Other intronsg 9,463 97 0.12 -1.02 0.69Synonymous 1,210 12 0.02 -2.16 1.49Nonsynonymous 3,177 6 0.01 -1.87 0 Silenth 10,673 109 0.11 -1.22 0.78 a Number of nucleotides

b Number of SNPs

c Average number of pairwise differences between sequences (%)

d Tajima’s D statistic

e Divergence as number of differences per nucleotide (%)

f 5’ and 3’ mRNA untranslated regions (UTR)

g Excludes splice sites

h Includes synonymous and intron sites, excluding the first intron and splice sites

Page 169: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

158

Supplementary Table 2

COL1a1 Nonsynonymous Polymorphisms

Exon cDNA Protein Populationa

8 C613G Pro205Alab Italian (1) 17 G1135C Ala379Pro Japanese (1) 24 C1663T Pro555Ser Russian (1) 44 G3223A Ala1075Thrc Middle Eastern (1), North African (3) 49 G3979A Gly1327Ser Middle Eastern (1) 50 C4195T Arg1399Cys Sub-Saharan African (1) Note: In “cDNA” and “Protein” columns, polymorphism listed denotes the change from ancestral to derived allele state (as inferred by comparison with chimpanzee as an outgroup). Numbers reflect position in sequence starting with either the first base of the start codon for “cDNA” or the first amino acid for “Protein.” The first 4 polymorphisms occur in the triple-helix domain and the last 2 in the C-terminal non-collagenous domain. a Population sample where SNP was identified with the frequency of SNP in parentheses. b Polymorphism previously identified in association with osteopenia (Spotila et al. 1994; Dalgleish 1997). c Polymorphism previously identified, but with no known associated phenotype (Dalgleish 1997).

Page 170: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

159

Supplementary Table 3

Observed Sp1-T Allele Frequencies

Population Allele Frequency (%)Global 8.8 Sub-Saharan African 5.6 North African 14.3 Middle Eastern 10.0 Russian 10.0 Chinese 0 Japanese 0 Southeast Asian 0 Mexican 20.0 Northern European 15.0 Italian 15.0

Page 171: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

160

Supplementary Fig. 1. Estimates of COL1a1 population differentiation. Pairwise

FST values are listed below the diagonal and Hudson’s (2000) Snn statistic above

the diagonal. Both measures were calculated using SNPs >5% in frequency. *

denotes statistically significant Snn values after a Bonferroni correction, P<0.001.

See Materials and Methods (Chapter 3) for additional information.

Sub-

Saha

ran

Afr

ican

Nor

th A

fric

an

Mid

dle

East

ern

Rus

sian

Chi

nese

Japa

nese

Sout

heas

t Asi

an

Mex

ican

Nor

ther

n Eu

rope

an

Italia

n

Sub-Saharan African

0.72 0.83* 0.85* 0.88* 0.93* 0.82* 0.82* 0.82* 0.82*

North African

0.11 0.63 0.61 0.88* 0.90* 0.77 0.70 0.76* 0.69

Middle Eastern

0.25 0.07 0.52 0.75* 0.73 0.57 0.50 0.35 0.38

Russian 0.23 0.07 0 0.81* 0.76* 0.61 0.64 0.57 0.49

Chinese 0.35 0.20 0.07 0.08 0.57 0.47 0.73* 0.68 0.67

Japanese 0.32 0.22 0.07 0.06 0 0.44 0.75* 0.58 0.63

Southeast Asian

0.26 0.14 0.02 0.01 0 0 0.55 0.56 0.51

Mexican 0.30 0.06 0.06 0.09 0.10 0.18 0.10 0.60 0.57

Northern European

0.24 0.11 0 0 0.02 0 0 0.11 0.38

Italian 0.26 0.08 0 0 0.02 0.02 0 0.05 0

Page 172: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

161

Supplementary Fig. 2. Inferred haplotypes for SNPs in the 5’ (on the left) and 3’

(on the right) “haploblocks” for each of the 10 human populations. Note that

haplotypes in the 5’ region have been sorted independently of those in the 3’

region. The derived allele for each site, as inferred from human-chimpanzee

contrasts, is represented with a grey box. Positions (in bp) are numbered starting

with the first nucleotide of the first exon. See Results and fig. 4 (Chapter 3) for

additional information.

Page 173: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

Chi

nese

Sub-

Saha

ran

Afr

ican

Nor

th A

fric

anM

iddl

e E

aste

rnR

ussia

nPo

sitio

n (b

p)36

811

2618

9723

1326

1627

0633

8034

1935

7736

4242

45

5714

6814

7436

8771

8966

9443

9567

9848

1042

611

284

1157

813

141

1332

713

443

1496

615

848

1609

4

162

Page 174: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

163

Japa

nese

Sout

heas

t Asia

nM

exic

anN

orth

ern

Eur

opea

nIt

alia

nPo

sitio

n (b

p)36

811

2618

9723

1326

1627

0633

8034

1935

7736

4242

45

5714

6814

7436

8771

8966

9443

9567

9848

1042

611

284

1157

813

141

1332

713

443

1496

615

848

1609

4

Page 175: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

164

Supplementary Fig. 3. Estimates of COL1a1 population differentiation for the 5’

region. Pairwise FST values are listed below the diagonal and Hudson’s (2000) Snn

statistic above the diagonal. Both measures were calculated using SNPs >5% in

frequency positioned 5’ to the identified “hotspot” (i.e., SNPs 368-4245 in fig. 4,

Chapter 3). * denotes statistically significant Snn values after a Bonferroni

correction, P<0.001. See Materials and Methods (Chapter 3) for additional

information.

Sub-

Saha

ran

Afr

ican

Nor

th A

fric

an

Mid

dle

East

ern

Rus

sian

Chi

nese

Japa

nese

Sout

heas

t Asi

an

Mex

ican

Nor

ther

n Eu

rope

an

Italia

n

Sub-Saharan African

0.58 0.78* 0.81* 0.88* 0.88* 0.83* 0.81* 0.76* 0.78*

North African

0.13 0.52 0.52 0.66 0.67 0.60 0.52 0.61 0.57

Middle Eastern

0.51 0.14 0.43 0.50 0.53 0.48 0.48 0.47 0.39

Russian 0.48 0.12 0 0.51 0.57 0.48 0.45 0.53 0.47

Chinese 0.62 0.27 0.02 0.01 0.48 0.46 0.56 0.53 0.50

Japanese 0.62 0.27 0.05 0.04 0 0.49 0.62 0.50 0.52

Southeast Asian

0.53 0.17 0 0 0 0 0.53 0.52 0.49

Mexican 0.42 0.07 0 0 0.08 0.11 0.02 0.60 0.52

Northern European

0.51 0.15 0 0 0.02 0.01 0 0.01 0.42

Italian 0.53 0.17 0 0 0.02 0.04 0 0 0

Page 176: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

165

Supplementary Fig. 4. Estimates of COL1a1 population differentiation for the 3’

region. Pairwise FST values are listed below the diagonal and Hudson’s (2000) Snn

statistic above the diagonal. Both measures were calculated using SNPs >5% in

frequency positioned 3’ to the identified “hotspot” (i.e., SNPs 5714-16094 in fig.

4, Chapter 3). * denotes statistically significant Snn values after a Bonferroni

correction, P<0.001. See Materials and Methods (Chapter 3) for additional

information.

Sub-

Saha

ran

Afr

ican

Nor

th A

fric

an

Mid

dle

East

ern

Rus

sian

Chi

nese

Japa

nese

Sout

heas

t Asi

an

Mex

ican

Nor

ther

n Eu

rope

an

Italia

n

Sub-Saharan African

0.83* 0.64 0.76* 0.75* 0.68 0.66 0.82* 0.71 0.64

North African

0.10 0.62 0.63 0.83* 0.90* 0.78* 0.59 0.78* 0.67

Middle Eastern

0.04 0.02 0.51 0.58 0.69 0.51 0.46 0.48 0.38

Russian 0.01 0.05 0 0.68 0.75* 0.57 0.57 0.54 0.43

Chinese 0.12 0.14 0.08 0.05 0.53 0.52 0.63 0.52 0.58

Japanese 0.07 0.19 0.09 0.03 0.01 0.43 0.72* 0.60 0.64

Southeast Asian

0.03 0.12 0.05 0 0 0 0.57 0.48 0.48

Mexican 0.21 0.06 0.11 0.15 0.09 0.21 0.15 0.57 0.51

Northern European

0.02 0.09 0.01 0 0.02 0 0 0.15 0.44

Italian 0.01 0.02 0 0 0.02 0.03 0 0.08 0

Page 177: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

166

Supplementary Fig. 5. Neighbor-joining tree constructed from COL1a1 SNPs

>5% allele frequency in our global population of 192 chromosomes, rooted with

chimpanzee. Sp1-T alleles are boxed in red. Certain clades lacking these alleles

have been collapsed to conserve space. The population sample of origin and the

number of alleles per population are shown for each branch. SA = sub-Saharan

African; NA = North African; ME = Middle Eastern; RU = Russian; CH =

Chinese; JA = Japanese; SA = Southeast Asian; MX = Mexican; NE = Northern

European; and IT = Italian. Reference “bar” reflects “2” substitutions. See Results

(Chapter 3) for additional information.

Page 178: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

167

NA17041b

NA17329a

NA17090b

NA17327b

NA17060a

NA17381b

NA16654a

2

NA-1, ME-4, RU-4, CH-1, SA-1, MX-1, NE-4, IT-2

IT-1

ME-1JA-1

SS-1

ME-1SS-1

IT-1JA-1, SA-1

ME-1ME-1

MX-1JAS

NE-1SS-1

SS-1, RU-1, NE-1

SS-11, NA-4, RU-1, SA-1, MX-1

RU-1

NA-2, ME-2, RU-5, SA-2, MX-5, NE-1, IT-3

CH-1, JA-2, SA-2, MX-3CH-4JA-1, ME-1

-1S-1

NA-1NA-2

ME-1MX-1

ME-1ME-1, JA-1, NE-2

IT-1SA-1

CH-1, SA-1IT-1JA-1

NA-1CH-1

ME-2, RU-6, CH-4, JA-7, SA-6, MX-1, NE-6, IT-4CH-1, SA-1, MX-1

ME-1, CH-7, JA-5, SA-4, MX-1, NE-2, IT-3

NE-1, IT-1NE-1

ME-1, IT-1MX-1

MX-1MX-2

NA-2, ME-1, RU-2, NE-1, IT-1

ME-1IT-1

SS-1

SS-1

NA-1ME-1

Chimpanzee

2

Page 179: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

168

APPENDIX C

SUPPLEMENTARY MATERIAL: CHAPTER 4

Page 180: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

169

Supplementary Table 1

COL1a1 Gene Region Polymorphism and Divergence Estimates

Number of differences Sites Polymorphism DivergenceSilent 76 74 Synonymous 12 15 Nonsynonymous 0 0 Promoter 4 8 UTR 1 1 First intron 11 9 Note: “Silent” includes synonymous and intron sites (excluding intron splice sites and the first intron). “Divergence” estimates are between human and chimpanzee.

Page 181: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

170

Supplementary Table 2

Chimpanzee SNP Diversity Estimates for PCR Fragments 5’ and 3’ of COL1a1

PCR Start Positiona

Stop Positiona

Regionb nc Sd θπe Df C-H Diver.g

C-B Diver.h

1 -89718 -87795 Silent 1748 4 0.037 -0.73 1.030 0.400 2 -81315 -80010 Silent 1273 0 0 n/a 1.021 0.236 3 -69735 -68058 Silent 594 3 0.181 1.14 1.178 0.337 Promoter 722 5 0.290 1.99 0.277 1.939 4 -66607 -65281 Silent 1260 3 0.047 -0.33 0.873 0 5 -35260 -33999 Silent 1030 2 0.054 0.35 1.845 0.194 6 -27926 -26320 Silent 1337 8 0.199 1.18 1.047 0.150 7 -21648 -17817 Silent 3430 14 0.130 1.13 0.583 0.379 8 -15499 -13839 Silent 1271 4 0.134 1.95 0.708 0.157 9 -12214 -10925 Silent 1171 7 0.197 1.11 1.025 0.171 10 25163 26632 Silent 918 3 0.100 0.65 1.198 0.109 Nonsyn 134 2 0.490 0.74 0.749 0 UTR 235 0 0 n/a 2.979 0 11 27205 28521 Silent 1161 2 0.043 0.12 0.775 0.172 12 28764 30323 Silent 1479 10 0.176 0.32 1.082 0.203 13 30239 34092 Silent 2790 12 0.115 0.44 0.932 0.215 Nonsyn 605 1 0.031 -0.31 1.323 0.331 14 34938 37156 Silent 654 3 0.066 -0.86 0.306 0.459 Nonsyn 28 0 0 n/a 0 0 UTR 36 0 0 n/a 2.778 0 Promoter 1000 5 0.146 0.62 0.900 0.100 1st intron 487 3 0.162 0.26 0.821 0.205 15 42564 44028 Silent 1432 4 0.029 -1.34 0.559 0.070 16 52684 53980 1st intron 1198 1 0.004 -1.12 0.584 0 17 67857 69935 Silent 1941 3 0.028 -0.48 0.567 0.309 18 81653 82973 1st intron 1301 1 0.014 -0.31 0.922 0 19 86876 88525 Silent 1298 4 0.019 -1.76 0.539 0.385 Nonsyn 315 1 0.130 1.06 0.318 0 20 88741 90616 Silent 1612 9 0.141 0.22 0.931 0.062

5’i - - Silent 13114 45 0.107 1.15 0.938 0.252 3’i - - Silent 13286 50 0.084 -0.18 0.790 0.211

a Start and stop positions refer to the first and last bp of the PCR primers according to the first base of the first exon of COL1a1 as position 1 as determined from the human genome reference sequence (NCBI build 36.1). PCR fragments

Page 182: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

171

with overlapping DNA sequences have been combined into a single “fragment.” See fig. 7 (Chapter 4) for a visual representation.

b “Silent” includes intergenic, synonymous, and intron sites (excluding first introns and splice sites; see Materials and Methods, Chapter 4). Potentially functional regions are listed separately as nonsynonymous, promoter, and UTR depending on the contents of each PCR fragment (e.g., some fragments only spanned introns). c Number of nucleotides d Number of SNPs e Average number of pairwise differences between sequences (%) f Tajima’s D statistic g Chimpanzee-Human divergence as number of differences per nucleotide (%) h Chimpanzee-Bonobo divergence as number of differences per nucleotide (%)

i Estimates derived from concatenating sequences from all PCR fragments 5’ or 3’ of COL1a1

Page 183: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

172

Supplementary Table 3

Bonobo SNP Diversity Estimates for PCR Fragments 5’ and 3’ of COL1a1

PCR Start Positiona

Stop Positiona

Regionb nc Sd θπe Df B-H Diver.g

1 -89718 -87795 Silent 1748 2 0.017 -0.96 1.030 2 -81315 -80010 Silent 1273 3 0 -0.09 1.021 3 -69735 -68058 Silent 594 3 0.038 -1.74 1.178 Promoter 722 3 0.061 -1.07 0.277 4 -66607 -65281 Silent 1260 3 0.035 -1.09 0.873 5 -35260 -33999 Silent 1030 4 0.071 -0.80 1.845 6 -27926 -26320 Silent 1337 9 0.168 -0.16 1.047 7 -21648 -17817 Silent 3430 11 0.064 -0.78 0.583 8 -15499 -13839 Silent 1271 4 0.066 -0.54 0.708 9 -12214 -10925 Silent 1171 2 0.047 0.10 1.025 10 25163 26632 Silent 918 0 0 n/a 1.198 Nonsyn 134 0 0 n/a 0.749 UTR 235 0 0 n/a 2.979 11 27205 28521 Silent 1161 3 0.026 -1.51 0.775 12 28764 30323 Silent 1479 5 0.044 -1.42 1.082 13 30239 34092 Silent 2790 7 0.045 -0.96 0.932 Nonsyn 605 2 0.048 -0.96 1.323 14 34938 37156 Silent 654 0 0 n/a 0.306 Nonsyn 28 0 0 n/a 0 UTR 36 0 0 n/a 2.778 Promoter 1000 0 0 n/a 0.900 1st intron 487 1 0.016 -1.16 0.821 15 42564 44028 Silent 1432 2 0.011 -1.51 0.559 16 52684 53980 1st intron 1198 3 0.067 0.03 0.584 17 67857 69935 Silent 1941 7 0.031 -2.04 0.567 18 81653 82973 1st intron 1301 2 0.045 0.25 0.922 19 86876 88525 Silent 1298 4 0.058 -0.77 0.539 Nonsyn 315 1 0.024 -1.16 0.318 20 88741 90616 Silent 1612 8 0.059 -1.71 0.931

5’h - - Silent 13114 41 0.063 -0.87 1.022 3’h - - Silent 13286 36 0.035 -1.89 0.858

a Start and stop positions refer to the first and last bp of the PCR primers according to the first base of the first exon of COL1a1 as position 1 as determined from the human genome reference sequence (NCBI build 36.1). PCR fragments

Page 184: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

173

with overlapping DNA sequences have been combined into a single “fragment.” See fig. 7 (Chapter 4) for a visual representation.

b “Silent” includes intergenic, synonymous, and intron sites (excluding first introns and splice sites; see Materials and Methods, Chapter 4). Potentially functional regions are listed separately as nonsynonymous, promoter, and UTR depending on the contents of each PCR fragment (e.g., some fragments only spanned introns). c Number of nucleotides d Number of SNPs e Average number of pairwise differences between sequences (%) f Tajima’s D statistic g Bonobo-Human divergence as number of differences per nucleotide (%) h Estimates derived from concatenating sequences from all PCR fragments 5’ or 3’ of COL1a1

Page 185: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

174

Supplementary Fig. 1: Inferred haplotypes for polymorphisms >5% frequency in

chimpanzees. Chromosomes with identical haplotypes have been combined into

one row with the number per haplotype listed on the side. Positions (bp) of each

site are numbered according to the first base of the first exon of COL1a1 as

position 1 as determined from the human genome reference sequence (NCBI

build 36.1). “5’ region” and “3’ region” refer to polymorphisms found within our

additional sequenced fragments 5’ or 3’ of COL1a1, respectively. Outside of

COL1a1, polymorphisms at nonsynonymous, promoter, UTR, and first intron

sites are labeled as such. Gene regions of all sites within the COL1a1 locus are

labeled and the separation of haplotypes into haplogroups A and B is indicated on

the side. The “ancestral” state for each site was inferred from human-chimpanzee-

macaque-orangutan contrasts. See Materials and Methods and Results (Chapter 4)

for additional information.

Page 186: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

175

TMEM92 Promoter

TMEM92 Promoter

TMEM92 Promoter

TMEM92 Promoter

Promoter

Promoter

Promoter

Promoter

Promoter

Intron 1

Intron 1

Intron 1

Intron 1

Intron 1

Intron 1

Intron 1

Intron 1

Intron 1

Intron 1

Intron 2

Intron 2

Intron 2

Intron 4

Posi

tion

(bp)

-89270

-88456

-69443

-69144

-69001

-68958

-68854

-68640

-65508

-65444

-34718

-27765

-27642

-27499

-27451

-27426

-21142

-20607

-20543

-19975

-19850

-19662

-19417

-19383

-18800

-18350

-15203

-14601

-14498

-14278

-12156

-12098

-11775

-11285

-11247

-1128

-746

-708

-678

-277

111

148

165

284

509

702

726

1064

1227

1324

1831

1881

1919

2134

CO

L1a1

H

aplo

grou

pA

nces

tral

GC

CC

AA

CC

CG

CG

AG

GG

GG

TG

AG

CC

CG

CC

GT

AT

GG

CT

AT

CTC

CC

CG

GG

GC

CT

AC

G

# ch

rom

o. 5

.T

A.

.G

..

T.

TA

.A

AA

AA

..

GC

.T

T.

GG

TC

..

AA

.A

CC

.TC

G de

letio

n.

..

.A

T.

.G

A.

T.

A2

.T

A.

.G

..

T.

TA

.A

AA

AA

..

GC

.T

T.

GG

TC

..

AA

.A

CC

.TC

G de

letio

n.

..

.A

T.

.G

A.

T.

A1

..

.T

G.

GT

..

TA

GA

AA

.A

..

.C

.T

T.

GG

T.

..

AA

.A

C.

GTC

G de

letio

n.

..

.A

T.

.G

A.

T.

A1

.T

A.

.G

..

T.

TA

.A

AA

.A

..

.C

.T

T.

GG

T.

..

AA

.A

C.

GTC

G de

letio

n.

..

.A

T.

.G

A.

T.

A1

..

A.

.G

..

T.

TA

.A

AA

.A

..

.C

.T

T.

GG

T.

..

AA

.A

C.

GTC

G de

letio

n.

..

.A

T.

.G

A.

T.

A1

..

A.

.G

..

..

TA

.A

AA

.A

..

.C

.T

T.

GG

T.

..

AA

.A

C.

GTC

G de

letio

n.

..

.A

T.

.G

A.

T.

A1

..

A.

.G

..

..

TA

GA

AA

.A

..

.C

.T

T.

GG

T.

..

AA

.A

C.

.TC

G de

letio

n.

..

.A

T.

TTCC

inse

rtion

GA

.T

.A

3.

.A

..

G.

..

.T

AG

AA

A.

A.

..

C.

TT

.G

GT

..

.A

A.

AC

..

TCG

dele

tion

..

..

AT

.TT

CC in

serti

onG

A.

T.

A1

C.

.T

G.

GT

..

..

G.

..

..

GA

..

..

.A

..

..

GG

..

.A

CC

.TC

G de

letio

n.

..

.A

T.

.G

A.

T.

A1

..

.T

G.

GT

..

..

G.

..

..

GA

..

..

.A

..

..

GG

..

..

..

.TC

GT

AT

A.

..

..

..

T.

A1

..

A.

.G

..

..

..

G.

..

..

GA

..

..

.A

..

..

GG

..

.A

C.

GTC

G de

letio

n.

..

.A

T.

.G

A.

T.

A1

..

A.

.G

..

..

..

G.

..

..

GA

..

..

.A

..

..

GG

..

.A

C.

GTC

G de

letio

n.

..

.A

T.

.G

A.

T.

A3

C.

.T

G.

GT

.A

..

G.

..

..

GA

..

..

.A

..

..

GG

..

.A

C.

GTC

G de

letio

n.

..

.A

T.

.G

A.

T.

A1

..

A.

.G

..

..

TA

GA

AA

.A

..

.C

.T

T.

GG

T.

..

AA

..

..

.TC

GT

AT

A.

..

..

.C

.G

.3

..

.T

G.

GT

..

..

G.

..

..

GA

..

..

.A

..

..

GG

..

..

..

.TC

GT

AT

A.

..

..

.C

.G

.1

..

.T

G.

GT

..

..

G.

..

..

GA

..

..

.A

..

..

GG

..

..

..

.TC

GT

AT

A.

..

..

.C

.G

.2

..

A.

.G

..

..

..

G.

..

..

GA

..

..

.A

..

..

GG

..

..

..

.TC

GT

AT

A.

..

..

.C

.G

.1

..

.T

G.

GT

..

TA

.A

AA

AA

..

GC

.T

T.

GG

TC

..

AA

..

..

.TC

GT

AT

A.

..

..

.C

.G

.1

..

.T

G.

GT

..

TA

.A

AA

AA

..

GC

.T

T.

GG

TC

..

AA

..

..

.TC

GT

AT

A.

..

..

.C

.G

.3

..

.T

G.

GT

..

..

G.

..

.A

..

.C

TT

T.

GG

T.

..

AA

T.

..

.TC

GT

AT

A.

.C

inse

rtion

..

.C

.G

.1

C.

.T

G.

GT

..

..

G.

..

.A

..

.C

TT

T.

GG

T.

..

AA

T.

..

.TC

GT

AT

A.

.C

inse

rtion

..

.C

.G

.4

..

.T

G.

GT

..

..

G.

..

.A

..

.C

TT

T.

GG

T.

..

AA

T.

..

.TC

GT

AT

A.

.C

inse

rtion

..

.C

.G

.1

..

.T

G.

GT

..

..

G.

..

.A

..

.C

TT

T.

GG

T.

..

AA

T.

..

.TC

GT

AT

A.

.C

inse

rtion

..

.C

.G

.

A B

5' R

egio

nC

OL1

a1

Page 187: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

176

Intron 4

Intron 4

Exon 5

Exon 5

Intron 5

Intron 5

Intron 5

Intron 5

Intron 5

Intron 5

Intron 6

Intron 6

Intron 7

Exon 8

Intron 8

Intron 8

Intron 9

Intron 9

Intron 9

Intron 9

Intron 9

Intron 10

Intron 11

Intron 11

Intron 12

Intron 12

Exon 13

Intron 13

Intron 14

Intron 15

Intron 16

Intron 16

Exon 17

Intron 19

Intron 19

Intron 19

Intron 20

Intron 20

Intron 20

Exon 21

Intron 21

Intron 21

Intron 22

Exon 23

Intron 23

Intron 24

Intron 27

Exon 28

Intron 28

Intron 29

Intron 29

Intron 29

Intron 29

Intron 29

Intron 30

Posi

tion

(bp)

2163

2169

2264

2282

2361

2522

2578

2617

2672

2777

3138

3233

3431

3550

3599

3701

3836

3911

3937

4037

4253

4436

4688

4792

4927

4969

4991

5076

5234

5403

5624

5790

5860

6351

6358

6403

6498

6499

6638

6787

6827

6848

7010

7075

7246

7429

8777

8832

8889

9165

9209

9238

9484

9485

9569

CO

L1a1

H

aplo

grou

pA

nces

tral

CC

TC

TT

TT

GT

ACA

GC

TT

GG

GC

GT

TG

CT

GG

GG

CC

TA

TC

CA

AC

CC

AT

GC

CC

CC

TC

CC

G

# ch

rom

o. 5

..

.T

CC

C.

.A

.A

CA d

elet

ion

..

A.

..

A.

G de

letio

nG

.C

TC

..

..

..

CG

..

T.

G.

TT

GG

C.

.T

.T

..

.T

.2

..

.T

CC

C.

.A

.A

CA d

elet

ion

..

A.

..

A.

G de

letio

nG

.C

TC

..

..

..

CG

..

T.

G.

TT

GG

C.

.T

.T

..

.T

.1

..

.T

CC

C.

.A

.A

CA d

elet

ion

..

A.

..

A.

G de

letio

nG

C.

TC

..

..

..

CG

..

T.

G.

TT

GG

..

.T

.T

..

.T

.1

..

.T

CC

C.

.A

.A

CA d

elet

ion

..

A.

..

A.

G de

letio

nG

C.

TC

..

..

..

CG

..

T.

G.

TT

GG

..

.T

AT

C.

.T

.1

..

.T

CC

C.

.A

.A

CA d

elet

ion

..

A.

..

A.

G de

letio

nG

C.

TC

..

..

..

CG

..

T.

G.

TT

GG

..

.T

AT

C.

.T

.1

..

.T

CC

C.

.A

.A

CA d

elet

ion

..

A.

..

A.

G de

letio

nG

C.

TC

..

..

..

CG

..

T.

G.

TT

GG

..

.T

AT

C.

.T

.1

.T

.T

CC

C.

.A

.A

CA d

elet

ion

CA

A.

A.

A.

G de

letio

nG

..

TC

..

..

T.

CG

..

T.

G.

TT

GG

..

..

..

CT

G.

.3

.T

.T

CC

C.

.A

.A

CA d

elet

ion

CA

A.

A.

A.

G de

letio

nG

..

TC

..

..

T.

CG

..

T.

G.

TT

GG

..

..

..

CT

G.

.1

..

.T

CC

C.

.A

.A

CA d

elet

ion

..

A.

..

A.

G de

letio

nG

.C

TC

..

..

..

CG

..

T.

G.

TT

GG

..

.T

.T

..

.T

.1

..

.T

CC

C.

.A

.A

CA d

elet

ion

.A

A.

..

A.

G de

letio

nG

..

TC

..

..

..

CG

..

T.

G.

TT

GG

..

.T

.T

..

.T

.1

..

.T

CC

C.

.A

.A

CA d

elet

ion

..

A.

..

A.

G de

letio

nG

C.

TC

..

..

..

CG

..

T.

G.

TT

GG

..

.T

AT

C.

.T

T1

..

.T

CC

C.

.A

.A

CA d

elet

ion

..

A.

..

A.

G de

letio

nG

C.

TC

..

..

..

CG

..

T.

G.

TT

GG

..

.T

AT

C.

.T

T3

..

.T

CC

C.

.A

.A

CA d

elet

ion

..

A.

..

A.

G de

letio

nG

C.

TC

..

..

..

CG

..

T.

G.

TT

GG

..

.T

AT

C.

.T

T1

T.

C.

..

.C

C in

serti

on.

C.

..

.G

.C

.T

..

..

..

A.

CA

.T

..

CT

.C

.T

..

G.

.T

..

..

CT

G.

.3

T.

C.

..

.C

C in

serti

on.

C.

..

.G

.C

.T

..

..

..

A.

CA

.T

..

CT

.C

.T

..

..

.T

..

.T

CT

G.

.1

T.

C.

..

.C

C in

serti

on.

C.

..

.G

.C

.T

..

..

..

A.

CA

.T

..

CT

.C

.T

..

..

.T

..

.T

CT

G.

.2

T.

C.

..

.C

C in

serti

on.

C.

..

.G

.C

.T

..

..

..

A.

CA

.T

..

CT

.C

.T

..

G.

.T

..

.T

CT

G.

.1

T.

C.

..

.C

C in

serti

on.

C.

..

.G

.C

.T

..

..

..

A.

CA

.T

..

CT

.C

.T

..

G.

.T

..

..

CT

G.

.1

T.

C.

..

.C

C in

serti

on.

C.

..

.G

.C

.T

..

..

..

A.

CA

.T

..

CT

.C

.T

..

G.

.T

..

..

CT

G.

.3

T.

C.

..

.C

C in

serti

on.

C.

..

.G

.C

.T

..

..

..

AT

CA

.T

..

CT

.C

.T

..

G.

.T

G.

..

CT

G.

.1

T.

C.

..

.C

C in

serti

on.

C.

..

.G

.C

.T

..

..

..

AT

CA

.T

..

CT

.C

.T

..

G.

.T

G.

..

CT

G.

.4

T.

C.

..

.C

C in

serti

on.

C.

..

.G

.C

.T

..

..

..

AT

CA

.T

..

CT

.C

.T

..

G.

.T

G.

..

CT

G.

.1

T.

C.

..

.C

C in

serti

on.

C.

..

.G

.C

.T

..

..

..

AT

CA

.T

..

CT

.C

.T

..

G.

.T

G.

..

CT

G.

.

A B

CO

L1a1

con

tinue

d

Page 188: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

177

Intron 34/Exon 35

Intron 40

Exon 48

Exon 49

Intron 49

Intron 49

Intron 49

Intron 50

Intron 50

Intron 50

3'UTR

SGCA (Arg/Leu)

SGCA (Asn/Ser)

SGCA (Arg/Cys)

SGCA Intron 1

SGCA Intron 1

SGCA Promoter

SGCA Promoter

SGCA Promoter

SGCA Promoter

SGCA Promoter

SAMD14 Intron 1

SAMD14 (Ser/Thr)

PDK2 3'UTR

Posi

tion

(bp)

11171

12372

14795

15098

15345

15414

15443

15774

15816

15826

16026

26057

26144

26150

26465

28031

28364

28930

29083

29549

29661

29758

29809

29918

31345

31617

32034

32173

32239

32350

32545

32727

32842

32896

33060

35031

35327

35588

35772

35884

36064

36499

36652

43549

68078

68082

82819

87239

89071

89090

89344

90399

90414

90421

90428

90435

90523

COL1

a1

Hapl

ogro

upAn

cest

ral

GC

CC

GT

CA

CC

AC

TC

CG

GG

GA

CG

TG

CC

GG

CA

GG

GA

CG

GTC

AA

GTCT

C G

CC

GC

AC

GG

AG

GC

GC

GC

# ch

romo

. 512

4-bp

dup

licat

ion

AT

TT

A.

.G

TT

..

..

..

..

AG

T.

..

A.

..

..

..

..

..

..

TCA

AGT

CTC

delet

ion

A.

G.

..

.C

..

..

.C

..

.2

124-

bp d

uplic

atio

nA

TT

TA

..

GT

T.

..

..

..

.A

GT

..

.A

..

..

..

..

..

..

.TC

AA

GTCT

C de

letio

nA

.G

..

..

C.

..

..

..

A.

1.

AT

T.

A.

.G

TT

..

..

..

.A

..

..

.C

..

..

A.

..

..

.T

..

..

..

C.

..

C.

..

..

..

..

1.

AT

T.

A.

.G

TT

..

C.

.A

.A

..

..

TAGG

inse

rtion

C.

..

.A

..

..

..

T.

..

..

..

.C

.C

..

..

..

..

.1

.A

TT

.A

..

GT

T.

.C

..

A.

A.

..

.TA

GG in

serti

onC

..

..

A.

..

..

.T

..

..

..

..

C.

C.

..

..

..

..

1.

AT

T.

A.

.G

TT

..

C.

.A

.A

..

..

TAGG

inse

rtion

C.

..

.A

..

..

..

T.

..

..

..

.C

.C

..

..

..

..

.1

..

..

..

A.

..

..

..

G.

..

..

G.

..

C.

..

..

..

..

A.

..

CTC

AA

GTCT

C de

letio

nA

..

C.

..

C.

..

..

C.

..

3.

..

..

.A

..

..

..

.G

..

..

.G

..

.C

..

..

..

..

.A

..

.C

TCA

AGT

CTC

delet

ion

A.

.C

..

.C

..

AC

..

..

.1

.A

TT

TA

..

GT

T.

..

..

..

A.

..

..

C.

..

.A

..

..

..

T.

..

..

.C

..

.C

..

..

..

..

.1

.A

TT

TA

..

GT

T.

..

..

..

A.

..

..

C.

..

.A

..

..

..

T.

..

..

..

..

..

.G

..

..

T.

.1

.A

TT

.A

.T

GT

T.

..

..

..

.A

GT

..

..

..

..

..

..

..

..

.TC

AA

GTCT

C de

letio

nA

.G

..

..

C.

..

..

C.

..

1.

AT

T.

A.

TG

TT

..

C.

..

..

AG

TT

..

..

.A

.T

.A

..

.T

..

..

..

..

..

..

G.

..

.T

..

3.

AT

T.

A.

TG

TT

..

C.

..

..

AG

TT

..

..

.A

.T

.A

..

.T

..

..

..

..

.T

.A

G.

.T

.T

..

1.

..

..

..

..

..

..

.G

..

..

AG

TT

..

..

.A

..

..

..

..

..

TCA

AGT

CTC

delet

ion

..

..

..

.C

..

..

..

.A

.3

..

..

..

..

..

..

..

G.

.A

A.

..

..

C.

..

.A

.G

..

.G

T.

..

..

..

..

.C

..

..

..

..

.1

..

..

..

..

..

..

..

G.

.A

A.

..

..

C.

..

.A

.G

..

.G

T.

..

..

..

..

.C

..

..

.C

..

.2

..

..

..

..

..

..

..

G.

..

A.

..

..

C.

..

.A

..

..

..

T.

..

..

..

..

.C

..

..

..

..

A1

..

..

..

..

..

..

..

G.

..

A.

..

..

C.

..

.A

..

..

..

T.

..

..

..

..

.C

..

..

..

..

A1

..

..

..

..

..

..

..

G.

..

A.

..

..

C.

..

.A

..

..

..

T.

..

..

..

..

T.

AG

..

T.

T.

.3

..

..

..

..

..

.T

A.

GT

..

.A

GT

..

..

TT

..

..

.T

..

.A

.TC

AA

GTCT

C de

letio

nA

T.

.T

..

.A

G.

.T

.T

..

1.

..

..

..

..

..

TA

.G

T.

..

AG

T.

..

.T

T.

..

..

T.

..

A.

TCA

AGT

CTC

delet

ion

AT

..

T.

..

AG

..

T.

T.

.4

..

..

..

..

..

.T

A.

GT

..

.A

GT

..

..

TT

..

..

.T

..

.A

.TC

AA

GTCT

C de

letio

nA

T.

.T

..

C.

..

..

..

..

1.

..

..

..

..

..

TA

.G

T.

..

AG

T.

..

.T

T.

..

..

T.

..

A.

TCA

AGT

CTC

delet

ion

AT

..

T.

..

..

..

..

..

.

A B

3' R

egio

nCO

L1a1

con

tinue

d

Page 189: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

178

Supplementary Fig. 2. LD comparisons among informative polymorphisms (>5%

frequency) in 40 chimpanzee chromosomes for sequenced chromosome 17

regions in and around COL1a1. See supplementary fig. 1 for polymorphism

positions. Significant pairwise associations according to r2 are represented by the

blue filled boxes above the diagonal. Filled boxes below the diagonal represent

pairwise comparisons in significantly more (blue boxes) or less (red boxes) LD

than expected given a locus-specific evolutionary model of recombination. See

Materials and Methods and Results (Chapter 4) for additional information. -

--

--

--

--

--

--

--

--

--

--

--

--

--

--

--

--

--

--

--

--

--

--

--

--

--

--

--

--

--

--

--

--

--

--

--

--

--

--

--

--

--

--

--

--

--

--

--

--

--

--

--

--

--

--

--

--

--

--

--

--

--

--

--

--

--

--

--

--

--

--

--

--

--

--

--

--

--

--

--

--

--

--

-

Page 190: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

179

Supplementary Fig. 3. REHH vs. COL1a1 core haplotype frequency. REHH is

measured at ~85-kb away from each core haplotype. REHH values have been

grouped according to haplotype frequency in overlapping bins of 5% (e.g., 10-

15%, 15-20%, etc.) with horizontal lines indicating the mean REHH for each bin.

Circled data points are REHH (and EHH in parentheses) for the haplotype 5’

(3.414) and 3’ (3.200) of the COL1a1 exon duplication, which are not

significantly different from the mean REHH observed in the 15-20% frequency

bin (P>0.09). See Results (Chapter 4) for additional information.

Page 191: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

180

Supplementary Fig. 4. Inferred haplotypes for polymorphisms >5% frequency in

bonobos. Chromosomes with identical haplotypes have been combined into one

row with the number per haplotype listed on the left. Positions (bp) of each site

are numbered according to the first base of the first exon of COL1a1 as position 1

as determined from the human genome reference sequence (NCBI build 36.1). “5’

region” and “3’ region” refer to polymorphisms found within our additional

sequenced fragments 5’ or 3’ of COL1a1, respectively. Polymorphisms in

nonsynonymous, promoter, UTR, and first intron sites are labeled as such. The

“ancestral” state for each site was inferred from human-chimpanzee-macaque-

orangutan contrasts. See Results (Chapter 4) for additional information.

Page 192: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

TMEM92 Promoter

TMEM92 Promoter

TMEM92 Promoter

Intron 1

SGCA (Arg/His)

SGCA Promoter

PPP1R9B Intron 1

PPP1R9B Intron 1

SAMD14 Intron 1

PDK2 3'UTR

Posi

tion

(bp)

-89390

-80761

-80509

-69267

-69195

-68977

-65607

-65359

-34900

-34780

-27633

-27499

-27392

-27115

-26561

-26505

-21142

-20923

-20379

-20192

-19749

-19746

-19553

-15229

-14515

-11285

132

11253

27465

29882

29972

30735

31268

31536

32173

35518

52871

53484

68441

81946

88063

89074

89324

90232

90587

Anc

estr

alC

GA

GA

GG

CC

AC

GT

TC

GG

TAC

GC

GC

CA

CG

GCA

CC

CC

AC

GC

CC

GG

CC

T#

chro

mo.

4.

..

..

..

..

..

AC

..

..

..

..

CA

.C

T.

.CA

del

etio

n.

.T

..

.A

GGA

GGGA

inse

rtion

A.

.T

C.

..

.1

..

..

..

.T

..

.A

C.

..

..

..

.C

A.

CT

..

CA d

elet

ion

..

T.

..

AGG

AGG

GA in

serti

onA

..

TC

..

..

1.

..

.C

..

T.

..

AC

..

..

..

..

CA

.C

T.

.CA

del

etio

n.

.T

..

.A

GGA

GGGA

inse

rtion

A.

.T

C.

..

.2

.A

..

..

..

..

.A

C.

..

..

..

.C

A.

CT

..

CA d

elet

ion

..

T.

..

AGG

AGG

GA in

serti

onA

..

TC

.A

..

1.

..

..

..

..

..

AC

..

..

..

..

CA

.C

T.

ACA

del

etio

n.

.T

..

.A

GGA

GGGA

inse

rtion

..

..

..

..

.1

..

..

..

.T

..

..

C.

.A

A.

..

..

..

..

..

..

..

..

..

..

..

..

..

.1

GA

G.

..

..

..

..

C.

.A

A.

..

..

..

..

..

..

..

..

..

..

..

C.

..

.1

..

..

..

..

..

..

C.

.A

..

..

..

..

..

..

..

..

..

..

..

..

..

..

.1

..

..

..

..

..

..

C.

.A

..

..

..

..

..

..

..

..

.C

..

..

..

..

..

.1

.A

..

..

..

..

..

C.

.A

.C

..

..

..

..

..

.T

..

.C

..

..

..

..

..

.2

..

..

..

..

..

..

C.

.A

.C

..

..

..

..

..

.T

..

.C

..

..

..

..

..

.1

..

G.

..

..

..

..

..

..

..

..

T.

..

..

..

..

..

..

..

..

..

..

..

.1

..

..

..

..

.G

..

..

..

..

T.

T.

..

..

..

..

..

..

..

..

..

..

..

.1

..

..

..

..

.G

..

..

..

..

T.

..

..

..

..

..

..

..

..

..

..

..

..

.1

..

G.

..

..

.G

..

..

..

..

T.

..

..

..

..

..

..

..

..

..

..

..

..

.1

..

GA

CC

A.

TG

..

..

..

..

T.

..

..

..

.A

..

T.

..

..

A.

.T

C.

..

.2

G.

G.

..

..

TG

..

..

..

..

T.

..

..

..

.A

..

T.

..

..

A.

.T

C.

..

.1

..

GA

.C

A.

..

T.

.C

A.

..

.A

..

.T

..

T.

..

..

T.

T.

.T

..

.C

.T

.2

..

..

..

..

..

T.

.C

A.

..

.A

..

.T

..

T.

..

..

T.

T.

.T

T.

.C

.T

A

CO

L1a1

3' R

egio

n5'

Reg

ion

181

Page 193: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

182

Supplementary Fig. 5. LD comparisons among informative polymorphisms (>5%

frequency) in 20 bonobo chromosomes for sequenced chromosome 17 regions.

Positions (in bp) of each site are numbered according to the first base of the first

exon of COL1a1 as position 1 as determined from the human genome reference

sequence (NCBI build 36.1). Significant pairwise associations according to r2 are

represented by the blue filled boxes above the diagonal. Filled boxes below the

diagonal represent pairwise comparisons in significantly more (blue boxes) or less

(red boxes) LD than expected given a locus-specific evolutionary model of

recombination. See Materials and Methods and Results (Chapter 4) for additional

information.

Position (bp) -8

9390

-807

61

-805

09-6

9267

-691

95-6

8977

-656

07-6

5359

-349

00-3

4780

-276

33-2

7499

-273

92

-271

15-2

6561

-265

05-2

1142

-209

23-2

0379

-201

92

-197

49-1

9746

-195

53-1

5229

-145

15-1

1285

132

1125

3

2746

529

882

2997

2

3073

531

268

3153

632

173

3551

852

871

5348

4

6844

181

946

8806

389

074

8932

490

232

9058

7

-89390-80761-80509-69267-69195-68977-65607-65359-34900-34780-27633-27499-27392-27115-26561-26505-21142-20923-20379-20192-19749-19746-19553-15229-14515-11285

132112532746529882299723073531268315363217335518528715348468441819468806389074893249023290587

Page 194: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

183

Supplementary Fig. 6. Neighbor-joining tree based on the number of substitutions

(excluding nonsynonymous sites) among 40 chimpanzee chromosome sequences

at the COL1a1 locus. The two high-frequency COL1a1 haplogroups are indicated

with “A” and “B.” Sequences bearing the exon duplication are boxed in red. The

tree is rooted using the “Ancestral” sequence inferred from human-chimpanzee-

orangutan-macaque contrasts. Reference “bar” reflects “5” substitutions. See

Results (Chapter 4) for more information.

Page 195: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

5

Ancestral

1b6b9a9b14b18b19b20b

2b3a19a

3b17b

10b12b13b16b

7a8b10a11a

12a

1a5a8a15a15b16a17a

5b20a

11b13a14a

2a4a4b6a18a

A

B

5

184

Page 196: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

185

Supplementary Fig. 7. Neighbor-joining tree based on the number of substitutions

(excluding nonsynonymous sites) among 40 chimpanzee chromosome sequences

for the COL1a1 locus and flanking regions spanning ~180 kb (see fig. 7, Chapter

4). The two high-frequency COL1a1 haplogroups are indicated with “A” and “B.”

Sequences bearing the COL1a1 exon duplication are boxed in red. The tree is

rooted using the “Ancestral” sequence inferred from human-chimpanzee-

orangutan-macaque contrasts. Reference “bar” reflects “10” substitutions. See

Results (Chapter 4) for more information.

Page 197: Molecular Evolution of Type I Collagen (COL1a1) and Its … · 2011. 8. 12. · Skeletal diseases related to reduced bone strength, like osteoporosis, vary in frequency and severity

186

10

Ancestral

10

6b7b18b19b20b

1b9a14b

3b17b

10b12b16b

B

9b

19a2b

13b

3a

2a4b18a

6a

8a15a15b16a17a

5b

10a7a8b11a

A

4a12a

13a11b

20a

1a5a

14a