The Distribution of Human Molecular Genetic...

Title Page

An Investigation Into The

Distribution Of Human Molecular

Genetic Variation In Sub-Saharan

Africa

Krishna Ranganaden Veeramah

SUBMITTED FOR THE DEGREE OF DOCTOR OF PHILOSOPHY

JANUARY 2008

MINOR CORRECTIONS SUBMITTED JUNE 2008

UNIVERSITY COLLEGE LONDON (UCL)

The Centre for Genetic Anthropology

Department of Biology

Supervisor: Dr Mark G.Thomas

Second Supervisor: Dr Mike E. Weale

Declaration of Ownership

I, Krishna Ranganaden Veeramah, confirm that the work presented in this

thesis is my own. Where information has been derived from other sources, I

confirm that this has been indicated in the thesis.

Abstract

Sub-Saharan Africa is believed to possess more human genetic diversity than any other

region of the world, a likely consequence of it being the probable place of origin of

anatomically modern man. Despite its evolutionary importance studies into the

distribution of this genetic variation have been somewhat limited in comparison to

Europe, Asia and the Americas, especially with respect to fine-scale studies that would

help elucidate local histories and the consequences of ethnic and linguistic interactions.

Another possible consequence of this lack of knowledge of genetic diversity is that much

information of functionally important genetic variants that are potentially relevant to

pharmacogenetic research is not available. This lack of information can add to an already

prevalent Eurocentric ascertainment bias in current knowledge of genetic variation,

depriving sub-Saharan African communities of the potential medical benefits

pharmacogenetics has to offer. This thesis describes three case studies that form part of an

investigation into human genetic variation in sub-Saharan Africa.

Chapter 2 uses sex-specific genetic systems to successfully differentiate between two

alternative oral histories of the ethnogenesis of the Nso΄ people of Cameroon. Chapter 3

establishes that substantial male and female gene flow has occurred among the peoples of

the Cross River region of Nigeria, a region that includes multiple ethnic groups speaking

distinct languages that appear to have separated hundreds and thousands of years ago.

Chapter 3 demonstrates that the drug metabolising enzyme Flavin-containing

Monooxygenase 2, which has been shown to be non-functional in all Europeans and

Asian individuals collected to date, has a putative functional allele in approximately one

third of sub-Saharan Africans, a finding that may have important implications for

therapeutic intervention strategies and xenobiotic exposure. This thesis demonstrates inter

alia the value of conducting genetic studies in sub-Saharan Africa using large datasets of

well known provenance.

Table Of Contents

Title Page ............................................................................................................................ 1 Declaration of Ownership ................................................................................................. 2 Abstract ............................................................................................................................... 3

Table Of Contents .............................................................................................................. 4 List of Figures and Tables ................................................................................................. 7 Abbreviations ................................................................................................................... 11 Acknowledgements .......................................................................................................... 13 1. Introduction .............................................................................................................. 15

1.1. Rationale Of The Study .......................................................................................... 15

1.2. Notable Geographical Features Of Sub-Saharan Africa ......................................... 21 1.3. The Languages Of Sub-Saharan Africa .................................................................. 22

1.3.1. Important Methods and Concepts of Historical Linguistics ............................ 22

1.3.2. The Distribution of sub-Saharan African Languages ...................................... 27 1.4. Previous Work On The Distribution Of Genetic Variation In Sub-Saharan Africa32

1.4.1. Classical Markers ............................................................................................ 32

1.4.2. Molecular Data ................................................................................................ 38 1.5. DNA Sampling Issues ............................................................................................. 56

1.6. Statement of work performed by Krishna Veeramah in this thesis ........................ 57 1.6.1. Chapter 2 ......................................................................................................... 57 1.6.2. Chapter 3 ......................................................................................................... 57

1.6.3. Chapter 4 ......................................................................................................... 57

2. Sex-Specific Genetic Data Support One of Two Alternative Versions Of The

Foundation Of The Ruling Dynasty Of The Nso´ In Cameroon ................................. 59 2.1. Introduction ............................................................................................................. 59

2.1.1. The geography, history and sociology of the Nso´ .......................................... 59

2.1.2. Expectations of sex-specific genetic variation in the Nso´ .............................. 63

2.2. Materials and Methods ............................................................................................ 70 2.2.1. Sample Collection Procedure .......................................................................... 70 2.2.2. Y-chromosome typing....................................................................................... 70

2.2.3. mtDNA typing................................................................................................... 72 2.2.4. Statistical and Population Genetic Analysis .................................................... 73 2.2.5. Dating of the Y*(xBR,A3b2) clade ................................................................... 74

2.2.6. Comparison of duy vs nshiylav and mtaar genealogy depths .......................... 81 2.3. Results and Discussion ........................................................................................... 82

2.3.1. The NRY and mtDNA distribution in the Nso΄ ................................................. 82 2.3.2. Association of the Y*(xBR,A3b2) lineage with the indigenous hunter-gatherer

Visale.......................................................................................................................... 84

2.3.3. Dating of the Y*(xBR,A3b2) lineage in the Nso´ ............................................. 88 2.3.4. The possible evolution of a relaxed patrilineal system of descent for the won

nto´ ............................................................................................................................. 91 2.4. Conclusion .............................................................................................................. 92

2.5. Supplementary Section for Chapter 2 ..................................................................... 94 2.5.1. Supplementary Section 2S.1: The expectation of NRY type frequencies in the

won nto´ and duy of the Nso´. .................................................................................... 94

3. It All Depends On The Scale: Little Sex-Specific Genetic Variation In The

Presence Of Substantial Language Variation In Peoples Of The Cross River Region

Of Nigeria Assessed Within The Wider Context Of West Central Africa. .............. 114 3.1. Introduction ........................................................................................................... 114

3.1.1. A brief description of the Peoples and Languages of the Cross River region

.................................................................................................................................. 115 3.1.2. Genetics and Language.................................................................................. 122 3.1.3. Expectations of the distribution of NRY and mtDNA variation in the Cross

River region ............................................................................................................. 123 3.2. Materials and Methods .......................................................................................... 128

3.2.1. Sample Collection Procedure. ....................................................................... 128 3.2.2. Y-chromosome typing..................................................................................... 129 3.2.3. mtDNA typing................................................................................................. 129 3.2.4. Statistical and Population Genetic Analysis .................................................. 130

3.3. Results ................................................................................................................... 134

3.3.1. The distribution Of NRY variation ................................................................. 134 3.3.2. The distribution of mtDNA variation ............................................................. 142

3.3.3. Are clan communities collected from different locations distinguishable? ... 144 3.3.4. Are different clans of the same language group collected from the same

location distinguishable? ......................................................................................... 145 3.3.5. Are different language groups collected from the same location

distinguishable? ....................................................................................................... 145 3.3.6. Are the same language groups collected from different locations

distinguishable? ....................................................................................................... 145

3.3.7. Are speakers of the six Cross River languages distinguishable? .................. 146 3.3.8. Are speakers of the six Cross River languages distinguishable when two

groups from Igboland are added to the analysis? ................................................... 148 3.3.9. Can differences between the Cross River region and Cameroonian and

Ghanaian groups be established? ............................................................................ 149

3.3.10. Are there correlations of genetic distances and geographic and linguistic

distances? ................................................................................................................. 154 3.3.11. The Origins of the Efik ................................................................................. 156

3.4. Discussion ............................................................................................................. 159

3.4.1. General observations regarding NRY and mtDNA variation ........................ 159 3.4.2. The Cross River region as a genetically homogenous region ....................... 160

3.4.3. Cross River, Ghana and Cameroon as genetically distinct regions .............. 163 3.4.4. No genetic evidence that the Efik Uwanse have an origin in ancient Palestine

.................................................................................................................................. 166

3.5. Conclusion ............................................................................................................ 166 3.6. Supplementary Section for Chapter 3 ................................................................... 168

4. The potentially deleterious functional variant FMO2*1 is at high frequency

throughout sub-Saharan Africa ................................................................................... 170 4.1. Introduction ........................................................................................................... 170

4.1.1. Previous work on Flavin-containing Monoxygenase 2 ................................. 170 4.1.2. The rationale for studying FMO2 in Africans ............................................... 172

4.2. Materials and Methods .......................................................................................... 172 4.2.1. Sample Collection .......................................................................................... 172

4.2.2. g.23238C>T typing ........................................................................................ 173 4.2.3. Statistical and Population Genetic Analysis .................................................. 175

4.3. Results ................................................................................................................... 184

4.3.1. The distribution of 23238C>T in Africa ........................................................ 184

4.3.2. Examining FMO2 for evidence of Natural Selection ..................................... 191 4.3.3. Analysis of NIEHS FMO2 re-sequencing data .............................................. 195

4.4. Discussion ............................................................................................................. 199 4.4.1. Functional FMO2 is found at high frequency throughout sub-Saharan Africa

.................................................................................................................................. 199 4.4.2. The possible consequences of FMO2 functionality in Africans ..................... 200 4.4.3. The Evolution of FMO2 ................................................................................ 202

4.5. Conclusion ............................................................................................................ 203

5. Conclusion .................................................................................................................. 205 5.1. Implications for investigating human history and behaviour ............................... 205 5.2. Implications for investigating medically relevant genetic variation ..................... 208 5.3. Future Work .......................................................................................................... 210

5.3.1. Future work derived from Chapter 2 (Sex-Specific Genetic Data Support One

Of Two Alternative Versions Of The Foundation Of The Ruling Dynasty Of The Nso`

In Cameroon) ........................................................................................................... 211 5.3.2. Future work derived from Chapter 3 (It All Depends On The Scale: Little Sex-

Specific Genetic Variation In The Presence Of Substantial Language Variation In

Peoples Of The Cross River Region Of Nigeria Assessed Within The Wider Context

Of West Africa) ........................................................................................................ 212 5.3.3. Future work derived from Chapter 4 (The potentially deleterious functional

variant FMO2*1 is at high frequency throughout sub-Saharan Africa) ................. 214 5.4. Final Comments .................................................................................................... 215

Appendix A: Criteria for and problems associated with collecting African samples

for The Centre for Genetic Anthropology (TCGA) DNA bank. ............................... 216

Appendix B: An example sociological data sheet used during DNA sample collection

.......................................................................................................................................... 220 Appendix C: Extraction of DNA from Buccal Swabs ................................................. 221

Appendix D: Legends of figures and tables found on the attached CD. ................... 223

Appendix E: LRH test Source Code ............................................................................ 225 References ....................................................................................................................... 226

List of Figures and Tables

Figure 1.1: A political map of Africa (from the Perry-Castañeda map collection). 23 Figure 1.2: A physical geography map of Africa (from the Perry-Castañeda map

collection). ....................................................................................................24

Figure 1.3: A simplified linguistic map of Africa according to the classification of Greenberg (1963) (a vectorisation by Mark Dingemanse). ...........................28

Figure 1.4: Average linkage tree for 42 populations. The abscissa shows the genetic distances (modified Nei) calculated on the basis of 120 allele frequencies taken from 42 classic genetic marker systems. Taken directly from Cavalli Sforza et al. (1994). ..................................................................34

Table 1.1: Some examples of natural selection detected in Africans. ..................54 Figure 2.1: Map showing towns in Cameroon where samples were collected. ....61 Figure 2.2: Lineage tree showing the relationship of won nto´ individuals and the

transition of won nto´ to duy under Royal Social Status Rule A. M = male offspring, F = female offspring, * = individual inherits the same NRY type as a fon). Won nto´ are shown in black and duy in red. ....................................64

Figure 2.3: Lineage tree showing the relationship of won nto´ individuals and the transition of won nto´ to duy under Royal Social Status Rule B. M = male offspring, F = female offspring, * = individual inherits the same NRY type as a fon). Won nto´ are shown in black and duy in red. ....................................65

Figure 2.4: Genealogical relationships of UEP markers used to define NRY haplogroups ..................................................................................................71

Table 2.1: Distribution of NRY haplogroups (NRY at UEP level) in the four Nso´ social classes. ..............................................................................................83

Table 2.2: Distribution of NRY haplogroups in the peoples of the western Grassfields and Tikar plain. ..........................................................................86

Figure 2.5: PCO plot of UEP-based population pairwise FST values. The PCO plot is constructed using pairwise genetic distances, FST, between the four Nso´ classes (labelled by name) and other populations of the western Grassfields and Tikar Plain (labelled using abbreviations as defined in Table 2.2). PCO1 and PCO2 explain 97.91% and 1.92% of the variation respectively. ............87

Table 2.3: Comparison of the depth of two genealogies. The probability of observing results equal to or more extreme than the difference between the Average Square Distance values of a) the duy and b) the nshiylav and mtaar combined. (Three independent run simulations for each set of criteria) .......90

Table 2.4: Cultural identity of won nto´ males sampled in the study as well as the cultural identity of each sample’s father, mother, father's father and mother's mother. .........................................................................................................93

Supplementary Figure 2S.1: Lineage tree showing the relationship of won nto´ individuals under Royal Social Status Rule A. (M = male offspring, F = female offspring, * = this individual inherits the same NRY type as a fon). ...95

Supplementary Figure 2S.2: Diagram showing the relative contributions of different won nto´ lineages to the won nto´ under Royal Social Status Rule A. .....................................................................................................................96

Supplementary Figure 2S.3: Lineage tree showing the relationship of won nto´ individuals under Royal Social Status Rule A for the two most recent adult generations of won nto´ males. (M = male offspring, F = female offspring, * =

this individual inherits the same NRY type as a fon. Numbers refer to specific Lineage Representatives).............................................................................98

Supplementary Table 2S.3: Probability of sampling Nso´ Y-chromosomes given Royal Social Status Rule A. ..........................................................................99

Supplementary Figure 2S.4: Lineage tree showing the transition of won nto´ to duy under Royal Social Status Rule A. (M = male offspring, F = female offspring, * = this individual inherits the same NRY type as a fon). Duy are shown in red. ..............................................................................................101

Supplementary Figure 2S.5: Lineage tree showing the relationship of won nto´ individuals under Royal Social Status Rule B. (M = male offspring, F = female offspring, * = this individual inherits the same NRY type as a fon). .102

Supplementary Figure 2S.6: Diagram showing the relative contributions of different won nto´ lineages to the won nto´ under Royal Social Status Rule B. ...................................................................................................................103

Supplementary Figure 2S.7: Lineage tree showing the relationship of won nto´ individuals under Royal Social Status Rule B for the two most recent adult generations of won nto´ males. (M = male offspring, F = female offspring, * = this individual inherits the same NRY type as a fon. Numbers refer to specific Lineage Representatives)...........................................................................105

Supplementary Table 2S.4: Probability of sampling Nso´ Y-chromosomes given Royal Social Status Rule B. ........................................................................106

Supplementary Figure 2S.8: Lineage tree showing the transition of won nto´ to duy under Royal Social Status Rule B. (M = male offspring, F = female offspring, * = this individual inherits the same NRY type as a fon). Duy are shown in red. ..............................................................................................107

Supplementary Figures 2S.9: Lineage tree showing the relationship of won nto´ individuals under Royal Social Status Rule C for the two most recent adult generations of won nto´ males. (M = male offspring, F = female offspring, * = this individual inherits the same NRY type as a fon. Numbers refer to specific Lineage Representatives)...........................................................................109

Supplementary Figures 2S.10: Lineage tree showing the relationship of won nto´ individuals under Royal Social Status Rule D for the two most recent adult generations of won nto´ males. (M = male offspring, F = female offspring, * = this individual inherits the same NRY type as a fon. Numbers refer to specific Lineage Representatives)...........................................................................110

Supplementary Table 2S.5: Probability of sampling Nso´ Y-chromosomes given Royal Social Status Rule C. .......................................................................111

Supplementary Table 2S.6: Probability of sampling Nso´ Y-chromosomes given Royal Social Status Rule D. .......................................................................111

Figure 3.1: Map showing the position where samples were collected from in West Central Africa. Political borders are shown by black lines. Colour bar indicates elevation in metres. .....................................................................116

Table 3.1: Summary of cultural practices of Cross River ethnic groups utilised in this study. ...................................................................................................117

Figure 3.2: Broad relationships of the differing language groups used or described in this chapter based on Williamson and Blench (2000). Branch lengths are not informative. ........................................................................118

Table 3.2: Nigerian Cross River sample collection details. ................................125 Table 3.3: First languages of parents of Cross River region samples utilised in

this study. ...................................................................................................126

Table 3.4a: Lexicostastic similarity percentages for various Niger-Congo languages. ‘?’ indicates no available data. .................................................135

Table 3.4b: Lexicostastic dissimilarity matrix for 6 Cross River languages, 3 Cameroon Grassfields languages and 2 Ghanaian languages. .................136

Figure 3.3: Language network based on distance matrix inferred from partial lexicostatistic matrix (Table 3.4b). ..............................................................137

Table 3.5: Haplogroup proportions in Cross River, Cameroonian Grassfield and Ghanaian groups. .......................................................................................139

Table 3.6: Hierarchical AMOVA results of Cross River, Cameroonian and Ghanaian groups at various molecular levels. Colour indicates significance level of Fixation Indices P-values: Yellow = 0.05<P<0.01, Orange = 0.01<0.001, Red = P<0.001. Each grouping is followed, indicated by ‘n’, by the number of groups and, if applicable, the number of individual populations analysed. ....................................................................................................140

Figure 3.4: Consensus neighbour joining trees for Cross River population using various methods of genetic distance for both NRY and mtDNA. Only individual node bootstrap values over 30% are shown on tree. ..................150

Table 3.7a: ETPD P-values (upper triangle) at various NRY and mtDNA levels for pooled Cameroonian, Ghanaian and Nigerian datasets. Colour code is same as Table 3.6. ...............................................................................................151

Tables 3.7b: Genetic Distances (lower triangle) and P-values (upper triangle) at various NRY and mtDNA levels for pooled Cameroonian, Ghanaian and Nigerian datasets. Colour code is same as Table 3.6. ...............................151

Figure 3.5: Various PCO plots at different NRY and mtDNA analysis levels for populations from the Cross River region, the Cameroon Grassfields and Ghana.........................................................................................................153

Table 3.8: Results of Mantel and Partial Mantel tests at different levels of NRY and mtDNA analysis using various distance matrices. Colour code is same as Table 3.6. ...............................................................................................155

Table 3.9: NRY and mtDNA haplogroup frequencies in the Efik Uwanse. .........157 Figure 3.6: Various PCO plots at different NRY and mtDNA analysis levels for

populations from the Efik Uwanse and comparison populations. ...............158 Figure 4.1: Diagrammatic representation of 23238C>T SNP restriction enzyme

assay. .........................................................................................................174 Figure 4.2: #rs6661174 Mbo/MseI Complementary Restriction Enzyme Digest

Banding Patterns. .......................................................................................175 Figure 4.3: Distribution of 10,000 EHH values calculated 0.4cM from core alleles

with A) SNP haplotype density not controlled and B) SNP haplotype density controlled at 1 SNP per 0.05cM (8 SNP extended haplotypes). .................182

Table 4.1: 23238C>T Genotype and Allele frequencies. ...................................186 Figure 4.4: Map showing the percentage of individuals with at least one FMO2*1

allele in Africa and two nearby countries. ...................................................187 Table 4.2: Pearson's Chi Square Test on individual regions. ............................188

Table 4.3: Fisher's Exact tests between regions. ..............................................188 Table 4.4: Fisher's Exact tests between CEA populations. ................................188

Figure 4.5: PCO plot of 23238 C>T-based population FST values. ....................189 Figure 4.6: Contour map based on FMO2*1 allele frequencies in Central East

African populations with areas of rapid allele frequency change shown with blue circles. ................................................................................................190

Figure 4.7: Spatial Autocorrelation Analysis of 23238C>T allele frequency data using (A) Moran’s II and (B) Geary’s cc. .....................................................192

Table 4.5: Table of ethnic identities found in the various populations examined in this chapter. ................................................................................................193

Table 4.6: Various Population Pairwise Fisher’s Exact Tests. ...........................194 Table 4.7: P-valuesa for EHH values calculated at various genetic (a) and

physical distances (b) from alleles present at the 23238C>T locus in the upstream (-) and downstream (+) directions in the YRI, CEU CHB+JPT datasets with a SNP haplotype density of 0.05cM per SNP in (a) and 10kb per SNP in (b). ............................................................................................196

Table 4.8: Table showing inferred haplotypes for FMO2 genomic variants from NIEHS sequencing data. ............................................................................197

Supplementary Table 2S.1: Distribution of NRY types, defined by UEP haplogroups and microsatellite haplotypes, in the four Nso´ social classes and people of the western Grassfields and Tikar Plain. .............................223

Supplementary Table 2S.2: Distribution of mtDNA types, defined by VSO haplotypes, in the four Nso′ social classes. ................................................223

Supplementary Table 2S.7: Confidence intervals for TMRCA calculations in the duy, the nshiylav and mtaar, and the won nto´ and duy, using two mutation models. .......................................................................................................223

Supplementary Tables 3S.1: Pairwise ETPD P-values for various levels of NRY and mtDNA analysis for Cross River samples, Cameroon and Ghana. Level of analysis shown in top left cell of matrix. Colour code is same as Table 3.6. ...................................................................................................................223

Supplementary Table 3S.2: Distribution of NRY types, defined by UEP haplogroups and microsatellite haplotypes, in the Cross River region, Cameroon and Nigeria. ..............................................................................223

Supplementary Table 3S.3: Pairwise genetic distances and associated P-values for various levels of NRY and mtDNA analysis. Level of analysis shown in top left cell of matrix. Colour code is same as Table 3.6. ...........................224

Supplementary Table 3S.4: Distribution of mtDNA types, defined by HVS-1 mtDNA haplogroups and VSO haplotypes, in the Cross River region, Cameroon and Nigeria. ..............................................................................224

Supplementary Table 3S.5: Distribution of NRY types, defined by UEP haplogroups and microsatellite haplotypes, in Ethiopia, Israeli and Palestinian Arabs, Lake Chad and Sudan. .................................................224

Supplementary Table 3S.6: Distribution of mtDNA types, defined by HVS-1 mtDNA haplogroups and VSO haplotypes, in Ethiopia, Israeli and Palestinian Arabs, Lake Chad and Sudan. ...................................................................224

Supplementary Table 3S.7: Pairwise genetic distances and associated P-values for various levels of NRY and mtDNA analysis for Efik Uwanse comparisons. Colour code is same as Table 3.6. .............................................................224

Abbreviations

AQ Amodiaquine

AMOVA Analysis of Molecular Variance

ASD Average Squared Distance

CAF Central African Republic

CEA Central East Africa

CI Confidence Interval

CYP Cytochrome P450

DRC Democratic Republic of Congo

DNA Deoxyribonucleic acid

DME Drug Metabolising Enzyme

ETA Ethionamide

ETPD Exact Test of Population Differentiation

EBSP Expansion of the Bantu-speaking peoples

EHH Extended Haplotype Homozygosity

FMO Flavin-containing Monooxygenase

HIV Human Immunodeficiency Virus

HVR-1 Hypervariable Segment 1

K2 Kimura 2 parameter

LR Lineage Representative

L-SMM Linear Length Dependent Stepwise Mutation Model

LD Linkage Disequilibrium

LRH Long Range Haplotype

MS Microsatellite

mtDNA mitochondrial DNA

MRCA Most Recent Common Ancestor

MR Multiregional

NIEHS National Institute of Environmental Health Sciences

NJ Neighbour Joining

NRY Non-Recombining portion of the Y chromosome

NA North Africa

PCR Polymerase Chain Reaction

PCA Principal Component Analysis

PCO Principal Co-ordinate Analysis

RAO Recent African Origin‟

RFLP Restriction Fragment Length Polymorphism

SMM Simple Stepwise Mutation Model

SNP Single Nucleotide Polymorphism

SEA South East Africa

TCGA The Centre for Genetic Anthropology

TMRCA Time to the Most Recent Common Ancestor

UEP Unique Event Polymorphism

UPGMA Unweighted Pair Group Method with Arithmetic Mean

VSO Variable Site Only

WA West Africa

WMH Won ntoʹ Modal Haplotype

Acknowledgements

I thank all those individuals who gave samples as well as those who helped in the

collection of samples, especially Matthew Forka and Loveline Lum who accompanied me

during fieldwork in Cameroon.

I also thank those individuals who have made my experience at UCL, and TCGA in

particular, over the last 4 years an enjoyable (yet productive!) one. This includes (in a

somewhat chronological order) Abigail Jones, Karine Rousseu, Ian Barnes, Elizabeth

Caldwell, Charlotte Mulcare, Isabel Homlquist, Andrew Loh, Luke Warren, Gianpiero

Cavalleri (JP), Catherine Ingram, Lorenzo Zannette, Ana Texieira, Chris Plaster, Yuval

Itan, Sarah Browning, Adam Powell, Laura Horsfall, Naser Ansari Pour and Lauren

Johnson.

I am also indebted to (this time in no particular order) Professor Elizabeth Shephard,

Professor Ian Philips, Professor David Zeitlyn, Dr Bruce Connell, Professor Verkijika

Fanso, Professor Robert Griffiths, Professor Dallas Swallow, Professor Sue Povey,

Professor Nancy Mendell and (last but by no means least) Dr Mike Weale for their

invaluable guidance during the different works described in this thesis.

However my main thanks are reserved for my supervisor Dr Mark Thomas, my industrial

sponsor Dr Neil Bradman and my parents, Lutcheemee and Ven Veeramah. Without their

support I would have simply been unable to undertake and complete this work.

Chapter 1:

Introduction

1. Introduction

1.1. Rationale Of The Study

Africa is the world‟s second largest continent at some 30 million km2, with the vast

majority of the landmass lying south of the Sahara desert, a region termed sub-Saharan

Africa. Sub-Saharan Africa (more specifically somewhere between eastern and southern

Africa (Quintana-Murci et al. 1999; Jobling, Hurles & Tyler-Smith 2004-Chapter 8;

Prugnolle, Manica & Balloux 2005; Ray et al. 2005; Amos & Manica 2006; Liu et al.

2006; Cramon-Taubadel & Lycett 2007)) is also now widely accepted as the likely place

of origin of anatomically modern man, the ancestor of all present day humans. Relative to

its actual geographical size the population of sub-Saharan Africa is small, at

approximately 700 million individuals, in comparison to the much geographically smaller

areas that constitute China and India. However the population growth rate is the highest

in the world at around 2.5% per year. Sub-Saharan Africa is linguistically diverse with

approximately 2000 different languages (Crystal 1997; Ethnologue 2005) distributed

across the region. Given the close ties that exist between language and social identity this

demonstrates a considerable variety of ethnic groups, many of which have complex

relationships with other neighbouring and distant groups. However, to much of the

developed world sub-Saharan Africa is best known for its high poverty rate, political

instability and increased incidence of infectious diseases such as Tuberculosis, Malaria

and HIV, aspects which, especially recently, have led to a great deal of media attention.

Though Africa has been described as the „cradle of humanity‟, prehistory in much of sub-

Saharan Africa extends to recent times (Ki-Zerbo 1989) with many ethnic groups,

especially those found inland, having documented records dating only to the arrival of

European Colonists. As a consequence archaeology, anthropology and linguistics have

been important tools in attempts to piece together sub-Saharan African histories.

Over the past 50 or so years researchers have examined the distribution of human genetic

variation in Africa („genetic‟ here referring to low resolution phenotypic data such as

blood groups as well as higher resolution molecular data). Many studies had as their focus

an attempt to distinguish between the multi-regional and out-of-Africa models for the

origins of Homo sapiens sapiens (Cann, Stoneking & Wilson 1987; Vigilant et al. 1991;

Chen et al. 1995; Horai et al. 1995; Jorde et al. 1995; Horai 1995; Seielstad et al. 1999;

Takahata, Lee & Satta 2001; Liu et al. 2006). They often compared sets of individuals of

African and non-African ancestry with the results favouring the out-of Africa explanation

(discussed later). In addition a number of studies have examined variation associated with

infectious diseases prevalent in sub-Saharan Africa, with many reports between the 1960s

to 1980s utilising serological techniques in the investigation of blood-based disorders

such as Glucose-6-phosphate dehydrogenase deficiency. The general consensus from

studies performed up to the present day is that human genetic variation appears to be

greater in sub-Saharan Africa than in the rest of the world (Olerup et al. 1991; Vigilant et

al. 1991; Bowcock et al. 1994; Armour et al. 1996; Tishkoff et al. 1996; Seielstad et al.

1999; Kaessmann et al. 1999; Jorde et al. 2000; Yu et al. 2002; Macfarlane & Simmonds

2004; Witherspoon et al. 2006). However it is not yet clear how this greater genetic

variation in Africa is distributed.

While there has been some work examining sub-Saharan African populations on a macro-

scale (covering multiple politically defined countries), with the general trend in

conclusions being that a) genetic distance among populations increases with geographic

distance and b) genetic diversity decreases as geographic distance increases from East

Africa (Prugnolle, Manica & Balloux 2005; Handley et al. 2007), there is very little

available information on variation at a fine-scale.

For example what is the extent of variation within defined groupings: among individuals,

within villages and among towns, within and among ethnic groups and within and among

larger geographic regions? Also is ethnic identity, language spoken or location better

correlated with genetic differences among groups?

Study of the distribution of human genetic variation in sub-Saharan Africa may lead to

interesting insights into our past. Previous work elsewhere in the world has already shown

the power of genetics, including the sex-specific genetic systems, in elucidating different,

and often very specific, aspects of population demographic history, for example origins

and migration events. As mentioned previously studies have already examined African

populations in the course of trying to ascertain the geographic origins of humanity and, at

the continent wide scale, studies on the distribution of genetic variation in sub-Saharan

Africa have provided insights into the expansion of the Bantu-speaking people through

demic rather than cultural diffusion processes (Underhill et al. 2001; Cruciani et al. 2002;

Salas et al. 2002; Wood et al. 2005). This expansion is considered to be one of the largest

human migration events of the recent past and is believed to have started some 4000 years

ago when agriculturists spread from around the present day Nigerian-Cameroonian border

into much of sub-Saharan Africa, propagating the many Bantu languages now

encountered in the region. However genetic analysis has been little used in the region to

uncover population history at a fine-scale, e.g. for particular ethnic groups. Given that

many peoples in sub-Saharan Africa have a relatively recent prehistory, with accounts of

their past reliant on oral histories sometimes supplemented by archaeological and

linguistic data, genetic studies may well be a useful tool when used alongside other more

traditional disciplines.

Patterns of genetic variation observed on a fine-scale in sub-Saharan Africa should be of

particular interest to linguists working in the region. Approaches used to model the

evolution of DNA are often similar to those used to describe the evolution of languages.

Relationships in both can be represented by phylogenetic trees, though these trees are

often simplifications of much more complex processes (DNA trees can be greatly affected

by recombination, while language trees are subject to problems of horizontal

transmission/word borrowing and language replacement). Despite all the difficulties, if

phylogenies based on genetic data correspond with those based on language this will

often be seen as highly significant evidence of a particular model of human behaviour or

demographic history. However linguists have often questioned the approach of genetic

studies (MacEachern 2000), especially because of the lack of appropriate sampling,

sometimes referring to the methodology as the „out of the freezer‟ approach (Blench 2006

pg 20). This may go some way to explaining why studies that show correlations between

genetic and linguistic data often involve large distances, making it difficult to disentangle

the relative contribution of geographic distance in any correlations of genetics and

language. As Africa possesses almost a third of the world‟s languages it is of interest to

establish whether careful and appropriate sampling at a fine-scale can help researchers

gain greater insight, especially in regard to linguists working on questions related to the

effects of language contact. If groups each speaking a different language can co-exist in

close geographic proximity, can these same groups maintain genetic isolation from each

other over long periods of time in such circumstances? At the same time linguistic

information can be important in structuring genetic studies since linguistics will often be

highly correlated with culture and therefore patterns of linguistic evolution can be used to

investigate patterns of cultural evolution. If an aim is to establish whether geography or

ethnicity are better predictors of genetic difference the languages spoken may be a vital

component in the differentiation and maintenance of identity.

No less important is the potentially beneficial medical application of knowledge of

diversity within the peoples of sub-Saharan Africa. The substantial genetic diversity

present in sub-Saharan Africa suggests that there are likely to be many alleles present in

individuals that contribute to increased resistance to a variety of infectious diseases

indigenous to the region. Some alleles may well be restricted to a particular group or

groups. In addition while common non-infectious diseases, under the common

disease/common variant hypothesis, would be expected to be caused by a few high

frequency variants present in a number of populations, rare alleles that probably lead to

complex disease susceptibility may well be restricted to particular groups, each with their

own specific set of causative alleles (Tishkoff & Williams 2002). Knowledge and

understanding of these resistance and susceptibility variants can aid in understanding the

causes of disease and contribute to new therapeutic interventions. However, gaining this

knowledge will require screening many individuals and groups, a task which is not

always easy in sub-Saharan Africa.

One way in which understanding patterns of distributions of genetic variation in sub-

Saharan Africa has the potential to bring relatively early benefits to its inhabitants is in

the selection, when choices are available, of the more appropriate pharmaceutical

intervention. Pharmacogenetics is the study of the genetic basis of individual variation in

drug response (Johnson 2003; Weinshilboum 2003; Wilke et al. 2007). This field is

receiving increasing attention from the scientific community as the ability to analyse ever

greater numbers of genetic variants, at greater speeds, rapidly increases as a consequence

of technological advances. Genetic variants have already been identified, especially

amongst the Cytochrome P450 genes, which have been shown to influence drug

metabolism. However since sub-Saharan African populations have generally been

underrepresented in pharmacogenetic studies compared to European and Asian

populations, there is likely to be substantial ascertainment bias in the literature in the

reporting of functional variants. Given that the rest of the world is believed to possess

only a subset of the genetic diversity present in Africa, it is likely that a large number of

potentially important variants that effect drug metabolism have yet to be uncovered.

Again these variants may well be restricted to particular regions or groups of people.

The ultimate aim of pharmacogenetics is to define individual drug administration profiles

that remove the risk of an adverse drug reaction due to an individual‟s genetic makeup

and maximises efficacy. In sub-Saharan Africa, at least, this is unlikely to become reality

in the foreseeable future due to a lack of appropriate infrastructure and to economic

constraints. However, population genetic theory presents the opportunity that it may be

possible to increase the probability of providing the most appropriate therapeutic

intervention by basing the decision on an understanding of the genetic characteristics of

the larger group to which a particular individual belongs. For example if it is shown that

in a particular region genetic identity is more closely tied to ethnic identity than to

geographic location, it may be possible to greatly reduce inappropriate pharmaceutical

intervention by administering drugs suited to the pharmacogenetic profile of each ethnic

group. Although it may be preferable to construct pharmacogenetic profiles for each

ethnic group it may be possible, by understanding the pattern of distribution of genetic

variation in a region, to infer the likely efficacy and reduce genetically determined

adverse events on a group by extrapolating from data obtained from genetic studies of

other groups. For example it may have been shown that the frequency of particular

variants alters in a clinal fashion across a region.

There is currently a paucity of information on the distribution of human genetic variation

in sub-Saharan Africa, especially at a fine geographic scale. There is therefore a great

need for studies that address this, which may lead to important insights into a) the

histories of populations, b) the cultural evolution of human society at a fine-scale as

reflected in the relationship of genetic and linguistic patterns. A further important benefit

of studying the distribution of human genetic diversity is in improving disease

management by providing insights that may assist in more appropriate pharmaceutical

intervention.

In this thesis each of the above three aspects are addressed in a case-study format;

presenting three independent studies that illustrate the potential utility of investigating the

distribution of human genetic variation in sub-Saharan Africa.

The three studies are briefly explained below.

Sex-Specific Genetic Data Support One Of Two Alternative Versions Of The Foundation

Of The Ruling Dynasty Of The Nso´ In Cameroon

In this study sex-specific genetic data are used to shed new light on the origins of the

Nsoʹ, a prominent ethnic group in the Grassfields of Cameroon. The alternative oral

histories of their ethnogenesis have been the subject of fierce debate among historians and

anthropologists. These oral histories have been well documented (Mzeka 1978; Mzeka

1990) and the group have a unique social class system that enabled the formulation of

hypotheses that could be evaluated by analysing genetic data.

It All Depends On The Scale: Little Sex-Specific Genetic Variation In The Presence Of

Substantial Language Variation In Peoples Of The Cross River Region Of Nigeria

Assessed Within The Wider Context Of West Central Africa.

In this study sex-specific genetic data from a very large and sociologically well

characterised dataset are used to examine genetic variation among peoples speaking

different languages in the relatively small Cross River region of south-east Nigeria. Many

of the languages have been well studied in detail by linguists who have shown them to

have varying degrees of divergence from each other, ranging from hundreds to thousands

of years. The analysis is undertaken within the context of the wider geographic area of

West Central Africa by extending it to include groups from Cameroon and Ghana.

The potentially deleterious functional variant FMO2*1 is at high frequency throughout

sub-Saharan Africa

This study examines the distribution across Africa of variation in the gene that encodes

the drug metabolising enzyme Flavin-containing Monooxygenase 2 (FMO2). The effect

of FMO2 protein expression has previously been largely ignored because, due to a

premature stop codon, it has been non-functional in all European and Asian individuals

examined to-date. Expression of the enzyme may be important in the efficacy and safety

of drugs used to treat tuberculosis in Africa. The frequency of the putative functional

allele in sub-Saharan Africans is assessed and is found to be high across the region.

Evidence for selection and the possible date of origin of the mutation are also assessed, in

some instances using software developed in this thesis. The study illustrates the utility of

undertaking frequency surveys in DNA collections of known provenance with a view to

improving healthcare.

Before presenting these studies some background information on sub-Saharan Africa is

described with particular emphasis on previous work on human genetic variation in the

region that will allow the reader to better place in context the findings of the three case

studies.

1.2. Notable Geographical Features Of Sub-Saharan Africa

As the name suggests the area described as sub-Saharan Africa in this thesis encompasses

all of the African continent that lies south of the present day Saharan Desert, which itself

covers most of northern Africa. The sub-Saharan African mainland consists of a total of

42 politically defined countries (see Figure 1.1 for a political map of Africa). As well as

the Sahara there are two other major desert areas, the Kalahari Desert which is semi-arid

and covers much of Botswana and parts of Namibia (though the surrounding basin

extends to Angola, Zambia, Zimbabwe and South Africa) and the more arid and hostile

Namib Desert that extends from Namibia up to southwest Angola. Marking the southern

border of the Sahara is the Sahel, a strip of semi arid grassland (see Figure 1.2 for a

physical geography map) that stretches from the Horn of Africa to the Atlantic Ocean and

traverses the gap between the arid desert to the north and tropical wetland to the south.

For obvious reasons none of the regions are particularly densely populated.

The Congo Basin is the world‟s second largest rainforest and lies in the heart of sub-

Saharan Africa. Though mostly covering the Democratic Republic of Congo (formerly

Zaire), it also extends to the north into parts of the Central African Republic and Sudan,

to the west onto parts of the Republic of Congo and to the south onto parts of Angola and

Zambia. The higher relief of the Great Lakes form a boundary to the east. The Great

Lakes region comprises many east African countries and is a rather flexible term. The

Great Lakes constitute mainly Lake Victoria, Lake Tanganyika, Lake Malawi, Lake

Turkana, Lake Albert and Lake Kivu. Lake Victoria is the largest of all and, along with

Lake Albert and Lake Edward, form connections with the White Nile.

The White Nile flows north from the Great Lake region until it reaches Central Sudan in

Khartoum where it joins the Blue Nile, which has its origins in Lake Tana in Ethiopia.

From there the Nile continues north through mostly desert until it reaches the

Mediterranean. There are four other major Rivers: the Congo River, which passes through

the Congo Basin towards the Atlantic; the Zambezi River, which flows from Zambia to

the Indian Ocean; the Niger River, flowing from Guinea to the Atlantic Ocean through

Nigeria via Mali and the Orange River, which flows from eastern South Africa to the

Atlantic through Namibia.

The Chad basin in north central Africa consists of Lake Chad that transverses northeast

Nigeria, northern Cameroon, Chad and Niger. It is a particularly shallow lake that

fluctuates constantly in depth and has shrunk considerable over time.

Major highlands in sub-Saharan Africa include the high relief provided by the Great

Lakes region, the Ethiopian Highlands and the Great Escarpment in South Africa.

Another distinctive topographical feature of sub-Saharan Africa is the Rift Valley that

runs from Syria in the Middle East to Mozambique. It is formed as a result of the meeting

of tectonic plates and it contains and is responsible for part of the Great Lakes region and

the high surrounding relief.

1.3. The Languages Of Sub-Saharan Africa

1.3.1. Important Methods and Concepts of Historical Linguistics

The languages different groups of people speak play a prominent role in this thesis,

especially in Chapter 3, with an individual‟s cultural identity closely linked to their

linguistic affiliation. By studying the distribution and structure of these languages it is

possible to gain useful insights into the past movements of people. Before describing how

this types of analysis (part of the wider field of „historical linguistics‟) relates to sub-

Figure 1.1: A political map of Africa (from the Perry-Castañeda map

collection).

Figure 1.2: A physical geography map of Africa (from the Perry-Castañeda

map collection).

Saharan African languages some important concepts such as language classification,

comparison and dating are introduced.

Languages are often grouped into „language families‟ based on evidence of shared

characteristics such as phonology, morphology, lexicon and syntax. This is based on the

concept that languages gradually change and diverge over time, resulting in new

languages. Therefore a language family is a group of languages that can be related by

descent (back in time) to a common ancestor. This common ancestor (which will no

longer exist) is referred to as a proto-language (e.g. proto-Bantu). Usually the proto-

language is not known because of a lack of written records but it can be reconstructed by

comparing two or more languages in the proposed family. This generally involves the

assembly of sets of words from each language that are likely to share a common origin

(i.e. they have the same definition), also called cognates, and inference, based on general

sound change laws, of the most plausible changes that are likely to have occurred in these

words for the languages compared to have diverged to their present status. The process

requires considerable linguistic expertise and is somewhat subjective, especially with

regard to the reconstruction of the proto-language. Therefore two linguists may assemble

very different reconstructions, though the confirmation that languages are of the same

family because of common descent is generally more objective.

By performing this analysis on multiple languages in a family it is possible to determine

the hierarchical relationship for how the languages have diverged, allowing the

construction of a phylolinguistic tree (though in reality language divergence is gradual,

not instantaneous as implied by a tree). Given this „genetic‟ ordering of languages it is

possible to further subdivide the family into smaller phylolinguistic units. However there

is no hard and fast rule for these subdivisions, as reflected by the various terminologies

that can be used (subfamilies, branches, section, group subgroup) of which there is no

consensus for their use (Blench 2006). These subdivisions are generally applied on a

case-by-case basis; for example when discussing the Niger-Congo family one might make

reference to the Benue-Congo subfamily of languages in one circumstance and the Cross

River subfamily in another.

Some language families themselves are considered language phyla. A language phylum

would be described as any language family for which the external affiliation cannot be

determined. For example it is not clear how Niger-Congo languages are related to other

language families (such as Afro-Asiatic) back in time with regard to a common ancestor.

Therefore this language family would be considered a language phylum (Blench 2006).

Analogous to the method used to classify language phyla, some languages are considered

isolates as they do not demonstrate any descent from a common ancestor with any other

known existing language. These include languages such as Basque of Spain/France

(Trask 1997) and Hadza of Tanzania (Sands 1998) and they are sometimes considered

language families containing only one language.

Another method to elucidate the hierarchical structure of a language family but that does

not require the reconstruction of a proto-language is lexicostatistics. This uses a list,

called a Swadesh list (named after Morris Swadesh, who developed the technique), of

around 100 concepts likely to be found in all human languages such as „sleep‟ or „nose‟

and that are free of cultural meaning (Swadesh 1955). Analysis is performed in pairwise

language comparisons so that each slot on the Swadesh list is filled by a word from each

language. The percentage of concepts that have pairs of words (one from each language

analysed) that are cognates is then determined (again a decision made by a linguist),

giving a metric for how similar a pair of languages are (a lexicostatistic). If this analysis

is performed for multiple languages it is possible to build up a phylolinguistic tree, with

the highest lexicostatistic being found between the two languages that split most recently.

This analysis is affected by issues such as (like the comparative method described for the

reconstruction of proto-languages) bias from word borrowing and the imposition of

specific cultural meanings for specific words. Some researchers have questioned whether

a universal list can ever really be applied (Gudschinsky 1956).

The lexicostatistic analysis described above can be extended to estimate the time when

languages diverged (Swadesh 1952). This methodology, called Glottochronology, uses

languages with known divergence dates and their associated lexicostatistics to calibrate a

linguistic clock that can be applied to the lexicostatistics of pairs of languages with

unknown divergence dates. Glottochronology remains a very controversial field because

of its various assumptions (see Renfrew, McMahon & Trask 2000).

The above types of analysis rely heavily on languages branching off and diverging in a

tree like manner. However in reality the phenomenon of „language shift/replacement‟ is a

major force in the evolution of culture and will severely skew any analysis such as

lexicostatistic calculations. Language shift is the process where a community shifts from

a native language to another language for some cultural reason. This is generally a

gradual process (though the actual speed depends on the circumstances involved) and

initially the community will become bilingual before the native language is dropped. In

Africa this can be seen directly with populations learning the lingua franca of the

different regions. For example Hausa has replaced many minority languages in Nigeria

(as part of a process called Hausaisation) while small Chadic speaking communities have

adopted Arabic, with the spread of Islam being a major factor (Blench 2006).

1.3.2. The Distribution of sub-Saharan African Languages

While it is not possible to detail all the linguistic relationships found in sub-Saharan

Africa because of the sheer volume of very specialised information collected over the

years, some features that may be useful to a readers of this thesis are briefly described

based on Roger Blench‟s book Archaeology, Language, and the African Past (2006),

which reviews the linguistic and archaeological data on Africa currently available. The

purpose of the section is to act as guide for those unaware of the histories of some of the

regions and peoples discussed later in the thesis. It should in no way be taken as a

comprehensive review of the material as the conclusions drawn for the different

migrations and histories described are of varying confidence, the details of which are

discussed in Roger Blench‟s book Archaeology, Language, and the African Past (2006).

Different points of interest have simply been summarised to create as full a picture of the

linguistic history of sub-Saharan Africa as possible for the reader. Where indicated by

Blench (2006) suitable secondary sources have also been included from which the

different conclusions are drawn.

Sub-Saharan African languages can be divided into four broad language phylums, Niger-

Congo, Afro-Asiatic, Nilo-Saharan and Khoisan (Greenberg 1963) (see Figure 1.3 for a

linguistic map of Africa). In addition there are languages that do not fit into these

traditional groupings such as Jalaa in Nigeria, Ongota in Ethiopia and Kwadi in Angola.

There are relatively few of these isolates in comparison to other regions in the world,

which is possibly a result of recent large-scale population movements. Finally

Austronesian languages, which are typically spoken outside the continent, are found

throughout the island of Madagascar, brought there by mariners from southeast Asia

around 2000 years ago during the great Austronesian expansions, probably from southern

Borneo (Adelaar 1995).

Figure 1.3: A simplified linguistic map of Africa according to the

classification of Greenberg (1963) (a vectorisation by Mark Dingemanse).

Niger-Congo languages are by far the most widespread group of languages found in sub-

Saharan Africa, with speakers found in Senegal as well as the tips of South Africa, with

different languages apparently continuously connected, which suggest some form of

population expansion. A peculiar exception to this general Niger-Congo connectivity are

groups in the Nuba Hills in Sudan that speak Kordofanian, a Niger-Congo language, yet

are completely surrounded by Nilo-Saharan speakers. There are 1,514 Niger-Congo

languages according to the latest Ethnologue build (2005). Bantu languages in particular

are numerous and geographically widespread but make up only a small part of the

diversity of the family. It is likely that the proto-Niger-Congo language arose in West

Africa earlier than 7000 years before present. The proto-Niger-Congo speakers would

have overwhelmed and/or assimilated hunter-gatherers groups and expanded due to some

technological advantage, which may have been the development of agriculture leading to

a possible crop domestication centre (Renfrew 1992; Ehret 2002), though the evidence for

this is not very concrete (Neumann, Kalheber & Uebel 1998; Ehret 2002), or the use of

more effective hunting technologies such as the bow and arrow.

Bantu speakers, a branch of Benue-Congo (the largest branch of the Niger-Congo family

with regard to the number of languages, speakers and geographic area covered), appear to

have expanded to the east 4000 years ago from a northwest Cameroonian origin

(Greenberg 1955) and spread through much of sub-Saharan Africa. This movement is

commonly termed the „Bantu Expansion‟ but it is more appropriate to refer to it as the

„expansion of the Bantu-speaking peoples‟. The key to this expansion was almost

certainly an agropastoralism lifestyle (Vansina 1990) while the development of iron

smelting may also have helped in later phases. The expansion to the south of sub-Saharan

Africa was probably split into two streams of people. One stayed on the west side of the

continent and moved through the rainforest (Banana domestication may have been

important for this) to the southeast where they encountered Khoisan speakers in the

Kalahari desert (Denbow 1986; Denbow 1990). The other stream first moved east along

the north fringe of the forest to the Great Lakes. From here they further expanded to the

south of the continent along the east coast (Huffman 1998).

There are currently 400 Afro-Asiatic languages. Most are found in North Africa and the

Middle East. However a diverse range of Chadic languages are found in the Chad Basin

and Semitic, Omotic and Cushitic languages are spoken in East Africa. Afro-Asiatic

probably arose in southwest Ethiopia 9-10 thousand years ago, probably in the same

location where Omotic languages are spoken today (Bender 1997; Blench 1999a). There

does not appear to be any evidence that Omotic speakers have moved anywhere else since

this origin. It is likely that Omotic and Cushitic proto-language diverged when Cushitic

speakers took up animal domestication while Omotic speakers retained their hunter-

gatherer lifestyle.

Cushitic speakers probably subsequently spread rapidly in many directions including

towards Lake Chad 4-5 thousand years ago (Blench 1999b), establishing Chadic speakers

in the region as well as heading as far south as Zambia. Another group may have headed

north and would later become Berber, Egyptian and Semitic speakers. The proto-Semitic

speakers arrived in the Near East but then turned back in a westward direction towards

Ethiopia where they probably displaced Omotic and Cushitic speakers. The Chadic

linguistic diversity is probably a result of admixture with many other settled populations.

Nilo-Saharan has only relatively recently been defined as a language group and its

existence is still contested by some researchers. It appears to be very geographically

fragmented and it is very difficult to ascribe the different languages in the phylum a

simple chronological order of origin and dispersal. However Blench has attempted to do

this in his book using glottochronological estimates of dates based on the internal

diversity of subgroups correlated with archaeological/ climatological evidence and a

proposed order of dispersal of major Nilo-Saharan language subgroups is described

below. However he warns that the dates “need to be treated with great caution”.

It is likely that the Nilo-Saharan heartland was somewhere in south Ethiopia. The Nilo-

Saharan speakers then spread westward approximately 18,000 years ago towards the

confluence of the Blue and White Nile rivers. They were still a small group of probably

hunter-gatherers at least 12,000 years ago and at that point some spread north along the

Nile while others migrated further westward towards Lake Chad in what may have been a

popular corridor for migration. At around 9000 years before present, the Nilo-Saharan

speakers in Lake Chad started to occupy parts of the Sahara desert (to become today‟s

Saharan speakers) while 6,000-8,000 years before present another group headed towards

the Niger river, the Nilo-Saharan languages now having travelled across most of the

width of the continent. Other movements occurred but, like much of the above, it is not

easy to work out how they got to where they eventually settled as the language group is

not particularly cohesive. Nubian speakers may have been a result of a back migration of

Saharan speakers from the west towards the Nile where they displaced settled Afro-

Asiatic speakers around 2,500 years ago. There has also been some debate regarding the

close relationship between Niger-Congo and Nilo-Saharan and whether they should be

placed within a macrophylum (Gregersen E.A. 1972), with Niger-Congo possibly being a

lower-level branch of Nilo-Saharan (Blench 1995).

Around 30 Khoisan or „click‟ languages are spoken at present (Güldemann & Voßen

2000) in small scattered communities across south western Africa though groups are also

found in Angola and Tanzania (the Hadza). They are a very diverse phylum though their

„click‟ attributes keep them linguistically grouped together as „click‟ languages are not

found elsewhere in the world. Therefore due to their common origin it has been suggested

that Khoisan speakers once extended up to Somalia though expanding Cushitic and

particularly Bantu speakers have substantially reduced their range.

Pygmies are usually by definition of short statue (eastern pygmies are smaller than

western pygmies, possibly because of less Bantu farmer introgression). Generally they

have adopted the languages of their agricultural neighbours. For example the Aka and

Biaka western pygmies speak Bantu and Adamawa respectively. However they appear to

share some common words that hint at a lost pygmy language (Bahuchet 1992; Bahuchet

1993). It has been suggested that the Pygmies were once a single group of hunter-

gatherers who have their origins in the tropical rainforest. One suggestion for their current

distribution is that reduction in the size of the rainforest around 10,000 years ago led to

separation and isolation of multiple pygmy groups. As the rainforest expanded again the

pygmy groups started to disperse at which point they encountered Bantu farmers along

side or amongst whom they settled. A recent study by Migliano et al. (2007) suggest that

the development of characteristic pygmy height is not, as previously thought, a direct

adaptation to a specific environment such as ease of movement through dense tropical

forests (Cavalli-Sforza 1986) but is instead a by-product of the need to reproduce

relatively early in life due to particular ecological conditions causing high early mortality

rates.

1.4. Previous Work On The Distribution Of Genetic Variation In Sub-

Saharan Africa

1.4.1. Classical Markers

The first studies on human genetic variation did not directly involve DNA but rather were

based on detecting and assessing variation using so called „classical markers‟. Types of

classical markers range from the different variants found in the blood groups systems (the

most basic being the A, B and O blood groups), the different forms of particular proteins

found in blood, liver and muscle such as the haemoglobins and the many different Human

Leukocyte Antigen (HLA) isoforms. The substantial use of classical markers through the

1970s and 80s for evolutionary studies was a result of large collections being made for

unrelated reasons such as tissue and blood matching coupled with the fact that methods of

detection of these alleles was considerably easier than examining DNA at the molecular

level (techniques for DNA typing were only introduced in the mid-1980‟s). With

technological advances classical markers were superseded and direct analysis of DNA

undertaken. However vast amounts of data have been assembled using these classical

markers that have been very useful in assessing population history. Much of the classical

marker data up to 1986 were collated and analysed in Luigi Cavalli-Sforza and

colleague‟s seminal work The History and Geography of Human Genes (1994). Though a

number of issues with regard to the methodologies used were raised after publication

many of the conclusions drawn from the study underpin present day thinking in genetic

history. The contents and timing of the book marks the period of transition from the use

of classical to molecular DNA markers, as witnessed by the limited discussion of the

latter. It did however include the well known mitochondrial (mtDNA) DNA study of

Cann et al. (1987). Though analysis in this thesis exclusively utilises molecular markers,

given the historical importance of the classical markers a brief summary of this earlier

work is provided to the extent that it relates to sub-Saharan Africa.

In The History and Geography of Human Genes sub-Saharan African samples were

analysed within two main frameworks, the first examining sub-Saharan populations as

well as populations found elsewhere in the world, the second examining sub-Saharan

populations as well as other African populations. Within the first framework 42

populations were assembled from the years 1978 to 1986 with data defining 120 classic

marker alleles. Six of these populations were considered sub-Saharan African: Bantu;

East African Ethiopians; Nilo-Saharan; West African; !Kung San from Botswana and

Namibia; Mbuti pygmies from Zaire. It should be noted that the 42 populations were

assembled by pooling data based on geographic, ethnic and linguistic affinities to gain a

suitable balance between available genetic data and the number of populations analysed.

The authors claim that the adverse effect of possible heterogeneity introduced by pooling

data should be minimal, while reducing the effect of noise caused by genetic drift in

particularly small populations.

Unweighted Pair Group Method with Arithmetic mean (UPGMA) trees based on both

Nei‟s genetic identity (Nei 1987) (Figure 1.4) and Reynold‟s FST (Reynolds, Weir &

Cockerham 1983) measures both clearly showed clustering of the six sub-Saharan

African and separation from non-sub-Saharan African populations, including the Berber

population from North Africa. The separation of sub-Saharan African from non-sub-

Saharan African populations was strongly maintained when bootstrapping (a method that

creates a number of resampled sets of the observed dataset using random sampling with

replacement) the data used to construct the UPGMA trees (though to a substantially lesser

extent for the East African and San populations). A similar pattern was observed in

Principal Component Analysis (PCA) Maps (the first principal component also grouped

Europeans with sub-Saharan Africans but this was thought to be a result of the

comparatively low number of sub-Saharan African samples utilised rather than any

genetic similarity between the two regions). Pooling populations in seven distinct regions

further reinforced this strong African (in reality sub-Saharan African) and non-African

split. A crude date for the split of African and non-African populations based on these

data and calibrated using archaeological findings was estimated at around 100,000 years

before present.

According to the authors the Nei‟s I-based tree and the linguistic tree showed a striking

correlation, including for four of the six the sub-Saharan African populations. The Mbuti

Pygmies could not be analysed as it was believed that they had lost their original

language and adopted that of neighbouring farmers while the Ethiopians were on very

different branches from the Berbers, though both were classified as Afro-Asiatic speaking

groups. It is difficult to assess the significance of the lack of Ethiopian-Berber linguistic

correlation given the cultural and linguistic heterogeneity found in Ethiopia. In addition

some correlation was observed between genetic and geographic distance of populations

though to a much lesser extent to that seen in Asia.

The authors found this analysis to correspond with an origin for modern humans from

Africa some 100,000 years ago and it should be noted that though they do not rule out

limited admixture with archaic Homo sapiens they do not even consider, let alone

formally test against, the multi-regional hypothesis of the origin of modern humans.

Figure 1.4: Average linkage tree for 42 populations. The abscissa shows the

genetic distances (modified Nei) calculated on the basis of 120 allele

frequencies taken from 42 classic genetic marker systems. Taken directly

from Cavalli Sforza et al. (1994).

When examining sub-Saharan African populations within an African framework 49

distinct African populations were assembled, again pooling data in a number of different

ways (ethnicity, language and geography). The average number of genes per population

was 47.6 but the actual average number of genes analysed in pairwise comparisons was

only 28.6. A UPGMA tree of these populations showed two distinct clusters, one with

only sub-Saharan African populations and a second with North African and East African

populations. This second cluster can be separated into North and Eastern populations

except that the Algerian population falls within the Eastern set while the Somali

population is found amongst sub-Saharan Africans rather than amongst the other Eastern

populations.

However, the majority of evolutionarily interesting groupings fall within the macro-sub-

Saharan African cluster where there are two major sub-clusters, the first a Central-

Southern cluster consisting of Bantu and Nilo-Saharan speaking populations and the

second consisting of West African populations. There are also four distinct outlying

groups of populations not found in either of these two sub-clusters. These include a group

of Khoisan populations, the Mbuti Pygmies and a group of Senegalese populations.

Overall the genetic tree appeared to display a fair level of congruence with both language

and geography. Of the main sub-Saharan African cluster a PCA plot showed the Bantu-

speaking populations to be particularly homogenous in comparison to the West African

and Nilo-Saharan speakers. The West African populations were more heterogeneous than

the Nilo-Saharan speakers and more dissimilar to the Bantu-speaking populations. In

regard to the Bantu speakers, neighbouring populations were more similar to each other,

perhaps reflecting the eastern and western streams of the expansion of the Bantu-speaking

peoples, but there was still a remarkable similarity between the north-western and south-

eastern Bantu-speaking populations. The Nilo-Saharan speakers‟ similarity to the Bantu

speakers should be interpreted with caution as the group was not particularly well defined

with no major representatives of the groups present in the study. The authors expressed a

view that the West African populations can be separated into three main clusters, one

Senegalese cluster, one a collection of populations lying between Senegal and Nigeria

and one consisting of four ethnically diverse Nigerian groups. The West African

heterogeneity was explained as being due to the region having had a greater time for

populations to differentiate in comparison to Bantu-speaking populations, possibly as a

consequence of a much older agricultural expansion, given that Bantu-speaking

populations have essentially had a maximum of around four thousand years (the start date

of the expansion) in which to differentiate.

The PCA plot also shows the East African populations lying between the sub-Saharan and

Northern African populations. The difference between the Northern and sub-Saharan

populations was expected given the barrier to gene flow the Sahara represents and the

differing physical features of people from the two regions. The intermediate status of the

East African populations is likely to have been a result of admixture between sub-Saharan

Africans in the south and Northern Africans and the Middle East individuals from the

north. Admixture percentages were estimated (using standard methods that calculate m of

a target population from two proposed parent populations based on relative gene

frequencies in the three populations (see page 55 of Cavalli-Sforza, Menozzi & Piazza

1994)) as approximately 60% of the former and 40% of the latter, though the exact level

of admixture is likely to vary depending on the location of particular groups, with

Ethiopians likely to be especially heterogeneous due to cultural and language differences

found within the country.

In regard to the outliers observed in the tree the most intriguing were where the click

speaking Khoisan and the Pygmies are placed. The Khoi, who practice farming, appear to

have been involved in substantial gene flow with southeast Bantu speakers. Admixture

for the Bantu element in the Khoi was estimated at approximately 60%. Conversely San

admixture with Bantu-speaking groups had been estimated for example in the Xhosa at

around 60%. These admixture events may have been paralleled by elements of the click

languages moving into some Bantu languages. On the other hand the San, who still retain

a hunter-gatherer lifestyle, show a loose similarity with East Africans and Asians. The

data presented by Cavalli-Sforza et al. (1994) therefore suggest that, despite the proposal

from some anthropologists that Khoisan are relics of the earliest human groups, they are

in fact the result of an early mixture between native Africans and Asians in similar

manner to Ethiopians, though the two events occurred at distinctive periods of time.

Three pygmy groups were assessed in this study using classical markers, of which the

smallest group, the Mbuti from the Democratic Republic of Congo (formally Zaire), are

the most genetically distinct, from both the other pygmy groups and from other sub-

Saharan Africans, hence their outlier position in both worldwide and African-specific

trees. Though it should be interpreted with caution the authors suggest a separation time

from sub-Saharan Africans of 18,000 years based on the FST estimated between the two

groups, with genetic distance calibrated with archaeological separation times. In addition

the Mbuti appear to show no obvious similarity to San or East Africans. The Biaka appear

to have experienced considerable admixture with their neighbouring Bantu-speaking

farmers while the third heterogeneous „pygmoid‟ group are almost indistinguishable from

other sub-Saharan Africans both genetically as well as phenotypically in terms of height

and culture. The work on Africa in this study finishes off with a particularly interesting

statement, commenting that “differences between most-sub-Saharan Africans other than

Khoisan and Pygmies seem rather small”.

Despite now being regarded as somewhat of a classic, the book was met with criticism in

regard to some of the analyses performed and its treatment of linguistic data. O‟Grady et

al. (1989) in particular raised various issues on the choices of pooling samples and on the

linguistic tree used. Overall, they found the fit of the genetic with linguistic data less than

„remarkable‟. Cavalli-Sforza and colleagues did respond to this criticism but serious

scepticism of choices of linguistic relationships and population affiliations, particularly in

Africa, that severely undermine the analysis still remain amongst anthropologists and

linguists (MacEachern 2000), with the criticism that many of the conclusions drawn from

these data are of an ad hoc nature.

Nei and Ota (1991) also criticised the use of UPGMA trees and an assumption of a

constant rate of evolution (which is not likely because of a) previous population

bottlenecks and b) some loci used may have been under selection, especially in regards to

Malaria). However, while using a Neighbour-Joining (NJ) tree that allowed for

evolutionary rate heterogeneity did considerable alter much of the structure (though using

a new set of classical markers (Nei & Roychoudhury 1993)) the main African and non-

African split that supported the out-of Africa hypothesis was still prominent and the San

were again an outlier from other sub-Saharan Africans.

While the many positive aspects of this study using classical markers have underpinned

later research using molecular data, including additional evidence of the common origin

of modern humans in sub-Saharan Africa (the book is still widely referenced), other

negative aspects of this study also persist, notably in the form of insufficiently rigorous

sampling strategies.

1.4.2. Molecular Data

Though sub-Saharan African molecular DNA has been collected and analysed in order to

investigate disease, by the far the most substantial collection of samples from African

populations has been undertaken to answer questions concerning the origins of modern

humans. A comprehensive discussion this topic, which many reviews have previously

addressed (Templeton 1997; Harpending & Rogers 2000; Hawks et al. 2000; Excoffier

2002; Brauer, Collard & Stringer 2004; Templeton 2005; Reed & Tishkoff 2006;

Garrigan & Hammer 2006; Torroni et al. 2006), is beyond the scope of this thesis but a

brief overview is presented.

1.4.2.1. The Origins of Anatomically Modern Humans (AMH)

Prior to the availability of genetic data there were two main schools of thought in regard

to the model proposed for the origin of modern humans: the „Recent African Origin‟

(RAO) model (Howells 1976; Harpending 1993) in which modern man arose somewhere

in Africa around 100,000-200,000 years ago and replaced all other archaic humans in an

expansion across the world and; the „Multiregional‟ (MR) model (Weidenreich 1946;

Thorne & Wolpoff 1981) in which modern man evolved over a period of a few million

years in the presence of continuous gene flow at a global level among groups of

geographically disparate archaic humans. In addition there are also models intermediate

of these two extremes (Smith 1985; Eswaran 2002; Templeton 2002) but in much of the

early molecular work the focus was on the two extremes. As stated above the classical

marker data appeared to side with the RAO model though not particularly strongly.

There were early indications with restriction enzyme digests of the beta-globin gene

(Wainscoat et al. 1986) and other autosomal regions that molecular data would parallel

the classical data showing a split between the African and non-African haplotype

diversity. However the study that stimulated the most debate was a multiple restriction

enzyme digest of mtDNA in a „worldwide panel‟ of individuals that was reported by

Cann et al (1987) , which appeared to show a most parsimonious tree with a clear split

between some African individuals and all other individuals, as well as identifying a

„mitochondrial Eve‟ that lived some 140,000-290,000 years ago. The African samples

appeared to be the most genetically diverse with all non-Africans possessing a subset of

this diversity. This was much vaunted as being evidence that supported the RAO model.

A number of issues concerning the methodology used in the study were raised: (a) its use

of African Americans rather than „true‟ Africans (Jobling, Hurles & Tyler-Smith 2004 pg

255), b) the tree used was one of a number of other most parsimonious trees that severely

altered many of the branches (Maddison 1991), c) the assumption of a continuous rate of

evolution for all branches (Wills 1992), d) its use of mid-point rooting (Darlu & Tassy

1987) and e) not taking into account the effects of natural selection acting on mtDNA

(Excoffier & Langaney 1989). Despite these criticisms the paper encouraged over the

next few years a number of other studies in which mtDNA was analysed. They improved

on the methodologies and samples used (though the actual number of African samples

included in these was still quite small). All these studies essentially came to the same

conclusion: modern man originated in Africa some 100,000-200,000 years ago

(Hasegawa & Horai 1991; Vigilant et al. 1991; Pesole et al. 1992; Horai et al. 1995).

Autosomal genes apparently showed a similar pattern using various types of marker

(Restriction Fragment Length Polymorphisms (RFLP), microsatellites) though the date of

the common ancestor appeared to fluctuate to a greater degree (Olerup et al. 1991; Nei &

Roychoudhury 1993; Bowcock et al. 1994; Armour et al. 1996; Tishkoff et al. 1996).

However it was also noted that the existence of greater genetic diversity within Africa on

its own was not sufficient to reject the MR model as this could also be the result of a

higher effective population size (Relethford & Harpending 1994), a point many of the

early studies claiming support of the RAO model had not considered. In addition some

studies did not show as great a distinction between Africans and non-Africans or as recent

a common ancestor as seen when analysing mtDNA (Jorde et al. 1995; Harding et al.

1997; Jorde et al. 1997). This led some authors to be less confident about their

conclusions concerning the best choice of model (especially without formal testing of the

models). Nevertheless the majority of researchers were agreed that Africa was a particular

important region in human evolution (Hammer et al. 1997; Stoneking et al. 1997;

Relethford & Jorde 1999). Researchers have attempted to account for higher African

population size by attributing this to an African-specific expansion that occurred before

an out-of-Africa migration (Reich & Goldstein 1998).

More focus was placed on statistically discriminating between the two models. Takahata

et al. (2001) used computer simulations to show that a large difference in the breeding

sizes of African and non-African populations would be needed to accept the MR

hypothesis. Data that show higher African genetic diversity using different genetic

systems has continued to mount (Seielstad et al. 1999; Jorde et al. 2000; Ingman &

Gyllensten 2001; Watkins et al. 2001; Macfarlane & Simmonds 2004; Witherspoon et al.

2006) and has led many in the scientific community to consider the case more or less

closed (see opinion box 8.4 in Jobling, Hurles and Tyler-Smith (2004)).

However some studies still demonstrate results that conflict with this general belief (Zhao

et al. 2000; Yu et al. 2001). The conspicuous absence of data directly supporting the MR

hypothesis is noteworthy. The issue is rather the inability to fully distinguish the MR

from the RAO model. It is becoming clearer that the strict MR model is of very little

value (Note though that some researches claim that the model has been misrepresented in

the literature (Templeton 2007)), which has led to proponents of multiregionalism to turn

to an „assimilation‟ model where Africa was the predominant influence but gene flow has

been experienced at a more limited level with smaller non-African archaic human

populations vulnerable to bottleneck and extinction events (Stringer 2002). This approach

seems a somewhat more useful and a promising model proposed by Eswaran (2002) in

which humans expanded from Africa as a wave and incorporated genes of archaic

humans at the wavefront appears to fit the current data quite well (Eswaran, Harpending

& Rogers 2005). As sequencing technologies increase in sensitivity, data indicating

possible gene flow of archaic alleles into modern human genomes are being uncovered.

This suggests that models incorporating some archaic human introgression (Wall &

Hammer 2006; Evans et al. 2006; Hawks et al. 2008) may become more acceptable in the

future causing the simple RAO model to be modified or rejected. Certainly more

information will be elucidated from application of the new high throughput sequencing

technologies, which have already successfully sequenced large amounts (~1,000,000bp)

of Neanderthal DNA (Green et al. 2006) (this particular investigation is part of a

collaborative effort between the Max Planck Institute and the 454 Life Sciences

Corporation to sequence the entire Neanderthal genome).

However even the assimilation models discussed above demonstrate the prominence of

the original African contribution to the modern human gene pool. If this is indeed the case

then current genetic data mostly likely place the origin of these „ancestral Africans‟

somewhere between eastern and southern Africa (Quintana-Murci et al. 1999; Jobling,

Hurles & Tyler-Smith 2004-Chapter 8; Prugnolle, Manica & Balloux 2005; Ray et al.

2005; Amos & Manica 2006; Liu et al. 2006; Cramon-Taubadel & Lycett 2007)).

1.4.2.2. The Genetic History of sub-Saharan Africa based on Molecular data

Molecular genetic studies in sub-Saharan Africa have been limited but have provided

interesting insights into human history, especially at large geographical scales (fine-scale

studies have been hindered by a lack of dense sample sets). The majority of inferences

have been drawn from mtDNA studies, followed closely by studies on the non-

recombining portion of the paternally inherited Y-chromosome (NRY), while autosomal

work has been limited. Analysis of autosomal data has been constrained by many factors

including difficulties in haplotype inference (Niu 2004), evaluating the effects of

selection, unavailability of suitable genotyping and sequencing technologies and

difficulties in undertaking fieldwork. These problems are now being surmounted and a

great number of studies involving autosomal genetic systems may be expected in the

future. Studies discussed below in general support the conclusions drawn in Cavalli-

Sforza‟s classical marker work.

1.4.2.2.1. The Differentiation Between sub-Saharan and Northern Africa

One striking confirmation of Cavalli-Sforza‟s work is the clear genetic separation of sub-

Saharan African populations from North African populations. The latter more closely

resemble Middle Eastern and Eurasian populations in almost all mtDNA, NRY and

autosomal studies (for good examples see Poloni et al. (1997); Scozzari et al. (1999);

Luis et al. (2004); Cruciani et al. (2002); Salas et al. (2002); Salas et al. (2004); Terreros

et al. (2005)), demonstrating the major genetic barrier that the Sahara Desert has been

through much of modern man‟s occupation of the African continent. However there is

evidence of contact in both directions involving both male and female mediated gene

flow in populations lying close to the boundaries of the Sahara: the Chad Basin, Guinea

Bissau and Algeria (see Salas et al. (2002), Coia et al. (2005), Rosa et al. (2004), Rosa et

al. (2007), Cerny et al. (2007), Flores et al. (2001), Richards et al. (2003)), with the

expansions of Berbers, migrations of Fulani and the Arab slave trade possibly being

major influences.

1.4.2.2.2. The Expansion of the Bantu-speaking Peoples

Cavalli-Sforza and colleagues commented on the generally homogenous nature of sub-

Saharan Africa and this appears to be reflected in most of the molecular data (Underhill et

al. 2001; Renquin et al. 2001; Salas et al. 2002; Berniell-Lee et al. 2006). MtDNA, NRY

and autosomal data show that Niger-Congo (and especially Bantu-speaking) populations

are often indistinguishable at close (Destro-Bisol et al. 2000; Renquin et al. 2001;

Donaldson et al. 2002; Lane et al. 2002), intermediate (Scozzari et al. 1994; Lecerf et al.

2007) and large geographic distances (Donaldson et al. 2002; Steinlechner et al. 2002;

Collins-Schramm et al. 2002; Alves et al. 2005) throughout much of sub-Saharan Africa.

The expansion of the Bantu-speaking people some 3,000-5,000 years ago is likely to have

been a major driving force for this homogeneity. Both NRY and mtDNA lineages appear

to have been spread through much of sub-Saharan Africa as a result of this expansion

though the patterns observed for men and women are quite distinct. The majority of men

in Bantu-speaking populations possess the NRY defined haplogroup E3a (Underhill et al.

2001; Cruciani et al. 2002) and a particular haplotype on this E3a background defined by

six microsatellites is the modal type in numerous Bantu-speaking populations, stretching

all the way from Cameroon and Nigeria to Southern Africa (Thomas et al. 2000; Pereira

et al. 2002; Berniell-Lee et al. 2006). Given its predominant presence in Bantu-speaking

populations, its relatively low within-haplogroup diversity (Scozzari et al. 1999) and an

estimated time for the most recent common ancestor for South African Bantu possessing

E3a chromosomes of 3000-5000 years before present (Thomas et al. 2000) this

distribution is best interpreted as a signature of Bantu-speaking males expanding across

sub-Saharan Africa.

The pattern seen with mtDNA is somewhat different. Usually characterised by their star

like genealogy because of a recent founder effect, there appear to be numerous lineages

that have different geographic origins and that have been dispersed as part of the

expansion (for example L1a, L2a1a, L2a2, L3e1 (Bandelt et al. 2001; Salas et al. 2002)).

This high Bantu mtDNA diversity appears to suggests that the expansion involved lots of

short bursts and involved numerous women being absorbed into groups of the incoming

Bantu-speaking men at various stages (Salas et al. 2002) while the male Bantu farmers

migrated much larger distances (Wood et al. 2005) and had lower effective population

sizes as evidenced by the low NRY haplogroup diversity, overwhelming most of the pre-

Bantu expansion NRY types (mainly Haplogroups A and B) throughout the continent

(Underhill et al. 2001). The features of the female-mediated aspects of the expansion have

also allowed more ancient diversity to be retained in the genetic record such as lineages

L1a and L1b that show evidence, because of their star-like structure, of internal

migrations occurring in sub-Saharan Africa starting 60-80,000 years ago (Watson et al.

1997).

MtDNA, because of its diversity, has been interpreted as providing the best genetic

evidence so far that the Bantu expansion occurred in distinct east and west streams. Both

streams appear to possess common features, especially in regard to the NRY, but also

show subtle differences (Pereira et al. 2001; Salas et al. 2002; Pereira et al. 2002). The

populations that were part of the eastern stream possess some mtDNA types mostly

specific to that region that indicate that it originated in the Great Lakes region of eastern

Africa (Soodyall et al. 1996; Salas et al. 2002). The western stream appears, as expected,

to have been greatly influenced by mtDNA types that seemingly originate in West Africa

(Plaza et al. 2004; Beleza et al. 2005) and its expansion seems to have been a more

gradual process, with a great diversity of mtDNA types and very few signs of founder

effects (Beleza et al. 2005). However it also appears that the western and eastern streams

may have connected near the end of the expansion, possibly along the southern African

savannah, as suggested by the presence of apparently distinct eastern and western stream

signature mtDNA and NRY types in the same populations, with the majority of gene flow

going from the east to the west (Pereira et al. 2002; Plaza et al. 2004; Beleza et al. 2005).

1.4.2.2.3. East Central Africa

East Central Africa, which in this sense encompasses Ethiopia, Sudan, Somalia and

Eritrea is the most genetically distinct and heterogeneous region of sub-Saharan Africa,

having experienced, because of its geographic position, possible gene flow with sub-

Saharan African, North African, Middle Eastern, European and Asian populations

(Passarino et al. 1998; Quintana-Murci et al. 1999; Underhill et al. 2001; Salas et al.

2002; Richards et al. 2003; Kivisild et al. 2004; Lovell et al. 2005; Sanchez et al. 2005).

Like much of sub-Saharan Africa, many mtDNA types are found in the region.

Haplogroup L3 is found at highest frequencies in East Africa (Salas et al. 2002) and

carriers of this lineage appear to have been among the first that migrated out of Africa

(Watson et al. 1997) (and seemingly spawning haplogroups M and N, the most

phylogenetically ancestral haplogroups present outside Africa), with Ethiopia being

proposed as the most likely region of origin for L3 (Quintana-Murci et al. 1999). There

has also been some debate on whether haplogroup M (as well as N) arose in East Africa

(Passarino et al. 1998; Quintana-Murci et al. 1999) or is present in the region due to a

back migration from Asia (Olivieri et al. 2006; Gonzalez et al. 2007), with different

sample sets and analyses giving subtly different signals. No general consensus has yet

been reached. The NRY profile of East Africans is also relatively heterogeneous in

comparison to much of sub-Saharan Africa and appears to possess both African-specific

clades, including the most ancestral haplogroups A and B as well as clades common

outside of Africa (Underhill et al. 2000; Underhill et al. 2001).

The diversity found in Ethiopian populations encompasses most of the genetic

heterogeneity of East Africa, appearing to be a composite of sub-Saharan African, West

Eurasian, Middle Eastern, and North African mtDNA, NRY and X chromosome lineages

(Passarino et al. 1998; Scozzari et al. 1999; Underhill et al. 2001; Cruciani et al. 2002;

Semino et al. 2002; Kivisild et al. 2004; Lovell et al. 2005), with some of these non-

African lineages entering Ethiopia through past back migrations from the Levant and/or

the Horn of Africa (Semino et al. 2002; Luis et al. 2004; Kivisild et al. 2004). In addition

male-mediated non-sub-Saharan African gene flow into Ethiopia appears to have been

substantially greater than female-mediated gene flow (Passarino et al. 1998). While one

explanation for the composite Ethiopian genetic profile is admixture between sub-Saharan

African and non-sub-Saharan African people, as suggested by Cavalli Sforza‟s work on

classical markers (Cavalli-Sforza, Menozzi & Piazza 1994), Tishkoff et al. (1996)

suggested from analysis of the CD4 autosomal locus that as Ethiopian Jews (as well as

Somalians) only possessed a subset of sub-Saharan African diversity, populations from

this region „may represent the modern descendants of the ancestral population that

spawned the migration from Africa‟ and thus already possessed a large amount of their

heterogeneity before the out-of- Africa expansion. Other points to note are a) that

haplogroup N, which is ancestral to most European mtDNA lineages, increases

significantly in frequency in a north eastern direction in Ethiopia, while haplogroup M is

much more evenly distributed (Kivisild et al. 2004) and b) previous NRY, mtDNA and

autosomal studies have been unable to differentiate between the two largest ethnic groups

of the region, the Amhara and Oromo (Scozzari et al. 1999; Sanchez-Mazas 2001;

Kivisild et al. 2004; Lovell et al. 2005). The presence of ancestral mtDNA and NRY

types in Ethiopian genetic profiles also appear to indicate some ancient link with Khoisan

speakers, a subject discussed later in this chapter.

In terms of genetic research, compared to Ethiopia other parts of East Africa have been

substantially less-well studied. However a study by Krings et al. (1999) did show that the

River Nile has been acting as a corridor for bidirectional gene flow, at least as far as

mtDNA diversity between Egypt and Sudan is concerned, with the greater (or older)

movement being south into Sudan. In addition there is no NRY evidence that suggests the

expansion of the Bantu-speaking peoples reached Somalia, which shows a close

similarity, as would be expected, to Ethiopia (Sanchez et al. 2005).

1.4.2.2.4. West Africa

NRY data from Senegal, Guinea Bissau, Gambia and Ghana show a high frequency of

haplogroup E3a (Semino et al. 2002; Wood et al. 2005; Rosa et al. 2007), which is

interesting given that the haplogroup is a putative signature of the Bantu expansion

(Underhill et al. 2001), suggesting it has an older and geographically more widespread

significance, possibly being a signature of the original proto Niger-Congo speakers. In

addition or alternatively the low NRY haplogroup diversity in West Africa may be a

product of agricultural expansion throughout the region or another part of the same

expansion that includes that of the Bantu-speaking peoples. Many of the mtDNA types

found in West Africa (from Senegal, Guinea Bissau and Sierra Leone) appear to be

specific to the region (Rando et al. 1998; Salas et al. 2002; Rosa et al. 2004; Jackson et al.

2005) and some of the many putative Bantu signature mtDNA lineages are not found at

high frequencies (e.g. L1a or L3e) indicating any agricultural expansion in West Africa

may have been somewhat distinct from the expansion of the Bantu-speaking peoples

towards southern Africa (Rosa et al. 2004). Given that there are no Bantu languages

spoken in West Africa and that E3a, due to its greater haplotype diversity, appears to have

been established in the region prior to the expansion of the Bantu-speaking peoples

(Cruciani et al. 2002) this region may have been the original wellspring of farming in sub-

Saharan Africa and acted as source of knowledge for the subsequent Bantu expansion

(Rosa et al. 2007) i.e. the West African agricultural expansion was older than the later

expansion of the Bantu-speaking peoples. The Mandeka of Guinea Bissau possess a

particularly high E3a microsatellite haplotype diversity and an mtDNA L2a/L2c star

phylogeny indicative of an expansion that suggests that they benefited from this

agriculture-driven population expansion (Rosa et al. 2004; Rosa et al. 2007). In addition

studies focusing on Guinea Bissau show mtDNA lineage sharing with North African

populations (Rosa et al. 2004) while the Felupe-Djola appear to possess NRY and

mtDNA types similar to those found in East Africa, consistent with their proposed oral

tradition of a migration from Sudan in the 15th

century (Rosa et al. 2004; Rosa et al.

2007).

1.4.2.2.5. South Eastern Africa

Though it was previously stated that to a considerable extent the sex-specific genetic

profiles of the peoples of sub-Saharan Africa have been homogenised by the expansion of

the Bantu-speaking peoples, East Africa does appear to show more genetic as well as

linguistic diversity than West Central and Central sub-Saharan Africa (Underhill et al.

2001; Salas et al. 2002; Wood et al. 2005), suggesting the movement and replacement of

existing populations by Bantu-speaking farmers was less complete in this region (Salas et

al. 2002). This is demonstrated best by the greater proportion of more ancestral NRY

(Haplogroups A and B) and mtDNA types (e.g. the putative Khoisan type L1d),

especially in the more sampled populations within Mozambique (Underhill et al. 2001;

Pereira et al. 2001; Salas et al. 2002) and Tanzania (Wood et al. 2005; Gonder et al.

2007). Mozambiquans (Brandstatter et al. 2004), Kenyans (Brandstatter et al. 2004),

Rwandan Hutus (Tofanelli et al. 2003), Zimbabweans (Gene et al. 2001) and Ugandans

(Gene et al. 2001) have all demonstrated substantial differences with each other and/or

populations further to the west at some molecular level. Mozambique datasets have also

been shown to possess European Y-chromosome lineages at a frequency of

approximately 5% (Pereira et al. 2002), possibly due to the influence of the Portuguese

slave traders.

1.4.2.2.6. The Chad Basin

The Chad Basin presents a very heterogeneous genetic profile that differs significantly

between male- and female-specific lineages, consistent with the complex population

movements the region has experienced. The NRY, mostly assessed by datasets from

northern Cameroon, show a substantial proportion of R1*-M173 types (Scozzari et al.

1999; Cruciani et al. 2002), a clade not usually found elsewhere in sub-Saharan Africa but

present in Asia (Luis et al. 2004). This has been presented as evidence for a possible back

migration from Asia to sub-Saharan African through the Levantine corridor. However,

mtDNA showed no such signal, suggesting that admixture of the immigrating group was

primarily male-mediated, at least once they reached their destination (Coia et al. 2005).

However the mtDNA data are still very heterogeneous with many different types showing

a mostly Central African connection but with possible gene flow from East Africa and

from West Africa (Cerny et al. 2004; Cerny et al. 2007) as well as a small North African

influence (Coia et al. 2005), demonstrating that the Sahel along which the Chad Basin lies

has been a major corridor for human migration in Africa.

1.4.2.2.7. Khoisan Speakers

The NRY (Knight et al. 2003; Cerny et al. 2007), mtDNA (Scozzari et al. 1988; Watson

et al. 1996; Knight et al. 2003) and autosomal systems (Ramsay & Jenkins 1988; Patin et

al. 2006) of the click-speaking Khoisan have been shown over the past two decades to be

consistently and substantially different from those of all other sub-Saharan Africans.

Khoisan tend to possess the most ancestral (most ancestral here referring to the NRY type

least derived from the NRY of the most recent common ancestor of all present day males)

NRY and mtDNA types at high frequencies (Scozzari et al. 1999; Underhill et al. 2000;

Underhill et al. 2001; Hammer et al. 2001) and thus are usually the earliest major outlier

branch in most phylogenies when conducting inter-population comparisons (Hammer et

al. 1997; Chen et al. 2000; Forster et al. 2000; Knight et al. 2003; Wood et al. 2005). The

Khoisan almost exclusively possess the mtDNA haplogroup L1d (Salas et al. 2002),

which is thus often used as a signature of Khoisan introgression.

Though they share very common genetic features the agricultural Khoi groups (such as

the Nama) show substantial genetic differences from the hunter-gatherer San groups

(such as the !Kung) (Soodyall et al. 1996; Scozzari et al. 1999; Renquin et al. 2001;

Cruciani et al. 2002). This dissimilarity of the two groups is probably in part due to drift

acting in the presence of mutual isolation and restricted numbers. Of the two the San tend

to be the more homogenous (Ramsay & Jenkins 1988; Watson et al. 1996). Another

reason for the differentiation (rather than simple isolation following a common origin) is

that the Khoi have experienced greater gene flow from Bantu-speaking farmers

(consistent with their farming lifestyle) which has added a substantial Bantu signature to

their mtDNA and NRY profiles (Scozzari et al. 1999; Chen et al. 2000; Cruciani et al.

2002). Conversely some southern Bantu speakers appear to have inherited Khoisan

signature types (Underhill et al. 2001; Salas et al. 2002).

The Khoisan share some of their ancestral NRY and mtDNA clades with Ethiopian

populations but this similarity appears to be quite ancient and the common clades of the

two groups have experienced substantial divergence (Scozzari et al. 1999; Cruciani et al.

2002; Salas et al. 2002; Semino et al. 2002). These clades are rare elsewhere in Africa

and form, for the NRY at least, the basal branch of the genealogical tree. This is

consistent with the assertion that the range of Khoisan-speakers once extended over a

much wider area including up to present day Ethiopia and it is contact within this wide

zone that has lead to the similarity that is still observed. In this scenario the Khoisan are

an offshoot of the larger ancestral population that has experienced drift as a consequence

of their isolation (Salas et al. 2002).

Supporting evidence that suggest the Khoisan were once much more widespread is the

presence of a click speaking hunter-gatherer group, the Hadza, in Tanzania. A study by

Knight et al. (2003) that compared NRY and mtDNA profiles in the Hadza to San, Bantu

and East African populations showed the Hadza to be much closer to the non-San groups

but still distinct from them. This study therefore seems to provide support for the

suggestion of an ancient common origin for the present day click speaking populations

that have become isolated and evolved independently over a wide area for perhaps tens of

thousands of years.

1.4.2.2.8. Pygmies

Five populations of Pygmy have been characterised in some way by genetic studies. They

can be broadly placed into two groups; the western pygmies: the Biaka, Mbenzele, Aka

from the Central African Republic (CAF) and the Bakolo from southern Cameroon; and

the eastern pygmies: the Mbuti of the Democratic Republic of Congo (DRC). Like the

Khoisan, the pygmies are distinguishable from other sub-Saharan Africans (including the

Khoisan) by NRY (Cruciani et al. 2002; Coia et al. 2005), mtDNA (Watson et al. 1996)

and autosomal data (Zekraoui et al. 1997; Destro-Bisol et al. 2000; Sanchez-Mazas 2001;

Renquin et al. 2001; Patin et al. 2006) with most groups being more homogenous

(Lucotte et al. 1994; Watson et al. 1996; Renquin et al. 2001) and with a greater time to

the most recent common ancestor (Watson et al. 1996) than most Bantu-speaking

populations.

The western (mostly from the CAF) and eastern pygmies can also be differentiated from

each other (Chen et al. 1995; Destro-Bisol et al. 2000; Batini et al. 2007) though both

groups show ancient similarities. The western pygmies possess mostly mtDNA types L1a

and L1c (Salas et al. 2002; Batini et al. 2007) and NRY haplogroup B (Wood et al. 2005),

reflecting similarity to Bantu-speaking CAF populations while the eastern pygmies

possess mtDNA types L1e and L2 (Salas et al. 2002; Batini et al. 2007) and NRY

haplogroup A (Wood et al. 2005), reflecting similarity with East African populations. The

large genetic distance between the eastern and western groups has been interpreted as

evidence of an independent origin and evolution of the pygmy characteristics in each

group (Chen et al. 1995; Destro-Bisol et al. 2004), which occurred a minimum of 18

thousand years ago, but dating of mtDNA clades point to a much older event. It should be

noted that the western pygmy populations tend to closely resemble each other and show

evidence of Bantu introgression (Destro-Bisol et al. 2000; Destro-Bisol et al. 2004; Coia

et al. 2004) though Batini et al. (2007) has suggested most of the Bantu/Pygmy similarity

is ancient rather than the result of recent introgression with a split some 30 to 60 thousand

years ago

1.4.2.2.9. The Fulani

Genetic studies on the Fulani have been fairly limited but those that have been conducted

using NRY, mtDNA and autosomal systems have shown nomadic Fulani to be generally

distinct from neighbouring populations (Scozzari et al. 1999; Modiano et al. 2001; Cerny

et al. 2006) and mtDNA data have shown Fulani from Burkina Faso, Chad and Cameroon

to be somewhat similar to each other (Cerny et al. 2006) in comparison to neighbouring

sedentary populations, with 10% of female lineages being of non-sub-Saharan African

origin, possibly from the northern massifs of the Central Sahara. However a relatively old

mtDNA study showed that the Senegalese Fulani, the Puel, who are semi-settled farmers,

could not be discriminated from other Senegalese populations (Scozzari et al. 1988) while

a HLA-Class I study demonstrated significant differences between Burkina Faso and

Gambian Fulani (Modiano et al. 2001).

1.4.2.2.10. Sao Tome

The islands of Sao Tome on the Gulf of Guinea have been the subject of an unusually

large amount of genetic investigation, as they provide a potentially excellent model in

which to observe the interplay of drift and admixture. The islands were, from the 15th

Century, a Portuguese Colony. Eventually they were inhabited by slaves of mixed sub-

Saharan African origin. The first study of the islands looked at the HVS-1 mtDNA profile

in comparison to the neighbouring Bioko Island, which underwent a much more ancient

and smaller migration with the immigrants eventually becoming today‟s Bubi tribe. As

would be expected the Bioko island was much more homogenous while the sampled

population from Sao Tome possessed many different African mainland mtDNA types

(Mateu et al. 1997). Focusing in on three Sao Tome Islands (Angolares, Forros and

Tongas) mtDNA data showed the Angloares to be most differentiated from the other two

islands but no probable European ancestry was detected (Trovoada et al. 2004). Y-

chromosome studies revealed the Angloares to be the most homogenous island (Trovoada

et al. 2001) while 10% of lineages on the other two islands were of European ancestry

(Trovoada et al. 2007; Goncalves, Spinola & Brehm 2007). This is consistent with

Angloares being a more isolated island with only central and south western African slaves

present on the island while Forros and Tongas experienced substantial numbers of unions

between Portuguese men and African slave women. Interestingly one autosomal study

found no European component (Gusmao et al. 2001) while another found approximately

10% European admixture (Tomas et al. 2002).

1.4.2.2.11. The Lemba

Another interesting genetic history investigation in sub-Saharan Africa is that on the

southern African Lemba people, whose oral tradition has been interpreted as indicating

that they are a „lost tribe of Israel‟ having migrated from Judea to Sena (which was

possibly in the Yemen) (Parfitt 1997), before finally settling in Mozambique, Zimbabwe

and South Africa. An initial NRY survey by Spurdle et al. (1996) using a four marker

system showed that approximately 50% of NRY lineages were of possible Semitic origin

due to the presence of the p12f2 marker though alternative Arabic and Jewish origins

could not be established. However a more high-resolution NRY study by Thomas et al.

(2000) was able to establish that some of the Semitic clades in the Lemba possessed the

Cohen Modal haplotype, a signature compound biallelic / microsatellite haplotype found

at particularly high frequencies in Ashkenazic and Sephardic Cohanim Jews, thus

supporting their remarkable claim of a Jewish origin.

1.4.2.3. Pharmacogenetics in Africa

Though there is no strict definition, pharmacogenetics is the study of the genetic basis of

drug metabolism. A review of the entire field is beyond the scope of this thesis (there are

entire journals dedicated to the subject and there are many excellent review articles

available (Johnson 2003; Weinshilboum 2003; Evans 2003; Agrawal & Khan 2007; Hall

& Sayers 2007; Kayser 2007; Lanfear & McLeod 2007; Swen et al. 2007; Wilke et al.

2007)). However some previous pharmacogenetic work that has involved sub-Saharan

African populations is detailed below. Studies on the molecular genetic basis of drug

metabolism in sub-Saharan African populations have been very limited (there has been

substantially more involving African Americans), especially in comparison to European-

and Asian-specific investigations.

Much of the early work focused on establishing whether genetic variants of drug

metabolising enzyme (DME) genes that were initially discovered in Europe and Asia

were also found in sub-Saharan African populations. The majority of the focus was

directed towards the Cytochrome P450 gene CYP2D6, which was and is by far the best

characterised of the CYP family because of the remarkable correlation it demonstrated

between genotype and phenotype (particularly so far as variation in the rate of drug

metabolism is concerned) (Weinshilboum 2003). The initial studies on the Shona of

Zimbabwe by Masimirembwa et al. (1993) showed marked differences in the frequency

of the Eurasian determined variants in Africans and also noted a marked difference in

expected phenotype from the genotypic profile (Masimirembwa et al. 1996a), with a

tendency towards slower metabolisers in Africans. It was shortly found, by re-sequencing

Shona individuals, that approximately 34% of individuals possessed another SNP in exon

2, defined as haplotype CYP2D6*17, that accounted for much of this phenotypic

discrepancy in Africa (Masimirembwa et al. 1996b) (other SNPs were later shown to

contribute to the resultant CYP2D6*17 phenotype (Oscarson et al. 1997)). Subsequent

studies showed this allele to be very prevalent in many sub-Saharan African populations

(South Africa =24% (Dandara et al. 2001), Gabon =23% (Panserat et al. 1999), Tanzania

=17-23% (Wennerholm et al. 1999; Dandara et al. 2001; Wennerholm et al. 2001),

Ghana=27.7% (Griese et al. 1999)) though it was a lot lower in Ethiopia=9% (Aklillu et

al. 1996) and was absent in Europeans (Wennerholm et al. 2002). Another allele,

„CYP2D*29‟, would also subsequently be shown to contribute substantially (Tanzania

=19%) to the African-specific „lower rate of metabolism‟ phenotype (Wennerholm et al.

2001).

The practice of assaying for Eurasian-derived variants in limited numbers of poorly

defined sub-Saharan African populations (especially in Zimbabwe Shona, South African

Venda and Ethiopians) continued for other DMEs (e.g. CYP2C19 (Masimirembwa et al.

1995; Persson et al. 1996; Bathum et al. 1999; Allabi et al. 2003), CYP3A4 (Tayeb et al.

2000; Zeigler-Johnson et al. 2002; Cavaco et al. 2003; Chelule et al. 2003), CYP2C9

(Allabi et al. 2003), N-acetyltransferase (NAT) 1 (Loktionov et al. 2002), NAT2

(Loktionov et al. 2002) and ABCB1 (Schaeffeler et al. 2001; Chelule et al. 2003) and thus

many of the frequencies observed were unremarkable. As suggested by Bapiro et al.

(2002) for the CYP2D*17 variant and Wojnowski et al. (2004) for CYP3A5, predicting

phenotypic effects of pharmaceuticals in African populations based on Eurasian

ascertained variation is likely to be hazardous due to the impact of African-specific

variation. Therefore the valuable insights, following the example of CYP2D6, have and

will come from sequencing African individuals in order to identify African-specific

variation that may affect phenotypic response.

Aklillu et al. (2002) (CYP1B1), Dandara et al. (2003) (NAT), Allabi et al. (2005)

(ABCB1) and Quaranta et al. (2006) (CYP3A5) have sequenced African individuals

(albeit a small number) and identified novel variants, some of which are found at high

frequencies and may affect drug metabolism. The survey by Quaranta et al. (2006) in

particular demonstrates the potentially large amount of information that has yet to be

revealed in sub-Saharan Africa in comparison to the rest of the continent. They used a

polymerase chain reaction-single strand conformational polymorphism approach to

interrogate the CYP3A5 gene in individuals from French Caucasians („Caucasian‟ here

referring to „white skinned‟ individuals of recent European origin), Gabonese and

Tunisians and found 8, 17 and 10 novel SNPs respectively. The study by Aklillu et al.

(2003) on CYP1A2 in Ethiopians not only found a novel variant but was also able to

show that the variant substantially lowered enzyme activity.

More recent work has been directed towards the functional aspects of alleles with

particular emphasis on the interaction of genetic variants found in sub-Saharan Africans

with pharmaceuticals that may be practically relevant. Mehlotra et al. (2006) identified

that the CYP2B6 enzyme was part of the metabolism of artemisinin, a drug used to treat

multi-drug resistant strains of falciparium malaria which is prevalent in sub-Saharan

Africa and thus determined the frequencies of a range of CYP2B6 haplotype in West

Africans while Penzak et al. (2007) examined the G516T SNP in HIV-infected

individuals in Uganda and showed the genotype to influence nevirapine (a drug used to

treat HIV infection) concentrations. Röwer et al. (2005) noted that amodiaquine (AQ) has

become the first line treatment of malaria in Ghana and given that it metabolises AQ,

decided to type CYP2C8 alleles in Ghanaian children. They were able to demonstrate an

allele frequency of 17% (much higher than found previously in Caucasians) for the

CYP2C8*2 allele that is associated with a decrease in enzyme activity, and thus it is

possible that a fair proportion of Ghanaian individuals may experience adverse drug

affects during the administering of AQ. Sim et al. (2006) and Mirghani et al. (2006) have

similarly performed investigations that examine practically relevant pharmacogenetic

variation at a functional level in sub-Saharan Africans.

Consistent with most genetic studies major differences between the pharmacogenetic

profiles of Africans and non-Africans have been demonstrated (Loktionov et al. 2002;

Allabi et al. 2003; Chelule et al. 2003; Garsa, McLeod & Marsh 2005; Mehlotra et al.

2006; Quaranta et al. 2006). However very few pharmacogenetic variation studies have

involved inter-sub-Saharan African population comparisons, the best example possibly

being that of Dandara et al. (2002), which itself only examines three broadly defined east

African groups of low sample size. Hopefully this paucity of sub-Saharan African

information will be addressed in the future.

1.4.2.4. Natural Selection in Africa

Human sub-Saharan African genetic variation appears to have been substantially shaped

by demographic events, particularly the expansion of the Bantu-speaking peoples.

However natural selection can also be a major force. Table 1.1 lists some genes that show

unique signals of natural selection in sub-Saharan Africa as well as, when available,

possible reasons for this selection. This list is far from exhaustive and is simply to guide

the reader towards some of the more prominent examples present in the literature. It is

notable in Table 1.1 that the majority of the statistically significant signatures of selection

in sub-Saharan Africa have only been elucidated very recently. This is a consequence of

the amounts of available data increasing rapidly because of the recent development and

Table 1.1: Some examples of natural selection detected in Africans.

Gene Name

(Gene Symbol)

Chromosome

Location Type Of Selection Signal Of Selection

Statistical Method of Selection

Detection

Possible Environmental Pressure

Causing Selection

Duffy blood

group,

chemokine

receptor (DARC)

1 Positive selection

for FY*O allele

Low level of sequence variation

(Hamblin & Di Rienzo 2000;

Hamblin, Thompson & Di

Rienzo 2002)

HKA, Fu and Li‟s D, FST, Fay

and Wu‟s H

Homozygotes are resistant to p

Plasmodium vivax (Livingstone 1984)

β-hemoglobin

(HBB) 11

Balancing selection

of sickle cell trait

High degree of haplotype

similarity for βs alleles

(Hanchard et al. 2006; Hanchard

et al. 2007)

Long Range haplotype

similarity

Heterozygotes are resistant to

Plasmodium falciparium (Aidoo et al.

Glucose-6-

phosphate

dehydrogenase

(G6PD)

X Positive selection

for G6PD A- allele

Long range LD (Sabeti et al.

2002b; Saunders et al. 2005) on

A- allele background and low

sequence diversity (Tishkoff et

al. 2001; Saunders, Hammer &

Nachman 2002)

Coalescent simulation of

neutral microsatellite variation

and LD, LRH

G6PD A- allele confers resistance to

Plasmodium falciparium

CD40 ligand

(CD40LG) X

Positive selection

for TNFSF5-CH4

haplotype (with

726C SNP)

Long range LD on CH4

haplotype (Sabeti et al. 2002b)

and low sequence diversity

(Sabeti et al. 2002a)

TNFSF5 heavily involved in immune

response and CH4 may confer

resistance to Plasmodium falciparium

Taste receptor,

type 2, member

16 (TAS2R16)

7 Balancing selection

around K172N SNP

Excess of derived alleles

detected but higher frequency of

ancestral K172 allele specific to

central Africa (Soranzo et al.

Fay and Wu‟s H, Exact Test of

172N allele increases protection

against cyanogenic plants foods but

K172 allele allows low level ingestion

of cyanogenic foods, which increases

resistant to Plasmodium falciparium

Lactase (LCT) 2

Positive selection

for lactase

persistence allele -

14010C

Long Range LD on -14010C

allele background (Tishkoff et

al. 2007)

Lactase persistence and thus drinking

fresh milk offers increase nutritional

benefits and milk is a good source of

water in arid environments.

Continues overleaf…

Table 1.1 continued

glycosyltransferase

(LARGE)

Long Range LD and extreme

derived allele frequency

difference between populations

(Sabeti et al. 2007)

Loss of normal glycosylase function

prevents α-dystroglycan modification,

preventing binding of Lassa Fever

Dystrophin (DMD) X Positive selection

Long Range LD and extreme

derived allele frequency

difference between populations

(Sabeti et al. 2007)

Loss of normal cytosolic adaptor

function prevents normal α-

dystroglycan function, preventing

binding of Lassa Fever virus

histocompatibility

complex, class II,

DR beta 1 (HLA-

2 Balancing selection

Extreme level of heterozygosity

in DRB1 gene (Renquin et al.

Ewens-Watterson‟s and

Slatkin‟s neutrality test

High levels of haplotype diversity in

HLA genes such as DRB1 allow

greater response to a broader range of

pathogens

Melanocortin 1

receptor (MC1R) 16 Purifying selection

Deficit in expected number of

non-synonymous changes (John

et al. 2003)

MacDonald-Kreitman,

Tajima‟s D, Fu and Li‟s F, D,

F* and D

Functional constraint required to

maintain dark skin pigmentation in

sub-Saharan Africans due to effect of

high sunshine rate

ATP-binding

cassette, sub-

family B

(MDR/TAP),

member 1

(ABCB1)

for mh7 haplotype

Long range LD on SNP

e26/3435C background (Tang et

al. 2004) (note: in African

Americans, not sub-Saharan

Africans)

MDR1 may regulate drug, xenobiotic

and enveloped virus traffic but actual

selection stimulus somewhat unknown

Opsin 1 (cone

pigments), long-

wave-sensitive

(OPN1LW)

X Balancing selection

Excess of non-synonymous

changes (Verrelli & Tishkoff

HKA, MacDonald-Kreitman,

Tajima‟s D

Variation in L-cone colour vision may

have allowed adaptive evolution of

hunter-gatherers.

vav 3 guanine

nucleotide

exchange factor

(VAV3)

1 Positive Long range haplotype LD

(Walsh et al. 2006) LRH

There is no known mechanism of

selection for VAV3, a hemopoetic cell-

specific guanine nucleotide exchange

factor

application of high-throughput DNA variation typing and sequencing technologies as well

the development of more powerful statistical methods for analysing these data (for a

recent review see Sabeti et al. (2006)). Understanding the effect of natural selection and

being able to discriminate these effects from demographic processes will be important for

understanding patterns of genetic variation in sub-Saharan Africa so further technological

and analytical developments will be vital in the near future.

1.5. DNA Sampling Issues

As methods to characterise the DNA of individuals, including new sequencing

technologies like 454 (Margulies et al. 2005), Solexa and SOLid (Shendure et al. 2005)

sequencing, continue to advance at a rapid rate it is clear that DNA sampling will be the

major methodological bottleneck when conducting genetic studies in sub-Saharan Africa.

The majority of genetic studies described above have used either relatively poorly defined

or pooled samples sets (e.g. populations labelled as „West Africa‟ or „Tanzanian‟) or, if

samples have been more carefully defined (e.g. populations labelled as Amharic born in

Addis Ababa), there has been very little other available population data characterised in

similar detail with which such datasets can be compared. Sampling criteria have, in

addition, been very variable and the original rules governing the making of fieldwork

collection are often no longer available. DNA sampling in sub-Saharan Africa presents a

unique challenge for the investigator and can be particular troublesome. Inconsistent

sampling strategies, while understandable, should be avoided. As investigations are

undertaken at ever finer scales such as those described in Chapters 2 and 3 of this thesis

the criteria used to collect samples become increasingly important. In addition subsequent

analysis must take account of the adopted sampling strategy. It is clear that stringent

standardised declared sampling strategies must be an important consideration of future

studies.

The Centre for Genetic Anthropology criteria are described in Appendix A together with

the practical problems encountered by the author of this thesis in the field. Definitive

criteria for collecting samples in the field in Africa have not yet been formulated but are

increasingly necessary. Until such time as they are the necessary minimal requirement is

that all research reports should contain in the fullest detail information of when, where,

how and by whom samples were collected so that this information can be taken into

account when data are analysed. In the three case studies (especially Chapters 2 and 3,

where appropriate population definition is critical) described in this thesis a large number

of densely sampled datasets, each of significant size, that have been relatively well

described have been used in an attempt to address some of the sampling issues of

previous studies.

1.6. Statement of work performed by Krishna Veeramah in this thesis

1.6.1. Chapter 2

All field sampling, DNA extraction and Y-chromosome typing of samples from Bafut,

Foumban, Nkambe, Wum, Bankim, Magba and Sabongari of Cameroon was performed

by Krishna Veeramah. All mtDNA typing and processing of samples from Cameroon was

performed by me. All statistical analysis was performed by Krishna Veeramah.

1.6.2. Chapter 3

All field sampling, DNA extraction, Y-chromosome typing, mtDNA typing and

processing of samples from Cameroon was performed by Krishna Veeramah. All

processing of mtDNA data from Nigeria was performed by me. All statistical analysis

was performed by Krishna Veeramah.

1.6.3. Chapter 4

All FMO2 g.23238C>T SNP typing was performed by Krishna Veeramah. All statistical

analysis was performed by Krishna Veeramah, except for the Logistic Regression, which

was carried out by Professor Nancy R. Mendell, and the interpolation step of the genetic

boundary analysis, which was carried out by Dr Mark Thomas.

Chapter 2:

Sex-Specific Genetic Data Support

One Of Two Alternative Versions

Of The Foundation Of The Ruling

Dynasty Of The Nso´ In Cameroon

2. Sex-Specific Genetic Data Support One of

Two Alternative Versions Of The Foundation

Of The Ruling Dynasty Of The Nso´ In

Cameroon

2.1. Introduction

2.1.1. The geography, history and sociology of the Nso´

The history of western Cameroon is dominated by rival polities which by the eighteenth

and nineteenth centuries had formed city states (also called kingdoms or fon-doms). Their

rivalries were a feature of the pre-colonial history of the Grassfields, the highlands which

form the West and North West Provinces of present day Cameroon. Although they are not

the only groups living in the region, these centralised polities have dominated regional

politics for centuries. This chapter discusses the early history of one of the most

celebrated kingdoms, the fon-dom of Nso´, using novel genetic data that throw new light

on a long-standing controversy among Nso´ historians.

Nso´ is one of the Grassfields states (see Figure 2.1 for geographic location) whose royal

family claims Tikar descent. This chapter will not attempt to revisit a complex topic

which has been much discussed in the literature (see Jeffreys 1964; Chilver & Kaberry

1971; Price 1979; Fowler & Zeitlyn 1996). For the purposes of this study it is sufficient to

note that, like many but not all royal families in the region, the Nso´ royal family traces

its origins to the royal family of the Tikar of the Tikar Plain, from near present-day

Bankim (there may be significant political advantages for a group to claim Royal Tikar

descent).

By the nineteenth century the Nso´ state had expanded, so that it was in effect a small

empire holding sway over surrounding ethnic groups, over whose control wars were

fought with rival states such as the Bamum state centred on Foumban (Kaberry 1962b;

Tardits 1980). Concerning the period before the establishment of the larger state there are

oral history accounts of uncertain antiquity which deal with the ethnogenesis of the Nso´

people. The most common account tells of a Princess Ngonnso´ travelling with followers

from the Tikar region, approximately 100 km to the east, separating from her brothers

(who founded neighbouring settlements) on the journey and encountering a small

indigenous group of hunter-gatherers (the Visale) amongst whom she settled (Mzeka

1990). Opinion differs between Mzeka (1978), who states that she was accompanied by a

husband, and most members of the Nso´ Historical Society1 who claim that she settled

without an accompanying husband (Tatah Humprey Mbuy-Senior member of the Nso´

Historical Society, personal communication). However, both sides of the debate agree

that her son became the first fon2 of the Nso´ and it is from him that the current fon is

directly descended along the paternal line.

Nso´ is unusual among the Grassfields kingdoms in having a system of named descent-

based social classes with varying rules of affiliation and transmission. They were first

described by Kaberry (1952; 1959; 1962a) and Chilver and Kaberry (1960), more

recently by Goheen (1996) and Chem-Langhëë and Fanso (1997). The four groups are 1)

the won nto´ (descendants of a fon down to the third or fourth generation (see below), 2)

the duy (descendants of a fon who ruled more than three or four generations ago (see

below) together, according to Chem-Langhëë and Fanso (1997) , with some members of

commoner lineages whose heads are descendants of princesses, and associated patriclans

or clan segments providing state counsellors, allegedly founded by immigrant royals), 3)

the nshiylav (subjects born or recruited3 into Palace service (patrilineally inherited) and 4)

the mtaar (commoners (patrilineally inherited)). Although the majority of the Nso´ are

self-identifying Christians of the Roman Catholic denomination, the fon has, through the

generations, maintained a polygynous household, which in 2005 numbered over 70

1 The Nso´ historical society is open to all Nso´ people as well as non-Nso´ individuals undertaking

research on Nso´ history and traditions. Address: Nso´ History Society, Tourist Home, P.O. Box 33, Kumbo

Nso´, North West Province, Cameroon. Telephone number: 00237 348 17 65.

2 A fon can be thought of as a traditional ruler or leader of a group or village though the actual degree of

power held by a fon is variable from group to group. While the term is somewhat specific to the Grassfields

of Cameroon, a fon is analogous to a traditional tribal chief.

3 Members of the nshiylav may also be recruited from the other categories (for example, from the mtaar) by

the fon and given a special (high) status (like personal page).

women4. Access to the fon‟s wives has traditionally been strictly controlled with illicit

unions subject to capital punishment (Chilver & Kaberry 1968). While paternal descent

from a fon is a necessary precondition for enthronement, the new Fon‟s mother must, in

the same tradition, be a mtaar commoner (Mzeka 1978).

Figure 2.1: Map showing towns in Cameroon where samples were

collected.

The membership rules, as commonly stated in abbreviated form, do not cover all possible

cases, particularly where the change in status from won nto´ to duy is concerned. Since

this has implications for the distribution of sex-specific genetic markers in the wider

population David Zeitlyn5 and Verkijika G. Fanso

6, who the author of this thesis worked

4 This information came from the late Emmanuel Nkem Mbinglo, a paternal brother of the fon.

5 David Zeitlyn is the Professor of Anthropology at the University of Kent who works primarily on the

Mambila population in North-West Cameroon.

closely with during this investigation, conducted some field research in an attempt to

clarify the position.

The problem is that Chem-Langhëë and Fanso (1997) define the rule differently from

Kaberry (1959). The former state that individuals are won nto´ members if they are

„descendants of any fon of Nso´ to the fourth generation through agnatic lines [strictly

patrilineal descent] and to the third generation through uterine connexions [cognatic or

strictly matrilineal descent]‟ (Chem-Langhee & Fanso 1997) (this is named Royal Social

Status Rule A; see Figure 2.2). They then go on to state that individuals descended from

one or more won nto´ are duy if either they are descendants of a fon more than four

generations ago along agnatic lines or are descendants of a fon more than three

generations ago along uterine connexions (and in both cases not a descendant of a more

recent fon). Kaberry (1959) simply says „Descendants of a Fon down to the third or fourth

generation are described as wonto‟. There is consequently uncertainty about the status of

members of the fourth generation from a fon. To explore this a fictional family tree was

drawn, designed to fit on a single side of paper, which was used as the basis of a set of

interviews with some knowledgeable Nso´ (five males) conducted by David Zeitlyn and

Verkijika G. Fanso in April 2007. The conclusion drawn (David Zeitlyn personal

communication) was there really is a degree of uncertainty when there are some female

links in any particular descent line (informants varied on whether the son of either a

second or third generation female descendent of a Fon is won nto’ or duy). In practice this

can be exploited tactically as part of Nso´ politics. The 'Kaberry' formulation of the rule

can be developed to remove uncertainty in a variety of ways, for example as „a person is a

member of won nto´ (down to the fourth generation (if a man) and third generation (if a

woman)) if she or he is both a child of a won nto´ member and a descendant of a fon (this

is named Royal Social Status Rule B; see Figure 2.3)‟. The interviews ruled out any

interpretation that won nto´ status is inherited solely along paternal lines. However, it

should be emphasised that uncertainty about group membership has not, to date, been

seen as a problem in Nsoʹ. There are few circumstances when an individual has to state

their category membership and the five Nso´ informants agreed it might be possible for

someone to be accepted in some circumstances as a won nto´ member but in others as a

6 Verkijika G. Fanso teaches in the Department of History at the University of Yaoundé 1, Cameroon, and

is of Nso´ ethnicity.

member of duy. There are formal criteria critically about whether a father has the right to

bestow his daughter in marriage in which case he is duy, or must offer her to the Palace to

bestow in which case he is a member of won nto´7. Opinions varied also about whether

the Ŋgwerong8 society could arbitrate when membership was disputed. However, at the

boundary, the question of membership does not appear to be controversial and none of the

informants could recall disputes about category membership arising. These enquiries

were viewed as abstract academic questions, concerned about the overall system, and

were not taken very seriously. The discrepancy between Royal Social Status Rules A and

B sets limits for genetic modelling and historical reconstruction, which are addressed in

the text below.

2.1.2. Expectations of sex-specific genetic variation in the Nso´

Analysis of sex-specific9 genetic systems (the non-recombining portion of the paternally

inherited Y-chromosome (NRY) and the maternally inherited mitochondrial DNA

(mtDNA)) has proved useful in elucidating the history of diverse ethnic groups where

well-defined alternative scenarios can be identified (see e.g. Thomas et al. 2000; Tambets

et al. 2004). There was a possibility that genetic data would be consistent with only one of

the two variants of the oral history regarding the origin of the father of the first fon of

Nso´ i.e. that Princess Ngonnso´ was already married to a man of Tikar origin when she

encountered the indigenous hunter-gatherers (Visale) or that after arrival in the

Grassfields she took a Visale husband who consequently fathered her child. To be able to

make such a distinction two conditions must be satisfied. These are a) the distribution of

NRY and mtDNA variation in the Nso´ was consistent with the expectations arising from

the group's declared social practices and b) the NRY profile in the Visale was distinct

from that of migrants from the Tikar plain.

7 As Chem-Langhëë and Fanso (1997) make clear there are controversial distinctions within the category of

won nto´ (not all patrilineally descended males are eligible to be fon, some say that it is only sons conceived

while a man is fon who are eligible to succeed to the Fonship, not those born before his selection). There

are also distinctions within the category of duy: the duy shiŋgwaŋ (duy of the salt) and the duy nsaansa’

(general duy), but these are beyond the scope of this chapter.

8The Ŋgwerong are the Nsoʹ regulatory society that in charge of law and policy enforcement.

9 Strictly these systems are not sex-specific but are inherited in a sex-specific manner. The term is used

throughout this thesis as a matter of convenience.

Figure 2.2: Lineage tree showing the relationship of won nto´ individuals and the transition of won nto´ to duy under Royal

Social Status Rule A. M = male offspring, F = female offspring, * = individual inherits the same NRY type as a fon). Won

nto´ are shown in black and duy in red.

Figure 2.3: Lineage tree showing the relationship of won nto´ individuals and the transition of won nto´ to duy under Royal

Social Status Rule B. M = male offspring, F = female offspring, * = individual inherits the same NRY type as a fon). Won

nto´ are shown in black and duy in red.

2.1.2.1. Expectations arising from the group's declared social practices

To undertake the analysis it was first examined whether it is possible to conclude that

either Royal Social Status A or Royal Social Status B has been followed. If Royal Social

Status Rule A has been followed it would be expected that 33.4% - 46.7% of won nto´

Nso´ males sampled would share identical NRY types while this same NRY type would

be expected asymptotically to approach a frequency of 12.5% in won nto´-descended-duy,

depending on the number of generations since the original fon. However, if Royal Social

Status Rule B has been followed, it would be expected that only 1.0% - 24.1% of sampled

won nto´ males would share the same NRY type. This same NRY type would be expected

to be at a frequency of 12.5% in the won nto´-descended-duy, irrespective of the number

of generations of descent from a fon. These expectations make it possible to establish

whether a) Royal Social Status Rule A, b) Royal Social Status Rule B or c) neither A nor

B has been followed. Generating the expectations given above (33.4% - 46.7% for won

nto´ under Royal Social Status Rule A, 1.0% - 24.1% for won nto´ under Royal Social

Status Rule B and 12.5% for duy under both rules) involves a very detailed process that

would disrupt the narrative of the chapter somewhat and so is fully described at the end of

the chapter in the section entitled Supplementary Section 2S.1. However a more concise

decription of the process is given below.

Royal Social Status Rules A and B are described above and represented in Figures 2.2

and 2.3 respectively and the expectations of the percentage of individuals expected to

carry a fon‟s NRY type are derived from these rules. Assumptions when generating these

expectations include: (a) that the Y-chromosome of fons and their patrilineal descendents

can be distinguished from those of non-patrilineal descendents; (b) that the numbers of

males and females throughout each generation are equal; (c) that generations are discrete;

(d) that the number of children a fon has (excluding his heir) is „2n‟ („n‟ of whom are

males) and; (e) the number of children a non-fon won nto´ has is „2y‟ („y‟ of whom are

males). Simple expectations based upon these rules can be generated given these

assumptions. However in this study such expectations are complicated by the sampling

strategy utilised (which is performed as a matter of routine within the TCGA laboratory)

where sampling of individuals who are the brother, father, son or paternal line cousin of

an individual from whom a buccal swab has already been collected is not permitted. In

addition, given the range of ages of samples collected (data not shown but available on

request), it must also be assumed that sampling is from both the most recent and second

most recent generation of adult won nto´ males (for simplicity possible sampling from the

third most recent generation is ignored).

As previously described the key to acquiring won nto´ status is descent from a fon at least

three and, in certain circumstances, four generations ago. Under Royal Social Status

Rules A and B any particular generation of won nto´ should contain individuals who

descend from the previous four fons, though the contribution of the fourth most recent fon

differs between the two rules and it is the won nto´ that descend from this fourth fon that

leads to different expectations under the two rules.

Any one won nto´ individual can be categorised based on how many generations ago they

descend from a fon (not including the fon from their own generations) and along what

lineage type/path they trace this ancestry (for example the third most recent fon along a

purely patrilineal path, or from the second most recent fon where their mother was a won

nto´ who‟s own father was a fon). Therefore it is possible to calculate, using the „n‟ and

„y‟ notation introduced earlier for the number of children a fon or won nto´ has (as well as

the other assumptions), the relative contributions to the total won nto´ of any particular

generation for each specific category (based on fon of descent and lineage path) of won

nto´ individual. From these relative contributions it is also possible to use probability

theory to calculate the probability of actually sampling each of these categories given the

sampling strategy utilised and that sampling is from two consecutive but discrete

generations of won nto´. Some of these categories of won nto´ are descended from a fon

down a strictly patrilineal lineage and so will possess the fon NRY type. Summing the

probabilities of sampling these fon NRY type categories and dividing this by the sum of

the probabilities for all possible categories of won nto´ will give the expected ratio of

individuals expected to be sampled that will possess the fon NRY type in the won nto´.

An algebraic simplification of these ratios for Royal Social Status Rules A and B are

shown below.

Royal Social Status Rule A:

( 1 / ( y + 1 ) ) ( n ( 1 + y )2 + 1) + 1

( 1 / ( y + 1 ) ) ( 3n ( 1 + y )2 + 1 ) + 1

Royal Social Status Rule B:

( 1 / ( y + 1 ) ) ( n ( 1 + y )2 + 1) + 1

( 1 / ( y + 1 ) ) ( 4 n y3 + 10 n y2 +9 n y +3 n + 1 ) + 1

The actual values of these ratios will vary depending on which values of „n‟ and „y‟ are

used. In order to take account of uncertainty concerning the real values of „n‟ and „y‟ a

range of different combinations of „n‟ and „y‟ extending from 1 to 25 were substituted

into the above expressions. Under Royal Social Status Rule A 33.4% - 46.7% of won nto´

males sampled should possess a fon‟s NRY type while 1.0% - 24.1% should be sampled

under Royal Social Status Rule B.

As duy status is inherited patrilineally the expected frequency of fon NRY types within

this social category is much simpler to predict. One additional assumption includes

ignoring duy status acquired by methods other than descent from a fon. Ideally every fon

since the first fon of Nsoʹ except for the most recent three fons will contribute duy to the

current generation. Though the numbers of duy descended from any fon should increase

with every passing generation that actual proportion that possess the fon NRY type will,

as shown in Figures 2.2 and 2.3, always be 12.5%. Under Royal Social Status Rule A the

proportion never actually reaches 12.5% but asymptotically approaches it depending on

the number of generations since the first fon. This is because for the fourth most recent

fon the lineage possessing the fon NRY type does not become duy until the subsequent

generation. However the effect of this solitary fon should be negligible.

The overall pattern of a shared NRY type in the won nto´ and duy should be most evident

with respect to a battery of rapidly evolving microsatellites, a finding which could

demonstrate that male line continuity of fons has been maintained for at least the past four

generations. The NRY of the won nto´ would be expected to be significantly less diverse

than those of the other social classes. Furthermore, if the rules governing selection of a

fon had been strictly adhered to and there had been no false paternity in the line of fons

since the foundation of the Nso´ then this homogenous NRY type would be that possessed

by the first fon of Nso´ and his father. In addition, given a) the requirement for the mother

of a fon to be a commoner, b) women move more freely among social categories in the

patrilineal Nso´ society and c) extreme polygyny is practiced by fons it would be

expected that distribution of mtDNA types among all four social classes would be similar.

2.1.2.2. Expectations arising from a possible distinct Visale NRY profile

If the above expectations were met then current knowledge of NRY variation in sub-

Saharan Africa could be used to explore the oral history of the Nsoʹ. In a previous

publication Underhill et al. (2001) suggested that previously common NRY lineages may,

throughout sub-Saharan Africa, have been replaced by a lineage associated with the

expansion of the Bantu-speaking peoples (EBSP)10

. Underhill et al. (2001) and Scozzari

et al. (1999) have identified the modal NRY of the EBSP to be E3a (using the

nomenclature of the Y-chromosome consortium (2002)). It would be expected that the

putative replaced NRY lineages would be observed at low frequencies with a patchy

distribution across sub-Saharan Africa. These NRY types would be remnants of past

hunter-gatherer populations that have been overwhelmed by the E3a NRY type and

become isolated from each other for a significant time period. This would be reflected by

high genetic distances at the microsatellite haplotype level among geographically

separated groups possessing the same SNP defined low frequency NRY lineages

(Underhill et al. 2001). However, these replaced NRY lineages may be found at high

local frequencies in existing populations that pre-date the EBSP.

It would therefore be reasonable to assume that the hunter-gatherer Visale may have

possessed one of these putative replaced NRY lineages at a significant frequency in

comparison to the Tikar (who speak a Bantoid language, so are connected to the EBSP).

If the signature NRY type found in the won nto´ was shown to be one of the pre-EBSP

lineages and was also not found in neighbouring Tikar populations (as well as other

nearby ethnic groups that may have experienced some contact with Tikar in the past) this

would favour the scenario whereby the immigrant princess married an indigenous Visale.

Conversely, the presence of a homogenous E3a lineage would favour Princess Ngonnso ´

travelling with a husband of Tikar origin who then fathered the first fon of Nso.

The considerable simplification implicit in this statement is noted. Gene flow is not necessarily associated

with language dispersion but there is enough hard data to suggest a close correlation for this to be sufficient

for the present chapter which is not primarily about the Bantu expansion. Some of the complexity has been

discussed by MacEachern (2000), Zeitlyn and Connell (2003) and Vansina (1995).

2.2. Materials and Methods

2.2.1. Sample Collection Procedure

Buccal swabs were collected from males over eighteen years old in the Cameroonian

town of Kumbo (n=151). In addition, buccal swabs were collected in four other western

Grassfields towns (Bafut (n=103), Foumban (n=117), Nkambe (n=82) and Wum

(n=116)), seven towns or villages on the Tikar plain (Atta (n=29), Bankim (n=73), Magba

(n=96), Nyamboya (n=98), Sabongari (n=94), Somie (n=100) and Songkolong (n=43))

and one town north of the Tikar plain (Mayo Darle (n=111)) (see Figure 2.1). All samples

were collected anonymously with informed consent. Individuals were recruited in a

fashion blinded to social class and a local resident assisted in ensuring that only one in

each of the following sets participated: a) brothers, b) father and sons, c) grandfather and

paternal line grandsons. The practice of not sampling individuals who are brothers,

fathers, sons or paternal line cousins of participants was adopted for ethical reasons and to

ensure consistency with other DNA sample collections at the The Centre for Genetic

Anthropology (TCGA). Sociological data were also collected from each individual

including age, current residence, birthplace, self-declared cultural identity (and Nso´

social class) and religion for the individual and the individual‟s father, mother, paternal

grandfather and maternal grandmother.

Standard phenol-chloroform DNA extractions were performed on all samples (see

Appendix C).

2.2.2. Y-chromosome typing

Standard TCGA kits were used to characterise six microsatellites (DYS19, DYS388,

DYS390, DYS391, DYS392, DYS393) and eleven biallelic Unique Event Polymorphism

(UEP) markers (92R7, M9, M13, M17, M20, SRY+465, SRY4064, SRY10831, sY81,

Tat, YAP), as described by Thomas et al. (1999). Microsatellite repeat sizes were

assigned according to the nomenclature of Kayser et al. (1997). Where necessary an

additional marker, p12f2, was typed as described by Rosser et al. (2000). NRY

Haplogroups were defined by the twelve UEP markers according to the nomenclature

proposed by the Y-chromosome Consortium (2002) (see Figure 2.4).

Figure 2.4: Genealogical relationships of UEP markers used to define NRY

haplogroups

These multiplex UEP/ microsatellite kits have already been shown to be reliable under a

wide range of conditions, consistently giving similar signal intensities across all UEPs

and microsatellites within each kit (Thomas, Bradman & Flinn 1999). Therefore any

multiplex runs that showed at least one UEP or microsatellite peak of substantially low

intensity were repeated. Any samples that gave UEP-1 and UEP-2 results that were

incompatible to the known phylogenetic tree for the NRY were also retyped for both kits.

Microsatellite results were also analysed for outliers and homoplasy amongst UEP

haplogroups and retyped for confirmation.

The UEPs used in this study were chosen primarily on the prior development and

standard application of multiplex UEP typing kits in the TCGA laboratory. It was

recognised beforehand that the use of these UEPs leads to a relatively crude resolution of

NRY haplogroups, with only a few markers likely to be relevant to investigating sub-

Saharan African individuals (which tend to fall within haplogroups A, B and E).

However the use of microsatellites in this study should aid in further resolving the fine-

scale phylogenetic relationship of samples (though further SNP typing would still be

preferred; see Chapter 5 for further discussion). It should be noted that only six

microsatellites were typed in this study, which limits the effectiveness of elucidating

these relationships somewhat and further typing would also have been preferable (over 50

NRY microsatellites have currently been identified), especially with regard to the

estimation of TMRCA dates. Unfortunately, given economic restrictions and the time

available the development and typing of further UEPs and microsatellites was not

possible though this is certainly a priority for possible future work (see Chapter 5).

To characterise NRY lineages potentially associated with populations replaced by the

EBSP as proposed by Underhill et al. (2001) the samples described above were analysed

(given a group label of „Grassfields of Cameroon‟ (n=1213)) along with unpublished

data (n=8072) held in The Centre for Genetic Anthropology database consisting of

sample sets collected from populations in sub-Saharan Africa: northern Cameroon

(n=778), southern Cameroon (n=174), Ethiopia (43 different locations covering most of

the country) (n=3368), north eastern Ghana (n=258), north western Ghana (n=471), south

eastern Ghana (n=161), south western Ghana (n=206), central Malawi (n=207), northern

Malawi (n=56), Mozambique (n=86), Cross River region-Nigeria (n=1247), southern

Senegal (n=95), western Senegal (n=90), Pretoria-South Africa (n=96), Sudan (n=647),

Tanzania (n=45), Uganda (n=36) and Zimbabwe (n=51).

2.2.3. mtDNA typing

The mtDNA HVS-1 region of all samples collected from Kumbo was sequenced as

described by Thomas et al. (2002) except that primers conL1-mod, conL2 and conH3

were replaced by conL849 (CTA TCT CCC TAA TTG AAA ACA AAA TA), conL884

(TGT CCT TGT AGT ATA A) and conHmt3 (CCA GAT GTC GGA TAC AGT TC)

respectively. HVS-1 Variable Site Only (VSO) haplotypes were determined for all

samples with sequence data covering a minimum of nucleotides 16020-16400 by

comparison to the Cambridge Reference Sequence (Anderson et al. 1981), with

haplotypes consisting of the nucleotide positions where substitutions, insertions or

deletions occurred as well as the actual base change.

Each sample‟s chromatogram was manually inspected for generally high levels of

background noise across its whole length of sequence. The 5ʹ and 3ʹ ends of raw

chromatograms were trimmed until at least 10 out of 15 bases at these ends had

confidence scores above 25%. The ends were then trimmed further by manually

inspecting the sequence. For each 96 sample sequencing run each position with a

proposed SNP, insertion, deletion or ambiguous position was examined manually. All

samples with any ambiguous sites after manual curation were sequenced again. In

addition sequencing of samples was repeated when the forward and reverse sequences did

not match.

2.2.4. Statistical and Population Genetic Analysis

The Pearson's chi-square goodness of fit test was performed within the R programming

environment. Genetic diversity, h, (the probability of randomly sampling two different

haplotypes in a population) and its standard error was estimated from unbiased formulae

of Nei (1987). Genetic differences between pairs of populations when individuals in

populations were described by mtDNA HVS-1 VSO haplotypes were assessed using an

Exact Test of Pairwise Population Differentiation (ETPD) with 10,000 Markov steps

(Raymond & Rousset 1995; Goudet et al. 1996). This test is analogous to a Fisher‟s

Exact test (Lee et al. 2004) but the size of the contingency table is extended to the number

of populations being compared (two in a pairwise population comparison, two or greater

in a global test) by the total number of different haplotypes present. Due to the

complexity introduced by the sheer number of extra rows and columns a null distribution

of tables to test against the observed data is generated using a random walk via a Markov

chain rather than comparison to some predefined distribution such as the hypogeometric

distribution.

Population Genetic Structure was estimated using Hierarchical Analysis of Molecular

Variance (AMOVA) (Excoffier, Smouse & Quattro 1992) based on a particular mutation

model (which allowed the evolutionary distance between pairs of haplotypes to be taken

into account) to generate a single Fixation Index statistic, FST, when a simple structure of

populations within a single group was defined. Significance of Fixation Indices are

assessed by randomly permuting individuals (given that only haploid systems were

considered) among populations or groups of populations, depending on the Fixation Index

being tested and after every round of permutations, of which 10,000 were performed,

Fixation Indices are recalculated to create a null distribution. Population pairwise genetic

distances were estimated from Analysis of Molecular Variance φST values (Excoffier,

Smouse & Quattro 1992). The genetic distances used were a) FST (Reynolds, Weir &

Cockerham 1983) (when individuals in populations were described by UEP haplogroups)

and b) RST (Slatkin 1995) (when NRYs of a particular haplogroup were characterised by

the six microsatellites). Significance of genetic distances was assessed by permutation of

individuals as described above for testing significance of Fixation Indices. All the above

was performed using Arlequin software (Schneider, Roessli & Excoffier 2000). AMOVA

is analogous to a traditional analysis of variance (ANOVA) (Sokal & Rohlf 1994) except

that it takes into account the degree of difference between haplotypes. In addition all

hypotheses are tested using permutation analysis and so no assumption of a normal

distribution is required. However assumptions of AMOVA include that all samples are

independent and randomly chosen, that mate choice is random and that inbreeding does

not occur within the populations.

Principal Coordinates Analysis (PCO) (Gower 1966) was performed using the „R‟

statistical package (www.R-project.org) by implementing the „cmdscale‟ function found

in the „mva‟ package on pairwise FST matrices and visualised using MSExcel.

2.2.5. Dating of the Y*(xBR,A3b2) clade

Y-time software (Behar et al. 2003) (URL:

http://www.ucl.ac.uk/tcga/software/index.html) was used to estimate the TMRCA, as well

as its associated confidence intervals, of the Y*(xBR,A3b2) NRYs identified in the Nso´

under three schemes;

(Scheme A): the TMRCA for all duy sampled who possess Y*(xBR,A3b2); (Scheme B):

the TMRCA for all nshiylav and mtaar sampled who possess Y*(xBR,A3b2); and

(Scheme C) the TMRCA for all won nto´ and duy sampled who possess Y*(xBR,A3b2).

The analysis utilised six microsatellites, DYS19, DYS388, DYS390, DYS391, DYS392

and DYS393. Due to uncertainty in its mutation behaviour (it may not be mutating in a

consistent stepwise manner as it displays a bimodal distribution within haplogroup

P*(xR1a) (Thomas et al. 2000)) all analysis was also repeated without DYS388.

It should be noted that as all samples collected were unrelated at the paternal grandfather

level all TMRCA estimates were effectively that of the sample‟s paternal grandfathers.

Consequently after all TMRCA point estimates and confidence intervals were calculated

they were increased by two generations or 40 years to allow for the effect of sampling

strategy utilised in this study.

2.2.5.1. Y-time Parameters

The Y-time parameters used, with their corresponding Y-time code given in parenthesis,

are listed below:

Ancestral haplotype (anc) = „14 20 11 14 13‟ or „14 12 20 11 14 13‟, number of

chromosomes (n) = Various (see below), number of microsatellite loci (nloci) = 5 or 6,

mutation rate per generation under Simple Stepwise Mutation Model (mua) =

0.001925752 (Behar et al. 2003), mua under Linear Length Dependent Stepwise Mutation

Model (mua) = -0.004758677 (see Y-time user guide), mub under Linear Length

Dependent Stepwise Mutation Model (mub) = 4.46E-04 (see Y-time user guide), lower

and upper boundary for equal-tailed 95% confidence limits (q) = 0.025-0.975, upper

boundary for one-tailed 95% confidence limits (q) = 0.05 or 0.95, the number of

simulations to perform at each value of T (MCruns) = 1000 and population growth model

(Rgrowth) =Various (see below).

The ancestral haplotype was chosen based on its status as the modal haplotype in

Schemes A, B and C. This analysis does not take into account error in the choice of

ancestral haplotype.

2.2.5.2. Mutation Rate and Mutation Models

As there is limited data available for individual loci, the mutation rate used in this chapter

of 0.00193 is an average value of numerous pedigree-based estimates of tri and tetra NRY

microsatellites (Heyer et al. 1997; Bianchi et al. 1998; Forster et al. 1998; Kayser et al.

2000) as utilised by Behar et al. (2003). It should be noted that more refined estimates are

now available such as from Gusmao et al. (2005) but the average from this study is only

around 10% lower so should not greatly impact the results presented here, increasing any

date estimates by approximately 11%. While pedigree-based estimates tend to agree with

those estimated from sperm-based analysis (Holtkemper et al. 2001), those based on

unrelated population data (which involves counting the number of mutations in a

phylogenetic network for a population and calibrating against a known event in that

population) are almost 10-fold lower (Caglia et al. 1997; Forster et al. 2000), which

would increase any date estimates by 900%. While there is no clear consensus of what

methodologies of mutation rate estimation are most reliable, the number of studies

utilising pedigree-based analysis far outweigh that of the population-based approach and

there are also a number of assumptions applied by the population-based method (e.g. the

date used for calibration of mutation rate, only considering mutations that change by one

step), that may be leading to an underestimation of the mutation rate (Zhivotovsky et al.

2004).

Under a Simple Stepwise Mutation Model the mutation rate is independent of the number

of repeats and when a mutation occurs the repeat length will change by one repeat, with

an increase or decrease being equally likely (e.g. a loci with 12 repeats is equally likely to

mutate to 13 or 11 repeats). This concept can be extended to the Linear Length Dependent

Stepwise Mutation Model, a more realistic representation of the mutation process. Under

this model increases and decreases by one repeat size are again equally likely. However

the rate at which these mutations occur increases as a linear function of microsatellite

length (i.e. the greater the number of repeats, the more likely a mutation is to occur). This

mechanism is based on the principle that if mutations are occurring because of replication

slippage and can occur between any two adjacent repeat units with equal probability, the

more repeat units that are available the more likely a mutation will take place. The

changing mutation rate can be represented by the equation µ = a + bL, where µ is the

mutation rate, L is the repeat length at a particular time and a and b are constants.

2.2.5.3. Population Growth Models

Various population growth models (Rgrowth) were tested for Schemes A, B and C.

Below is description of the rationale for selecting the various growth models.

2.2.5.3.1. Star Genealogy

When a simple microsatellite network of all Y*(xBR, A3b2) Y-chromosomes in the Nso′

is drawn, the network strongly resembles a Star genealogy. However, the likely

genealogies of the samples used in Schemes A, B and C will probably be highly

correlated, resulting in an underestimation of confidence intervals.

2.2.5.3.2. Rgrowth=0

This setting results in an assumed genealogy of constant size, which, except in the case of

an extreme bottleneck, should take into account any uncertainty in the genealogy with

respect to the level of tree correlation. A consequence of this is that confidence intervals

are likely to be overestimated so this approach is conservative.

2.2.5.3.3. Other Rgrowth

When not assigned „STAR‟ or „0‟, Rgrowth is determined by two other independent

parameters, N and r, and are related by the following equation:

Rgrowth = N * r

where N = the current effective population size and r = instantaneous growth rate11

Separate values for N and r were considered (including r=0.05-see below) for Schemes A,

B and C respectively and are discussed below.

r = 0.05

r=0.05 as a suitable value for a lower boundary for the growth rate in a sub-Saharan

African population was adopted as a rough estimate having regard to calculations of

population sizes in sub-Saharan Africa during the period 400BC-1970AD (See Table

2.1.1 from Cavalli-Sforza, Menozzi & Piazza 1994).

Scheme A:

An effective population size (N) for Nso′ duy individuals with a Y*(xBR, A3b2) NRY

was estimated on the basis that (a) all males with Y*(xBR, A3b2) Y-chromosomes in the

Nso´ duy are paternal line descendants of the first fon, and no males with other Y-

chromosomes are descendants of the first fon, (b) there are 200,000 Nso′ (according to the

latest census (Second general census of population and housing of Cameroon. Volume

3:preliminary analysis 1987)) and half of the Nso′ are male, (c) 51/132 of Nso′ males are

duy (estimated from Nso′ DNA sample survey), (d) effective population size is typically

taken as 1/10th of the census population size, (e) samples were collected randomly from

members of the four classes and (f) nine chromosomes out of 51 duy were Y*(xBR,

A3b2). Therefore the effective population size (N) was calculated as:

The instantaneous growth rate assumes overlapping generations and a constant breeding period .

200,000 * 0.5 * 9 =682

132 * 10

r = 0.252 and 0.706

Two other estimates for the instantaneous growth rate were calculated using the

continuous population growth model based on features of the oral history:

lnx = lnx0 + rt

where x = current actual population size, x0 = initial actual population size, t = time in

generations and r = instantaneous growth rate.

As the interest here is in the TMRCA from the first fon, x0 = 1, while x = 200,000 * 0.5 *

(9/132) (as above) = 6818. Two different estimates of r were calculated using upper and

lower boundaries for the date of origin taken from alternative accounts of the oral

tradition (Mzeka 1990).

The lower boundary for the time of origin of the Nso´ from oral history is 700 years, or

35 generations at 20 years per generation. The lower boundary for the instantaneous

growth rate using the oral history is therefore:

r(lower) = (ln(6818) - ln(1))/35 = 0.252

The upper boundary for the time of origin of the Nso´ from oral history is 250 years, or

12.5 generations at 20 years per generation. The upper boundary for the instantaneous

r(upper) = (ln(6818) - ln(1))/12.5 = 0.706

Scheme B:

An effective population size (N) for Nso′ nshiylav and mtaar individuals with a Y*(xBR,

A3b2) NRY was estimated on the basis that (a) all males with Y*(xBR, A3b2) Y-

chromosomes in the Nso´ nshiylav and mtaar are paternal line descendants of the Visale,

and no males with other Y-chromosomes are descendants of the Visale, (b) there are

200,000 Nso′ (according to the latest census (Second general census of population and

housing of Cameroon. Volume 3:preliminary analysis 1987)) and half of the Nso′ are

male, (c) 63/132 of Nso′ males are either nshiylav or mtaar (estimated from the Nso′

DNA sample survey), (d) effective population size is typically taken as 1/10th of the

census population size, (e) samples were collected randomly from members of the four

classes and (f) eleven chromosomes out of 63 nshiylav and mtaar were Y*(xBR, A3b2).

Therefore the effective population size (N) was calculated as:

200,000 * 0.5 * 11 = 833

132 * 10

r = 0.161 and 0.450

lnx = lnx0 + rt

generations and r = instantaneous growth rate

According to oral tradition, there were 30 Visali males when Princess Ngonnso′

encountered the Visale (x0 = 30) while x = 200,000 * 0.5 * (11/132) (as above) = 8333.

Two different estimates of r were calculated using upper and lower boundaries for the

date of origin taken from alternative accounts of the oral tradition (Mzeka 1990).

r(lower) = (ln(8333) - ln(30))/35 = 0.161

r(upper) = (ln(8333) - ln(30))/12.5 = 0.450

Scheme C:

N=1439

An effective population size (N) for Nso′ won nto´ and duy individuals with a Y*(xBR,

A3b2) NRY was estimated on the basis that (a) all males with Y*(xBR, A3b2) Y-

chromosomes in the Nso´ won nto´ and duy are paternal line descendants of the first fon,

and no males with other Y-chromosomes are descendants of the first fon, (b) there are

200,000 Nso′ (according to the latest census (Second general census of population and

housing of Cameroon. Volume 3:preliminary analysis 1987)) and half of the Nso′ are

male, (c) 69/132 of Nso′ males are either won nto or duy (estimated from the Nso′ DNA

sample survey), (d) effective population size is typically taken as 1/10th of the census

population size, (e) samples were collected randomly from members of the four classes

and (f) 19 chromosomes out of 69 won nto and duy were Y*(xBR, A3b2). Therefore the

effective population size (N) was calculated as:

200,000* 0.5 * 19 =1439

132 * 10

r = 0.274 and 0.766

lnx = lnx0 + rt

generations and r = instantaneous growth rate

As the interest here is in the TMRCA from the first fon, x0 = 1, while x = 200,000 * 0.5 *

(19/132) (as above) = 14394. Two different estimates of r were calculated using upper

and lower boundaries for the date of origin taken from alternative accounts of the oral

tradition (Mzeka 1990).

r(lower) = (ln(14394) - ln(1))/35 = 0.274

r(upper) = (ln(14394) - ln(1))/12.5 = 0.766

2.2.6. Comparison of duy vs nshiylav and mtaar genealogy depths

In order to establish whether the nshiylav and mtaar Y*(xBR,A3b2) genealogy was

deeper than that of the duy, the probability was estimated that the observed results would

be equal to or more extreme than the difference calculated between a) duy and b) nshiylav

and mtaar, assuming the two groups were from the same genealogy. This methodology is

described below.

The duy and nshiylav and mtaar Y*(xBR,A3b2) NRYs were grouped together (n=20) and

the Average Squared Distance for these samples calculated (ASD=0.0667 (0.06 without

using DYS388)). Trees were then simulated under this ASD value and the two mutation

and four demographic criteria described below.

For each simulated tree the 20 samples at the tips of the tree were randomly assigned to

either a group of final size n=9 (representing the duy) or a group of final size n=11

(representing the nshiylav and mtaar). The ASD was then calculated for each group. If the

ASD of the group with n=9 was equal to 0.0 (the ASD of the original duy) the pair of

ASD results were recorded. If the ASD of the group with n=9 was greater than 0.0 the

results were discarded. This process was repeated until 10,000 pairs of ASD values were

recorded where the group with n=9 had an ASD of 0.0.

A P-value was estimated by calculating the number of pairs of ASD values where the

difference between the two pairs was equal to or greater than 0.1212 (the ASD of the

original nshiylav and mtaar (0.1091 without using DYS388)) with P<0.05 taken as the

level of significance.

This analysis was performed for four demographic models and two mutation models, a

Simple Stepwise Mutation Model and a Linear Length Dependent Stepwise Mutation

Model.

The four demographic models used were:

a) „Star‟

b) Rgrowth=0

c) Rgrowth= 754.5041

d) Rgrowth=10,000,000

a) and b) were used as the these are the two most extreme demographic scenarios

possible.

c) was used as this is a more realistic demographic model and was calculated in a similar

manner as parameters described in section 2.2.5. Here N=1515 and r=0.498.

d) was used as it was an unfeasibly large growth model that was still not as extreme as a

„Star‟ demography

All genealogy comparisons were performed using adapted Y-time routines recoded in

Python (Code available on request from Krishna Veeramah).

2.3. Results and Discussion

2.3.1. The NRY and mtDNA distribution in the Nso΄

The modal NRY haplogroup in the won nto´ was Y*(xBR,A3b2) with a frequency of

55.6% (See Table 2.1). This haplogroup was also found at a frequency of 17.9% in the

duy. Furthermore, all of these Y*(xBR,A3b2) chromosomes had the same microsatellite

haplotype (14-12-20-11-14-14) (see Supplementary Table 2S.1 for all relevant NRY

data). For convenience only this NRY type and the associated microsatellite haplotype is

referred to as the won nto´ Modal haplotype (WMH). The modal NRY haplogroup in the

non-won nto´ social classes was E3a with a diverse range of NRY types at the

microsatellite haplotype level (h= 0.94 ± 0.01). Y*(xBR,A3b2) NRYs were found in the

other non-royal social classes but these included microsatellite haplotypes that were 1-3

mutation steps different from the WMH, suggesting that they had originated in the won

nto´ or paternal ancestors of a founder of the won nto´ some time ago and subsequently

diverged from the WMH. This accords with Nso´ rules of class inheritance.

Table 2.1: Distribution of NRY haplogroups (NRY at UEP level) in the four

Nso´ social classes.

Assigned NRY

haplogroup

Sample Cultural Identity

won nto´

(n=18)

(n=51)

(n=21)

nshiylav

(n=42)

(n=132)

P*(xR1a) 0 (0.000) 1 (0.020) 0 (0.000) 0 (0.000) 1 (0.008)

BR*(xDE,JR) 2 (0.111) 0 (0.000) 0 (0.000) 1 (0.024) 3 (0.023)

E*(xE3a) 0 (0.000) 2 (0.039) 0 (0.000) 1 (0.024) 3 (0.023)

Y*(xBR,A3b2) 10 (0.556) 9 (0.176) 3 (0.143) 8 (0.190) 30 (0.227)

E3a 6 (0.333) 39 (0.765) 18 (0.857) 32 (0.762) 95 (0.720)

Note. Figures indicate the number of NRY characterised while relative frequencies are

shown in brackets. Haplogroup nomenclature is that proposed by the Y-chromosome

Consortium (2002).

In regard to expectations inferred from the Nso´s' declared social practices the frequency

(at approximately 56%) and extreme homogeneity of Y*(xBR,A3b2) observed in the won

nto´ made the WMH the likely candidate to be the NRY type possessed by Nso´ fons and

the knowledge that a high status man generally considered to be a paternal descendant of

a recent fon possessed the WMH confirmed this. A Pearson's chi-square goodness of fit

test was performed to test the deviation of the observed WMH frequency in the won nto´

from the expectation of the proportion of individuals who possess the fon's NRY type

from both Royal Social Status Rule A and Royal Social Status Rule B respectively. That

Royal Social Status Rule A has been followed could not be rejected at the 1% level (and

only barely at the 5% level) (Chi-square test against upper limit of expected frequency of

46.7%: P = 0.45, X2 = 0.567, df =1, and lower limit of 33.4%: P = 0.046, X

2 = 3.97, df

=1) in contrast to Royal Social Status Rule B for which non-compliance with the rule is

statistically significant (Chi-square test against upper limit of expected frequency of

24.1%: P = 0.001, X2=9.73, df = 1 and lower limit of 1.0%: P < 0.0001, X

2 = 541.15, df

=1). While these tests are dependent on twin assumptions of random sampling and equal

reproductive success of non-fon won nto’ males, both of which may not hold exactly, the

size of the difference in P-values is strongly indicative of a real effect. This support for

Royal Social Status Rule A is notable given that the WMH types appear in non- won nto´

males at a low frequency and therefore fon NRY types could enter the won nto´ through

non-won nto´ males resulting in the prior expectation being an underestimate. These data

support male line continuity of Nso´ fons up to at least the fourth generation and the

WMH can thus be considered a likely candidate for the NRY type passed down from the

first fon of Nso´. Also, there was insufficient statistical support to reject the hypothesis

that the frequency of Y*(xBR,A3b2) found in the duy is in accordance with expectations

based on declared social practices (Chi-square test against expected frequency of 12.5%:

P = 0.26, X2=1.23, df=1). There was no statistical difference in the frequency of mtDNA

types (see Supplementary Table 2S.2) in combined or pairwise comparisons among the

four Nso´ classes (Global ETPD P-value=0.82±0.05, pairwise ETPD P-values > 0.25).

Therefore the pattern of both NRY and mtDNA variation in the Nso´ was in concordance

with expectations based on Royal Social Status Rule A.

2.3.2. Association of the Y*(xBR,A3b2) lineage with the indigenous hunter-gatherer

Visale

The merits of the two versions of the oral history were then examined. This first required

the investigation of the likely origins of Y*(xBR,A3b2). To see whether it is credible that

Y*(xBR,A3b2) is one of the NRY lineages replaced by the EBSP as proposed by

Underhill et al. (2001), samples were analysed from the Cameroon Grassfields, including

the Nso´, (total n=1213) alongside unreported data held in TCGA database consisting of

sample sets collected from across sub-Saharan Africa, including from the region of the

EBSP (n=8072). The frequencies of E3a and Y*(xBR,A3b2) NRYs (again characterised

by a battery of twelve UEPs and six microsatellites) were compared. Consistent with the

suggestion of Underhill et al (2001) E3a was the most common haplogroup within each

population (lowest population frequency= 46.3%, mean=80.2%, standard

deviation=0.149), except in Ethiopia, Sudan and the Lake Chad region of northern

Cameroon where the EBSP is not believed to have had a major impact, while the

frequency of Y*(xBR,A3b2) never exceeded 14%. In eight of the populations examined

(northern Cameroon, north eastern Ghana, Mozambique, western Senegal, Sudan,

Tanzania, Uganda and Zimbabwe) Y*(xBR,A3b2) was not represented. Y*(xBR,A3b2)

was represented, however, in eleven other, widely distributed, populations (southern

Cameroon*, Grassfields of Cameroon*, Ethiopia*, north western Ghana, south eastern

Ghana, south western Ghana*, central Malawi, northern Malawi, Pretoria-South Africa,

Cross River region-Nigeria and southern Senegal*). In the five populations in which the

Y*(xBR,A3b2) count was greater than 10 (indicated by an asterisk in the list above) the

modal haplogroup E3a had an among-group variance, assessed using AMOVA

(Excoffier, Smouse & Quattro 1992; Michalakis & Excoffier 1996), of 1.97%, a low

figure (relative to other haplogroups) which is consistent with either a recent common

origin or high inter-group gene flow and low effective population size. The putative

replaced Y*(xBR,A3b2), on the other hand, had a high among-group variance of 87.31%

which is consistent with inter-group isolation12

. In previous publications Y*(xBR,A3b2)

(or haplogroups of relative equivalence) has been reported at 20-45% (Hammer et al.

1998; Scozzari et al. 1999; Underhill et al. 2001) in Khoisan groups, which have origins

that pre-date the EBSP. This distribution therefore suggests that Y*(xBR,A3b2) could be

common in hunter-gather populations that predate the EBSP.

To establish that the WMH was not common in other groups inhabiting the Grassfields or

the land to the east, including the Tikar plain, the NRY of males from 10 other

neighbouring ethnic groups (n=780) (Table 2.2) were analysed. Only one self-declared

non-Nso´ had a Y*(xBR,A3b2) chromosome and this individual was born in Kumbo, the

Nso´ capital.

A PCO plot (Figure 2.5) based on a pairwise FST distance matrix calculated using NRY

haplogroup frequencies clearly distanced the won nto´ from both the other Nso´ social

classes and the other ethnic groups, demonstrating that high frequencies of

Y*(xBR,A3b2) is not typical of the Grassfields and Tikar plain NRY profiles.

Accordingly, as Y*(xBR,A3b2) is typical of a hunter-gather population and WMH is the

most likely candidate to be the NRY type of the father of the first fon of Nso´, the NRY

data favour the oral tradition of the Princess marrying an indigenous Visale from which

all subsequent fons descend.

It should be noted that the AMOVA among-group variance is used following the approach of Di

Giacomo et al (2004) as a convenient statistic for comparing the distribution of haplotypes within a single

haplogroup where the haplogroup is present in multiple ethnic groups (in this case haplogroups E3a and

Y*(xBR,A3b2)). In doing so E3a and Y*(xBR,A3b2) are treated as separate haploid populations from

which samples have been selected at random. No inferences are drawn other than that low among-group

variance is consistent with a recent common origin or gene flow between the members of the haplogroup

and high among-group variance is consistent with isolation of the separate collections of representatives of

the haplogroup.

Table 2.2: Distribution of NRY haplogroups in the peoples of the western Grassfields and Tikar plain.

Assigned NRY

haplogroup

Cultural identitya

(n=99)

(n=66)

(n=30)

(n=20)

(n=152)

(n=75)

(n=154)

(n=81)

(n=56)

(n=47)

(n=780)

P*(xR1a) 0

(0.000)

(0.013)

(0.019)

(0.012)

(0.000)

(0.006)

BR*(xDE,JR) 5

(0.051)

(0.000)

(0.066)

(0.013)

(0.099)

(0.036)

(0.021)

(0.037)

E*(xE3a) 0

(0.000)

(0.039)

(0.013)

(0.045)

(0.025)

(0.000)

(0.021)

(0.022)

A3b2 0

(0.000)

(0.052)

(0.000)

(0.010)

Y*(xBR,A3b2) 0

(0.000)

(0.018)

(0.000)

(0.001)

E3a 94

(0.949)

(1.000)

(0.895)

(0.960)

(0.870)

(0.864)

(0.946)

(0.957)

(0.923)

NOTE.-Figures indicate the number of NRY characterised while relative frequencies are shown in brackets. Haplogroup nomenclature is that

proposed by the Y-chromosome Consortium (2002). aA = Aghem speakers, located in Wum; B = Bafut speakers, located in Bafut and did not

declare a Tikar ethnic identity; BT = Bafut speakers, located in Bafut and declared a Tikar ethnic identity; Bl = Bamileke speakers, located

throughout the western Grassfields and Tikar plain after, it is claimed, being displaced from their homeland of Mbam living on the Tikar plain;

Bm = Bamun speakers, located in Foumban; K = Kwandja speakers, located in towns in the north-eastern region of the Tikar plain such as

Nyamboya; M = Mambila speakers located in towns near the Nigerian border on the Tikar plain, such as Atta, Somie and Songkolong as well as

Mayo Darle; T = Tikar speakers, located in towns on the Tikar plain such as Magba, Sabongari and Bankim; W = Wimbum speakers, located in

Nkambe; Y = Yamba speakers, located in towns throughout the Tikar plain, such as Sabongari, Magba, Bankim, Somie, Songkolong and Atta as

well as Mayo Darle.

Figure 2.5: PCO plot of UEP-based population pairwise FST values. The PCO plot is constructed using pairwise genetic

distances, FST, between the four Nso´ classes (labelled by name) and other populations of the western Grassfields and

Tikar Plain (labelled using abbreviations as defined in Table 2.2). PCO1 and PCO2 explain 97.91% and 1.92% of the

variation respectively.

2.3.3. Dating of the Y*(xBR,A3b2) lineage in the Nso´

Given the close match between the distribution of NRY types and one of the two main

versions of the foundation story of the ruling dynasty and the potentially large number of

offspring likely to be descended from the founder of the line, the time to the most recent

common ancestor (TMRCA) of the randomly collected Y*(xBR,A3b2) NRYs observed

in different social classes were estimated to investigate specific aspects of Nso´ history.

Oral history suggests a time since the first fon of some 250-700 years before present

(Mzeka 1990).

Considering the rules of social class inheritance and the previous assertion that the first

ever fon of Nso´ carried a WMH it was reasonable to postulate that all sampled duy with

a Y*(xBR,A3b2) chromosome were male line descendants of the WMH carrying first

fon. Consequently, in order to see how similar an estimate of the TMRCA for all duy with

a Y*(xBR,A3b2) chromosome was to the period suggested by oral history, the method of

Behar et al. (2003) was applied using both a Simple Stepwise Mutation Model and a

Linear Length Dependent Mutation Model as well as utilising a variety of demographic

models to compute associated confidence intervals (see Methods and Materials Section

2.2.5. for a full explanation). As all individuals analysed had the same microsatellite

haplotype (the WMH) the actual point estimate of the TMRCA was non-informative

(Average Squared Distance (ASD) = 0.0 with and without DYS388) but the upper limit of

the 95% one-tailed confidence interval (CI) under realistic demographic models was 1035

years (assuming an intergeneration time of 20 years13

) (1112 years without using

DYS388) (see Supplementary Table 2S.7 for associated confidence intervals). Therefore

these data are consistent with the oral history that suggests that the founding of the Nso´

Royal family was a recent event that occurred within the last 1000 years. Even under a

more conservative demographic model of constant population size, which is known not to

be the case for the general population in recent centuries and is unlikely for a paternally

inherited genetic system possessed by an agnatically defined elite social group practicing

male polygamy, the upper limit was 1497 years (1771 years without using DYS388). This

analysis was also repeated with the addition of using the Y*(xBR,A3b2) NRYs found in

the won nto´, who also all descend from the first fon, but the results are not reported here

An intergeneration time of 20 years was applied after consultation with David Zeitlyn who is specialises

in working within the Grassfields region.

(though they can be found in the Supplementary Table 2S.7). This was because the won

nto´ may enjoy a reproductive advantage due to their elevated social position (as least so

far as a reigning fon is concerned), which is likely to inflate the contribution made by

their recent shared ancestry to the TMRCA calculation, and thus adversely affect the

confidence intervals for the TMRCA estimates on these samples (in this case by reducing

Analysis of the Y*(xBR,A3b2) NRYs in the nshiylav and the mtaar gave an ASD of

0.1212 (0.1091 without DYS388). This gave a TMRCA point estimate of 1299 years

[64.94 generations] (1173 years without using DYS388 [58.65 generations]) under the

Simple Stepwise Mutation Model and 1672 years [83.606 generations] (1351 years

without using DYS388 [67.57 generations]) under the Linear Length Dependent Stepwise

Mutation Model (combined two-tailed 95 % CIs for both estimates: 176-7119 years

(116-6293 years without using DYS388)). The upper limit of the CI for the nshiylav and

the mtaar TMRCA estimate is much older than that for the duy. However, as both CIs

overlap it was not possible, from this analysis alone, to distinguish between the depths of

the genealogies of these two groups.

Given that the ancestral haplotype predicted for a) the duy Y*(xBR,A3b2) NRYs, and b)

the nshiylav and mtaar Y*(xBR,A3b2) NRYs is identical (the WMH), if the depths of the

genealogies of the two groups, a) and b), was the same, it is likely that they would share

the same MRCA at the root of a common genealogy. As a consequence the expectation

would be that the Average Square Distance (ASD) of a) duy and b) nshiylav and mtaar

Y*(xBR,A3b2) NRYs combined would be similar. If, however, the nshiylav and mtaar

had an older genealogy than did the duy then the expectation would be that the ASD of

the nshiylav and mtaar combined would be greater than that of the duy. Numerous

genealogies of final generation size n = 20 (representing the total number of duy (n=9),

nshiylav and mtaar (n=11) Y*(xBR,A3b2) NRYs) were simulated under the ASD value

of all duy, nshiylav and mtaar Y*(xBR,A3b2) NRYs combined (0.067 with DYS388,

0.060 without DYS388) under various demographic and mutation models. Individuals

were then randomly assorted at the tips of the simulated trees to one of two groups of size

n=9 (representing the original duy) and n=11 (representing the original nshiylav and

mtaar) and the ASD of the two groups calculated to estimate the probability of observing

a duy / nshiylav and mtaar ASD difference equal to or more extreme than that calculated

in the survey if the two groups, a) and b), shared a common MRCA (see Methods and

Materials Section 2.2.6. for full explanation). The survey-based data were for the duy

(ASD = 0.0) and for the nshiylav and mtaar (ASD with DYS388 = 0.1212, ASD without

DYS388 = 0.1091). Under all demographic and mutation models significantly low P-

values (P<0.05) were obtained, except under a star genealogy where the P-value was

approaching significance (P=0.093) (Table 2.3). As a star genealogy is not a particularly

realistic demographic model in this case it is reasonable to reject the hypothesis that a)

and b) share the same MRCA and therefore assert that the nshiylav and the mtaar have a

significantly older genealogy than the duy.

Table 2.3: Comparison of the depth of two genealogies. The probability of

observing results equal to or more extreme than the difference between the

Average Square Distance values of a) the duy and b) the nshiylav and mtaar

combined. (Three independent run simulations for each set of criteria)

Demographic Model P-value

P-value minus

DYS388

SSM L-SMM SSM L-SMM

Star genealogy

Run 1 0.090 NA 0.061 NA

Run 2 0.091 NA 0.058 NA

Run 3 0.093 NA 0.062 NA

Rgrowth=0

Run 1 0.003 0.003 0.002 0.002

Run 2 0.003 0.004 0.003 0.003

Run 3 0.003 0.003 0.002 0.003

Rgrowth=894.21

Run 1 0.018 0.021 0.016 0.017

Run 2 0.019 0.023 0.016 0.017

Run 3 0.017 0.022 0.016 0.017

Rgrowth=10,000,000

Run 1 0.047 0.052 0.032 0.038

Run 2 0.044 0.054 0.034 0.036

Run 3 0.043 0.049 0.029 0.034

NOTE.-SSM = Single Stepwise mutation model. L-SMM = Linear Length Dependent

Stepwise Mutation Model.

This finding suggests that the Y*(xBR,A3b2) NRYs in the nshiylav and the mtaar

descend not just from individuals of the Royal social class but also from those Visale

individuals that were not made part of the royal family when the Princess arrived and

instead were made commoners, as the hunter-gatherer Visale would be expected to have a

much older TMRCA than the duy. This is consistent with the previously held belief that

the indigenous Visale accepted the rule of the Princess and her heir and became a mtaar

lineage (there are believed to be approximately 20 existing mtaar lineages (Chilver &

Kaberry 1968) ) with the condition that all future fons must have a mother that is of mtaar

social class (Mzeka 1978).

2.3.4. The possible evolution of a relaxed patrilineal system of descent for the won nto´

When testing whether the observed distribution of Nso´ NRY types met expectations of

social practice, though Royal Social Status Rule A could not be rejected, the observed

frequency of the WMH at 55.6% was somewhat above the expected range. A higher than

expected frequency is not a problem in the subsequent analysis since male line continuity

of fons has been clearly demonstrated, permitting the definition of a putative NRY type

for the first fon of Nso´. Examination of the sociological data collected along with the

DNA of Nso´ males (see Methods and Materials) show that 15 of 18 males (83.3%)

inherited won nto´ status through their father and paternal grandfather (one further won

nto´ male had a won nto´ father but no won nto´ grandfather) while their mothers and

paternal grandmothers were of other social classes (see Table 2.4). The two remaining

won nto´ males appear to have inherited their won nto´ status through the matrilineal line.

Given that the sampling strategy utilised in this study (described in full in the Methods

and Materials section) under-records the proportion of fon NRY types in the actual

population, the elevated frequency of the WMH is striking as is the extremely large

number of individuals claiming won nto´ membership through paternal inheritance

compared to those with affiliations through a uterine connection. One possible

explanation is that the Nso´ royal family may have evolved or is evolving into a more

patrilineally defined group. An almost strictly patrilineal model of won nto´ status

inheritance, where non-patrilineally inherited membership of the won nto´ is restricted to

children of a fon's daughter, was named Royal Social Status Rule C (see Supplementary

Section 2S.1). This rule generates an expected range of 50.7%-93.1% for fon NRY types

given the sampling strategy used (estimated using a methodology similar to that used for

the expectations for Royal Social Status Rules A and B). While the lower limit of 50.7%

appears a reasonable fit (P = 0.68, X2 = 0.17, df =1) to our observed data the upper limit

(93.1%) shows a significant deviation (P < 0.0001, X2 = 39.49, df = 1). A more relaxed

model, Royal Social Status Rule D (see Supplementary Section 2S.1), where won nto´

membership is restricted to paternal line inheritance plus inheritance through a line of

three generations containing only one female, generates an expected range of 40.7%-

53.9% for the percentage of males with a fon‟s NRY that may be expected to be sampled.

In this case neither the upper nor lower limit can be rejected using a Pearson‟s Chi Square

test (Upper limit: P = 0.89, X2=0.02, df = 1; Lower limit: P = 0.20, X

2 = 1.65, df = 1).

Given the above it is possible that the Nso´ royal inheritance system may in practice be

more patrilineal than previously described. Clearly further field work may establish

whether this is in fact the case. The limited exploration of the rules of won nto´ affiliation

undertaken with Nso´ elders described above suggests that continuing development of the

rules of won nto´ membership is a possibility.

2.4. Conclusion

It is frequently difficult to establish in what manner, where and when events that are the

subject of oral history occurred, even in accounts in which categorical assertions are

made. Nevertheless, such narratives can prove valuable sources from which information

can be extracted. Confidence in conclusions reached from the analysis of oral tradition is

increased when they are supported by data from other sources, for example linguistics and

archaeological excavation. This study has shown that the distribution of NRY and

mtDNA is consistent with an oral history that describes a) fusion of an indigenous hunter-

gatherer group with later migrants and b) paternal descent of the ruling dynasty from the

indigenous inhabitants of the land over the period covered by the oral history.

The frequency of the won nto´ Modal Haplotype (WMH) in the won nto´ social class

accords very well with what one would predict from population genetic theory and the

sampling strategy utilised in this study and illustrates the power of genetic anthropology

to confirm the genetic consequences of social practices and labels. Notably support has

been provided for one description of the social system put forward by local researchers as

opposed to that advanced by western-based scholars. In this study it has also been

illustrated how, in the investigation of the histories of groups living in sub-Saharan

Africa, genetic analysis may prove a valuable additional tool in the armoury of scholars

seeking to elucidate, on a fine-scale, the pre-histories of sub-Saharan African populations.

Table 2.4: Cultural identity of won nto´ males sampled in the study as well

as the cultural identity of each sample’s father, mother, father's father and

mother's mother.

Sample

Identifier

Self-declared

cultural

identity

Father's

cultural

identity

Father's father's

cultural identity

Mother's

cultural

identity

Mother's

mother's

cultural identity

NSO-01 won nto´ won nto´ won nto´ duy Nso´

NSO-02 won nto´ won nto´ won nto´ nshiylav Bamun

NSO-03 won nto´ won nto´ won nto´ mtaar mtaar

NSO-04 won nto´ won nto´ won nto´ duy Nso´

NSO-05 won nto´ won nto´ won nto´ nshiylav nshiylav

NSO-06 won nto´ won nto´ won nto´ nshiylav nshiylav

NSO-07 won nto´ won nto´ won nto´ duy nshiylav

NSO-08 won nto´ won nto´ won nto´ duy duy

NSO-09 won nto´ won nto´ won nto´ won nto´ mtaar

NSO-10 won nto´ won nto´ won nto´ mtaar nshiylav

NSO-11 won nto´ won nto´ won nto´ Nsungli Wimbum

NSO-12 won nto´ won nto´ won nto´ Nso nshiylav

NSO-13 won nto´ won nto´ won nto´ mtaar mtaar

NSO-14 won nto´ won nto´ won nto´ nshiylav duy

NSO-15 won nto´ won nto´ won nto´ duy mtaar

NSO-16 won nto´ won nto´ Nooni nshiylav duy

NSO-17 won nto´ Duy duy won nto´ won nto´

NSO-18 won nto´ Nshiylav nshiylav won nto´ won nto´

2.5. Supplementary Section for Chapter 2

Because of their large size, for Supplementary Tables 2S.1, 2S.2 and 2S.7 please see

attached CD-ROM.

2.5.1. Supplementary Section 2S.1: The expectation of NRY type frequencies in the won

nto´ and duy of the Nso´.

Within the main text it is stated that:

If Royal Social Status Rule A has been followed it would be expected that 33.4% - 46.7%

of won nto´ Nso´ males sampled would share identical NRY types while this same NRY

type would be expected asymptotically to approach a frequency of 12.5% in won nto´-

descended-duy, depending on the number of generations since the original fon. However,

if Royal Social Status Rule B has been followed, it would be expected that only 1.0% -

24.1% of sampled won nto´ males would share the same NRY type. This same NRY type

would be expected to be at a frequency of 12.5% in the won nto´-descended-duy,

irrespective of the number of generations of descent from a fon.

Below is a description of how the above conclusions were elucidated.

Assumptions:

For the expected frequencies in the won nto´:

(a) All fons with paternal descent from the first fon will share the same NRY

(b) Male and female births and survival rates are similar.

(c) Generations do not overlap.

(d) Female won nto´ marry males who are not patrilineal descendants of the

first fon.

2.5.1.1. Royal Social Status Rule A

Royal Social Status Rule A can be described as: individuals are assigned won nto´ status

if either for up to four generations they are descendants of a fon along agnatic lines

(interpreted to mean, when expressed unambiguously, as „an exclusively paternal line of

inheritance‟) or for up to three generations they are descendants of a fon along uterine

connexions (mixed gender or strictly matrilineal lineages).

Ignoring duy status acquired by other means, individuals are duy if either they are

descendants of a fon along agnatic lines of not less than five generations or are

descendants of a fon for not less than four generations along lines with uterine

connexions. Inheritance of duy status is thereafter patrilineal.

The genealogy of the won nto´ descendants of a fon over four generations is illustrated

below (Supplementary Figure 2S.1).

Supplementary Figure 2S.1: Lineage tree showing the relationship of won

nto´ individuals under Royal Social Status Rule A. (M = male offspring, F =

female offspring, * = this individual inherits the same NRY type as a fon).

won nto´ males of the current generation are a summation of the following:

a) sons of the current fon.

b) grandsons of the previous fon.

c) great grandsons of the second previous fon.

d) great great grandsons of the third previous fon through exclusively paternal descent.

If every fon has an equal number of sons and daughters (the total of which can be any

even number) as well as one extra son who becomes the next fon, while all other

individuals have only one son and one daughter, then, as can be seen in Supplementary

Figure 2S.2, the relative proportions of a)-d) individuals in the current generation of male

won nto´ would be: a) 0.125, b) 0.25, c) 0.5 and d) 0.125.

Supplementary Figure 2S.2: Diagram showing the relative contributions of

different won nto´ lineages to the won nto´ under Royal Social Status Rule

The percentage of males in a)-d) who would share the same NRY type (*) of a fon would

be: a) 100%, b) 50%, c) 25% and d) 100%.

Therefore the relative proportions of a)-d) individuals in the current generation of male

won nto´ who possess the NRY* type would be:

a) 100% * 0.125 = 0.125

b) 50% * 0.25 = 0.125

c) 25% * 0.5 = 0.125

d) 100% * 0.125 = 0.125

The proportion of males in the current generation who would share the NRY* type is the

sum of the above i.e. 0.5. Therefore in any one generation it would be expected that half

of won nto´ males would share the same NRY type. However, since it is not reasonable to

assume that won nto´ have only two children and that sampling of individuals who are the

brother, father, son or paternal line cousin of an individual from whom a buccal swab has

already been collected is not permitted the expectation is different (as now described).

If it is assumed a) that sampling is from both the most recent and second most recent

generation of adult won nto´ males and that the third most recent generation are not

sampled, b) that individuals are not sampled who are the brother, father, son or paternal

line cousin of another subject, c) the number of children a fon has (excluding his heir) is

„2n‟ („n‟ of whom are males) and d) that the number of children a non-fon won nto´ has is

„2y‟ („y‟ of whom are males), the proportion of males who possess a fon‟s NRY type that

are expected to be sampled if rule A has been followed is:

( 1 / ( y + 1 ) ) ( n ( 1 + y )2 + 1) + 1

( 1 / ( y + 1 ) ) ( 3n ( 1 + y )2 + 1 ) + 1

The above expression assesses the probability of sampling the different individual

lineages included in figure Supplementary Figure 2S.3 given the sampling strategy

utilised in this study (see Supplementary Table 2S.3 for probabilities). To assist

understanding of the approach adopted a description of how the proportions of individuals

belonging to a specific lineage were calculated is given below.

The proportions of individuals of lineage representative (LR) 2 in Supplementary Figure

2S.3 from whom buccal swabs are taken is calculated as follows: calculate the probability

of sampling individuals of LR 2 in the population rather than their sons (LR 3). This is a

Supplementary Figure 2S.3: Lineage tree showing the relationship of won nto´ individuals under Royal Social Status Rule

A for the two most recent adult generations of won nto´ males. (M = male offspring, F = female offspring, * = this individual

inherits the same NRY type as a fon. Numbers refer to specific Lineage Representatives).

function of the relative contributions of LRs 2 and 3 combined to the total population.

The contribution of LR 2 will be, starting from Fon4 and working down the lineage, „n‟ *

„y‟ * „y‟ or ny2. Similarly the contribution of LR 3 will be ny

3. Therefore the probability

of sampling from LR 2 rather than LR 3 will be the contribution of LR 2 divided by the

contribution of the sum of LR 2 and LR 3 i.e. ny2

/ (ny2 + ny

3). As expressed in b) above

only one individual of LR 2 is sampled among all that share the same father and paternal

grandfather. Consequently the maximum number of LR 2 that can be sampled is ‘n’. The

number of males that can be sampled of LR 2 is therefore ‘n’ multiplied by ny2

/ (ny2 +

ny3), as shown in the second row of Table A1, while the number for LR 3 is ‘ny’

multiplied by ny3

/ (ny2 + ny

3) as shown in the third row of Supplementary Table 2S.3.

Supplementary Table 2S.3: Probability of sampling Nso´ Y-chromosomes

given Royal Social Status Rule A.

Lineage Representative

Total number of males of this lineage

type that can be sampled conditional on

the probability of sampling this individual

in terms of 'n' and 'y'

Proportion of the

lineage possessing a

fon's NRY type

1 ny ( ny3 / ( ny

4 + ny

3 ) ) 1

2 n ( ny2 / ( ny

3 + ny

2 ) ) 1

3 ny ( ny3 / ( ny

3 + ny

2 ) ) 1

4 ny ( ny2 / ( ny

3 + ny

2 ) ) 0

5 ny / ( ny

2 + ny

6 n ( ny2 / ( ny

2 + ny

7 ny ( ny2 / ( ny

2 + ny

8,9,10 1 1

11 n ( ny / ( ny

12 n ( ny / ( ny

2 + ny

13 n ( ny2 / ( ny

2 + ny

14 ny ( ny2 / ( ny

2 + ny

15 n ( ny2 / ( ny

3 + ny

2 ) ) 0

16 ny ( ny2 / ( ny

3 + ny

2 ) ) 0

NOTE.-'2n' is the number of children a fon has ('n' of which are male). '2y' is the

number of children a won nto' has ('y' of which are male).

The process is repeated for each lineage in Supplementary Figure 2S.3 and summed to

yield the total number of won nto´ sampled. Some of the lineages are patrilineal with an

origin in a fon and will therefore have the same NRY type as the fon. These lineages have

a probability of having a fon NRY of ‘1’ as shown in column 3 of Supplementary Table

2S.3. Summing all the lineages yields the total number of individuals who share the fon’s

NRY type expected to be sampled. This figure is divided by the total number of samples

(described above) to calculate the proportion of won nto´ expected to share the fon’s

NRY. The expression above is an algebraic simplification performed following

summation.

Given a range of different combinations of „n‟ and „y‟ extending from 1 to 25 (to take

account of uncertainty concerning the real values of „n‟ and „y‟) and applying the above

expression it is observed that, under Royal Social Status Rule A and the sampling strategy

utilised, 33.4% - 46.7% of won nto´ males tested will possess a fon‟s NRY type.

For the expected frequencies in the duy:

All the assumptions above apply as well as:

(a) Only duy who descend from the first fon are considered; those individuals who

acquire duy status in other ways are not considered e.g. because of claimed

royal descent originating in other ethnic groups incorporated into the Nso´

empire.

(b) duy status once acquired is inherited in a strictly patrilineal manner.

(c) duy do not marry won nto´ that have paternal line descent from a fon.

Supplementary Figure 2S.4 illustrates the transition of a fon‟s descendants from won nto´

to duy.

Section continues overleaf…

Supplementary Figure 2S.4: Lineage tree showing the transition of won nto´

to duy under Royal Social Status Rule A. (M = male offspring, F = female

offspring, * = this individual inherits the same NRY type as a fon). Duy are

shown in red.

In Supplementary Figure 2S.4 there are eight male duy of the present generation, only one

of whom possesses a fon‟s NRY. All things being equal, every fon should contribute the

same number of male duy individuals, 12.5% of whom possess a fon‟s NRY. However,

when sampling from the most recent generation the present and three previous fons will

not have had sufficient descendant generations to contribute any duy while the fon of four

generations ago will have contributed seven of the eight males. Nevertheless he would not

have produced the one duy with a fon‟s NRY. If sampling from the most recent

generation the frequency of male duy with a fon‟s NRY will approach 12.5% but never

reach it. The more fons there have been since the first fon, the closer the proportion

approaches 0.125. Note that 12.5% is independent of both „n‟ and „y‟ and the sampling

2.5.1.2. Royal Social Status Rule B

Royal Social Status Rule B can be described as: a person is a member of won nto´ (down

to the fourth generation (if a man) and third generation (if a woman)) if she or he is both a

child of a won nto´ and a descendant of a fon.

The genealogy of the won nto´ descendants of a fon over four generations is illustrated

below (Supplementary Figure 2S.5).

Supplementary Figure 2S.5: Lineage tree showing the relationship of won

nto´ individuals under Royal Social Status Rule B. (M = male offspring, F =

female offspring, * = this individual inherits the same NRY type as a fon).

won nto´ males of the current generation are a summation of the following:

a) sons of the current fon.

b) grandsons of the previous fon.

c) great grandsons of the second previous fon.

d) great great grandsons of the third previous fon.

If every fon has an equal number of sons and daughters (the total of whom can be any

even number) as well as one extra son who becomes the next fon, and all other

individuals have only one son and one daughter, then, as can be seen in figure

Supplementary Figure 2S.6, the relative proportions of a)-d) individuals in the current

population of male won nto´ would be: a) 0.067, b) 0.133, c) 0.267 and d) 0.533.

Supplementary Figure 2S.6: Diagram showing the relative contributions of

different won nto´ lineages to the won nto´ under Royal Social Status Rule

The percentage of males in a)-d) who would share the same NRY type (*) of a fon would

be: a) 100%, b) 50%, c) 25% and d) 12.5%.

Therefore the relative proportions of a)-d) individuals in the current population of male

won nto´ who possess the NRY* type would be:

a) 100% * 0.067 = 0.067

b) 50% * 0.133 = 0.067

c) 25% * 0.267 = 0.067

d) 12.5% * 0.533 = 0.067

The proportion of males in the current population who would share the NRY* type is the

sum of the above i.e. 0.27. Therefore in any one generation it would be expected that just

over one quarter of won nto´ males would share the same NRY type. However, since it is

not reasonable to assume that won nto´ have only two children and sampling individuals

who are the brother, father, son or paternal line cousin of an individual from whom a

buccal swab has already been collected is not permitted the expectation is different (as

now described).

If it is assumed a) that sampling is from both the most recent and second most recent

generation of adult won nto´ males and that the third most recent generation are not

sampled, b) that individuals are not sampled who are the brother, father, son or paternal

line cousin of another subject, c) the number of children a fon has (excluding his heir) is

„2n‟ („n‟ of whom are males) and d) that the number of children a non-fon won nto´ has is

„2y‟ („y‟ of whom are males), the proportion of males who possess a fon‟s NRY type that

are expected to be sampled if rule B has been followed is:

( 1 / ( y + 1 ) ) ( n ( 1 + y )2 + 1) + 1

( 1 / ( y + 1 ) ) ( 4 n y3 + 10 n y2 +9 n y +3 n + 1 ) + 1

The above expression assesses the probability of sampling the different individual

lineages included in Supplementary Figure 2S.7 given the sampling strategy utilised in

this study (see Supplementary Table 2S.4 for probabilities and Royal Social Status Rule

A for how these probabilities are calculated). Given a range of different combinations of

„n‟ and „y‟ extending from 1 to 25 (to take account of uncertainty concerning the real

values of „n‟ and „y‟) and applying the above expression it is observed that, under Royal

Social Status Rule B and the sampling strategy utilised in this study, 1.0% - 24.1% of won

nto´ males tested will possess a fon‟s NRY type.

For the expected frequencies in the duy:

All the assumptions above apply as well as:

(a) Only duy who descend from the first fon are considered; those individuals who

acquire duy status in other ways are not considered e.g. because of claimed

royal descent originating in other ethnic groups incorporated into the Nso´

empire.

(b) duy status once acquired is inherited in a strictly patrilineal manner.

(c) duy do not marry won nto´ that have paternal line descent from a fon.

Supplementary Figure 2S.7: Lineage tree showing the relationship of won nto´ individuals under Royal Social Status Rule

B for the two most recent adult generations of won nto´ males. (M = male offspring, F = female offspring, * = this individual

inherits the same NRY type as a fon. Numbers refer to specific Lineage Representatives).

given Royal Social Status Rule B.

Lineage

Representative

Total number of males of this lineage type that can

be sampled conditional on the probability of

sampling this individual in terms of 'n' and 'y'

Proportion of the

lineage possessing a

fon's NRY type

1 ny ( ny3 / ( ny

4 + ny

3 ) ) 1

2 ny2 ( ny

3 / ( ny

4 + ny

3 ) ) 0

3 ny ( ny3 / ( ny

4 + ny

3 ) ) 0

4 ny2 ( ny

3 / ( ny

4 + ny

3 ) ) 0

5 n ( ny2 / ( ny

3 + ny

2 ) ) 1

6 ny ( ny3 / ( ny

3 + ny

2 ) ) 1

7 ny2 ( ny

3 / ( ny

3 + ny

2 ) ) 0

8 ny ( ny2 / ( ny

3 + ny

2 ) ) 0

9 ny ( ny3 / ( ny3

10 ny2 ( ny

3 / ( ny

3 + ny

2 ) ) 0

11 ny / ( ny

2 + ny

12 n ( ny2 / ( ny

2 + ny

13 ny ( ny2 / ( ny

2 + ny

14, 15, 16 1 1

17 n ( ny / ( ny

18 n ( ny / ( ny

2 + ny

19 n ( ny2 / ( ny

2 + ny

20 ny ( ny2 / ( ny

2 + ny

21 n ( ny2 / ( ny

3 + ny

2 ) ) 0

22 ny ( ny3 / ( ny

3 + ny

2 ) ) 0

23 ny2 ( ny

3 / ( ny

3 + ny

2 ) ) 0

24 ny ( ny2 / ( ny

3 + ny

2 ) ) 0

25 ny ( ny3 / ( ny

3 + ny

2 ) ) 0

26 ny2 ( ny

3 / ( ny

3 + ny

2 ) ) 0

27 ny ( ny3 / ( ny

4 + ny

3 ) ) 0

28 ny2 ( ny

3 / ( ny

4 + ny

3 ) ) 0

29 ny ( ny3 / ( ny

4 + ny

3 ) ) 0

30 ny2 ( ny

3 / ( ny

4 + ny

3 ) ) 0

Supplementary Figure 2S.8 illustrates the transition of a fon‟s descendants from won nto´

to duy.

Supplementary Figure 2S.8: Lineage tree showing the transition of won nto´

to duy under Royal Social Status Rule B. (M = male offspring, F = female

offspring, * = this individual inherits the same NRY type as a fon). Duy are

shown in red.

In Supplementary Figure 2S.8 there are eight male duy of the present generation, only one

of whom possesses a fon‟s NRY. All things being equal, every fon should contribute the

same number of male duy individuals, 12.5% of whom possess a fon‟s NRY. When

sampling from the present generation, the overall frequency of male duy with a fon‟s

NRY will be 12.5% (unlike Royal Social Status Rule A which will approach but never

reach 12.5%). Note that 12.5% is independent of both „n‟ and „y‟ and the sampling

2.5.1.3. Royal Social Status Rule C and D

The expectations for the proposed Royal Social Status Rules C and D are generated in

similar manner to Royal Social Status Rules A and B by assessing the probabilities of

sampling the individual lineages included in Supplementary Figures 2S.9 and 2S.10 (see

Supplementary Tables 2S.5 and 2S.6 for probabilities).

2.5.1.4. The implications of won nto´ women marrying a) won nto´ men and b) non- won

nto´ men carrying the fon’s NRY type

Table 2.3 shows one case of a marriage between a won nto´ man and a won nto´ woman.

In evaluating the effect of the sampling strategy utilised it was assumed that exclusively

won nto´ marriages do not occur. Such marriages could affect the expectation since each

lineage might no longer be discrete. Correcting for this assumption is complicated given

the ways lineages may interact. It has not been done since approximate calculations

indicate that at realistic levels of family size the effect would be small and most probably

increase the proportion of fon NRY types in the won nto´. Furthermore any such small

increase in the expectation for the incidence of fon NRY types should not affect the

conclusions set out above.

However the relatively simple correction to take account of fon NRY types entering the

won nto´ class as a consequence of won nto´ women marrying non-won nto´ men carrying

the fon NRY type has been made. The correction requires assumptions concerning (a) the

expected proportion of marriages of won nto´ women with men of a different class or

ethnicity and (b) the proportion of men in other groups carrying the fon NRY type. (a) is

assumed based on the proportion of men sampled of each class/ethnicity included in the

survey (the correction applies the survey proportions after allowing for unions with non-

Nso´ men; see Table 2.3 which contains one case out of 18 of a marriage between a non-

Nso´ and a member of the won nto´). Therefore the proportion of marriages to each of the

Nso´ classes is reduced by 17/18 to take into account non-Nso´ males marrying won nto´

females and it is assumed that none of the non-Nso´ males carry the fon NRY type). Since

(b) was estimated by later typing the correction is post hoc and consequently is not

included in the principal text of the chapter.

Supplementary Figures 2S.9: Lineage tree showing the relationship of won nto´ individuals under Royal Social Status

Rule C for the two most recent adult generations of won nto´ males. (M = male offspring, F = female offspring, * = this

individual inherits the same NRY type as a fon. Numbers refer to specific Lineage Representatives).

Supplementary Figures 2S.10: Lineage tree showing the relationship of won nto´ individuals under Royal Social Status

Rule D for the two most recent adult generations of won nto´ males. (M = male offspring, F = female offspring, * = this

individual inherits the same NRY type as a fon. Numbers refer to specific Lineage Representatives).

given Royal Social Status Rule C.

Total number of males of this lineage type that

can be sampled conditional on the probability of

Proportion of the lineage

possessing a fon's NRY

1 ny ( ny3

/ ( ny4

2 n ( ny2

/ ( ny3

3 ny ( ny3

/ ( ny3

4 ny / ( ny

2 + ny

5 n ( ny2

/ ( ny2

+ ny ) ) 1

6, 7, 8 1 1

9 n ( ny / ( ny

10 n ( ny / ( ny

2 + ny

11 n ( ny2

/ ( ny2

+ ny ) ) 0

12 n ( ny2

/ ( ny3

given Royal Social Status Rule D.

Total number of males of this lineage type that

can be sampled conditional on the probability of

Proportion of the lineage

possessing a fon's NRY

1 ny ( ny3

/ ( ny4

2 n ( ny2

/ ( ny3

3 ny ( ny3

/ ( ny3

4 ny ( ny2

/ ( ny3

5 ny / ( ny

2 + ny

6 n ( ny2

/ ( ny2

+ ny ) ) 1

7 ny ( ny2

/ ( ny2

+ ny ) ) 0

8, 9, 10 1 1

11 n ( ny / ( ny

12 n ( ny / ( ny

2 + ny

13 n ( ny2

/ ( ny2

+ ny ) ) 0

14 n ( ny2 / ( ny

3 + ny

2 ) ) 0

If a1 is the adjusted expected number of sampled fon NRY types, a0 is the initial expected

number of sampled fon NRY and xi are the three non-won nto´ classes and non-Nso´

ethnicities, the correction is:

a1 = a0 + [(1-a0)*∑((proportion of fon NRY types in non-won nto´ group xi * (17/18)) * proportion of sampled xi males in Nso´)].

Applying this correction for Royal Social Status Rule A, a Pearson‟s Chi Square test

demonstrates greater support for the rule, with neither the new upper limit (upper limit

expectation of 52.8%: P = 0.815, X2= 0.055, df = 1) or lower limit (lower limit

expectation of 41.0%: P = 0.209, X2 = 1.58, df = 1) significantly different from the

observed data. For Royal Social Status Rule B both the adjusted upper limit (upper limit

expectation of 32.8%): P = 0.040, X2=4.23, df = 1) and lower limit (lower limit

expectation of 12.3%: P <0.0001, X2 = 31.22, df = 1) expectations deviated significantly

from the observed data. Therefore Royal Social Status Rule A still appears the more

likely scenario for won nto´ social practice. The conclusions drawn from analysis of

Royal Social Status Rules C and D are unchanged despite applying the correction

described above (Royal Social Status Rule C [upper limit expectation of 59.1%: P =

0.760, X2 = 0.09, df = 1; lower limit expectation of 47.5%: P = 0.494, X2 = 0.47, df = 1],

Royal Social Status Rule D [upper limit expectation of 93.9%: P < 0.0001, X2 = 46.20,

df=1; lower limit expectation of 56.4%: P = 0.942, X2 = 0.005, df = 1]).

Chapter 3:

It All Depends On The Scale: Little

Sex-Specific Genetic Variation In The

Presence Of Substantial Language

Variation In Peoples Of The Cross

River Region Of Nigeria Assessed

Within The Wider Context Of West

Central Africa.

3. It All Depends On The Scale: Little Sex-

Specific Genetic Variation In The Presence

Of Substantial Language Variation In

Peoples Of The Cross River Region Of

Nigeria Assessed Within The Wider Context

Of West Central Africa.

3.1. Introduction

There have been many studies seeking to compare genetic and language differences

among peoples. Some have demonstrated genetic continuity across linguistic boundaries

(Rosser et al. 2000; Zegura et al. 2004), while others have concluded that language

boundaries are associated with increased genetic distances (Karafet et al. 2002; Wood et

al. 2005). Studies of possible associations between languages and sex-specific genetic

systems in sub-Saharan Africa however are few in number and limited in scale. This

study attempts to address this gap by examining genetic variation in the paternally

inherited non-recombining portion of the Y-chromosome (NRY) and the maternally

inherited mitochondrial DNA (mtDNA) in multiple groups from West Central Africa at

various levels of identity (clan, self declared ethnic identity, first language affiliation) and

geographic separation (Cross River region of Nigeria, Grassfields of Cameroon and

Ghana).

While the sex-specific genetic systems represent what are in effect just two loci, each

comprised of linked markers, they have the considerable advantage in population studies

because of their smaller effective population size (Jobling, Hurles & Tyler-Smith 2004 pg

134), leading to increased rates of genetic drift and thus population differentiation. This is

useful when seeking to identify evidence of isolation among communities. It is unlikely

that the frequencies of well characterised NRY and mtDNA types will both be

statistically similar if there is not either a recent common origin of the groups or if there

has been substantial gene flow among them. This study examines a large number of

groups that speak clearly distinguishable languages with estimated times to separation

ranging from 500 (Connell & Maison 1994) to several thousands of years and whether the

language separation has taken place or at least has been maintained in the presence of

substantial male and/or female gene flow, as evidenced by sex-specific genetic systems.

In the course of doing so possible associations between geographic and genetic distance

are also examined. Finally since one of the groups (the Efik) has a long-standing claim of

an ancient origin in the Palestine of antiquity the sex-specific genetic systems are

examined for evidence supporting this claim.

First the regions, peoples and languages included in this study are described.

3.1.1. A brief description of the Peoples and Languages of the Cross River region

The Cross River region (named after the river of the same name which passes through the

region) is situated in the extreme southeast of Nigeria and adjacent parts of Cameroon.

The physical geography is varied with mountains, rainforest and an alluvial plain at its

estuary at the Atlantic.

Linguistically and culturally, it is one of the most diverse regions of the world (as

assessed by the number of languages given the size of the region). It is home to more than

60 distinct languages. Early European missionaries reported that every village had its own

language. These languages can be classified into a number of distinct language groups.

The most notable, though not only, groupings are „Cross River‟ and „Bantoid‟, both of

which include many subgroups.

The land to the north east of the Cross River region (Figure 3.1) is now generally

accepted as the area from which the expansion of the Bantu-speaking peoples began

approximately three to five thousand years ago (Greenberg 1955; Vansina 1990; Blench

2006). Bantu languages are now spoken throughout most of sub-Saharan Africa south of

the equator.

Figure 3.1: Map showing the position where samples were collected from in

West Central Africa. Political borders are shown by black lines. Colour bar

indicates elevation in metres.

The Cross River region was a major source of slaves during the Atlantic slave trade with

Calabar, at the confluence of the Cross and Calabar Rivers, becoming both the region‟s

principal urban centre and one of the trade‟s most active ports. The Efik, the most

numerous group in the town, played a significant role in the trade, often as intermediaries

between Europeans on the coast and in-land groups.

The resident peoples have been characterised as „Syncretic Christian‟; that is to say,

nominally Christian but retaining aspects of traditional animist worship. In general their

social structures are „acephalous‟ (absence of a fixed, centralised political structure)

although the Efik do have a „king‟, or paramount ruler, the Obong. Nevertheless the

power and role of the Obong is not equivalent to, say, either that of the fon in Grassfields

societies or the Oba in former African kingdoms such as those of Benin or of the Yoruba.

Interestingly the more centralised system of the Efik developed during the rise of Calabar

and is a direct result of close contact with their British trading partners (Latham 1973;

Noah 1980).

Brief details of the cultural practices of the Cross River peoples included in this study are

given in Table 3.1 (information on Lower Cross groups: Anaang, Efik, Ibibio, and Oro is

from Forde and Jones (1950), Udo (1983) and Uya (Uya 1984); Efut: from Connell

(1983); Ejagham: Talbot (1912); Igbo: Basden (1966) and Forde and Jones (1950)). In all

cases information has been supplemented from Connell unpublished field notes (1983).

Because of its linguistic and cultural diversity, proximity to the Bantu homeland and role

in the slave trade, the peoples of the Cross River are of considerable interest to linguists

(especially those concerned with historical linguistics and the nature of language contact),

historians and other researchers interested in the mechanisms and consequences of

population movements. They also provide an opportunity to examine possible

associations of language and genetic difference on a fine-scale. In this study data from

1113 residents of the Cross River region speaking six languages as their mother tongue

and drawn from 24 clans and 20 locations were analysed.

Table 3.1: Summary of cultural practices of Cross River ethnic groups

utilised in this study.

Ethnic

Marriage

Practice

Patrilocal/

Matrilocal

Patrilineal/

Matrilineal Religion

Ruling

Structure

Anaang exogamous patrilocal patrilineal syncretic

Christian acephalous

Efik exogamous patrilocal patrilineal syncretic

Christian

centralised,

paramount ruler

Efut exogamous patrilocal patrilineal syncretic

Ejagham exogamous patrilocal patrilineal syncretic

Ibibio exogamous patrilocal patrilineal syncretic

Igbo exogamous patrilocal patrilineal syncretic

Oron exogamous patrilocal patrilineal syncretic

In the past most of the extensive variety of languages in the region were categorised as

“Semi-Bantu, a linguistic designation considered obsolete since the work of Greenberg

(1963). Currently the accepted classification (subject to some dispute14

) identifies

„Bantoid‟ and „Cross River‟ as the two most important groups of languages found in the

Cross River region. As branches of Benue-Congo (one of the main families within the

large and diverse Niger-Congo phylum) they share a common parent language (though at

a time remote in the past of at least 6000 years ago) (Figure 3.2).

Figure 3.2: Broad relationships of the differing language groups used or

described in this chapter based on Williamson and Blench (2000). Branch

lengths are not informative.

See Connell (1994), Connell (1998) and Williamson and Blench (2000) for further details. Williamson &

Blench (2000) argue that Cross River and Bantoid are sufficiently similar to be grouped together while still

falling under Benue-Congo.

Cross River is divided into Bendi and Delta Cross, with the latter comprised of four

subgroups: Central Delta, Ogoni, Upper Cross and Lower Cross. The best studied of

these, from a comparative perspective, is Lower Cross. Lower Cross itself is comprised of

some twenty languages (Connell 1994; Connell & Maison 1994) including Anaang, Efik,

Ibibio and Oron; and is spoken over most of the lower region of the Cross River basin –

the alluvial plain of its geography – and consequently includes Calabar. Dialect variation

exists within some of the Lower Cross languages, particularly Ibibio and Anaang, and this

variation has sometimes been claimed to correspond to clan groupings.

Details of the relationships among the four Delta Cross subgroups are not fully

understood; indeed, further work may lead to a reassessment of this grouping. Similarly,

solid evidence to unite Bendi with Delta Cross is at present lacking and some scholars,

most recently Blench (2001) are more comfortable placing it within Bantoid.

Evidence from comparative linguistics, oral tradition (Connell 1994; Connell & Maison

1994) and documentary material (Ardener 1968; Latham 1973) indicate that the Lower

Cross languages together with the people that speak them are in the process of separating

and spatially dispersing. Connell & Maison (1994) suggest the major dispersal, with

perhaps one or two earlier exceptions, began approximately 500-600 years ago. It appears

to have consisted of a general movement towards the coast from an inland-situated

homeland. Some of the available oral traditions speak of these migrations (see below),

explaining them as a response to the arrival of Europeans and a search for increased trade

opportunities. Latham (1973), citing reports of Europeans from this period, concludes that

the site that has since become Calabar was only settled after the first contact with

Europeans.

The component groupings within Bantoid, on a broad sweep, are shown in Figure 3.2.

The primary branching is of North and South Bantoid. North Bantoid is comprised of

Mambiloid, and more controversially Dakoid and Tikar15

. South Bantoid comprises

numerous subgroups, including Bantu (made up of several hundred languages). Those in

proximity to the Cross River region include Tivoid, Grassfields, Beboid, Nyang and

Boyd (1996) questions the inclusion of Dakoid, while Connell (2000) suggests the existence of the

division itself is questionable

Ekoid. Of Bantu itself, the Northwest group of languages (also known as „A‟ Bantu in

Guthrie‟s (1967) alpha-numeric nomenclature) is found in and adjacent to the Cross River

region.

A further refinement of the linguistic classification, now widely accepted, divides Benue-

Congo into East and West branches (EB-C and WB-C). Cross River and Bantoid are both

part of EB-C. Another language grouping found partly in, but primarily to the west of the

Cross River region, is Igboid, which consists mainly of a range of Igbo dialects. Despite

the geographical proximity of Igboland to the Cross River basin, Igboid languages are

classified as WB-C, which reflects the considerable time (some thousands of years) since

the existence of a common parent (viz. Proto Benue-Congo) of Igbo on one hand and

Cross River and Bantoid on the other.

The oral traditions of the different Lower Cross groups have been examined in some

detail in Connell & Maison (1994). While movements of peoples in various directions

are indicated, they, in general, relate an expansionary movement southward in search of

trade opportunities with the newly arrived Europeans. A village or region named „Ibom‟

is often suggested as a point of origin. There is today a village called Ibom near the Igbo

town of Arochukwu which is situated in the northwest of the Cross River area, in the

border region between Igbo and Ibibio territories. An alternative account suggests

dispersal from the Ibom Arochukwu area was a response to conflict with the expanding

Igbo speaking population. It should also be noted that most Lower Cross traditions deal

with the relatively recent past in comparison to the oral traditions discussed.

Several of the Lower Cross groups also have diverging traditions, for example of having

migrated from Cameroon. The Efik, in particular, have a variety of conflicting traditions,

which are summarised in Noah (1980). Among them is a claim of origin in, and migration

from, ancient Palestine (Akak 1986). This story tells of a migration from the Middle East

via Sudan, Chad and Benin, with stops among the Igbo and then Ibibio, and the founding

of Calabar. However some versions of the account claim no more than that the Efik are of

Igbo origin. Most of the Efik traditions have as a common thread a final stop among the

Ibibio, specifically in the Uruan area. The Hart Commission (Hart 1964) investigated

various Efik claims and essentially concluded that they were without foundation. The

report concluded:

“The last tribes among whom the Efiks might have lived were the Ibibio. If

they had lived among the Ibo [sic.] and were in fact Ibo [sic.] in origin,

there is no means ready to hand to determine the truth or falsity of this

claim of origin.”

The Efut, another group found within the boundaries of Calabar, claim an origin in a

Bantu-speaking area to the east of the Cross River estuary in Cameroon. It is claimed by

some that their language was once Londo, a Northwest Bantu language (A11-according to

the nomenclature of Guthrie (1967)) which is still spoken in Cameroon (Connell 1983;

Thompson 1983), but they have since adopted Efik as their primary tongue.

The oral traditions of the Ejagham, also known as Ekoi, are less well documented. The

main body of the Ejagham population is to be found in the Upper Cross River basin and

extends southward. One Ejagham subgroup (also known as ‟Qua‟ or „Ekin‟) occupies a

part of Calabar, and claim to have arrived there before the Efik, having migrated

southward from the main Ejagham area (Noah 1980). This claim is supported by the

practice, continued to the present day, in which the Efik pay tribute to the Qua (Hart

1964).

The Igbo constitute the third largest ethnic group in Nigeria, numbering (approximately)

18,000,000 (Ethnologue 2005). They occupy much of the southeast of the country,

forming an arc around the Cross River region. The Igbo are well known as traders and

merchants and are found in every major urban area in Nigeria, including a sizeable

population in Calabar. Many Igbo were brought to Calabar during the era of the slave

trade. In more recent times, many others have settled and established businesses.

The Igbo are a relatively diverse group and from the linguistic standpoint comprise over

20 different lects (Manfredi 1989). Their oral traditions broadly speak of a north to south

expansion (Forde & Jones 1950). This expansion may still be in progress since only in

relatively recent times has a sizeable Igbo population settled in the coastal areas of the

Niger Delta.

3.1.2. Genetics and Language

Exploration of correlations between differences among the languages of peoples and

variation in their sex-specific genetic systems has been encouraged by the representation

of both languages and genetic systems by bifurcating and multi-furcating trees. Inferences

drawn from such trees, notwithstanding that the trees themselves are frequently gross

approximations to the actual demographic processes involved, have provided interesting

insights into human history and social behaviour.

Most studies to date of the correlation between genetics and language have concentrated

on the relationship over a broad canvas, often at a continental or intercontinental scale,

with considerable emphasis on any link between long-range language dispersals and the

spread of early farmers. Rosser et al. (2000) for example found the distribution of NRY in

Europe to be associated primarily with geography rather than language and suggested that

the current European genetic landscape has been greatly influenced by the expansion of

farmers from the Near East during the Neolithic. In contrast Wood et al. (2005) found a

correlation in Africa between genetic and linguistic distances when analysing NRY, and

to a lesser extent mtDNA, with differences between Bantu-speaking and non-Bantu-

speaking groups having an especially large influence on the correlation. Other studies on

the peoples of the Americas (Zegura et al. 2004), Pacific Islands (Hurles et al. 2002) and

Siberia (Karafet et al. 2002) have also had varying degrees of success in attempting to

establish a linguistic/genetics link. More recent work has begun to examine, and find,

relationships between linguistics and DNA at a finer scale. (See for example the study of

Lansing et al. (2007) on the Sumba populations of eastern Indonesia. This found a

correlation between NRY frequencies and the level of influence of incoming farmers on

the languages of different islands.)

An advantage of analysing NRY and mtDNA in fine-scale studies where peoples are in

close geographic proximity is that both systems, being effectively single loci and of

smaller effective population size than the autosomal system, are more prone to drift.

Although it has not yet been conclusively demonstrated, given a sufficient battery of

markers (say for the NRY six microsatellites and for mtDNA 350 nucleotides of the

HVR-I region) and sufficiently large sample sizes (~50), in the absence of inter-group

gene flow or recent common origin it is likely that two groups will have significantly

different distributions of either NRY types, mtDNA types or both (Nasidze et al. 2004;

Thomas et al. 2007; Trovoada et al. 2007; Chaubey et al. 2007; Cox 2007). The NRY

frequently demonstrates a greater degree of population structuring than do other systems,

which is likely due to the practice of patrilocality (Seielstad, Minch & Cavalli-Sforza

1998). It is of course important to appreciate that failure to detect dissimilarity is not to

have established identity. Other studies (see for example Chapter 2 in this thesis) have

revealed that susceptibility to drift can lead to substantial differences in the distribution of

NRY types even among classes and caste like clans of the same ethnic identity.

Prior to this study the variation in ethnic identities, cultural practices, oral histories and

languages of the peoples of the Cross River was well known with many tongues believed

to have separated hundreds, and in some cases thousands, of years ago. It is interesting

therefore to examine whether patterns of distribution of differences in sex-specific genetic

systems among the groups are similar to those suggested by the linguistic data. The

absence of detectable differences would on the other hand suggest either that the

relationships postulated by the linguistic analysis do not reflect reality or, in the

alternative, languages, cultural practices and oral histories have all been maintained in the

face of extensive gene flow.

3.1.3. Expectations of the distribution of NRY and mtDNA variation in the Cross River

region

In this study the NRY and mtDNA in multiple well characterised groups in the

linguistically diverse Cross River region were surveyed in what is the most densely

sampled and well defined sub-Saharan African dataset collected to date from a localised

geographic area. Groups speaking six different Benue-Congo languages known to be

predominant in the Cross River region were included: Anaang, Efik, Ejagham, Igbo,

Ibibio, and Oron, and samples were collected at multiple locations and at various levels of

ethnic identity (Table 3.2).

The principal aim was to establish whether there had been substantial inter-language

group gene flow in the Cross River region, analysis for which this particular dataset was

well suited. Crude expectations of the level of gene flow between different language

groups were generated based on sociological data that were collected from each

individual who would subsequently be analysed for NRY and mtDNA genetic markers as

part of this study.

Of the 1113 males analysed in this study, 918 had fathers that spoke as their first

language one of the six languages described in the paragraph above. Of these 918, 88.2%

had mothers who spoke the same language as their first language. In the same manner 887

of the Cross River samples had mothers that spoke one of the six languages as their first

language, 89.4% of whom had fathers that spoke the same language as their first language

(see Table 3.3). While in sociological-anthropological analysis it may appear that

language is a strong factor in mate choice, in the context of population genetic theory

these figures equate to a high migration rate among language groups (treating each

language group as a distinct population). Under a very crude Wright Island model with

„islands‟ of at least 1000 individuals this migration rate of 10% would, given sufficient

time, give a Fixation Index of at most 0.002, a very low value that suggests a substantial

amount of gene flow between „islands‟.

However the sociological information on inter-group gene flow is based on data from

only the last two generations before present (samples were collected from adult males of a

wide range of ages) while the Fixation Index referred to is based on a model that assumes

a substantially longer time period. If substantial genetic structuring was observed among

Cross River language groups this would suggest that the practice of high male and/or

female gene flow is a recent phenomenon while an overall homogenous NRY and/or

mtDNA distribution would suggest that gene flow has been maintained over a long period

despite some apparently very important cultural differences among peoples of the region.

The Cross River dataset also allowed the investigation, in a more limited way and without

any preconceived expectations, of whether, in the small geographical area of the Cross

River region, differences at other, varying, levels of grouping could be observed.

Specifically these questions were posed: a) are clan communities collected from different

locations distinguishable? b) are clans of the same language group collected from the

same location distinguishable? c) are different language groups collected from the same

location distinguishable? d) are representatives of the same language group collected

from different locations distinguishable?

The analysis was then extended to interpret the results within the broader geographical

context of West Central Africa by analysing NRY and mtDNA from groups resident in

Cameroon and Ghana (see Figure 3.1 and Table 3.2). Examination of the sociological

Table 3.2: Nigerian Cross River sample collection details.

Code Language Place collected Clan/Secondary affiliation

Latitude Longitude total n

SOUTH EAST NIGERIA

AN-EA Annang Afaha Esang, Ikot Ubom

Ediene Abak 5.050 7.717 26

AN-AO Annang Afaha Esang, Ikot Ubom

Afaha Obong 5.050 7.717 37

AN-IO Annang Abak, Ikot Obioma, Ikot Ekpene, Ukanafun

4.992 7.758 47

EF-EE Efik Eniong, Atan Ono Yom

Efut 5.167 7.983 50

EF-INE Efik Ikot Nakanda, Ikot Ene

Efut 4.908 8.442 48

EF-OEU Efik Oyo Efam, Ikot Abasi Obori

Uwanse 4.950 8.317 50

EK-CA Ejagham Calabar Akampka 4.950 8.317 18

EK-CC Ejagham Calabar Calabar 4.950 8.317 29

EK-CI Ejagham Calabar Ikom 4.950 8.317 40

EK-NA Ejagham Netim Akampka 5.350 8.350 51

IB-ANMWN Ibibio Afaha Nsit, Mbiokporo

Western Nsit 4.833 7.900 38

IB-EAEEUAE Ibibio Etebe Afaha Eket, Ekpene Ukpa

Afaha Eket 4.717 7.867 50

IB-EUE Ibibio Ette Ukpom Ette 4.620 7.650 50

IB-IAAUA Ibibio Ikot Akpan, Afaha Ubiom

Awa 4.690 7.815 28

IB-IEINOI Ibibio Ikot Essien, Ikot Ntu

Oku-Iboku 5.133 7.933 50

IB-IMIEI Ibibio Ikot Mbonde, Ikot Ekang

Itam 5.042 7.842 50

IB-IOINO Ibibio Ikot Oku, Ikot Ntuenoku

Oku 5.100 7.967 50

IB-MNENN Ibibio Mkpok Ndon Eyo Nnung Ndem 4.633 7.850 50

IB-NEI Ibibio Ndiya Edienne Ikono 4.783 7.883 50

IB-OII Ibibio Obong Itam Itam 5.133 7.967 50

IB-ONMNI Ibibio Onoh, Ntan Mbat Ntan Ibiono 5.233 7.933 50

IG-C Igbo Calabar 4.950 8.317 100

OR-AO Oron Oron Afaha Okpo 4.833 8.233 28

OR-ENEEAU Oron Eyo Nsik, Eyo Ekpe

Afaha Ukwong 4.750 8.250 73

IG-E Igbo Enugu 6.433 7.483 57

IG-N Igbo Nenwe 6.117 7.517 52

CAMEROON

CA-BT Tikar Bankim 6.083 11.500 34

CA-FB Bamun Foumban 5.717 10.917 117

CA-WA Aghem Wum 6.383 10.067 118

GH-AEW Akan Enchi 5.817 -2.817 21

GH-AKE Akan Kibi 6.167 -0.550 51

GH-ASWW Akan Sefwi-Wiawso 6.333 -2.267 22

GH-FEWR Akan Enchi 5.817 -2.817 61

GH-EHVR Ewe Ho 6.600 0.467 88

Table 3.3: First languages of parents of Cross River region samples utilised

in this study.

Father's first language of

samples belonging to the 6 Cross

River languages analysed in this study

Mother's first

language of same samples

number of samples

Mother's first language of

samples belonging to the 6 Cross

River languages analysed in this study

Father's first

language of same samples

number of samples

Annang

Annang 57 Annang

Annang 57

Ibibio 8 Ibibio 3

Efik 3 Efik 1

Bekwara 1

Annang Total 69 Annang Total 61

Efik 101 Efik

Efik 101

Ibibio 5 Ibibio 21

Annang 1 Ejagham 6

Bekwara 1 Annang 3

English 1 Abakpa 1

Igbo 1 Boki 1

Oron 1 English 1

Tiv 1 Igbo 1

Ugep 1

Umon 1

Yoruba 1

Efik Total 115 Efik Total 135

Ejagham

Ejagham 115 Ejagham

Ejagham 115

Efik 6 Nde 1

Ibibio 5

English 2

Mbembe 2

Igbo 1

Umon 1

Ejagham Total 132 Ejagham Total 116

Continues overleaf….

Table 3.3 continued…

Ibibio

Ibibio 402 Ibibio

Ibibio 402

Efik 21 Annang 8

Igbo 15 Eket 8

Eket 6 Efik 5

Ijaw 5 Ejagham 5

Annang 3 English 4

Yoruba 3 Pidgin 3

Hausa 1 Igbo 1

Nembe 1

Pidgin 1

Ibibio Total 458 Ibibio Total 436

Igbo Igbo 85 Igbo Igbo 85

Efik 1 Ibibio 15

Ibibio 1 Efik 1

Ejagham 1

English 1

Igbo Total 87 Igbo Total 103

Oron Oron 17 Oron Oron 17

Yoruba 2 Efik 1

Oron Total 19 Oron Total 18

Grand Total 880 Grand Total 869

Proportion of samples where both parents speak the same

language fixed on father's first language type

Proportion of samples where both parents speak the same

language fixed on father's first language type

data showed that there were no instances where an individual from the Cross River,

Ghanaian or Cameroonian datasets had one parent from one of the three groups and

another parent from a different member of the three groups. Under the same Wright

Island model as previously, even if allowing for one migrant every generation (0.1%), a

Fixation Index of around 0.2 would be expected, a value that is consistent with substantial

inter-group isolation. Therefore observable differences among the NRY and mtDNA

profiles of these three regions would be expected.

Finally it was examined whether the NRY and mtDNA genetic data drawn from the Efik

Uwanse sample provided support for an origin in the Palestine of antiquity by comparing

this group to a possible source population (Israeli Arabs/Palestinians) as well as possible

contributing populations that the Efik Uwanse may have met along their proposed route

of migration (Ethiopians, Sudanese, a population from Lake Chad, Igbo speakers and

Ibibio speakers).

3.2.1. Sample Collection Procedure.

Buccal swabs were collected from males over eighteen years old unrelated at the paternal

grandfather level from locations in South East Nigeria as shown in Table 2. All buccal

swabs were collected anonymously with informed consent. Sociological data were also

collected from each individual including age, current residence, birthplace, self-declared

cultural identity, first language, second language and (when available) clan affiliation for

the individual as well as similar information on the individual‟s father, mother, paternal

grandfather and maternal grandmother. The samples were classified into groups primarily

by first language spoken, then by place of collection and thirdly, when available, by clan

or some other subsidiary criterion. Where collections from a particular group were made

in more than one location (for example the Ediene Abak were collected from two

neighbouring villages: Afaha Esang and Ikot Ubom) and co-ordinate data are available

for both sites, locations are represented by averages.

Buccal swabs and similar sociological data as described above were also collected from

males eighteen years or older unrelated at the paternal grandfather level from the

following groups:

LC-AFα β

: Afade Speakers from Lake Chad, Cameroon (n=48), CA-BTα β

: Tikar speakers

from Bankim Cameroon (n=34), CA-FBα β

: Bamoun speakers from Foumban Cameroon

(n=117), CA-WAα β

: Aghem speakers from Wum Cameroon (n=118), ET-AAα β

: Amharic

speakers from Addis Ababa Ethiopia (n=72), GH-AEWα β

: Twi speakers from Enchi

Ghana (n=21), GH-AKEα β

: Twi speakers from Kibi Ghana (n=51), GH-ASWWα β

speakers from Sefwi Wiawso Ghana (n=22), GH-EHVRα β

: Ewe speakers from Ho Ghana

(n=88), GH-FEWRα β

: Fante speakers from Enchi (n=61), SU-KHα: Arabic speakers from

Khartoum Sudan (n=75), SU-KAβ

: Sudanese from Kassala (n=75) and IPAα: Israeli

Arabs/Palestinians (n=143).

Standard phenol-chloroform DNA extractions were performed on all samples (see

Appendix C).

3.2.2. Y-chromosome typing

The NRY of all South East Nigerian samples as well as those samples in groups with the α

notation were typed in the following manner. Standard TCGA kits were used to

characterise six microsatellites (DYS19, DYS388, DYS390, DYS391, DYS392,

DYS393) and eleven biallelic Unique Event Polymorphism (UEP) markers (92R7, M9,

M13, M17, M20, SRY+465, SRY4064, SRY10831, sY81, Tat, YAP), as described by

Thomas et al. (1999). Microsatellite repeat sizes were assigned according to the

nomenclature of Kayser et al. (1997). Where necessary an additional marker, p12f2, was

typed as described by Rosser et al. (2000). NRY Haplogroups were defined by the twelve

UEP markers according to the nomenclature proposed by the Y-chromosome Consortium

(2002) (see Figure 2.4). See Chapter 2 for a discussion on the choice of UEP and

microsatellite markers used.

These multiplex UEP/ microsatellite kits have already been shown to be reliable under a

wide range of conditions, consistently giving similar signal intensities across all UEPs

and microsatellites within each kit (Thomas, Bradman & Flinn 1999). Therefore any

multiplex runs that showed at least one UEP or microsatellite peak of substantially low

intensity were repeated. Any samples that gave UEP-1 and UEP-2 results that were

incompatible to the known phylogenetic tree for the NRY were also retyped for both kits.

Microsatellite results were also analysed for outliers and homomplasy amongst UEP

haplogroups and retyped for confirmation.

3.2.3. mtDNA typing

The mtDNA HVS-1 region of all South East Nigerian samples as well as those samples in

groups with the β notation was sequenced as described by Thomas et al. (2002) except

that primers conL1-mod, conL2 and conH3 were replaced by conL849 (CTA TCT CCC

TAA TTG AAA ACA AAA TA), conL884 (TGT CCT TGT AGT ATA A) and conHmt3

(CCA GAT GTC GGA TAC AGT TC) respectively. HVS-1 Variable Site Only (VSO)

haplotypes were determined for all samples from South East Nigeria by comparing

sequence data covering nucleotides 16020-16400 with the Cambridge Reference

Sequence (Anderson et al. 1981). Haplotypes were defined by base changes and

nucleotide positions where substitutions, insertions or deletions occurred. Tentative

mtDNA Africa-specific haplogroup classification was based on the scheme of Salas et al.

(2004). HVS-1 Variable Site Only (VSO) haplotypes were also determined for all

samples from groups with the β notation with sequence data covering nucleotides 16023-

16380. South East Nigerian HVS-1 coverage was reduced to this range during

comparisons with these groups. In addition the IPA2 β

: Israeli Arabs/Palestinians mtDNA

dataset was taken from Richards et al. (2000).

Each sample‟s chromatogram was manually inspected for generally high levels of

background noise across its whole length of sequence. The 5ʹ and 3ʹ ends of raw

chromatograms were trimmed until at least 10 out of 15 bases at these ends had

confidence scores above 25%. The ends were then trimmed further by manually

inspecting the sequence. For each 96 sample sequencing run each position with a

proposed SNP, insert, deletion or ambiguous position was examined manually. All

samples with any ambiguous sites after manual curation were sequenced again. In

addition sequencing of samples was repeated when the forward and reverse sequences did

not match.

Genetic diversity, h, (the probability of randomly sampling two different haplotypes in a

population) and its standard error was estimated from unbiased formulae of Nei (1987).

Genetic differences between pairs of populations when individuals in populations were

described by a) NRY UEP haplogroups, b) combined NRY UEP haplogroup and six

microsatellite haplotypes (UEP+MS) or c) mtDNA HVS-1 VSO haplotypes were

assessed using an Exact Test of Pairwise Population Differentiation (ETPD) with 10,000

Markov steps (Raymond & Rousset 1995; Goudet et al. 1996). This test is analogous to a

Fisher‟s Exact test (Lee et al. 2004) but the size of the contingency table is extended to

the number of populations being compared (two in a pairwise population comparison, two

or greater in a global test) by the total number of different haplotypes present. Due to the

complexity introduced by the sheer number of extra rows and columns a null distribution

of tables to test against the observed data is generated using a random walk via a Markov

chain rather than comparison to some predefined distribution such as the hypogeometric

distribution.

Population Genetic Structure was estimated using Hierarchical Analysis of Molecular

Variance (AMOVA) (Excoffier, Smouse & Quattro 1992) based on a particular mutation

model (which allowed the evolutionary distance between pairs of haplotypes to be taken

into account) to generate a single Fixation Index statistic, FST, when a simple structure of

populations within a single group was defined, or three Fixation Indices, FST (the within-

population Fixation Index), FSC (the among-populations within-group Fixation Index) and

FCT (the among-group Fixation Index), when a more complex structure of populations

within multiple groups was defined. Significance of Fixation Indices are assessed by

randomly permuting individuals (given that only haploid systems are considered) among

populations or groups of populations, depending on the Fixation Index being tested and

after every round of permutations, of which 10,000 were performed, Fixation Indices are

recalculated to create a null distribution.

Population pairwise genetic distances were estimated from Analysis of Molecular

Variance φST values (Excoffier, Smouse & Quattro 1992). The genetic distances used

were a) FST (Reynolds, Weir & Cockerham 1983) (when individuals in populations were

described by UEP haplogroups, UEP+MS haplotypes and mtDNA HVS-1 VSO

haplotypes), b) RST (Slatkin 1995) (when NRY were characterised by the six

microsatellites) and c) the Kimura-2 parameter model (which allows different transition

and transversion rates) with gamma distribution of value 0.47 (K2) (Kimura 1980) (when

mtDNA was characterised by HVS-1 sequences with gaps removed). Significance of

genetic distances was assessed by permutation of individuals as described above for

testing significance of Fixation Indices. All the above was performed using Arlequin

software (Schneider, Roessli & Excoffier 2000). AMOVA is analogous to a traditional

analysis of variance (ANOVA) (Sokal & Rohlf 1994) except that it takes into account the

degree of difference between haplotypes. In addition all hypotheses are tested using

permutation analysis and so no assumption of a normal distribution is required. However

assumptions of AMOVA include that all samples are independent and randomly chosen,

that mate choice is random and that inbreeding does not occur within the populations.

It should be noted that on occasion, in instances when individuals were described by

UEP+MS haplotypes, where populations were significantly different at the 1% level

using the ETPD the most frequently observed haplotype from each population (as long as

this haplotype was not the modal haplotype in either or both populations) that was not

shared with the other population was removed to establish whether, given the overall

similarity of Cross River populations, the observed significant difference was capable of

being caused by overrepresentation of just one particular haplotype in each group.

The TMRCA and confidence intervals for the NRY were estimated using Y-time software

(Behar et al. 2003) (URL: http://www.ucl.ac.uk/tcga/software/index.html).

Principal Coordinates Analysis (PCO) (Gower 1966) was performed using the „R‟

statistical package (www.R-project.org) by implementing the „cmdscale‟ function found

in the „mva‟ package on pairwise FST matrices and visualised using MSExcel.

3.2.4.1. Phylogenetic Analysis

NRY UEP+MS haplotype and mtDNA HVS-1 haplotype FST distance matrices were

constructed using Phylip 3.67 package Gendist. The FST genetic distance used in Gendist

was that of Reynolds, Weir, and Cockerham (1983). In addition 1,000 bootstrap replicates

of the observed data were constructed by randomly sampling haplotypes with

replacement within each separate population to generate 1,000 new datasets (source code

available on request from Krishna Veeramah). FST distance matrices for these 1,000

bootstrapped replicates were generated as for the original observed data. Phylogenetic

analysis was performed on these 1,001 distance matrices using the Phylip 3.67 packages

Neighbour and Consense to create a consensus tree with internal node confidence values.

NRY microsatellite phylogenetic analysis was performed using POPTREE software

written by N. Takezaki. These Neighbour Joining trees were constructed using genetic

distance matrices based on Goldstein et al‟s (1995) δμ2 pairwise distance measure and

1,000 bootstrap datasets were created for internal node confidence values.

In order to generate mtDNA K2-based trees using HVS-1 sequence data a genetic

distance matrix was constructed using the average net number of substitutions measure of

Nei (1987) based on a K2 mutation model (all positions with insertions and deletions in

comparison with the reference sequence were removed from the sequence alignment prior

to distance calculations). In addition 100 bootstrap replicates of the observed data were

constructed by randomly sampling entire sequences with replacement from each separate

population and the K2-based genetic distance matrices recalculated for each matrix

(source code available on request from Krishna Veeramah). Phylogenetic analysis was

performed on these 101 distance matrices using the Phylip 3.67 packages Neighbour and

Consense to create a consensus tree with internal node confidence values.

All trees were visualised using Treeview software (Page 1996).

3.2.4.2. Mantel and Partial Mantel Tests

Mantel and Partial Mantel tests (Sokal & Rohlf 1994) were performed between genetic

distance and both geographic and linguistic distance using the „R‟ package „Vegan‟ which

uses the Pearson product-moment method. Significance was assessed by permuting the

rows and columns of the matrices 1,000 times.

Geographic distances were Great Circle distances estimated from latitude and longitude

data. Linguistic distances were constructed using the method described below.

Lexicostatistic similarity percentages shown in Table 3.4a were compiled using the

following sources: the pairwise values for the Lower Cross languages (Anaang, Efik,

Ibibio and Oron) were taken from Connell & Maison (1994). No lexicostatistic similarity

percentages were available for Ejagham languages in comparison with the other five

Cross River region languages. Therefore data for three other Bantoid languages Tunen,

Mambila (which represents different branches of Bantoid spoken near to the Cameroon-

Nigeria borderland) and Bobangi (a Southern Bantu language spoken in the Democratic

Republic of Congo) were used as surrogates for Ejagham as lexicostatistic similarity

percentages had been calculated for these languages in comparisons between Efik and

Igbo as well as each other by Schadeberg (1986). The pairwise value between Akan and

Ewe is from Schadeberg (1986), Asante being a particular dialect, representing Akan. The

pairwise value between Ekoid (The Ekoid language in question being Nkim, not

Ejagham) and Mambila is from Piron (1995b). The pairwise value between Aghem and

Tikar is from Piron (1995a). The pairwise comparisons between Tikar and Mambila and

Tikar and Tunen are from Piron (1998). No suitable lexicostatistics were available for

Foumban so its similarity to Aghem (both are Narrow Grassfields groups) was estimated

on the assumption that the similarity is larger than the average similarity between the

three Southern Bantoid languages Tunen, Tikar, Bobangi but smaller than the average

similarity between Oron and the three Lower Cross languages.

An incomplete lexicostatistic distance matrix was then calculated for the six Cross River,

three Cameroonian and two Ghanaian languages used in this study by subtracting the

lexicostatistic similarity percentages from 100% as performed by Weng and Sokal (1995),

with cells containing Ejagham pairwise comparisons found by taking the average

lexicostatistic dissimilarity for the appropriate Tunen, Mambila, Bobangi and Nkim

pairwise comparisons. Missing data in the distance matrix shown in Table 3.4a (indicated

by a question mark) were then estimated using the weighted least-square approach of

Makarenkov and Lapointe (2004) via the T-Rex software package

(http://www.labunix.uqam.ca/~makarenv/trex.html) to give the linguistic distance matrix

shown in Table 3.4b. The neighbour joining tree generated by this distance matrix (see

Figure 3.3) is of similar structure to that proposed by other sources such as the

Ethnologue (2005).

3.3. Results

3.3.1. The distribution Of NRY variation

3.3.1.1. Cross River region

The twelve typed UEP makers define 14 distinct NRY haplogroups, of which eight were

observed in the Cross River dataset (n=1081). The modal haplogroup was E3a (87%)

using the nomenclature of the Y-chromosome Consortium (The Y Chromosome

Consortium 2002) (see Table 3.5). Gene diversity based on UEP haplogroups for the

entire region was 0.231±0.017 and for the individual clans ranged from 0.067 to 0.378

with a mean of 0.23 and a variance of 0.007; for individual locations it ranged from 0.117

to 0.378 with a mean of 0.25 and a variance of 0.006 and for individual language groups

it ranged from 0.188 to 0.265 with a mean of 0.229 and a variance of 0.0006. In all clans

the E3a haplogroup was modal (mean: 0.87, variance: 0.003, range: 0.77-0.97). There

were seven pairwise differences between clans (assessed using a Pairwise ETPD) at 5%

significance and none at 1% significance (see Supplementary Table 3S.1 for all ETPD

results tables). Furthermore of the seven significant pairwise comparisons none were

significant even at 5% significance when haplotypes were defined by UEP+MS. Gene

Table 3.4a: Lexicostastic similarity percentages for various Niger-Congo languages. ‘?’ indicates no available data.

Anaang Efik Ibibio Oron Tunen Mambila Bobangi Aghem Bamun Nkim Tikar Igbo Asante Ewe

Anaang --

Efik 83 --

Ibibio 90 90 --

Oron 70 73 71 --

Tunen ? 29 ? ? --

Mambila ? 21 ? ? 34 --

Bobangi ? 25 ? ? 40 31 --

Aghem ? ? ? ? ? ? ? --

Bamun ? ? ? ? ? ? ? --

Nkim ? ? ? ? ? ? ? ? ?

Tikar ? ? ? ? 20 20 ? 32 ? 34 --

Igbo ? 24 ? ? 29 20 23 ? ? ? ? --

Asante ? 17 ? ? 22 19 21 ? ? ? ? 24 --

Ewe ? 17 ? ? 18 17 20 ? ? ? ? 26 26 --

Continues overleaf…

Table 3.4 continued

Table 3.4b: Lexicostastic dissimilarity matrix for 6 Cross River languages, 3 Cameroon Grassfields languages and 2

Ghanaian languages.

Anaang Ibibio Efik Oron Ejagham Aghem Bamun Tikar Igbo Akan Ewe

Anaang 0.0

Ibibio 10.0 0.0

Efik 14.2 12.8 0.0

Oron 29.5 28.1 28.4 0.0

Ejagham 75.9 74.5 74.7 75.0 0.0

Aghem 75.9 74.4 74.7 75.0 68.0 0.0

Bamun 75.9 74.4 74.7 75.0 68.0 49.0 0.0

Tikar 75.9 74.4 74.7 75.0 66.0 68.0 68.0 0.0

Igbo 77.8 76.3 76.6 76.9 75.2 75.2 75.2 75.2 0.0

Akan 83.2 81.7 82.0 82.3 80.6 80.6 80.6 80.6 74.7 0.0

Ewe 83.9 82.4 82.7 82.9 81.3 81.2 81.2 81.2 75.3 74.0 0.0

Figure 3.3: Language network based on distance matrix inferred from

partial lexicostatistic matrix (Table 3.4b).

diversity based on UEP+MS haplotypes for the entire region was 0.937±0.005 and for the

individual clans ranged from 0.882 to 0.966 with a mean of 0.93 and a variance of

0.0005, for locations from 0.913 to 0.966 with a mean of 0.94 and a variance of 0.0002

and for language groups 0.919 to 0.949 with a mean of 0.94 and a variance of 0.0001. As

expected (see Materials and Methods), of the three cases where inter-clan differences

were observed using UEP+MS haplotypes at 1% significance none were maintained even

at the 5% threshold when the most frequently observed unshared haplotype from each

group in a pairwise comparison was removed. Interestingly in all clans but one the

UEP+MS modal haplotype was E3a-15-12-21-10-11-13 (mean: 0.21, variance: 0.003,

range: 0.13-0.32) (see Supplementary Table 3S.2 for all NRY data), which has been

identified as a possible signature type for the expansion of the Bantu-speaking peoples

(Thomas et al. 2000). The one clan in which it was not modal (the Ejagham Akampka

from Calabar (EK-CA)) comprised only 18 samples and its frequency was 0.11 ± SE

0.07. In pairwise comparisons using RST (see Supplementary Table 3S.3 for all genetic

distance results tables), pairwise genetic distances were not significantly different in any

clan comparisons (P>0.01). The AMOVA-based Fixation Indices at UEP, UEP+MS and

RST levels for all clans were not significant (P-value: >0.131; see Table 3.6 for all

AMOVA results).

3.3.1.2. Cameroon

Six haplogroups were found in the Cameroon Grassfields dataset (n=266, number of

subgroups =3)) where the modal type was again E3a (90%). Gene diversity based on UEP

haplogroups for the pooled dataset was 0.189±0.032 and for the three individual groups

ranged from 0.083-0.280 with a mean of 0.20 and a variance of 0.012. In all groups the

E3a haplogroup was modal (mean: 0.89, variance: 0.003, range: 0.85-0.96). There were

two pairwise differences between groups at the 5% significance level and none at the 1%

level. However differences at the UEP+MS level in all three population pairwise

comparisons were highly significant (P<0.0001) as were all pairwise RST (P<0.0001).

Gene diversity based on UEP+MS haplotypes for the entire region was 0.946±0.005 and

for the individual clans ranged from 0.887 to 0.958 with a mean of 0.92 and a variance of

0.001. The UEP+MS modal haplotype was different in each of the three groups (CA-BT:

E3a-16-10-21-10-11-13 (Freq=0.18), CA-FB: E3a-16-12-21-10-11-16 (Freq=0.26), CA-

WA: E3a-15-12-21-10-11-15 (Freq=0.23) while the putative Bantu Expansion haplotype

ranged from 0.026-0.090 among the three groups. The AMOVA-based Fixation Index for

Table 3.5: Haplogroup proportions in Cross River, Cameroonian Grassfield

and Ghanaian groups.

NRY UEP Haplogroup (according to the

nomenclature of the Y-chromosome

consortium(2002))

AN-AO 0.00 0.03 0.00 0.00 0.00 0.00 0.00 0.97 0.00

AN-EA 0.00 0.08 0.08 0.00 0.00 0.00 0.00 0.85 0.00

AN-IO 0.00 0.17 0.02 0.00 0.00 0.00 0.00 0.81 0.00

EF-EE 0.00 0.06 0.04 0.00 0.00 0.00 0.00 0.90 0.00

EF-INE 0.02 0.19 0.02 0.00 0.00 0.00 0.00 0.77 0.00

EF-OEU 0.00 0.10 0.00 0.00 0.02 0.00 0.00 0.88 0.00

EK-CA 0.06 0.06 0.00 0.00 0.00 0.00 0.00 0.89 0.00

EK-CC 0.00 0.07 0.00 0.00 0.00 0.00 0.00 0.93 0.00

EK-CI 0.03 0.05 0.05 0.00 0.00 0.00 0.00 0.86 0.00

EK-NA 0.02 0.08 0.04 0.00 0.00 0.00 0.00 0.86 0.00

IB-ANMWN 0.00 0.06 0.03 0.00 0.00 0.00 0.00 0.92 0.00

IB-EAEEUAE 0.00 0.13 0.04 0.00 0.00 0.00 0.00 0.83 0.00

IB-EUE 0.00 0.08 0.04 0.00 0.00 0.02 0.00 0.86 0.00

IB-IAAUA 0.00 0.18 0.04 0.00 0.00 0.00 0.00 0.79 0.00

IB-IEINOI 0.00 0.12 0.00 0.00 0.00 0.00 0.00 0.88 0.00

IB-IMIEI 0.00 0.04 0.00 0.00 0.02 0.00 0.00 0.94 0.00

IB-IOINO 0.00 0.04 0.06 0.00 0.00 0.00 0.00 0.90 0.00

IB-MNENN 0.00 0.02 0.02 0.00 0.00 0.02 0.00 0.94 0.00

IB-NEI 0.00 0.13 0.06 0.00 0.00 0.00 0.00 0.81 0.00

IB-OII 0.02 0.08 0.04 0.02 0.00 0.02 0.00 0.80 0.02

IB-ONMNI 0.00 0.08 0.02 0.00 0.00 0.00 0.00 0.90 0.00

IG-C 0.00 0.07 0.01 0.00 0.01 0.01 0.00 0.90 0.00

OR-AO 0.00 0.11 0.00 0.00 0.00 0.00 0.00 0.89 0.00

OR-ENEEAU 0.00 0.04 0.05 0.00 0.01 0.01 0.00 0.88 0.00

Cross River Grand Total

0.00 0.08 0.03 0.00 0.00 0.00 0.00 0.87 0.00

IG-E 0.00 0.06 0.02 0.00 0.00 0.02 0.00 0.91 0.00

IG-N 0.00 0.04 0.00 0.00 0.00 0.00 0.00 0.96 0.00

CA-BT 0.03 0.06 0.06 0.00 0.00 0.00 0.00 0.85 0.00

CA-FB 0.00 0.09 0.04 0.00 0.01 0.00 0.01 0.85 0.00

CA-WA 0.00 0.04 0.00 0.00 0.00 0.00 0.00 0.96 0.00

GH-AEW 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00

GH-AKE 0.00 0.00 0.08 0.00 0.00 0.00 0.00 0.92 0.00

GH-ASWW 0.00 0.00 0.00 0.00 0.23 0.00 0.00 0.77 0.00

GH-EHVR 0.02 0.02 0.05 0.00 0.00 0.00 0.00 0.91 0.00

GH-FEWR 0.03 0.00 0.02 0.00 0.02 0.00 0.00 0.93 0.00

All Populations Total 0.01 0.07 0.03 0.00 0.01 0.00 0.00 0.89 0.00

Table 3.6: Hierarchical AMOVA results of Cross River, Cameroonian and Ghanaian groups at various molecular levels.

Colour indicates significance level of Fixation Indices P-values: Yellow = 0.05<P<0.01, Orange = 0.01<0.001, Red =

P<0.001. Each grouping is followed, indicated by ‘n’, by the number of groups and, if applicable, the number of individual

populations analysed.

Genetic System

and Level of

molecular

resolution

Cross River

region (n=24)

Cameroon

Grassfields

Ghana (n=5) Ibibio (n=11)

Cross River

pooled groups

of language

speakers (n=6)

Cross River

clans grouped

by language

(n=6,24)

Cross River

clans grouped

by language

with 2 Igbo

populations

(n=6,26)

Cross River

region +

Ghana+

Cameroon

Grassfields

(n=3,32)

FST P-value FST P-value FST P-value FST P-value FST P-value FCT P-value FCT P-value FCT P-value

NRY UEP FST 0.000 0.474 0.020 0.034 0.041 0.005 0.002 0.355 -0.003 0.876 -0.004 0.919 -0.001 0.597 0.006 0.049

NRY UEP+MS

FST 0.000 0.562 0.139 < 0.0001 0.003 0.212 -0.002 0.831 -0.001 0.800 -0.001 0.778 0.000 0.475 0.015 0.000

NRY MS RST 0.004 0.131 0.004 <0.0001 0.008 0.175 0.004 0.189 -0.001 0.613 -0.003 0.886 -0.002 0.772 -0.025 0.025

mtDNA HVS-1

VSO FST 0.000 0.242 0.010 0.000 0.000 0.374 0.001 0.138 0.001 0.100 0.001 0.130 0.000 0.202 0.005 <0.001

mtDNA HVS-1

K2 -0.001 0.663 0.001 0.351 0.001 0.368 0.000 0.498 0.001 0.191 0.002 0.105 0.002 0.086 0.016 <0.001

all three groups at the UEP level was significant at the 5% threshold, while at the

UEP+MS and RST levels the Fixation index was highly significant (P-value: < 0.0001).

3.3.1.3. Ghana

Five haplogroups were found in the Ghanaian dataset (n=242, number of sub-groups=5)

where the modal type was E3a (91%). Gene diversity based on UEP haplogroups for the

pooled dataset was 0.164±0.032 and for the five individual groups ranged from 0.000-

0.368 with a mean of 0.16 and a variance of 0.018. In all groups the E3a haplogroup was

modal (mean: 0.91, variance: 0.007, range: 0.77-1.00). There were four pairwise

differences between clans at 5% significance and three at 1% significance. Of the four

significant pairwise comparisons none were significant even at the 5% significance level

when haplotypes were defined by UEP+MS. Gene diversity based on UEP+MS

haplotypes for the pooled dataset was 0.958±0.006 and for the five individual groups

ranged from 0.933 to 0.954 with a mean of 0.94 and a variance of 0.0001. For the one

case where an inter-ethnic group pairwise difference using UEP+MS haplotypes was

observed at 1% significance (GH-EHVR:GHFEWR) it was not maintained even at 5%

significance when the most frequently observed unshared haplotype from each group was

removed (P=0.232). The putative Bantu Expansion signature haplotype E3a-15-12-21-10-

11-13 was the UEP+MS modal haplotype in three of the five Ghanaian groups (mean:

0.16, variance: 0.0005, range: 0.14-0.18). In GH-AEW it was the co-modal haplotype

along with E3a-17-12-21-10-11-15 (Freq=0.19) while in GH-AKE it was the second most

frequently observed haplotype (Freq=0.137) following its one-step neighbour E3a-15-12-

21-10-11-14 (Freq=0.178). In pairwise comparisons using RST, genetic distances were not

significantly different in any ethnic group comparisons (P>0.01). The AMOVA-based

Fixation Index at the UEP level for all groups was significant at 1% but at the UEP+MS

and RST levels the structuring was not considered statistically significant at the 1%

threshold.

3.3.1.4. Igboland

Four haplogroups were found in the Igboland dataset (n=109, number of subgroups =2),

where the modal type was E3a (93%). Gene diversity based on UEP haplogroups for the

pooled dataset was 0.130±0.044 and for the two individual groups ranged from 0.080-

0.176. In both groups the E3a haplogroup was modal (range: 0.91-0.96). No significant

difference was found between the two groups based on UEP frequencies at 5%

significance (P=0.526). In addition no significant difference was found using UEP+MS

haplotypes at the 5% level, though the P-value was close to 0.05 (P=0.058). Gene

diversity based on UEP+MS haplotypes for the pooled dataset was 0.925±0.011 and for

the two individual groups ranged from 0.915 to 0.928. The putative Bantu Expansion

signature haplotype was the UEP+MS modal haplotype in IG-N (Freq=0.16) and was the

joint third most frequent haplotype in IG-E (Freq=0.13), where E3a-15-12-21-10-11-14

and E3a-17-12-21-10-11-14 were the co-modal haplotypes (Freq=0.15. In pairwise

comparisons using RST, the one pairwise genetic distance between IG-N and IG-E was

significantly different at the 5% threshold but not at 1% (P=0.048).

3.3.2. The distribution of mtDNA variation

Tentative mtDNA haplogroup classifications according to the nomenclature of Salas et al.

(2004) have been reported for the interest of the reader. However, because of the

difficulty of correctly predicting mtDNA haplogroups through HVS-1 sequence data

alone as described by Torrino et al. (2000) no statistical analysis has been performed at

this level and any conclusions based on this classification are at best tentative. The

subject of interest in this chapter lies in exploring the similarity and dissimilarity of

existing populations for which a predetermined phylogenetic relationship of mtDNA

types is not required.

3.3.2.1. Cross River region

363 distinct mtDNA HVS-1 haplotypes were observed in the Cross River dataset

(n=1088) (see Supplementary Table 3S.4 for all mtDNA data). Gene diversity based on

mtDNA HVS-1 haplotypes for the entire region was 0.991±0.001 and for the individual

clans ranged from 0.978 to 1.000 with a mean of 0.991 and a variance of 0.00002, for

individual locations it ranged from 0.978 to 0.997 with a mean of 0.990 and a variance of

0.00002 and for individual language groups it ranged from 0.986 to 0.992 with a mean of

0.990 and a variance of 0.00001. Of the 24 Cross River clans there were 44 haplotypes

that were modal or co-modal amongst the groups, which would be expected given the

high mtDNA haplotype diversity. However one particular haplotype, 126C-187T-189C-

223T-264T-270T-278T-293G-311C, was modal or co-modal in ten populations and its

overall frequency was the highest observed in the Cross River region (Freq=0.043). The

closest to this haplotype in frequency was 129A-209C-223T-292T-295T-311C

(Freq=0.030), which was modal or co-modal in seven populations. Four other haplotypes

were co-modal in three populations, all of which had a frequency of 0.22% or less. 205 of

the 363 haplotypes were observed only once in the dataset (4.4%, 5.4% and 1% of which

were found at varying frequencies in the Cameroonian, Ghanaian and Igboland datasets

respectively). The mean number of pairwise differences per pair of sequences per

population ranged from 8.02 to 10.99 with a mean of 9.75 and a variance of 0.492. There

were twelve population pairwise differences between clans (assessed using a Pairwise

ETPD) at 5% significance and one (IB-MNENN verses IB-IOINO) at 1%. In pairwise

comparisons using the K2 model three pairwise genetic distances were significant

(0.01>P>0.001) but the IB-MNENN/IB-IOINO comparison was not one of them. The

AMOVA-based Global Fixation Indices at the mtDNA VSO haplotype FST and mtDNA

K2 levels for all clans considered were not significant (P>0.242).

3.3.2.2. Cameroon

In all 133 distinct mtDNA HVS-1 haplotypes were observed in the Cameroonian

Grassfield dataset (n=256). Gene diversity based on mtDNA HVS-1 haplotypes for the

entire region was 0.991±0.001 and for three groups ranged from 0.968 to 0.990 with a

mean of 0.981 and a variance of 0.0001. Each of the three groups possessed a different

modal haplotype ranging in frequency from 0.056-0.147. 78 of the 113 haplotypes were

observed only once in the dataset (19.2%, 10.3% and 6% of which were found at varying

frequencies in the Cross River, Ghanaian and Igboland datasets respectively). The mean

number of pairwise differences per pair of sequences per population ranged from 8.93 to

9.31 with a mean of 9.15 and a variance of 0.039. Two of the three population pairwise

comparisons showed highly significant differences between populations (P<0.001) using

pairwise ETPD but no significantly large genetic differences were found even at 5%

significance using K2-based genetic distances. The AMOVA-based Global Fixation

Index at the mtDNA VSO haplotype FST level was highly significant (P<0.0001) but was

not significant using a K2 model (P=0.351).

3.3.2.3. Ghana

There were 144 distinct mtDNA HVS-1 haplotypes observed in the Ghanaian dataset

(n=238). Gene diversity based on mtDNA HVS-1 haplotypes for the entire region was

0.988±0.003 and for the five groups ranged from 0.985 to 0.995 with a mean of 0.989 and

a variance of 0.00001. The 223T-278T-294T-309G-390A haplotype was modal in two of

the five groups (Freq 0.06-0.08) and was co-modal in a further one. Different modal

haplotypes were found in the other two groups. 108 of the 144 haplotypes were observed

only once in the dataset (21.3%, 7.4% and 11% of which were found at varying

frequencies in the Cross River, Cameroonian and Igboland datasets respectively). The

mean number of pairwise differences per pair of sequences per population ranged from

6.84 to 8.14 with a mean of 7.26 and a variance of 0.415. There was one population

pairwise difference between groups at 5% significance but it was not significant at a

1%.threshold. In pairwise comparisons using the K2 model no pairwise genetic distances

were significantly different in any population comparison (P>0.05). The AMOVA-based

Global Fixation Indices at mtDNA VSO haplotype FST and mtDNA K2 levels was not

significant (P-value> 0.368).

3.3.2.4. Igboland

74 distinct mtDNA HVS-1 haplotypes were observed in the Igboland dataset (n=105).

Gene diversity based on mtDNA HVS-1 haplotypes for the entire region was 0.988±0.004

and for the two groups ranged from 0.982 to 0.991. The 172C-183C-189C-223T-320T

haplotype was modal in IG-E (Freq=0.07) and 126C-187T-189C-223T-264T-270T-278T-

293G-311C was modal in IG-N (Freq=0.12). 56 of the 74 haplotypes were observed only

once in the dataset (41.1%, 17.9% and 19.6% of which were found at varying frequencies

in the Cross River, Cameroonian and Ghanaian datasets respectively). The mean number

of pairwise differences per pair of sequences per population ranged from 9.54 to 10.00.

There was no pairwise difference between IG-N and IG-E at 5% significance while the

K2 pairwise genetic distance was also not significant (P>0.05).

A series of questions posed in the introduction that examine the Cross River data at

various levels of grouping (clan, location and language) are now addressed.

3.3.3. Are clan communities collected from different locations distinguishable?

In two cases datasets consisting of the same clan or secondary affiliation were collected

from more than one location: i) the Ejagham of Akampka from a) Calabar (EK-CC) and

b) Netim (EK-NA) and ii) the Efik Efut from a) Eniong and Atan Ono Yom (EF-EE) and

b) Ikot Nakanda and Ikot Ene (EF-INE). The Ejagham of Akampka showed no significant

differences in the NRY (UEP and UEP+MS level assessed using the ETDP and MS level

assessed using AMOVA RST-based genetic distance) or mtDNA comparisons (HVS-1

haplotype level assessed using ETPD and HVS-1 sequence level assessed using

AMOVA-based K2 genetic distance). However the Efut did show a significant difference

between clans at the UEP+MS level (P<0.01) though, as stated earlier, this significant

difference was lost even at 5% significance when the most frequently observed unshared

haplotype from each clan was removed. No significant differences were found at any

other NRY or mtDNA level for the Efut.

3.3.4. Are different clans of the same language group collected from the same location

distinguishable?

In two cases datasets consisting of different clans (or other parallel secondary affiliations)

of the same language group were collected from the same location: from i) Afaha

Esang/Ikot Ubom the a) Annang Ediene Abak (AN-EA) and b) the Annang Afaha Obong

(AN-AO) and from ii) Calabar the a) Ejagham of Akampka (EK-CA), b) the Ejagham of

Ikom (EK-CI) and c) the Ejagham of Calabar (EK-CC). The two clans from Afaha

Esang/Ikot Ubom showed no significant differences at any NRY or mtDNA level.

However a highly significant difference was found in Calabar between the Ejagham of

Akampka and the Ejagham of Calabar (P<0.01) at the UEP+MS level though once again

this difference was lost, even at 5% significance, when the most frequently observed

unshared haplotype was removed from each group. In addition significant differences

were found between the Ejagham of Akampka and the Ejagham of Ikom at 5%

significance at the UEP+MS and RST levels but not at 1% significance. No significant

pairwise differences were found at any mtDNA level.

3.3.5. Are different language groups collected from the same location distinguishable?

There was one dataset where two different language groups were collected from the same

location: from i) Calabar a) Ejagham speakers and b) Igbo speakers. There were no

significant differences between these two language groups at any NRY level (P-value:

UEP = 0.60, UEP+MS = 0.47, RST= 0.48). An ETPD was significant at the mtDNA

haplotype level at 5% significance but not at 1% (P=0.048) and the K2 genetic distance

was not significant (P=0.16).

3.3.6. Are the same language groups collected from different locations distinguishable?

In the Cross River dataset there are five language groups where samples were collected

from the same language speakers in two or more locations: the Annang (two locations),

the Efik (three locations), the Ejagham (two locations), the Ibibio (eleven locations) and

the Oron (two locations). No significant differences were observed at any NRY or

mtDNA levels for the Annang, Ejagham and Oron while the one difference in Efik

pairwise comparisons was the same one as previously described in testing for differences

between clans at different locations when examining the two Efut clans. For the 11 Ibibio

groups there were a total of 55 pairwise comparisons at each of the different levels of

analysis. At all NRY levels there were no significant differences at the 1% threshold. At

the mtDNA level there was one significant difference (P<0.01) found using pairwise

ETPD while K2-based genetic distances revealed one significant pairwise genetic

distance (P<0.01). Because of the large number of pairwise comparisons the Ibibio were

also additionally analysed using hierarchical AMOVA. The AMOVA-based Fixation

Index for the individual Ibibio clans/locations was not significant at any NRY or mtDNA

level (P-value > 0.138).

3.3.7. Are speakers of the six Cross River languages distinguishable?

This section addresses the principal question posed in the introduction: do the six Cross

River language group datasets indicate any sex-specific genetic system structuring or has

gene flow among them been sufficient to prevent differences developing? Using pooled

datasets of language speakers in the Cross River region (where clans were pooled based

on their principal language) the hierarchical AMOVA-based Fixation indexes were not

significant at any NRY or mtDNA level. There were also no pairwise significant

differences between language groups at any NRY level. However at the mtDNA level two

pairwise significant differences (P<0.01) were observed using an ETPD (between the

Ejagham and Ibibio and between the Ejagham and Efik). The Ejagham and Ibibio

pairwise comparison also gave a significant (P<0.01) AMOVA-based K2 genetic

distance.

To take into account any differences among language groups due to differences within

language groups each was analysed clan separately but within a framework where clans

were also grouped by their language spoken.

The AMOVA-based Fixation Indices for among-group differences (with populations (in

this case clans) grouped by language spoken; FCT is the Among-Group Fixation Index)

were not significant at any NRY or mtDNA level of analysis (P-value>0.105).

Though the Fixation Indices above indicate a lack of among-group structure between

different language-speaking clans, significant individual pairwise differences were

observed at every NRY and mtDNA level. Below is a description the distribution of these

differences and for each language group. The percentage of pairwise comparisons of

clans within this language group with clans of all other language groups that were

significantly different at the 5% threshold at each NRY and mtDNA level of analysis are

also reported. While there are possibly issues of multiple testing because of the large

number of non-independent pairwise comparisons these figures do provide a useful report

of the distribution of pairwise differences and can indicate potentially interesting patterns

and candidate outliers.

At the UEP haplogroup level five significant pairwise comparison were observed

between clans of different language groups at the 5% level, four of which involved EF-

INE, while none were found at 1% significance. The percentage of pairwise comparisons

involving each language group with all others that were significant at the 5% threshold

was: Annang: 1.6%, Efik: 6.3%, Ejagham: 0.0%, Ibibio: 2.8%, Igbo: 0.0%, Oron: 2.3%.

At the UEP+MS level differences were found in eight pairwise comparisons at 5%

significance, five of which involved EF-EE. One difference was found at 1%

significance, that between AN-AO and IB-EAEEUAE, which was not significant even at

5% when the most frequently observed unshared haplotype was removed from each

group. The percentage of pairwise comparisons involving each language group with all

others that were significant at the 5% threshold was: Annang: 3.2%, Efik: 9.5%,

Ejagham: 3.8%, Ibibio: 4.9%, Igbo: 0.0%, Oron: 0.0%.

Twelve significant genetic distances were observed using RST at the 5% level, five of

which involved EF-INE, while none were observed at 1%. The percentage of pairwise

genetic distances involving each language group with all others that were significant at

5% was; Annang: 4.8%, Efik: 7.9%, Ejagham: 6.3%, Ibibio: 6.3%, Igbo: 4.3%, Oron:

At the mtDNA haplotype level seven pairwise comparisons were significantly different at

the 5% level and none at 1%. The percentage of pairwise comparisons involving each

language group with all others that were significant at the 5% threshold was: Annang:

6.3%, Efik: 0.0%, Ejagham: 1.3%, Ibibio: 4.2%, Igbo: 0.0%, Oron: 6.8%.

Thirteen significant genetic distances were observed using the K2 distance model at the

5% level, seven of which involved AN-EA, while two were observed at the 1% threshold,

both of which involved IB-ANMWN. The percentage of pairwise genetic distances

involving each language group with all others that were significant at the 5% threshold

was: Annang: 12.7%, Efik: 4.8%, Ejagham: 7.5%, Ibibio: 7.7%, Igbo: 4.3%, Oron: 2.3%.

3.3.8. Are speakers of the six Cross River languages distinguishable when two groups

from Igboland are added to the analysis?

In the Cross River region Calabar is considered a particularly cosmopolitan city where

different ethnicities reside together at an unusually high frequency for the region as a

whole. Of the six language groups considered here the Igbo would be expected to be the

most genetically distinct. Given that samples were collected from only one group of Igbo

speakers from Calabar two groups from Igboland to the west of the Cross River region

(IG-E and IG-N) were added to the inter-language group analysis to take into account the

potentially unusually high levels of inter-ethnic admixture (in comparison to the region as

a whole) that may have taken place involving Igbo from Calabar.

The AMOVA-based FCT values were slightly higher at all NRY but not mtDNA levels

when the IG-N and IG-E were grouped with the Igbo speaking group from Calabar (all

other language group structures were the same ) but none of these FCT values were

significant (P-value>0.086).

In comparisons between the two Igboland groups and the Igbo from Calabar no

significant differences were found at the 5% level using UEP frequencies. Significant

differences were found between the Igbo from Calabar and both Igboland groups using

UEP+MS haplotypes at the 5% level but not at 1% (P>0.025). However when using RST

genetic distances these two pairwise comparisons were not significantly different even at

the 5% threshold (P>0.382). No significant differences were found using mtDNA

haplotype frequencies (P>0.055) but the K2-based genetic distance between the Igbo

from Calabar and IG-N was significant at the 5% level (P=0.024).

In comparisons with the non-Igbo Cross River region groups, at the UEP level IG-E

showed no significant pairwise differences even at the 5% level with Cross River clans

while IG-N showed three significant differences at the 5% level and none at the 1% level.

At the UEP+MS level IG-E showed significant differences with four populations at 5%

significance and six populations at 1% significance while IG-N showed differences with

ten populations at 5% significance and seven populations at 1% significance. At the

microsatellite level there were no significant RST genetic distances between IG-E and any

other Cross River population, even at 5% significance, while IG-N showed four

significant genetic distances at the 5% level and one at 1%.

At the mtDNA haplotype level IG-E showed two significant pairwise differences with

Cross River populations at 5% significance and none at 1%, while IG-N showed no

significant differences at 5% and one at 1%. mtDNA K2 genetic distances were not

significant in any comparisons involving IG-E while for IG-N there were four significant

K2 genetic distances at 5% and three significant genetic distances at 1% , in which all

five other language groups apart from the Oron were involved on some occasion.

As expected from AMOVA results, phylogenetic analyses of Cross River clans through

consensus neighbour joining trees at various NRY and mtDNA (Figure 3.4) levels

showed no consistent language groupings with very low internal node bootstrap values

across the trees, suggesting the various branches for each tree are somewhat

interchangeable, even in the presence of the two Igboland populations.

3.3.9. Can differences between the Cross River region and Cameroonian and Ghanaian

groups be established?

Using three pooled datasets consisting of the 24 Cross River region clans, five Ghanaian

groups and three Cameroonian groups respectively, pairwise ETPD showed significant

differences at the 1% threshold between all three datasets at all NRY and mtDNA levels

except at the UEP level where there was no significant difference between the Cross

River region and Cameroon while NRY RST and mtDNA K2 genetic distances were also

significant at the 1% threshold (see Table 3.7).

Figure 3.4: Consensus neighbour joining trees for Cross River population using various methods of genetic distance for

both NRY and mtDNA. Only individual node bootstrap values over 30% are shown on tree.

Table 3.7a: ETPD P-values (upper triangle) at various NRY and mtDNA levels for pooled Cameroonian, Ghanaian and

Nigerian datasets. Colour code is same as Table 3.6.

NRY UEP level NRY UEP+ms level

mtDNA VSO level

Cameroon Ghana Nigeria Cameroon Ghana Nigeria Cameroon Ghana Nigeria

Cameroon

Ghana 0.001 0.000 0.000

Nigeria 0.620 0.000 0.000 0.000 0.000 0.000

Tables 3.7b: Genetic Distances (lower triangle) and P-values (upper triangle) at various NRY and mtDNA levels for pooled

Cameroonian, Ghanaian and Nigerian datasets. Colour code is same as Table 3.6.

NRY UEP-based F ST NRY UEP+ms-based FST NRY Microsatellite-based R ST

Cameroon Ghana Nigeria Cameroon Ghana Nigeria Cameroon Ghana Nigeria

Cameroon * 0.045 0.312 * 0.000 0.000 * 0.000 0.000

Ghana 0.007 * 0.002 0.021 * 0.002 0.062 * 0.010

Nigeria 0.000 0.016 * 0.028 0.004 * 0.042 0.006 *

mtDNA VSO-based F ST mtDNA VSO-based K2

Cameroon Ghana Nigeria Cameroon Ghana Nigeria

Cameroon * 0.000 0.000 * 0.000 0.000

Ghana 0.007 * 0.000 0.029 * 0.000

Nigeria 0.005 0.004 * 0.014 0.015 *

To account for possible within-region differentiation the Cross River clans and Ghanaian

and Cameroonian groups were compared on a population-by-population basis but within

a framework where populations were also grouped by their country of origin.

The AMOVA-based Fixation Indices for among-group differences, FCT, (with populations

grouped by one of the three countries) were significant at the 5% threshold using NRY

UEP defined haplogroups and RST and were significant at the 1% level using UEP+MS

haplotypes and at both levels of mtDNA analysis (P-value<0.001).

The percentage of pairwise comparisons of Cameroon Grassfields populations with Cross

River clans that were significantly different at the 5% level was: for NRY UEP-based

pairwise ETPD=16.7%, for NRY UEP+MS-based pairwise ETPD =86.1% (the majority

are highly significant), for RST genetic distance=88.9% (again the majority are highly

significant), for mtDNA haplotype-based ETPD=97.2% (again the majority are highly

significant), for mtDNA K2 genetic distance= 36.1%.

The percentage of pairwise comparisons of Ghanaian populations with Cross River clans

that were significantly different at the 5% level was: for NRY UEP-based pairwise

analogue Fisher‟s test ETPD = 40.8%, for NRY UEP+MS-based pairwise ETPD =

16.7%, for RST genetic distance =17.5%, for mtDNA haplotype-based ETPD = 44.2%, for

mtDNA K2 genetic distance = 30.8%.

Supplementary Table 3S.1 and 3S.3 also shows that at the UEP+MS, RST and mtDNA

haplotype levels (and to some extent mtDNA K2 levels) pairwise comparisons between

Ghanaian and Cameroonian populations indicated highly significant differences.

PCO plots of NRY and mtDNA genetic distances at various levels of resolution showed a

general pattern (see Figure 3.5) at all levels where the Cross River datasets clustered

together, with the Cameroonian and Ghanaian populations tending to lie on the periphery

(though when examining each Cameroonian and Ghanaian population individually some

populations were observed deep within the Cross River cluster while others are distinct

outliers).

Figure 3.5: Various PCO plots at different NRY and mtDNA analysis levels

for populations from the Cross River region, the Cameroon Grassfields and

Ghana.

3.3.9.1. Estimation of the TMRCA of individuals possessing the E3a haplogroup in the 3

West Central African regions

A crude estimate of the TMRCA of the E3a clade of all samples analysed from the Cross

River region, the Cameroon Grassfields and Ghana using Y-time was, assuming a) a star

genealogy, b) a mutation rate per generation of 0.00193 (Behar et al. 2003), c) a Simple

Stepwise Mutation Model and d) that the ancestral haplotype was the Bantu signature

haplotype was 279 generations before present or 5580 years (assuming an inter-

generation time of 20 years) (95% Confidence Interval (CI) = 268 (5360 years) – 291

(5820) generations (years) before present).

3.3.10. Are there correlations of genetic distances and geographic and linguistic

distances?

A Mantel test of correlation between genetic and linguistic distance for the Cross River

clans showed no correlation at any NRY or mtDNA level (P> 0.085) (see Table 3.8 for all

Mantel and Partial Mantel test results). A test for correlation between genetic and

linguistic distance while holding geographic distance constant resulted in a P-value using

UEP+MS haplotype-based FSTs of 0.058 (r=0.058), while genetic distances at all other

NRY and mtDNA levels of analysis gave very non-significant P-values (P-value> 0.264).

In addition no correlation was found between genetic and geographic distance at any level

(P>0.386), even when holding linguistic distance constant (P>0.091).

Performing these Mantel and Partial Mantel tests for correlations between genetic,

linguistic and geographic distances but restricting the dataset to only Lower Cross

languages (therefore excluding Ejagham and Igbo) suggests no correlation at any NRY or

mtDNA level with very high P-values in all cases (P>0.271).

However expanding the Cross River dataset to include the Igboland populations does

reveal a highly significant correlation between NRY UEP+MS FSTs and linguistic

distance using a normal Mantel test (P=0.004, r=0.354) (but there is no correlation at any

other NRY or mtDNA level of analysis (P>0.157)). The correlation between NRY

UEP+MS FST and geographic distances using this same Igboland included dataset is also

close to significance (P=0.074, r=0.253). A partial Mantel test of the correlation between

NRY UEP+MS genetic and geographic distance while holding linguistic distance

constant was not significant (P>0.159, r=0.103). However a significant correlation

between NRY UEP+MS genetic and linguistic distance is still apparent when holding

geographic distance constant (P=0.034, r=0.275), though the correlation is less

pronounced.

When the 24 Cross River region populations were considered with the five Ghanaian and

three Cameroonian groups highly significant correlations were found between genetic and

linguistic distance (P<0.01) at all NRY and mtDNA levels apart from at the UEP

Table 3.8: Results of Mantel and Partial Mantel tests at different levels of

NRY and mtDNA analysis using various distance matrices. Colour code is

same as Table 3.6.

Correlation Analysis

Groups utilised

Genetic distance matrix type calculated

NRY UEP-based FST

NRY UEP+ms-based FST

Microsatellite-

based RST

mtDNA VSO-based FST

mtDNA VSO-based K2

value R

P-value

value R

P-value

Geogra

Cross River +

Cameroon + Ghana

0.458 0.001 0.123 0.212 0.235 0.060 0.300 0.035 0.432 0.001

Nigeria (Includes IG-N and

0.107 0.196 0.253 0.074 0.078 0.263 0.098 0.223 0.142 0.182

Cross River

-0.012 0.524 -0.078 0.838 0.008 0.446 -0.033 0.656 0.018 0.386

Lower Cross

-0.006 0.495 -0.087 0.797 -0.046 0.664 -0.141 0.910 -0.052 0.561

Cross River +

Cameroon + Ghana

0.166 0.073 0.364 0.001 0.317 0.002 0.372 0.000 0.347 0.001

-0.101 0.820 0.354 0.004 0.111 0.165 0.113 0.149 0.080 0.263

Cross River

-0.2243 0.980 0.270 0.085 0.067 0.298 0.074 0.261 -0.015 0.509

Lower Cross

-0.137 0.835 -0.164 0.887 -0.281 0.972 -0.062 0.647 -0.145 0.906

l Geogra

Cross River +

Cameroon + Ghana

0.455 0.001 -0.131 0.819 0.057 0.321 0.103 0.201 0.298 0.005

0.176 0.091 0.103 0.159 0.029 0.382 0.051 0.353 0.119 0.201

Cross River

0.054 0.287 -0.167 0.987 -0.011 0.534 -0.056 0.761 0.023 0.365

Lower Cross

0.045 0.310 -0.032 0.651 0.060 0.264 -0.128 0.892 -0.135 0.893

tics c

Geogra

Cross River +

Cameroon + Ghana

-0.156 0.945 0.366 0.004 0.227 0.036 0.251 0.015 0.120 0.123

-0.173 0.964 0.275 0.034 0.084 0.227 0.076 0.246 0.014 0.398

Cross River

-0.230 0.980 0.305 0.058 0.067 0.314 0.087 0.264 -0.021 0.546

Lower Cross

-0.144 0.872 -0.143 0.845 -0.284 0.972 -0.013 0.537 -0.001 0.429

haplogroup level, which was close to significance at the 5% threshold (P>0.073). When

geographic distance was held constant the UEP-based correlation was still not significant

while the correlation using RST was only significant at the 5% threshold (P=0.036).

However the correlation using UEP+MS FSTs was still highly significant (P=0.004,

r=0.366). The correlation between mtDNA K2-based genetic distances and linguistic

distance was no longer significant (P>0.123, r=0.123) but the correlation using mtDNA

haplotype FST was still significant but to a lesser degree than previously (P=0.015,

r=0.251). Highly significant correlations were also found between genetic and geographic

distance using this geographically widespread dataset at the UEP haplogroup FST and

mtDNA K2 levels (P<0.001) while using the mtDNA FST distance produced a significant

correlation at 5% significance (P=0.035, r=0.300). The RST-based distance was almost

significant (P>0.06, r=0.235). However the significant correlation using mtDNA-based

FST was lost when linguistic distance was held constant (P>0.201, r=0.103).

3.3.11. The Origins of the Efik

Initial examination of the distribution of NRY and inferred mtDNA haplogroups (Table

3.9) revealed an extremely high frequency of African-specific types in the Efik Uwanse,

which immediately reduced the likelihood that they had a Middle Eastern origin. To

investigate this further, given the expectations set out in the introduction, the Efik

Uwanse (EF-OUE) were compared to the Ibibio and Igbo as well as the following non-

Nigerian populations: Arabe speakers from Lake Chad (LC-AF), Amharic speakers from

Ethiopia (ET-AA), Israeli and Palestinian Arabs (IPA for NRY data, IPA2 for mtDNA

data) and Sudanese (SU-KH for NRY data, SU-KA for mtDNA (see Supplementary

Tables 3S.5 and 3S.6 for NRY and mtDNA data). It should be noted that this is the best

comparative dataset currently available and it is not claimed that each group completely

represents its area of origin. However they are likely to possess the major genetic

signatures that the Efik might have acquired from origin or admixture in the past. The two

Efik Efut populations (EF-EE and EF-INE) who claim a separate Cameroonian origin and

have recently adopted the Efik language were, for comparison, also separately included in

the pairwise analysis.

PCO plots of pairwise genetic distances (see Figure 3.6) at all NRY levels showed the

EF-OUE to be firmly clustered with the Ibibio and Igbo populations and considerably

differentiated from the four non-Nigerian comparison populations. The genetic distances

on which these PCO plots were based showed highly significant differences between the

EF-OUE and the four non-Nigerian populations (see Supplementary Table 3S.7).

Table 3.9: NRY and mtDNA haplogroup frequencies in the Efik Uwanse.

NRY UEP

Haplogroup

(according to the

nomenclature of the Y-

chromosome

consortium(2002))

Inferred mtDNA

Haplogroup

(according to the

nomenclature of Salas et

al. (2002))

BR*(xDE,JR) 5 L0a1 2

Y*(xBR,A3b2) 1 L1* 2

E3a 44 L1b 8

Total 50 L1c1 1

L1c2 5

L2a 10

L3* M* N* 1

L3e1* 4

L3e2* 3

L3e2b 2

L3e3 1

L3e4 1

Total 48

The general level of genetic differentiation was less pronounced at the mtDNA level,

especially when using the K2 mutation model, but PCO plots showed the EF-OUE to still

be clustered amongst the Ibibio and Igbo populations rather than the non-Nigerian

populations. All genetic distances between the EF-OUE and the four non-Nigerian

populations at both levels of mtDNA analysis were significant at the 1% threshold except

between it and the Lake Chad dataset using the K2 distance where the genetic distance

was not significant even at the 5% threshold. The two Efik Efut groups were also

indistinguishable from the main Ibibio/Igbo cluster at all levels.

Figure 3.6: Various PCO plots at different NRY and mtDNA analysis levels

for populations from the Efik Uwanse and comparison populations.

3.4. Discussion

The main finding of this study is that the Cross River region can be genetically

differentiated, at least by the sex-specific genetic systems, from other geographically

separated regions in West Central Africa but the different ethnic groups found with the

region, which all speak different languages, cannot be distinguished in the main from

each other. This appears to fit the prior expectation that gene flow is more restricted

between geographically distant populations in comparison to populations that lie within a

common region despite the presence of significant cultural and linguistic differences.

However, despite the overall homogeneity displayed in the Cross River region,

differences were found among groups at various levels. Therefore, while at a macro-scale

differences among groups would be predicted, as the scale is reduced the populations

should homogenise, though random, unpredictable differences may arise.

3.4.1. General observations regarding NRY and mtDNA variation

The NRY and mtDNA types found in the populations included in this study are fairly

typical of those observed in West Central Africa and sub-Saharan Africa (excluding East

Africa) as a whole. As would be expected E3a (which has previously been found at high

frequencies across sub-Saharan Africa (Wood et al. 2005)) is by far the predominant UEP

haplogroup in all Cross River, Cameroonian and Ghanaian populations, suggesting a

recent common paternal ancestry of most sub-Saharan African males (again disregarding

East Africa). Much of this common ancestry appears to have been driven by expanding

Bantu-speaking farmers spreading the E3a NRY type across the continent (Underhill et

al. 2001) as evidenced by the presence of the proposed Bantu signature haplotype as the

modal type as far away as South Africa (Thomas et al. 2000). However given that none of

the groups studied here actually speak a Bantu language the effect of the Bantu expansion

on this region is likely to have been limited.

The putative Bantu signature E3a UEP+MS haplotype is found at relatively high

frequencies in the Cross River region, which, given its proximity to the proposed Bantu

homeland, suggests that the proposed signature haplotype is likely to have been well

established at high frequency in Western Central Africa prior to the start of the Bantu

expansion, while its significant presence in the Ghanaian region, where Bantu languages

are not spoken, is likely due to some other movement of peoples that either brought the

haplotype into or, as implied by Rosa et al. (2007), from West Africa. Interestingly the

Bantu signature haplotype is not the modal NRY type in any of the Cameroon Grassfields

groups, another region very close to the proposed Bantu homeland. This suggests that this

part of Cameroon was somewhat isolated from the farmers that initiated the expansion of

Bantu languages peoples and may have retained much of its prior genetic diversity.

The distribution of mtDNA variation was, as expected, much more diverse than for the

NRY and was also very similar to that which has previously been reported for sub-

Saharan Africa, with almost all HVS-1 haplotypes observed in the Cross River,

Cameroonian and Ghanaian datasets able to be placed, albeit tentatively, in a number of

„L‟ haplogroups. The major haplogroups found in „Central‟ and „West‟ Africa by Salas et

al. (Salas et al. 2002) (L1a, L1b, L1c, L2a and L3e in Central Africa, L1b, L2a, L3b/d and

L3e in West Africa) all appear to be represented at appreciable frequencies amongst the

three datasets included in this study. The extremely high h values for the HVS-1 region

for all three groups were comparable to those previously observed in West Africa by

Salas et al. (2002) (mean h = 0.99) and slightly higher than those found across all of sub-

Saharan African (mean h = 0.97, excluding pygmies and Khoisan speakers) while the

average number of pairwise differences values were also similar to those found across

sub-Saharan Africa by Salas et al. (2002) (mean = 7.92).

3.4.2. The Cross River region as a genetically homogenous region

The results of this study showed very little sex-specific genetic differentiation at any

NRY or mtDNA level amongst the different groups of peoples living in the Cross River

region of Nigeria. As mentioned previously the vast majority of members of the different

groups of the region are likely to have shared a recent common paternal ancestor as

evidenced by the high E3a frequency observed in all populations. The main reason for the

homogeneity is likely to be that gene flow has been substantial over a long period (as is

supported for recent times by the sociological data). It is notable that the level of gene

flow mediated by men and women appears similarly high, though the data assembled here

are not directly comparable. (In fact while neither genetic system showed significant

genetic structuring, FST and FCT P-values were much closer to significance at the mtDNA

level, the opposite to what would be expected given that all the language groups

considered here are considered to comprise patrilocal communities). A major

consequence of this gene flow appears to be that there is no genetic differentiation among

the six different language groups studied, even in comparisons with the Igbo speakers of

Calabar, a group which is believed to have separated from the Lower Cross groups some

thousands of years ago. This demonstrates that major language differences can be

maintained in the presence of substantial gene flow, a finding that will be of considerable

interest to linguistics working on aspects of language contact and suggests that a)

demographic history and language spoken can, in West Central Africa at least, be

independent and b) oral histories may relate more to the extant group as a cultural

construct than as an entity defined by biological ancestry.

Given the lack of genetic differentiation among the Cross River region populations it is

unsurprising that no correlation was found between either a) genetic and linguistic

distance and b) genetic and geographic distance, suggesting that gene flow has been

multi-directional. However at the UEP+MS level the addition of the two Igboland groups

did appear to result in a significant correlation between genetic and linguistic distance

even when controlling for geographic distance. In addition a number of significant

differences were found between these two Igboland groups (especially IG-N) and the

other Cross River region clans at various NRY and mtDNA levels. These were notable

when compared with the number of pairwise significant differences observed just among

the Cross River groups themselves. Differences were even identified when comparing the

Igboland groups to the Igbo from Calabar. These differences, coupled with the general

lack of differentiation within the region, appears to support the idea of the Cross River

region being a distinct genetic region, though how far this region extends, and therefore

how far the same level of gene flow extends, is unclear. As the correlation between

genetic and linguistic distance was almost significant for the Cross River region clans at

the NRY UEP+MS level but was significant when the neighbouring Igbo groups were

added to the analyses, this suggests that the groups present in Cross River region

experience a level of male-mediated gene flow that is close to the permitted limit if

linguistic difference is to be maintained.

One factor that may have contributed to the Cross River region being particularly

homogenous was its position as a major slave post, which may have led to extensive

mixing of members of different ethnic groups that would normally have had somewhat

less contact with each other as a consequence of geographic separation. This process,

which may have occurred for as long as 200 years, could have significantly increased

gene flow among speakers of different languages. This may go some way to explaining

the very high levels of both male and female mediated gene flow among primarily

patrilineal groups. Intriguingly some Y chromosome haplogroups that are possibly

indicative of European ancestry (P*(xR1a), J) are found at very low frequencies (less than

1%) amongst the Cross River samples. It is possible that these may have entered the

Cross River gene pool as a consequence of male introgression of slave traders. However,

neither of the two haplogroups described above are unequivocally European and further

UEP delineation would be required to truly test for the presence of this process (for

example, Haplogroup P*(xR1a) contains Haplogroup R1b, which is found amongst

Western Europeans (Zalloua et al. 2008), as well as Haplogroup R2, which is found

amongst South Asians (Kivisild et al. 2003)). A few mtDNA lineages may also

potentially demonstrate recent European ancestry but it impossible to truly establish this

based only on HVS-1 data and female European introgression would be unlikely, at least

with regard to impact of the slave trade.

However, in spite of a general level of genetic homogeneity, significant differences were

observed at different resolutions of groupings, such as between clans from different

locations (Efik Efut) and between language groups from different locations. Conversely,

significant differences were not observed in other comparisons of the same type (Ejagham

of Akampa from Calabar and Netim). There appears to be no obvious pattern and these

differences are either an artefact, a consequence of multiple pairwise comparisons, or

local transient differences that gene flow will eventually extinguish (although some, at

least, may represent emerging differences which might be determined by further

anthropological fieldwork). Often the significant differences appeared at the UEP+MS

level of analysis and were lost when only one haplotype was removed from each group,

showing that at such a fine-scale of analysis differential reproductive success of just one

man in each group can potentially cause significant differentiation.

Therefore the data presented here appear to be consistent with the following demographic

model of male and female mediated gene flow in the Cross River region:

Culturally defined demes (be they defined by language or clan) are experiencing

substantial multidirectional gene flow, independent of geographic location, that has led to

a highly homogenous meta-population (that encompasses all demes). At times genetic

differentiation may occur within the meta-population among a subset of demes as a

consequence of reproductive success of one or more individuals within one or more

demes of this subset. However within a relatively few generations the high level of

among-deme gene flow that characterises the region will cause the distribution of the sex-

specific genetic systems of the culturally distinct demes to be statistically similar.

It would be interesting in the future to estimate, via computer simulations, the range of

values that parameters such as migration rate, migratory distance, deme population size,

generation time, deme and meta-population growth rate and reproductive success can

have that is consistent with the patterns of genetic variation presented here.

However, as discussed in more detail in Chapter 5, the resolution of the majority of Y

chromosomes in this study is limited upto haplogroup E3a. Further delineation of E3a

with other UEP markers has the potential to reveal more genetic structuring amongst the

Cross River region (even despite the microsatellites also not demonstrating any

structuring) and therefore the demographic model described above should be treated with

some caution until further work is performed in the future.

3.4.3. Cross River, Ghana and Cameroon as genetically distinct regions

The results presented here clearly show genetic differentiation of both NRY and mtDNA

systems among the three geographically separated regions included in this study: the

Cross River region, the Grassfields of Cameroon and Ghana. This was expected as a

consequence of the large geographic distances involved, which would substantially

reduce of gene flow and thus regional heterogeneity. The underlying genetic relationship

of the three regional groupings is evidenced by the high E3a haplogroup frequencies in all

three regions. This similarity at the UEP haplogroup level is an important contributor to

the AMOVA-based Fixation Indices being significant only at the 5% threshold. The

estimate of 5580 years before present of the TMRCA of all E3a-possessing individuals in

the three West Central African regions appears to support the scenario of E3a being

established in West Africa and expanding towards the Cameroon-Nigeria border in an

event that occurred prior to the expansion of the Bantu-speaking peoples since it does not

lie in the 3000-5000 year range previously attributed to the start of the latter migration.

Despite the crude assumptions used to generate this estimate16

if an inter-generation time

of 25 years was used the TMRCA would have been 6975 years before present, which is

considerably older than the 4,839 years before present given by Thomas et al. (2000) for

the TMRCA of their E3a South African Bantu Y-chromosomes using the same method

(including a 25 years inter-generation time).

The three Cameroon Grassfields groups appear to be more genetically differentiated from

the Cross River region than the five Ghanaian groups. Yet the Cameroonian groups are

closer geographically and linguistically to the Cross River region. In addition, amongst

themselves the Cameroon Grassfields groups are more genetically heterogeneous than are

the Ghanaian populations, which are relatively homogenous despite being more

geographically disparate. The Grassfields, despite its name, is a largely highland area

made up valleys broken up by hills and mountains (Mount Oku is located in the

Grassfields and is the second highest mountain in West Central Africa). Therefore the

greater differentiation of the three Cameroonian populations both from each other and

from the Cross River region groups is perhaps unsurprising given that the topology of the

region may have presented major physical barriers to gene flow among populations.

All three regions appear to broadly share the same NRY and mtDNA haplogroup types

and therefore share some recent common ancestry. As a consequence the genetic

differentiation between the Cameroon groups and the other two groups is not as

pronounced in UEP and mtDNA K2 genetic distance-based calculations (where the older

evolutionary relationships of NRY and mtDNA types have greater influence in the

generation of genetic distances) than in UEP+MS, MS and mtDNA FST genetic distance-

based calculations (where differences of recent origin are given equal or greater weight

than are those due to evolutionary older differences).

A significant correlation was observed between geographic and genetic distance at the

UEP and mtDNA K2 levels, but also between linguistic and genetic distance at the

UEP+MS, MS and mtDNA VSO levels. The correlations in all cases are likely driven

The 95% CI of 5360-5820 years for this TMRCA estimate is likely a substantial underestimate because

of the crude assumptions applied in the method. Therefore, while presented for the interest of the reader as a

matter of routine, it is recommended that conclusions are not drawn from these CI‟s with regard to

population history.

primarily by the many pairs of small genetic and geographic distances resulting from

pairwise comparisons between the numerous Cross River groups and larger genetic and

geographic distances in pairwise comparisons of Cross River groups and non-Cross River

groups. Whether the correlation is ultimately best explained by geographic or linguistic

distance is substantially driven by the level of genetic differentiation of the Cameroonian

groups from the Cross River groups. Figure 3.3 shows that the linguistic distance matrix

records only a slightly greater distance between the Cross River languages and Ghanaian

languages than between the Cross River languages and the Cameroonian Grassfield

languages. The geographic matrix however records a much greater geographic distance

between the Cross River region and Ghana than between the Cross River region and

Cameroon. The level of Cameroonian and Cross River genetic differentiation is lowest in

the UEP-and K2-based calculations and hence genetic distances in this analysis show a

better fit with the geographic matrix, while the level of Cameroonian and Cross River

genetic differentiation is highest at the UEP+MS, MS and mtDNA VSO levels and thus

the genetic distances show a better fit with the linguistic distance matrix.

While these results show that both geographical location and language spoken are likely

to have impacted on the pattern of genetic diversity observed among and within the three

West Central African regions, the high level of heterogeneity in Cameroon demonstrates

the additional major influence of topography on this diversity. Therefore attempting to

interpret the significant correlations observed between genetic distance and linguistic

distance in some cases, and geographic distance in others is probably an over-simplistic

approach that will lead to explanations that do not truly take into account the complex

demographic processes involved at such a fine geographical scale and thus will be of little

value. Both factors may, of course, be involved and applying other analytical techniques

such as multiple regression analysis (Lichstein 2007) may allow, in the future, the relative

contributions (Freckleton 2002) of geographic and language separation to sex-specific

differentiation among the datasets in this study to be established. Indeed the analyses

undertaken to date may be interpreted as indicating that over longer timescales

geographic distance has played a larger part in genetic isolation while languages currently

spoken have, to the extent that there is genetic differentiation, played their part in more

recent times. Ease of movement over the landscape e.g. travel time on foot, is clearly a

better measure than raw distance. (Such an approach would be similar to that used on a

larger scale by Prugnolle et al. (2005). However at this time such measurements are not

available and further work will be required to generate them. Better definition of the

linguistic matrix may also improve the analyses somewhat. (The matrix used here has a

fair measure of uncertainty attached to it, not least because some distances have been

estimated based on approximate lexicostatistics derived from figures for

phylolinguistically similar languages while others were inferred by other crude means.)

3.4.4. No genetic evidence that the Efik Uwanse have an origin in ancient Palestine

Analysis of the NRY and mtDNA profiles of the Efik Uwanse and comparisons to other

groups showed little or no evidence of a Palestinian origin for the group. PCO plots of

genetic distance clearly showed the Efik to be genetically similar, at least with regard to

the sex-specific systems, to the Ibibio and possibly the Igbo, though it was difficult to

differentiate between the contributions of these two Nigerian groups. However the almost

complete lack of sharing of NRY and mtDNA types found in the Efik Uwanse with

possible founder and source populations from Palestine, and along the proposed route to

present day Calabar, argues against admixture with any of these populations. This is

consistent with the findings of the Hart report (Hart 1964). Given the homogeneity of the

region it is not clear that the data support an Ibibio origin but such an origin is likely

given that all oral traditions, even those that claim an original eastern origin, record that

in the past the Efik lived amongst the Ibibio.

3.5. Conclusion

This study demonstrates the value of having dense sampling strategies and DNA of

known and detailed provenance, when at all possible, in studies of the distribution of

human genetic diversity in sub-Saharan Africa. There has, unfortunately, been a tendency

in some studies to use a limited number of sample sets, often of small size and undeclared

origin and relationships. This study has utilised a large, sociologically well defined,

dataset of a total of 1113 males collected from the Cross River region of Nigeria, an area

of some 7,000km2. A recent Y-chromosome study of Guinea-Bissau by Rosa et al. (2007)

analysed 282 samples to represent different ethnic groups from across the country, which

has a total area of some 35,000km2. That study stated that its sample set “extends

significantly the Y-chromosomal coverage of West African populations … both in size

and number of surveyed ethnic groups”. Whilst the findings of this particular paper are

not disputed here, given the paucity of information regarding genetic variation at a fine-

scale across sub-Saharan Africa as whole there is no reason to believe that sample sizes of

the magnitude previously used are large and varied enough to permit genetic analysis to

make a significant contribution to answering the many complex questions likely to be

encountered in the course of unravelling demographic histories of specific African

ethnicities. Similarly, one must be careful about extrapolating to the rest of sub-Saharan

Africa or even to West Central Africa as a whole the conclusions drawn from this study

of the Cross River region. The Cross River region is in close proximity to the proposed

place of origin of the expansion of the Bantu-speaking peoples but is not part of it.

Therefore it may contain genetic characteristics that are atypical when viewed in a wider

geographical context.

In summation it has been shown that major cultural and language differences among

individuals and groups in West Central Africa can be maintained even in the presence of

substantial male and female mediated gene flow. Gene flow was inferred to be reduced as

geographic and linguistic distance among populations was increased, resulting in genetic

differentiation among neighbouring regions in sub-Saharan Africa. However it is likely

that much more complex processes are at work in these regions than are revealed by the

somewhat simplistic population genetic models used in this study. The value of well

defined datasets collected at a fine geographic scale as previously called for by

anthropologists and linguists (MacEachern 2000) working in Africa has been

demonstrated. Given the interesting similarities and differences observed among

culturally distinct groups living in close proximity revealed in this study, the undertaking

of further genetic surveys elsewhere in sub-Saharan Africa utilising in depth sampling

strategies and more advanced analysis should be encouraged.

3.6. Supplementary Section for Chapter 3

Because of their large size for Supplementary Tables 3S.1, 3S.2, 3S.3, 3S.4, 3S.5, 3.S6

and 3.S7 please see attached CD-ROM.

Chapter 4:

The Potentially Deleterious Functional

Variant FMO2*1 Is At High

Frequency Throughout Sub-Saharan

Africa

4. The potentially deleterious functional

variant FMO2*1 is at high frequency

throughout sub-Saharan Africa

4.1. Introduction

Flavin-containing Monoxygenases (FMOs, EC1.14.13.8) catalyze the NADPH-dependent

oxidative metabolism of a variety of foreign chemicals that contain, as their site of

oxidation, a soft nucleophilic heteroatom, such as nitrogen, phosphorus, sulphur or

selenium (Cashman 2000; Krueger & Williams 2005). Substrates include therapeutic

drugs, dietary-derived compounds and environmental pollutants.

Humans possess five functional FMO genes, designated FMO1 to FMO5 (Lawton et al.

1994; Phillips et al. 1995; Hernandez et al. 2004). All but the FMO5 gene are present

within a 220-kb cluster on chromosome 1q24.3 (Hernandez et al. 2004). FMO5 is located

~26Mb closer to the centromere at 1q21.1 (Hernandez et al. 2004). A sixth gene, FMO6,

present within the cluster, does not produce a correctly spliced mRNA and thus appears to

be a pseudogene (Hines et al. 2002). A second FMO gene cluster, containing five

pseudogenes, FMO7P to FMO11P, is located ~4Mb centromeric of the FMO gene cluster

(Hernandez et al. 2004).

4.1.1. Previous work on Flavin-containing Monoxygenase 2

In most mammals, including non-human primates, FMO2 is the major isoform expressed

in the lung (Phillips et al. 1995; Yueh, Krueger & Williams 1997; Dolphin et al. 1998;

Krueger et al. 2001; Janmohamed et al. 2004). A single-nucleotide polymorphism (SNP)

(g.23238C>T, dbSNP #rs6661174), in exon 9 that converts a glutamine codon at position

472 to a stop codon (Q472X), resulting in the production of a truncated polypeptide that

is functionally inactive (Dolphin et al. 1998) has been identified in humans. In

populations of European (n=79) and Asian (n=118) origin all individuals tested have been

found to be homozygous for this allele (FMO2*2A) (Dolphin et al. 1998; Whetstine et al.

2000). However, an allele, FMO2*1, that has previously been shown to encode a full-

length, functionally active protein (Dolphin et al. 1998; Krueger et al. 2002) has been

found in African-Americans (26%, n=180) (Dolphin et al. 1998; Whetstine et al. 2000;

Furnes et al. 2003) and Hispanics17

(2-7%, n=280 and 327) (Krueger et al. 2004).

Substrates of human FMO2 include thioether-containing organophosphate pesticides,

such as phorate and disulfoton (Henderson et al. 2004a). In this case, products of the

FMO2-catalyzed reaction are substantially less toxic than the parent compounds (Neal &

Halpert 1982) and thus the enzyme has a protective role. However, FMO2 has also been

shown to catalyze S-oxygenation of thiourea and some of its derivatives (Henderson et al.

2004b), producing sulfenic and/or sulfinic acid metabolites, which are more toxic than the

parent compound (Neal & Halpert 1982). Sulfenic acid derivatives of thioureas can

deplete glutathione, leading to oxidative stress (Krieter et al. 1984); they can also bind to

sulphydryl groups on proteins and thus may directly perturb cell function (Onderwater et

al. 1999). Thus, if exposed to thiourea or its derivatives, individuals who possess an

FMO2*1 allele are predicted to be at increased risk of pulmonary toxicity. With an

estimated global production of 10,000 tonnes (CICADA2003), thioureas are present in a

wide range of industrial, household and medical products and, consequently, exposure to

these chemicals is widespread.

FMOs are also involved in the metabolism of therapeutic drugs, including several that are

used to treat multidrug-resistant tuberculosis (Vannelli, Dykman & Ortiz de Montellano

2002; Fraaije et al. 2004; Qian & Ortiz de Montellano 2006), which is a major health

problem in Africa, with an estimated 544,000 deaths in 2005

[http://www.who.int/mediacentre/factsheets/fs104/en/]. There is evidence that at least one

of these drugs, ethionamide (ETA), is a substrate for human FMO2 (Krueger & Williams

2005), but it is not known whether metabolism of the drug by FMO2 will increase or

decrease its efficacy or toxicity.

According to the authors of the study „Hispanic‟ in this case referred to individuals of Mexican or Puerto

Rican descent.

4.1.2. The rationale for studying FMO2 in Africans

It has been shown previously that most African Americans have a significant European

contribution to their ancestry (~4 to ~30% (Reed 1969; Parra et al. 1998; Destro-Bisol et

al. 1999)) so it is likely that a functional FMO2 will be found at an even higher incidence

in sub-Saharan Africans than in African Americans. Since this may be important in regard

to drug efficacy and public safety the distribution of the FMO2*1 and FMO2*2A alleles

in multiple populations across Africa was assessed. Samples from the Middle East

(Turkey and Yemen) were also characterised to determine whether the FMO2*1 allele

was present at appreciable frequencies in populations outside but close to Africa.

In addition, the Long-Range Haplotype test (Sabeti et al. 2002b), which examines the

level of allele-specific haplotype linkage disequilibrium (LD), was used to analyse data

from the International HapMap project for evidence of positive selection at the

g.23238C>T SNP and sequence data were used from the NIEHS SNP program to a)

examine the haplotype backgrounds of the two g.23238C>T alleles and b) estimate the

time of origin of the FMO2*2A allele. This will help provide preliminary insights into the

evolutionary history and future of the FMO2 enzyme.

4.2.1. Sample Collection

DNA samples were prepared from buccal swabs from a sample of males over eighteen

years old unrelated at the paternal grandfather level from the following locations in and

around Africa: Algeria-Mostaganem (n=43), Algeria-Port Say (n=118), Cameroon-Mayo

Darle (n=119), Cameroon-Lake Chad (n=76), Ethiopia-Gambella (n=106), Ethiopia-

Addis Ababa (n=24), Ethiopia-Borena (and surrounding area) Wollo (n=36), Ethiopia-

Dessie (and surrounding area) Wollo (n=26), Ghana-Sandema (n=90), Ghana-Navrongo

(n=45), Malawi-Lilongwe (n=144), Malawi-Mangochi (n=60), Malawi-Mzuzu (n=56),

Morocco-Ifrane (n=70), Mozambique-Sena (n=84), Nigeria-Calabar (n=88), Senegal-

southern region (n=94), Senegal-Dakar (n=95), South Africa-Pretoria (n=41), Sudan-

northern region (n=136), Sudan-southern region (n=126), Tanzania-Kilimanjaro (n=50),

Turkey-East Anatolia (n=31), Turkey-West Anatolia (n=28), Uganda-Ssese Islands

(n=39), Yemen-Sena (n=34), Yemen-Hadramaut region (n=83), Zimbabwe-Mposi

(n=34). All samples were collected anonymously with informed consent. Sociological

data, including age, current residence, birthplace, self-declared cultural identity and

religion of the individual and of the individual‟s father, mother, paternal grandfather and

maternal grandmother were also collected. In addition, the African populations sampled

were grouped into four geographic regions (North Africa-NA, West Africa-WA, Central

East Africa-CEA, South East Africa-SEA), as delineated in Table 4.1. The two Anatolian

Turkish samples were considered to be from a single region (TU), as were the two

Yemeni samples (YE).

4.2.2. g.23238C>T typing

A 68-bp region containing the g.23238C>T SNP was amplified by PCR using the primers

FMO2-1414-UM (5ʹ-TGG CTG TGA GAC TCT ATT TCG GAC CCT GCA ACT CCG

A-3ʹ) and FMO2-1414-LM (5ʹ-CCA TTG CCC AGG CCC AAC CAG GCG ATA TT-

3ʹ). Each primer contained a single mismatch to its target sequence at the 3ʹ-end

penultimate nucleotide (underlined). The design of the primers was such that the

amplification product would contain recognition sites for the restriction endonucleases

MboI (GATC), if the target sequence contained a C at position 23238, and MseI (TTAA),

if the target sequence contained a T at position 23238.

DNA was amplified in 10-µl reaction volumes containing 0.4 µM of each primer, 0.13

units Taq DNA polymerase (HT Biotech, Cambridge, UK), 9.3 nM TaqStartTM

monoclonal antibody (BD Biosciences Clontech, Oxford, UK), 200 µM dNTPs and

reaction buffer supplied with the Taq polymerase. The cycling parameters were: 5 min of

pre-incubation at 93ºC, followed by 37 cycles of 93ºC for 1 min, 55ºC for 1 min and 72ºC

for 1 min.

The resultant PCR product was used for two independent, complementary restriction

endonuclease (RE) digestions that each targeted one of the two introduced RE sites (See

Figure 4.1) RE digestions were carried out in 10-µl volumes containing 4 µl of PCR

product, 0.7 units RE (MboI or MseI), BSA and reaction buffer according to the

supplier‟s recommendations (New England Biolabs, Hitchin, UK). All reactions were

incubated overnight at 37°C. After RE digestion DNA fragments were resolved by

electrophoresis through a 3.5% agarose gel. When full-length PCR product is digested

with MboI, FMO2*1 alleles are cleaved, resulting in two fragments of length 35bp and

33bp, respectively. When full-length PCR product is digested with MseI, FMO2*2A

alleles are cleaved, resulting in two fragments of length 38bp and 30bp, respectively. The

gel-banding patterns observed for the two assays and associated genotypes are shown in

Figure 4.2. Samples where the genotype had already been determined by a previous

laboratory (Professor Ian Phillip, School of Biological and Chemical Sciences, Queen Mary,

University of London) using alternate methodologies (sequencing and allele-specific PCR)

were used to test the assay described above.

Figure 4.1: Diagrammatic representation of 23238C>T SNP restriction

enzyme assay.

Figure 4.2: #rs6661174 Mbo/MseI Complementary Restriction Enzyme

Digest Banding Patterns.

Tests for departure of observed genotype frequencies from those expected under Hardy-

Weinberg equilibrium (Guo & Thompson 1992) were performed using Arlequin software

(Schneider, Roessli & Excoffier 2000). Pairwise FST values were estimated from

AMOVA ST values (Reynolds, Weir & Cockerham 1983).

Logistic Regression analysis was performed to evaluate the differences in the FMO2*1

allele frequency among subgroups within regions and among regions in which the

subgroups had similar allele frequencies. This was undertaken by first testing for fit of the

subgroup frequencies to a model which allowed only for regional differences in the

FMO2 allele frequencies. Pearson chi square tests were subsequently performed to test

for overall heterogeneity within individual regions. If significant heterogeneity was found

in a region, further pairwise comparisons of the subgroups within the region were made

by Fisher Exact tests. For logistic regression analysis and post hoc region and subgroup

comparisons, individuals were categorised into two groups on the basis of whether or not

(Y =0,1) they possessed at least one FMO2*1 allele (in this way the sample size equalled

the number of individuals studied, n, rather than the number of chromosomes, 2n, thus

ensuring that the observations were truly independent). This analysis assumes that

individuals in populations are unrelated.

Principal coordinates analysis (PCO) was performed, using GENSTAT5 software, on

pairwise similarity matrices. Here similarity was quantified as being equal to the value of

the genetic distance subtracted from 1.0 (1-FST). Values along the main diagonal,

representing the similarity of each population sample to itself, were calculated from the

estimated genetic distance between two copies of the same sample. For AMOVA-based

FST distances, the resulting similarity of a sample to itself simplifies to n/(n–1).

A Mantel test for the correlation between a matrix of pairwise FST values and a

corresponding matrix of pairwise geographic distances was performed within the R-

programming environment, using routines found in the APE package.

Spatial autocorrelation analysis was performed using AIDA software where ten distance

classes were set, each of which produce the greatest similarity in the number of pairwise

comparisons of chromosomes within that class, and II and cc indices (Bertorelle &

Barbujani 1995) were calculated (analogous to Moran's I and Geary's c, respectively)

within each class. Graphs were plotted using Microsoft Excel.

Genetic boundary analysis was performed as described by Barbujani et al. (1989). Surface

interpolation was performed on observed FMO2*1 allele frequencies using SURFACE,

part of the Generic Mapping Tools package (Wessel & Smith 1998), to create a grid

(surface) of estimated allele frequencies every 0.5° latitude and 0.5° longitude over a

region covering Central East Africa (Latitude: 6-17° N, Longitude: 27-41° E). A vector

consisting of the measures Average Value of Absolute Magnitude (AVMA) and Average

Direction (AD) was calculated from the centre of all 0.5° longitude by 0.5° latitude

regions (termed pixels) across the entire surface. Major genetic boundaries were found

using „criterion 2‟ of Barbujani et al. (1989) The highest decile described by Barbujani et

al. (1989) was replaced with the highest 5% of AVMAs and the second highest decile was

replaced with the next highest 5% of AVMAs. This analysis was performed using

software developed at TCGA (Python code available on request from Krishna Veeramah).

A contour map (tension factor = 0.25) of estimated FMO2*1 allele frequencies for this

region was created using Generic Mapping Tools software (Wessel & Smith 1998) and

proposed boundaries plotted onto this map.

4.2.3.1. The Long-Range Haplotype Test

The Long-Range Haplotype (Sabeti et al. 2002b) test involves calculating the EHH

statistic at some pre-defined distance from a core region.

When the core region is a binary SNP, the EHH for an allele x at the core SNP is the

probability that two randomly chosen samples from a population of individuals with x

have the same SNP-based haplotype extending from the x to a SNP at some pre-defined

distance. EHH therefore is a measure of haplotype conservation or linkage disequilibrium

from the allele x and is on a scale of 0-1.

The Long-Range Haplotype test was performed in this chapter using two different

International HapMap Project dataset releases [http://www.hapmap.org] from four

different populations.

HapMap Phase I encompasses the following: the #16c.1 YRI build, consisting of

1,076,451 SNPs genotyped in 30 parent-offspring trios from the Yoruba in Ibadan, the

rel#16c.1 CEU build, consisting of 1,105,072 SNPs genotyped in 30 parent-offspring

trios from the Centre d'Etude du Polymorphisme Humain (CEPH-Utah residents with

ancestry from northern and western Europe) panel, the rel#16c.1 CHB build, consisting of

1,088,689 SNPs genotyped in 45 unrelated Han Chinese from Beijing, China and the

rel#16c.1 JPT build, consisting of 1,088,426 SNPs genotyped in 45 unrelated Japanese in

Tokyo, Japan. In each dataset approximately 1 SNP is genotyped every 5kb across the

human genome.

HapMap Phase II encompasses the following: the rel#21 YRI build, consisting of

3,241,616 SNPs genotyped in 30 parent-offspring trios from the Yoruba in Ibadan,

Nigeria, the rel#21 CEU build, consisting of 1,105,072 SNPs genotyped in 30 parent-

offspring trios from the CEPH panel and the rel#21 CHB+JPT build, consisting of

3,305,784 SNPs genotyped in the 45 CHB+JPT panel. Because of their high genetic

similarity it is accepted practice to pool the CHB and JPT datasets. In each dataset

approximately 1 SNP is genotyped every 2kb across the human genome.

Haplotype phase inference for these data was performed by the HapMap consortium

using Phase 2.0 software. Recombination rate data are based on averaged recombination

rates across all four HapMap populations (Hudson 2001; Myers et al. 2005): YRI, CEU,

CHB and JPT.

The iHS method of Voight et al. (2006) was applied to the g.23238C>T SNP in the

HapMap Phase I YRI dataset and to the FMO2 gene in the HapMap Phase I YRI, CEU

datasets and a pooled JPT+CHB dataset, using the web-based tool Haplotter

[http://pritch.bsd.uchicago.edu/data.html].

A similar Long-Range Haplotype test method for detecting selection using HapMap

Phase II data was developed specifically during this study. This method is described

below.

4.2.3.1.1. Calculating EHH at the 23238C>T locus

SNP haplotypes were extracted from HapMap dataset for each of the three populations

(either the 60 unrelated YRI parents, the 60 unrelated CEU parents or all 90 unrelated

CHB+JPT individuals) over a region extending 2.0cM either side of the core SNP

(23238C>T). To enable a more direct comparison with EHH values generated for other

core SNPs, the overall haplotype SNP density was controlled at approximately one SNP

every 0.05cM. Individual haplotypes were then placed into one of two groups based on

which allele they possessed at the core SNP (or just one group if the SNP was

monomorphic, as is the case for 23238C>T in the CEU and CHB+JPT populations), and

EHH values calculated for both alleles (if applicable) at twenty pre-defined genetic

distances (0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0, 1.1, 1.2, 1.3, 1.4, 1.5, 1.6, 1.7, 1.8,

1.9 and 2.0cM) either side of the core SNP, as described by Sabeti et al. (2002b) and

Mueller and Andreoli (2004).

4.2.3.1.2. Creating an empirical null distribution using genetic and physical distances

To test whether any of the EHH values generated from analysis of 23238C>T indicated

positive selection in the YRI dataset, 10,000 SNPs of similar homozygosity (See Table

1.10.1 of Cavalli-Sforza, Menozzi & Piazza 1994) (±2.5%) to 23238C>T, and with a

minimum haplotype density of one SNP every 0.05cM for up to 2cM either side of the

core SNP, were randomly chosen from across the genome (X and Y-chromosomes

excluded) from within the same HapMap population dataset. EHH values were then

calculated for these SNPs, as described for the 23238C>T SNP above. This resulted in

two empirical distributions of 10,000 EHH values at each of the twenty pre-defined

genetic distances, one for lower frequency alleles and one for higher frequency alleles.

When testing for significance of EHH values for the only allele present in the CEU and

CHB+JPT samples, FMO2*2A, all random core SNPs were required to have a

homozygosity value of one. Because in these populations the SNPs were monomorphic

there was only one empirically observed distribution at each genetic distance.

EHH values for the 23238C>T alleles could then be compared with the relevant

distribution, via a one-tail test, to establish whether the level of LD extending over a

particular distance from the core SNP was unusual in comparison with other alleles of

similar frequency found across the genome. EHH values were considered outliers if they

lay in the upper 5% tail of the relevant distribution. All cM genetic distances were

estimated using the recombination rates determined by HapMap. This results in

evaluation of EHH values for each allele at 20 predefined genetic distances. If positive

selection had taken place EHH values within the upper 5% tail of the relevant distribution

would be expected over a relatively continuous region (i.e. a number of consecutive cM

intervals, significance of which could be assessed via a Runs Test (Sokal & Rohlf 1994)).

The LRH test was repeated using physical rather than genetic distances. EHH was

estimated at 20 pre-defined physical distances (0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9,

1.0, 1.1, 1.2, 1.3, 1.4, 1.5, 1.6, 1.7, 1.8, 1.9 and 2.0Mb) either side of the core SNP and a

haplotype SNP density of approximately one SNP every 10kb. The null distribution was

based on only 1,000 data points, with the intention that this could be increased to 10,000

if a particular run showed indications of selection. The rationale for repeating this

analysis using physical distances was that both the estimated genetic distances and EHH

values from core SNPs are derived from the same HapMap data. If genetic distances are

utilised and the SNP is biallelic, the recombination rate is estimated from the SNP and

will be an average for both allele backgrounds. In this case, extreme EHH outliers from

the average rate can be detected and are signatures of selection (though this analysis may

be considered conservative). However, if the SNP being tested is monomorphic the

recombination rate estimated from this point will be directly correlated to the single EHH

value estimated and so no potential selection will be detected. By using physical

distances, EHH acts a proxy for LD and allows for the identification of possible positive

selection if the recombination rate for a region extending from a particular locus for a pre-

defined distance is exceptionally low. However, this method is not ideal as it will not

control for local variation in recombination rates (e.g., recombination hotspots), which is

why the use of genetic distances is preferable when applicable (non-monomorphic data).

Interestingly, comparison of the Illumina pedigree-based genetic map

[http://www.illumina.com/pages.ilmn?ID=191] with the HapMap recombination map, in

the region in which the g.23238C>T (rs6661174) SNP (which is not included in the

Illumina map) lies, showed there to be a greater genetic distance between two SNPs that

are included in both maps (rs913257 and rs7877) in the Illumina map (1.26cM) than in

the HapMap map (0.99cM). On the other hand, the distance between the first (rs884080)

and last (rs2027432) SNPs in the Illumina map for chromosome 1 is shorter (273.14cM)

than in the HapMap map (280.391cM). This raises the possibility that the FMO2 gene in

HapMap has a lower than expected LD in comparison with the rest of the chromosome,

resulting in extreme EHH values for a particular allele at the 23238C>T locus being even

stronger evidence of selection. This analysis was performed using software designed at

TCGA (Python code available on request from Krishna Veeramah).

In comparison to the method of Voight et al. (2006), the approach described above has

the advantage of controlling for haplotype SNP density, which may severely impact

actual EHH estimates (see Figure 4.3 for an example of two distributions using different

haplotype densities), whereas the Voight et al. (2006) approach is likely to be a more

sensitive test because the level of haplotype homozygosity surrounding a SNP is more

finely described by the standardised iHS statistic. However, the underlying principle of

looking for unusually high levels of haplotype homozygosity in large sets of empirical

distributions is similar and the method presented here effectively complements that of

Voight et al. (2006).

4.2.3.2. Estimating the age of the g.23238C>T mutation

Individuals of various ethnicities, including a subset of HapMap samples, have been

sequenced for all exons of environmental response genes, including FMO2, as part of the

National Institute of Environmental Health Sciences (NIEHS) SNP program (NIEHS

SNPs. NIEHS Environmental Genome Project, University of Washington, Seattle, WA

[http://egp.gs.washington.edu]) [ (June, 2007)].). FMO2 sequencing data from the

NIEHS SNP program were utilised to a) examine the distribution of genetic variants in

individuals of recent African descent as well as b) to estimate the time when the

g.23238C>T mutation occurred.

All FMO2 exon variants for all 95 NIEHS Panel 2 samples were extracted and haplotypes

were inferred using PHASE version 2.1 (Stephens, Smith & Donnelly 2001; Stephens &

Donnelly 2003). Haplotypes for a subset of NIEHS variants (those variants with a

relatively large minor allele frequency with few or no undetermined genotypes) had

already been inferred by the NIEHS and this information was utilised when inferring

phase of the remaining variants. The pairs of haplotypes of the 12 Yoruba and 15 African-

American NIEHS samples were then examined for the presence and distribution of

synonymous mutations, non-synonymous mutations, insertions and deletions.

To estimate the time of origin of the g.23238C>T SNP all biallelic SNPs, insertions and

deletions found within FMO2-defined exons, introns and untranslated regions by the

NIEHS SNP program for all 95 NIEHS Panel 2 samples were extracted and haplotypes

inferred using fastPhase version 1.2 (Scheet & Stephens 2006) (Phase was not used on

this occasion because of the large number of variants present. With regard to the

haplotype inference for only the coding region variants that was performed using Phase

2.1, only phased haplotypes of two individuals were found to differ in comparison to the

fastPHASE 2.1 output. One individual differed by a single variant that was only found

once in the entire dataset, and therefore is irresolvable by either method, and the other

individual differed at only two SNPs, but was homozygous T for the g.23238C>T SNP,

therefore making the impact on any further analysis minimal). Because the g.23238C

allele was present only in Yoruba and African-American NIEHS Panel 2 samples and the

African-American samples are of uncertain demographic origin, only the NIEHS Yoruba

samples were used in subsequent analysis.

Figure 4.3: Distribution of 10,000 EHH values calculated 0.4cM from core alleles with A) SNP haplotype density not

controlled and B) SNP haplotype density controlled at 1 SNP per 0.05cM (8 SNP extended haplotypes).

Within the SNAP workbench (Price & Carbone 2005; Aylor, Price & Carbone 2006),

RecPars software (Hein 1990) showed that recombination was likely to have occurred a

number of times within the sequences found in the Yoruba samples. Therefore the

coalescent-based approach of Griffiths and Marjoram (1996) (using the program

recomb58) was applied, which takes recombination into account when calculating

maximum likelihood estimates of mutation rate and recombination rate as well as

estimating the time to the most recent common ancestor and the time since the origin of

every variant nucleotide within a collection of observed sequences. However, the

approach of Griffiths and Marjoram (1996) will fail with very large numbers of

recombination events. Therefore LDhat version 2.0 was used to identify a region of

sequence surrounding the g.23238C>T SNP in the Yoruba that had sufficiently low

numbers of recombination events to enable recomb58 to be used. This was done by

iteratively removing 500bp of sequence from the 5ʹ end of the gene and running LDhat to

estimate the recombination rate (ρ) (the number of recombination events per gene

(sequence) per generation) (sequence from the 3ʹ was not removed because the

g.23238C>T SNP is located close to this end of the gene). A region 8500bp in length was

eventually found (extending 6280bp upstream and 2219bp downstream of g.23238C>T

respectively), which contained 38 segregating sites (inclusive of g.23238C>T), with a ρ

value of approximately 2, which was deemed reasonable to be used in recom58. LDhat

also gave Watterson‟s estimate of θ (Watterson 1975) (mutation rate – the number of

mutations per gene (sequence) per generation) for this region, of approximately 10. The

ancestral state of each segregating site was found by comparison with the chimpanzee and

macaque FMO2 genome assembly sequences found within Ensembl Genome Database.

Recom58 was run on observed data (derived from the sequence of the 8500bp region of

FMO2 that contained g.23238C>T, from 24 phased NIEHS Yoruba chromosomes) for

5,000,000 iterations, with initial generating parameters of θ = 10 and ρ=2, over a

likelihood surface for θ ranging from 0.25 to 20.00, with 0.25 increments, and for ρ

ranging from 0.1 to 4.0, with 0.1 increments. From this first run, maximum-likelihood

estimates of θ and ρ were 10.50 and 0.80, respectively. Recom58 was then rerun on the

same observed data using these new estimated parameter values of θ and ρ (again running

for 5,000,000 iterations over the same likelihood surface range), to investigate

characteristics of the ancestral distributions, including the time of origin of the

g.23238C>T SNP. The values of θ and ρ estimated on this occasion were 11.25 and 0.70,

respectively, similar to the previous run. The mean of the expected time since mutation

for the g.23238C>T SNP from this run, in coalescent units (T), was 1.1962 (standard

deviation = 0.2127). T can be converted into time t, in years, using the expression t = 2 *

Ne * T *(generation time in years), where Ne is the effective population size. An Ne figure

of 7500 was used. This value was estimated using linkage disequilibrium data from the

Yoruba HapMap samples (Tenesa et al. 2007). Given T, Ne and an inter-generation time

for humans of 28 years (Fenner 2005), an estimated time since mutation for g.23238C>T

in years (t) can be calculated.

Lower and upper boundaries were also calculated for the estimated age of the mutation. A

gamma distribution (α = 31.616, θ = 0.037) was estimated from the mean and standard

deviation of the coalescent time estimate from recomb58 in order to calculate 95%

confidence intervals (2.5% = 0.816, 97.5% = 1.648) in the R programming environment.

Upper (8751) and lower (4889) estimates of Ne were taken from maximum and minimum

values of Ne calculated for the Yoruba HapMap sample for each individual chromosome

(disregarding the acrocentric chromosomes 21 and 22 and X, which all demonstrate

substantially different Ne estimates to the other chromosomes and thus may be behaving

somewhat abnormally), adjusted here to take into account the 18% underestimate from

HapMap data due to ascertainment bias (Tenesa et al. 2007), and upper and lower

generation times were taken as 19.4 years (the mean female age of first birth in hunter-

gatherer societies) and 36.1 (the mean female age of last birth in Nation states) (Fenner

2005). Given T and the above values of Ne and inter-generation time for humans, a lower

and upper boundary for the time since mutation for g.23238C>T in years (t) can be

calculated.

4.3. Results

4.3.1. The distribution of 23238C>T in Africa

The g.23238C>T allele frequencies and geographic locations for populations typed in this

study are shown in Table 4.1 and Figure 4.4. The overall FMO2*1 allele frequency for all

samples from Africa (n=1800) was 0.153, with 28.3% of individuals having at least one

FMO2*1 allele. Across all 24 populations in Africa the observed percentage of

individuals who have at least one FMO2*1 allele ranged from 4.3-49.1. For samples from

sub-Saharan Africa (n=1569) the overall FMO2*1 allele frequency was 0.170, with

31.4% of individuals having at least one FMO2*1 allele and across these 21 populations

the observed range of frequencies of FMO2*1-carrying individuals was 17.8-49.1%. The

Yemen sample (n=117) had an overall FMO2*1 allele frequency of 0.047, with 8.5% of

individuals having at least one FMO2*1 allele. The FMO2*1 allele was not observed in

the Anatolian Turkish sample (n=59). No population deviated significantly from Hardy-

Weinberg equilibrium (P>0.12).

Using Logistic Regression on the proportion of individuals with at least one FMO2*1

allele, significant differences were found both among regions (P<0.0001, df=5) and

among populations within regions (P<0.04, df=23). The major factor contributing to

among-region differences is likely to be the noticeably lower FMO2*1 frequencies

observed in non-sub-Saharan African populations in comparison to sub-Saharan African

populations. Pearson‟s Chi Square tests were performed to explore within-region

differences (see Table 4.2). The only statistically heterogeneous region was Central East

Africa (CEA) (P<0.003, df=5). Exclusion of CEA from the Logistic Regression analysis

resulted in no significant differences (P=0.25, df 15) among populations within the

remaining regions.

In order to make pairwise comparisons of regions using Fisher‟s Exact test, populations

within each region were pooled, except in the case of CEA, which had previously been

identified as having statistically significant heterogeneity and therefore was excluded

from this analysis. From these pairwise comparisons (see Table 4.3) the following

arrangement of regions based on frequencies of individuals with at least one FMO2*1

allele could be discerned:

TU<(YE=NA)<(SEA=WA)

Further examination of populations in CEA, with pairwise Fisher‟s Exact tests (see Table

4.4), showed the populations in this region to be roughly split into two main groups, one

consisting of north Sudan and the four Amharic populations and the other consisting of

the Anuak and south Sudan. A PCO plot of pairwise FST values for all populations in

Africa (Figure 4.5) showed the Anuak of Gambella and south Sudan to be genetically

close to each other with respect to the g.23238C>T SNP, probably due to them both

Table 4.1: 23238C>T Genotype and Allele frequencies.

Country FMO2 genotype

frequency

n FMO2*1

frequency

FMO2*2A

frequency

At least one

FMO2*1

allele Latitu

Country Location

Cultural Identity *1/

West A

Cameroon

Mayo Darle Various 2 39 78 119 0.181 0.819 34.45% 6.54 11.45

Lake Chad Various 2 25 49 76 0.191 0.809 35.53% 12.28 14.75

Ghanaian

Sandema Bulsa 3 21 66 90 0.15 0.85 26.67% 10.73 -1.28

Navrongo Kasena 1 7 37 45 0.1 0.9 17.78% 10.88 -1.09

Nigeria

Calabar Igbo 0 22 66 88 0.125 0.875 25.00% 4.96 8.31

Senegal

South Manj 3 25 66 94 0.165 0.835 29.79% 12.99 -15.88

Dakar Wolof 1 24 70 95 0.137 0.863 26.32% 14.69 -17.45

A) Ethiopian

Gambella Anuak 5 47 54 106 0.269 0.731 49.06% 8.25 34.58

Addis Ababa Amharic 1 7 16 24 0.188 0.813 33.33% 9.01 38.85

Borena, Wollo Amharic 0 7 29 36 0.097 0.903 19.44% 10.75 38.77

Dessie, Wollo Amharic 1 6 19 26 0.154 0.846 26.92% 11.23 39.53

North Various 2 37 97 136 0.151 0.849 28.68% 15.21 33.04

South Various 10 43 73 126 0.25 0.75 42.06% 10.85 29.77

East A

Malawi

Lilongwe Various 4 35 105 144 0.149 0.851 27.08% -13.98 33.77

Mangochi Various 1 16 43 60 0.15 0.85 28.33% -14.47 35.27

Mzuzu Various 0 17 39 56 0.152 0.848 30.36% -11.47 34.02

Mozambique

Sena Sena 2 22 60 84 0.155 0.845 28.57% -17.44 35.027

South Africa

Pretoria Bantu 2 15 24 41 0.232 0.768 41.46% -25.75 28.30

Tanzania

Kilimanjaro Chagga 2 18 30 50 0.22 0.78 40.00% -5.38 38.05

Uganda

Ssese Bantu 0 10 29 39 0.128 0.872 25.64% -0.45 32.56

Zimbabwe

Mposi Shona 0 7 27 34 0.103 0.309 20.59% -17.31 31.328

Algeria

Mostaganem Unspecified 0 5 38 43 0.058 0.942 11.63% 35.94 0.09

Port Say Unspecified 0 10 108 118 0.042 0.958 8.47% 35.08 -2.18

Morocoo

Ifrane Berbers 0 3 67 70 0.021 0.979 4.29% 33.59 -5.17

) Turkey

East Anatolia Anatolian Turks 0 0 31 31 0 1 0.00% 40.28 33.25

West Anatolia Anatolian Turks 0 0 28 28 0 1 0.00% 39.68 31.21

) Yemen

Sena Unspecified 0 4 30 34 0.059 0.941 11.76% 15.41 44.24

Hadramaut Unspecified 1 5 77 83 0.042 0.958 7.23% 16.81 49.94

Total 43 477 1456 1976 0.142 0.858 26.32%

Figure 4.4: Map showing the percentage of individuals with at least one

FMO2*1 allele in Africa and two nearby countries.

Table 4.2: Pearson's Chi Square Test on individual regions.

Region df Chi square P-value

CEA 5 18.12 0.003*

NA 2 2.15 0.34

SEA 7 7.51 0.37

WA 6 7.34 0.29

Yemen 1 0.64 0.43

NOTE.- df = degrees of freedom. * indicates P-value is less than 0.05.

Table 4.3: Fisher's Exact tests between regions.

WA SEA NA TU

SEA 0.7915 \ \ \

NA 0.0001* 0.0001* \ \

TU 0.0001* 0.0001* 0.0294* \

YE 0.0001* 0.0001* 0.8361 0.0321

NOTE.- * indicates P-value is less than 0.05.

Table 4.4: Fisher's Exact tests between CEA populations.

Gambella Addis

Borena,

Dessie,

Sudan-

Ababa 0.1812

Borena,

Wollo 0.0018* 0.2418

Dessie,

Wollo 0.0492* 0.7598 0.5475

Sudan-

North 0.0013* 0.6340 0.2982 1.0000

Sudan-

South 0.2933 0.5008 0.0180* 0.1883 0.0277*

NOTE.- * indicates P-value is less than 0.05.

possessing slightly elevated FMO2*1 frequencies in comparison with the other African

populations surveyed here. Addis Ababa appears to be somewhat separated from all

populations, but this may be a stochastic effect due to its low sample size. A Pearson‟s

Chi Squared test comparing the frequencies of individuals with at least one FMO2*1

allele in all populations in sub-Saharan Africa was significant (P<0.003), but removing

only the Anuak and south Sudan populations resulted in non-significance (P=0.526),

Figure 4.5: PCO plot of 23238 C>T-based population FST values.

emphasising that these two populations are outliers from the overall allele distribution

observed across sub-Saharan Africa. Genetic Boundary analysis on this region also

revealed that, despite their geographical proximity, the Anuak in Gambella are separated

from other Ethiopian groups by a sharp allele frequency gradient (Figure 4.6).

Figure 4.6: Contour map based on FMO2*1 allele frequencies in Central

East African populations with areas of rapid allele frequency change shown

with blue circles.

A significant correlation between matrices of pairwise genetic distances (FST) and

geographic distances (km) was found using the Mantel test when all populations typed in

this study (P<0.001) and only African populations (P< 0.003) were considered, but not

when only sub-Saharan African populations were analysed (P=0.741). In addition,

autocorrelation indices II and cc for sub-Saharan African populations showed no apparent

correlation with geographic distance (see Figure 4.7), confirming the generally similar

distribution of g.23238C>T alleles across sub-Saharan Africa.

When samples were grouped by self-declared ethnic identity (they were included as a

separate group if there were 15 samples or more with the same self-declared ethnic

identity (see Table 4.5)), no significant differences were found between the same ethnic

group living in multiple locations (Fisher‟s Exact, P>0.24) (see Table 4.6), for example,

the Amharic speakers who were sampled in three locations (Pearson‟s Chi Square,

P=0.47 df = 2), or among different ethnic groups collected at the same location (Fisher‟s

Exact, P >0.09).

4.3.2. Examining FMO2 for evidence of Natural Selection

Typing of the g.23238C>T SNP and many neighbouring SNPs by the International

HapMap project allowed the investigation, using the Long-Range Haplotype test (Sabeti

et al. 2002b), of whether a signal suggestive of positive selection of either allele at this

locus could be detected. The FMO2*1 allele frequency in the YRI dataset is 0.175, which

is similar to that observed in sub-Saharan Africa. In contrast, FMO2*1 was absent in the

CEU and CHB+JPT datasets, consistent with previous studies (Dolphin et al. 1998;

Whetstine et al. 2000).

The method of Voight et al. (2006), which uses a derivative of the EHH statistic,

standardised iHS, allows direct comparisons of SNPs of different frequencies and

provides a measure of haplotype conservation around the target SNP in comparison to the

rest of the genome. The web-based tool Haplotter, which applies the method of Voight et

al. (2006) on HapMap Phase I data, was used to look for evidence of recent positive

selection at the g.23238C>T locus in the YRI dataset.

The standardised iHS for this locus is 0.992, a value which lies in the 84th

percentile on a

standard normal curve. This indicates that the increased level of haplotype homozygosity

on the derived T allele (as iHS is positive) is not significantly different (P>0.05, two

tailed test) from that expected from the genome as a whole and therefore provides no

evidence of recent positive selection for either allele.

Figure 4.7: Spatial Autocorrelation Analysis of 23238C>T allele frequency

data using (A) Moran’s II and (B) Geary’s cc.

Table 4.5: Table of ethnic identities found in the various populations

examined in this chapter.

Country Location Ethnic Group (Self

Declared) CC CT TT n

Cameroon

Mayo Darle

Fulbe 0 19 33 52

Haousa 0 8 15 23

Mambila 0 6 13 19

Other 2 6 17 25

Total 2 39 78 119

Cameroon

Lake Chad

Kotoko 1 12 24 37

Other 1 13 25 39

Total 2 25 49 76

Ghana Sandema Bulsa 3 21 66 90

Ghana Navrongo Kasena 1 7 37 45

Nigeria Calabar Igbo 0 22 66 88

Senegal South Manj 3 25 66 94

Senegal Dakar Wolof 1 24 70 95

Ethiopia Gambella Anuak 5 47 54 106

Ethiopia Addis Ababa Amharic 1 7 16 24

Ethiopia Borena, Wollo Amharic 0 7 29 36

Ethiopia Dessie, Wollo Amharic 1 6 19 26

Ga'ali 1 10 33 44

Shaigi 0 4 14 18

Other 1 23 50 74

Total 2 37 97 136

Dinka 1 13 28 42

Nuer 0 2 14 16

Other 9 28 31 68

Total 10 43 73 126

Malawi

Mangochi

Yao 0 7 20 27

Chewa 1 5 9 15

Other 0 4 14 18

Total 1 16 43 60

Malawi

Tumbuka 0 13 29 42

Other 0 4 10 14

Total 0 17 39 56

Malawi

Lilongwe

Chewa 1 17 51 69

Yao 1 7 20 28

Tumbuka 1 3 14 18

Other 1 8 20 29

Total 4 35 105 144

Mozambique

Sena 1 2 16 19

Tembo 0 8 11 19

Other 1 12 33 46

Total 2 22 60 84

Tanzania Kilimanjaro Chagga 2 18 30 50

Uganda Ssese Bantu 0 10 29 39

Zimbabwe Mposi Shona 0 7 27 34

South Africa Pretoria Bantu 2 15 24 41

Algeria

Mostaganem

Undeclared 0 5 35 40

Other 0 0 3 3

Total 0 5 38 43

Algeria

Port Say

Undeclared 0 5 77 82

Other 0 5 31 36

Total 0 10 108 118

Morocco Ifrane Berbers 0 3 67 70

Yemen Yemen-Sena Other 0 4 30 34

Yemen Yemen-Hadramaut Other 1 5 77 83

Turkey East Anatolia Anatolian Turks 0 0 31 31

Turkey West Anatolia Anatolian Turks 0 0 28 28

Note.- Declared identities below 15 per individual label and undeclared identities at

each location have been grouped under the term 'Other'.

Table 4.6: Various Population Pairwise Fisher’s Exact Tests.

Same geographic location

Same ethnic group

Same country Cameroon

Mayo Darle Lake Chad

Fulbe Haousa Mambila

Mayo Darle Haousa 1.0000 \

Mambila 0.7842 1.0000 \

Lake Chad Kotoko 1.0000 1.0000 1.0000

Kasena

Bulsa 0.2895

Senegal

Manj 0.6297

Ethiopia

Addis Ababa Borena, Wollo

Amharic Amharic

Borena, Wollo Amharic 0.2418 \

Dessie, Wollo Amharic 0.7598 0.5475

South Sudan North Sudan

Dinka Nuer Ga'ali

South Sudan Nuer 0.1884 \

North Sudan Ga'ali 0.4786 0.4814 \

Shaigi 0.5417 0.6602 1.0000

Malawi

Mangochi Mzuzu Lilonghgwe

Yao Tumbuka Chewa Yao

Mzuzu Tumbuka 0.7877 \

Lilonghgwe

Chewa 1.0000 0.6640 \

Yao 1.0000 1.0000 0.8047 \

Tumbuka 1.0000 0.5501 1.0000 0.7393

Mozambique

Sena 0.0902

This particular analysis was not possible using the CEU or JPT+CHB datasets because

the g.23238C>T SNP is monomorphic in these populations. However, examination of the

whole FMO2 gene, which involves examining the proportion of SNPs in the gene that

have extreme iHS values in comparison to other genes (see Voight et al. (2006)), again

using Haplotter, showed no evidence of selection in any population (P-values: CEU=

0.351631, YRI= 0.999955, JPT+CHB= 0.99954).

An alternative method of the Long-Range Haplotype test, developed as a part of this

study (see Methods and Materials), which, unlike the method of Voight et al. (2006),

controlled for haplotype SNP density, which may heavily influence estimated EHH

values, and used HapMap Phase II data, was also used to complement the analysis

described above. Using this method no EHH values for either allele in the YRI dataset or

for the FMO2*2A allele in the CEU and CHB-JPT datasets (the FMO2*1 allele could not

be evaluated because it was not present at all in these two datasets) were found that were

significantly different from their corresponding null distributions, using either genetic or

physical distances from the core SNP (Table 4.7), except at 0.2cM upstream of the

FMO2*2A allele in the CHB-JPT dataset. This elevated EHH value could be regarded as

a stochastic effect because (a) there is only significance at one point, therefore not

reaching the criterion of elevated EHH values over a continuous region and (b) it was

detected using a genetic rather than a physical map, which, as discussed in the methods

section, is unreliable for monomorphic data. Therefore, consistent with results using the

method of Voight et al. (2006), no evidence was found that suggests that either the

FMO2*1 allele or the FMO2*2A allele have been favoured by recent positive selection

inside or outside of Africa using the alternative Long-Range-Haplotype method

developed in this thesis.

4.3.3. Analysis of NIEHS FMO2 re-sequencing data

The NIEHS SNP program identified, from whole gene sequencing, 19 FMO2 coding-

region variants among the Panel 2 samples (see Table 4.8), 14 of which were previously

reported (Furnes et al. 2003) in African Americans. Four mutations were synonymous,

nine were non-synonymous, one was found in the 3ʹ untranslated region, two were

insertions (one of which was found in the 3ʹ untranslated region), one was a deletion and

two were premature stop codons (including 23238C>T).

Table 4.7: P-valuesa for EHH values calculated at various genetic (a) and

physical distances (b) from alleles present at the 23238C>T locus in the

upstream (-) and downstream (+) directions in the YRI, CEU CHB+JPT

datasets with a SNP haplotype density of 0.05cM per SNP in (a) and 10kb

per SNP in (b).

EHH EHH

(a) Distance from core

SNP (cMs)

YRI CEU CHB+

(b) Distance from core

SNP (Mbs)

YRI CEU CHB+

-2 0.202 1.000 0.566 0.763 -2 0.157 1.000 0.100 1.000

-1.9 0.221 1.000 0.658 0.83 -1.9 0.517 1.000 0.564 1.000

-1.8 0.243 0.902 0.756 0.74 -1.8 0.538 1.000 0.578 1.000

-1.7 0.275 0.929 0.837 0.819 -1.7 0.553 1.000 0.594 1.000

-1.6 0.315 0.952 0.890 0.839 -1.6 0.571 0.853 0.614 1.000

-1.5 0.364 0.789 0.930 0.832 -1.5 0.592 0.713 0.627 1.000

-1.4 0.169 0.871 0.820 0.826 -1.4 0.608 0.740 0.646 0.616

-1.3 0.227 0.931 0.870 0.909 -1.3 0.626 0.757 0.669 0.632

-1.2 0.178 0.648 0.780 0.727 -1.2 0.647 0.779 0.690 0.652

-1.1 0.264 0.622 0.870 0.582 -1.1 0.669 0.742 0.720 0.691

-1 0.378 0.612 0.903 0.464 -1 0.690 0.714 0.743 0.661

-0.9 0.297 0.618 0.831 0.489 -0.9 0.561 0.694 0.687 0.637

-0.8 0.148 0.578 0.654 0.242 -0.8 0.595 0.718 0.698 0.614

-0.7 0.273 0.489 0.549 0.353 -0.7 0.555 0.743 0.714 0.645

-0.6 0.349 0.456 0.492 0.289 -0.6 0.590 0.769 0.693 0.707

-0.5 0.282 0.453 0.478 0.338 -0.5 0.560 0.706 0.710 0.670

-0.4 0.155 0.220 0.307 0.159 -0.4 0.603 0.767 0.746 0.736

-0.3 0.064 0.123 0.119 0.095 -0.3 0.589 0.842 0.817 0.824

-0.2 0.189 0.306 0.068 0.023 -0.2 0.494 0.798 0.766 0.820

-0.1 0.537 0.303 0.108 0.139 -0.1 0.490 0.791 0.672 0.738

0.1 0.475 0.400 0.602 0.919 0.1 0.388 0.217 0.526 0.551

0.2 0.155 0.098 0.225 0.554 0.2 0.288 0.079 0.356 0.356

0.3 0.372 0.400 0.290 0.311 0.3 0.232 0.046 0.210 0.237

0.4 0.407 0.397 0.165 0.206 0.4 0.342 0.131 0.255 0.171

0.5 0.282 0.256 0.145 0.161 0.5 0.322 0.138 0.178 0.124

0.6 0.134 0.113 0.131 0.285 0.6 0.244 0.092 0.128 0.075

0.7 0.578 0.266 0.243 0.412 0.7 0.453 0.253 0.262 0.193

0.8 0.457 0.283 0.148 0.577 0.8 0.392 0.279 0.258 0.158

0.9 0.402 0.254 0.127 0.478 0.9 0.365 0.309 0.319 0.223

1 0.773 0.298 0.170 0.535 1 0.345 0.391 0.303 0.295

1.1 1.000 0.225 0.219 0.534 1.1 0.305 0.371 0.356 0.333

1.2 1.000 0.445 0.396 0.680 1.2 0.276 0.354 0.309 0.296

1.3 1.000 0.662 0.545 0.617 1.3 0.296 0.335 0.363 0.272

1.4 1.000 0.587 0.610 0.491 1.4 0.363 0.345 0.342 0.241

1.5 1.000 0.444 0.563 0.371 1.5 0.341 0.315 0.319 0.255

1.6 1.000 0.413 0.789 0.397 1.6 0.319 0.306 0.347 0.315

1.7 1.000 0.300 0.697 0.292 1.7 0.294 0.294 0.323 0.293

1.8 1.000 0.212 0.902 0.254 1.8 0.286 0.272 0.359 0.270

1.9 1.000 0.145 0.860 0.189 1.9 0.263 0.259 0.340 0.243

2 1.000 0.580 0.800 0.262 2 0.157 0.596 0.396 1.000

NOTE.- aP-values here are the proportion of values in the relevant 10,000 (or 1000 for physical

distances) value distribution that are equal to or are more extreme than the 1414C>T-based value.

Table 4.8: Table showing inferred haplotypes for FMO2 genomic variants from NIEHS sequencing data.

ositio

Ethnic Identity of NIEHS samples

Anc type → A T - T + T C G G C A A G G T A C A - AA YR AS EU HI

1 A A 2 3 0 0 0 5

2 A A A 2 1 0 0 0 3

3 A 1 0 0 0 0 1

4 A C A 2 0 0 0 0 2

5 C A G T G 1 0 0 0 0 1

6 C G T G 1 4 1 2 3 11

7 C G G T + 0 0 0 1 0 1

8 A G G T 0 1 0 0 0 1

9 A G T 9 0 22 23 13 67

10 A G T G 0 0 0 2 2 4

11 A G G T 0 0 0 0 1 1

12 A G G T + 0 0 0 1 0 1

13 A T G G T 0 1 0 0 0 1

14 G T G 1 0 0 0 0 1

15 G G T + 0 0 0 1 0 1

16 T G G T 0 0 0 1 0 1

17 T G T 2 3 0 7 4 16

18 T G T + 0 4 0 0 0 4

19 T G T G 0 0 5 3 6 14

20 T G G T + 1 0 12 0 9 22

21 + C - T G G T 6 5 0 0 1 12

22 + C - T G T 2 0 0 0 0 2

23 + C - T G T G 0 0 0 1 0 1

24 G A T G T 0 1 0 0 2 3

25 G T G T 0 1 8 1 1 11

26 G T G G T + 0 0 0 1 2 3

Total 30 24 48 44 44 190

NOTE.- Anc Type = Ancestral type from Chimpanzee and Macaque, FS = Frame shift mutation, UTR = Untranslated Region mutation, AA = African

American, YR = Yoruban, AS = Asian, EU = European, HI = Hispanic. * indicates that variant was found by Furnes et al. (2003). g.23238C>T SNP is shown

in bold type

After haplotype inference of the 12 NIEHS Yoruba individuals (24 chromosomes), four

chromosomes were shown to possess the 23238C allele (see Table 4.8). Three of these

chromosomes had identical haplotypes (haplotype 1), with two synonymous changes

(g.13733G>A, g.22027G>A) in comparison with an ancestral reference sequence

(elucidated from chimpanzee and macaque data), one of which was found only on a

23238C background (g.22027G>A). The fourth chromosome had an additional, non-

synonymous, mutation (g.18237G>A (R238Q)) that was only found on a 23238C

background (haplotype 2).

Addition of the 15 phased NIEHS African-American samples (30 chromosomes) showed

that a further seven chromosomes possessed the 23238C SNP. Six of the seven had the

two synonymous mutations while the other lacked the g.13733G>A variant (haplotype 3).

The R238Q variant was also found in two 23238C African-American individuals while an

additional non-synonymous mutation (g.19910G>C (R391T)) was found in a further two

23238C chromosomes (haplotype 4).

The 23238T-possessing chromosomes found in the Yoruba, African-American, European,

Hispanic and Asian NEIHS samples possessed a number of variants including non-

synonymous and synonymous mutations as well as insertions and deletions, often in

combination. For example, the g.7702_7703insGAC insertion is found on the same

background as a deletion (g.10951delG), a stop codon (g.23238T) and two non-

synonymous mutations (g.7731T>C (F81S) and g.13732C>T (S195L)) (n = 15,

haplotypes 21, 22 and 23).

Utilizing phased FMO2 genomic data for the Yoruba NEIHS samples produced an

estimate of the time of occurrence of the 23238C>T mutation of 502,404 years before

present (lower boundary: 2 * 4889 * 0.816 * 19.4 = 154,790 years before present, upper

boundary: 2 * 8751 * 1.648 * 36.1 = 1,041,243 years before present), using the

coalescent-based method described by Griffiths and Majoram (1996).

4.4. Discussion

4.4.1. Functional FMO2 is found at high frequency throughout sub-Saharan Africa

The g.23238C>T SNP allele distribution reported in this study is consistent with the

expectation based on the proportion of FMO2*1 in African-American and Hispanic

individuals. The ancestral allele of g.23238C>T is present at even higher frequencies in

most sub-Saharan populations than in the admixed populations of the Americas, with

approximately one third of individuals throughout the sub-continent possessing this

variant.

The results in this chapter suggest that frequencies of g.23238C>T alleles are fairly

similar throughout most of sub-Saharan Africa. However, there are two groupings, the

Anuak and south Sudan, which display significantly higher frequencies of the ancestral

allele than was found elsewhere in this survey. In Ethiopia the Anuak from Gambella

display a marked difference in g.23238C>T allele frequency compared with all three

Amharic Ethiopian groups. The distribution of the g.23238C>T polymorphism in the

population from southern Sudan is also significantly different from that in the northern

Sudanese. If these two populations were not included in this survey, CEA would have

been similar to both WA and SEA, emphasising the overall similarity throughout sub-

Saharan Africa. It should also be noted that the Anuak in Ethiopia are thought to be an

immigrant population associated with a larger group of Anuak, who reside in south-

eastern Sudan (personal correspondence Ambaye Ogato). This may go some way to

explaining the similar allele frequencies observed in the southern Sudanese group and the

Anuak.

The data presented here are somewhat similar to the observed distribution of Y-

chromosome variation in Africa, with a great deal of similarity among Niger-Congo

speaking populations, a large part of which is likely to be a consequence of the expansion

of the Bantu-speaking peoples, and more genetic differentiation among populations

speaking the tongues of other language families, such as Afro-Asiatic and Nilo-Saharan

(Wood et al. 2005).

The substantial difference in FMO2*1 allele frequencies between northern-African and

sub-Saharan African populations is consistent with other genetic studies, using classical

markers, and more recent studies, using the non-recombining portion of the Y-

chromosome (NRY) (Cruciani et al. 2002) and mitochondrial DNA (Salas et al. 2002)

data, which show large genetic differences between the two regions, with the Saharan

desert acting as a major barrier to gene flow. The presence of the FMO2*1 allele at a low

frequency in the Maghreb as well as in the Yemen could be due to the transfer of

indigenous sub-Saharan Africans to northern Africa and the Arabian Peninsula in the

course of the Arab slave trade during the 8th

to 19th

centuries (Fisher 2001; Richards et al.

2003). The absence of FMO2*1 from the Turkish datasets is in agreement with previous

work, which has shown that the FMO2*1 allele is not present in populations that are not

of recent-African descent (Dolphin et al. 1998; Whetstine et al. 2000).

Although the dataset used in this study was sufficient to explore the general distribution

of the g.23238C>T variant across Africa, more localised sampling will be needed to

answer other potentially important questions. For example, despite the absence of many

statistically significant inter-group differences among the sub-Saharan populations typed

in this study, the range in frequency of individuals possessing at least one FMO2*1 allele

was wide, at 31.3% (17.8-49.1%). If the FMO2*1 variant is shown to be of medical

importance then fine-scale surveys involving greater numbers of subjects will be needed

to identify local groups with particularly high frequencies.

4.4.2. The possible consequences of FMO2 functionality in Africans

Given the observed similarity in the distribution of the g.23238C>T polymorphism across

sub-Saharan Africa, it is possible to extrapolate from the data reported here to estimate

the number of people in sub-Saharan Africa as a whole who have at least one FMO2*1

allele. Based on a study of Hispanic-Americans of Puerto-Rican and Mexican origin

(Krueger et al. 2005), which found that three mutations known to decrease enzyme

function segregated with the truncation mutation, it is currently reasonable to assume that

the FMO2*1 allele found in Africans results in a fully functional FMO2 enzyme

(however, other, unidentified, mutations may render the FMO2 enzyme less catalytically

active or even completely inactive). Given that the total population of sub-Saharan Africa

is 726 million (725,800,000 – 2004 World Bank estimate [http://www.worldbank.org]),

226 million individuals may possess at least one allele that encodes functional FMO2.

Sequence data from the NIEHS SNP programme for Yoruba and African-American

samples support the suggestion that the FMO2*1 allele results in functionally active

FMO2. While no statistical support is offered, because of uncertainty in regard to certain

aspects of the NIEHS data (i.e., there is possible error in haplotype inference because of

the presence of very rare variants and there are large regions where successful sequencing

coverage in all samples has not been achieved), it would appear that a large majority of

variants that may affect the functional activity of the enzyme lie on an FMO2*2A

background. This suggests that chromosomes possessing this allele are in mutational free

fall (the evolutionary pressure to conserve sequence identity has been relaxed) because of

the loss of function caused by the g.23238C>T mutation, whereas chromosomes with

FMO2*1 may have been evolutionarily conserved as they still retain enzymatic activity.

However, given the small number of g.23238C-possessing individuals (n=12) in the

NIEHS Yoruba dataset it is necessary to be cautious in drawing conclusions about FMO2

activity in Africa from these data alone. With such a considerable number of individuals

potentially at risk of thiourea toxicity, however, the effect of FMO2 expression in humans

on the metabolism of this family of chemicals (as well as of other chemical families that

may also act as substrates of FMO2) requires further investigation. If the action of the

enzyme is shown to be detrimental then the risk of future exposure to offending substrates

will need to be considered very carefully by regulatory authorities.

Drugs that are primarily metabolised by FMOs may, in general, have certain advantages

over those metabolised by cytochrome P450 enzymes (CYPs), because FMOs are not as

readily inhibited or induced, thus reducing the risk of drug-drug interactions (Cashman

2005). If, however, FMO2 is involved in the metabolic pathway of drugs used to treat

common diseases in Africa and if products of enzymatic activity have a toxic effect then

great caution should be applied in the distribution and use of such drugs. Given the large

numbers potentially at risk it is important that the activity of the enzyme encoded by the

FMO2*1 allele in African populations is established as quickly as possible, not least

because of the current widespread use of ETA in the treatment of tuberculosis.

Knowledge of local allele frequencies of important drug-metabolizing enzyme variants

that are easy to type, such as g.23238C>T, could prove useful in predicting drug response

in Africa. This is because a) compiling individual profiles (Johnson 2003; Weinshilboum

2003; Evans 2003) of the activity of drug-metabolizing enzymes is unlikely to be feasible

for the foreseeable future, due to economic constraints and a lack of appropriate

infrastructure, and b) variation in individual drug response may well be geographically

and ethnically structured (Wilson et al. 2001). In addition, small, isolated populations in

which genetic drift may lead to significant changes in allele frequencies, as may have

been observed in the Anuak, could well benefit from the collection of such data. It may

also be prudent, as genetic characterisation becomes more common in the developed

world, for individuals with a significant sub-Saharan African ancestry to be typed for the

g.23238C>T SNP.

4.4.3. The Evolution of FMO2

The Long-Range Haplotype test revealed no evidence for positive selection on either

allele at the g.23238C>T SNP in any of three HapMap populations (YRI, CEU and

CHB+JPT), so the high frequency of the derived FMO2*2A allele cannot readily be

explained by it having a recent selective advantage. As a consequence of this and the

presence of FMO2*1 throughout sub-Saharan Africa at roughly similar frequencies it is

suggested that the most likely explanation for why the FMO2*1 allele is not present

outside Africa is because it was lost in a bottleneck when anatomically modern humans

migrated out of Africa sometime after 65,000 years ago (see Mellars 2006 and Chapter 1)

and that therefore the g.23238C>T SNP must have a sub-Saharan African origin prior to

this event. However, Sabeti et al. (2002b) have indicated that the EHH statistic is unable

to detect positive selection that has occurred more than 30,000 years ago, so the

possibility that a strong selective pressure existed before this date cannot be dismissed,

which resulted in the increase in FMO2*2A allele frequency and the complete loss of the

FMO2*1 allele outside of Africa. Another explanation is that selection acted only on the

populations migrating out of Africa, and since the allele went to fixation the signal is not

visible via the iHS test. However, under that scenario extended LD would be expected

around the FMO2 gene in non-African HapMap populations, which is not found.

Interestingly there is evidence that selection has acted on one member of the FMO family,

FMO3, which has been the subject of balancing selection (Allerston et al. 2007).

Dating of when the g.23238C>T SNP arose, through the use of NIEHS sequencing data,

appears, notwithstanding the need to apply somewhat crude assumptions, to support the

ancient origin of this SNP with a time of 502,404 years before present, well before any

estimates of the first exodus of modern humans from Africa into the rest of the world.

Even the lower boundary yields a time some 90,000 years before this event. It will be

interesting to observe the frequency of FMO2*1 in isolated traditional hunter-gatherer

groups such as the Khoisan (which, like any pygmy populations, were not available as

they are currently not part of the TCGA African DNA sample database) that are thought

to be one of the earliest diverging human populations. Similarly, analysis of linked

microsatellites may prove useful in understanding more about the mechanism of its

dispersal.

4.5. Conclusion

The peoples of sub-Saharan Africa demonstrate one of the fastest population growth rates

in the world, while the region itself is widely accepted as the place of origin of

anatomically modern humans. However, in comparison with other regions, studies

investigating the distribution of human genetic variation at the molecular level have been

sparse. Those that have been performed have often been limited in scope to a few

populations and small sample sizes. This study has contributed to redressing this

imbalance and shown that a gene previously considered to be of little interest, but now

thought to encode an enzymatic variant that may be important in human healthcare, is

present at relatively high frequencies in multiple populations throughout the continent.

Surveys such as this are not only of benefit to the indigenous populations of Africa, but

are also of increasing importance in the planning of healthcare in the developed world,

where the number of individuals of recent African descent is growing and, in some areas,

such as the Americas and Europe, is already substantial. Sub-Saharan Africa is thought to

possess more human genetic diversity than the rest of the world combined. However, it is

not yet clear how this diversity is distributed and indeed what part of that diversity is not

present outside the continent. It is obvious that variation not recognised cannot be studied

in vivo. Paucity of such knowledge can lead to inappropriate therapeutic, prophylactic

and diagnostic intervention and increase the risk of an adverse drug reaction. There is a

need for more studies on human genetic diversity in Africa; research from which all

people of recent African descent, wherever living, should benefit.

Chapter 5:

Conclusion

5. Conclusion

This chapter discusses implications of the findings from the three case studies described

in this thesis for genetic studies to elucidate a) local histories, b) the structure and extent

of genetic diversity in the presence of cultural diversity and c) potentially medically

relevant variation in sub-Saharan Africa. It also describes how the methodologies used

can be developed to increase their utility. Finally further research relating both to the

questions addressed in this thesis and related matters are suggested.

It is clear from the three case studies that even rather rudimentary molecular techniques

and relatively conventional statistical analysis applied at a fine-scale level of

discrimination can produce results that are of interest over a wide range of disciplines. In

each case the question addressed should be well defined with expectations or

hypothesises clearly stated. Sampling strategies and methods must be carefully planned.

Samples collected or selected for each study must be appropriate for testing prior

hypothesises, of sufficient number and with a known provenance. Studies that do not

meet these criteria are of limited value.

5.1. Implications for investigating human history and behaviour

The majority of studies investigating genetic diversity in sub-Saharan Africa have

covered a large geographic area and utilised samples already available. Sometimes

samples from multiple ethnic groups are pooled with little justification (for example see

Watson et al. (1996) and Hammer et al. (1997)) while phylogeographical approaches that

seek to fit genetic data to known demographic or prospective selective events are applied

in a somewhat ad hoc manner (for example see Underhill et al. (2001) and Salas et al.

(2002)). At best these approaches offer starting points for future investigation since a

single genetic outcome can usually be explained by multiple demographic scenarios, and

multiple genetic outcomes can result from the same demographic scenario as a result of

evolutionary variance (i.e., drift effects).

Academics in the social sciences such as anthropology and linguistics have previously

appealed for fine-scale studies in sub-Saharan Africa with dense sampling strategies

(MacEachern 2000) and Chapters 2 and 3 have demonstrated the value of such research.

The survey of FMO2 variation, while suited to addressing the questions posed in the

study, is of limited use in elucidating issues relating to human history and language

evolution. Though one plausible demographic explanation for the present distribution of

FMO2*1 has been suggested, many other scenarios are also possible.

Sample selection should follow the formulation of testable hypothesis. To ensure that

sample collection is appropriate geneticists should collaborate closely with linguists,

anthropologists, historians (including local historians) and archaeologists, all of whom

can contribute to understanding the complex processes and events that may have

occurred, or are still in progress. The very precise expectations formulated with respect to

royal social status inheritance described in Chapter 2 illustrate the advantages of this

approach

The structure of language trees can be the subject of fierce debate among linguists and,

often, these differences of opinion are insufficiently understood by geneticists when

seeking correlations between „genetics‟ and „language‟ (see criticism from outside the

genetics community e.g. O‟Grady et al. (1989), Bolnick et al. (2004), Campbell et al.

(2006) ). Some studies have sought to account for this uncertainty by varying branch

lengths between languages but still fail to take into account different interpretations of the

underlying shape of the language tree. To address these issues in the study described in

Chapter 3, the author of this thesis worked closely with Dr Bruce Connell, a linguist

specialising in South Nigerian languages. Such cross disciplinary collaboration is

particularly necessary in the formulation of the questions to be addressed when particular

aspects of language practices of the ethnic groups being studies might easily be

overlooked by a non-specialist.

It is interesting to note in Chapter 2 the congruence of oral history with the genetic data in

regard to the ethnogenesis of the Nso΄ and in Chapter 3 the lack of congruence in regard

to the origins of the Efik Uwanse. Traditional historians have tended to be rather sceptical

about the utility of oral histories (Blench 2006) even though they can be potentially

valuable sources for understanding the past, especially in sub-Saharan Africa where

written records are somewhat recent (Ki-Zerbo 1989). At the source of this difficulty is

deciding which oral histories, or parts of an oral history, record real events and which do

Chapters 2 and 3 have shown the potential of genetic studies to provide supporting

evidence for one or more alternative accounts. This finding should be of particular

interest to anthropologists and local historians. They also show the value of appropriate

DNA sampling methodologies and the necessity in analysis to take full account of the

sampling strategy adopted. The Nso΄ study made use of the characteristics of

agriculturists and hunter-gatherers, both of which appear to have left distinct genetic

signatures (at least for the NRY) that can be detected across sub-Saharan Africa

(Underhill et al. 2001), and the group‟s well defined hierarchical social system. Oral

histories incorporating both of these tendencies are likely to be particularly amenable to

in depth genetic investigation.

Within the discipline of genetic history (the elucidation of past events through the

interpretations of genetic data) the sex-specific systems analysed in this thesis have

proved particularly useful. Though studies using NRY and mtDNA have been popular

over the past 10-15 years because of their relative ease of characterisation, recent

advances in haplotype inference(Niu 2004; Li et al. 2005; see Browning & Browning

2007) and sequencing technologies (Mitnik et al. 2001; see Mitchelson 2003) have

increased the availability of useable autosomal data. Nevertheless, in appropriate

circumstances the increased susceptibility to drift of NRY and mtDNA combined with

sex-specific demographic events recorded in many accounts of local histories, and

cultural evolution, ensure that NRY and mtDNA are frequently the genetic systems of

choice. For examples autosomal markers are unlikely to have been particularly useful in

elucidating the history of the Nso΄.

However this is not to suggest that analysis of autosomal data will not be of any use.

Much of the sex-specific genetic variation in sub-Saharan Africa is likely to have been

shaped by the expansion of the Bantu speaking peoples. As the autosomes are a) less

prone to genetic drift because of their four-fold effective population size (ignoring the

effects of reproductive variance (see Jobling, Hurles & Tyler-Smith 2004 page 134 Box

5.1)) and b) possess more genetic material to analyse evidence of demographic events and

origins may be preserved. New large scale sequencing technologies (Schuster 2008) such

as 454 (Margulies et al. 2005), Solexa (Bentley 2006) and SOLid (Shendure et al. 2005)

sequencing and development of more realistic (and presumably complex) models of

human evolution combined with developments in analysis of large datasets should enable

parts at least of this archive to be interpreted.

5.2. Implications for investigating medically relevant genetic variation

It is obvious that finding genetic variation in sub-Saharan Africa that is absent elsewhere

should be of potential medical benefit. However it is the approach to achieving this

objective that is of most interest in this discussion. Ideally one would collect large

samples from every ethnic group and sequence entire genomes. However this is currently

impractical. Are there approaches that quickly and cheaply identify important variants of

relatively immediate and widespread practical relevance given the economic constraints

of working in sub-Saharan Africa? Chapter 4 is one good, albeit simple, approach when

seeking pharmacogenetically relevant variants.

The immediate objective is to identify potentially important variants i.e. genetic variation

of therapeutic, diagnostic or prophylactic importance present at significant frequencies in

one or more ethnic or geographic groupings („significant frequency‟ in this situation is to

be assessed in the context of medical cost/benefit assessments, which can vary from

group to group). Meeting this criterion should ensure that knowledge of the variation can

be used to benefit peoples of sub-Saharan Africa. One target is genetic variation in genes

coding for drug metabolising enzymes (especially those involved in the metabolism of

drugs used to treat diseases prevalent in sub-Saharan Africa). Often there will be reports

of their existence in African Americans (Whetstine et al. 2000; Hirunsatit et al. 2007; e.g.

Gong et al. 2007).

The distribution of genetic variants can then be assessed in sample sets of populations

across sub-Saharan Africa as in Chapter 4. This should determine geographic structuring

and the likelihood of local variation at a continent wide level. In particular, based on

existing genotype/phenotype association studies, the significance of variation can be

assessed and the likely number of individuals affected determined. Establishing the above

would enable researchers to more efficiently focus on variation that that is likely to be of

benefit to the greatest number of individuals possible, as seen in Chapter 4 with the

finding that the 23238C allele is likely present in over 2,000,000 sub-Saharan Africans, a

variant that may possibly have a substantial effect on how these individuals respond to

treatment for tuberculosis.

The next step is to determine whether a variant of potential functional significance based

on observations made outside Africa has the same functional association within African

populations. This might not be so since inter alia redundancy within drug metabolising

enzyme systems might prevent the expression of a phenotypic effect. This will be

achieved by functional expression studies, focusing especially on the effects of the variant

on the metabolism of drugs used to treat diseases prevalent in sub-Saharan Africa. Such

work requires close collaboration between genetics and biochemistry laboratories.

Having established that a genetic variant has a sufficiently important functional

consequence within sub-Saharan African populations information on its consequences

and distribution should be provided to health care providers and an economic cost benefit

analysis undertaken on which to base future policy.

In the absence of individualised profiling it is anticipated that first choice therapeutic,

diagnostic and prophylactic intervention will be based on information about geographical

and inter-ethnic group distributions of variation (Tishkoff & Kidd 2004; Vizirianakis

2004; Reinbold 2007). Because characterisation of each and every group is impractical

and even though genetic drift shaped by demographic history may cause considerable

variation in small isolated populations, knowledge of genetic variation at a higher but still

more local geographic scales combined with knowledge of relationships informed from

anthropology and linguistics may permit useful predictions of the pharmacogenetic

profiles of uncharacterised groups within a region. For example Chapter 3 showed

differences among three neighbouring region in West Central Africa as a result of

differential gene flow. A better understanding of the factors that have caused these

differences and development of appropriate models of relationships between genetic

variation and demographic history could aid in the prediction of pharmacogenetic profiles

of individuals in these regions. Relatively small scale sampling and typing of variants

combined with the knowledge of population sizes, social structures and practice might

make important contributions to the improvement of efficacy and the reduction of adverse

events in healthcare. Fundamental to this approach is the greater use of fine-scale surveys

to generate data which can be used to construct more sensitive models.

Of course not all, or perhaps even most, variation in drug efficacy and safety is due to

genetic variation. Environmental influences, drug-drug interactions and poor compliance

with medical advice all make a contribution to therapeutic outcomes. Nevertheless

medical interaction based on a better prediction of genetic control and metabolic

pathways has the potential to benefit people in sub-Saharan Africa, a region in which

medically relevant genetic variation is likely to be greater than elsewhere in the world.

What is more there could be benefits in the relatively near future while the notion of

individualised pharmacogenetic targeting is unlikely to have applications in sub-Saharan

Africa in the foreseeable future. Technology is approaching the point when, in the near

future, entire genomes will be routinely sequenced in a matter of days or even hours.

Theoretical methods that can handle such large masses of data will also have to be

developed but the greatest challenge, in Africa, may be the economic cost of

implementation. If knowledge of human genetic variation is to be harnessed in the pursuit

of better healthcare investigators will need access to DNA biobanks with well

provenanced samples. Given the current poor infra-structure to support such collections

(Tishkoff & Williams 2002) it is important that when investigators do have the

opportunity to collect samples they work with anthropologists and other social scientists

to select appropriate targets.

5.3. Future Work

Each of the three case studies described in this thesis have been performed within the

time frame and using the resources available. However each study has also revealed scope

for additional research, both to evaluate more thoroughly the findings of Chapters 2-4 and

also to gain insight into aspects not addresses in the projects. Below is a description of

potential further work not discussed in the case studies themselves that might be

undertaken.

5.3.1. Future work derived from Chapter 2 (Sex-Specific Genetic Data Support One Of

Two Alternative Versions Of The Foundation Of The Ruling Dynasty Of The Nso` In

Cameroon)

The potential to infer the history of the Nso΄ from genetic data depended heavily on the

ability to determine whether sex-specific genetic profiles in the won nto´ and duy

conformed to prior expectations given previously reported alternative rules concerning

inheritance of royal status (Royal Social Status Rules A and B). These expectations are

based on a range of assumptions, including that all fons („n‟) and all won nto´ and duy

(„y‟) have in each case an equal number of offspring. Combinations of „n‟ and „y‟ yield a

range of probabilities for the proportion of fon NRY types present in the won nto´ and

duy. This may be considered by some a somewhat unsatisfactory approach. Rather than

generating discrete probabilities an alternative approach could be to generate estimates of

proportions by simulating won nto´ and duy genealogies from an initial single fon under a

given set of rules in silico. This would permit variation in reproductive success among

individuals. Such an approach could allow for the effects of same social class marriages

and acquisition of duy status other than by descent from won nto´. At the end of Chapter 2

it was suggested that the won nto´ descent system may be evolving and further

anthropological work to investigate the possibility that a somewhat more patrilineal

system is emerging was proposed. The effect of this possibility and other could be

incorporated into such simulations. In addition given that Nso΄ fons are known to have

many more children than other won nto´ it is possible to explore associations of higher

male status and reproductive success.

Another aspect in which this case study can be refined is in the dating of the

Y*(xBR,A3b2) clade in the won nto´ and duy. As stated in Chapter 2 the point estimate

was non-informative because the Y*(xBR,A3b2) clade in the won nto´ and duy was

homogenous, meaning that the associated upper confidence interval for the most recent

common ancestor is zero years, which is obviously nonsense. It should be possible to

tighten confidence intervals by typing of further NRY microsatellites in Y*(xBR,A3b2)

individuals, collecting a new larger sample set or both.

5.3.2. Future work derived from Chapter 3 (It All Depends On The Scale: Little Sex-

Specific Genetic Variation In The Presence Of Substantial Language Variation In Peoples

Of The Cross River Region Of Nigeria Assessed Within The Wider Context Of West

Africa)

Given the high level of homogeneity of sex-specific systems observed in the peoples of

the Cross River region it would be interesting to see whether the populations can be

discriminated using additional markers (NRY) and sequences (mtDNA) and, if so, at what

point such discrimination becomes possible. For the NRY haplogroup E3a, because of its

high frequency, is the primary candidate for further resolution. According to the

nomenclature of the Y-chromosome Consortium (2002) E3a can be further characterised

at one additional level of UEP markers (Haplogroups E3a*, E3a1-E3a6). The results from

NRY microsatellite analysis suggest that there may not be observable genetic structure

even at this level of genealogical resolution. However extra UEP typing can on occasion

reveal fine-scale population structure that microsatellites cannot and this is a particularly

plausible scenario in Chapter 3 where the number of microsatellites typed (six) is quite

low. It would therefore be appropriate to type a subset of the Cross River samples to

assess whether or not further typing would be informative. In addition some further

insight concerning relationships among groups might be generated by more detailed

characterisation of haplogroup BR*(xDE,JR) samples (Haplogroup B is Africa-specific

while Haplogroup R is mostly found in Eurasia (Underhill et al. 2001)). Ultimately, any

study that considers the genealogical relationships of a non-recombining system will be

biased by the specific lineage delineators typed. While the UEP markers used here were

not chosen with any particular global region in mind, biases in their ascertainment will

obviously affect the conclusions drawn from the genetic studies described in this thesis

(see Jobling & Tyler-Smith 2003; Wilder et al. 2004).

Also, with the reduction in cost of whole mtDNA typing, it is now practical to envisage

complete sequencing a subset of samples from Cross River groups, which would allow

more accurate and reliable definition of mtDNA haplogroups. The assumption that

genetic drift would be the major evolutionary force in causing genetic structuring within

the Cross River region and that the effect of novel mutations would be negligible is a

reasonable one given the time periods during which the various languages separated.

However it is possible that signature mtDNA haplogroups may exists that lend clues to

more ancient demographic features of the Cross River region.

The findings of Chapter 3 suggest many questions that would require further sampling to

address. For example though samples from speakers of six of the most prominent

languages of the Cross River region were analysed there are numerous other groups

speaking their own languages that have not been sampled. It would be interesting to see if

groups speaking less common languages (such as Efai or Ibuno, each of which there are

less than 10,000 speakers each (Ethnologue 2005)) have experienced similar levels of

male and female mediated gene flow as the larger groups. More data on mating patterns

would also be interesting in these cases since it may be that in a linguistically diverse

region smaller populations must avoid inter-language unions if their language is to

survive.

The addition of the Igboland groups to the study indicated that genetic differentiation may

be greater outside the Cross River region. Further characterisation of populations at

various distances from the region may establish if this is the case. The two Igboland

groups did indicate some male-specific inter-group differentiation. This leads to two

further questions: a) can different populations in Igboland be differentiated and if so by

what criteria, e.g. geography or dialect? (Given the Igbo‟s prominent role in Nigeria

(there are almost 20 million Igbo speakers (Ethnologue 2005)) and their diverse range of

dialects a detailed study is clearly appropriate) and b) is the level of sex-specific genetic

homogeneity observed in the Cross River region common in South East Nigeria (Carrying

out a similar study to that pursued in Chapter 3 in other regions, especially ones likely to

be less influenced by the slave trade, would be informative)?

Comparison of the peoples of the Cross River region with other West Central African

populations discriminated between these and Ghanaian and Cameroon Grassfields

populations. At the same time it was shown that these two other regions also

demonstrated very different patterns of sex-specific genetic diversity. The Grassfields

populations were more heterogeneous than the Ghanaian populations despite covering a

smaller geographical area. Fine-scale investigation of these two regions and the reasons

for their different patterns of genetic diversity could uncover the underlying causes. In the

case of the Grassfields it is possible that topographic variation has been a major factor.

5.3.3. Future work derived from Chapter 4 (The potentially deleterious functional variant

FMO2*1 is at high frequency throughout sub-Saharan Africa)

This study examined the distribution of the FMO2 g.23238C>T SNP in sub-Saharan

Africa in a broad geographical context. While the frequency is generally homogenous

across sub-Saharan Africa there was nevertheless a large range. East Africa seems

particularly variable and further analysis of populations in southern Sudan and Ethiopia is

called for.

Of immediate importance is to establish a) whether the FMO2*1 allele does code a

functional FMO2 enzyme in Africans and not just in African Americans and b) what the

medical impact of functionality is. Early indications from work the laboratories of

Professor Ian Philips at Queen Mary, University of London and Professor Elizabeth

Shephard at University College London (unpublished data) suggest that African FMO2*1

alleles are in fact catalytically active. If it does code for a functionally active enzyme it is

important to establish how this functionality affects the metabolism of thiourea-based

drugs such as Ethionamide (e.g. are there dosage-specific effects?) and if any other drug

metabolizing enzymes may interact with or compensate for a) functional FMO2 and b)

non-functional FMO2 (either directly or indirectly).

The NIEHS data revealed many SNPs on the 23238C background that may affect protein

structure. Sequencing of FMO2 exons as well as possible promoters, enhancers and splice

sites in a large cohort of Africans may identify further variants, some of which may be

population-or region-specific, and again it will be important to establish the effect of

these variants on enzyme activity.

Though it was not the primary focus of the study tests for recent positive selection were

performed using the Long Range Haplotype (LRH) test (Sabeti et al. 2002b). This test

was chosen since the International HapMap project has made necessary SNP data readily

available. No evidence of positive selection at either 23238C>T allele was detected.

However, as discussed in Chapter 4, the LRH test has power to detect only very recent

selective sweeps (<30,000 years before present (Sabeti et al. 2006)) because

recombination will extinguish evidence of earlier events. Given the age of the SNP

(~502,404 years before present, lower boundary:154,790 years, upper boundary:

1,041,243 years) other methods are necessary to explore the possibility that the fixation of

the non-functional 23238T allele in Europeans and Asians was a result of positive

selection acting outside the range amenable to LRH testing. It will be necessary to re-

sequence the FMO2 gene in African and Eurasian chromosomes. This would allow us, in

a manner similar to that utilised by Xue et al. (2006), to conduct tests of neutrality using

Tajima‟s D (Tajima 1989), Fu and Li‟s D and F (Fu & Li 1993) and Fay and Wu‟s H (Fay

& Wu 2000) as well as to examine whether the level of haplotype diversity for either

allele is that expected under neutrality. Such methods might allow us to detect not only

signatures of possible positive selection but also balancing selection. Though NIEHS

sequence data are available for these populations they are probably not of the required

quality to perform such analyses accurately (there are large regions where successful

sequencing coverage in all samples has not been achieved, which may affect the ability to

detect selection as analysis such as Tajima‟s D depend on correctly identifying singleton

variation (Filatov 2002)). Only once this re-sequencing has been performed will it be

possible to attempt to assess evidence for selection as a factor in the present day

distribution of 23238C>T alleles.

5.4. Final Comments

The three case studies presented in this thesis have revealed important findings to

elucidate local human history, demographic behaviour and medically relevant variation in

sub-Saharan Africa. They demonstrate how examining the distribution of human genetic

diversity can generate useful insights in many diverse areas. Throughout this thesis these

studies have emphasised the importance of adopting appropriate sampling methodologies

and utilizing the expertise of collaborators working in other disciplines such as

anthropology and linguists. With the rapid advance of relevant genotyping and

sequencing technologies and rapidly reducing costs, scope for such work is increasing. It

is to be hoped that these advances will be matched by increasing attention to fieldwork

and the raw material for such studies; i.e. the choice of individuals from whom samples

are taken. Only then can peoples of sub-Saharan Africa start to reap the benefits, in

practical ways, that this research can generate.

Appendix A: Criteria for and problems

associated with collecting African samples for

The Centre for Genetic Anthropology

(TCGA) DNA bank.

1. Ethical and Legal Consents

No collection is made unless it is permitted by national and local law and

appropriate ethical consent has been obtained in the country in which the collection

is made. No collection is made unless to do so will not breach local custom. All

collections are made with the consent of local communal leaders and in ways

acceptable to local custom.

Problems

It is not always possible to establish what the relevant law is and is not always

possible to identify a suitable body from which ethical consent should be obtained.

It is sometimes difficult to identify local officials, the consent of whom should be

sought.

2. No Coercion

No collections are made under arrangements in which donors are instructed to

participate.

3. Random collection

To the greatest extent possible, samples are collected from donors randomly.

Where there is a preset number of samples to be collected they are collected using a

„first come first served‟ approach. There are three approaches to collection: a)

establishing presence in a public place e.g. a weekly market and waiting for persons

to offer to provide mouth swabs, b) prearranged gatherings e.g. in a village or town

hall, advertised previously by a local representative and c) visits to small hamlets.

4. Informed Consent

The purposes of the study are explained in simple terms to all donors.

Problem

Frequently the explanation has to be given through a local interpreter speaking in a

local language.

5. Thank you gift

Donors are rewarded for providing a sample by being given a Polaroid photograph.

Problem

There is a risk that a donor may provide false information in order to qualify to give

a mouth swab or attempt to give a mouth swab on more than one occasion in order

to get more than one photograph.

6. Donors

Samples are only collected from males of 18 years of age or older that do not have a

common paternal grandfather.

Problems

It is necessary to be careful to ensure that persons under 18 do not give false

information about their age in order to obtain a photograph and that persons

sharing the same paternal grandfather do not give false information for the same

reason. A local adviser (normally an interpreter) is recruited to ensure that these

activities do not take place. When collections are made at a market or in a village

or town hall, whenever possible, they are completed in a single day to minimise the

risk of breaching these rules.

Collecting only from males can sometimes cause females in the same location to feel

discriminated against. It would be preferable to ensure that individuals, in addition

to not sharing a common paternal grandfather, do not share a common maternal

grandmother. In fact it would be preferable if individuals did not share any

grandparent, either maternal or paternal. In practice however this objective cannot

be achieved while collecting random and anonymous samples in rural African

locations. To attempt to do so would require a level of questioning and record

keeping that is not practical or consistent with collecting samples anonymously.

Given frequent occurrences of polygamy (formal and informal) introducing a

criterion of not sharing a common maternal grandmother as well as not sharing a

common paternal grandfather is not practical. In addition, since it was not a

requirement of early TCGA collections made for Y-chromosome studies, not

introducing this criterion ensures consistency.

At markets, in particular, crowds can become animated making it difficult to keep

order.

Complying with the rule not to collect from individuals sharing the same paternal

grandfather is necessary to ensure that cases of false paternity are not identified.

A further purpose is to ensure consistency across TCGA collections which were

originally compiled for Y-chromosome studies. Complying with this restriction

does introduce an element of bias preventing a collection being entirely random. At

an extreme it is possible that in a village consisting of only one or two clans,

perhaps only one sample can be collected from each clan.

In collections made in markets, obtaining the information for datasheets can be

difficult and it is necessary to have a sufficiently large team of recorders and local

interpreters to ensure the task is performed satisfactorily. The possibility of

inaccurate information being provided by interpreters must be recognised.

Collectors need to ensure that answers are provided by the donor and are not

imposed on the donor by an interpreter.

7. Anonymity

All samples are collected anonymously.

8. Donor Targets

Prior to collection commencing a target figure for the number of samples to be

collected is defined and is normally set at 100.

Before collection starts a decision is made as to whether to collect persons attending

a particular location randomly or to restrict the collection to persons born or living

at a particular location, in a particular region, possessing a particular self defined

identity, speaking a particular first or second language or defined by some other

stated criterion.

Problem

It is possible that potential donors will provide inaccurate information in order to

obtain a photograph.

9. Form of Collection

Only mouth swabs are collected.

Problem

The yield of DNA is far lower than if blood is taken.

Appendix B: An example sociological data

sheet used during DNA sample collection

Appendix C: Extraction of DNA from Buccal

For collection of samples in the field buccal swabs are rubbed along both cheeks on the

inside of the mouth for approximately 20 seconds to collect cheek cells. This is usually

performed by the collector but occasionally by the individual being sampled himself (for

example a high ranking individual such as a chief of a village may not be allowed contact

with other individuals). The buccal swab is then placed within a 1.5ml tube so that the

swab end makes contact with a 1ml 0.05M Ethylenediaminetetraacetic acid (EDTA),

0.5% Sodium Dodecyl Sulfate (SDS) preservative solution. The following

Phenol/Chloroform DNA extraction protocol is then performed for each sample.

1. 40 µl of 10 mgml-1

proteinase K is added to 20ml of sterile distilled water.

2. 0.8ml of the water/proteinase K solution described in step 1 is added to the 1.5ml

tube containing the buccal swab immersed in EDTA/SDS solution.

3. The mixture from step 2 is then incubated at 56C for between 1-3 hours.

4. 0.8ml of the mixture from step 3 is added to a microfuge tube containing 0.6ml of

phenol/chloroform (1:1) mix.

5. The sample from step 4 is mixed and then centrifuged for 10 minutes at maximum

speed.

6. The resultant aqueous (upper) phase (layer) in the microfuge tube is transferred to

a microfuge tube containing 0.6ml of chloroform and 30µl of 5M NaCl using a

standard Gilson pipette.

speed.

a microfuge tube containing 0.7ml of chloroform using a standard Gilson pipette.

speed.

a screw-top microfuge tube (which is used for long term storage of the DNA)

containing 0.7ml of isopropanol using a standard Gilson pipette.

11. The sample from step 10 is mixed and then centrifuged for 13 minutes at

maximum speed.

12. The resultant supernatant is carefully (to avoid dislodging the DNA from the walls

of the tube) discarded and the tube is inverted at a 45˚ for one minute in order to

drain off any remaining supernatant.

13. 0.8ml of 70% Ethanol is then added to the screw-top microfuge tube.

14. The sample from step 13 is then centrifuged for 10 minutes at maximum speed.

15. The resultant supernatant is carefully discarded and the tube is inverted at a 45˚

for 20 minutes in order to drain off any remaining supernatant.

16. 200 µl of TE (pH 9.0) is then added to the microfuge tube.

17. The mixture from step 16 is then incubated at 56˚C for 10 min mixing

occasionally.

18. The resulting DNA with TE mixture is then stored upright in a freezer at -

20˚C, ready for use.

The following protocol is performed in batches of samples to increase throughput. Steps

1-3 are performed in batches of 48, steps 4-10 in batches of 24 (the maximum capacity of

the microfuge centrifuge) and step 11-18 in batches of 96. Custom TCGA DNA

extraction sheets and appropriate labelling are used throughout to prevent mixing up of

samples.

Appendix D: Legends of figures and tables

found on the attached CD.

Chapter 2

Supplementary Table 2S.1: Distribution of NRY types, defined by UEP

haplogroups and microsatellite haplotypes, in the four Nso´ social classes

and people of the western Grassfields and Tikar Plain.

Supplementary Table 2S.2: Distribution of mtDNA types, defined by VSO

haplotypes, in the four Nso′ social classes.

Supplementary Table 2S.7: Confidence intervals for TMRCA calculations in

the duy, the nshiylav and mtaar, and the won nto´ and duy, using two

mutation models.

Chapter 3

Supplementary Tables 3S.1: Pairwise ETPD P-values for various levels of

NRY and mtDNA analysis for Cross River samples, Cameroon and Ghana.

Level of analysis shown in top left cell of matrix. Colour code is same as

Table 3.6.

haplogroups and microsatellite haplotypes, in the Cross River region,

Cameroon and Nigeria.

Supplementary Table 3S.3: Pairwise genetic distances and associated P-

values for various levels of NRY and mtDNA analysis. Level of analysis

shown in top left cell of matrix. Colour code is same as Table 3.6.

Supplementary Table 3S.4: Distribution of mtDNA types, defined by HVS-1

mtDNA haplogroups and VSO haplotypes, in the Cross River region,

Cameroon and Nigeria.

haplogroups and microsatellite haplotypes, in Ethiopia, Israeli and

Palestinian Arabs, Lake Chad and Sudan.

Supplementary Table 3S.6: Distribution of mtDNA types, defined by HVS-1

mtDNA haplogroups and VSO haplotypes, in Ethiopia, Israeli and

Palestinian Arabs, Lake Chad and Sudan.

Supplementary Table 3S.7: Pairwise genetic distances and associated P-

values for various levels of NRY and mtDNA analysis for Efik Uwanse

comparisons. Colour code is same as Table 3.6.

Appendix E: LRH test Source Code

The original Python source code written by myself that was used to perform the version

of the LRH haplotype test developed at the TCGA described in section 4.2.3.1.1 is

available on the CD that accompanies this thesis (SNPsig-v35-phase2-build36-cm.py).

This version uses only SNPs for the core region but a further version that can use

haplotype core regions has also been developed as part of another study and is available

from the author on request. This code uses build35 of the HapMap dataset and requires

the downloading of HapMap files in the following structure (though this structure is

easily editable within the code):

C:Hapmap-build35\Allelefrequencies\(All unzipped allele frequency files)

C:Hapmap-build35\Phasedata\(All unzipped phased data files)

C:Hapmap-build35\Recombrates\(All unzipped genetic map data)

The code was written in Python version 2.4 so should be compatible with this and any

future versions of python. Back compatibility has not been tested but the Python

programming environment is freely available at www.python.org. It also requires the

downloading and installation of the python package „numarray‟. The REHH values given

by the programme should NOT be used as it does not yet account for EHH values of 0 in

the calculation of REHH. This version is also quite computer memory intensive as it

requires a lot of re-accessing of HapMap data files. The author is currently working on a

quicker method which may be available in the near future. Contact

Krishna.veeramah@ucl.ac.uk for any further enquiries.

References

1. 1987. Second general census of population and housing of Cameroon. Volume

3:preliminary analysis.: SUPECAM, Yaounde.

2. The Concise International Chemical Assessment Document 49 (CICADA 49).

2003. 20 Avenue Appia, 1211 Geneva 27, Switzerland, UN Environment

Programme, the International Labour Organization and the World Health

Organization.

Ref Type: Generic

3. 2005. Ethnologue: Languages of the World. Dallas, Texas: SIL International.

4. Adelaar, A. 1995. Asian roots of the Malagasy: a linguistic perspective. Bijdragen

tot de Taal-Land en Volkenkunde, 151: 325-356.

5. Agrawal, S. & Khan, F. 2007. Human genetic variation and personalized

medicine. Indian J.Physiol Pharmacol., 51 (1): 7-28.

6. Aidoo, M. et al 2002. Protective effects of the sickle cell gene against malaria

morbidity and mortality. Lancet, 359 (9314): 1311-1312.

7. Akak, E. O. 1986. The Palestine Origin of the Efiks. Calabar: Akak and Sons.

8. Aklillu, E. et al 2003. Genetic polymorphism of CYP1A2 in Ethiopians affecting

induction and expression: characterization of novel haplotypes with single-

nucleotide polymorphisms in intron 1. Mol.Pharmacol., 64 (3): 659-669.

9. Aklillu, E. et al 2002. Functional analysis of six different polymorphic CYP1B1

enzyme variants found in an Ethiopian population. Mol.Pharmacol., 61 (3): 586-

10. Aklillu, E. et al 1996. Frequent distribution of ultrarapid metabolizers of

debrisoquine in an ethiopian population carrying duplicated and multiduplicated

functional CYP2D6 alleles. J.Pharmacol.Exp.Ther., 278 (1): 441-446.

11. Allabi, A. C. et al 2003. Genetic polymorphisms of CYP2C9 and CYP2C19 in the

Beninese and Belgian populations. Br.J.Clin.Pharmacol., 56 (6): 653-657.

12. Allabi, A. C. et al 2005. Single nucleotide polymorphisms of ABCB1 (MDR1)

gene and distinct haplotype profile in a West Black African population.

Eur.J.Clin.Pharmacol., 61 (2): 97-102.

13. Allerston, C. K. et al 2007. Molecular evolution and balancing selection in the

flavin-containing monooxygenase 3 gene (FMO3). Pharmacogenet.Genomics, 17

(10): 827-839.

14. Alves, C. et al 2005. STR allelic frequencies for an African population sample

(Equatorial Guinea) using AmpFlSTR Identifiler and Powerplex 16 kits. Forensic

Sci.Int., 148 (2-3): 239-242.

15. Amos, W. & Manica, A. 2006. Global genetic positioning: evidence for early

human population centers in coastal habitats. Proc.Natl.Acad.Sci.U.S.A, 103 (3):

820-824.

16. Anderson, S. et al 1981. Sequence and organization of the human mitochondrial

genome. Nature, 290 (5806): 457-465.

17. Ardener, E. 1968. Documentary and linguistic evidence for the rise of the trading

polities between Rio del Rey and Cameroons. In: I. M. Lewis, ed., History and

Social Anthropology. London: 1500-1650.

18. Armour, J. A. et al 1996. Minisatellite diversity supports a recent African origin

for modern humans. Nat.Genet., 13 (2): 154-160.

19. Aylor, D. L., Price, E. W. & Carbone, I. 2006. SNAP: Combine and Map modules

for multilocus population genetic analysis. Bioinformatics., 22 (11): 1399-1401.

20. Bahuchet, S. 1992. Dans la Forêt d'Afrique Centrale:Les Pygmées Parmi le

Peuples d'Afrique Centrale., Histoire d'une civilisation forestière I. Paris: Peeters-

SELAF.

21. Bahuchet, S. 1993. La Recontre des Agriculteurs:Les Pygmées Parmi le Peuples.,

Histoire d'une civilisation forestière II. Paris: Peeters-SELAF.

22. Bandelt, H. J. et al 2001. Phylogeography of the human mitochondrial haplogroup

L3e: a snapshot of African prehistory and Atlantic slave trade. Ann.Hum.Genet.,

65 (Pt 6): 549-563.

23. Bapiro, T. E. et al 2002. The molecular and enzyme kinetic basis for the

diminished activity of the cytochrome P450 2D6.17 (CYP2D6.17) variant.

Potential implications for CYP2D6 phenotyping studies and the clinical use of

CYP2D6 substrate drugs in some African populations. Biochem.Pharmacol., 64

(9): 1387-1398.

24. Barbujani, G., Oden, N. L. & Sokal, R. R. 1989. Detecting Regions of Abrupt

Change in Maps of Biological Variables. Systematic Zoology, 38 (4): 376-389.

25. Basden, G. T. 1966. Among the Ibos of Nigeria. London: Frank Cass.

26. Bathum, L. et al 1999. Phenotypes and genotypes for CYP2D6 and CYP2C19 in a

black Tanzanian population. Br.J.Clin.Pharmacol., 48 (3): 395-401.

27. Batini, C. et al 2007. Phylogeography of the human mitochondrial L1c

haplogroup: genetic signatures of the prehistory of Central Africa.

Mol.Phylogenet.Evol., 43 (2): 635-644.

28. Behar, D. M. et al 2003. Multiple origins of Ashkenazi Levites: Y chromosome

evidence for both Near Eastern and European ancestries. Am.J.Hum.Genet., 73

(4): 768-779.

29. Beleza, S. et al 2005. The genetic legacy of western Bantu migrations.

Hum.Genet., 117 (4): 366-375.

30. Bender, M. L. 1997. Upside-down Afrasian. Afikanische Arbeitspapiere, 50: 19-

31. Bentley, D. R. 2006. Whole-genome re-sequencing. Curr.Opin.Genet.Dev., 16

(6): 545-552.

32. Berniell-Lee, G. et al 2006. Y-chromosome diversity in Bantu and Pygmy

populations from Central Africa. International Congress Series, 1288: 234-236.

33. Bertorelle, G. & Barbujani, G. 1995. Analysis of DNA diversity by spatial

autocorrelation. Genetics, 140 (2): 811-819.

34. Bianchi, N. O. et al 1998. Characterization of ancestral and derived Y-

chromosome haplotypes of New World native populations. Am.J.Hum.Genet., 63

(6): 1862-1871.

35. Blench, R. 1995. Is Niger-Congo Simply a Branch of Nilo-Saharan. In: R. Nicolai

& F. Rottland, eds., Proceedings of the Fith Nilo-Saharan Linguistics Colloquium,

Nice, 1992. Cologne, Germany: Rudiger Koppe. 68-118.

36. Blench, R. 1999a. The Languages of Africa: Macrophyla Proposals and

Implications for Archaeological Interpretation. In: R. Blench & M. Spriggs, eds.

IV edn. London: Routledge. 29-47.

37. Blench, R. 1999b. The Westward Wanderings of Cushitic Pastoralists. In: C.

Baroin & J. Boutrais, eds., L'Homme et l'Animale Dans le Bassin du Lac Tchad.

Paris: IRD.

38. Blench, R. The Bendi languages: More lost Bantu languages. 2001.

Ref Type: Unpublished Work

39. Blench, R. 2006. Archaeology, Language, and the African Past. Lanham:

AltaMira Press.

40. Bolnick, D. A. et al 2004. Problematic use of Greenberg's linguistic classification

of the Americas in studies of Native American genetic variation.

Am.J.Hum.Genet., 75 (3): 519-522.

41. Bowcock, A. M. et al 1994. High resolution of human evolutionary trees with

polymorphic microsatellites. Nature, 368 (6470): 455-457.

42. Boyd, R. 1996. Chamba Daka and Bantoid: A further look at Chamba Daka

classification. Journal of West African Languages, 26 (2): 29-43.

43. Brandstatter, A. et al 2004. Mitochondrial DNA control region sequences from

Nairobi (Kenya): inferring phylogenetic parameters for the establishment of a

forensic database. Int.J.Legal Med., 118 (5): 294-306.

44. Brauer, G., Collard, M. & Stringer, C. 2004. On the reliability of recent tests of

the Out of Africa hypothesis for modern human origins. Anat.Rec.A

Discov.Mol.Cell Evol.Biol., 279 (2): 701-707.

45. Browning, S. R. & Browning, B. L. 2007. Rapid and accurate haplotype phasing

and missing-data inference for whole-genome association studies by use of

localized haplotype clustering. Am.J.Hum.Genet., 81 (5): 1084-1097.

46. Caglia, A. et al 1997. Y-chromosome STR loci in Sardinia and continental Italy

reveal islander-specific haplotypes. Eur.J.Hum.Genet., 5 (5): 288-292.

47. Campbell, L. Languages and Gene in Collaboration: some Practical Matters.

48. Cann, R. L., Stoneking, M. & Wilson, A. C. 1987. Mitochondrial DNA and

human evolution. Nature, 325 (6099): 31-36.

49. Cashman, J. R. 2000. Human flavin-containing monooxygenase: substrate

specificity and role in drug metabolism. Curr.Drug Metab, 1 (2): 181-191.

50. Cashman, J. R. 2005. Some distinctions between flavin-containing and

cytochrome P450 monooxygenases. Biochem.Biophys.Res.Commun., 338 (1):

599-604.

51. Cavaco, I. et al 2003. CYP3A4*1B and NAT2*14 alleles in a native African

population. Clin.Chem.Lab Med., 41 (4): 606-609.

52. Cavalli-Sforza, L. L. 1986. African Pygmies. Orlando, Florida: Academic Press.

53. Cavalli-Sforza, L. L., Menozzi, P. & Piazza, A. 1994. The History and Geography

of Human Genes. New Jersey: Princeton University Press.

54. Cerny, V. et al 2006. MtDNA of Fulani nomads and their genetic relationships to

neighboring sedentary populations. Hum.Biol., 78 (1): 9-27.

55. Cerny, V. et al 2004. mtDNA sequences of Chadic-speaking populations from

northern Cameroon suggest their affinities with eastern Africa. Ann.Hum.Biol., 31

(5): 554-569.

56. Cerny, V. et al 2007. A bidirectional corridor in the Sahel-Sudan belt and the

distinctive features of the Chad Basin populations: a history revealed by the

mitochondrial DNA genome. Ann.Hum.Genet., 71 (Pt 4): 433-452.

57. Chaubey, G. et al 2007. Peopling of South Asia: investigating the caste-tribe

continuum in India. Bioessays, 29 (1): 91-100.

58. Chelule, P. K. et al 2003. MDR1 and CYP3A4 polymorphisms among African,

Indian, and white populations in KwaZulu-Natal, South Africa.

Clin.Pharmacol.Ther., 74 (2): 195-196.

59. Chem-Langhee, B. & Fanso, V. G. 1997. Social categories, local politics and the

uses of oral tradition in Nso'. Paideuma, 43: 313-327.

60. Chen, Y. S. et al 2000. mtDNA variation in the South African Kung and Khwe-

and their genetic relationships to other African populations. Am.J.Hum.Genet., 66

(4): 1362-1383.

61. Chen, Y. S. et al 1995. Analysis of mtDNA variation in African populations

reveals the most ancient of all human continent-specific haplogroups.

Am.J.Hum.Genet., 57 (1): 133-149.

62. Chilver, E. M. & Kaberry, P. M. 1960. From Tribute to Tax in a Tikar Chiefdom.

Africa, 30 (1): 1-19.

63. Chilver, E. M. & Kaberry, P. M. 1968. Traditional Bamenda; The Pre-colonial

History and Ethnography of the Bamenda Grassfields.: Ministry of Primary

Education and Social Welfare and West Cameroon Antiquities Commision.

64. Chilver, E. M. & Kaberry, P. M. 1971. The Tikar problem: a non-problem.

Journal of African Languages, 10 (2): 13-14.

65. Coia, V. et al 2004. Binary and microsatellite polymorphisms of the Y-

chromosome in the Mbenzele pygmies from the Central African Republic.

Am.J.Hum.Biol., 16 (1): 57-67.

66. Coia, V. et al 2005. Brief communication: mtDNA variation in North Cameroon:

lack of Asian lineages and implications for back migration from Asia to sub-

Saharan Africa. Am.J.Phys.Anthropol., 128 (3): 678-681.

67. Collins-Schramm, H. E. et al 2002. Markers that discriminate between European

and African ancestry show limited variation within Africa. Hum.Genet., 111 (6):

566-569.

68. Connell, B. Unpublished fieldnotes. 1983.

69. Connell, B. 1994. The Lower Cross languages: a prolegomena to the classification

of the Cross River languages. Journal of West African Languages, XXIV (1): 3-

70. Connell, B. 1998. Classifying Cross River. vol. 2. Lawrenceville, NJ: Africa

World Press. 17-25.

71. Connell, B. 2000. The Integrity of Mambiloid. In: H. E. Wolff & O. Gensler, eds.,

Proceedings from the 2nd World Congress of African Linguistics. Leipzig.

Cologne: pplyBrkRulesRüdiger Köppe Verlag. 197-213.

72. Connell, B. & Maison, K. B. 1994. A Cameroun homeland for the Lower Cross

languages? Sprache und Geschichte in Afrika, 15: 47-90.

73. Cox, M. 2007. Extreme patterns of variance in small populations: placing limits

on human Y-chromosome diversity through time in the Vanuatu Archipelago.

Ann.Hum.Genet., 71 (Pt 3): 390-406.

74. Cramon-Taubadel, N. & Lycett, S. J. 2007. Human cranial variation fits iterative

founder effect model with African origin. Am.J.Phys.Anthropol.

75. Cruciani, F. et al 2002. A back migration from Asia to sub-Saharan Africa is

supported by high-resolution analysis of human Y-chromosome haplotypes.

Am.J.Hum.Genet., 70 (5): 1197-1214.

76. Crystal, D. 1997. The Cambridge Encyclopedia of Language., 2nd edn.

Cambridge: Cambridge University Press.

77. Dandara, C. et al 2003. Arylamine N-acetyltransferase (NAT2) genotypes in

Africans: the identification of a new allele with nucleotide changes 481C>T and

590G>A. Pharmacogenetics, 13 (1): 55-58.

78. Dandara, C. et al 2001. Genetic polymorphism of CYP2D6 and CYP2C19 in east-

and southern African populations including psychiatric patients.

79. Dandara, C. et al 2002. Genetic polymorphism of cytochrome P450 1A1

(Cyp1A1) and glutathione transferases (M1, T1 and P1) among Africans.

Clin.Chem.Lab Med., 40 (9): 952-957.

80. Darlu, P. & Tassy, P. 1987. Disputed African origin of human populations.

Nature, 329 (6135): 111-112.

81. Denbow, J. R. 1986. A new look at the later prehistory of the Kalahari. Journal of

African History, 27: 3-28.

82. Denbow, J. R. 1990. Congo to Kalahari:Data and hypotheses about the political

economy of the western stream of the Early Iron Age. African Archaelogical

Review, 8: 139-176.

83. Destro-Bisol, G. et al 2000. Microsatellite variation in Central Africa: an analysis

of intrapopulational and interpopulational genetic diversity. Am.J.Phys.Anthropol.,

112 (3): 319-337.

84. Destro-Bisol, G. et al 2004. The analysis of variation of mtDNA hypervariable

region 1 suggests that Eastern and Western Pygmies diverged before the Bantu

expansion. Am.Nat., 163 (2): 212-226.

85. Destro-Bisol, G. et al 1999. Estimating European admixture in African Americans

by using microsatellites and a microsatellite haplotype (CD4/Alu). Hum.Genet.,

104 (2): 149-157.

86. Di Giacomo, F. et al 2004. Y chromosomal haplogroup J as a signature of the

post-neolithic colonization of Europe. Hum.Genet., 115 (5): 357-371.

87. Dolphin, C. T. et al 1998. The flavin-containing monooxygenase 2 gene (FMO2)

of humans, but not of other primates, encodes a truncated, nonfunctional protein.

J.Biol.Chem., 273 (46): 30599-30607.

88. Donaldson, I. J. et al 2002. Unique TCR beta-subunit variable gene haplotypes in

Africans. Immunogenetics, 53 (10-11): 884-893.

89. Ehret, C. 2002. Languages Family Expansion:Broadening Our Understanding of

Cause from an African Perspective. In: P. Bellwood & C. Renfrew, eds.,

Examining the Farming/Language Dispersal Hypothesis. Cambridge: McDonald

Institute for Archaelogical Research. 163-176.

90. Eswaran, V. 2002. Rules A Diffusion Wave out of Africa: The Mechanism of the

Modern Human Revolution. Current Anthropology, 49: 1-18.

91. Eswaran, V., Harpending, H. & Rogers, A. R. 2005. Genomics refutes an

exclusively African origin of humans. J.Hum.Evol., 49 (1): 1-18.

92. Evans, P. D. et al 2006. Evidence that the adaptive allele of the brain size gene

microcephalin introgressed into Homo sapiens from an archaic Homo lineage.

Proc.Natl.Acad.Sci.U.S.A, 103 (48): 18178-18183.

93. Evans, W. E. 2003. Pharmacogenomics: marshalling the human genome to

individualise drug therapy. Gut, 52 Suppl 2: ii10-ii18.

94. Excoffier, L. 2002. Human demographic history: refining the recent African

origin model. Curr.Opin.Genet.Dev., 12 (6): 675-682.

95. Excoffier, L. & Langaney, A. 1989. Origin and differentiation of human

mitochondrial DNA. Am.J.Hum.Genet., 44 (1): 73-85.

96. Excoffier, L., Smouse, P. E. & Quattro, J. M. 1992. Analysis of molecular

variance inferred from metric distances among DNA haplotypes: application to

human mitochondrial DNA restriction data. Genetics, 131 (2): 479-491.

97. Fay, J. C. & Wu, C. I. 2000. Hitchhiking under positive Darwinian selection.

Genetics, 155 (3): 1405-1413.

98. Fenner, J. N. 2005. Cross-cultural estimation of the human generation interval for

use in genetics-based population divergence studies. Am.J.Phys.Anthropol., 128

(2): 415-423.

99. Filatov, D. A. 2002. proseq: A software for preparation and evolutionary analysis

of DNA sequence data sets. Molecular Ecology Notes, 2 (4): 621-624.

100. Fisher, H. J. 2001. Slavery in the History of Muslim Black Africa., 1st edn.

London: C.Hurst & Co. Ltd.

101. Flores, C. et al 2001. Y-chromosome differentiation in Northwest Africa.

Hum.Biol., 73 (4): 513-524.

102. Forde, D. & Jones, G. I. 1950. The Ibo and Ibibio-speaking Peoples of South-

eastern Nigeria. London: Oxford University Press.

103. Forster, P. et al 1998. Phylogenetic resolution of complex mutational features at

Y-STR DYS390 in aboriginal Australians and Papuans. Mol.Biol.Evol., 15 (9):

1108-1114.

104. Forster, P. et al 2000. A short tandem repeat-based phylogeny for the human Y

chromosome. Am.J.Hum.Genet., 67 (1): 182-196.

105. Fowler, I. & Zeitlyn, D. 1996. Introductory Essay: the Grassfields and the Tikar.

Oxford: Berghahn.

106. Fraaije, M. W. et al 2004. The prodrug activator EtaA from Mycobacterium

tuberculosis is a Baeyer-Villiger monooxygenase. J.Biol.Chem., 279 (5): 3354-

107. Freckleton, R. P. 2002. On the misuse of residuals in ecology:regression of

residuals vs. multiple regression. Journal of Animal Ecology, 71: 542-545.

108. Fu, Y. X. & Li, W. H. 1993. Statistical tests of neutrality of mutations. Genetics,

133 (3): 693-709.

109. Furnes, B. et al 2003. Identification of novel variants of the flavin-containing

monooxygenase gene family in African Americans. Drug Metab Dispos., 31 (2):

187-193.

110. Garrigan, D. & Hammer, M. F. 2006. Reconstructing human origins in the

genomic era. Nat.Rev.Genet., 7 (9): 669-680.

111. Garsa, A. A., McLeod, H. L. & Marsh, S. 2005. CYP3A4 and CYP3A5

genotyping by Pyrosequencing. BMC.Med.Genet., 6: 19.

112. Gene, M. et al 2001. The Bubi population of Equatorial Guinea characterised by

HUMTH01, HUMVWA31A, HUMCSF1PO, HUMTPOX, D3S1358, D8S1179,

D18S51 and D19S253 STR polymorphisms. Int.J.Legal Med., 114 (4-5): 298-300.

113. Goheen, M. 1996. Men Own the Fields, Women Own the Crops; Gender and

Power in the Cameroon Grassfields., 1st edn.: The University of Wisconsin Press.

114. Goldstein, D. B. et al 1995. Genetic absolute dating based on microsatellites and

the origin of modern humans. Proc.Natl.Acad.Sci.U.S.A, 92 (15): 6723-6727.

115. Goncalves, R., Spinola, H. & Brehm, A. 2007. Y-chromosome lineages in Sao

Tome e Principe islands: evidence of European influence. Am.J.Hum.Biol., 19 (3):

422-428.

116. Gonder, M. K. et al 2007. Whole-mtDNA genome sequence analysis of ancient

African lineages. Mol.Biol.Evol., 24 (3): 757-768.

117. Gong, Y. et al 2007. Single nucleotide polymorphism discovery and haplotype

analysis of Ca2+-dependent K+ channel beta-1 subunit.

Pharmacogenet.Genomics, 17 (4): 267-275.

118. Gonzalez, A. M. et al 2007. Mitochondrial lineage M1 traces an early human

backflow to Africa. BMC.Genomics, 8: 223.

119. Goudet, J. et al 1996. Testing differentiation in diploid populations. Genetics, 144

(4): 1933-1940.

120. Gower, J. C. 1966. Some distance properties of latent root and vector methods

used in multivariate analysis. Biometrika, 53: 325-328.

121. Green, R. E. et al 2006. Analysis of one million base pairs of Neanderthal DNA.

Nature, 444 (7117): 330-336.

122. Greenberg, J. H. 1955. Studies in African Linguistic Classification. Branford:

Compass.

123. Greenberg, J. H. 1963. The Languages of Africa. Bloomington: The Hague:

Mouton.

124. Gregersen E.A. 1972. Kongo-Saharan. Journal of African Languages, 11 (1): 69-

125. Griese, E. U. et al 1999. Analysis of the CYP2D6 gene mutations and their

consequences for enzyme function in a West African population.

Pharmacogenetics, 9 (6): 715-723.

126. Griffiths, R. C. & Marjoram, P. 1996. Ancestral inference from samples of DNA

sequences with recombination. J.Comput.Biol., 3 (4): 479-502.

127. Gudschinsky, S. 1956. The ABC's of lexicostatistics. Word, 12: 175-210.

128. Güldemann, T. & Voßen, R. 2000. Khoesan. In: B. Heine & D. Nurse, eds.,

African Languages:An Introduction. Cambridge: Cambridge University Press. 99-

129. Guo, S. W. & Thompson, E. A. 1992. Performing the exact test of Hardy-

Weinberg proportion for multiple alleles. Biometrics, 48 (2): 361-372.

130. Gusmao, L. et al 2001. STR data from S. Tome e Principe (Gulf of Guinea, West

Africa). Forensic Sci.Int., 116 (1): 53-54.

131. Gusmao, L. et al 2005. Mutation rates at Y chromosome specific microsatellites.

Hum.Mutat., 26 (6): 520-528.

132. Guthrie, M. 1967. Comparative Bantu: an introduction to the comparative

linguistics and prehistory of the Bantu languages. Farnborough: Gregg Press.

133. Hall, I. P. & Sayers, I. 2007. Pharmacogenetics and asthma: false hope or new

dawn? Eur.Respir.J., 29 (6): 1239-1245.

134. Hamblin, M. T. & Di Rienzo, A. 2000. Detection of the signature of natural

selection in humans: evidence from the Duffy blood group locus.

Am.J.Hum.Genet., 66 (5): 1669-1679.

135. Hamblin, M. T., Thompson, E. E. & Di Rienzo, A. 2002. Complex signatures of

natural selection at the Duffy blood group locus. Am.J.Hum.Genet., 70 (2): 369-

136. Hammer, M. F. et al 1998. Out of Africa and back again: nested cladistic analysis

of human Y chromosome variation. Mol.Biol.Evol., 15 (4): 427-441.

137. Hammer, M. F. et al 2001. Hierarchical patterns of global human Y-chromosome

diversity. Mol.Biol.Evol., 18 (7): 1189-1203.

138. Hammer, M. F. et al 1997. The geographic distribution of human Y chromosome

variation. Genetics, 145 (3): 787-805.

139. Hanchard, N. et al 2007. Classical sickle beta-globin haplotypes exhibit a high

degree of long-range haplotype similarity in African and Afro-Caribbean

populations. BMC.Genet., 8: 52.

140. Hanchard, N. A. et al 2006. Screening for recently selected alleles by analysis of

human haplotype similarity. Am.J.Hum.Genet., 78 (1): 153-159.

141. Handley, L. J. et al 2007. Going the distance: human population genetics in a

clinal world. Trends Genet., 23 (9): 432-439.

142. Harding, R. M. et al 1997. Archaic African and Asian lineages in the genetic

ancestry of modern humans. Am.J.Hum.Genet., 60 (4): 772-789.

143. Harpending, H. & Rogers, A. 2000. Genetic perspectives on human origins and

differentiation. Annu.Rev.Genomics Hum.Genet., 1: 361-385.

144. Harpending, H. C. 1993. The genetic structure of ancient human populations.

Current Anthropology, 34: 483-496.

145. Hart, A. K. 1964. Report of the Enquiry into the Dispute over the Obongship of

Calabar. Enugu: Government Printer.

146. Hasegawa, M. & Horai, S. 1991. Time of the deepest root for polymorphism in

human mitochondrial DNA. J.Mol.Evol., 32 (1): 37-42.

147. Hawks, J. et al 2008. A genetic legacy from archaic Homo. Trends Genet., 24 (1):

19-23.

148. Hawks, J. et al 2000. Population bottlenecks and Pleistocene human evolution.

Mol.Biol.Evol., 17 (1): 2-22.

149. Hein, J. 1990. Reconstructing evolution of sequences subject to recombination

using parsimony. Math.Biosci., 98 (2): 185-200.

150. Henderson, M. C. et al 2004a. S-oxygenation of the thioether organophosphate

insecticides phorate and disulfoton by human lung flavin-containing

monooxygenase 2. Biochem.Pharmacol., 68 (5): 959-967.

151. Henderson, M. C. et al 2004b. Human flavin-containing monooxygenase form 2

S-oxygenation: sulfenic acid formation from thioureas and oxidation of

glutathione. Chem.Res.Toxicol., 17 (5): 633-640.

152. Hernandez, D. et al 2004. Organization and evolution of the flavin-containing

monooxygenase genes of human and mouse: identification of novel gene and

pseudogene clusters. Pharmacogenetics, 14 (2): 117-130.

153. Heyer, E. et al 1997. Estimating Y chromosome specific microsatellite mutation

frequencies using deep rooting pedigrees. Hum.Mol.Genet., 6 (5): 799-803.

154. Hines, R. N. et al 2002. Alternative processing of the human FMO6 gene renders

transcripts incapable of encoding a functional flavin-containing monooxygenase.

Mol.Pharmacol., 62 (2): 320-325.

155. Hirunsatit, R. et al 2007. Sequence variation and linkage disequilibrium in the

GABA transporter-1 gene (SLC6A1) in five populations: implications for

pharmacogenetic research. BMC.Genet., 8: 71.

156. Holtkemper, U. et al 2001. Mutation rates at two human Y-chromosomal

microsatellite loci using small pool PCR techniques. Hum.Mol.Genet., 10 (6):

629-633.

157. Horai, S. 1995. Evolution and the origins of man: clues from complete sequences

of hominoid mitochondrial DNA. Southeast Asian J.Trop.Med.Public Health, 26

Suppl 1: 146-154.

158. Horai, S. et al 1995. Recent African origin of modern humans revealed by

complete sequences of hominoid mitochondrial DNAs. Proc.Natl.Acad.Sci.U.S.A,

92 (2): 532-536.

159. Howells, W. W. 1976. Explaining modern man: Evolutionists Versus

migrationists. Journal of Human Evolution, 5 (5): 477-495.

160. Hudson, R. R. 2001. Two-locus sampling distributions and their application.

Genetics, 159 (4): 1805-1817.

161. Huffman, T. N. 1998. The antiquity of Lobola. South African Archaeological

Bulletin, 53: 57-62.

162. Hurles, M. E. et al 2002. Y chromosomal evidence for the origins of oceanic-

speaking peoples. Genetics, 160 (1): 289-303.

163. Ingman, M. & Gyllensten, U. 2001. Analysis of the complete human mtDNA

genome: methodology and inferences for human evolution. J.Hered., 92 (6): 454-

164. Jackson, B. A. et al 2005. Mitochondrial DNA genetic diversity among four ethnic

groups in Sierra Leone. Am.J.Phys.Anthropol., 128 (1): 156-163.

165. Janmohamed, A. et al 2004. Cell-, tissue-, sex- and developmental stage-specific

expression of mouse flavin-containing monooxygenases (Fmos).

Biochem.Pharmacol., 68 (1): 73-83.

166. Jeffreys, M. D. W. 1964. Who are the Tikar? African Studies, 23 (3/4): 141-153.

167. Jobling, M., Hurles, M. E. & Tyler-Smith, C. 2004. Human Evolutionary

Genetics: Origins, People and Disease. Abingdon: Garland Science.

168. Jobling, M. A. & Tyler-Smith, C. 2003. The human Y chromosome: an

evolutionary marker comes of age. Nat.Rev.Genet., 4 (8): 598-612.

169. John, P. R. et al 2003. DNA polymorphism and selection at the melanocortin-1

receptor gene in normally pigmented southern African individuals.

Ann.N.Y.Acad.Sci., 994: 299-306.

170. Johnson, J. A. 2003. Pharmacogenetics: potential for individualized drug therapy

through genetics. Trends Genet., 19 (11): 660-666.

171. Jorde, L. B. et al 1995. Origins and affinities of modern humans: a comparison of

mitochondrial and nuclear genetic data. Am.J.Hum.Genet., 57 (3): 523-538.

172. Jorde, L. B. et al 1997. Microsatellite diversity and the demographic history of

modern humans. Proc.Natl.Acad.Sci.U.S.A, 94 (7): 3100-3103.

173. Jorde, L. B. et al 2000. The distribution of human genetic diversity: a comparison

of mitochondrial, autosomal, and Y-chromosome data. Am.J.Hum.Genet., 66 (3):

979-988.

174. Kaberry, P. M. 1952. Women of the Grassfields. London: HMSO.

175. Kaberry, P. M. 1959. Traditional Politics in Nsaw. Africa, 24 (4): 370.

176. Kaberry, P. M. 1962a. Retainers and Royal Households in the Cameroon

Grasslands. Cahiers D'Etudes Africaines, 3 (10): 282-298.

177. Kaberry, P. M. 1962b. The Date of the Bamun-Banso War 1885-1889. Man, 62

(s220): 140.

178. Kaessmann, H. et al 1999. DNA sequence variation in a non-coding region of low

recombination on the human X chromosome. Nat.Genet., 22 (1): 78-81.

179. Karafet, T. M. et al 2002. High levels of Y-chromosome differentiation among

native Siberian populations and the genetic signature of a boreal hunter-gatherer

way of life. Hum.Biol., 74 (6): 761-789.

180. Kayser, M. et al 1997. Evaluation of Y-chromosomal STRs: a multicenter study.

Int.J.Legal Med., 110 (3): 125-129.

181. Kayser, M. et al 2000. Characteristics and frequency of germline mutations at

microsatellite loci from the human Y chromosome, as revealed by direct

observation in father/son pairs. Am.J.Hum.Genet., 66 (5): 1580-1588.

182. Kayser, S. R. 2007. Pharmacogenomics and the potential for personalized

therapeutics in cardiovascular disease. Prog.Cardiovasc.Nurs., 22 (2): 104-107.

183. Ki-Zerbo, J. 1989. General History of Africa: Methodology and African

Prehistory.: James Currey Ltd.

184. Kimura, M. 1980. A simple method for estimating evolutionary rates of base

substitutions through comparative studies of nucleotide sequences. J.Mol.Evol., 16

(2): 111-120.

185. Kivisild, T. et al 2004. Ethiopian mitochondrial DNA heritage: tracking gene flow

across and around the gate of tears. Am.J.Hum.Genet., 75 (5): 752-770.

186. Kivisild, T. et al 2003. The genetic heritage of the earliest settlers persists both in

Indian tribal and caste populations. Am.J.Hum.Genet., 72 (2): 313-332.

187. Knight, A. et al 2003. African Y chromosome and mtDNA divergence provides

insight into the history of click languages. Curr.Biol., 13 (6): 464-473.

188. Krieter, P. A. et al 1984. Increased biliary GSSG efflux from rat livers perfused

with thiocarbamide substrates for the flavin-containing monooxygenase.

Mol.Pharmacol., 26 (1): 122-127.

189. Krings, M. et al 1999. mtDNA analysis of Nile River Valley populations: A

genetic corridor or a barrier to migration? Am.J.Hum.Genet., 64 (4): 1166-1176.

190. Krueger, S. K. et al 2002. Identification of active flavin-containing

monooxygenase isoform 2 in human lung and characterization of expressed

protein. Drug Metab Dispos., 30 (1): 34-41.

191. Krueger, S. K. et al 2005. Haplotype and functional analysis of four flavin-

containing monooxygenase isoform 2 (FMO2) polymorphisms in Hispanics.

Pharmacogenet.Genomics, 15 (4): 245-256.

192. Krueger, S. K. et al 2004. Differences in FMO2*1 allelic frequency between

Hispanics of Puerto Rican and Mexican descent. Drug Metab Dispos., 32 (12):

1337-1340.

193. Krueger, S. K. & Williams, D. E. 2005. Mammalian flavin-containing

monooxygenases: structure/function, genetic polymorphisms and role in drug

metabolism. Pharmacol.Ther., 106 (3): 357-387.

194. Krueger, S. K. et al 2001. Characterization of expressed full-length and truncated

FMO2 from rhesus monkey. Drug Metab Dispos., 29 (5): 693-700.

195. Lane, A. B. et al 2002. Genetic substructure in South African Bantu-speakers:

evidence from autosomal DNA and Y-chromosome studies.

Am.J.Phys.Anthropol., 119 (2): 175-185.

196. Lanfear, D. E. & McLeod, H. L. 2007. Pharmacogenetics: using DNA to optimize

drug therapy. Am.Fam.Physician, 76 (8): 1179-1182.

197. Lansing, J. S. et al 2007. Coevolution of languages and genes on the island of

Sumba, eastern Indonesia. Proc.Natl.Acad.Sci.U.S.A, 104 (41): 16022-16026.

198. Latham, A. J. H. 1973. Old Calabar., The impact of the international economy

upon a traditional society. Oxford: Clarendon Press. 1600-1891.

199. Lawton, M. P. et al 1994. A nomenclature for the mammalian flavin-containing

monooxygenase gene family based on amino acid sequence identities.

Arch.Biochem.Biophys., 308 (1): 254-257.

200. Lecerf, M. et al 2007. Allele frequencies and haplotypes of eight Y-short tandem

repeats in Bantu population living in Central Africa. Forensic Sci.Int., 171 (2-3):

212-215.

201. Lee, A. C. et al 2004. Molecular evidence for absence of Y-linkage of the Hairy

Ears trait. Eur.J.Hum.Genet., 12 (12): 1077-1079.

202. Li, J. et al 2005. [Analysis and application of SNP and haplotype in the human

genome]. Yi.Chuan Xue.Bao., 32 (8): 879-889.

203. Lichstein, J. 2007. Multiple regression on distance matrices: a multivariate spatial

analysis tool. Vegetatio, 188 (2): 117-131.

204. Liu, H. et al 2006. A geographically explicit genetic model of worldwide human-

settlement history. Am.J.Hum.Genet., 79 (2): 230-237.

205. Livingstone, F. B. 1984. The Duffy blood groups, vivax malaria, and malaria

selection in human populations: a review. Hum.Biol., 56 (3): 413-425.

206. Loktionov, A. et al 2002. Differences in N-acetylation genotypes between

Caucasians and Black South Africans: implications for cancer prevention. Cancer

Detect.Prev., 26 (1): 15-22.

207. Lovell, A. et al 2005. Ethiopia: between Sub-Saharan Africa and western Eurasia.

Ann.Hum.Genet., 69 (Pt 3): 275-287.

208. Lucotte, G. et al 1994. Reduced variability in Y-chromosome-specific haplotypes

for some Central African populations. Hum.Biol., 66 (3): 519-526.

209. Luis, J. R. et al 2004. The Levant versus the Horn of Africa: evidence for

bidirectional corridors of human migrations. Am.J.Hum.Genet., 74 (3): 532-544.

210. MacEachern, S. 2000. Genes, tribes, and African history. Current Anthropology,

41 (3): 357-384.

211. Macfarlane, C. & Simmonds, P. 2004. Allelic variation of HERV-K(HML-2)

endogenous retroviral elements in human populations. J.Mol.Evol., 59 (5): 642-

212. Maddison, D. R. 1991. African Origin of Human Mitochondrial DNA

Reexamined. Systematic Zoology, 40 (3): 355-363.

213. Makarenkov, V. & Lapointe, F. J. 2004. A weighted least-squares approach for

inferring phylogenies from incomplete distance matrices. Bioinformatics., 20 (13):

2113-2121.

214. Manfredi, V. 1989. Igboid. In: J. Bendor-Samuel & V. Lanham, eds.University

Press of America. 337-358.

215. Margulies, M. et al 2005. Genome sequencing in microfabricated high-density

picolitre reactors. Nature, 437 (7057): 376-380.

216. Masimirembwa, C. et al 1995. Phenotyping and genotyping of S-mephenytoin

hydroxylase (cytochrome P450 2C19) in a Shona population of Zimbabwe.

Clin.Pharmacol.Ther., 57 (6): 656-661.

217. Masimirembwa, C. et al 1996a. Phenotype and genotype analysis of debrisoquine

hydroxylase (CYP2D6) in a black Zimbabwean population. Reduced enzyme

activity and evaluation of metabolic correlation of CYP2D6 probe drugs.

218. Masimirembwa, C. et al 1996b. A novel mutant variant of the CYP2D6 gene

(CYP2D6*17) common in a black African population: association with

diminished debrisoquine hydroxylase activity. Br.J.Clin.Pharmacol., 42 (6): 713-

219. Masimirembwa, C. M. et al 1993. Genetic polymorphism of cytochrome P450

CYP2D6 in Zimbabwean population. Pharmacogenetics, 3 (6): 275-280.

220. Mateu, E. et al 1997. A tale of two islands: population history and mitochondrial

DNA sequence variation of Bioko and Sao Tome, Gulf of Guinea.

Ann.Hum.Genet., 61 (Pt 6): 507-518.

221. Mehlotra, R. K. et al 2006. Prevalence of CYP2B6 alleles in malaria-endemic

populations of West Africa and Papua New Guinea. Eur.J.Clin.Pharmacol., 62

(4): 267-275.

222. Mellars, P. 2006. Why did modern human populations disperse from Africa ca.

60,000 years ago? A new model. Proc.Natl.Acad.Sci.U.S.A, 103 (25): 9381-9386.

223. Michalakis, Y. & Excoffier, L. 1996. A generic estimation of population

subdivision using distances between alleles with special reference for

microsatellite loci. Genetics, 142 (3): 1061-1064.

224. Migliano, A. B., Vinicius, L. & Lahr, M. M. 2007. Life history trade-offs explain

the evolution of human pygmies. Proc.Natl.Acad.Sci.U.S.A, 104 (51): 20216-

20219.

225. Mirghani, R. A. et al 2006. CYP3A5 genotype has significant effect on quinine 3-

hydroxylation in Tanzanians, who have lower total CYP3A activity than a

Swedish population. Pharmacogenet.Genomics, 16 (9): 637-645.

226. Mitchelson, K. R. 2003. The use of capillary electrophoresis for DNA

polymorphism analysis. Mol.Biotechnol., 24 (1): 41-68.

227. Mitnik, L. et al 2001. Recent advances in DNA sequencing by capillary and

microdevice electrophoresis. Electrophoresis, 22 (19): 4104-4117.

228. Modiano, D. et al 2001. HLA class I in three West African ethnic groups: genetic

distances from sub-Saharan and Caucasoid populations. Tissue Antigens, 57 (2):

128-137.

229. Mueller, J. C. & Andreoli, C. 2004. Plotting haplotype-specific linkage

disequilibrium patterns by extended haplotype homozygosity. Bioinformatics., 20

(5): 786-787.

230. Myers, S. et al 2005. A fine-scale map of recombination rates and hotspots across

the human genome. Science, 310 (5746): 321-324.

231. Mzeka, N. P. 1978. The Core Culture of Nso'. Agawam, Ma.: Jerome Radin Co.

232. Mzeka, N. P. 1990. Four Fons of Nso': Nineteenth and Early Twentieth Century

Kingship in the Western Grassfields of Cameroon. Bamenda Cameroon: The

Spider Publishing Enterprise.

233. Nasidze, I. et al 2004. Mitochondrial DNA and Y-chromosome variation in the

caucasus. Ann.Hum.Genet., 68 (Pt 3): 205-221.

234. Neal, R. A. & Halpert, J. 1982. Toxicology of thiono-sulfur compounds.

Annu.Rev.Pharmacol.Toxicol., 22: 321-339.

235. Nei, M. 1987. Molecular Evolutionary Genetics.: Columbia University Press.

236. Nei, M. & Ota, T. 1991. Evolutionary relationships of human populations at the

molecular level. In: S. Osowa & T. Honjo, eds., Evolutions of life. Tokyo:

Springer. 415-428.

237. Nei, M. & Roychoudhury, A. K. 1993. Evolutionary relationships of human

populations on a global scale. Mol.Biol.Evol., 10 (5): 927-943.

238. Neumann, K., Kalheber, S. & Uebel, D. 1998. Remains of woody plants from

Saouga, a medieval west African village. Vegetation History and Archaeobotany,

7: 57-77.

239. Niu, T. 2004. Algorithms for inferring haplotypes. Genet.Epidemiol., 27 (4): 334-

240. Noah, M. E. 1980. Old Calabar: The City States and the Europeans. Calabar:

Scholars' Press.

241. O'Grady, R. T. et al 1989. Genes and tongues. Science, 243 (4899): 1651.

242. Olerup, O. et al 1991. HLA-DR and -DQ gene polymorphism in West Africans is

twice as extensive as in north European Caucasians: evolutionary implications.

Proc.Natl.Acad.Sci.U.S.A, 88 (19): 8480-8484.

243. Olivieri, A. et al 2006. The mtDNA legacy of the Levantine early Upper

Palaeolithic in Africa. Science, 314 (5806): 1767-1770.

244. Onderwater, R. C. et al 1999. Activation of microsomal glutathione S-transferase

and inhibition of cytochrome P450 1A1 activity as a model system for detecting

protein alkylation by thiourea-containing compounds in rat liver microsomes.

Chem.Res.Toxicol., 12 (5): 396-402.

245. Oscarson, M. et al 1997. A combination of mutations in the CYP2D6*17

(CYP2D6Z) allele causes alterations in enzyme function. Mol.Pharmacol., 52 (6):

1034-1040.

246. Page, R. D. 1996. TreeView: an application to display phylogenetic trees on

personal computers. Comput.Appl.Biosci., 12 (4): 357-358.

247. Panserat, S. et al 1999. CYP2D6 polymorphism in a Gabonese population:

contribution of the CYP2D6*2 and CYP2D6*17 alleles to the high prevalence of

the intermediate metabolic phenotype. Br.J.Clin.Pharmacol., 47 (1): 121-124.

248. Parfitt, T. 1997. Journey to the vanished city. London: Pheonix.

249. Parra, E. J. et al 1998. Estimating African American admixture proportions by use

of population-specific alleles. Am.J.Hum.Genet., 63 (6): 1839-1851.

250. Passarino, G. et al 1998. Different genetic components in the Ethiopian

population, identified by mtDNA and Y-chromosome polymorphisms.

Am.J.Hum.Genet., 62 (2): 420-434.

251. Patin, E. et al 2006. Sub-Saharan African coding sequence variation and haplotype

diversity at the NAT2 gene. Hum.Mutat., 27 (7): 720.

252. Penzak, S. R. et al 2007. Cytochrome P450 2B6 (CYP2B6) G516T influences

nevirapine plasma concentrations in HIV-infected patients in Uganda. HIV.Med.,

8 (2): 86-91.

253. Pereira, L. et al 2002. Bantu and European Y-lineages in Sub-Saharan Africa.

Ann.Hum.Genet., 66 (Pt 5-6): 369-378.

254. Pereira, L. et al 2001. Prehistoric and historic traces in the mtDNA of

Mozambique: insights into the Bantu expansions and the slave trade.

Ann.Hum.Genet., 65 (Pt 5): 439-458.

255. Persson, I. et al 1996. S-mephenytoin hydroxylation phenotype and CYP2C19

genotype among Ethiopians. Pharmacogenetics, 6 (6): 521-526.

256. Pesole, G. et al 1992. The evolution of the mitochondrial D-loop region and the

origin of modern man. Mol.Biol.Evol., 9 (4): 587-598.

257. Phillips, I. R. et al 1995. The molecular biology of the flavin-containing

monooxygenases of man. Chem.Biol.Interact., 96 (1): 17-32.

258. Piron, P. 1995a, Classfication interne du groupe bantöide, Université Libre de

Bruxelles.

259. Piron, P. 1995b. Identification lexicostatistique des groupes bantoïdes stables.

Journal of West African Languages, 25 (2): 3-39.

260. Piron, P. 1998. Internal classification of the Bantoid language group, with special

focus on the relation between Narrow Bantu, Southern Bantoid and Northern

Bantoid., Language History and Linguistic Description in Africa.Trenton N.J.:

Africa World Press. 65-74.

261. Plaza, S. et al 2004. Insights into the western Bantu dispersal: mtDNA lineage

analysis in Angola. Hum.Genet., 115 (5): 439-447.

262. Poloni, E. S. et al 1997. Human genetic affinities for Y-chromosome P49a,f/TaqI

haplotypes show strong correspondence with linguistics. Am.J.Hum.Genet., 61

(5): 1015-1035.

263. Price, D. 1979. Who are the Tikar now? Paideuma, 25: 89-98.

264. Price, E. W. & Carbone, I. 2005. SNAP: workbench management tool for

evolutionary population genetic analysis. Bioinformatics., 21 (3): 402-404.

265. Prugnolle, F., Manica, A. & Balloux, F. 2005. Geography predicts neutral genetic

diversity of human populations. Curr.Biol., 15 (5): R159-R160.

266. Qian, L. & Ortiz de Montellano, P. R. 2006. Oxidative activation of thiacetazone

by the Mycobacterium tuberculosis flavin monooxygenase EtaA and human

FMO1 and FMO3. Chem.Res.Toxicol., 19 (3): 443-449.

267. Quaranta, S. et al 2006. Ethnic differences in the distribution of CYP3A5 gene

polymorphisms. Xenobiotica, 36 (12): 1191-1200.

268. Quintana-Murci, L. et al 1999. Genetic evidence of an early exit of Homo sapiens

sapiens from Africa through eastern Africa. Nat.Genet., 23 (4): 437-441.

269. Ramsay, M. & Jenkins, T. 1988. Alpha-globin gene cluster haplotypes in the

Kalahari San and southern African Bantu-speaking blacks. Am.J.Hum.Genet., 43

(4): 527-533.

270. Rando, J. C. et al 1998. Mitochondrial DNA analysis of northwest African

populations reveals genetic exchanges with European, near-eastern, and sub-

Saharan populations. Ann.Hum.Genet., 62 ( Pt 6): 531-550.

271. Ray, N. et al 2005. Recovering the geographic origin of early modern humans by

realistic and spatially explicit simulations. Genome Research, 15 (8): 1161-1167.

272. Raymond, M. & Rousset, F. 1995. An Exact Test for Population Differentiation.

Evolution, 49 (6): 1280-1283.

273. Reed, F. A. & Tishkoff, S. A. 2006. African human diversity, origins and

migrations. Curr.Opin.Genet.Dev., 16 (6): 597-605.

274. Reed, T. E. 1969. Caucasian genes in American Negroes. Science, 165 (895): 762-

275. Reich, D. E. & Goldstein, D. B. 1998. Genetic evidence for a Paleolithic human

population expansion in Africa. Proc.Natl.Acad.Sci.U.S.A, 95 (14): 8119-8123.

276. Reinbold, H. 2007. [Ethnic background related pharmacological differences].

MMW.Fortschr.Med., 149 (42): 34, 36.

277. Relethford, J. H. & Harpending, H. C. 1994. Craniometric variation, genetic

theory, and modern human origins. Am.J.Phys.Anthropol., 95 (3): 249-270.

278. Relethford, J. H. & Jorde, L. B. 1999. Genetic evidence for larger African

population size during recent human evolution. Am.J.Phys.Anthropol., 108 (3):

251-260.

279. Renfrew, C. 1992. Archaeology, genetic and linguistic diversity. Man, 27 (3):

445-478.

280. Renfrew, C., McMahon, A. & Trask, L. 2000. Time Depth in Historical

Linguistics. Cambridge, England: The McDonald Institute for Archaeological

Research.

281. Renquin, J. et al 2001. HLA class II polymorphism in Aka Pygmies and Bantu

Congolese and a reassessment of HLA-DRB1 African diversity. Tissue Antigens,

58 (4): 211-222.

282. Reynolds, J., Weir, B. S. & Cockerham, C. C. 1983. Estimation Of The

Coancestry Coefficient: Basis For A Short-Term Genetic Distance. Genetics, 105

(3): 767-779.

283. Richards, M. et al 2000. Tracing European founder lineages in the Near Eastern

mtDNA pool. Am.J.Hum.Genet., 67 (5): 1251-1276.

284. Richards, M. et al 2003. Extensive female-mediated gene flow from sub-Saharan

Africa into near eastern Arab populations. Am.J.Hum.Genet., 72 (4): 1058-1064.

285. Rosa, A. et al 2004. MtDNA profile of West Africa Guineans: towards a better

understanding of the Senegambia region. Ann.Hum.Genet., 68 (Pt 4): 340-352.

286. Rosa, A. et al 2007. Y-chromosomal diversity in the population of Guinea-Bissau:

a multiethnic perspective. BMC.Evol.Biol., 7: 124.

287. Rosser, Z. H. et al 2000. Y-chromosomal diversity in Europe is clinal and

influenced primarily by geography, rather than by language. Am.J.Hum.Genet., 67

(6): 1526-1543.

288. Rower, S. et al 2005. Short communication: high prevalence of the cytochrome

P450 2C8*2 mutation in Northern Ghana. Trop.Med.Int.Health, 10 (12): 1271-

289. Sabeti, P. et al 2002a. CD40L association with protection from severe malaria.

Genes Immun., 3 (5): 286-291.

290. Sabeti, P. C. et al 2002b. Detecting recent positive selection in the human genome

from haplotype structure. Nature, 419 (6909): 832-837.

291. Sabeti, P. C. et al 2006. Positive natural selection in the human lineage. Science,

312 (5780): 1614-1620.

292. Sabeti, P. C. et al 2007. Genome-wide detection and characterization of positive

selection in human populations. Nature, 449 (7164): 913-918.

293. Salas, A. et al 2002. The making of the African mtDNA landscape.

Am.J.Hum.Genet., 71 (5): 1082-1111.

294. Salas, A. et al 2004. The African diaspora: mitochondrial DNA and the Atlantic

slave trade. Am.J.Hum.Genet., 74 (3): 454-465.

295. Sanchez, J. J. et al 2005. High frequencies of Y chromosome lineages

characterized by E3b1, DYS19-11, DYS392-12 in Somali males.

Eur.J.Hum.Genet., 13 (7): 856-866.

296. Sanchez-Mazas, A. 2001. African diversity from the HLA point of view: influence

of genetic drift, geography, linguistics, and natural selection. Hum.Immunol., 62

(9): 937-948.

297. Sands, B. 1998. Eastern and Southern African Khoesan: Evaluating Claims of a

Distant Linguistic Relationship., Quellen zur Khoesan-Forschung 14. Cologne,

Germany: Rudiger Koppe.

298. Saunders, M. A., Hammer, M. F. & Nachman, M. W. 2002. Nucleotide variability

at G6pd and the signature of malarial selection in humans. Genetics, 162 (4):

1849-1861.

299. Saunders, M. A. et al 2005. The extent of linkage disequilibrium caused by

selection on G6PD in humans. Genetics, 171 (3): 1219-1229.

300. Schadeberg, T. C. 1986. The lexicostatistic base of Bennett & Sterk's

reclassification of Niger-Congo with particular reference to the cohesion of Bantu.

Studies in African Linguistics, 17: 69-83.

301. Schaeffeler, E. et al 2001. Frequency of C3435T polymorphism of MDR1 gene in

African people. Lancet, 358 (9279): 383-384.

302. Scheet, P. & Stephens, M. 2006. A fast and flexible statistical model for large-

scale population genotype data: applications to inferring missing genotypes and

haplotypic phase. Am.J.Hum.Genet., 78 (4): 629-644.

303. Schneider, S., Roessli, D. & Excoffier, L. Arlequin: A software for population

genetics data analysis. [Ver 2.000]. 2000. Genetics and Biometry Lab, Dept. of

Anthropology, University of Geneva.

Ref Type: Computer Program

304. Schuster, S. C. 2008. Next-generation sequencing transforms today's biology.

Nat.Methods, 5 (1): 16-18.

305. Scozzari, R. et al 1999. Combined use of biallelic and microsatellite Y-

chromosome polymorphisms to infer affinities among African populations.

Am.J.Hum.Genet., 65 (3): 829-846.

306. Scozzari, R. et al 1994. Genetic studies in Cameroon: mitochondrial DNA

polymorphisms in Bamileke. Hum.Biol., 66 (1): 1-12.

307. Scozzari, R. et al 1988. Genetic studies on the Senegal population. I.

Mitochondrial DNA polymorphisms. Am.J.Hum.Genet., 43 (4): 534-544.

308. Seielstad, M. et al 1999. A view of modern human origins from Y chromosome

microsatellite variation. Genome Research, 9 (6): 558-567.

309. Seielstad, M. T., Minch, E. & Cavalli-Sforza, L. L. 1998. Genetic evidence for a

higher female migration rate in humans. Nat.Genet., 20 (3): 278-280.

310. Semino, O. et al 2002. Ethiopians and Khoisan share the deepest clades of the

human Y-chromosome phylogeny. Am.J.Hum.Genet., 70 (1): 265-268.

311. Shendure, J. et al 2005. Accurate multiplex polony sequencing of an evolved

bacterial genome. Science, 309 (5741): 1728-1732.

312. Sim, S. C. et al 2006. A common novel CYP2C19 gene variant causes ultrarapid

drug metabolism relevant for the drug response to proton pump inhibitors and

antidepressants. Clin.Pharmacol.Ther., 79 (1): 103-113.

313. Slatkin, M. 1995. A measure of population subdivision based on microsatellite

allele frequencies. Genetics, 139 (1): 457-462.

314. Smith, F. H. 1985. Continuity and change in the origin of modern Homo sapiens.

Z.Morphol.Anthropol., 75 (2): 197-222.

315. Sokal, R. R. & Rohlf, F. J. 1994. Biometry., 3rd edn. New York: W. H. Freeman

and Co.

316. Soodyall, H. et al 1996. mtDNA control-region sequence variation suggests

multiple independent origins of an "Asian-specific" 9-bp deletion in sub-Saharan

Africans. Am.J.Hum.Genet., 58 (3): 595-608.

317. Soranzo, N. et al 2005. Positive selection on a high-sensitivity allele of the human

bitter-taste receptor TAS2R16. Curr.Biol., 15 (14): 1257-1265.

318. Spurdle, A. B. & Jenkins, T. 1996. The origins of the Lemba "Black Jews" of

southern Africa: evidence from p12F2 and other Y-chromosome markers.

Am.J.Hum.Genet., 59 (5): 1126-1133.

319. Steinlechner, M. et al 2002. Gabon black population data on the ten short tandem

repeat loci D3S1358, VWA, D16S539, D2S1338, D8S1179, D21S11, D18S51,

D19S433, TH01 and FGA. Int.J.Legal Med., 116 (3): 176-178.

320. Stephens, M. & Donnelly, P. 2003. A comparison of bayesian methods for

haplotype reconstruction from population genotype data. Am.J.Hum.Genet., 73

(5): 1162-1169.

321. Stephens, M., Smith, N. J. & Donnelly, P. 2001. A new statistical method for

haplotype reconstruction from population data. Am.J.Hum.Genet., 68 (4): 978-

322. Stoneking, M. et al 1997. Alu insertion polymorphisms and human evolution:

evidence for a larger population size in Africa. Genome Research, 7 (11): 1061-

323. Stringer, C. 2002. Modern human origins: progress and prospects.

Philos.Trans.R.Soc.Lond B Biol.Sci., 357 (1420): 563-579.

324. Swadesh, M. 1952. Lexico-statistic dating of prehistoric ethnic contacts.

Proceeding of the American Philosophical Society, 96: 453-463.

325. Swadesh, M. 1955. Towards greater accuracy in lexicostatistic dating.

International Journal of American Linguistics, 21 (121): 137.

326. Swen, J. J. et al 2007. Translating pharmacogenomics: challenges on the road to

the clinic. PLoS.Med., 4 (8): e209.

327. Tajima, F. 1989. Statistical method for testing the neutral mutation hypothesis by

DNA polymorphism. Genetics, 123 (3): 585-595.

328. Takahata, N., Lee, S. H. & Satta, Y. 2001. Testing multiregionality of modern

human origins. Mol.Biol.Evol., 18 (2): 172-183.

329. Talbot, P. A. 1912. In the Shadow of the Bush. London: Heinemann.

330. Tambets, K. et al 2004. The western and eastern roots of the Saami--the story of

genetic "outliers" told by mitochondrial DNA and Y chromosomes.

Am.J.Hum.Genet., 74 (4): 661-682.

331. Tang, K. et al 2004. Genomic evidence for recent positive selection at the human

MDR1 gene locus. Hum.Mol.Genet., 13 (8): 783-797.

332. Tardits, C. 1980. Le Royaume Bamoun. Paris: Libraire Armand Colin.

333. Tayeb, M. T. et al 2000. CYP3A4 promoter variant in Saudi, Ghanaian and

Scottish Caucasian populations. Pharmacogenetics, 10 (8): 753-756.

334. Templeton, A. 2002. Out of Africa again and again. Nature, 416 (6876): 45-51.

335. Templeton, A. R. 1997. Out of Africa? What do genes tell us?

Curr.Opin.Genet.Dev., 7 (6): 841-847.

336. Templeton, A. R. 2005. Haplotype trees and modern human origins.

Am.J.Phys.Anthropol., Suppl 41: 33-59.

337. Templeton, A. R. 2007. Genetics and recent human evolution. Evolution

Int.J.Org.Evolution, 61 (7): 1507-1519.

338. Tenesa, A. et al 2007. Recent human effective population size estimated from

linkage disequilibrium. Genome Research, 17 (4): 520-526.

339. Terreros, M. C., Martinez, L. & Herrera, R. J. 2005. Polymorphic Alu insertions

and genetic diversity among African populations. Hum.Biol., 77 (5): 675-704.

340. The Y Chromosome Consortium 2002. A Nomenclature System for the Tree of

Human Y-Chromosomal Binary Haplogroups. Genome Research, 12 (2): 339-

341. Thomas, M. G. et al 2007. New genetic evidence supports isolation and drift in the

Ladin communities of the South Tyrolean Alps but not an ancient origin in the

Middle East. Eur.J.Hum.Genet.

342. Thomas, M. G., Bradman, N. & Flinn, H. M. 1999. High throughput analysis of

10 microsatellite and 11 diallelic polymorphisms on the human Y-chromosome.

Hum.Genet., 105 (6): 577-581.

343. Thomas, M. G. et al 2000. Y chromosomes traveling south: the cohen modal

haplotype and the origins of the Lemba--the "Black Jews of Southern Africa".

Am.J.Hum.Genet., 66 (2): 674-686.

344. Thomas, M. G. et al 2002. Founding mothers of Jewish communities:

geographically separated Jewish groups were independently founded by very few

female ancestors. Am.J.Hum.Genet., 70 (6): 1411-1420.

345. Thompson, R. F. 1983. Flash of the Spirit: African & Afro-American art &

philosophy. New York: Vintage.

346. Thorne, A. G. & Wolpoff, M. H. 1981. Regional continuity in Australasian

Pleistocene hominid evolution. Am.J.Phys.Anthropol., 55 (3): 337-349.

347. Tishkoff, S. A. et al 1996. Global patterns of linkage disequilibrium at the CD4

locus and modern human origins. Science, 271 (5254): 1380-1387.

348. Tishkoff, S. A. & Kidd, K. K. 2004. Implications of biogeography of human

populations for 'race' and medicine. Nat.Genet., 36 (11 Suppl): S21-S27.

349. Tishkoff, S. A. et al 2007. Convergent adaptation of human lactase persistence in

Africa and Europe. Nat.Genet., 39 (1): 31-40.

350. Tishkoff, S. A. et al 2001. Haplotype diversity and linkage disequilibrium at

human G6PD: recent origin of alleles that confer malarial resistance. Science, 293

(5529): 455-462.

351. Tishkoff, S. A. & Williams, S. M. 2002. Genetic analysis of African populations:

human evolution and complex disease. Nat.Rev.Genet., 3 (8): 611-621.

352. Tofanelli, S. et al 2003. Variation at 16 STR loci in Rwandans (Hutu) and

implications on profile frequency estimation in Bantu-speakers. Int.J.Legal Med.,

117 (2): 121-126.

353. Tomas, G. et al 2002. The peopling of Sao Tome (Gulf of Guinea): origins of

slave settlers and admixture with the Portuguese. Hum.Biol., 74 (3): 397-411.

354. Torroni, A. et al 2006. Harvesting the fruit of the human mtDNA tree. Trends

Genet., 22 (6): 339-345.

355. Torroni, A. et al 2000. mtDNA haplogroups and frequency patterns in Europe.

Am.J.Hum.Genet., 66 (3): 1173-1177.

356. Trask, L. 1997. The History of Basque. London: Routledge.

357. Trovoada, M. J. et al 2001. Evidence for population sub-structuring in Sao Tome e

Principe as inferred from Y-chromosome STR analysis. Ann.Hum.Genet., 65 (Pt

3): 271-283.

358. Trovoada, M. J. et al 2004. Pattern of mtDNA variation in three populations from

Sao Tome e Principe. Ann.Hum.Genet., 68 (Pt 1): 40-54.

359. Trovoada, M. J. et al 2007. Dissecting the genetic history of Sao Tome e Principe:

a new window from Y-chromosome biallelic markers. Ann.Hum.Genet., 71 (Pt 1):

77-85.

360. Udo, E. A. 1983. Who are the Ibibio? Onitsha: Africana-FEP Publishers.

361. Underhill, P. A. et al 2001. The phylogeography of Y chromosome binary

haplotypes and the origins of modern human populations. Ann.Hum.Genet., 65 (Pt

1): 43-62.

362. Underhill, P. A. et al 2000. Y chromosome sequence variation and the history of

human populations. Nat.Genet., 26 (3): 358-361.

363. Uya, O. E. 1984. A History of the Oron People. Oron: Manson.

364. Vannelli, T. A., Dykman, A. & Ortiz de Montellano, P. R. 2002. The

antituberculosis drug ethionamide is activated by a flavoprotein monooxygenase.

J.Biol.Chem., 277 (15): 12824-12829.

365. Vansina, J. 1990. Paths in the Rainforests: Toward a History of Political Tradition

in Equatorial Africa.: The University of Wisconsin Press.

366. Vansina, J. 1995. New Linguistic Evidence and the Bantu Expansion. Journal of

African History, 36 (2): 173-195.

367. Verrelli, B. C. & Tishkoff, S. A. 2004. Signatures of selection and gene

conversion associated with human color vision variation. Am.J.Hum.Genet., 75

(3): 363-375.

368. Vigilant, L. et al 1991. African populations and the evolution of human

mitochondrial DNA. Science, 253 (5027): 1503-1507.

369. Vizirianakis, I. S. 2004. Challenges in current drug delivery from the potential

application of pharmacogenomics and personalized medicine in clinical practice.

Curr.Drug Deliv., 1 (1): 73-80.

370. Voight, B. F. et al 2006. A map of recent positive selection in the human genome.

PLoS.Biol., 4 (3): e72.

371. Wainscoat, J. S. et al 1986. Evolutionary relationships of human populations from

an analysis of nuclear DNA polymorphisms. Nature, 319 (6053): 491-493.

372. Wall, J. D. & Hammer, M. F. 2006. Archaic admixture in the human genome.

Curr.Opin.Genet.Dev., 16 (6): 606-610.

373. Walsh, E. C. et al 2006. Searching for signals of evolutionary selection in 168

genes related to immune function. Hum.Genet., 119 (1-2): 92-102.

374. Watkins, W. S. et al 2001. Patterns of ancestral human diversity: an analysis of

Alu-insertion and restriction-site polymorphisms. Am.J.Hum.Genet., 68 (3): 738-

375. Watson, E. et al 1996. mtDNA sequence diversity in Africa. Am.J.Hum.Genet., 59

(2): 437-444.

376. Watson, E. et al 1997. Mitochondrial footprints of human expansions in Africa.

Am.J.Hum.Genet., 61 (3): 691-704.

377. Watterson, G. A. 1975. On the number of segregating sites in genetical models

without recombination. Theor.Popul.Biol., 7 (2): 256-276.

378. Weidenreich, F. 1946. Apes, Giants and Men. Chicago: University of Chicago

Press.

379. Weinshilboum, R. 2003. Inheritance and drug response. N.Engl.J.Med., 348 (6):

529-537.

380. Weng, Z. & Sokal, R. R. 1995. Origins of Indo-Europeans and the spread of

agriculture in Europe: comparison of lexicostatistical and genetic evidence.

Hum.Biol., 67 (4): 577-594.

381. Wennerholm, A. et al 2002. The African-specific CYP2D617 allele encodes an

enzyme with changed substrate specificity. Clin.Pharmacol.Ther., 71 (1): 77-88.

382. Wennerholm, A. et al 2001. Characterization of the CYP2D6*29 allele commonly

present in a black Tanzanian population causing reduced catalytic activity.

383. Wennerholm, A. et al 1999. Decreased capacity for debrisoquine metabolism

among black Tanzanians: analyses of the CYP2D6 genotype and phenotype.

384. Wessel, P. & Smith, W. 1998. New, improved version of Generic Mapping Tools

released. EOS Transactions, 79 (47): 579.

385. Whetstine, J. R. et al 2000. Ethnic differences in human flavin-containing

monooxygenase 2 (FMO2) polymorphisms: detection of expressed protein in

African-Americans. Toxicol.Appl.Pharmacol., 168 (3): 216-224.

386. Wilder, J. A. et al 2004. Global patterns of human mitochondrial DNA and Y-

chromosome structure are not influenced by higher migration rates of females

versus males. Nat.Genet., 36 (10): 1122-1125.

387. Wilke, R. A. et al 2007. Identifying genetic risk factors for serious adverse drug

reactions: current progress and challenges. Nat.Rev.Drug Discov., 6 (11): 904-

388. Williamson, K. & Blench, R. 2000. Niger-Congo. Cambridge: Cambridge

University Press. 11-42.

389. Wills, C. 1992. Human origins. Nature, 356 (6368): 389-390.

390. Wilson, J. F. et al 2001. Population genetic structure of variable drug response.

Nat.Genet., 29 (3): 265-269.

391. Witherspoon, D. J. et al 2006. Human population genetic structure and diversity

inferred from polymorphic L1(LINE-1) and Alu insertions. Hum.Hered., 62 (1):

30-46.

392. Wojnowski, L. et al 2004. Increased levels of aflatoxin-albumin adducts are

associated with CYP3A5 polymorphisms in The Gambia, West Africa.

393. Wood, E. T. et al 2005. Contrasting patterns of Y chromosome and mtDNA

variation in Africa: evidence for sex-biased demographic processes.

Eur.J.Hum.Genet., 13 (7): 867-876.

394. Xue, Y. et al 2006. Spread of an inactive form of caspase-12 in humans is due to

recent positive selection. Am.J.Hum.Genet., 78 (4): 659-670.

395. Yu, N. et al 2002. Larger genetic differences within africans than between

Africans and Eurasians. Genetics, 161 (1): 269-274.

396. Yu, N. et al 2001. Global patterns of human DNA sequence variation in a 10-kb

region on chromosome 1. Mol.Biol.Evol., 18 (2): 214-222.

397. Yueh, M. F., Krueger, S. K. & Williams, D. E. 1997. Pulmonary flavin-containing

monooxygenase (FMO) in rhesus macaque: expression of FMO2 protein, mRNA

and analysis of the cDNA. Biochim.Biophys.Acta, 1350 (3): 267-271.

398. Zalloua, P. A. et al 2008. Y-chromosomal diversity in Lebanon is structured by

recent historical events. Am.J.Hum.Genet., 82 (4): 873-882.

399. Zegura, S. L. et al 2004. High-resolution SNPs and microsatellite haplotypes point

to a single, recent entry of Native American Y chromosomes into the Americas.

Mol.Biol.Evol., 21 (1): 164-175.

400. Zeigler-Johnson, C. M. et al 2002. Ethnic differences in the frequency of prostate

cancer susceptibility alleles at SRD5A2 and CYP3A4. Hum.Hered., 54 (1): 13-21.

401. Zeitlyn, D. & Connell, B. 2003. Ethnogenesis and Fractal History on the African

Frontier: Mambila-Njerep-Mandulu. Journal of African History, 44 (1): 117-138.

402. Zekraoui, L. et al 1997. High frequency of the apolipoprotein E *4 allele in

African pygmies and most of the African populations in sub-Saharan Africa.

Hum.Biol., 69 (4): 575-581.

403. Zhao, Z. et al 2000. Worldwide DNA sequence variation in a 10-kilobase

noncoding region on human chromosome 22. Proc.Natl.Acad.Sci.U.S.A, 97 (21):

11354-11358.

404. Zhivotovsky, L. A. et al 2004. The effective mutation rate at Y chromosome short

tandem repeats, with application to human population-divergence time.

Am.J.Hum.Genet., 74 (1): 50-61.

The Distribution of Human Molecular Genetic...

Documents

Transcript of The Distribution of Human Molecular Genetic...

Genetic and Molecular Epidemiology

A molecular genetic analysis

Molecular genetic analysis of canine congenital ...

MOLECULAR GENETIC DIVERGENCE BETWEEN AVIAN SIBLING …

Cigna Molecular Genetic Testing Requirements

Molecular scatology: the use of molecular genetic analysis to assign ...

Molecular Genetic Ppt

Molecular Genetic Insights intoMolecular Genetic Insights into

MOLECULAR GENETIC DIVERGENCE BETWEEN - UCI

Molecular Pathology/Molecular Diagnostics/Genetic Testing · 2020-01-04 · Molecular Pathology/Molecular Diagnostics/Genetic Testing Page 2 of 55 UnitedHealthcare Medicare Advantage

Genetic Variation in Ustilago Bullata: Molecular Genetic ...

Molecular strategies for genetic diversity analysis and ...amsdottorato.unibo.it/5910/1/Wei_thesis_complete_final_version_for... · Molecular strategies for genetic diversity analysis

Supplemental Guide: Molecular Genetic Pathology

Molecular genetic characterisation of the

Wang - Molecular Basis of Genetic Diseases

Molecular Analysis for Genetic Distinctiveness and ...

Molecular Genetic

#30 Molecular Genetic Case Study

BASIC MOLECULAR GENETIC MECHANISMSkbp-srmc.yolasite.com/resources/Chapter 4.pdf · BASIC MOLECULAR GENETIC MECHANISMS T he extraordinary versatility of proteins as molecular machines

Genetic and Molecular อ.สู้