Post on 07-Mar-2018
Prediction of Functional Quantitative Traits in Plants using a Whole-Genome Population-
Based ApproachTatiana Tatarinova
What is the SNP
• SINGLE • NUCLEOTIDE • POLIMORPHISM
Human genome has about 3 billion nucleotides Two randomly selected individuals differ by
<0.5% of a genome (<15 mil)
Why do we care about SNPs?
• History • Personalized medicine • Clinical trials • Forensics • Agriculture and breeding • Monitoring rare species
DNA
…TCAGGTCACAGTCT…
…TCAGGTCACAGTCT……TCAGGCCACAGTCT……TCAGGCCACAGTCT…
Individual 1
Individual 2
Individual 3
DNA
Reference sequence
Katarzyna Bryc, 2013
Single Nucleotide Polymorphism (SNP)
…TCAGGTCACAGTCT……TCAGGCCACAGTCT……TCAGGCCACAGTCT…
SNP A.k.a. allele, locus, marker, variant
Katarzyna Bryc, 2013
Pigmentation example - SLC45A2
ALFRED: The ALlele FREquency Database
http://alfred.med.yale.edu/alfred/Katarzyna Bryc, 2013
Genotyping Sequencing
Potentially, every position on a genome is studied However, quality is variable, lower than for the SNP chip (0.1% error is achieved for 75% of bases) Some areas require read depths of 100 or more.
Large, but limited number of high-quality calls. 1 million SNPs can be genotyped for $100 Error rate (wrong calls) <1% (reported by 23 and me)
Geographic Population Structure prediction approach
Infer position of the origin of an individual from whole-genome collection of ancestry-informative markers
National Genographic Project
• Goal: Assemble world-wide collection of reference genotypes • How is it done?
Discoveries
• Peruvians from Lima were not typical Peruvians • Many self-identified Keto were not Keto
–Median Keto component in self-identified Keto is 71%
–Only 13 of 67 have over 90% of Keto ancestry (pure Keto)
• Why is it so difficult?
How to collect data
• Approach village nurse, elders, shamans • Request to select unrelated people of “pure” origin, who remember
their grandparents • Distribute a survey requesting information about place of birth and
ethnicity of the individual, his/her parents and grandparents
All >75% <=75%
• In pairs
– Obtain informed consent • Find out:
– Parents’ names – Parents’ dates of birth – Parents’ place of birth – Parents’ ethnicities
• Then:
– Grandparents’ names – Grandparents’ dates of birth – Grandparents’ place of birth – Grandparents’ ethnicities
Exercise
How to find Ancestry Informative Markers (AIMs)?
• How to find Ancestry Informative Markers? • For each position on the genome, we have information from K
populations. Each locus can have up to N alleles. • In - informativeness for assignment. For a given set of populations,
the minimal of In=0 occurs when all alleles have equal frequencies in all populations. The maximal value, log(K), occurs when N > K and no allele is found in more than one population
Example: K=2 and N=2
How to find Ancestry Informative Markers (AIMs)?
Example: K=2 and N=2. Let’s assume that we have A or C in the selected position. 1. Informative marker p1A=1, p1C=0, p2A=0, p2C=1
PA=PC=0.5
Informativeness=(-0.5log2(0.5)+0-0.5log2(0.5)+0)=1=log2(K)
1. Non-Informative marker p1A=0.5, p1C=0.5, p2A=0.5, p2C=0.5
PA=PC=0.5
Informativeness=(-0.5log2(0.5)-0.5log2(0.5) +4*(0.5*log2(0.5)/2)) =1-1=0
To infer population structure from genotype data, it is necessary to first reduce the dimensionality of the dataset due to the thousands of SNPs it encompasses.
From SNPs to Admixture
Thousands of ancestry informative SNPs
North East Asian Mediterranian South African
South West Asian
Native American Oceanian
South East Asian
Northern European
Sub-Saharan African
HGDP00985 0.5253 0.0202 0 0.2222 0.0404 0.0101 0.0101 0.1717 0
HGDP01094 0.04 0.04 0 0.03 0.83 0 0.01 0.05 0
HGDP00982 0.0102 0.1531 0.0306 0.0714 0.0408 0 0.0102 0.2041 0.4796
ADMIXTURE
Admixture proportions in geographically adjacent populations, such as Italian and Greeks, and populations sharing similar history, like British and Germans, are similar.
36
Selection of K
Identify the best K value as judged by prediction of systematically withheld data points
Input: genetics
Samples of known origin
40
SAMPLE ID
NORTH EASTASIAN
MEDITERRANIAN
SOUTH AFRICA
SOUTH WEST ASIAN
NATIVE AMERICAN OCEANIAN
SOUTH EAST ASIA
NORTHERNEUROPEAN
SUB-SAHARANAFRICA
Chinese 1 0.718826 0.000419 0.00001 0.00001 0.00001 0.00001 0.280695 0.00001 0.00001Chinese 2 0.734967 0.00001 0.00001 0.00001 0.001061 0.00001 0.263912 0.00001 0.00001Chinese 3 0.74693 0.00001 0.00001 0.00001 0.010271 0.003244 0.239505 0.00001 0.00001Chinese 4 0.671209 0.00001 0.00001 0.00001 0.00001 0.00001 0.328721 0.00001 0.00001Chinese 5 0.725614 0.00001 0.00001 0.00001 0.00001 0.00001 0.274316 0.00001 0.00001Chinese 6 0.72071 0.00001 0.00001 0.001098 0.01665 0.00001 0.261492 0.00001 0.00001Chinese 7 0.695701 0.00001 0.00001 0.00001 0.00001 0.00001 0.304229 0.00001 0.00001Chinese 8 0.709767 0.00001 0.00001 0.00001 0.00001 0.00001 0.290163 0.00001 0.00001Chinese 9 0.715808 0.01056 0.00001 0.00001 0.00001 0.00001 0.273572 0.00001 0.00001Chinese 10 0.732043 0.00001 0.00001 0.00001 0.012694 0.00001 0.255203 0.00001 0.00001Chinese 11 0.655995 0.00001 0.00001 0.00001 0.00001 0.00001 0.343935 0.00001 0.00001Chinese 12 0.712607 0.00001 0.00001 0.00001 0.00001 0.00001 0.287323 0.00001 0.00001
For every reference population, calculate mean admixture vectors
NORTH EAST ASIA
MEDI-TERRANIA
SOUTH AFRICA
SOUTH WEST ASIA
NATIVE AMERICA OCEANIA
SOUTH EAST ASIA
NORTHERN EUROPE
SUB-SAHARANAFRICA
Chinese 0.711681 0.000923 1.00E-05 0.000101 0.003396 0.00028 0.283589 1.00E-05 1.00E-05
Russian 0.068867 0.265222 0.001241 0.224659 0.035011 0.008622 0.031844 0.363107 0.001419
Tatar 0.15794 0.209897 1.00E-05 0.210957 0.011902 0.002605 0.005703 0.400975 1.00E-05
41
Input: geography
• For every reference population find the corresponding coordinates.
Latitude Longitude
Chinese 39.55 116.2
Russian 55.75 37.62
Tatar 55.55 50.93
Moscow
42
Prediction of provenance
• Knowing relationship between geographic and genetic distances, is it possible to find a geographic origin of a person of known genotype?
• We decided to try a simple approach
43
A B
X
Dealing with individuals of unknown origin
NORTH EAST ASIA
MEDI-TERRANIA
SOUTH AFRICA
SOUTH WEST ASIA
NATIVE AMERICA OCEANIA
SOUTH EAST ASIA
NORTHERN EUROPE
SUB-SAHARANAFRICA
Unknown 0.711681 0.000923 1.00E-05 0.000101 0.003396 0.00028 0.283589 1.00E-05 1.00E-05
• Find distances between the Unknown vector and all reference vectors
• Sort reference populations by distance from smallest to largest
44
GPS accurately assigned: • ~100% of all individuals to their continental regions • 80% of all individuals to their country of origin • 60% of all individuals to their inner-country region
46
.
GPS achieved high assignment accuracy for Southeast Asian and Oceanian populations (87.5%) and subpopulations (77%).
We traced the origins of Remote Oceanian populations to Near Oceania.
Reich et al. (2011). Denisova admixture and the first modern human dispersals into Southeast Asia and Oceania, AJHG
243 Southeast Asian and Oceanian samples
48
Prediction of bio-origin
• Knowing relationship between geographic and genetic distances, is it possible to find a geographic origin of a person of known genotype?
• We decided to try a simple approach
50
A B
X
Application to Arabidopsis
51
•Percent of populations that are perfectly mapped: 60% •Average distance to correct population: 200 km
■Legumes (fabaceae) plant family Ability to establish nodule symbiosis High synteny to legume crops
■Small diploid genome (500Mb) sequenced Tang et al., 2014 (BMC Genomics)
■Numerous genomic resources • RIL populations • Mutant collections • Large EST and RNASEQ databases • HAPMAP resources Branca et al., 2011
(PNAS) • 288 genomes already sequenced
■Biotechnology tools for validation ■Large biodiversity Gentzbittel et al., 2015 (Front. Pl. Sci)
Medicago truncatula
M. truncatula - a model legume plant
01/08/2017
Grain Legumes Forage Legumes
• Alfalfa • Clover
• Soybean • Chickpea • Bean
High agricultural
value
• Pea • Peanut • Lentil
Grain Legumes are very rich in proteins (Pea: 3x more than cereals)
Photo: B.Julier
Legumes comprises major agricultural species for human and animal nutrition
HapMap data for M. truncatulaCurrent state: 288 sequenced accessions
# SNPs Chr1 : 1.508.346 Chr2 : 1.964.419 Chr3 : 2.472.365 Chr4 : 2.225.537 Chr5 : 3.251.290 Chr6 : 1.172.980 Chr7 : 2.262.094 Chr8 : 1.659.690
Total : 16.516.721 ~ 40% of SNPs in coding regions
18 sister taxa
Here are the data available for the 288 Medicago accessions that have been already sequenced leading to a total number of more than 16.5 million SNP. Among those, 18 accessions which highly diverged from others have been re-classified to some sister taxa despite perfect botanic similarities
56
M. truncatula is spontaneously occurring all around the Mediterranean basin
Several collections are available: DZ, TN, FR , AU, US, …. for thousands of accessions
Photo: A.Abdelguerfi, INA, Alger, AlgeriaPhoto: J-M Prospéri, INRA Montpellier, France
Photo: J-M Prospéri, INRA Montpellier, France Photo: E.Aouani, CBBC, Tunisia Photo: L. Gentzbittel, ENSAT, Toulouse, France
Photo: L. Gentzbittel, ENSAT, Toulouse, France
57
NSF HapMap project
Current state: 288 sequenced accessions Total: 16.516.721 SNPs 30 lines at > 20XThe rest at least 5X
Branca et al., 2011 (PNAS)
58
PCA identifies two Medicago sub-populations among 262 accessions
Agrees with results of Bonhomme et al. 2014
40,000 randomly chosen SNPs
A refined population structure for Medicago species using GPS algorithm
Medicago is likely to be structured in 8 ancestral genomesElhaik, E., Tatarinova, T., et al. 2014. Geographic population structure analysis of worldwide human populations infers their biogeographical origins. Nat Commun 5, article 313
Fig. 2: Predicted distance from true origin for each M. truncatula accession using t h e l e a v e - o n e - o u t procedure, for increasing values of K.
'leave-one-out' test for assignation of accessions to actual populations, using inferred ancestral genomes proportions.
840K randomly selected SNPs (~100,000 per chromosome)
Admixture analysis
Choice of # putative ancestral populations
Fig. 1: BIC as a function of increasing values of K, using Discriminant Analysis of Principal Components (DAPC) applied on the 8 4 0 K S N P d a t a s e t . (Jombart et al 2010)
K=8
Examples of GPS analysis with various levels of accuracy
'Leave-one Out' procedure accuracy is <300 kilometers
Linear relationship between geographical and genetic distances when d < 950 km (R=0.8, p-value=10-4, Mantel's test)
A refined population structure for Medicago species using GPS algorithm
Mtr is likely to be structured in 8 ancestral genomes
Elhaik, E., Tatarinova, T., et al. 2014. Geographic population structure analysis of worldwide human populations infers their biogeographical origins. Nat Commun 5, article 313
'leave-one-out' test for assignation of accessions to actual populations, using inferred ancestral genomes proportions.
840K randomly selected SNPs
Admixture analysis
Choice of # putative ancestral populations Rare ancestral genome
Complex admixture pattern
Very low Admixture pattern
Frequent ancestral genome
A putative 'ancestral genome‘ is characterized by a specific pattern of allele frequencies
Gentzbittel et al., under review
Is this population structure computed from genetic diversity data consistent with: -geographical distribution? -Bioclimatic variables? -Other biological covariates?
Geographic distribution of the putative eight ancestral genomes of M. truncatula
Spanish coastalAlgiers
North Tunisian coastal Atlas
French
Greek Spanish-Moroccan Inland
South Tunisian coastal
Gentzbittel et al., under review
5 out of 8 populations are present in Mahgreb
Algiers (K1)
Spanish coastal (K2)
North Tunisian coastal (K3)Atlas (K4)
South Tunisian coastal (K5)
French (K6)
Greek (K7)
Spanish-Moroccan Inland (K8)
Maghreb as a putative center of diversity for M. truncatula species
Some ancestral genomes may be adapted to local bio-climatic variables
We looked for putative association with the 19 bio-climatic variables, as defined by WorldClim (http://www.worldclim.org).
The “South Tunisian Coastal" genome is positively correlated to annual mean T°C (BIO1) and Mean T°C of colder quarter (BIO11) and negat ive ly cor re la ted to annua l precipitation (BIO12).
The “Spanish coastal" genome is
negatively correlated to T°C seasonality (BIO4), and T°C annual range (BIO7).
The “Greek" genome is positively correlated to precipitation of the wettest month (BIO13), precipitation seasonality (BIO15), precipitation of the wettest quarter (BIO16) and precipitation of coldest quarter (BIO19).
Spanish coastal
Greek
BIO1: r=0.26*** BIO11: r=-0.24** BIO12: r=--0.28***
BIO4: r=-0.42*** BIO7: r=-0.31***
BIO13: r=0.39*** BIO15: r=0.45*** BIO16: r=0.33*** BIO19: r=0.29***
The “North Tunisian Coastal" genome is not significantly correlated to any bioclimatic covariables.
Climate and geography explain less <50% of the variation in genome admixture component proportions
Other putative explanatory variables?
-soil structure and composition?
-Biotic interactions?
Gentzbittel et al., under review
Verticillium sp. are soil-borne fungal pathogens for Legumes
V. alfalfae symptoms on susceptible M.truncatula
7 dpi
12 dpi
A17 (R) F83005.5 (S)
GFP -expressing V31-2 strain (Ben et al. 2013 ; Toueni et al. in preparation.)
Root pathogens enter the plant through root central cylinder
GFP Rs
GUS staining
Ralstonia solanacearum infection
xy
cwhy
* *
cor
xyhy
Verticillium albo-atrum infectionVailleau et al. 2006 Ben et al. 2012
Ancestral genomes may present different levels of resistance to Verticillium wilt
Many eudicot species and cultivars are resistant to the disease and all monocots, gymnosperms and ferns are immune.
Ancestral genomes present different levels of resistance to Verticillium wilt
Linear Model (242 Mt HAPMAP accessions) MSSCorLS ~ K2 + K5 + K7 + K8
r²=0.31, P-value < 2.2 x 10-16
Can phenotype be predicted using GPS?
• Use plant model (Medicago truncatula) to test applicability of admixture framework to predict phenotypes.
• M. truncatula was infected with a fungal pathogen V. alfalfae
• MSS (Maximum symptom score) ranges from 0 (healthy) to 4 (dead), 2 threshold.
GPS was used to predict provenance of 59 unknown samples, and MSS was inferred based on the similarity of admixture profiles to the training set accessions Plants were infected in the lab and symptoms recorded 21/29 samples that are predicted to be resistant were resistant 24/30 samples predicted to be susceptible were susceptible
Experimental validation of predicted resistance levels
Chi-square p-value = 1.5.10-4
We thank Jean-Marie Prospéri (INRA Montpellier, France) for providing seeds of the accessions
WhoGem models can be significant predictors of quantitative functional traits in plants
Phenotypic data from: Mazurier et al, in preparation Stanton-Geddes et al., 2013 Bonhomme et al., 2013Gentzbittel et al., pending submission
Comparisons of accuracy of prediction for several quantitative traits of adaptive relevance in M. truncatula using f ive Genomic Selection algorithms and our WhoGEM. Accuracy is computed using 50 runs of five-fold cross-validation for all methods. Phenotypes: (a) Plant height at harvest (b) Number of leaves (c) Nodule number below 5 cm of root growth (d) Nodule number above 5 cm of root (e) Total number of nodules (f) Partial resistance to Aphanomyces euteiches, estimated by root-rot index.
What is missing?Distance between two points should not be just geometric distance. Add: 1. Mode of propagation (e.g. Medicago seeds stick to goats) 2. Soil 3. Climate 4. Geography
This methodology may serve as a basis for analyses in other plant species and for other functional quantitative traits of interest.
Conclusions
• Large proportion of phenotypic variation between individuals may be best explained by population admixture.
• Variation in genome admixture proportion explains most of phenotypic variation for quantitative functional traits.
• We experimentally confirm the prediction of differences in quantitative disease resistance levels in the wild model legume Medicago truncatula.
• Admixture components were found to be significantly related to climate and geography, also positive selection at the species level might not explain current adaptation.
• Phenotypes can be predicted using genome-wide patterns of admixture, when incorporating covariates such as individuals' provenance.
• This insight contributes to the understanding of adaptation, and can accelerate plant and animal breeding, and biomedical research programs.
MAX correlati
on
MIN correlati
on Positively correlated
withNegatively correlated
with
K1 0.42 -0.31 Isothermality Precipitation of Wettest
Quarter
K2 0.42 -0.46
Precipitation in February
Mean Diurnal Range (Mean of monthly (max
temp - min temp)) K3 0.40 -0.25 Precipitation in August Precipitation in March
K4 0.67 -0.40Precipitation in May /
June Min. temperature in
November
K5 0.34 -0.31Temperature Annual
Range Precipitation in March
K6 0.74 -0.31 LatitudePrecipitation of Wettest
Quarter
K7 0.59 -0.51
Temperature Seasonality/Max
Temperature in July Isothermality
K8 0.63 -0.59
Precipitation in December/ January /
Coldest Quarter Longitude
Application of GPS to aDNA (Bronze Age)
30 out of 100 Bronze Age samples (Allentoft et al 2015) had over 500 of ancestry informative markers. We applied GPS algorithm to find the closest modern population.
• ReAdmix developed to treat individuals of mixed origin and represents an individual as a linear combination of admixture vectors of reference populations
30%British+10%Russian+60%Chinese P=a1rs1 + a2rs2 +... + aprsp, subject to constraints
a1 + a2 +... + ap=1 and ai≥0
79
More complex cases? reAdmix
How it works
• We assume right away that the given ancient proportions contain error
• Start with a guess population • Add/remove populations to achieve optimal fit • Conditional optimization (such as “I know that there
was a Jewish ancestor somewhere in my pedigree”)
80
31.07.2017
reAdmix approachAim: to find the smallest subset of modern populations whose combined admixture components are similar to those of the individual within a small tolerance margin.
The algorithm consists of three phases:
1. Iteratively build the first candidate solution (LARS) and improve it.2. Generate the predefined number M of additional candidate solutions randomly and apply the Differential Evolution (DEEP).
3. Identify the populations that have stable membership in the solution across the set, that is, are part of solution in at least 75% of cases.
Let R={ri}i=1..I be the set of modern populations where
ri=(ri,1, ..., ri,K) and K is the dimension (K=9).We seek two sets S=(s1,...,sp) and A=(a1,...,ap) where
si are the indices of modern populations ai are the coefficients of modern populationsin the approximation
each
of the test vector T
31.07.2017
Phase 1. Build and improve the initial solution set
S - individual’s ancestry
T – test vector
N - desired size of the solution
R – the set of modern populations
dim(S) < N
j=arg max[F(rj,T)] S = S ∪ {j}
aj=max[α:α·rj<T+ε]×β T = T − ajrj S0 = (S \ si) U {rk}
S = S0
i=0..dim(S) k=0..dim(R)
f(S0,T) < f(S,T)Affinity score
minimizes the loss function
where
Find the population vector with the highest affinity score Append this population to the solution set. Calculate the weight of the population vector
For all populations x in the current solution and for all y outside the solution, replace x with y, if it reduces the error.
31.07.2017
Phase 2. Optimize the solution by global stochastic and local search
2. Local search over all populations close to the preliminary solution
This step selects between related populations (e.g. Belorussian, Russian, and Ukrainian) that could have been misplaced in previous steps
S1,S2,...,SM
are generated randomly or from population groups
Objective function
is the approximation of Twhere
1. Differential Evolution Gmax number of iterations
DE: optimization method used for multidimensional real valued functions. Good for Treatment of noisy problems (Storn and Price, 1997)
31.07.2017
Phase 3. Averaging
S1:(a1,r1)1,(a2,r2)1,..,(ap-1,rp-1)1,(ap,rp)1
S2:(a1,r1)2,(a2,r2)2,..,(ap-1,rp-1)2,(ap,rp)2
SM:(a1,r1)M,(a2,r2)M,..,(ap-1,rp-1)M,(ap,rp)M
...
r belongs to final solution if: r=r1,1=r2,M-1=...=rp,M — is the same modern population
and
is present in L > 75% solutions
and
a=(1/L)(a1,1+a2,M-1+..+ap,M)
Reliable estimate: - the populations that are part of solution in at least 75% of cases,
- their averaged estimates of admixture proportion.
SM-1:(a1,r1)M-1,(a2,r2)M-1,..,(ap-1,rp-1)M-1,(ap,rp)M-1
S1, S2, ... , SM-1, SM — the set of candidate solutions. Final solution:
S=(s1,...,sp)A=(a1,...,ap)
Benchmark-1
Take sample that were validated by GPS1
Then we used ReAdmix in (a) unconditional and (b) conditional with incorrect guess mode.
(a) 94% of the samples were mapped to their reported ethnic group, and average distance to the correct location was 54±13 miles.
(b) with randomly chosen incorrect guess, 92% of samples was mapped to the reported ethnic group, with average distance to the correct location equal to 65±17 miles.
In all cases ReAdmix correctly identified the cases as un-mixed
86
Benchmark-2: simulated 500 50:50 mixtures
87
Testing mode Percent of at least one correctly predicted
origin
Percent of completely
correct predictions
Average distance to correct population, miles
Unconditional 78% 56% 324±46
Conditional on the equal weights, populations unknown
91% 74% 176±33
Conditional on one of the populations, weights unknown
NA 71% 104±16
Benchmark-2: simulated 500 50:25:25 mixtures
88
Testing mode Percent of at least one correctly predicted
origin
Percent of completely correct
predictions
Average distance to correct population,
miles
Unconditional 88% 21% 346±34Conditional on the major population, weights unknown
73% 27% 222±26
Conditional on one of the minor populations, weights unknown
78% 33% 234±30
Sohn et (2012) al benchmark• 2 components • 4 components
4-dim space: European, African, Native American and East Asian
Color coding: red-European, green-African, yellow- Native American, blue-East Asian, and white- unassigned