FIGS workshop in Madrid, PGR Secure (9 to 13 January 2012)
-
Upload
dag-endresen -
Category
Technology
-
view
5.434 -
download
0
description
Transcript of FIGS workshop in Madrid, PGR Secure (9 to 13 January 2012)
Focused Identification of Germplasm Strategy (FIGS)
for wild relatives of the cultivated plants
Dag Endresen and Abdallah Bari PGR Secure (EU 7th Framework Program)
Workshop 9-13 January 2012, Madrid, Spain
TOPICS:
Wheat at Alnarp, June 2010 2
• Trait mining with FIGS
– Predictive link between climate data and trait data
– Case studies:• Morphological traits, Nordic barley• Net blotch on barley• Stem rust on wheat • Stem rust, Ug99 on bread wheat
landraces
Domestication and cultivated plants:Utilizing genetic potential from the wild
corn, maize
wild tomato
tomato
teosinte3
cultivation
A NEEDLE IN A HAY STACK
• Scientists and plant breeders want a few hundred germplasm accessions to evaluate for a particular trait.
• How does the scientist select a small subset likely to have the useful trait?
4
Challenges for utilization of plant genetic resources
* Large gene bank collections* Limited screening capacity
5
Origin of Concept:Boron toxicity of wheat and barley example of late 1980s
What is Focused Identification of Germplasm Strategy
Slide made byM.C. Mackay, 1995
South AustraliaMediterranean Sea
er
y
a
n
is
F I G SOCUSED DENTIFICATION OF ERM PLASM TRATEGY
Data la
yers sie
ve acce
ssions
ba
sed on
latitud
e &
lon
gitud
e
Illustration by Mackay (1995)
FIGS:Origin of FIGS: Michael Mackay (1986, 1990, 1995)
7
Origin of FIGS: Michael Mackay (1986, 1990, 1995)
8
– Identification of plant germplasm with a higher likelihood of having desired genetic diversity for a target trait property.
– Using climate data for prediction of crop traits a priori BEFORE the field trials.
OBJECTIVES OF FIGS
9
Bread wheat at Nöbbelöv in Lund
TRAIT MINING
• Focused Identification of Germplasm Strategy (FIGS).
• Identify new and useful genetic diversity for crop improvement.
• Based on eco-geographic data analysis using climate data.
European mountain ash (Sorbus aucuparia L.) at Alnarp, July 2004 10
FOCUSED IDENTIFICATION OF GERMPLASM STRATEGY
Climate layers from the ICARDA ecoclimatic database (De Pauw, 2003)
11
ASSUMPTION:The climate at the original source location, where the plant germplasm was developed is correlated to the trait property.
AIM: To build a computer model explaining the crop trait score from the climate data.
12
High cost data
Low cost data
Genebank accessions
(landraces & CWR)
Trait data
Climate data
Field trials (€€€)
Focu
sed I
denti
ficati
on of
Germ
plasm
Stra
tegy
Geo-referencing of
collecting places13
CLIMATE EFFECT DURING THE CULTIVATION PROCESS
Wild relatives are shaped by the environment
Primitive cultivated crops are shaped by local climate and humans
Traditional cultivated crops (landraces) are shaped by climate and humans
Modern cultivated crops are mostly shaped by humans (plant breeders)
Perhaps future crops are shaped in the molecular laboratory…?
14
PREDICTIVE LINK BETWEEN ECO-GEOGRAPHY AND TRAITS
It is possible that the human mediated selection of landraces will contribute to the link between ecogeography and traits.
During traditional cultivation the farmer will select for and introduce germplasm for improved suitability of the landrace to the local conditions.
15
CLIMATE DATA – WORLDCLIMThe climate data can be extracted from the WorldClim dataset.http://www.worldclim.org/ (Hijmans et al., 2005)
Data from weather stations worldwide are combined to a continuous surface layer.
Climate data for each landrace is extracted from this surface layer.
Precipitation: 20 590 stations
Temperature: 7 280 stations
16
CLIMATE DATALayers used in these early FIGS studies:
• Precipitation (rainfall)• Maximum temperatures • Minimum temperatures
Some of the other layers available:
• Potential evapotranspiration (water-loss)• Agro-climatic Zone (UNESCO classification)• Soil classification (FAO Soil map)• Aridity (dryness)
(mean values for month and year)Eddy De Pauw (ICARDA, 2008)
17
LIMITATIONS OF FIGS
• Landraces and wild relatives– The link between climate data and the trait data is
required for trait mining with FIGS. Modern cultivars are not expected to show this predictive link (complex pedigree).
• Georeferenced accessions– Trait mining with FIGS is based on multivariate
models using climate data from the source location of the germplasm. To extract climate data the accessions need to be accurately georeferenced.
18
MORPHOLOGICAL TRAITS IN NORDIC BARLEY LANDRACES
Field observations by Agnese Kolodinska Brantestam (2002-2003)
Multi-way N-PLS data analysis, Dag Endresen (2009-2010)
Priekuli (LVA) Bjørke (NOR) Landskrona (SWE) 19
MULTI-WAY N-PLS RESULTSNORDIC BARLEY LANDRACES
ExperimentSite Year
Heading days
Ripening days
Length of plant
Harvest index
Volumetric weight
Thousand grain weight
LVA 20021 n.s. n.s. n.s. n.s. *** n.s.
LVA 2003 *** n.s. ** ** *** n.s.NOR 2002 - * ** *** ** n.s.NOR 2003 ** *** *** * * n.s.
SWE 2002 ** *** n.s. ** * n.s.SWE 20032 n.s. ** n.s. n.s. ** n.s.
*** Significant at the 0.001 level (p-value)** Significant at the 0.01 level
* Significant at the 0.05 leveln.s. Not significant (at the above levels)
1 LVA 2002 Germination on spikes (very wet June)2 SWE 2003 Incomplete grain filling (very dry June)
Endresen, D.T.F. (2010). Predictive association between trait data and ecogeographic data for Nordic barley landraces. Crop Science 50: 2418-2430. DOI: 10.2135/cropsci2010.03.0174
20
CLASSIFICATION PERFORMANCE
• Positive predictive value (PPV)• PPV = True positives / (True positives + False positives)• Classification performance for the identification of
resistant samples (positives)
• Positive diagnostic likelihood ratio (LR+)• LR+ = sensitivity / (1 – specificity)• Less sensitive to prevalence than PPV
21
NET BLOTCH ON BARLEY LANDRACES
Green dots indicate collecting sites for resistant wheat landraces and red dots collecting sites for susceptible landraces.
USDA GRIN, trait data online: http://www.ars-grin.gov/cgi-bin/npgs/html/desc.pl?1041
Field experiments made in Minnesota, North Dakota and Georgia in the USA
22
MULTIVARIATE SIMCA RESULTSFOR NET BLOTCH ON BARLEY
Dataset (unit) PPV LR+ Estimated gainNet blotch (accession) 0.54 (0.48-0.60) 1.75 (1.42-2.17) 1.35 (1.19-1.50)
Random 0.40 (0.35-0.45) 0.99 (0.84-1.17) 0.99 (0.87-1.12)(40 % resistant samples)
Endresen, D.T.F., K. Street, M. Mackay, A. Bari, E. De Pauw (2011). Predictive association between biotic stress traits and ecogeographic data for wheat and barley landraces. Crop Science 51: 2036-2055. DOI: 10.2135/cropsci2010.12.0717
PPV = Positive Predictive Value; LR+ = Positive Diagnostic Likelihood Ratio
23
STEM RUST ON WHEAT LANDRACES
USDA GRIN, trait data online: http://www.ars-grin.gov/cgi-bin/npgs/html/desc.pl?65049
Green dots indicate collecting sites for resistant wheat landraces and red dots collecting sites for susceptible landraces.
Field experiments made in Minnesota by Don McVey
24
MULTIVARIATE SIMCA RESULTSFOR STEM RUST ON WHEAT
Dataset (unit) PPV LR+ Estimated gainStem rust (accession) 0.54 (0.50-0.59) 3.07 (2.66-3.54) 1.95 (1.79-2.09)
Random 0.29 (0.26-0.33) 1.04 (0.90-1.20) 1.03 (0.91-1.16)(28 % resistant samples)
Stem rust (site) 0.50 (0.40-0.60) 4.00 (2.85-5.66) 2.51 (2.02-2.98)Random 0.19 (0.13-0.26) 0.94 (0.63-1.39) 0.95 (0.66-1.33)(20 % resistant samples)
Endresen, D.T.F., K. Street, M. Mackay, A. Bari, E. De Pauw (2011). Predictive association between biotic stress traits and ecogeographic data for wheat and barley landraces. Crop Science 51: 2036-2055. DOI: 10.2135/cropsci2010.12.0717
PPV = Positive Predictive Value; LR+ = Positive Diagnostic Likelihood Ratio
25
MULTIVARIATE ANALYSISSTEM RUST ON WHEAT
Classifier method AUC Cohen’s KappaPrincipal Component Regression (PCR)
0.69 (0.68-0.70) 0.40 (0.37-0.42)
Partial Least Squares (PLS) 0.69 (0.68-0.70) 0.41 (0.39-0.43)Random Forest (RF) 0.70 (0.69-0.71) 0.42 (0.40-0.44)Support Vector Machines (SVM) 0.71 (0.70-0.72) 0.44 (0.42-0.45)Artificial Neural Networks (ANN) 0.71 (0.70-0.72) 0.44 (0.42-0.46)
Bari, A., K. Street, , M. Mackay, D.T.F. Endresen, E. De Pauw, and A. Amri (2011). Focused Identification of Germplasm Strategy (FIGS) detects wheat stem rust resistance linked to environment variables. Genetic Resources and Crop Evolution [online first]. doi:10.1007/s10722-011-9775-5; Published online 3 Dec 2011.
Abdallah Bari (ICARDA)
AUC = Area Under the ROC Curve (ROC, Receiver Operating Curve)
26
MULTIVARIATE SIMCA RESULTSSTEM RUST (UG99) ON WHEAT
Ug99 set with 4563 wheat landraces screened for Ug99 in Yemen 2007, 10.2 % resistant accessions. The true trait scores for 20% of the accessions (825 samples) were revealed. We used trait mining with SIMCA to select 500 accessions more likely to be resistant from 3728 accession with true scores hidden (to the person making the analysis). The FIGS set was observed to hold 25.8 % resistant samples and thus 2.3 times higher than expected by chance.
27
MULTIVARIATE ANALYSIS RESULTSFOR STEM RUST (UG99) ON WHEAT
Classifier method PPV LR+ Estimated gainkNN (pre-study) 0.29 (0.13-0.53) 5.61 (2.21-14.28) 4.14 (1.86-7.57)
SIMCA 0.28 (0.14-0.48) 5.26 (2.51-11.01) 4.00 (2.00-6.86)
Ensemble classifier 0.33 (0.12-0.65) 8.09 (2.23-29.42) 6.47 (2.05-11.06)
Random 0.06 (0.01-0.27) 0.95 (0.13-6.73) 0.97 (0.16-4.35)(pre-study, 550 + 275 accessions)
Ensemble 0.26 (0.22-0.30) 2.78 (2.34-3.31) 2.32 (2.00-2.68)Random 0.11 (0.09-0.15) 1.02 (0.77-1.36) 0.95 (0.77-1.32)(blind study, 825 + 3738 accessions)
Endresen, D.T.F., K. Street, M. Mackay, A. Bari, E. De Pauw, K. Nazari, and A. Yahyaoui (2012). Sources of Resistance to Stem Rust (Ug99) in Bread Wheat and Durum Wheat Identified Using Focused Identification of Germplasm Strategy (FIGS). Crop Science [online first]. doi: 10.2135/cropsci2011.08.0427; Published online 8 Dec 2011.
PPV = Positive Predictive Value; LR+ = Positive Diagnostic Likelihood Ratio
28
A LIFEBOAT TO THE GENE POOL
• PDF available from: – http://db.tt/lZMpwgJ
• Available from Libris (Sweden)– ISBN: 978-91-628-8268-6
29
SURFING THE GENEPOOL
Michael Clemet Mackay (2011)
• PDF available from: – http://pub.epsilon.slu.se/8439/
• Available from Libris (Sweden)– ISBN: 978-91-576-7634-4
30
31
Collecting the seeds for some of the most important populations for the wild relatives of the cultivated plants, to maintain them at the ex situ genebanks, would provide improved access to these valuable plant genetic resources.
... but how to identify the most important CWR populations?
GAP ANALYSIS FOR MANAGEMENT OF GENEBANK COLLECTIONS
• Advice the planning of new collecting/gathering expeditions
– Identification of relevant areas were the crop species is predicted to be present (using GBIF data and niche models).
– Focus on areas least well represented in the genebank collection (maximize diversity).
– Focus on areas with a higher likelihood for a desired target trait (FIGS).
See http://gisweb.ciat.cgiar.org/GapAnalysis/ for more information.32
WORMWOOD (ARTEMISIA ABSINTHIUM L.)NORDGEN STUDY: JUNE 2010
Species distribution model(7 364 records)
Using the Maxent desktop software.
33
Wormwood (Artemisia absinthium L.)
The compatibility of data standards between PGR and biodiversity collections made it possible to integrate the worldwide germplasm collections into the biodiversity community (TDWG, GBIF).
POTENTIAL OF THE GBIF TECHNOLOGY
http://data.gbif.org/datasets/network/2
34
Using GBIF/TDWG technology (and contributing to its development), the PGR community can more easily establish specific PGR networks without duplicating GBIF's work.
Data analysis methods for using the FIGS approach with wild relatives of the
cultivated plantsDag Endresen and Abdallah Bari
PGR Secure (EU 7th Framework Program) Workshop 9-13 January 2012, Madrid, Spain
STEPS TO FOLLOW:
36
• Data collection and preparation• Geo-referencing of collecting locations• Initial data exploration• Pre-processing of dataset• Choose modeling method• Calibration of model• Validation of model• Validation of prediction results
GEO-REFERENCING FOR THE SOURCE COLLECTING SITE
Example of georeferencing for NGB9529, a barley landrace reported as originating from
Lyderupgaard using KRAK.dk and maps.google.com
37
EXPLORE THE DATASET (FOR OUTLIERS)
The influence plot (residuals against leverage) shows sample NGB6300 (FRO) observed at Priekuli in 2003 (replicate 2) with a very high leverage - well separated from the “data cloud”.
After looking into the raw data, this observation point was removed as outlier (set to NaN).
38
PRE-PROCESSING
39
Here: Across mode 2 (traits)
Mean centering removes the absolute intensity to avoid the model to focus on the variables with the highest numerical values (intensity).
Scaling makes the relative distribution of values (range spread) more equal between variables. Auto-scaling is a combination of mean centering and variance scaling. After auto-scaling all variables have a mean of zero and a standard deviation of one. The objective is to help the model to separate the relevant information from the noise.
No preprocessing Mean centering Auto scale
Priekuli
Bjørke
Landskrona
SOME MORE PRE-PROCESSING EXAMPLES
40
A MODEL OF THE REAL WORLD
• Validation of the model– No model can ever be absolutely correct– A simulation model can only be an
approximation– A model is always created for a specific
purpose• Apply the model
– The simulation model is applied to make predictions based on new fresh data
– Be aware to avoid extrapolation problems41
DATA FOR THE TRAIT MINING MODEL
• Training set– For the initial calibration or training step.
• Calibration set– Further calibration, tuning step– Often cross-validation on the training set
is used to reduce the consumption of raw data.
• Test set– For the model validation or goodness of
fit testing.– External data, not used in the model
calibration.
42
12 months (mode 2)
14 s
ampl
es (m
ode
1)
12 months (mode 2)
14 s
ampl
es (m
ode
1)(m
ode 3)3 cl
imate
varia
bles
Precipitation
Max temp
Min temp
Min. temperature14
sam
ples Jan, Feb, Mar, …
(mode 2)
Max. temperature Precipitation
36 variables
3rd level for mode 32nd level for mode 31st level for mode 3
mod
e 1
Jan, Feb, Mar, …(mode 2)
Jan, Feb, Mar, …(mode 2)
MULTI-WAY DATA STRUCTURE (N-PLS)
43
SPLIT-HALF MODEL VALIDATION
The two PARAFAC models each calibrated from two independent split-half subsets, both converge to a very similar solution as the model calibrated from the complete dataset.
The PARAFAC model is thus a general and stable model for the scope of Scandinavia.
Example used here is the Trait data model (mode 1) from Endresen (2010).
44
SPLIT-HALF MODEL VALIDATION
45
PARAFAC climate data (3-w)
RESIDUALS (VALIDATE MODEL FIT)
46Be aware of over-fitting! NB! Model validation!
The distance between the model (predictions) and the reference values (validation) is the residuals.
Example of a bad model calibration
Cross-validation indicates the appropriate model complexity.
Calibration step
MODELING METHODS
47
DATA MODELING METHODS• Parallel Factor Analysis (PARAFAC) (Multi-way)• Multi-linear Partial Least Squares (N-PLS) (Multi-way)• Soft Independent Modeling of Class Analogy (SIMCA)• k-Nearest Neighbor (kNN) • Partial Least Squares Discriminant Analysis (PLS-DA)• Linear Discriminant Analysis (LDA)• Principal component logistic regression (PCLR)• Generalized Partial Least Squares (GPLS)• Random Forests (RF)• Neural Networks (NN)• Support Vector Machines (SVM)
These methods above are the modeling methods used by Endresen (2010), Endresen et al (2011, 2012), and Bari et al (2012).
• Boosted Regression Trees (BRT)• Multivariate Regression Trees (MRT)• Bayesian Regression Trees
48
Principal component 1
Princip
al co
mponent 2
Prin
cipa
l com
pone
nt 3
3 PCs
1 PC
2 PCs
SIMCA ANALYSIS (PCA MODEL FOR EACH CLASS)
Illustration modified from Wise et al., 2006:201 (PLS Toolbox software manual)
Resistant samples
Intermediate
Susceptible
*
Example from the stem rust set:
49
VALIDATION OF RESULTS
• Pearson product-moment correlation (R) (-1 to 1)• Coefficient of determination (R2) (0 to 1)• Cohen’s Kappa (K) (-1 to 1)• Proportion observed agreement (PO) (0 to 1)• Proportion positive agreement (PA) (0 to 1)• Positive predictive value (PPV) (0 to 1)• Positive diagnostic likelihood ratio (LR+) (from 0)• Sensitivity and specificity• Area under the curve (AUC)
– Receiver operating characteristics (ROC)
• Root mean square error (RMSE)– RMSE of calibration (RMSEC)– RMSE of cross-validation (RMSECV)– RMSE of prediction (RMSEP)
• Predicted residual sum of squares (PRESS)
50
CONFUSION MATRIX
PredictedResistant (positive) Susceptible (negative)
Observed (Actual)
Resistant (positive) True positive (TP) False negative (FN)
Susceptible (negative) False positive (FP) True negative (TN)
51
Proportion observed agreement (PO) = (TP + TN) / NProportion positive agreement (PA) = (2 * TP) / (2 * TP + FP + FN)
Positive predictive value (PPV) = TP / (TP + FP)
Sensitivity = TP / (TP + FN)Specificity = TN / (TN + FP)
Positive diagnostic likelihood ratio (LR+) = Sensitivity / (1 – specificity) Positive diagnostic likelihood ratio (LR+) = (TP / [TP + FN]) / (FP / [FP + TN])
Odds ratio (OR) = (TP * TN) / (FN * FP)
Yule’s Q = (OR - 1) / (OR + 1)
Positive predictive gain (Gain) = PPV/prevalence
CORRELATION COEFFICIENT (R2)Predictions for the cross-validated (leave-one-out) samples for a N-PLS model
Endresen (2010) for trait 5 (volumetric weight) observations from Priekuli, 2002 (mean of the replications) and 6 principal components.
Correlation coefficient:
r2 = 0.96
Sum squared residuals:
RMSECV = 2.86
RMSECV
52Actual trait scores
Pred
icte
d tr
ait s
core
s
RMSEC
CORRELATION COEFFICIENT (R2)Predictions for the cross-validated (leave-one-out) samples with a N-PLS model.
Endresen (2010) for trait 5 (volumetric weight) observations from Bjørke, 2002 (mean of the replications) and 4 principal components.
Correlation coefficient:
r2 = 0.84
Sum squared residuals:
RMSECV = 3.66
RMSECV
53
Pred
icte
d tr
ait s
core
s
Actual trait scores
RMSEC
SIGNIFICANCE LEVELS
• Often the critical levels (a) for the p-value significance is set as 0.05, 0.01 and 0.001 (5 %, 1 %, 0.1 %).
• The significance level is often marked with one star (*) for the 0.05 level, two stars (**) for the 0.01 level and three stars (***) for the 0.001 level.
– 5% (even a random effect when an experiment is repeated 20 times is likely to be observed one time)
– 1% (if an experiment is repeated 100 times a random effect is likely to be observed one time)
– 0.1% (if an experiment is repeated 1000 times a random effect is likely to be observed one time) 54
PGR Secure (EU 7th Framework)Workshop FIGS approach9-13 Jan 2012, Madrid
Abdallah Bari (ICARDA)[email protected]
55