Post on 01-Apr-2015
X YThe significance of the structure
of data on PLS predictions of protein
involving both natural and human experimental design
Åsmund RinnanLars Munck
Three Data-sets of barley
B + C: The major substances protein, starch, cellulose, beta-glucan, fat and water are weighted to represent biological composition
A B C
Natural Simulated DoE31 31 54
All measured on NIR 6500 from 1100-2498nm with 2 nm intervals
RinnanDatasetPreprocessingPCAiPCAPLSBiologyPLS - againSummary
MunckPermutationMutantsDiff specData structureGeneticsConclusion
Normal barleyProtein mutantsCarbohydrate mutants
Pre-processing of spectra
Moving Window SNV with 130 nm window
The 1580-2498 nm spectral area visualizes the least differences
between the three data sets
RinnanDatasetPreprocessingPCAiPCAPLSBiologyPLS - againSummary
MunckPermutationMutantsDiff specData structureGeneticsConclusion
PCA 1100-2500nm
RinnanDatasetPreprocessingPCAiPCAPLSBiologyPLS - againSummary
MunckPermutationMutantsDiff specData structureGeneticsConclusion
Interval PCA selects 1804-2060 nm givingthe least differences between datasets.
RinnanDatasetPreprocessingPCAiPCAPLSBiologyPLS - againSummary
MunckPermutationMutantsDiff specData structureGeneticsConclusion
Predicting proteinUsing the three datasets
Nat Sim DoE
RMSE 0.71 1.08 0.69
r2 0.9 0.84 0.96
nLV 5 2 5
intercept 1.09 2.12 0.48
slope 0.93 0.86 0.97
Regression coefficients
RinnanDatasetPreprocessingPCAiPCAPLSBiologyPLS - againSummary
MunckPermutationMutantsDiff specData structureGeneticsConclusion
PLS diagnostics (to protein)
A.Simple correlation coefficients: wave-length absorbtion to protein content.
B.PLS Regression coefficients
NaturalSimulatedDoE
RinnanDatasetPreprocessingPCAiPCAPLSBiologyPLS - againSummary
MunckPermutationMutantsDiff specData structureGeneticsConclusion
Isolating the chemical and biological components of the data-sets.
A B C
Natural SimulatedNatural
DoE
31 31 54
ChemistrySimBiology
RestBiologySimBiology
ChemistryChemistry
SimBiology = B – CRestBiology = (A – C) – (B – C)
RinnanDatasetPreprocessingPCAiPCAPLSBiologyPLS - againSummary
MunckPermutationMutantsDiff specData structureGeneticsConclusion
Predicting protein: by PLS: Chemistry and non simulated(rest) biology show high contributions while that of simulated biology is low.
Chemistry SimBio RestBio
RMSE 0.94 2.53 1.31
R2 0.87 0.13 0.76
nLV 3 1 3
intercept 1.58 12.9 3.15
slope 0.90 0.17 0.80
RinnanDatasetPreprocessingPCAiPCAPLSBiologyPLS - againSummary
MunckPermutationMutantsDiff specData structureGeneticsConclusion
Normalized regression coefficients
RinnanDatasetPreprocessingPCAiPCAPLSBiologyPLS - againSummary
MunckPermutationMutantsDiff specData structureGeneticsConclusion
Back to data, selected wavelengths
RinnanDatasetPreprocessingPCAiPCAPLSBiologyPLS - againSummary
MunckPermutationMutantsDiff specData structureGeneticsConclusion
Full PLS Correlation-PLSWavelengths abs to protein
Assignment PLSPhil Williams
Quick comparison
RinnanDatasetPreprocessingPCAiPCAPLSBiologyPLS - againSummary
MunckPermutationMutantsDiff specData structureGeneticsConclusion
Results: Summary
RinnanDatasetPreprocessingPCAiPCAPLSBiologyPLS - againSummary
MunckPermutationMutantsDiff specData structureGeneticsConclusion
Interpretation: We are working by ”Permutation science”:
• 1.By mathematical validation of models permutation of data in chemometrics i.e cross-validation
RinnanDatasetPreprocessingPCAiPCAPLSBiologyPLS - againSummary
MunckPermutationMutantsDiff specData structureGeneticsConclusion
”Permutation science”:
• 2.Design of Experiments (DoE) Permutation of data through experiments by human design.
RinnanDatasetPreprocessingPCAiPCAPLSBiologyPLS - againSummary
MunckPermutationMutantsDiff specData structureGeneticsConclusion
”Permutation science”:
• 1.By mathematical validation of models permutation of data in chemometrics i.e. crossvalidation
• 2.Design of Experiments (DoE) Permutation of data through experiments by human design.
• 3. Natural design Permutation by selection of unique natural states where nature reveals its principles in data.
Question: In chemometrics why not combine them all rather than focusing on mathematical permutation alone?
All three permutation approaches are in the heart ofchemometric validation of models! Why not use themtogether as we have done here. They are
complementary.
RinnanDatasetPreprocessingPCAiPCAPLSBiologyPLS - againSummary
MunckPermutationMutantsDiff specData structureGeneticsConclusion
Principles of natural processesare reflected in data
• The solar eclipse reveals solar eruptions
• The NIR barley endosperm mutant model developed since 1965 with expression control of genetics and environment Two types of mutants:
regulative protein mutants – P and carbohydrate (starch) mutants – C(normal barley – N)
*)
*) http://science.nationalgeographic.com/science/enlarge/solar-eclipse-moon.html
Mutant 5.f
Bomi control
Mutant 5.f
Bomi control
J.Chemometrics 24: 481-495 (2010)
RinnanDatasetPreprocessingPCAiPCAPLSBiologyPLS - againSummary
MunckPermutationMutantsDiff specData structureGeneticsConclusion
How were the mutants found? By a bi-variate plot % proteinto mmol DBC (Dye binding capacity by acilanorange)
The Dyebinding Capacity (DBC) instrument for basic amino acids (lysine).
Background: Development of screening methods for improving lysine and nutritional quality in barley
LM at the nutritional laboratoryof the Swedish seed Ass.Svalöf in 1967.
High lysine Mutation
Mutation recombinants
Normal recombinants
DB
C
% protein
RinnanDatasetPreprocessingPCAiPCAPLSBiologyPLS - againSummary
MunckPermutationMutantsDiff specData structureGeneticsConclusion
Selecting endosperm mutants
J.Chemometrics 24: 481-495 (2010)
8
10
12
14
16
18
20
2 4 6 8 10 12 14 16 18 20
3a_3a_
3a_piggy
5f_5f_5f_5g_5g_5g_5g_ 1616
3a3a3a3b3b
3c
3c
3m3m
449449
4d
5f
5f
5f
5g5g5g
5g5g5g w1
w2NNNNNNN
N
N
NN
NN NN N
NNN
N
NNNNN
N NNNNb
β-Glucan
A/P
A/P = Amide Nitrogen to Protein
High β-glucan
Normal
High Lysine
No data
Vitamin E profileA/P vs. b-gulcan
Conclusion: Each mutant produces a unique chemical fingerprint for each individual gene in a controlled genetic background (Bomi). The fingerprint is summerized on the level of chemical bonds by NIR spectroscopy. Cellular computation is soft like a PCA.
RinnanDatasetPreprocessingPCAiPCAPLSBiologyPLS - againSummary
MunckPermutationMutantsDiff specData structureGeneticsConclusion
Any chemical (bi-)plot can select any mutant.
There are deterministic differential NIR spectra for each mutant to the gene background Bomi that
reveals a spectral absorption reproducibility as high as 10-5 MSC log 1/R for the P mutant lys3.a(blue) and the C mutant lys5.g (brown).
RinnanDatasetPreprocessingPCAiPCAPLSBiologyPLS - againSummary
MunckPermutationMutantsDiff specData structureGeneticsConclusion
Data structure is super-ordinate to chemometric analysis
-0.15
-0.10
-0.05
0
0.05
0.10
-0.20
-0.10
0 0.10
0.20
• 3a• 3a•
3m• 3b
• 3c• 4d
• 16 •
5g• 95
• 449•
449
• 5f
• 5g
• w1
• w2 •
Bomi• CAII
• Minerva
• Nordal•
Nordal• Triumph
• Lysiba
• Lysimax
PC1
PC2
Scores
CN
P
BG = 12.3
BG = 3.7
3.2
3c
3a
The 3a and 3c P mutantsare differentiated in thisPCA
However, spectral differences in the area 2450-2500nm represent a much more finely tuned and informative change in β-glucan from 3.1% in 3a to 6.4% in 3c
RinnanDatasetPreprocessingPCAiPCAPLSBiologyPLS - againSummary
MunckPermutationMutantsDiff specData structureGeneticsConclusion
How is the chemical composition of the cell decided?
Through soft modeling of intercellular dynamics of the whole cell by quantum and chemical cross-talk as revealed by the movements of chromosomes at mitosis (click at theleft figure).
Cell emergence is like music as directed by the whole chemical orchestra of the cell
RinnanDatasetPreprocessingPCAiPCAPLSBiologyPLS - againSummary
MunckPermutationMutantsDiff specData structureGeneticsConclusion
Conclusion• Biological macro data are
basically deterministic calculated in situ by “set probability” controlled by the whole cell
• Holistic analysis is limited by uncertainty specified as irreducibility “top down” and indeterminacy “bottom up”
• The structure of data is the king that rules mathematical modeling by data inspection
• Because of the determinism that here is demonstrated, data development of gentle data models (such as MSC) and data inspection software are of essential importance in avoiding a reduction
of information. • Chemometrics is excellent for
over- views but the results have to
be checked by data inspection,
RinnanDatasetPreprocessingPCAiPCAPLSBiologyPLS - againSummary
MunckPermutationMutantsDiff specData structureGeneticsConclusion