Supplementary Material (SM) NMR shifts in aluminosilicate ... · In the electronic form of...

16
Supplementary Material (SM) NMR shifts in aluminosilicate glasses via Machine Learning Authors : Ziyad Chaker, *a Mathieu Salanne, b Jean-Marc Delaye, c and Thibault Charpentier *a Affiliations : a NIMBE, CEA, CNRS, Université Paris-Saclay, CEA-Saclay, F-91191 Gif-sur-Yvette cedex, FRANCE; E-mails: [email protected] ([email protected]) and [email protected] b Maison de la Simulation, USR 3441, CEA, CNRS, INRIA, Université Paris-Sud, Université de Versailles, F-91191 Gif-sur-Yvette, France. c CEA, DEN, Service d’études de vitrification et procédés hautes températures, 30207 Bagnols-sur-Cèze, France. Corresponding Authors : *Thibault Charpentier: [email protected] *Ziyad Chaker: [email protected] (Contact for help on the codes and data usage) Data provided in electronic supplementary materials: In the electronic form of supplementary materials, we provide all our input data in the form of ISOCAR files (DFT- GIPAW σ iso data in the folder "Diso_ISOCAR_DATA") and the corresponding VASP POSCAR input structure files (folder "Glasses_Structures_DATA"). We also give the codes (folder "Descriptors_Codes") necessary to compute SOAP, BPSF and ARDF descriptors (direct input to the ML codes) since the full descriptors files’ sizes are too large for uploading in the PCCP platform. The raw data files (folders "Descriptors_BPSF_ARDF_SOAP_LRR_SiO2_Learning_Curves", "Algo- rithms_LRR_LKRR_GKRR_SOAP_SiO2_LearningCurves_Construction" and "Cross-validation_8_ML_Algorithms_RawData") are provided for constructing the learning curves for the different descriptors (figure 1) and algorithms considered (figure 5) as well as for the algorithms comparison in Table S3 (raw data and python codes to generate the plots). We also provide all the raw data and a python code (folder "SOAP_Descriptors_CharacterizationSurfaces") for the production of all SOAP descriptors characterization figures (plots and surfaces of figures 3, 6, 9, S5, S8, S12). Finally, we provide the main ML code constructed and used in our work (MACLAREN.sh and its two corresponding python routines : MALL.py and ML_functions.py in the folder "Main_MACLAREN_MLCODE"). Since no documentation is available, yet, for these home-made codes, please contact the corresponding authors (Ziyad Chaker or Thibault Charpentier) of the article for further information on the usage of these codes ([email protected], [email protected]). Notes : The learning curves (figures 1 and 5) have been computed using the following combinations of structures provided in supplementary materials: 0KSiO 2 -(1 to 10) for (x=10); 0KSiO 2 -(1 to 10) and 0KSiO 2 -n(1 to 10 for (x=20); 0KSiO 2 -(1 to 10) and 0KSiO 2 -n(1 to 10 for (x=20) and 300KSiO 2 -(1 to 10 for (x=30); and so on until including the 97 SiO 2 structures in the training/validation set. The test set is composed, for all these computations, of the two structures: 2000KSiO 2 -n9 and 2000KSiO 2 -n10. Note that due to DFT-GIPAW calculations convergence issues, two structures from the 501 used in this work are redundant: NAS3-1 at 0K is the same than NAS3-2 at 0K and 30Na-1 at 1500K is the same than 30Na-2 at 1500K. These two systems can be safely ignored to reproduce the results of our work. Nevertheless one can also use them (as we have done) within the very same ML set (training or validation) with no impact on the results obtained. This will just result in increasing the weight of these structures during the training or validation process. 1–16 | 1 Electronic Supplementary Material (ESI) for Physical Chemistry Chemical Physics. This journal is © the Owner Societies 2019

Transcript of Supplementary Material (SM) NMR shifts in aluminosilicate ... · In the electronic form of...

Page 1: Supplementary Material (SM) NMR shifts in aluminosilicate ... · In the electronic form of supplementary materials, we provide all our input data in the form of ISOCAR files (DFT-GIPAW

Supplementary Material (SM)

NMR shifts in aluminosilicate glasses via Machine LearningAuthors:

Ziyad Chaker,∗a Mathieu Salanne,b Jean-Marc Delaye,c and Thibault Charpentier∗a

Affiliations:a NIMBE, CEA, CNRS, Université Paris-Saclay, CEA-Saclay, F-91191 Gif-sur-Yvette cedex, FRANCE;

E-mails: [email protected] ([email protected]) and [email protected] Maison de la Simulation, USR 3441, CEA, CNRS, INRIA, Université Paris-Sud, Université de Versailles, F-91191 Gif-sur-Yvette, France.

c CEA, DEN, Service d’études de vitrification et procédés hautes températures, 30207 Bagnols-sur-Cèze, France.Corresponding Authors:

*Thibault Charpentier: [email protected]*Ziyad Chaker: [email protected] (Contact for help on the codes and data usage)

Data provided in electronic supplementary materials:

In the electronic form of supplementary materials, we provide all our input data in the form of ISOCAR files (DFT-GIPAW σiso data in the folder "Diso_ISOCAR_DATA") and the corresponding VASP POSCAR input structure files (folder"Glasses_Structures_DATA"). We also give the codes (folder "Descriptors_Codes") necessary to compute SOAP, BPSFand ARDF descriptors (direct input to the ML codes) since the full descriptors files’ sizes are too large for uploadingin the PCCP platform. The raw data files (folders "Descriptors_BPSF_ARDF_SOAP_LRR_SiO2_Learning_Curves", "Algo-rithms_LRR_LKRR_GKRR_SOAP_SiO2_LearningCurves_Construction" and "Cross-validation_8_ML_Algorithms_RawData") areprovided for constructing the learning curves for the different descriptors (figure 1) and algorithms considered (figure 5) as well as forthe algorithms comparison in Table S3 (raw data and python codes to generate the plots). We also provide all the raw data and a pythoncode (folder "SOAP_Descriptors_CharacterizationSurfaces") for the production of all SOAP descriptors characterization figures (plotsand surfaces of figures 3, 6, 9, S5, S8, S12). Finally, we provide the main ML code constructed and used in our work (MACLAREN.shand its two corresponding python routines : MALL.py and ML_functions.py in the folder "Main_MACLAREN_MLCODE"). Since nodocumentation is available, yet, for these home-made codes, please contact the corresponding authors (Ziyad Chaker or ThibaultCharpentier) of the article for further information on the usage of these codes ([email protected], [email protected]).

Notes:

The learning curves (figures 1 and 5) have been computed using the following combinations of structures provided in supplementarymaterials: 0KSiO2-(1 to 10) for (x=10); 0KSiO2-(1 to 10) and 0KSiO2-n(1 to 10 for (x=20); 0KSiO2-(1 to 10) and 0KSiO2-n(1 to 10for (x=20) and 300KSiO2-(1 to 10 for (x=30); and so on until including the 97 SiO2 structures in the training/validation set. The testset is composed, for all these computations, of the two structures: 2000KSiO2-n9 and 2000KSiO2-n10. Note that due to DFT-GIPAWcalculations convergence issues, two structures from the 501 used in this work are redundant: NAS3-1 at 0K is the same than NAS3-2at 0K and 30Na-1 at 1500K is the same than 30Na-2 at 1500K. These two systems can be safely ignored to reproduce the results of ourwork. Nevertheless one can also use them (as we have done) within the very same ML set (training or validation) with no impact onthe results obtained. This will just result in increasing the weight of these structures during the training or validation process.

Journal Name, [year], [vol.], 1–16 | 1

Electronic Supplementary Material (ESI) for Physical Chemistry Chemical Physics.This journal is © the Owner Societies 2019

Page 2: Supplementary Material (SM) NMR shifts in aluminosilicate ... · In the electronic form of supplementary materials, we provide all our input data in the form of ISOCAR files (DFT-GIPAW

List of TablesS1 Compositions and the cubic cell dimensions the glasses considered in this work: Vitreous silica (SiO2)120; Sodosilicates

x={10 to 50}NS corresponding to (Na2O)x(SiO2)100−x; Sodo-aluminosilicates NAS3, NAS4 and NAS5 corresponding to(Al2O3)25(Na2O)25(SiO2)50, (Al2O3)17.5(Na2O)17.5(SiO2)70 and (Al2O3)12.5(Na2O)12.5(SiO2)75, respectively. The column"Total" refers to the total number of structures considered for each glass composition . . . . . . . . . . . . . . . . . . . . 5

S2 Sizes of the vectors of SOAP descriptors for each oxide glass considered for different numbers nmax (= Lmax) of basis func-tions used for the descriptor construction. With Ns, the number of species in the system, the SOAP vector of descriptorsis calculated as: (1/4)·Lmax·(Lmax+1)2·Ns·(Ns+1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

S3 ML errors (∆σ - the FWHM of the ML absolute errors distribution, RMSE - root mean square error and MAE - mean abso-lute error) obtained for relaxed (0K) and room-temperature (300K) SiO2 glasses (39 structures in the reference set) NMRσiso predictions. The results are shown for both nuclei for the different algorithms described in the computational part:linear ridge regression (LRR)63, kernel ridge regression with a linear kernel (LKRR) and a Gaussian kernel (GKRR)62,elastic net regression (ENR), random forest regression (RFR), Bayesian ridge regression (BRR), k-nearest neighbors (k-NN) and artificial neural networks (ANN)63,66. All calculations are performed with the SOAP descriptor. These estimationare averaged over a 10-fold CV for LRR, BRR, k-NN and ANN while a 3-fold CV is used for ENR, RFR, LKRR and GKRR.The values given between brackets correspond to the resulting CV standard deviations . . . . . . . . . . . . . . . . . . . 5

2 | 1–16Journal Name, [year], [vol.],

Page 3: Supplementary Material (SM) NMR shifts in aluminosilicate ... · In the electronic form of supplementary materials, we provide all our input data in the form of ISOCAR files (DFT-GIPAW

List of FiguresS1 LRR σiso predictions CV (cross-validation) standard deviations (STD), obtained from the 10-fold CV process applied to

obtain the learning curves reported in figure 1, as a function of the training/validation set size. The horizontal grey lineindicates the 1 ppm standard deviation value. The LRR error reported (∆σ) is the FWHM of the distribution of absoluteLRR deviations from DFT-GIPAW calculated data. The vertical arrows indicate the temperatures of the systems appendedto the training/validation set for increasing values of the reference set size (x). The test set is composed of two SiO2

structures of the most challenging 2000K ones . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

S2 LRR-NMR isotropic magnetic shielding predictions for 0K and 300K SiO2 systems (39 structures) for each of the threedescriptors considered and the two nuclei involved in these structures. The oblique grey line indicates the exact matchingbetween LRR predictions and DFT estimations. The LRR error reported (∆σ) is the FWHM of the distribution of absoluteLRR deviations from DFT-GIPAW calculated data. The corresponding root-mean square errors (RMSE) and mean absoluteerrors (MAE) are also reported in each case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

S3 LRR-SOAP test set error distributions (grey distributions) superimposed with the experimental NMR spectra (coloredsolid lines) of a typical SiO2 glass. The LRR results are shown for the three different descriptors considered in this work.The LRR error reported (∆σ) is the FWHM of the distribution of absolute LRR deviations from DFT-GIPAW calculated data 7

S4 LRR-NMR σiso predictions as a function of the absolute (number of elements) descriptor vectors sizes (left panel) andcutoff radius (right panel) for the ARDF, BPSF and SOAP descriptors considered in the case of SiO2 system. The refer-ence set includes only the relaxed SiO2 systems (20 structures) and the LRR error reported (∆σ) is the FWHM of thedistribution of absolute LRR deviations from DFT-GIPAW calculated data . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

S5 LRR NMR σiso predictions errors (Left: ∆σ ; Right: mean absolute error) as function of the SOAP descriptors sizes (Lmax)and cutoff radius (Rc). The reference set considered (19 structures) is composed of all SiO2 systems at room temperature(300K). ∆σ error is the FWHM of the distribution of absolute LRR deviations from DFT-GIPAW calculated data and MAEis the mean absolute error. The vertical arrows indicate the SOAP parameters choice used in most of our work (Lmax = 5and Rc = 5.5 Å). The resulting LRR errors ∆σ , RMSE and MAE for the points indicated by the arrows are, respectively,0.7, 0.7 and 0.5 ppm for 29Si; 1.4, 1.4 and 1.1 ppm for 17O . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

S6 ML-SOAP vs DFT NMR isotropic magnetic shieldings results for the 300K SiO2 systems (19 structures) using LRR (leftpanels), GKRR (central panels) and ENR (right panels) for both species. The ML error reported (∆σ) is the FWHM ofthe distribution of absolute ML predictions deviations from DFT-GIPAW calculated data. The corresponding root-meansquare errors (RMSE) and mean absolute errors (MAE) are also reported in each case . . . . . . . . . . . . . . . . . . . . 9

S7 The ML σiso predictions standard deviations, obtained from a 10-fold CV for LRR and 3-fold CV for GKRR and LKRR usedto obtain the learning curves reported in figure 5, as function of the reference set size. The test ML error reported (∆σ) isthe FWHM of the distribution of absolute LRR deviations from DFT-GIPAW calculated data. The vertical arrows indicatethe temperatures of the systems appended to the training/validation set for increasing values of the reference set size(x). The test set is composed of two SiO2 structures of the most challenging 2000K ones . . . . . . . . . . . . . . . . . . 9

S8 LRR NMR σiso predictions errors (Left: ∆σ ; Right: mean absolute error) as a function of the SOAP descriptors sizes(Lmax) and cutoff radius (Rc). The reference set considered (50 structures) is composed of all NS systems at roomtemperature (300K). ∆σ error is the FWHM of the distribution of absolute LRR deviations from DFT-GIPAW calculateddata. The vertical arrows indicate the SOAP parameters choice used in most of our work (Lmax = 5 and Rc = 5.5 Å). Theresulting LRR errors ∆σ , RMSE and MAE for the points indicated by the arrows are, respectively, 1.0, 1.0 and 0.8 ppmfor 29Si; 2.6, 2.6 and 1.9 ppm for 17O; 1.5, 1.5 and 1.2 ppm for 23Na . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

S9 (a,b,c) Learning curves of LRR-SOAP σiso predictions for each NS composition ({10 to 50}NS) at all temperatures (0K,300K, 1000K, 1500K and 2000K) and 2 structures at 2000K as test set. The results are shown for each nucleus: 29Si (a),17O (b) and 23Na (c). (d) Learning Curves for all NS compositions mixed (test set is 5 structures 50NS at 300K). The LRRerror reported (∆σ) is the FWHM of the distribution of absolute LRR deviations from DFT-GIPAW calculated data. Thevertical arrows indicate the temperatures of the systems appended to the training/validation set for increasing values ofthe reference set size (x) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

S10 LRR-SOAP NMR isotropic magnetic shieldings predictions for 0K and 300K NS systems (100 structures of all compositionsfrom 10 to 50 NS mixed in the reference set). The grey oblique line indicates the exact matching between LRR and DFT-GIPAW estimations. The LRR error reported (∆σ) is the FWHM of the distribution of absolute LRR deviations fromDFT-GIPAW calculated data. The corresponding root-mean square errors (RMSE) and mean absolute errors (MAE) arealso reported in each case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

S11 LRR-SOAP errors distributions (grey Gaussians) of σiso test set predictions superimposed with the experimental NMRspectra (colored solid lines) of a typical (Na2O)23-(SiO2)77 glass. The LRR error reported (∆σ) is the FWHM of thedistribution of absolute LRR deviations from DFT-GIPAW calculated data . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

Journal Name, [year], [vol.], 1–16 | 3

Page 4: Supplementary Material (SM) NMR shifts in aluminosilicate ... · In the electronic form of supplementary materials, we provide all our input data in the form of ISOCAR files (DFT-GIPAW

S12 LRR-SOAP NMR σiso predictions errors (Left: ∆σ ; Right: mean absolute error) as a function of the SOAP descriptorssizes (Lmax) and cutoff radius (Rc). The reference set (30 structures) considered is composed of all NAS systems at roomtemperature (300K). ∆σ error is the FWHM of the distribution of absolute LRR deviations from DFT-GIPAW calculateddata. The resulting LRR errors ∆σ , RMSE and MAE for the points indicated by the arrows region are, respectively, 1.3,1.3 and 1.0 ppm for 29Si; 2.4, 2.4 and 1.8 ppm for 17O; 1.6, 1.6 and 1.1 ppm for 23Na; 1.6, 1.6 and 1.2 ppm for 27Al . . . 13

S13 Learning curves of LRR-SOAP σiso predictions in the case of sodo-aluminosilicates (each NAS system separately) for 29Si(a), 17O (b), 23Na (c) and 27Al (d), for a training/validation set including successively 0K, 300K, 1000K, 1500K and2000K. The test set is composed of 2 structures at 2000K. The LRR error reported (∆σ) is the FWHM of the distributionof absolute LRR deviations from DFT-GIPAW calculated data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

S14 LRR-SOAP σiso results for 0K and 300K NAS systems (58 structures of all compositions, NAS3, NAS4 and NAS5 mixedin the reference set). The grey oblique line indicates the exact matching between LRR and DFT-GIPAW estimations andthe bar plots show the LRR-SOAP test error distributions fitted by a Gaussian function (solid colored lines). The LRRerror reported (∆σ) is the FWHM of the distribution of absolute LRR deviations from DFT-GIPAW calculated data. Thecorresponding root-mean square errors (RMSE) and mean absolute errors (MAE) are also reported in each case . . . . . 15

S15 LRR-SOAP σiso predictions for 0K and 300K NAS systems (62 structures at all compositions) with a specific test set: fourlarger structures (two NAS3-L and two NAS4-L of 700 atoms) together with their small size counterpart (two NAS3 andtwo NAS4 of 350 atoms). The train/validation set is composed of the remaining 54 structures of all NAS compositions.The grey oblique line indicates the exact matching between LRR and DFT-GIPAW estimations and the bar plots show theLRR test error distributions fitted by a Gaussian function (solid colored lines). The LRR error reported (∆σ) is the FWHMof the distribution of absolute LRR deviations from DFT-GIPAW calculated data. The corresponding root-mean squareerrors (RMSE) and mean absolute errors (MAE) are also reported in each case . . . . . . . . . . . . . . . . . . . . . . . . 16

4 | 1–16Journal Name, [year], [vol.],

Page 5: Supplementary Material (SM) NMR shifts in aluminosilicate ... · In the electronic form of supplementary materials, we provide all our input data in the form of ISOCAR files (DFT-GIPAW

Table S1 Compositions and the cubic cell dimensions the glasses considered in this work: Vitreous silica (SiO2)120; Sodosilicates x={10to 50}NS corresponding to (Na2O)x(SiO2)100−x; Sodo-aluminosilicates NAS3, NAS4 and NAS5 corresponding to (Al2O3)25(Na2O)25(SiO2)50,(Al2O3)17.5(Na2O)17.5(SiO2)70 and (Al2O3)12.5(Na2O)12.5(SiO2)75, respectively. The column "Total" refers to the total number of structures consideredfor each glass composition

Glasses Compositions Number of atomic structures Edge29Si 17O 23Na 27Al 0K 300K 1000K 1500K 2000K Total length (Å)

SiO2 120 240 - - 20 19 20 20 20 99 17.589610NS 90 190 20 - 10 10 10 10 10 50 16.352220NS 80 180 40 - 10 10 10 10 10 50 16.151230NS 70 170 60 - 10 10 10 10 10 50 15.984640NS 60 160 80 - 10 10 10 10 10 50 15.861050NS 50 150 100 - 10 10 10 10 10 50 15.8193NAS3 50 200 50 50 10 10 10 10 10 50 16.7762NAS4 70 210 35 35 8 10 10 10 10 48 16.9511NAS5 75 200 25 25 10 10 10 10 10 50 16.6251

NAS3-L 100 400 100 100 - 2 - - - 2 21.1367NAS4-L 100 400 100 100 - 2 - - - 2 21.3570

Table S2 Sizes of the vectors of SOAP descriptors for each oxide glass considered for different numbers nmax (= Lmax) of basis functions used for thedescriptor construction. With Ns, the number of species in the system, the SOAP vector of descriptors is calculated as: (1/4)·Lmax·(Lmax+1)2·Ns·(Ns+1)

SOAP Number of elements in the vector of SOAP descriptorsparameter SiO2 Na2O-SiO2 (NS) Al2O3-Na2O-SiO2 (NAS)Lmax = 2 27 54 90Lmax = 3 72 144 240Lmax = 4 150 300 500Lmax = 5 270 540 900Lmax = 6 441 882 1470Lmax = 7 672 1344 2240Lmax = 8 972 1944 3240Lmax = 9 1350 2700 4500

Table S3 ML errors (∆σ - the FWHM of the ML absolute errors distribution, RMSE - root mean square error and MAE - mean absolute error) obtainedfor relaxed (0K) and room-temperature (300K) SiO2 glasses (39 structures in the reference set) NMR σiso predictions. The results are shown for bothnuclei for the different algorithms described in the computational part: linear ridge regression (LRR)63, kernel ridge regression with a linear kernel(LKRR) and a Gaussian kernel (GKRR)62, elastic net regression (ENR), random forest regression (RFR), Bayesian ridge regression (BRR), k-nearestneighbors (k-NN) and artificial neural networks (ANN)63,66. All calculations are performed with the SOAP descriptor. These estimation are averagedover a 10-fold CV for LRR, BRR, k-NN and ANN while a 3-fold CV is used for ENR, RFR, LKRR and GKRR. The values given between bracketscorrespond to the resulting CV standard deviations

29Si nucleus in SiO2 system

ML algorithmTrain set Validation set Test set

∆σ RMSE MAE ∆σ RMSE MAE ∆σ RMSE MAE

LRR 0.75(0.18) 0.75(0.18) 0.58(0.13) 0.88(0.27) 0.88(0.27) 0.63(0.14) 0.84(0.15) 0.84(0.15) 0.65(0.12)LKRR 0.82(0.18) 0.82(0.18) 0.63(0.13) 0.90(0.16) 0.90(0.16) 0.67(0.12) 0.85(0.16) 0.85(0.16) 0.66(0.13)GKRR 0.44(0.18) 0.44(0.18) 0.34(0.15) 0.91(0.34) 0.91(0.34) 0.58(0.20) 1.07(0.17) 1.07(0.17) 0.81(0.13)ENR 1.01(0.06) 1.01(0.06) 0.77(0.04) 1.06(0.09) 1.06(0.09) 0.80(0.05) 1.01(0.04) 1.01(0.04) 0.79(0.04)RFR 1.28(0.24) 1.28(0.24) 0.89(0.20) 3.01(0.88) 3.05(0.85) 2.25(0.75) 3.61(0.07) 3.64(0.05) 2.86(0.04)BRR 2.41(0.12) 2.41(0.12) 1.81(0.05) 2.47(0.17) 2.49(0.19) 1.89(0.08) 2.35(0.03) 2.37(0.05) 1.83(0.03)k-NN 0.00(0.00) 0.00(0.00) 0.00(0.00) 3.80(1.23) 3.92(1.27) 2.31(1.27) 5.30(0.12) 5.37(0.19) 4.29(0.17)ANN 2.95(0.34) 2.95(0.34) 2.22(0.22) 2.99(0.33) 3.02(0.36) 2.29(0.25) 2.92(0.21) 2.94(0.22) 2.28(0.17)

17O nucleus in SiO2 system

ML algorithmTrain set Validation set Test set

∆σ RMSE MAE ∆σ RMSE MAE ∆σ RMSE MAE

LRR 1.38(0.09) 1.38(0.09) 1.06(0.06) 1.49(0.11) 1.49(0.11) 1.13(0.08) 1.58(0.09) 1.58(0.09) 1.19(0.06)LKRR 1.39(0.02) 1.39(0.02) 1.07(0.01) 1.48(0.04) 1.49(0.04) 1.13(0.02) 1.50(0.01) 1.51(0.01) 1.15(0.01)GKRR 0.72(0.26) 0.72(0.26) 0.55(0.20) 1.23(0.20) 1.23(0.20) 0.86(0.21) 1.64(0.16) 1.64(0.16) 1.23(0.11)ENR 1.62(0.03) 1.62(0.03) 1.24(0.02) 1.70(0.04) 1.70(0.04) 1.30(0.02) 1.71(0.01) 1.71(0.01) 1.34(0.00)RFR 1.52(0.16) 1.52(0.16) 1.08(0.18) 3.73(0.79) 3.73(0.79) 2.65(0.69) 4.65(0.07) 4.65(0.07) 3.55(0.06)BRR 3.97(0.17) 3.97(0.17) 2.97(0.13) 4.09(0.27) 4.10(0.27) 3.09(0.20) 4.03(0.11) 4.03(0.11) 3.04(0.08)k-NN 0.00(0.00) 0.00(0.00) 0.00(0.00) 5.12(1.37) 5.14(1.38) 2.97(1.50) 7.34(0.13) 7.35(0.13) 5.77(0.11)ANN 3.03(0.11) 3.03(0.11) 2.29(0.07) 3.16(0.17) 3.16(0.17) 2.37(0.13) 3.25(0.17) 3.25(0.17) 2.37(0.07)

Journal Name, [year], [vol.], 1–16 | 5

Page 6: Supplementary Material (SM) NMR shifts in aluminosilicate ... · In the electronic form of supplementary materials, we provide all our input data in the form of ISOCAR files (DFT-GIPAW

CV

ST

D o

f σ i

so L

RR

pre

dic

tion

s (p

pm

)

Number of SiO2 structures in the reference set (x)

0K 300K 1000K 1500K 2000K 0K 300K 1000K 1500K 2000K

ARDF SOAP BPSF

29Si 17O

20 40 60 80 100 20 40 60 80 10010-2

100 100

10-1

101101

10-1

10-2

Fig. S1 LRR σiso predictions CV (cross-validation) standard deviations (STD), obtained from the 10-fold CV process applied to obtain the learningcurves reported in figure 1, as a function of the training/validation set size. The horizontal grey line indicates the 1 ppm standard deviation value. TheLRR error reported (∆σ ) is the FWHM of the distribution of absolute LRR deviations from DFT-GIPAW calculated data. The vertical arrows indicate thetemperatures of the systems appended to the training/validation set for increasing values of the reference set size (x). The test set is composed of twoSiO2 structures of the most challenging 2000K ones

Fig. S2 LRR-NMR isotropic magnetic shielding predictions for 0K and 300K SiO2 systems (39 structures) for each of the three descriptors consideredand the two nuclei involved in these structures. The oblique grey line indicates the exact matching between LRR predictions and DFT estimations. TheLRR error reported (∆σ ) is the FWHM of the distribution of absolute LRR deviations from DFT-GIPAW calculated data. The corresponding root-meansquare errors (RMSE) and mean absolute errors (MAE) are also reported in each case

6 | 1–16Journal Name, [year], [vol.],

Page 7: Supplementary Material (SM) NMR shifts in aluminosilicate ... · In the electronic form of supplementary materials, we provide all our input data in the form of ISOCAR files (DFT-GIPAW

29Si

17O

BPSF ARDF SOAP

-60 -70 -80 -90 -100 -110 -120 -13029Si NMR shift (ppm)

Exp. SiO2

Δσ = 1.3 ppm

-60 -70 -80 -90 -100 -110 -120 -13029Si NMR shift (ppm)

Exp. SiO2

Δσ = 1.9 ppm

-60 -70 -80 -90 -100 -110 -120 -13029Si NMR shift (ppm)

Exp. SiO2

Δσ = 0.7 ppm

120 100 80 60 40 20 0 -6017O NMR shift (ppm)

-20 -40 -80

Exp. SiO2

Δσ = 2.7 ppm

120 100 80 60 40 20 0 -6017O NMR shift (ppm)

-20 -40 -80

Exp. SiO2

Δσ = 3.9 ppm

29Si

17O

29Si

120 100 80 60 40 20 0 -6017O NMR shift (ppm)

-20 -40 -80

Exp. SiO2

Δσ = 1.5 ppm

17O

LRRerror

LRRerror

LRRerror

LRRerror

LRRerror

LRRerror

Fig. S3 LRR-SOAP test set error distributions (grey distributions) superimposed with the experimental NMR spectra (colored solid lines) of a typicalSiO2 glass. The LRR results are shown for the three different descriptors considered in this work. The LRR error reported (∆σ ) is the FWHM of thedistribution of absolute LRR deviations from DFT-GIPAW calculated data

NM

R σ

iso

LR

R e

rror

Δσ

(pp

m)

Number of elements in descriptors vectors

6

5

4

3

2

200 400 800 1000

ARDF LRR

BPSF LRR

SOAP LRR

0 600 1200

8

7 6

5

4

3

2

7

6 8 12 14 184 10 16Cutoff radius Rc of descriptors (A)

17O 17O

29Si

NM

R σ

iso

LR

R e

rror

Δσ

(pp

m)

29Si4

6

5

3

2

1

1400

200 400 800 1000 14000 600 1200Number of elements in descriptors vectors

NM

R σ

iso

LR

R e

rror

Δσ

(pp

m)

ARDF LRR

BPSF LRR

SOAP LRR

6 8 12 14 184 10 16Cutoff radius Rc of descriptors (A)

ARDF LRR (Size = 60 elements)

BPSF LRR (Size = 70 elements)

SOAP LRR (Size = 270 elements )

ARDF LRR (Size = 60 elements)

BPSF LRR (Size = 70 elements)

SOAP LRR (Size = 270 elements )

(Rc = 6 A)

(Rc = 6 A)

4

6

5

3

2

1

NM

R σ

iso

LR

R e

rror

Δσ

(pp

m)

Fig. S4 LRR-NMR σiso predictions as a function of the absolute (number of elements) descriptor vectors sizes (left panel) and cutoff radius (rightpanel) for the ARDF, BPSF and SOAP descriptors considered in the case of SiO2 system. The reference set includes only the relaxed SiO2 systems(20 structures) and the LRR error reported (∆σ ) is the FWHM of the distribution of absolute LRR deviations from DFT-GIPAW calculated data

Journal Name, [year], [vol.], 1–16 | 7

Page 8: Supplementary Material (SM) NMR shifts in aluminosilicate ... · In the electronic form of supplementary materials, we provide all our input data in the form of ISOCAR files (DFT-GIPAW

NM

R σ

iso

LR

R Δ

σ E

rror

(p

pm

)

SiO2 - Test Set- SOAP

29Si

Descriptor vector size (Lmax)

Cutoff

radiu

s (A)

Descriptor vector size (Lmax)

Descriptor vector size (Lmax)

NM

R σ

iso

LR

R Δ

σ E

rror

(p

pm

)

NM

R σ

iso

LR

R M

AE

Err

or (

pp

m)

NM

R σ

iso

LR

R M

AE

Err

or (

pp

m)

Descriptor vector size (Lmax)

Cutoff

radiu

s (A)

Cutoff

radiu

s (A)

Cutoff

radiu

s (A)

29Si

17O17O

Fig. S5 LRR NMR σiso predictions errors (Left: ∆σ ; Right: mean absolute error) as function of the SOAP descriptors sizes (Lmax) and cutoff radius (Rc).The reference set considered (19 structures) is composed of all SiO2 systems at room temperature (300K). ∆σ error is the FWHM of the distributionof absolute LRR deviations from DFT-GIPAW calculated data and MAE is the mean absolute error. The vertical arrows indicate the SOAP parameterschoice used in most of our work (Lmax = 5 and Rc = 5.5 Å). The resulting LRR errors ∆σ , RMSE and MAE for the points indicated by the arrows are,respectively, 0.7, 0.7 and 0.5 ppm for 29Si; 1.4, 1.4 and 1.1 ppm for 17O

8 | 1–16Journal Name, [year], [vol.],

Page 9: Supplementary Material (SM) NMR shifts in aluminosilicate ... · In the electronic form of supplementary materials, we provide all our input data in the form of ISOCAR files (DFT-GIPAW

17O

220200180170 240190 210 230

ENR-SOAP

Test set

Δσ = 1.6 ppm

17O

220200180170 240190 210 230

GKRR-SOAP

Test set

Δσ = 1.5 ppm

465460455450445440435430

29SiENR-SOAP

Test set

Δσ = 1.1 ppm

LRR-SOAP

Test set

Δσ = 1.4 ppm

NM

R σ

iso

ML

pre

dic

tion

s (p

pm

)

DFT-GIPAW NMR σiso evaluation (ppm)

29SiGKRR-SOAP

465460455450445440435430

Test set

Δσ = 0.9 ppmΔσ = 0.8 ppm

460

455

450

445

440

435

430

465

465460455450445440435430

Test set

29SiLRR-SOAP

220

200

180

170

240

190

210

230

220200180170 240190 210 230

RMSE = 0.8 ppm

MAE = 0.6 ppm

RMSE = 0.9 ppm

MAE = 0.7 ppm

RMSE = 1.1 ppm

MAE = 0.8 ppm

RMSE = 1.4 ppm

MAE = 1.1 ppm

RMSE = 1.5 ppm

MAE = 1.2 ppm

RMSE = 1.6 ppm

MAE = 1.2 ppm

Fig. S6 ML-SOAP vs DFT NMR isotropic magnetic shieldings results for the 300K SiO2 systems (19 structures) using LRR (left panels), GKRR (centralpanels) and ENR (right panels) for both species. The ML error reported (∆σ ) is the FWHM of the distribution of absolute ML predictions deviations fromDFT-GIPAW calculated data. The corresponding root-mean square errors (RMSE) and mean absolute errors (MAE) are also reported in each case

CV

ST

D o

f σ i

so M

L p

red

icti

ons

(p

pm

)

Number of SiO2 structures in the reference set (x)

0K 300K 1000K 1500K 2000K

LRR GKRR LKRR

29Si 17O

20 40 60 80 100 20 40 60 80 100

0K 300K 1000K 1500K 2000K

100

100

10-1

10-1

Fig. S7 The ML σiso predictions standard deviations, obtained from a 10-fold CV for LRR and 3-fold CV for GKRR and LKRR used to obtain the learningcurves reported in figure 5, as function of the reference set size. The test ML error reported (∆σ ) is the FWHM of the distribution of absolute LRRdeviations from DFT-GIPAW calculated data. The vertical arrows indicate the temperatures of the systems appended to the training/validation set forincreasing values of the reference set size (x). The test set is composed of two SiO2 structures of the most challenging 2000K ones

Journal Name, [year], [vol.], 1–16 | 9

Page 10: Supplementary Material (SM) NMR shifts in aluminosilicate ... · In the electronic form of supplementary materials, we provide all our input data in the form of ISOCAR files (DFT-GIPAW

NM

R σ

iso

LR

R Δ

σ E

rror

(p

pm

)NS - Test Set- SOAP

29Si

Descriptor vector size (Lmax)

Cutoff

radiu

s (A)

Descriptor vector size (Lmax)

Descriptor vector size (Lmax)

NM

R σ

iso

LR

R Δ

σ E

rror

(p

pm

)

NM

R σ

iso

LR

R M

AE

Err

or (

pp

m)

NM

R σ

iso

LR

R M

AE

Err

or (

pp

m)

Descriptor vector size (Lmax)

Cutoff

radiu

s (A)

Cutoff

radiu

s (A)

Cutoff

radiu

s (A)

29Si

17O

Descriptor vector size (Lmax)

Descriptor vector size (Lmax)

NM

R σ

iso

LR

R Δ

σ E

rror

(p

pm

)

NM

R σ

iso

LR

R M

AE

Err

or (

pp

m)

Cutoff

radiu

s (A)

Cutoff

radiu

s (A)

23Na

17O

23Na

Fig. S8 LRR NMR σiso predictions errors (Left: ∆σ ; Right: mean absolute error) as a function of the SOAP descriptors sizes (Lmax) and cutoff radius(Rc). The reference set considered (50 structures) is composed of all NS systems at room temperature (300K). ∆σ error is the FWHM of the distributionof absolute LRR deviations from DFT-GIPAW calculated data. The vertical arrows indicate the SOAP parameters choice used in most of our work (Lmax= 5 and Rc = 5.5 Å). The resulting LRR errors ∆σ , RMSE and MAE for the points indicated by the arrows are, respectively, 1.0, 1.0 and 0.8 ppm for29Si; 2.6, 2.6 and 1.9 ppm for 17O; 1.5, 1.5 and 1.2 ppm for 23Na

10 | 1–16Journal Name, [year], [vol.],

Page 11: Supplementary Material (SM) NMR shifts in aluminosilicate ... · In the electronic form of supplementary materials, we provide all our input data in the form of ISOCAR files (DFT-GIPAW

50 75 100 125 150 175 200 225 2501

2

3

4

5

6

7

8

Size of the Train+Validation set

NM

R σ

iso L

RR

Err

or (

pp

m)

All NS compositions

29Si17O23Na

10NSx & 20NSx 30NSx 40NSx 50NSxdc

ba

10 15 20 25 30 35 40 45 50

101

102

0K & 300K 1000K 1500K 2000K

NM

R σ

iso

LR

R E

rror

(p

pm

)

29Si 17O

23Na

10NS20NS30NS40NS50NS

Size of the Train+Validation set

101

2.101

0K & 300K 1000K 1500K 2000K

10 15 20 25 30 35 40 45 50

10NS20NS

30NS40NS50NS

Size of the Train+Validation set

NM

R σ

iso

LR

R E

rror

(p

pm

)

10 15 20 25 30 35 40 45 50

NM

R σ

iso

LR

R E

rror

(p

pm

) 103

102

101

Size of the Train+Validation set

0K & 300K 1000K 1500K 2000K

10NS20NS30NS40NS50NS

Fig. S9 (a,b,c) Learning curves of LRR-SOAP σiso predictions for each NS composition ({10 to 50}NS) at all temperatures (0K, 300K, 1000K, 1500Kand 2000K) and 2 structures at 2000K as test set. The results are shown for each nucleus: 29Si (a), 17O (b) and 23Na (c). (d) Learning Curves for all NScompositions mixed (test set is 5 structures 50NS at 300K). The LRR error reported (∆σ ) is the FWHM of the distribution of absolute LRR deviationsfrom DFT-GIPAW calculated data. The vertical arrows indicate the temperatures of the systems appended to the training/validation set for increasingvalues of the reference set size (x)

Journal Name, [year], [vol.], 1–16 | 11

Page 12: Supplementary Material (SM) NMR shifts in aluminosilicate ... · In the electronic form of supplementary materials, we provide all our input data in the form of ISOCAR files (DFT-GIPAW

Fig. S10 LRR-SOAP NMR isotropic magnetic shieldings predictions for 0K and 300K NS systems (100 structures of all compositions from 10 to 50NS mixed in the reference set). The grey oblique line indicates the exact matching between LRR and DFT-GIPAW estimations. The LRR error reported(∆σ ) is the FWHM of the distribution of absolute LRR deviations from DFT-GIPAW calculated data. The corresponding root-mean square errors (RMSE)and mean absolute errors (MAE) are also reported in each case

-60 -70 -80 -90 -100 -110 -120 -13029Si NMR shift (ppm)

Exp. (Na2O)23-(SiO2)77

Δσ = 1.0 ppm

120 100 80 60 40 20 0 -6017O NMR shift (ppm)

-20 -40 -80

Δσ = 2.7 ppm

Exp. (Na2O)23-(SiO2)77

40 0 -80-4023Na NMR shift (ppm)

Exp. (Na2O)23-(SiO2)77

Δσ = 1.4 ppm

LRRerror

LRRerror

LRRerror

Fig. S11 LRR-SOAP errors distributions (grey Gaussians) of σiso test set predictions superimposed with the experimental NMR spectra (colored solidlines) of a typical (Na2O)23-(SiO2)77 glass. The LRR error reported (∆σ ) is the FWHM of the distribution of absolute LRR deviations from DFT-GIPAWcalculated data

12 | 1–16Journal Name, [year], [vol.],

Page 13: Supplementary Material (SM) NMR shifts in aluminosilicate ... · In the electronic form of supplementary materials, we provide all our input data in the form of ISOCAR files (DFT-GIPAW

NAS - Test Set- SOAP

29Si

Descriptor vector size (Lmax)

NM

R σ

iso

LR

R M

AE

Err

or (

pp

m)

Descriptor vector size (Lmax)

Cutoff

radiu

s (A)

NM

R σ

iso

LR

R Δ

σ E

rror

(p

pm

)29Si

Descriptor vector size (Lmax)

Descriptor vector size (Lmax)

NM

R σ

iso

LR

R Δ

σ E

rror

(p

pm

)

NM

R σ

iso

LR

R M

AE

Err

or (

pp

m)

Cutoff

radiu

s (A)

17O17O

Descriptor vector size (Lmax)

Descriptor vector size (Lmax)

NM

R σ

iso

LR

R Δ

σ E

rror

(p

pm

)

NM

R σ

iso

LR

R M

AE

Err

or (

pp

m)

Cutoff

radiu

s (A)

Cutoff

radiu

s (A)

23Na23NaCuto

ff ra

dius (

A)

Cutoff

radiu

s (A)

27Al

Descriptor vector size (Lmax)

Descriptor vector size (Lmax)

NM

R σ

iso

LR

R Δ

σ E

rror

(p

pm

)

NM

R σ

iso

LR

R M

AE

Err

or (

pp

m)

Cutoff

radiu

s (A)

Cutoff

radiu

s (A)

27Al

Fig. S12 LRR-SOAP NMR σiso predictions errors (Left: ∆σ ; Right: mean absolute error) as a function of the SOAP descriptors sizes (Lmax) and cutoffradius (Rc). The reference set (30 structures) considered is composed of all NAS systems at room temperature (300K). ∆σ error is the FWHM of thedistribution of absolute LRR deviations from DFT-GIPAW calculated data. The resulting LRR errors ∆σ , RMSE and MAE for the points indicated by thearrows region are, respectively, 1.3, 1.3 and 1.0 ppm for 29Si; 2.4, 2.4 and 1.8 ppm for 17O; 1.6, 1.6 and 1.1 ppm for 23Na; 1.6, 1.6 and 1.2 ppm for 27Al

Journal Name, [year], [vol.], 1–16 | 13

Page 14: Supplementary Material (SM) NMR shifts in aluminosilicate ... · In the electronic form of supplementary materials, we provide all our input data in the form of ISOCAR files (DFT-GIPAW

dc

ba

10 15 20 25 30 35 40 45 50

1000K 1500K 2000K0K & 300K

NM

R σ

iso

LR

R E

rror

(p

pm

) 29Si 17O

23Na 27Al

NAS3NAS4NAS5

Size of the Train+Validation set

1000K 1500K 2000K0K & 300K

10 15 20 25 30 35 40 45 50

NAS3NAS4NAS5

Size of the Train+Validation set

NM

R σ

iso

LR

R E

rror

(p

pm

)

10 15 20 25 30 35 40 45 50

NM

R σ

iso

LR

R E

rror

(p

pm

)

Size of the Train+Validation set

1000K 1500K 2000K0K & 300K

NAS3NAS4NAS5

NM

R σ

iso

LR

R E

rror

(p

pm

)

Size of the Train+Validation set

1000K 1500K 2000K

NAS3NAS4NAS5

10 15 20 25 30 35 40 45 50

6.100

5.100

4.100

0K & 300K

2.101

101

101

6.100

4.100

3.100

4.100

6.100

Fig. S13 Learning curves of LRR-SOAP σiso predictions in the case of sodo-aluminosilicates (each NAS system separately) for 29Si (a), 17O (b), 23Na(c) and 27Al (d), for a training/validation set including successively 0K, 300K, 1000K, 1500K and 2000K. The test set is composed of 2 structures at2000K. The LRR error reported (∆σ ) is the FWHM of the distribution of absolute LRR deviations from DFT-GIPAW calculated data

14 | 1–16Journal Name, [year], [vol.],

Page 15: Supplementary Material (SM) NMR shifts in aluminosilicate ... · In the electronic form of supplementary materials, we provide all our input data in the form of ISOCAR files (DFT-GIPAW

Fig. S14 LRR-SOAP σiso results for 0K and 300K NAS systems (58 structures of all compositions, NAS3, NAS4 and NAS5 mixed in the referenceset). The grey oblique line indicates the exact matching between LRR and DFT-GIPAW estimations and the bar plots show the LRR-SOAP test errordistributions fitted by a Gaussian function (solid colored lines). The LRR error reported (∆σ ) is the FWHM of the distribution of absolute LRR deviationsfrom DFT-GIPAW calculated data. The corresponding root-mean square errors (RMSE) and mean absolute errors (MAE) are also reported in eachcase

Journal Name, [year], [vol.], 1–16 | 15

Page 16: Supplementary Material (SM) NMR shifts in aluminosilicate ... · In the electronic form of supplementary materials, we provide all our input data in the form of ISOCAR files (DFT-GIPAW

Fig. S15 LRR-SOAP σiso predictions for 0K and 300K NAS systems (62 structures at all compositions) with a specific test set: four larger structures(two NAS3-L and two NAS4-L of 700 atoms) together with their small size counterpart (two NAS3 and two NAS4 of 350 atoms). The train/validation setis composed of the remaining 54 structures of all NAS compositions. The grey oblique line indicates the exact matching between LRR and DFT-GIPAWestimations and the bar plots show the LRR test error distributions fitted by a Gaussian function (solid colored lines). The LRR error reported (∆σ ) isthe FWHM of the distribution of absolute LRR deviations from DFT-GIPAW calculated data. The corresponding root-mean square errors (RMSE) andmean absolute errors (MAE) are also reported in each case

16 | 1–16Journal Name, [year], [vol.],