BigData Analytics in Materials Science · Pearson, K. "On Lines and Planes of Closest Fit to...
Transcript of BigData Analytics in Materials Science · Pearson, K. "On Lines and Planes of Closest Fit to...
BigData Analytics in
Materials Science
Luca M. GhiringhelliFritz Haber Institute
Handson workshop densityfunctional theory and beyond:Firstprinciples simulations of molecules and materials
Berlin, Germany, July 13 July 23, 2015
Data, data, data: big data
Bigdata challenge, fourV:Volume (amount of data)Variety (heterogeneity, of form and meaning of data)Veracity (uncertainty of data quality)Velocity
Highthroughput screening: query and read out what was stored
Shouldn't we do more?
Analysis
Identify (so far) hidden correlations Identify which materials should be studied next as most promising candidates Identify anomalies
We have a dream
From the periodic table of the elements to a chart of materials:Organize materials according to their properties and functions, e.g.
figure of merit of thermoelectrics (as function of T )
turnover frequency of catalytic materials (as function of T and p)
efficiency of photovoltaic systems
Training setCalculate properties and
functions P, for many materials, iDensityFunctional Theory
Fast PredictionCalculate properties
and functions for new values of d (new materials)
Big Data Analysis
DescriptorFind the appropriate
descriptor di, build a table: | i | di | Pi |
LearningFind the function PSL(d) for the table;
do cross validation.Statistical learning
(Orbital period)² = C (orbit's major axis)³
Learning Discovery→
Suppose to know the trajectories of all planets in the solar system, from accurate observations (experiment)orby numerically integrating general relativity equations (calculations at the highest level of theory)
http://www.bbdc.berlin
Ingredients of a BigData project
Databases, platforms
“Just Databases”ICSD Inorganic Crystal Structure DB http://icsd.fizkarlsruhe.deCOD Crystallography Open DB http://www.crystallography.net/ESP Electronic Structure Project http://gurka.fysik.uu.se/ESP/CCCBDB Comp. Chemistry Comparison and Benchmark DB
http://cccbdb.nist.gov/
Databases + analytic toolsMaterials project http://www.materialsproject.orgAFLOW Atomatic Flow for Materials Discovery http://aflowlib.orgAiiDA Automated interactive infrastructure and DB for Atomistic Simulations
http://www.aiida.netOQMD Open Quantum Materials DB http://oqmd.org/
Novel Materials Discovery http://nomadrepository.euNoMaD http://nomadcoe.eu
Data from many different codes need for a conversion layer→
“Just Databases”ICSD Inorganic Crystal Structure DB http://icsd.fizkarlsruhe.deCOD Crystallography Open DB http://www.crystallography.net/ESP Electronic Structure Project http://gurka.fysik.uu.se/ESP/CCCBDB Comp. Chemistry Comparison and Benchmark DB
http://cccbdb.nist.gov/
Databases + analytic toolsMaterials project http://www.materialsproject.orgAFLOW Atomatic Flow for Materials Discovery http://aflowlib.orgAiiDA Automated interactive infrastructure and DB for Atomistic Simulations
http://www.aiida.netOQMD Open Quantum Materials DB http://oqmd.org/
Novel Materials Discovery http://nomadrepository.euNoMaD http://nomadcoe.eu
Data from many different codes need for a conversion layer→
Databases, platforms
http://www.bbdc.berlin
Ingredients of a BigData project
Outline
Descriptors and fingerprints
Back to machine learning: Automatic descriptor search
Linear and nonlinear dimensionality reduction
Feature selection
Some words on causal descriptorproperty relationship
Descriptors
Can we predict an optimal material for a complex process (e.g. heterogenous catalysis)
by looking to a simple (set of) descriptor(s) ?
Many parameters enter into the kinetics of a given reaction: The energy of all intermediates and of the transition states separating them
Even for simple reactions, this easily gives 10–20 energy variables for a specific catalytic reaction.
Are they all uncorrelated?
Descriptor for adsorption energies
CH
CH2
CH3
AbildPedersen, …, and J. K. Nørskov PRL 99, 016105 (2007)Nørskov, AbildPedersena, Studta, Bligaard PNAS 108, 937 (2011)
CO + 3H2 →CH4 + H2O
(dissociative adsorption of CO)
Descriptor for turnover frequency (TOF)
C CH→ x
O OH→C and O CO(ads)→CO(ads) CO(TS) →C CH→ x(TS)O OH(TS)→
Stepped 221 surfaces
Selective hydrogenation of acetylene (C2H2) in the presence of an excess of ethylene (C2H4) on 111 surfaces.Unwanted: ethane, C2H6
Descriptor based search for a new catalyst
(Geneticlike) fingerprint
Gaussian Kernel Ridge Regression
Data: 175 linear 4blocks periodic polymers. 7 blocks: CH2, SiF2, SiCl2, GeF2, GeCl2, SnF2, SnCl2,
Descriptor: 20 dimensions [# building blocks of type i, of ii pairs, of iii triplets]
Pilania, Wang, …, and Ramprasad, Scientific Reports 3, 2810 (2013). DOI: 10.1038/srep02810
Isayev, …, and Curtarolo, Chemistry of Materials 27, 735 (2015)
(Geneticlike) fingerprint
Isayev, …, and Curtarolo, Chemistry of Materials 27, 735 (2015)
(Geneticlike) fingerprint
Supervised learning
d → P mapping Support vector machinesNeural networksDecision treesGenetic programming (symbolic regression)Kernel ridge regressionCompressed sensing
Unsupervised learning
d → d'Find patterns / trends
Principalcomponents analysisNonlinear dim. reduction Sketch mapClustering
Machine learning: (yet) a(nother) classification
Principal component analysis
Pearson, K. "On Lines and Planes of Closest Fit to Systems of Points in Space". Philosophical Magazine 2, 559 (1901)
Orthonormal transformation of coordinates, converting a set of (possibly) linearly correlated coordinates into a new set of linearly uncorrelated (called principal or normal) components, such that the first component has the largest variance and each subsequent has the largest variance constrained to being orthogonal to all the preceding components
From 8: rs, rp, Es/Zv, Ep/Zv , for A and B
(Linear) dimensionality reduction: principal components
Saad, …, Chelikowsky, and Andreoni, PRB 85, 104104 (2012)
1 2 3Arb
. (lin
ear)
sca
le
Components
Principal component analysis
Pearson, K. "On Lines and Planes of Closest Fit to Systems of Points in Space". Philosophical Magazine 2, 559 (1901)
From 8: rs, rp, Es/Zv, Ep/Zv , for A and B
(Linear) dimensionality reduction: principal components
Saad, …, Chelikowsky, and Andreoni, PRB 85, 104104 (2012)
What's on the axes?
Linear combination of (possibly all) the initial dimensions
1 2 3Arb
. (lin
ear)
sca
le
Components
(Nonlinear) dimensionality reduction
(Nonlinear) dimensionality reduction
(Nonlinear) dimensionality reduction
Proximity matchingProximity matching
Sketchmap algorithm
Minimization of the stress function (for a set of landmarks points)
Sketchmap algorithm
Sketch map of folded landscape of Ala12
Is pointwise (relative) free energy invariant upon dimensionality reduction?No, only for regions:
Sketch map of folded landscape of Ala12
Sketch map of folded landscape of Ala12
From clusters to defects in bulk
From clusters to defects in bulk
From clusters to defects in bulk
Dimensionality reduction with meaningful axes?
What's on the axes?
What about having a dimensionality reduction, or call it feature selection,
i.e., such that the (best) lowdimensional representation is selected
among (many many) given candidates?
It is time for: compressed sensing
Reference:LMG, J. Vybiral, S. V. Levchenko, C. Draxl, and M. Scheffler,
Phys. Rev. Lett. 114, 105503 (2015)Don't overlook the Supplementary Information!
A famous example: classification of octet binaries
The chemical space
A famous example: classification of octet binaries
Eh depends on the first nearest neighbour distance.C depends on the electrical conductivity.Eh²+ C² is related to the (average) band gap.
Both are properties of the binary compound.
Note: more than classification, the distance from the dividing line has a meaning.
Figure of merit to be optimized:
Regularization (prefer “lower complexity” in the solution)
(Linear) ridge regression
Mathematical formulation of the problem
Inadequacy of the “natural” descriptor
Inadequacy of the “natural” descriptor
= 1E4 ; = 0.1λ σ
Gaussian Kernel Ridge Regression
Wish list for a descriptor
Figure of merit to be optimized:
Regularization (prefer “lower complexity” in the solution)
A more complex regularization:
(Linear) ridge regression
NP – hard !!!
Mathematical formulation of the problem
Mathematical formulation of the problem: sparsity
LASSO: convex problem, equivalent to the NP-hard if features (columns of D) are uncorrelated
LASSO, compressed/ive sensing in Materials Science
23 primary features
Find a descriptor AND an accurate evaluation for the difference in energy between RS and ZB crystal structures for all (82) AB octet semiconductors.ΔE = ΔE ( d )
Possibly identify a 2D descriptor which gives a “nice” representation of the materials in a plane
The task
KS
leve
ls [
eV]
Valence p
Valence sRadial probability densities
[Å]
Primary (atomic) features
Radius @ maxAverage radiusTurning point
example: Sn (Tin)
Valence p (HOMO)
Valence s
KS lev els [eV
]
LUMO
23 primary features
Only linear combination of primary features: not enough.
“Best” performance: 21 selected features (RMSE = 0.10 eV).
LASSO:
LASSO
+
Radius 1 Radius 2
+
KS level 1 KS level 2
Systematic construction of the feature space
Radius 1 Radius 2 KS level 1 KS level 2
Systematic construction of the feature space
| x y | | x y |
KS level 1 KS level 2
+
Radius 1 Radius 2
/
| x y |
Systematic construction of the feature space
+
Radius 1 Radius 2 KS level 1 KS level 2
| x y |
KS level 1 KS level 2
+
Radius 1 Radius 2
/
| x y |
Systematic construction of the feature space
+
Radius 1 Radius 2 KS level 1 KS level 2
| x y |
exp(x)
(x)^n
In practice: formalism borrowed form symbolic regression
Systematic construction of the feature space: EUREQA
Descriptor (candidates: 242)a The largest distance between a H atom and its nearest Si neighborb The shortest distance between a Si atom and its sixthnearest Si neighborc The maximum bond valence sum on a Si atomd The smallest value for the fifthsmallest relative bond length around a Si atome The fourthshortest distance between a Si atom and its eighthnearest neighborf The secondshortest distance between a Si atom and its fifthnearest neighborg The thirdshortest distance between a Si atom and its sixthnearest neighborh The HSi nearestneighbor distance for the hydrogen atom with the fourthsmallest difference between the distances to the two Si atoms nearest to a H atom
T. Müller et al. PRB 89 115202 (2014):Data: ~1000 amorphous structures of 216 Si atoms (saturated)
Property: hole trap depth
EUREQA: genetic programming software. Global optimization (genetic algorithm).Schmidt M., Lipson H., Science, Vol. 324, No. 5923, (2009)
Total ~ 10000 features
Systematic construction of the feature space
Primary features
1D
2D
3D
“Extended” LASSO : features are correlated, so the first 25-30 features selected by lasso when scanning from large to low λ are selected and all single features, all pairs, all triplets... are separately tested via linear regression (the NP-hard problem, but only with 25-30 features)
1D 2D 3D
Finding the descriptor
Performance of the descriptors: accuracy, validation
ε
!
“Complexity”
Erro
r
Training err.
Validation err.
Leave 10% out cross validation
Errors are energies, in eV
Max Absolute Error
Convergence with dimensionality of the descriptor
Performance of the descriptors: accuracy
1D 2D
5D3D
Twodimensional descriptor
0 0.2 eV 0.45 eV 1.0 eV-0.2 eV
Few words on causality
There are four possibilities (types of causality relationship) behind P(d):
1. d → P : P “listens” to d
2. P → d : d “listens” to P
3. A → d and A → P : There is no direct connection between d and P, but d and P both “listen” to a third “actuator”
4. There is no direct connection between d and P, but they have a common effect (Berkson paradox)
...that listens to both and screams: “I occurred” [Judea Pearl]
[If the admission criteria to a certain graduate school call for either high grades as an undergraduate or special musical talents, then these two attributes will be found to be correlated (negatively) in the student population of that school, even if these attributes are uncorrelated in the population at large (selection bias). Indeed, students with low grades are likely to be exceptionally gifted in music, which explains their admission to graduate school.]
Few words on causality
We are not able to write down a scientific law that connects the descriptor
directly with the total-energy difference between RS and ZB structures
However, ZA, ZB determine these descriptors, and ZA, ZB determine the many-body Hamiltonians and the total-energy difference.
Quantitative analysis: effect of noise
Features multiplied by Gaussian noise
The same 2D descriptor is found:
Property perturbed by adding uniform noise
The same 2D descriptor is found:
The effect of noise
Bigdata for Materials Science: Infrastructures
Descriptors and fingerprints
Automatic descriptor search:Linear and nonlinear dimensionality reduction
Principal component analysisSketch map
Feature selectionLASSO (compressed sensing)Symbolic regression
Some words on causal descriptorproperty relationshipCrossvalidation and Stability analysis
Summary