Finding patterns, correlations, and descriptors in ...
Transcript of Finding patterns, correlations, and descriptors in ...
Finding patterns, correlations, and descriptors in materials data using subgroup discovery and compressed sensing
Bryan R. GoldsmithUniversity of Michigan, Ann Arbor
Department of Chemical Engineering
Christopher J. Bartel and Charles MusgraveCU Boulder, Department of Chemical and Biological Engineering
Chris Sutton, Runhai Ouyang, Luca M. Ghiringhelli, Matthias SchefflerFritz Haber Institute of the Max Planck Society, Theory Department
Mario Boley and Jilles VreekenMax Planck Institute for Informatics
Predicting advanced materials requires understanding the mechanisms underlying their function
[1] K. Kang et al. Science 311, 977 (2006)
Battery materials
Catalysts
[1]
Identifying physically meaningful descriptors can aid materials discovery
Screen new catalystsIncrease understandingDescriptor = function(atomic or material features)
Descriptor β Property
Ma, Xianfeng and HongliangXin PRL 118.3 (2017) 036101.
NΓΈrskov, Jens K., et al. PNAS108.3 (2011) 937-943.
Identifying physically meaningful descriptors can aid materials discovery
Screen new catalystsIncrease understandingDescriptor = function(atomic or material features)
Descriptor β Property
Ma, Xianfeng and HongliangXin PRL 118.3 (2017) 036101.
NΓΈrskov, Jens K., et al. PNAS108.3 (2011) 937-943.
Machine learning tools can acceleratethe discovery of descriptors
This talk focuses on two data-analytics tools to find descriptors of materials
1. Compressed sensing to find low-dimensional descriptors- Perovskite oxides and halides
2. Subgroup discovery to find local patterns and their descriptions- Gold clusters in the gas phase (sizes 5-14 atoms)- Octet binary (AB) semiconductors
3. Future work: Compressed sensing and subgroup discovery for catalysis
Compressed sensing allows the construction of sparse models with high accuracy
Original image Sparse in the basis set
Recovered with 10% measurements
Emmanuel Candès, Terence Tao, and David Donoho
Part 1. Compressed sensing to find interpretable descriptors
min π½π½ 0 subject to π¦π¦ = π«π«π½π½ coefficients
Matrix of the materialsβ features
Targetproperty
Compressed sensing allows construction of sparse models with high accuracy
ππππ-norm: total # ofnon-zero coefficients
Ideally use ππ0-norm minimization
Emmanuel Candès, Terence Tao, and David Donoho
ππ0-norm minimization is too expensive to perform for large feature matrix D!
οΏ½ΜοΏ½π½πΏπΏπΏπΏπΏπΏπΏπΏπΏπΏ(Ξ») = argminπ½π½
12 π¦π¦ β π«π«π½π½ 2
2 + Ξ» π½π½ 1
Instead often minimize ππ1-norm (LASSO)as approximation of ππ0-norm
ππππ-norm:Sum of absolute value of coefficients
Root mean squared error
Regularizationparameter
Example of LASSO+ππ0 : Find a descriptor that predicts the crystal structure energy differences between the 82 octet AB compounds
LASSO+ππ0: L. M. Ghiringhelli, J. Vybiral, S. V. Levchenko, C. Draxl, M. Scheffler, PRL 114, 105503 (2015)
Unfortunately LASSO has stability issues for a huge feature space of correlated features
This issue has been solved recently using theSure Independence Screening Sparsifying Operator (SISSO) algorithm
R. Ouyang et al., βSISSO: a compressed-sensing method for systematically identifying efficient physical models of materials propertiesβ arxiv (2017)
Runhai Ouyang
Managing high dimensional and correlated feature spaces by combining screening and compressed sensing
SISSO overviewStep 1. Systematically construct a huge feature space
Primary features
Feature transformation
Huge feature space (billions)operator set
οΏ½π π = {+,β,Γ,Γ·, exp, ππππππ,β1 ,2 ,3 , π π π π π π π π , β }
Step 2. Select top ranked features using Sure Independent Screening (SIS)[1]
Sure independent screening Select the N largest components of π«π«π»π»y
π¦π¦: target propertyπ«π«: matrix of the features for each material
[1] J. Fan and J. Lv, J. R. Statist. Soc. B 70, 849 (2008)
Results in a subspace of N features that are most correlated with the property of interest
Step 3. Iteratively apply sure independent screeningwith a sparse approximation algorithm
(billions)Huge feature space π«π«
β¦πΏπΏ1 πΏπΏ1 οΏ½πΏπΏ2
Minimize ππ0
πΏπΏππ οΏ½ππ=1
ππβ1πΏπΏππ
Minimize ππ01π·π· descriptor 2π·π· descriptor πππ·π· descriptor
SISπ¦π¦ (π«π«) SISR1(π«π«) SISRππβ1(π«π«)
Minimize ππ0
SIS = sure independent screening
Si = feature subspace
π¦π¦ = target property
1012
~3000
Ri = Residual of target property usingthe previous iterations least squares prediction
SISSO: R. Ouyang et al. arxiv (2017) Also can use domain overlap for classification problems
Perovskites β promising functional materials
Perovskites are a class of ABX3 materials A typically group 1, 2, or lanthanide B typically transition metal X typically chalcogen or halogen
Special Issue: Perovskites, Science 2017
SISSO applied to perovskites to find a descriptor for their stability
Goldschmidtβs tolerance factor (t) to predict stability
π π =rA + rX
2(rB + rX)
If cubic, a = 2 rB + rX = 2 rA + rX ; π π = 1
Viktor Goldschmidt (1926)
Can we find a better descriptor using SISSO?
Correa-Baena et al., Science 2017
Dataset of experimentally characterized ABX3
576 ABX3 with experimental XRD 313 perovskites and 263 nonperovskites 75 different cations (A, B) 5 different anions (X)
Experimental results compiled from:H. Zhang, N. Li, K. Li, D. Xue, Acta Cryst. B 2007C. Li, X. Lu, W. Ding, L. Feng, Y. Gao, Z. Guo, Acta Cryst. B 2008W. Travis, E. Glover, H. Bronstein, D. Scanlon, R. Palgrave, Chem. Sci. 2016
t is often insufficient, especially for halides π π =π π π΄π΄ + π π ππ
2(π π π΅π΅ + π π ππ)576 ABX3 with experimental XRD 313 perovskites and 263 nonperovskites 75 different cations (A, B) 5 different anions (X) Only 74% accuracy
on experimental set
t is often insufficient, especially for halides π π =π π π΄π΄ + π π ππ
2(π π π΅π΅ + π π ππ)
Only 74% accuracy on experimental set
576 ABX3 with experimental XRD 313 perovskites and 263 nonperovskites 75 different cations (A, B) 5 different anions (X)
t ~ guessing for heavier halides
New tolerance factor discovered with SISSO (compressed sensing)
ππ = πππΏπΏπππ©π©β πππ¨π¨ πππ¨π¨ β
πππ¨π¨/πππ©π©ππππ πππ¨π¨/πππ©π©
92% accuracy
requires the same inputs as for the calculation of t
from ~3,000,000,000 potential descriptors
Ο is general for oxides and halides
t ~ uniform performance 92% accuracy
Comparing the forms of t and Ο
ππ = πππππππ΅π΅β πππ΄π΄ πππ΄π΄ β
πππ΄π΄/πππ΅π΅ππππ πππ΄π΄/πππ΅π΅
π π =π π π΄π΄ + π π ππ
2(π π π΅π΅ + π π ππ)
1 2 3 1 2
0.825 < t < 1.059 β perovskite Ο < 4.18 β perovskite 92% accuracy74% accuracy
Decision tree outputs only (-1, 1)
1J. Platt, Advances In large margin classifiers, 1999
Mapping decision tree outputs to logistic regression1 yields β(Ο)
Monotonic perovskite probabilities β β(Ο)
Monotonic perovskite probabilities β β(Ο)
1J. Platt, Advances In large margin classifiers, 1999
Decision tree outputs only (-1, 1)
Mapping decision tree outputs to logistic regression1 yields β(Ο)
β(Ο) compares well with DFT-GGA ΞHdec
ΞHdec > 0 β stable in cubic structure
88% agreement
β correlates with ΞHdec
Ο can be more powerful (CaZrO3, CaHfO3)
Decomposition enthalpies from:X.-G. Zhao, D. Yang, Y. Sun, T. Li, L. Zhang, L. Yu, A. Zunger, JACS 2017Q. Sun, W.-J. Yin, JACS 2017
Double perovskites for emerging solar absorbers
A2BBβX6
9 recently synthesized double perovskite
halides β all predicted to be perovskite by Ο
2016
2016
2017
2017
β¦ and more every month
Lower triangle βCs2BBβCl6
Upper triangle β(CH3NH3)2BBβBr6
Ο applied to 259,296 A2BBβX6 compounds
SISSO to find new descriptor, Ο,which improves upon Goldschmidtβs
C. Bartel et al., Under review (https://arxiv.org/abs/1801.07700) github.com/CJBartel/perovskite-stability
ππ = rXrBβ nA nA β
rA/rBln rA/rB
New tolerance factor for perovskite stability using SISSO
SISSO to find new descriptor, Ο,which improves upon Goldschmidtβs
Ο yields meaningful probabilities
New tolerance factor for perovskite stability using SISSO
C. Bartel et al., Under review (https://arxiv.org/abs/1801.07700) github.com/CJBartel/perovskite-stability
New tolerance factor for perovskite stability using SISSO
SISSO to find new descriptor, Ο,which improves upon Goldschmidtβs
Ο yields meaningful probabilities
Stability elucidated as β(A, B, X)
C. Bartel et al., Under review (https://arxiv.org/abs/1801.07700)
SISSO should be applicableto catalysis problems
Typically one focuses on creating a global prediction model for some property of interest (e.g., SISSO, Kernel Ridge Regression, Neural Networks)
Underlying mechanisms canchange across materials
Relations between subsets of data may be important
material property 1 (y1)
mat
eria
l pro
pert
y 2
(y2)
Part 2. Subgroup discovery to find local patterns and their descriptions
material property 1 (y1)
mat
eria
l pro
pert
y 2
(y2)
The periodic table has subgroups
Subgroup discovery: find meaningful local descriptors of a target property in materials-science data
Review: M. Atzmueller, WIREs Data Min. Knowl. Disc. 5 (2015)B. R. Goldsmith, M. Boley, J. Vreeken, M. Scheffler, L. M. Ghiringhelli, New J. Phys. 19, (2017)M. Boley, B. R. Goldsmith, L. M. Ghiringhelli, J. Vreeken, Data Min. Knowl. Disc. 1391, (2017)
Subgroup discovery focuses on local observations
Decision trees Subgroup discovery
Descriptive features, ππ1, β¦ ,ππππ β πΏπΏ
e.g., d-band center, coordination number, atomic radii
Target features π¦π¦1, β¦ ,π¦π¦ππ β ππ
e.g., adsorbate binding energy
Basic selectors, ππ1, β¦ , ππππ β πΆπΆ β {false, true}
e.g., Is the d-band center low?; Is the atom coordination number < 3?
Find selector ππ = ππ1 β β§ β―β§ ππππ β
that maximizes quality π π = ext ππππ
πΌπΌπ’π’(ππππ)1βπΌπΌ
ext ππππ
is the coverage of points where ππ is true
π’π’ ππππ is the utility function (optimization criteria)
Subgroup discovery: how it works
1. Gas Phase Gold Clusters (Au5-Au14)
2. Classification and description of 82 Octet Binary Semiconductors
Two tutorial applications of subgroup discovery presented
24,400 gold cluster configurations in total
Display interesting catalytic properties
Rocksalt Zincblende
vs.
24,400 gold cluster configurations (Au5-Au14) in the gas phase
HOM
O-L
UM
O e
nerg
y ga
p (e
V)
(relative) total energy of gold cluster (eV)
Three main subgroups identified HOM
O-LU
MO
energy gap (eV)Probability
ππππππππ(π΅π΅) β§ π΅π΅ β₯ ππ
π΅π΅ = ππ
π¨π¨π¨π¨π¨π¨(π΅π΅)
Rediscover simple insight about HOMO-LUMO gap
Choose target propertyHOMO-LUMO energy gap
Choose variation reduction utility functionπ’π’ ππβ² = (std ππ β std ππβ² )/std(ππ)
Find descriptors that predict crystal structuresfor the 82 octet AB-type materials
Target property sign(Erocksalt β Ezincblende)
Input candidate descriptors into subgroup discovery from DFT calculationsβ’ Radii of s, p, d orbitals of free atoms β’ Electron affinityβ’ Ionization potentialβ¦.and othersB. R. Goldsmith et. al., New J. Phys. 19, (2017)
M. Boley et. al., Data Min. Knowl. Disc. 1391, (2017)
Rocksalt Zincblende
vs.
πππ©π©ππ β πππ©π©ππ β₯ ππ.ππππ Γ and πππ¬π¬ππ β₯ ππ.ππππ Γ
πππ©π©ππ β πππ©π©ππ β€ ππ.ππππ Γ and πππ¬π¬ππ β€ ππ.ππππ Γ
π π pA β π π pB
π π sA
Subgroup discovery classifies 79 of the 82 compoundsusing a two-dimensional descriptor
zincblende subgroup
Rocksaltsubgroup
MgTe AgI
CuF
Computationalscreening
First-principles microkinetic modeling
Catalyst designCatalyst characterization
Computationalspectroscopy
First-principlesthermochemistry &kinetic parameters
Reaction mechanism
Mechanistic understanding
Active sites
Looking ahead: Can we use data analytics to extract catalyst descriptors?
DescriptorData analyticsSISSOSubgroup Discoveryβ¦.
Machine learninge.g., use subgroup discovery and compressed sensing relate features to Figures of merit
HeterogeneousCatalyst Feature Space Figures of Merit
atomicproperties
electronic structurephysical
properties activity
stability
selectivity
(1)
(2)
Catalyst Space
support
composition
size
Machine learning for catalyst design and discovery
Acknowledgements
Mario BoleyRunhai OuyangChristopher SuttonJilles Vreeken
Matthias SchefflerLuca M. GhiringhelliCharles Musgrave
Subgroup discoveryand SISSO tutorials are onlinehttps://www.nomad-coe.eu/
Christopher Bartel