Finding patterns, correlations, and descriptors in ...

40
Finding patterns, correlations, and descriptors in materials data using subgroup discovery and compressed sensing Bryan R. Goldsmith University of Michigan, Ann Arbor Department of Chemical Engineering Christopher J. Bartel and Charles Musgrave CU Boulder, Department of Chemical and Biological Engineering Chris Sutton, Runhai Ouyang, Luca M. Ghiringhelli, Matthias Scheffler Fritz Haber Institute of the Max Planck Society, Theory Department Mario Boley and Jilles Vreeken Max Planck Institute for Informatics

Transcript of Finding patterns, correlations, and descriptors in ...

Page 1: Finding patterns, correlations, and descriptors in ...

Finding patterns, correlations, and descriptors in materials data using subgroup discovery and compressed sensing

Bryan R. GoldsmithUniversity of Michigan, Ann Arbor

Department of Chemical Engineering

Christopher J. Bartel and Charles MusgraveCU Boulder, Department of Chemical and Biological Engineering

Chris Sutton, Runhai Ouyang, Luca M. Ghiringhelli, Matthias SchefflerFritz Haber Institute of the Max Planck Society, Theory Department

Mario Boley and Jilles VreekenMax Planck Institute for Informatics

Page 2: Finding patterns, correlations, and descriptors in ...

Predicting advanced materials requires understanding the mechanisms underlying their function

[1] K. Kang et al. Science 311, 977 (2006)

Battery materials

Catalysts

[1]

Page 3: Finding patterns, correlations, and descriptors in ...

Identifying physically meaningful descriptors can aid materials discovery

Screen new catalystsIncrease understandingDescriptor = function(atomic or material features)

Descriptor β†’ Property

Ma, Xianfeng and HongliangXin PRL 118.3 (2017) 036101.

NΓΈrskov, Jens K., et al. PNAS108.3 (2011) 937-943.

Page 4: Finding patterns, correlations, and descriptors in ...

Identifying physically meaningful descriptors can aid materials discovery

Screen new catalystsIncrease understandingDescriptor = function(atomic or material features)

Descriptor β†’ Property

Ma, Xianfeng and HongliangXin PRL 118.3 (2017) 036101.

NΓΈrskov, Jens K., et al. PNAS108.3 (2011) 937-943.

Machine learning tools can acceleratethe discovery of descriptors

Page 5: Finding patterns, correlations, and descriptors in ...

This talk focuses on two data-analytics tools to find descriptors of materials

1. Compressed sensing to find low-dimensional descriptors- Perovskite oxides and halides

2. Subgroup discovery to find local patterns and their descriptions- Gold clusters in the gas phase (sizes 5-14 atoms)- Octet binary (AB) semiconductors

3. Future work: Compressed sensing and subgroup discovery for catalysis

Page 6: Finding patterns, correlations, and descriptors in ...

Compressed sensing allows the construction of sparse models with high accuracy

Original image Sparse in the basis set

Recovered with 10% measurements

Emmanuel Candès, Terence Tao, and David Donoho

Part 1. Compressed sensing to find interpretable descriptors

Page 7: Finding patterns, correlations, and descriptors in ...

min 𝛽𝛽 0 subject to 𝑦𝑦 = 𝑫𝑫𝛽𝛽 coefficients

Matrix of the materials’ features

Targetproperty

Compressed sensing allows construction of sparse models with high accuracy

π’π’πŸŽπŸŽ-norm: total # ofnon-zero coefficients

Ideally use 𝑙𝑙0-norm minimization

Emmanuel Candès, Terence Tao, and David Donoho

Page 8: Finding patterns, correlations, and descriptors in ...

𝑙𝑙0-norm minimization is too expensive to perform for large feature matrix D!

�̂�𝛽𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿(Ξ») = argmin𝛽𝛽

12 𝑦𝑦 βˆ’ 𝑫𝑫𝛽𝛽 2

2 + Ξ» 𝛽𝛽 1

Instead often minimize 𝑙𝑙1-norm (LASSO)as approximation of 𝑙𝑙0-norm

π’π’πŸπŸ-norm:Sum of absolute value of coefficients

Root mean squared error

Regularizationparameter

Page 9: Finding patterns, correlations, and descriptors in ...

Example of LASSO+𝑙𝑙0 : Find a descriptor that predicts the crystal structure energy differences between the 82 octet AB compounds

LASSO+𝑙𝑙0: L. M. Ghiringhelli, J. Vybiral, S. V. Levchenko, C. Draxl, M. Scheffler, PRL 114, 105503 (2015)

Page 10: Finding patterns, correlations, and descriptors in ...

Unfortunately LASSO has stability issues for a huge feature space of correlated features

This issue has been solved recently using theSure Independence Screening Sparsifying Operator (SISSO) algorithm

R. Ouyang et al., β€œSISSO: a compressed-sensing method for systematically identifying efficient physical models of materials properties” arxiv (2017)

Runhai Ouyang

Managing high dimensional and correlated feature spaces by combining screening and compressed sensing

Page 11: Finding patterns, correlations, and descriptors in ...

SISSO overviewStep 1. Systematically construct a huge feature space

Primary features

Feature transformation

Huge feature space (billions)operator set

�𝑅𝑅 = {+,βˆ’,Γ—,Γ·, exp, 𝑙𝑙𝑙𝑙𝑙𝑙,βˆ’1 ,2 ,3 , 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠, βˆ’ }

Page 12: Finding patterns, correlations, and descriptors in ...

Step 2. Select top ranked features using Sure Independent Screening (SIS)[1]

Sure independent screening Select the N largest components of 𝑫𝑫𝑻𝑻y

𝑦𝑦: target property𝑫𝑫: matrix of the features for each material

[1] J. Fan and J. Lv, J. R. Statist. Soc. B 70, 849 (2008)

Results in a subspace of N features that are most correlated with the property of interest

Page 13: Finding patterns, correlations, and descriptors in ...

Step 3. Iteratively apply sure independent screeningwith a sparse approximation algorithm

(billions)Huge feature space 𝑫𝑫

…𝐿𝐿1 𝐿𝐿1 �𝐿𝐿2

Minimize 𝑙𝑙0

𝐿𝐿𝑛𝑛 �𝑖𝑖=1

π‘›π‘›βˆ’1𝐿𝐿𝑖𝑖

Minimize 𝑙𝑙01𝐷𝐷 descriptor 2𝐷𝐷 descriptor 𝑛𝑛𝐷𝐷 descriptor

SIS𝑦𝑦 (𝑫𝑫) SISR1(𝑫𝑫) SISRπ‘›π‘›βˆ’1(𝑫𝑫)

Minimize 𝑙𝑙0

SIS = sure independent screening

Si = feature subspace

𝑦𝑦 = target property

1012

~3000

Ri = Residual of target property usingthe previous iterations least squares prediction

SISSO: R. Ouyang et al. arxiv (2017) Also can use domain overlap for classification problems

Page 14: Finding patterns, correlations, and descriptors in ...

Perovskites – promising functional materials

Perovskites are a class of ABX3 materials A typically group 1, 2, or lanthanide B typically transition metal X typically chalcogen or halogen

Special Issue: Perovskites, Science 2017

SISSO applied to perovskites to find a descriptor for their stability

Page 15: Finding patterns, correlations, and descriptors in ...

Goldschmidt’s tolerance factor (t) to predict stability

𝑠𝑠 =rA + rX

2(rB + rX)

If cubic, a = 2 rB + rX = 2 rA + rX ; 𝑠𝑠 = 1

Viktor Goldschmidt (1926)

Can we find a better descriptor using SISSO?

Correa-Baena et al., Science 2017

Page 16: Finding patterns, correlations, and descriptors in ...

Dataset of experimentally characterized ABX3

576 ABX3 with experimental XRD 313 perovskites and 263 nonperovskites 75 different cations (A, B) 5 different anions (X)

Experimental results compiled from:H. Zhang, N. Li, K. Li, D. Xue, Acta Cryst. B 2007C. Li, X. Lu, W. Ding, L. Feng, Y. Gao, Z. Guo, Acta Cryst. B 2008W. Travis, E. Glover, H. Bronstein, D. Scanlon, R. Palgrave, Chem. Sci. 2016

Page 17: Finding patterns, correlations, and descriptors in ...

t is often insufficient, especially for halides 𝑠𝑠 =𝑠𝑠𝐴𝐴 + 𝑠𝑠𝑋𝑋

2(𝑠𝑠𝐡𝐡 + 𝑠𝑠𝑋𝑋)576 ABX3 with experimental XRD 313 perovskites and 263 nonperovskites 75 different cations (A, B) 5 different anions (X) Only 74% accuracy

on experimental set

Page 18: Finding patterns, correlations, and descriptors in ...

t is often insufficient, especially for halides 𝑠𝑠 =𝑠𝑠𝐴𝐴 + 𝑠𝑠𝑋𝑋

2(𝑠𝑠𝐡𝐡 + 𝑠𝑠𝑋𝑋)

Only 74% accuracy on experimental set

576 ABX3 with experimental XRD 313 perovskites and 263 nonperovskites 75 different cations (A, B) 5 different anions (X)

t ~ guessing for heavier halides

Page 19: Finding patterns, correlations, and descriptors in ...

New tolerance factor discovered with SISSO (compressed sensing)

𝝉𝝉 = π’“π’“π‘Ώπ‘Ώπ’“π’“π‘©π‘©βˆ’ 𝒏𝒏𝑨𝑨 𝒏𝒏𝑨𝑨 βˆ’

𝒓𝒓𝑨𝑨/𝒓𝒓𝑩𝑩𝒍𝒍𝒏𝒏 𝒓𝒓𝑨𝑨/𝒓𝒓𝑩𝑩

92% accuracy

requires the same inputs as for the calculation of t

from ~3,000,000,000 potential descriptors

Page 20: Finding patterns, correlations, and descriptors in ...

Ο„ is general for oxides and halides

t ~ uniform performance 92% accuracy

Page 21: Finding patterns, correlations, and descriptors in ...

Comparing the forms of t and Ο„

𝜏𝜏 = π‘Ÿπ‘Ÿπ‘‹π‘‹π‘Ÿπ‘Ÿπ΅π΅βˆ’ 𝑛𝑛𝐴𝐴 𝑛𝑛𝐴𝐴 βˆ’

π‘Ÿπ‘Ÿπ΄π΄/π‘Ÿπ‘Ÿπ΅π΅π‘™π‘™π‘›π‘› π‘Ÿπ‘Ÿπ΄π΄/π‘Ÿπ‘Ÿπ΅π΅

𝑠𝑠 =𝑠𝑠𝐴𝐴 + 𝑠𝑠𝑋𝑋

2(𝑠𝑠𝐡𝐡 + 𝑠𝑠𝑋𝑋)

1 2 3 1 2

0.825 < t < 1.059 β†’ perovskite Ο„ < 4.18 β†’ perovskite 92% accuracy74% accuracy

Page 22: Finding patterns, correlations, and descriptors in ...

Decision tree outputs only (-1, 1)

1J. Platt, Advances In large margin classifiers, 1999

Mapping decision tree outputs to logistic regression1 yields β„˜(Ο„)

Monotonic perovskite probabilities – β„˜(Ο„)

Page 23: Finding patterns, correlations, and descriptors in ...

Monotonic perovskite probabilities – β„˜(Ο„)

1J. Platt, Advances In large margin classifiers, 1999

Decision tree outputs only (-1, 1)

Mapping decision tree outputs to logistic regression1 yields β„˜(Ο„)

Page 24: Finding patterns, correlations, and descriptors in ...

β„˜(Ο„) compares well with DFT-GGA Ξ”Hdec

Ξ”Hdec > 0 β†’ stable in cubic structure

88% agreement

β„˜ correlates with Ξ”Hdec

Ο„ can be more powerful (CaZrO3, CaHfO3)

Decomposition enthalpies from:X.-G. Zhao, D. Yang, Y. Sun, T. Li, L. Zhang, L. Yu, A. Zunger, JACS 2017Q. Sun, W.-J. Yin, JACS 2017

Page 25: Finding patterns, correlations, and descriptors in ...

Double perovskites for emerging solar absorbers

A2BB’X6

9 recently synthesized double perovskite

halides – all predicted to be perovskite by Ο„

2016

2016

2017

2017

… and more every month

Page 26: Finding patterns, correlations, and descriptors in ...

Lower triangle –Cs2BB’Cl6

Upper triangle –(CH3NH3)2BB’Br6

Ο„ applied to 259,296 A2BB’X6 compounds

Page 27: Finding patterns, correlations, and descriptors in ...

SISSO to find new descriptor, Ο„,which improves upon Goldschmidt’s

C. Bartel et al., Under review (https://arxiv.org/abs/1801.07700) github.com/CJBartel/perovskite-stability

𝜏𝜏 = rXrBβˆ’ nA nA βˆ’

rA/rBln rA/rB

New tolerance factor for perovskite stability using SISSO

Page 28: Finding patterns, correlations, and descriptors in ...

SISSO to find new descriptor, Ο„,which improves upon Goldschmidt’s

Ο„ yields meaningful probabilities

New tolerance factor for perovskite stability using SISSO

C. Bartel et al., Under review (https://arxiv.org/abs/1801.07700) github.com/CJBartel/perovskite-stability

Page 29: Finding patterns, correlations, and descriptors in ...

New tolerance factor for perovskite stability using SISSO

SISSO to find new descriptor, Ο„,which improves upon Goldschmidt’s

Ο„ yields meaningful probabilities

Stability elucidated as β„˜(A, B, X)

C. Bartel et al., Under review (https://arxiv.org/abs/1801.07700)

SISSO should be applicableto catalysis problems

Page 30: Finding patterns, correlations, and descriptors in ...

Typically one focuses on creating a global prediction model for some property of interest (e.g., SISSO, Kernel Ridge Regression, Neural Networks)

Underlying mechanisms canchange across materials

Relations between subsets of data may be important

material property 1 (y1)

mat

eria

l pro

pert

y 2

(y2)

Part 2. Subgroup discovery to find local patterns and their descriptions

Page 31: Finding patterns, correlations, and descriptors in ...

material property 1 (y1)

mat

eria

l pro

pert

y 2

(y2)

The periodic table has subgroups

Subgroup discovery: find meaningful local descriptors of a target property in materials-science data

Review: M. Atzmueller, WIREs Data Min. Knowl. Disc. 5 (2015)B. R. Goldsmith, M. Boley, J. Vreeken, M. Scheffler, L. M. Ghiringhelli, New J. Phys. 19, (2017)M. Boley, B. R. Goldsmith, L. M. Ghiringhelli, J. Vreeken, Data Min. Knowl. Disc. 1391, (2017)

Page 32: Finding patterns, correlations, and descriptors in ...

Subgroup discovery focuses on local observations

Decision trees Subgroup discovery

Page 33: Finding patterns, correlations, and descriptors in ...

Descriptive features, π‘Žπ‘Ž1, … ,π‘Žπ‘Žπ‘šπ‘š ∈ 𝐿𝐿

e.g., d-band center, coordination number, atomic radii

Target features 𝑦𝑦1, … ,𝑦𝑦𝑛𝑛 ∈ π‘Œπ‘Œ

e.g., adsorbate binding energy

Basic selectors, 𝑐𝑐1, … , π‘π‘π‘˜π‘˜ ∈ 𝐢𝐢 β†’ {false, true}

e.g., Is the d-band center low?; Is the atom coordination number < 3?

Find selector 𝜎𝜎 = 𝑐𝑐1 β‹… ∧ β‹―βˆ§ 𝑐𝑐𝑙𝑙 β‹…

that maximizes quality 𝑠𝑠 = ext πœŽπœŽπ‘ƒπ‘ƒ

𝛼𝛼𝑒𝑒(π‘Œπ‘ŒπœŽπœŽ)1βˆ’π›Όπ›Ό

ext πœŽπœŽπ‘ƒπ‘ƒ

is the coverage of points where 𝜎𝜎 is true

𝑒𝑒 π‘Œπ‘ŒπœŽπœŽ is the utility function (optimization criteria)

Subgroup discovery: how it works

Page 34: Finding patterns, correlations, and descriptors in ...

1. Gas Phase Gold Clusters (Au5-Au14)

2. Classification and description of 82 Octet Binary Semiconductors

Two tutorial applications of subgroup discovery presented

24,400 gold cluster configurations in total

Display interesting catalytic properties

Rocksalt Zincblende

vs.

Page 35: Finding patterns, correlations, and descriptors in ...

24,400 gold cluster configurations (Au5-Au14) in the gas phase

HOM

O-L

UM

O e

nerg

y ga

p (e

V)

(relative) total energy of gold cluster (eV)

Three main subgroups identified HOM

O-LU

MO

energy gap (eV)Probability

𝐞𝐞𝐞𝐞𝐞𝐞𝐞𝐞(𝑡𝑡) ∧ 𝑡𝑡 β‰₯ πŸ•πŸ•

𝑡𝑡 = πŸ”πŸ”

𝐨𝐨𝐨𝐨𝐨𝐨(𝑡𝑡)

Rediscover simple insight about HOMO-LUMO gap

Choose target propertyHOMO-LUMO energy gap

Choose variation reduction utility function𝑒𝑒 π‘Œπ‘Œβ€² = (std π‘Œπ‘Œ βˆ’ std π‘Œπ‘Œβ€² )/std(π‘Œπ‘Œ)

Page 36: Finding patterns, correlations, and descriptors in ...

Find descriptors that predict crystal structuresfor the 82 octet AB-type materials

Target property sign(Erocksalt – Ezincblende)

Input candidate descriptors into subgroup discovery from DFT calculationsβ€’ Radii of s, p, d orbitals of free atoms β€’ Electron affinityβ€’ Ionization potential….and othersB. R. Goldsmith et. al., New J. Phys. 19, (2017)

M. Boley et. al., Data Min. Knowl. Disc. 1391, (2017)

Rocksalt Zincblende

vs.

Page 37: Finding patterns, correlations, and descriptors in ...

𝒓𝒓𝐩𝐩𝐀𝐀 βˆ’ 𝒓𝒓𝐩𝐩𝐁𝐁 β‰₯ 𝟎𝟎.πŸ—πŸ—πŸπŸ Γ…and 𝒓𝒓𝐬𝐬𝐀𝐀 β‰₯ 𝟏𝟏.𝟐𝟐𝟐𝟐 Γ…

𝒓𝒓𝐩𝐩𝐀𝐀 βˆ’ 𝒓𝒓𝐩𝐩𝐁𝐁 ≀ 𝟏𝟏.πŸπŸπŸ”πŸ” Γ…and 𝒓𝒓𝐬𝐬𝐀𝐀 ≀ 𝟏𝟏.πŸπŸπŸ•πŸ• Γ…

𝑠𝑠pA βˆ’ 𝑠𝑠pB

𝑠𝑠 sA

Subgroup discovery classifies 79 of the 82 compoundsusing a two-dimensional descriptor

zincblende subgroup

Rocksaltsubgroup

MgTe AgI

CuF

Page 38: Finding patterns, correlations, and descriptors in ...

Computationalscreening

First-principles microkinetic modeling

Catalyst designCatalyst characterization

Computationalspectroscopy

First-principlesthermochemistry &kinetic parameters

Reaction mechanism

Mechanistic understanding

Active sites

Looking ahead: Can we use data analytics to extract catalyst descriptors?

DescriptorData analyticsSISSOSubgroup Discovery….

Page 39: Finding patterns, correlations, and descriptors in ...

Machine learninge.g., use subgroup discovery and compressed sensing relate features to Figures of merit

HeterogeneousCatalyst Feature Space Figures of Merit

atomicproperties

electronic structurephysical

properties activity

stability

selectivity

(1)

(2)

Catalyst Space

support

composition

size

Machine learning for catalyst design and discovery

Page 40: Finding patterns, correlations, and descriptors in ...

Acknowledgements

Mario BoleyRunhai OuyangChristopher SuttonJilles Vreeken

Matthias SchefflerLuca M. GhiringhelliCharles Musgrave

Subgroup discoveryand SISSO tutorials are onlinehttps://www.nomad-coe.eu/

Christopher Bartel