Feature engineering for microstructure-property mapping in ...

15
Feature engineering for microstructure-property mapping in organic photovoltaics Sepideh Hashemi a , Baskar Ganapathysubramanian b , Stephen Casey c , Ji Su c , Surya R. Kalidindi a,d,* a George W. WoodruSchool of Mechanical Engineering, Georgia Institute of Technology, , Atlanta, 30332, GA, USA b Department of Mechanical Engineering, Iowa State University, , Ames, 50011, IA, USA c NASA Langley Research Center, , Hampton, 23681, VA, USA d School of Computational Science and Engineering, Georgia Institute of Technology, , Atlanta, 30332, GA, USA Abstract Linking the highly complex morphology of organic photovoltaic (OPV) thin films to their charge transport properties is critical for achieving high performance material system that serves as a cost-ecient approach for energy harvesting. In this paper, a novel unsupervised feature engi- neering framework is developed and used to establish reduced-order structure-property linkages for OPV films. This framework takes advantage of digital image processing algorithms to iden- tify the salient material features of OPVs undergoing the charge transport phenomenon. These material states are then used to obtain a low-dimensional representation of OPV microstructures via 2-point spatial correlations and principal component analysis. It is found that in addition to the material PC scores, two distance-based metrics are required to complete the microstructure quantification of complex OPVs. A localized version of the Gaussian process (laGP) is then used to link the material PC scores as well as the two distance-based metrics to the short-circuit cur- rent of OPVs. It is demonstrated that the unsupervised feature engineering framework presented in this paper in conjunction with the laGP can lead to high-fidelity and accurate data-driven structure-property linkages for OPV films. Keywords: Unsupervised feature engineering, Reduced-order models, Structure–property linkages, Organic photovoltaics, Charge transport, Gaussian processes 1. Introduction Flexible, light-weight, and wearable solar cells oer a promising solution to cheap energy harvesting for consumer products as well as residential applications. Over the past decade, rapid developments in synthetic chemistry have resulted in organic photovoltaic systems that have pushed single-junction organic photovoltaic (OPV) eciencies over 16%. These novel materials – electron-donors and electron-acceptors – provide tremendous opportunities for improved per- formance, reaching the performance of silicon based photovoltaics. In conjunction with synthesis advances, a large body of work has demonstrated that the microstructure in the active layer is key * Corresponding author at: George W. WoodruSchool of Mechanical Engineering, Georgia Institute of Technology, Atlanta, GA 30332, USA. E-mail address: [email protected] (S.R. Kalidindi). Preprint submitted to Acta Materialia November 4, 2021 arXiv:2111.01897v1 [cond-mat.mtrl-sci] 2 Nov 2021

Transcript of Feature engineering for microstructure-property mapping in ...

Page 1: Feature engineering for microstructure-property mapping in ...

Feature engineering for microstructure-property mapping inorganic photovoltaics

Sepideh Hashemia, Baskar Ganapathysubramanianb, Stephen Caseyc, Ji Suc, Surya R.Kalidindia,d,∗

aGeorge W. Woodruff School of Mechanical Engineering, Georgia Institute of Technology, , Atlanta, 30332, GA, USAbDepartment of Mechanical Engineering, Iowa State University, , Ames, 50011, IA, USA

cNASA Langley Research Center, , Hampton, 23681, VA, USAdSchool of Computational Science and Engineering, Georgia Institute of Technology, , Atlanta, 30332, GA, USA

Abstract

Linking the highly complex morphology of organic photovoltaic (OPV) thin films to their chargetransport properties is critical for achieving high performance material system that serves as acost-efficient approach for energy harvesting. In this paper, a novel unsupervised feature engi-neering framework is developed and used to establish reduced-order structure-property linkagesfor OPV films. This framework takes advantage of digital image processing algorithms to iden-tify the salient material features of OPVs undergoing the charge transport phenomenon. Thesematerial states are then used to obtain a low-dimensional representation of OPV microstructuresvia 2-point spatial correlations and principal component analysis. It is found that in addition tothe material PC scores, two distance-based metrics are required to complete the microstructurequantification of complex OPVs. A localized version of the Gaussian process (laGP) is then usedto link the material PC scores as well as the two distance-based metrics to the short-circuit cur-rent of OPVs. It is demonstrated that the unsupervised feature engineering framework presentedin this paper in conjunction with the laGP can lead to high-fidelity and accurate data-drivenstructure-property linkages for OPV films.

Keywords: Unsupervised feature engineering, Reduced-order models, Structure–propertylinkages, Organic photovoltaics, Charge transport, Gaussian processes

1. Introduction

Flexible, light-weight, and wearable solar cells offer a promising solution to cheap energyharvesting for consumer products as well as residential applications. Over the past decade, rapiddevelopments in synthetic chemistry have resulted in organic photovoltaic systems that havepushed single-junction organic photovoltaic (OPV) efficiencies over 16%. These novel materials– electron-donors and electron-acceptors – provide tremendous opportunities for improved per-formance, reaching the performance of silicon based photovoltaics. In conjunction with synthesisadvances, a large body of work has demonstrated that the microstructure in the active layer is key

∗Corresponding author at: George W. Woodruff School of Mechanical Engineering, Georgia Institute of Technology,Atlanta, GA 30332, USA. E-mail address: [email protected] (S.R. Kalidindi).Preprint submitted to Acta Materialia November 4, 2021

arX

iv:2

111.

0189

7v1

[co

nd-m

at.m

trl-

sci]

2 N

ov 2

021

Page 2: Feature engineering for microstructure-property mapping in ...

to high performance devices. Thus, tailoring the morphology in the active layer of OPVs con-tinues to be crucial for maximizing performance. More importantly, advances in self-assemblysuggests the possibility of remarkable control of the active layer morphology.

Despite the importance of morphology to OPV device performance, it remains a challengeto comprehensively and rapidly map morphologies to performance. The availability of reliableand fast structure-property models could enable domain scientists to (a) explore, identify anddesign ”ideal” morphologies that maximize performance, (b) identify microstructure featuresthat positively (or negatively) impact performance, and (c) quantify how perturbations to themorphology (due to oxidation, annealing or ageing) degrade performance.

Past approaches of investigating structure-property linkages relied on full-physics simula-tors — either discrete (kinetic) monte carlo models, or continuum drift-diffusion models. Thesemodels are typically expensive to deploy, and sequential deployment for exploration or optimiza-tion has been shown to be prohibitively expensive. Similarly, rapid design exploration using suchfull-physics simulators is typically not possible, even with access to high performance computingresources.

Recent approaches overcome this challenge by first creating a diverse dataset of annotatedmorphologies and their performances, and then utilizing data-driven tools on this dataset to con-struct low-computational cost surrogate structure-property models. Such a strategy amortizes thecost of creating a large annotated dataset across multiple studies. Additionally, property annota-tion on this dataset using the full-physics simulators are embarrassingly parallel, thus, optimallyutilizing HPC resources.

Such structure-property surrogate models – especially in the context of OPV – have been suc-cessfully constructed and deployed for design optimization, process-structure-property linkages,sensitivity analysis, and other studies. However, most of these studies have:

• either relied on manual ’featurization’ of the morphologies based on knowledge of thephotophysics [1–3]. While very useful, such approaches are non-trivial and generally time-consuming. Additionally, manual featurization carries the risk of overlooking or neglectingimportant features,

• or utilized the full raw morphology data to construct structure-property linkages [4]. How-ever, these approaches need massive datasets to train good surrogate models due to thelarge input dimensionality (of the morphology image). Additionally, the resultant surro-gates are complex and usually not interpretable.

In this work, we bridge these two extremes by using a principled approach of unsupervisedfeaturization of the morphologies. These low dimensional set of features are then used to trainan accurate structure-property surrogate model. Specifically, the recently developed MaterialKnowledge System (MKS) framework [5–9] offers a data-driven framework for unsupervisedfeature engineering of material microstructures. This framework employs a voxelized repre-sentation of microstructures to efficiently compute the 2-point spatial correlations [10–12] andperform principal component analysis (PCA) [13, 14] on them to identify a sufficiently smallnumber of features representing the complex material microstructure. The feature engineeringdeveloped in the MKS framework is unsupervised in that the microstructure feature selection iscompletely uninfluenced by the output variables targeted by the surrogate model. Although alarge number of options exist for building the surrogate models of interest, recent work in theMKS framework [8, 9, 15–20] has demonstrated that Gaussian process regression (GPR) [14, 21]

2

Page 3: Feature engineering for microstructure-property mapping in ...

offers advantages because of its ability to formulate non-parametric models while allowing for arigorous consideration of the prediction uncertainty.

Certain extensions are needed to the current MKS framework in order to apply it successfullyto the present problem. First, a large number of pixel-scale (local) material states need to beconsidered, which is expected to be significantly larger than those encountered in prior casestudies. This is because of the need to consider not only the donor and acceptor pixels, but alsothe different types of the donor-acceptor interfaces present in the microstructure. Second, thesmall thickness of the films requires additional considerations in the feature engineering. This isbecause the distances of the different types of the donor-acceptor interfaces from the top and thebottom surfaces of the films are known to control their effective properties [22].

This paper describes the extensions made to the MKS framework so that it can be applied suc-cessfully to establish data-driven microstructure-property linkages for OPV films. More specifi-cally, this paper develops and demonstrates novel approaches to feature engineering the complexmicrostructures in blend films by combining digital image processing techniques with the previ-ously established MKS feature selection methods. The employment of digital image processingtechniques allows for computationally efficient pixel-scale labeling of the different material statesin the polymer blend films. When these protocols are followed by 2-point spatial correlations andPCA, they offer novel unsupervised feature engineering of the complex material structure of OPVfilms. The tremendous utility of such feature engineering protocols is demonstrated in this pa-per by building a surrogate model for the prediction of the short circuit current of photovoltaicpolymer blend films using a localized version of GPR.

2. Background

2.1. Microstructure and photovoltaic property dataset

We utilize a curated dataset of microstructure images created by solving the Cahn-Hilliardequation [23] with varying initial conditions. The Cahn-Hilliard equation [23] describes phaseseparation occurring in a binary mixture, and has been shown to be a good representation ofmorphology evolution during fabrication of organic blend thin films [24–26] that are the typicalactive layer in OPV’s. The image data arising from these simulations provide a rich datasetfor constructing structure-property surrogate models [4]. The dataset is a collection of 33, 552microstructure images of 101 × 101 pixels in resolution. Each image is grayscale, with the valueof each pixel ranging between 0 to 1.

Each microstructure is virtually interrogated to extract its current-voltage characteristics, bysolving a morphology aware (i.e., spatially heterogeneous) photophysics device model. We de-ploy a validated, in-house software that uses a finite element based solution strategy for solvingthe photophysics device model [27–29]. The photophysics model is described by the steady stateexcitonic drift diffusion (XDD) equations. The XDD equations are a set of four tightly coupledpartial differential equations that model the optoelectronic physics of energy harvesting in or-ganic photovoltaic devices. The photophysics consists of the following stages (also illustrated inFigure 1):

• Incident solar radiation causes the generation of energetically active electron-hole pairs,called excitons (denoted by X), in the donor regions of the microstructure. These excitonsdiffuse across the microstructure and have a finite lifetime before becoming ground stateelectron-hole pairs;

3

Page 4: Feature engineering for microstructure-property mapping in ...

Figure 1: Schematic illustrating the various stages of the photophysics process (see main text for detailed description).

• Excitons that diffuse and reach the donor-acceptor interface undergo dissociation into elec-trons (denoted by n) and holes (denoted by p) at the donor-acceptor interface. The disso-ciation mechanism is material and field dependent (denoted by D);

• These generated charges (n,p) traverse the microstructure and reach their correspondingelectrodes (cathode and anode) to produce a current. Two mechanisms are responsible fordriving carrier transport or current flow. First, the drift, which is caused by the presenceof an electric field (denoted as the gradient of the potential, ∇ϕ, and second, the diffusion,which is caused by a spatial gradient of electron or hole concentration;

• The distribution of electrons and holes in the microstructure interacts with the applied volt-age and influences the electrostatic potential ϕ across the microstructure. Finally, electronsand holes can recombine (denoted by R) to create excitons

The photophysics described above is encoded using the exciton drift diffusion (XDD) equa-tions [27]. In prior work, these XDD equations were solved to get the performance of the OPVdevice, which is characterized by the short-circuit current Jsc. XDD simulation results for eachof the 34672 microstructures generated earlier provide us the photophysics properties (Jsc).

2.2. Feature engineering using MKS frameworkIn the MKS framework, the uniformly discretized (i.e., voxelated) representative volume

elements (RVEs) of the material microstructures are denoted by an array, mhs , whose elements

denote the volume fractions of the material state h found at voxel s. Microstructural domainswhere each voxel is occupied fully by a specific material state leads to microstructure arrayswhere the value of mh

s is either 0 or 1. Although it may be tempting to use mhs directly as the

feature set, it should be recognized that it lacks translational invariance. The MKS frameworkemploys the framework of 2-point spatial correlations [10–12], which are essentially auto- andcross-correlations of material state maps of the microstructure. Mathematically, the discretizedset of 2-point spatial correlations, denoted as f hh

r , are computed as

f hh′

r =1

S r

∑s

mhsmh

s+r (1)

where h and h′

index all of the material states present in the studied material system, r indexes aset of discretized vectors arising from the voxelization used to define mh

s , and S r denotes the total4

Page 5: Feature engineering for microstructure-property mapping in ...

number of pixels that allow for placement of vectors r within the microstructural domain. Thecomputations implied in Eq. (1) can be efficiently carried out using the fast Fourier transform(FFT) algorithm [30, 31].

The complete set of 2-point spatial correlations computed using Eq.(1) produces a large un-wieldy set of features. In the MKS framework, a smaller set of salient features is identified (i.e.,feature engineering) by performing principal component analysis (PCA) [13, 14], which (rota-tionally) transforms the data into a new space where the axes are organized by their ability toaccount for the variance in the dataset. The new orthogonal axes and the new coordinates ob-tained from the PCA are then referred to as PC scores and PC basis, respectively. Prior studieshave often shown a drastic dimensionality reduction going from ∼105 − 106 original microstruc-tural features to less than ∼10 − 15 PCs [6, 8, 15, 32, 33].

2.3. Gaussian process regression models

Although many surrogate model building approaches can be used for building structure-property linkages, prior work has shown the benefits of using Gaussian process regression (GPR)in combination with the MKS feature engineering described earlier [8, 9, 15–20]. GPR is par-ticularly powerful when building surrogate models for complex nonlinear systems/phenomena,where the parametric model forms are not yet established. The other main advantage of GPR liesin the quantification of the uncertainty associated with the model predictions.

In the GPR-MKS framework, the reduced-order structure-property linkage of interest can bedecomposed into a linear mean function m and an error function ε often modeled as a zero-meanGaussian process. Mathematically, the desired model is expressed as [21]

p = m(γ) + ε (2)

m(γ) = β0 +

R∑i=1

βiγi (3)

ε ∼ GP(0, k(γ,γ′

)) (4)

where p is the target property (i.e., output), γ is the input feature vector consisting of R PCs, βare coefficients of the linear model, and k(γ,γ

) is the GP’s covariance function. The automaticrelevance determination squared exponential (ARD-SE) kernel [21] has often been used to definethe GP’s covariance. The ARD-SE kernel is mathematically expressed as

k(γ, γ′

) = σ2f exp

−12

R∑l=1

(γl − γ

l

)2

σ2l

+ σ2nδγγ′ (5)

where the scaling factorσ f , length scaleσl, and noise factorσn are hyperparameters of the kernelfunction, and δγγ′ is the Kronecker delta. The hyperparameter σn determines the homoscedasticnoise in the target predictions. The hyper parameter σ f controls the amplitude of the variance inthe output. The length scale σl automatically determines the relevance of input features on thepredictions. Higher values of σl results in smoother predictions, indicating minimal influenceon the output prediction. The values of hyperparameters need to be optimized during the modelbuilding process to obtain the best model.

The joint distribution of the observed training data (X) and the unobserved test data (X∗) isgiven by [21]

5

Page 6: Feature engineering for microstructure-property mapping in ...

[pp∗

]∼ N

(0 ,

[K(X, X) K∗(X, X∗)

K†∗ (X, X∗) K∗∗(X∗, X∗)

])(6)

The predictive posterior is obtained from conditioning the joint distribution fully defined by itsmean and covariance [21]:

µ∗ = K†∗K−1 p

Σ∗ = K∗∗ − K†∗K−1K∗

(7)

The main computationally intensive operation in GP formulation is the inversion of the kernelmatrix which scales as O(N3). Although this is a one-time computation, in case of large ensembleof training data, the computation and storage of K−1 present significant challenges. Prior studieshave addressed these challenges using methods such as low-rank approximations to GPs [21, 34],treed GPs [35, 36] and local approximate GP (laGP) [37, 38]. Recent research has demonstratedthat low-rank approximations and treed GPs tend to over-smooth the data, might impose an upperlimit on the data size and typically take longer to compute [39]. The recently developed laGPmodel is particularly attractive as it scales well with the data size, allows for non-stationaritymodelling, and is highly parallelizable. The laGP model employs a local subset of the data totrain separate GPs for each target point. The subset of data can be chosen as n nearest neighborsof the target point. However, this simple criterion does not yield the optimum predictions. In-stead, the laGP approach utilized in this work employs the active learning Cohn (ALC) method[38, 40] to sequentially update the chosen subset of the training points. The ALC sequentiallyidentifies points whose addition to the local subset maximizes the expected information gain bymaximizing the reduction in the prediction variance.

3. Microstructure-Property models for photovoltaic polymers

The workflow used in this paper for building the surrogate microstructure-property modelsfor OPVs will involve two main steps: (i) unsupervised feature engineering of the microstructureusing the MKS framework, and (ii) establishing the laGP models using the engineered features.Further details of these steps are described next.

3.1. Material states in OPV microstructuresThe gray-scale OPV microstructures (with each pixel value ranging between zero and one)

obtained from solving the Cahn-Hilliard equation (summarized in section 2.1) are thresholdedinto binary microstructures consisting of donor (D) and acceptor (A) phases (i.e., material statebinerization). In this study, a threshold of 0.5 was used to convert the gray-scale microstructuresinto binary microstructures. As mentioned earlier, in order for the (binarized) OPV microstruc-tures to have efficient charge transport, the donor and the acceptor regions should be directlyconnected to the corresponding electrodes positioned at top and bottom surfaces of the thin films,respectively. In other words, the donor/acceptor pixels connected/unconnected to their respectiveelectrodes are expected to play very different roles in the performance of the OPVs. Therefore,it was decided to define four different material states for labelling the individual pixels in themicrostructures: (i) DΛ - donor pixels connected to the top surface, (ii) D◦ - donor pixels uncon-nected to the top surface, (iii) A∨ - acceptor pixels connected to the bottom surface, and (iv) A◦ -acceptor pixels unconnected to the bottom surface.

6

Page 7: Feature engineering for microstructure-property mapping in ...

In addition, the different types of the donor-acceptor interfaces present in the microstructureaffect the charge transport in very different ways. As the charge transport occurs mainly in theconnected regions, any interface between two connected regions, I1 = (DΛ, A∨), is most effective.It can also be seen that any interface between two unconnected regions, I2 = (D◦, A◦), is leasteffective. The other two types of interfaces, I3 = (DΛ, A◦) and I4 = (D◦, A∨), are consideredsemi-effective.

As a final consideration, the charges created in OPV microstructures typically move throughthe donor and acceptor regions that are directly connected to the top and bottom electrodes (DΛ

and A∨), respectively. In addition, if unconnected donor/acceptor regions (D◦ and A◦) are con-siderably close to their respective electrodes, they also play a role in the charge transport [22].This consideration is especially important for microstructures that comprised only unconnecteddonor/acceptor regions. The charge transport of such microstructures is inversely related to thedistance of the closest D◦ and A◦ from their relevant electrodes.

Leveraging the insights described above, we devised and implemented a 3-step procedure toassign material local states to each pixel in each OPV microstructure. In the first step, we assignone of the four material local states described above to each voxel in the OPV microstructure:DΛ, D◦, A∨, and A◦ (see Figure 2a). This was achieved by first considering the donor phaseas the foreground (i.e., assigning values of one to donor pixels and zero to acceptor pixels) andusing a cluster labeling algorithm [41] to identify uniquely the connected sets of the donor pixels(i.e., donor clusters). The pixels in the donor clusters connected to the top surface were allassigned the material state DΛ, while the rest of the donor pixels were assigned the material stateD◦. A similar procedure was performed to assign the material states A∨ and A◦. Note that theassignment of these four material states is mutually exclusive. In other words, every pixel in themicrostructure is assigned only one of the four material states mentioned above.

In the second step, we have defined an additional material local state identifying the differ-ent types of interfaces between the donor and acceptor pixels. This additional material state isassigned only to the interface pixels. As already described, a total of four different interfaces arepossible: (DΛ, Av), (DΛ, A◦), (D◦, Av), and (D◦, A◦) (see the microstructure shown in Figure 2b).In this work, we adopted a 2-pixel interfacial region that included the first pixel on either sideof the interface. The interface pixels are identified using a computational strategy developed inprior work for 2-phase microstructures [8]. This computation is performed using the convolutionkernel shown in Figure 2b on selected foreground material states, which produces an integer ci ateach pixel i. The values of ci ∈ [1 : 4] identify interfacial pixels (i.e., the interior pixels within theforeground and background would have values zero and five, respectively). In this work, specialconsiderations were made to account for the non-periodicity of the microstructures. Specifically,this challenge was addressed using suitable zero-padding schemes [31]. For non-periodic mi-crostructures, the sets of edge pixels and corner pixels were identified separately; edge pixelswith ci ∈ [1 : 3] and corner pixels with ci ∈ [1, 2] denote interfacial pixels. By applying theprocedure described above to each phase (i.e., treating each phase as foreground one at a time),each interface pixel can be mapped uniquely to one the aforementioned four types of interfaces.Figure 2b shows the labelling of the interface pixels for the example microstructure shown inFigure 2a.

In the last step of the unsupervised feature identification procedure employed in this study,we identify a third material state descriptor for the top/bottom rows of pixels connecting to theelectrodes. This feature is designed to capture the effect arising from the shortest distance ofdonor/acceptor pixels from their respective electrodes. As already mentioned, this feature is es-pecially important for microstructures where the donor/acceptor pixels are not in direct contact

7

Page 8: Feature engineering for microstructure-property mapping in ...

(a) (b) (c)

Figure 2: Labelling of the material local states to each pixel of a selected OPV microstructure. (a) Each pixel is assignedone of the four material local states corresponding to connected/unconnected donor/acceptor pixels. Connectivity in thiscontext refers to whether the donor/acceptor pixels are connected to their corresponding electrodes at the top/bottomsurfaces. (b) Each interface pixel is assigned one of the four interfaces. The interfacial region is considered to be 2-pixelthick, comprising both pixels on either side of the interface. The convolution kernel used to identify the interfaces isshown on the right. (c) A third material state is assigned to the top and bottom rows of pixels based on the shortestdistance, d, of the donor (acceptor) pixels from the top (bottom) surface. The plot shows distribution of d for the selectedmicrostructure.

with their corresponding electrodes. Figure 2c presents a histogram of the shortest vertical dis-tance of the donor (acceptor) pixel to the top (bottom) surface, d, for the example microstructureshown in Figure 2a. It was decided to use exp(−d/λ) as the feature value for each top/bottompixel, with λ = 10 nm reflecting the expected diffusion length for exciton transport [22, 42].Consequently, the feature value is one when the pixels are in direct contact and exponentiallydecreases when there is a gap.

After labelling the material local states, the next step involves the computation of the im-portant microstructure statistics. The central challenge comes from the large number of spatialstatistics that could be computed. In the present case, since there are a total of eight materiallocal states (four acceptor/donor states and four interface states), one can potentially define atotal of 82 = 64 sets of spatial correlations (including auto-correlations and cross-correlations).Since each set of spatial correlations has a total of 101 × 101 = 10, 201 features, the full setof features becomes unwieldy for establishing surrogate models. In prior work [8] on correlat-ing the effective permeability of a porous solid to its pore structure, it was observed that theauto-correlations of the material local states (including interface states) were adequate for pro-ducing high fidelity structure-property linkages. Utilizing the insights from that work, we haveincluded only the following sets of spatial correlations in establishing the surrogate models pre-sented in this work: i) 2-point spatial auto-correlations for each of the four main material localstates

{f DΛDΛ

r , f D◦D◦r , f A∨A∨

r , f A◦A◦r

}, and ii) 2-point spatial auto-correlations for each of the four

interfacial local states{f I1I1r , f I2I2

r , f I3I3r , f I4I4

r}. Even using only this subset of spatial correlations

produces a total of 8 × 10, 201 = 81, 608 features. As already described in Section 2.2, PCA isapplied to obtain a small number of features (i.e., PC scores) as inputs to the surrogate structure-property models. Prior to application of PCA, each of the eight sets of spatial correlations arescaled to exhibit the same variance across the entire dataset. This is necessary due to the fact thatPCA aims to capture the variance in the dataset in the smallest number of terms. Therefore, scal-ing the different sets of spatial correlations ensures that each set of spatial correlations is equallyweighted in the PC representations. In this work, for reasons already explained, the averagedvalues of exp(−d/10) for both electrodes, denoted as

{δΛ, δ∨

}, are used as additional features

8

Page 9: Feature engineering for microstructure-property mapping in ...

(i.e., these are appended to the selected PC scores representing the microstructure statistics asadditional features).

3.2. Local Gaussian process surrogate models for OPVsThe microstructure PC scores as well as the two distance-based features are used as inputs to

train a local Gaussian process (laGP) surrogate model to predict the short circuit current of OPVmicrostructures. Each input is scaled to exhibit the same variance across the entire ensembleof the dataset. This is needed because laGP models identify local subsets of the training datausing suitable distance measures. For each target point, the first n0 closest neighboring pointsare chosen as the initial training set for building the initial GP. Subsequently, the ALC criterionis used to sequentially update the training data to maximize the expected information gain. Asthe training subset is sequentially updated, one expects to see a systematic decrease in the im-provement to the model performance. Consequently, one would naturally reach a point wherefurther updating the training set would only minimally improve the laGP model performance.In this study, the sequential update of the laGP model was continued until the reduction in theprediction variance was smaller than 10−6. In the protocol described above, the final size of thelocal training set is denoted as nd = n0 + nALC , where nALC denotes the number of training pointsselected using the ALC criterion. The performance of the trained laGP models produced in thiswork was quantified using multiple error measures, including normalized mean absolute error(nMAE), normalized median absolute deviation (nMAD) and R2. These are defined as

nMAE =

1N

∑Ni=1 |J

(i)sc − J(i)

sc |

Jsc(8)

nMAD =

median(|J(1)

sc − J(1)sc |, |J

(2)sc − J(2)

sc |, · · · , |J(N)sc − J(N)

sc |

)Jsc

(9)

R2 = 1 −∑N

i=1(J(i)sc − J(i)

sc )2∑Ni=1(J(i)

sc − Jsc)2(10)

where J(i)sc and J(i)

sc are the actual (ground truth) and the predicted short circuit current of the ith

target point, and N is the number of test points. Jsc denotes the mean value of the Jsc values. R2

serves as an indicator of how much of the variation in the output is explained by the inputs. Thevalue of R2 for a perfect model is expected to be one. Likewise, for a flat line model that alwayspredicts the mean, the value of R2 will be zero.

4. Results and discussions

In the present study, an ensemble of 33,552 distinct OPV microstructures was generated toestablish the desired data-driven microstructure-property linkage for OPV films. The short circuitcurrent Jsc associated with each microstructure was obtained by solving the XXD equationsdiscussed in Section 2.1. The unsupervised feature engineering framework described in Section3.1 was employed on each microstructure.

Figure 3 depicts the eight sets of spatial auto-correlations computed for the example mi-crostructure shown in Figure 2a. The top and bottom rows in this figure present spatial auto-correlations of the four main material states and the four interface states, respectively. Note that

9

Page 10: Feature engineering for microstructure-property mapping in ...

Figure 3: The 8 sets of 2-point spatial auto-correlations corresponding to main material states and interfaces of theexample microstructure shown in Figure 2 is shown. The center value of these statistical maps is volume fraction of thecorresponding material state.

the auto-correlations exhibit centro-symmetry, because the values of the statistics for r and −rare the same. Therefore, half the information in these maps is redundant and could be eliminatedbefore performing the PCA. The central peak value in each auto-correlation map, correspondingto r = 0, reflects the volume fraction of the specific material state. For the interface states, thisvalue corresponds to the volume fraction occupied by the 2-voxel wide interface regions definedin this work. The auto-correlation maps implicitly capture a significant amount of statistical in-formation on the shape, size, and spacing distributions of the material states in the microstructure.For instance, the bands in the f DΛDΛ

r map capture important features related to the size, shape,orientation, and spacing of the DΛ regions in the microstructure (compare the auto-correlationmap with the actual microstructure in Figure 2a). Similarly, f A◦A◦

r captures the details of the morecompact and isolated positioning of the A◦ regions in this microstructure. In contrast, the auto-correlation maps for D◦ and A∨ indicate that these regions are more broadly distributed in themicrostructure. Similar observations can be made for the auto-correlation maps of the interfacestates.

In order to efficiently compute the PCA of the large data matrix of size 33, 552 × 81, 608assembled in this work, we took advantage of the randomized SVD algorithm implemented inDASK package in Python programming language [43]. It was decided to truncate the PC repre-sentations obtained from this protocol to 10 PCs, because there was no appreciable improvementin the variance captured beyond this truncation level. This represents a significant reduction inthe dimensionality of the microstructure representation, where we started with 81,608 spatialcorrelations and ended up with only 10 PC scores. The representation of all 33,552 microstruc-tures in the first three PCs is presented in Figure 4. In this figure, each data point corresponds tothe first three PC scores of the microstructure statistics and is colored using its value of Jsc. Al-though the three PC scores represent only a subset of the regressors we intend to use in this work(a total of ten PC scores and two distance-based metrics will be used), it is very encouraging tosee the patterns in Figure 4 suggesting a strong dependence of the target on these regressors.

10

Page 11: Feature engineering for microstructure-property mapping in ...

Figure 4: The low-dimensional representation of the entire data ensemble of OPV microstructures in the first 3 PC basisis depicted. The PC representations are truncated after the first 10 PCs. The unsupervised PCA is powerful in capturingthe microstructural differences as well as the variance in the values of Jsc.

A direct interpretation of the PC scores is currently not possible. Essentially, each PC ba-sis represents a linearly weighted collection of 81,608 spatial correlations. The PC score ofeach OPV microstructure represents the projection (i.e., dot product) of its set of 81,608 spatialcorrelations on the corresponding PC basis. The high dimensionality of the PC basis makes itimpractical to seek the precise physical meaning of the PC scores. However, it was found thatthe first PC score is highly correlated to the volume fractions of the four main material localstates as well as the I1 and I3 interface states (interfaces of DΛ with acceptor material states).The second PC score was found to be highly correlated to volume fractions of I2 and I4 interfacestates (interfaces of A∨ with donor material states). In addition to the information on the volumefractions, PC scores contain rich information on other morphological aspects of microstructuressuch as shape, size, orientation and spacing of material features within OPV microstructures.For instance, as seen from Figure 4, the microstructures comprising coarser regions of A∨ and/orD◦ have higher PC1 values, while the microstructures with coarser regions of A◦ and/or DΛ havesmaller PC1 values. Several other similar qualitative observations can be made by inspecting Fig-ure 4 closely. As another example, it can been seen that microstructures comprised mainly/onlyfrom coarser unconnected donor/acceptor regions are completely separated from the microstruc-tures with connected donor/acceptor finer regions in the low-dimensional PC representation.

The 10 microstructure PC scores and the two averaged distance-based metrics, δΛ and δ∨,are used as inputs to train the surrogate laGP models using the R package language [44]. Asalready noted, each input feature is scaled to exhibit the same variance across the entire datasetfor this model building strategy. A laGP model is produced for each test point using a set ofn0 = 35 closest neighbors in the input domain. The ALC criterion is employed to sequentiallyadd points to the design space such that their addition maximizes the expected information gain.The training size for the 33,552 laGP models produced in this study was in the range [36, 346].

11

Page 12: Feature engineering for microstructure-property mapping in ...

The distribution of the training sizes is shown in Figure 5a. It is seen that more than 99% of thelaGP models built in this study needed less than 100 local training data points. This small sizeof the local training data set significantly reduces the computational cost involved in building thedesired laGP models. Figures 5b and 5c present the parity plot comparing the Jsc predictionsfrom the laGP models with their corresponding ground-truth values as well as the uncertainty as-sociated with the model predictions (i.e., one standard deviation from the mean prediction shownas error bars) and the distribution of the relative mean absolute errors, respectively. The standarddeviation in 99.7% of the trained models is within 5% of the Jsc. Those few models that exhibithigher uncertainties correspond to the microstructures that fall on the boundary of the input PCdomain. This is to be expected as laGP performs better in the interior of the input domain, com-pared to the edges of the input domain (there is limited availability of training points in theseregions). The normalized absolute error was higher than 0.05 in only 8% of the trained laGPmodels. Considering the entire set of trained models, the normalized mean absolute error nMAEand the normalized mean median absolute deviation nMAD were 2.16% and 1.10%, respectively.Moreover, a high value of R2 = 0.99 was calculated for the trained laGP models, which demon-strates that a high proportion of the variance in the target is being captured well by the modelinputs. This clearly demonstrates the efficacy of the novel feature engineering framework pre-sented in this work in establishing high-fidelity data-driven microstructure-property mappings inorganic photovoltaics.

5. Conclusions

A novel unsupervised feature engineering framework for data-driven mappings of microstruc-tures to their photovoltaic properties has been successfully developed. A computationally effi-cient labeling of two sets of salient material states (four bulk material states and four interfacestates) served as critical features, and were extracted via digital image processing techniques. Inorder to take into account the non-periodicity of the OPV microstructures, suitable zero-paddingschemes were utilized. It was found that only 2-point spatial auto-correlations of the eight sets oflabeled material states were sufficient to extract reasonably accurate structure-property linkages.The low-dimensional representation of this rich large set of material statistics was then obtainedfrom principal component analysis by taking advantage of a scalable randomized SVD algorithm.In addition to material PC scores, it was found that two additional expert-defined distance-basedmetrics further improved the accuracy of the data driven structure-property linkages. A localized-version of the Gaussian process (laGP) was employed to extract these data-driven reduced orderstructure-property linkages. The laGP model took advantage of active learning to detect a smallsubset of the training data used to build separate models for each target data point. It was shownthat with only a small subset of the training dataset one can build accurate laGP models. Thissignificantly reduced the computational cost involved in building of the desired laGP model. Theuncertainty associated with the model predictions were quantified by considering one standarddeviation from the mean prediction. It was found that the uncertainty of only 0.3% of the trainedmodels is higher than 5% of the mean value of the target property. The high-fidelity accuratestructure-property linkages extracted in this study attest to the tremendous efficacy of the pro-posed novel feature engineering framework for complex organic photovoltaic microstructures.

12

Page 13: Feature engineering for microstructure-property mapping in ...

(a)

(b)

(c)

Figure 5: Depiction of the performance of the data-driven structure-property linkages trained for OPV microstructuresis presented. The total design size of each laGP model is determined using ALC criterion. The distribution of the finaldesign size is shown in (a). The parity plot comparing the predictions and actual values of Jsc as well as the relative meanabsolute error are presented in (b) and (c), respectively.The established high-fidelity microstructure-property linkagesdemonstrate the utility of the developed feature engineering framework for organic photovoltaics.

13

Page 14: Feature engineering for microstructure-property mapping in ...

6. Acknowledgments

The authors acknowledge funding from NASA Langley Research Center. BG acknowledgespartial support from NSF 1906194 and ONR Award N00014-19-1-2453. SK acknowledges par-tial support from Vannevar Bush Fellowship through ONR Award N00014-18-1-2879.

References

[1] O. Wodo, J. D. Roehling, A. J. Moule, B. Ganapathysubramanian, Quantifying organic solar cell morphology: acomputational study of three-dimensional maps, Energy & Environmental Science 6 (2013) 3060–3070.

[2] O. Wodo, S. Tirthapura, S. Chaudhary, B. Ganapathysubramanian, Computational characterization of bulk hetero-junction nanomorphology, Journal of Applied Physics 112 (2012) 064316.

[3] O. Wodo, J. Zola, B. S. S. Pokuri, P. Du, B. Ganapathysubramanian, Automated, high throughput exploration ofprocess–structure–property relationships using the mapreduce paradigm, Materials discovery 1 (2015) 21–28.

[4] B. S. S. Pokuri, S. Ghosal, A. Kokate, S. Sarkar, B. Ganapathysubramanian, Interpretable deep learning for guidedmicrostructure-property explorations in photovoltaics, npj Computational Materials 5 (2019) 1–11.

[5] S. R. Kalidindi, Hierarchical materials informatics: novel analytics for materials data, Elsevier, 2015.[6] A. Iskakov, Y. C. Yabansu, S. Rajagopalan, A. Kapustina, S. R. Kalidindi, Application of spherical indentation

and the materials knowledge system framework to establishing microstructure-yield strength linkages from carbonsteel scoops excised from high-temperature exposed components, Acta Materialia 144 (2018) 758–767.

[7] M. I. Latypov, L. S. Toth, S. R. Kalidindi, Materials knowledge system for nonlinear composites, ComputerMethods in Applied Mechanics and Engineering 346 (2019) 180–196.

[8] Y. C. Yabansu, P. Altschuh, J. Hotzer, M. Selzer, B. Nestler, S. R. Kalidindi, A digital workflow for learningthe reduced-order structure-property linkages for permeability of porous membranes, Acta Materialia 195 (2020)668–680.

[9] S. Hashemi, S. R. Kalidindi, A machine learning framework for the temporal evolution of microstructure dur-ing static recrystallization of polycrystalline materials simulated by cellular automaton, Computational MaterialsScience 188 (2021) 110132.

[10] S. Torquato, Random heterogeneous materials: microstructure and macroscopic properties, volume 16, SpringerScience and Business Media, 2002.

[11] S. R. Niezgoda, Y. C. Yabansu, S. R. Kalidindi, Understanding and visualizing microstructure and microstructurevariance as a stochastic process, Acta Materialia 59 (2011) 6387–6400.

[12] S. R. Niezgoda, A. K. Kanjarla, S. R. Kalidindi, Novel microstructure quantification framework for databasing,visualization, and analysis of microstructure data, Integrating Materials and Manufacturing Innovation 2 (2013) 3.

[13] T. Hastie, R. Tibshirani, J. Friedman, J. Franklin, The elements of statistical learning: data mining, inference andprediction, The Mathematical Intelligencer 27 (2005) 83–85.

[14] C. M. Bishop, Pattern recognition and machine learning, springer, 2006.[15] P. Fernandez-Zelaia, Y. C. Yabansu, S. R. Kalidindi, A comparative study of the efficacy of local/global and

parametric/nonparametric machine learning methods for establishing structure–property linkages in high-contrast3d elastic composites, Integrating Materials and Manufacturing Innovation 8 (2019) 67–81.

[16] A. E. Tallman, K. S. Stopka, L. P. Swiler, Y. Wang, S. R. Kalidindi, D. L. McDowell, Gaussian-process-drivenadaptive sampling for reduced-order modeling of texture effects in polycrystalline alpha-ti, JOM 71 (2019) 2646–2656.

[17] Y. C. Yabansu, A. Iskakov, A. Kapustina, S. Rajagopalan, S. R. Kalidindi, Application of gaussian process re-gression models for capturing the evolution of microstructure statistics in aging of nickel-based superalloys, ActaMaterialia 178 (2019) 45–58.

[18] Y. C. Yabansu, V. Rehn, J. Hotzer, B. Nestler, S. R. Kalidindi, Application of gaussian process autoregressivemodels for capturing the time evolution of microstructure statistics from phase-field simulations for sintering ofpolycrystalline ceramics, Modelling and Simulation in Materials Science and Engineering 27 (2019) 084006.

[19] S. Parvinian, Y. C. Yabansu, A. Khosravani, H. Garmestani, S. R. Kalidindi, High-throughput exploration of theprocess space in 18protocols and gaussian process models, Integrating Materials and Manufacturing Innovation 9(2020) 199–212.

[20] A. Marshall, S. R. Kalidindi, Autonomous development of a machine-learning model for the plastic response oftwo-phase composites from micromechanical finite element models, JOM (2021) 1–11.

[21] C. K. Williams, C. E. Rasmussen, Gaussian processes for machine learning, volume 2, MIT Press Cambridge, MA,2006.

14

Page 15: Feature engineering for microstructure-property mapping in ...

[22] O. Wodo, S. Tirthapura, S. Chaudhary, B. Ganapathysubramanian, A graph-based formulation for computationalcharacterization of bulk heterojunction morphology, Organic Electronics 13 (2012) 1105–1113.

[23] J. W. Cahn, J. E. Hilliard, Free energy of a nonuniform system. I. interfacial free energy, The Journal of chemicalphysics 28 (1958) 258–267.

[24] O. Wodo, B. Ganapathysubramanian, Modeling morphology evolution during solvent-based fabrication of organicsolar cells, Computational Materials Science 55 (2012) 113–126.

[25] O. Wodo, B. Ganapathysubramanian, How do evaporating thin films evolve? unravelling phase-separation mecha-nisms during solvent-based fabrication of polymer blends, Applied Physics Letters 105 (2014) 153104.

[26] K. Zhao, O. Wodo, D. Ren, H. U. Khan, M. R. Niazi, H. Hu, M. Abdelsamie, R. Li, E. Q. Li, L. Yu, et al., Verticalphase separation in small molecule: polymer blend organic thin film transistors can be dynamically controlled,Advanced Functional Materials 26 (2016) 1737–1746.

[27] H. K. Kodali, B. Ganapathysubramanian, Computer simulation of heterogeneous polymer photovoltaic devices,Modelling and Simulation in Materials Science and Engineering 20 (2012) 035015.

[28] H. K. Kodali, B. Ganapathysubramanian, A computational framework to investigate charge transport in hetero-geneous organic photovoltaic devices, Computer Methods in Applied Mechanics and Engineering 247 (2012)113–129.

[29] S. Pfeifer, B. S. S. Pokuri, P. Du, B. Ganapathysubramanian, Process optimization for microstructure-dependentproperties in thin film organic electronics, Materials Discovery 11 (2018) 6–13.

[30] D. T. Fullwood, S. R. Niezgoda, B. L. Adams, S. R. Kalidindi, Microstructure sensitive design for performanceoptimization, Progress in Materials Science 55 (2010) 477–562.

[31] A. Cecen, T. Fast, S. R. Kalidindi, Versatile algorithms for the computation of 2-point spatial correlations inquantifying material structure, Integrating Materials and Manufacturing Innovation 5 (2016) 1.

[32] A. Khosravani, A. Cecen, S. R. Kalidindi, Development of high throughput assays for establishing process-structure-property linkages in multiphase polycrystalline metals: Application to dual-phase steels, Acta Materialia123 (2017) 55–69.

[33] N. H. Paulson, M. W. Priddy, D. L. McDowell, S. R. Kalidindi, Reduced-order structure-property linkages forpolycrystalline microstructures based on 2-point statistics, Acta Materialia 129 (2017) 428–438.

[34] A. Wilson, H. Nickisch, Kernel interpolation for scalable structured gaussian processes (kiss-gp), in: InternationalConference on Machine Learning, PMLR, 2015, pp. 1775–1784.

[35] T. D. Bui, R. E. Turner, Tree-structured gaussian process approximations, in: Advances in Neural InformationProcessing Systems, 2014, pp. 2213–2221.

[36] B.-J. Lee, J. Lee, K.-E. Kim, Hierarchically-partitioned gaussian process approximation, in: Artificial Intelligenceand Statistics, PMLR, 2017, pp. 822–831.

[37] R. B. Gramacy, D. W. Apley, Local gaussian process approximation for large computer experiments, Journal ofComputational and Graphical Statistics 24 (2015) 561–578.

[38] R. B. Gramacy, lagp: Large-scale spatial modeling via local approximate gaussian processes in r, Journal ofStatistical Software 72 (2016) 1–46.

[39] M. J. Heaton, A. Datta, A. O. Finley, R. Furrer, J. Guinness, R. Guhaniyogi, F. Gerber, R. B. Gramacy, D. Ham-merling, M. Katzfuss, A case study competition among methods for analyzing large spatial data, Journal ofAgricultural, Biological and Environmental Statistics 24 (2019) 398–425.

[40] D. A. Cohn, Neural network exploration using optimal experiment design, Neural networks 9 (1996) 1071–1083.[41] R. M. Haralock, L. G. Shapiro, Computer and robot vision, Addison-Wesley Longman Publishing Co., Inc., 1991.[42] P. E. Shaw, A. Ruseckas, I. D. Samuel, Exciton diffusion measurements in poly (3-hexylthiophene), Advanced

Materials 20 (2008) 3516–3520.[43] M. Rocklin, Dask: Parallel computation with blocked algorithms and task scheduling, in: Proceedings of the 14th

python in science conference, volume 130, Citeseer, 2015, p. 136.[44] R. B. Gramacy, lagp: large-scale spatial modeling via local approximate gaussian processes in r, Journal of

Statistical Software 72 (2016) 1–46.

15