arXiv:2010.06175v1 [stat.ML] 13 Oct 2020

Neural Gaussian Mirror

for Controlled Feature Selection in Neural Networks

Xin Xing∗1, Yu Gui†2, Chenguang Dai3, and Jun S. Liu3

1Department of Statistics, Virginia Tech2Department of Statistics, University of Chicago3Department of Statistics, Harvard University

Abstract

Deep neural networks (DNNs) have become in-creasingly popular and achieved outstanding per-formance in predictive tasks. However, the DNNframework itself cannot inform the user whichfeatures are more or less relevant for making theprediction, which limits its applicability in manyscientific fields. We introduce neural Gaussianmirrors (NGMs), in which mirrored features arecreated, via a structured perturbation based ona kernel-based conditional dependence measure,to help evaluate feature importance. We designtwo modifications of the DNN architecture forincorporating mirrored features and providingmirror statistics to measure feature importance.As shown in simulated and real data examples,the proposed method controls the feature selec-tion error rate at a predefined level and maintainsa high selection power even with the presence ofhighly correlated features.

1 Introduction

Recent advances in deep neural networks (DNNs)significantly improve the state-of-the-art predic-tion methods in many fields such as image process-ing, healthcare, genomics, and finance. However,

∗The first two authors contributed equally to this paper.†The first two authors contributed equally to this paper.

complex structures of DNNs often lead to theirlacks of interpretability. In many fields of science,model interpretation is essential for understandthe underlying scientific mechanisms and for ra-tional decision-making. At the very least, it isimportant to know which features are more rele-vant to the prediction outcome and which are not,i.e., conducting feature selection. Without prop-erly assessing and controlling the reliability ofsuch feature selection efforts, the reproducibilityand generalizability of the discoveries are likelyunstable. For example, in genomics, current de-velopments lead to good predictions of complextraits using millions of genetic and phenotypic pre-dictors. If one trait is predicted as “abnormal,” itis important for the doctor to know which predic-tors are active in such a prediction, whether theymake medical and biological sense, and whetherthere exist targeted treatment strategies.

In the classical statistical inference, some p-value based methods such as the Benjamini-Hochberg [5] and Benjamini-Yekutieli [6] proce-dures have been widely employed for controllingthe false discovery rate (FDR). However, in non-linear and complex models, it is often difficultto obtain valid p-values since the analytical formof the distribution of a relevant test statisticsis largely unknown. Approximating p-values bybootstrapping and other resampling methods isa valid strategy, but is also often infeasible dueto model non-identifiability, (i.e., the existence of

1

arX

iv:2

010.

0617

5v1

[st

at.M

L]

13

Oct

202

0

multiple equivalent models, such as a DNN withpermuted input or internal nodes), complex dataand model structures, and high computationalcosts. In an effect to code with these difficul-ties, [2] proposed the knockoff filter, which con-structs a “fake” copy of the feature matrix that isindependent of the response but retains the covari-ance structure of the original feature matrix (alsocalled the “design matrix”). Important featurescan be selected after comparing inferential resultsfor the true features with those for the knock-offs. To extend the idea to higher dimensionaland more complex models, [9] further proposedthe Model-X knockoff, which considers a randomdesign with known joint distribution (Gaussian)of the input features. Recently, [34] develops thePCS inference to investigate the stability of thedata by introducing perturbations to the data,the model, and the algorithm. Dropout [29] andGaussian Dropout can be viewed as examples ofmodel perturbations, which are efficient for re-ducing the instability and elevating out-of-sampleperformances.

For enhancing interpretability, feature selec-tion in neural networks has attracted some recentattention. To this end, gradients with respectto each input are usually taken as an effectivemeasurement of feature importance in convolu-tional and multi-layer neural networks as shownin [18]. [27] selects features in DNNs based onchanges of the cross-validation (CV) classificationerror rate caused by the removal of an individualfeature. However, how to control the error of sucha feature selection process in DNNs is unclear. Arecent work by [22] utilizes the Model-X Knock-off method to construct a pairwisely-connectedinput layer to replace the original one, and thelatent “competition” within each pair separatesimportant features from unimportant ones by themeasurement of knockoff statistics. However, theModel-X method requires that the distributionof X be either known exactly, or known to be ina nice distribution family that possesses simplesufficient statistics [20]. Also, the constructionprocedure for the knockoffs in [22] is based on

the Gaussian assumption on the input features,which is not satisfied in many applications suchas genome wide association studies with discreteinput data. In addition, the DeepPINK does notfit exactly the purpose of network interpretationsince the input layer is not fully connected tohidden layers, and thus some of the connectionsin the “original” structure may be lost in thisspecific Pairwise-Input structure.

In this paper, we introduce the neural Gaus-sian mirror (NGM) strategy for feature selectionwith controlled false selection error rates. NGMdoes not require any knowledge about or assump-tion on the joint distribution of the input fea-tures. Each NGM is created by perturbing aninput feature explicitly. For example, for thejth feature, we create two mirrored features asX+j = Xj + cjZj and X−j = Xj − cjZj , where

Zj is the vector of i.i.d. standard Gaussian ran-dom variables. The scalar cj is chosen so as tominimize the dependence between X+

j and X−jconditional on the remaining variables, X−j . Tocope with nonlinear dependence, we introduce akernel-based conditional independence measureand obtain cj by solving an optimization prob-lem. Then, we construct two modifications ofthe network architectures based on the mirroreddesign. The proposed mirror statistics tend totake large positive values for important featuresand are symmetric around zero for null features,which enables us to control the selection errorrate.

2 Background

2.1 Controlled feature selection

In many modern applications, we are not onlyinterested in fitting a model (with potentiallymany predictors) that can achieve a high predic-tion accuracy, but also keen in selecting relevantfeatures with controlled false selection error rate.Mathematically, we consider a general model withan unknown link function F illustrating the con-nection between the response variable Y and p

2

predictor variables X ∈ Rp: Y = F (X)+ε, whereε denotes the random noise.

We define S0 to be the set of “null” fea-tures: ∀ k ∈ S0, we have Y ⊥⊥ Xj | X−j ,where X−j = X1, . . . , Xp\Xj and defineS1 = 1, . . . , p\S0 as the set for relevant features.Feature selection is equivalent to recovering S1based on observations. When the estimated setS1 is produced, the false discoveries can be de-noted as S1 ∩ Sc1, where S1 is the true set ofimportant features. The false discovery propor-tion (FDP) is defined as

FDP =|S1 ∩ Sc1||S1|

, (2.1)

and its expectation is called the false discoveryrate (FDR), i.e., FDR = E[FDP].

Technically, controlled feature selection meth-ods are different from conventional feature selec-tion methods such as Lasso in that the formerrequire additional efforts in assessing the selec-tion uncertainty by means of estimating the FDP,which becomes more challenging in DNNs.

2.2 Gaussian mirror design for linearmodels

To discern relevant features from irrelevantones, we intentionally perturb the input fea-tures via the following Gaussian mirror design:for each feature Xj , we construct a mirroredpair (X+

j , X−j ) = (Xj + cjZj , Xj − cjZj) with

Zj ∼ N(0, 1) being independent of all the X. Wecall (X+

j , X−j , X−j) the j-th mirror design. Re-

gressing Y on the j-th mirror design, we obtaincoefficients via the ordinary least squares (OLS)method. Let β+j and β−j be the estimated coeffi-

cients for X+j and X−j , respectively. The mirror

statistics for the jth feature is defined as:

Mj = |β+j + β−j | − |β+j − β

−j | (2.2)

For important features j ∈ S1, β+j tends to be

similar to β−j , which helps cancel out the pertur-bation part leading to a large positive value of

Mj ≥ t, j ∈ nonnull

Mj ≤ − t,j ∈ null

Mj ≥ t,j ∈ null

Threshold t

Figure 1: Histogram of mirror statistics: redrepresents important variables and blue standsfor null variables

Mj . From another point of view, the large valueof Mj indicates that the importance of the jthfeatures is more stable to the perturbation, i.e.β+j ≈ β−j . For null features j ∈ S0, the mirrorstatistics can be made symmetric about zero bychoosing a proper cj as detailed in Theorem 2.1.

Theorem 2.1 Assume Y = Xβ + ε where εis Gaussian white noise. Let X+

j = Xj + cZj,

X−j = Xj − cZj, and ILj (c) = ρX+j X

−j ·X−j

, i.e.,

the partial correlation of X+j and X−j given X−j.

For any j ∈ S0, if we set

cj = arg minc

[ILj (c)]2, (2.3)

then we have P (Mj < −t) = P (Mj > t) for anyt > 0.

A detailed proof of the theorem can be found in[31]. It shows that the symmetric property of themirror statistics Mj for null features is satisfiedif we choose cj that minimizes the magnitude ofthe partial correlation ILj (c), which is equivalent

to solving ILj (c) = 0 in linear models. In other

words, the perturbation cjZj makes X+j and X−j

partially uncorrelated given X−j .We note that in linear models with OLS fitting,

the above construct is equivalent to generating

3

an independent noise feature cjZ and regress Yon (X, cjZ). The mirror statistic is simply themagnitude difference between the estimated co-efficient for Xj and that for cjZ. However, thissimpler construct is no longer equivalent to theoriginal design for nonlinear models or for high-dimensional linear models where the OLS is notused for fitting. Also, the conditional indepen-dence motivation for choosing cj is lost in thissimpler formulation.

Figure 1 shows the distribution of the Mj ’s. Itis seen that relevant features can be separatedfrom the null ones fairly well and a consistentestimate of the false discoveries proportion basedon the symmetric property can be obtained.

For a threshold t, we select features as S1 =j : Mj ≥ t. The symmetric property of Mj forj ∈ S0 implies that E[#j ∈ S0 : Mj ≥ t] =E[#j ∈ S0 : Mj ≤ −t]. Thus we obtain anestimate of the FDP as

FDP (t) =#j : Mj ≤ −t

#j : Mj ≥ t ∨ 1(2.4)

In practice, we choose a data adaptive threshold

τq = mintt > 0 : FDP (t) ≤ q (2.5)

to control the FDR at a predefined level q.

As shown in Theorem 4 of [31], under weakdependence assumption of Mjs, we have

E[FDP (τq)] < q

as p goes to infinity, i.e., the FDR is asymptot-ically controlled under the predefined thresholdq.

3 Neural Gaussian mirror

We here describe a model-free mirror design forcontrolling variable selection errors in DNNs. Inlinear models, a key step of our mirror designis to choose a proper perturbation level cj so asto annihilate the partial correlation between themirror variables X+

j and X−j . Since the partial

correlation being zero for two random variablesimplies that they are conditional independenceunder the joint Gaussian assumption, for morecomplex models such as DNNs, we consider ageneral kernel-based measure of conditional de-pendence between X+

j and X−j and choose cj byminimizing this measure. Then, we can constructthe mirror statistics and establish a data-adaptivethreshold to control the selection error rate in asimilar way as in linear models. We call this wholeprocedure the neural Gaussian mirror (NGM).

3.1 Model-free mirror design

Without any specific model assumption, we con-struct the mirrored pair as

(X+j ,X

−j ) = (Xj + cjZj ,Xj − cjZj),

where Zj ∼ N(0, In), and cj is chosen as

cj = arg minc

[IKj (c)]2 (3.1)

Here [IKj (c)]2 is a kernel-based conditional inde-

pendence measure between X+j and X−j given

X−j . We give a detailed expression of [IKj (c)]2

in Sections 3.2 and 3.3. Compared with (2.3),the kernel-based mirror design can incorporatenonlinear dependence effectively.

3.2 Decomposition of log-density func-tion

For notational simplicity, we write (U, V,W ) =(Xj + cZ,Xj − cZ,X−j) and denote their log-transferred joint density function as η, whichbelongs to a tensor product reproducing kernelHilbert space (RKHS) H [21]. We define H =HU ⊗ HV ⊗ HW , where HU , HV and HW aremarginal RKHSs and “⊗” denotes the tensorproduct of two vector spaces. We decompose ηas

η(u, v, w) = ηU (u)+ηV (v)+ηW (w)+ηU,W (u,w)

+ ηV,W (v, w) + ηU,V (u, v) + ηU,V,W (u, v, w).(3.2)

4

The uniqueness of the decomposition is guaran-teed by the probabilistic decomposition of H.

For simplicity, we use the Euclidean space as anexample to illustrate the basic idea of tensor sumdecomposition, which is also known as ANOVAdecomposition in linear models. For example,for the d-dimensional Euclidean space, we letf be a vector and let f(x) be its x-th entry,x = 1, . . . , d. Suppose A is the average operatordefined as Af(x) = 〈δ, f〉, where δ = (δ1, . . . , δd)is a probability vector (i.e., δi ≥ 0, and

∑δi = 1).

The tensor sum decomposition of the Euclideanspace Rd is

Rd = Rd0⊕Rd1 := δ⊕f ∈ Rd |d∑

x=1

δxf(x) = 0,

where the first space is called the grand meanand the second space is called the main effect.Then, we construct the kernel for Rd0 and Rd1 asin Lemma A.2 in the Supplementary Material(SM).

However, in RKHS of infinite dimension, thegrand mean is not a single vector. Here, we setthe average operator Al as Al := f → Exf(x) =Ex⟨K lx, f⟩Hl =

⟨ExK

lx, f⟩Hl where K l is the

kernel function in Hl and the first equality isdue to the reproducing property. ExK

lx plays the

same role as δ in Euclidean space. Then we havethe tensor sum decomposition of marginal RKHSHl defined as

Hl = Hl0⊕Hl1 := ExKx⊕f ∈ Hl : Alf = 0.(3.3)

Following the same fashion, we call Hl0 as thegrand mean space and Hl1 as the main effectspace. Note that ExK

lx is also known as the

kernel mean embedding, which is well establishedin the statistics literature [7]. We show the kernelfunctions for Hl0 and Hl1 in Lemma A.3 (see SMfor details).

Following [17], we apply the distributive law

and have the decomposition of H as

H =(HU0 ⊕HU1 )⊗ (HV0 ⊕HV1 )⊗ (HW0 ⊕HW1 )

≡H000 ⊕H100 ⊕H010 ⊕H001 ⊕H110 ⊕H101

⊕H011 ⊕H111 (3.4)

where Hijk = HUi ⊗ HVj ⊗ HWk . We show thekernel functions for each subspace in Lemma A.4(see SM). Each component in (3.2) is the pro-jection of η on the corresponding subspace in(3.4). Thus, the decomposition of the log-densityfunction in (3.2) is unique.

We introduce the following lemma to establisha sufficient and necessary condition for the condi-tional independence of U and V given W basedon the decomposition in (3.2).

Lemma 3.1 Assume that the log-joint-densityfunction η of (U, V,W ) belongs to a tensor prod-uct RKHS. U and V are conditional independentif and only if ηU,V + ηU,V,W = 0.

Here ηC is the function only of variables in theset C. The proof of this lemma is given in SM.Lemma 3.1 implies that the following two hypoth-esis testing problems are equivalent:

H0 : U ⊥⊥ V |W vs. H1 : U 6⊥⊥ V |W (3.5)

and

H0 : η ∈ H0 vs. H1 : η ∈ H/H0, (3.6)

where H0 := H000⊕H100⊕H010⊕H001⊕H011⊕H101. Compared to (3.5), a key advantage of(3.6) is that we are able to specify the functionspace of the log-transferred density under bothnull and alternative.

3.3 Kernel-based Conditional Depen-dence Measure

Let ti = (Ui, Vi,Wi), i = 1, . . . , n, be i.i.d. ob-servations generated from the distribution of

5

T = (U, V,W ). The log-likelihood-ratio func-tional is

LRn(η) = `n(η)− `n(PH0η)

= − 1

n

n∑i=1

η(ti)− PH0η(ti), η ∈ H,

(3.7)

where PH0 is a projection operator from H to H0.Using the reproducing property, we rewrite (3.7)as

LRn(η) = − 1

n

n∑i=1

⟨KHti , η

⟩H −

⟨KH0

ti, η⟩H

,

(3.8)where KH is the kernel for H and KH0 is thekernel for H0. Then, we calculate the Frechetderivative of the likelihood ratio functional as

DLRn(η)∆η =

⟨1

n

n∑i=1

(KHti −KH0ti

),∆η

⟩H

=

⟨1

nK1

ti ,∆η

⟩H

(3.9)

where K1 is the kernel for H110⊕H111. The scorefunction 1

nK1ti

is the first order approximationof the likelihood ratio functional. We define ourkernel-based conditional measure as the squarednorm of the score function of the likelihood ratiofunctional as

[IKj (c)]2 =

∥∥∥∥∥ 1

n

n∑i=1

K1ti

∥∥∥∥∥2

H

, (3.10)

which, by the reproducing property, can be ex-panded as

[IKj (c)]2 =1

n2

n∑i=1

n∑j=1

K1(ti, tj). (3.11)

The construction of [IKj (c)]2 is related to thekernel conditional independence test [13], whichgeneralizes the conditional covariance matrix toa conditional covariance operator in RKHS. How-ever, the calculation of the norm of the condi-tional covariance operator involves the inverse of

Figure 2: Indivdual neural Gaussian mirror inDNN

an n × n matrix, which is expensive to obtainwhen the sample size n is large. But we willshow that our proposed score test statistics onlyinvolve matrix multiplications.

We introduce a matrix form of the squarednorm of score function to facilitate the computa-tion process. In (3.11), [INj (c)]2 is determined by

the kernel on HU1 ⊗HV1 ⊗HW0 ⊕HU1 ⊗HV1 ⊗HW1 .Thus, by Lemma A.2 and Lemma A.3, we canrewrite (3.11) as

[IKj (c)]2 =1

n2[(HKUH) (HKVH) KW ]++

(3.12)

where H = In − 1n11T , In is a n × n identity

matrix and 1n is a n × 1 vector of ones, and[A]++ =

∑ni=1

∑nj=1Aij .The most popular ker-

nel choices are Gaussian and polynomial kernels.With parallel computing, the computational com-plexity for using either is approximately linear inn.

3.4 Mirror Statistics in DNNs

In order to construct mirror statistics in DNNs,we introduce an importance measure for eachinput feature as a generalization of the connectionweights method proposed by [10]. We considera fully connected multi-layer perceptron (MLP)

with k hidden layers denoted as (H(t)1 , . . . ,H

(t)nt )

for t = 1, . . . , k. The connection between the t-th

6

Figure 3: Simultaneous neural Gaussian mirrorin DNN

and (t+ 1)-th layers can be expressed as

H(t+1)j = gt+1(

nt∑i=1

w(t)ij H

(t)i ), for j = 1, . . . , nt+1,

where gt+1 is the activation function. We lett = 0 and t = k + 1 denote the input and outputlayers, respectively.

For a path l = (Xj , H(1)i1, ...,H

(k)ik, y) through

the network, We define its accumulated weightas

δ(l) = ω(0)ji1ω(k)ik1

k−1∏t=1

ω(t)itit+1

.

Let Ω(0)j = (ω

(0)j1 , . . . , ω

(0)jn1

) be the weight vectorconnecting Xj with the first layer. We define thefeature importance of Xj as

L(Xj) =∑l∈Pj

δ(l) = 〈C,Ω(0)j 〉, (3.13)

where Pj is the set consisting of all paths con-necting Xj to y with one node in each layer and

C =∏kt=1 Ω(t), where Ω

(t)ij = ω

(t)ij ∈ Rnt×nt+1

for i = 1, . . . , nt and j = 1, . . . , nt+1. The gra-dient w.r.t. each input feature ∂y/∂Xj in [18]is a modified version of L(Xj) (see the SM fordetails).

Based on (3.13), we define mirror statistics as

Mj = |L(X+j ) + L(X−j )| − |L(X+

j )− L(X−j )|,(3.14)

and use (2.4) to estimate FDP. By setting a pre-defined error rate q, a data adaptive thresholdis given by τq = mintt > 0 : FDP (t) ≤ q and

S1 = j : Mj ≥ τq is the set of selected features.

4 Implementations

4.1 Two forms of neural Gaussian mir-ror

The NGM can be realized in two ways: individ-ual or simultaneous. As shown in Figure 2, theindividual neural Gaussian mirror (INGM) is con-structed by treating one feature a time. Morespecifically, for feature Xj , we create X+

j and

X−j with cj minimizing [IKj (c)]2. Then, we set

the input layer as (X1, . . . , X+j , X

−j , . . . , Xp) and

fit the MLP as shown in Figure 2. For any prede-fined error rate q, we select feature by Algorithm1.

Algorithm 1 Individual NGM: feature selectionin deep neural networks

Input: Fixed FDR level q, (xi, yi), i = 1, ..., nwith xi ∈ Rp, yi ∈ R1

Output: S1

1: for i = 1 to p do2: Generate Zj ∼ N (0, In)3: Compute cj = arg minc[I

Kj (c)]2

4: Create mirrored pair(X+j ,X

−j ) = (Xj +

cjZj ,Xj − cjZj)5: Train MLP with X(j) = (X+

j ,X−j ,X−j)

as the input layer6: Compute mirror statistics Mj by (3.14)7: end for8: Calculate T = mint > 0 : FDP (t) ≤ q.9: Return S1 ← j : Mj ≥ T.

To increase the efficiency, we construct thesimultaneous neural Gaussian mirror (SNGM),by mirroring all the p features at the same timeas shown in Figure 3. The input layer include allmirrored pairs, i.e., (X+

1 , X−1 , . . . , X

+p , X

−p ). The

detailed algorithm is given in Algorithm 2.

7

Algorithm 2 SNGM: Feature selection withsimultaneous-mirrored PairsInput: Fixed FDR level q, (xi, yi), i = 1, ..., nwith xi ∈ Rp, yi ∈ R1

Output: S1

1: for i = 1 to p do2: Generate Zj ∼ N (0, In)3: Compute cj = arg minc[I

Kj (c)]2

4: Create mirrored pair (X+j ,X

−j ) = (Xj +

cjZj ,Xj − cjZj)5: end for6: Train MLP with input layer as

(X+1 ,X

−1 , . . . ,X

+p ,X

−p ).

7: Calculate Mj by (3.14) for j = 1, . . . , p.

8: Calculate T = mint > 0 : FDP (t) ≤ q.9: Output S1 ← j : Mj ≥ T.

4.2 Screening

To further reduce the computational cost, wepropose a screening step based on the rank of fea-ture importance measure defined in (3.13). Thescreening procedure is inspired by the RANKmethod proposed by [11,22], which uses part ofthe data for estimating the precision matrix andsubset selection and leaves the remaining datafor controlled feature selection.

In the screening procedure, we first randomlyselect [n/3] samples (X(1),y(1)) to train a neu-ral network. We calculate feature importance

measure L(X(1)j ) for j = 1, 2, . . . , p and rank

their absolute values from large to small as|L|(1) ≥ |L|(2) ≥ · · · ≥ |L|(p). We denote

Sm = j : L(X(1)j ) ≥ |L|(m) as our screened set.

Empirically, we can set m as [n/2] or [2n/log(n)].The screening can be implemented before theINGM and the SNGM algorithms to save compu-tational costs. we call the INGM with screeningand SNGM with screening as S-INGM and S-SNGM, respectively.

5 Numerical simulations

We choose an MLP structure with two hidden lay-ers, with N1 = 20log(p) and N2 = 10log(p) hid-den nodes respectively.We compare the selectionpower and FDR of NGMs with the DeepPINK,and conduct experiments in both simulated andreal-data settings with 20 repetitions. In the sim-ulation studies, we consider data from both linearmodels and nonlinear models. In each setting weconsider two covariance structures for X: theToeplitz partial correlation structure and the con-stant partial correlation structure. The Toeplitzpartial correlation structure has its precision ma-trix (i.e., inverse) as Ω = (ρ|i−j|). The constantpartial correlation structure has its precision ma-trix as Ω = (1− ρ)Ip + ρJp where Jp = 1

p1p1Tp ,

where ρ = 0.5.

5.1 Linear models

First, we examine the performance of all methods

in linear models: yi = β>xi + εi, εii.i.d.∼ N(0, 1),

for i = 1, . . . , n. We randomly set k = 30 ele-ments in β to be nonzero and generated fromN (0, (20

√log(p)/n)2) to mimic various signal

strengths in real applications.

As shown in Figure 4, NGMs control the FDRat q = 0.1 and have a higher power than the Deep-PINK. In the constant partial correlation settingwith partial correlation ρ = 0.5, DeepPINK showsa power loss when p is larger than 1500 since theminimum eigenvalue of the correlation matrix is1/(1 + (p− 1)ρ), which approaches 0 as p→∞and makes important features highly correlatedwith their knockoff counterparts. By introducinga random perturbation, which reduces the corre-lation between mirror statistics, NGMs controlFDR at the level of q = 0.1 and maintain poweraround 0.8 in both correlation structures.

In high-dimensional cases when p ≥ 2000, theNGMs, especially the SNGM and S-SNGM, arecomputationally more efficient than DeepPINK.In addition, we note that the difference in timeconsumption between DeepPINK and SNGM

8

0.0

0.2

0.4

0.6

0.8

1.0

Toeplitz partial correlation

dimension p

FD

P/P

ower FDR

DeepPinkSNGMS−INGMS−SNGM

PowerDeepPinkSNGMS−INGMS−SNGM

0.0

0.2

0.4

0.6

0.8

1.0

Constant partial correlation

dimension p

FD

P/P

ower

Figure 4: Performance comparison of NGMs and DeepPINK with predefined FDR level q = 0.1 inlinear models with two partial correlation structures

Table 1: Varying single-index link functions with two PC (partial correlation) structures(q=0.1)

Setting n = 1000 Link function S-SNGM SNGM S-INGM DeepPink

FDR Power FDR Power FDR Power FDR Power

Toeplitz PC

p = 500f1(t) = t+ sin(t) 0.071 0.867 0.045 0.857 0.057 0.900 0.085 0.900f2(t) = 0.5t3 0.092 0.873 0.080 0.843 0.061 0.830 0.155 0.413f3(t) = 0.1t5 0.059 0.865 0.071 0.847 0.132 0.807 0.148 0.333

p = 2000f1(t) = t+ sin(t) 0.085 0.843 0.081 0.820 0.084 0.837 0.126 0.817f2(t) = 0.5t3 0.076 0.788 0.082 0.587 0.081 0.810 0.149 0.320f3(t) = 0.1t5 0.093 0.737 0.108 0.530 0.185 0.620 0.477 0.053

Constant PC

p = 500f1(t) = t+ sin(t) 0.057 0.880 0.067 0.877 0.070 0.900 0.136 0.86f2(t) = 0.5t3 0.084 0.870 0.047 0.867 0.057 0.847 0.084 0.587f3(t) = 0.1t5 0.066 0.867 0.066 0.847 0.027 0.802 0.143 0.46

p = 2000f1(t) = t+ sin(t) 0.055 0.857 0.012 0.833 0.057 0.813 0.094 0.800f2(t) = 0.5t3 0.064 0.787 0.016 0.613 0.104 0.803 0.174 0.313f3(t) = 0.1t5 0.080 0.743 0.020 0.548 0.111 0.607 0.412 0.147

mainly lies in the construction of input variables.As shown in 6, computational time of construct-ing knockoff variables in DeepPINK increasesrapidly when the number of features grows.Incontrast, for SNGM and S-SNGM, the construc-tion of mirrored input is parallel for each feature,which makes the computation time less affectedby the increase of dimension p.

5.2 Single-index models

We further test the performance of the proposedmethods and DeepPINK in nonlinear cases suchas single-index models. In our experiment, wechoose three nonlinear link functions: f1(t) =t + sin(t), f2(t) = 0.5t3 and f3(t) = 0.1t5. Wetry both high-dimensional cases with p = 2000and low-dimensional cases with p = 500. Asshown in Table 1, NGMs and DeepPINK all havedesirable performances similar to that in linear

9

0.0

0.2

0.4

0.6

0.8

1.0

(a) Single−index: p=500

# of nonzeros

FD

P/P

ower

0.0

0.2

0.4

0.6

0.8

1.0

(b) Single−index: p=2000

# of nonzeros

FD

P/P

ower

FDRDeepPinkSNGMS−INGMS−SNGM

PowerDeepPinkSNGMS−INGMS−SNGM

Figure 5: Performance comparison of NGM and DeepPINK with predefined FDR level q = 0.1under single-index model with link function f2. The triangle dots denote power and round dotsdenote FDR. (a). p=500; (b). p=2000

0 500 1000 1500 2000

020

4060

80

Time consumption of construction: Knockoffs vs Mirrors

p

Ave

rage

tim

e (s

ec)

Model−X knockoffMirrors

Figure 6: Computational time for constructinginput variables: Mirrors vs Model-X Knockoffs

models since the first function f1(t) = t+ sin(t)is dominated by its linear term. However, inthe latter two cases where f is a polynomial,Table 1 shows that DeepPINK suffers a loss ofpower. In all the three cases, NGMs are capableof maintaining the power at a higher level thanDeepPINK.

To study the influence of sparsity, we furtherconsider an experiment with constant partial cor-relation ρ = 0.5 and varying the number of im-portant features from 10 to 50 skipping by 10.For varying sparsity levels, as shown in Table1, NGMs maintain high power with controllederror rate in both the low dimensional case withp = 500 and the high dimensional setting withp = 2000, whereas DeepPINK tends to loss powerwhen the sparsity level is high.

5.3 Real data design: tomato dataset

We consider a panel of 292 tomato accessionsin [4]. The panel includes breeding materials thatare built and characterized with over 11,000 SNPs.Each SNP is coded as 0, 1 and 2 to denote the ho-mozygous (major), heterozygous, and the otherhomozygous (minor) genotypes, respectively. Toevaluate the performance of NGMs and Deep-PINK under non-Gaussian designs, we randomlyselect p ranging from 100 to 1000 SNPs to formour design matrix, and generate the response yiin the same way as with the simulated examples.This simulation is replicated 20 times indepen-

10

Table 2: Performance comparison with tomato de-sign (q=0.2, k=30)

n = 292S-INGM S-SNGM DeepPink

FDR Power FDR Power FDR Power

p = 100 0.171 0.893 0.131 0.690 0.182 0.607

p = 500 0.156 0.964 0.129 0.586 0.209 0.403

p = 1000 0.229 0.942 0.281 0.520 0.145 0.490

p = 2000 0.296 0.864 0.347 0.577 0.244 0.453

dently, to which each method in consideration isapplied.

We set the pre-defined FDR rate at q = 0.2 andcompare the empirical FDR and power of NGMswith those of DeepPINK. As shown in table 2,the S-INGM method shows the lowest FDR andhighest power when p = 100 and 500. Whenp =1000 and 2000, S-INGM is more powerful thanDeepPINK but the its FDR inflates slightly. S-SNGM is computationally more efficient, but withslightly inferior performances, than S-INGM.

To provide a better visualization of the trade-off between the error rate and power, we plot theROC curve for the three methods. Specificallythe ROC curve is plotted with true positive rate

TPR =TP

TP + FN(5.1)

against the false positive rate

FPR =FP

FP + TN. (5.2)

A method with a larger area under the curve(AUC) is regarded as a method with bettertrade-off between power and FDR. In both low-dimensional and high-dimensional settings, Fig-ure 7 shows that NGMs have a larger AUC thanDeepPINK.

6 Discussion

In this paper, we propose NGMs for feature im-portance assessment and controlled selection inneural network models, which is an important

0.0 0.2 0.4 0.6

0.0

0.2

0.4

0.6

0.8

1.0

Tomato: p=100

FPR

TP

R

DeepPinkS−INGMS−SNGM


0.0 0.2 0.4 0.6

0.0

0.2

0.4

0.6

0.8

1.0

Tomato: p=1000

FPR

TP

R


Figure 7: ROC curve for performance on tomatodataset

aspect for model interpretation. Even in sit-uations with no need of explicit feature selec-tions, having a good rank of the predictors ac-cording to their influence on the output can bevery helpful for practitioners to prioritize theirfollow-up work. We emphasize that our methoddoes not require any distributional assumptionon X, thus is widely applicable to a broad classof neural network models including those withdiscrete or categorical features. In addition, themirror design can be generated to convolutionalneural networks (CNN) to measure the stabil-ity of the filters. For example, for each originalCNN slice in the input tensor, we can createits corresponding mirrored slice by adding andsubtracting a slice of random perturbations, andcontrast the influences of the original and mir-rored slices. detailed study in this direction isdeferred to a future work. The data used in thereal example is available in (ftp.solgenomics.net/manuscripts/Bauchet_2016/).

11

ftp.solgenomics.net/manuscripts/Bauchet_2016/

ftp.solgenomics.net/manuscripts/Bauchet_2016/

Supplementary materials

A Test statistics for conditionalindependence

To simplify the notations, let

(U, V,W ) = (X+j , X

−j , X−j)

Lemma A.1 U and V are conditional indepen-dent if and only if ηU,V (u, v)+ηU,V,W (u, v, w) = 0

Proof

Suppose f is the joint density function for(U, V,W ) = (X+

j , X−j , X−j), then it can be de-

composed into 23 = 8 factors:

f(u, v, w) = exp(ηU (u) + ηV (v) + ηW (w)

+ ηU,W (u,w) + ηV,W (v, w) + ηU,V (u, v)

+ ηU,V,W (u, v, w)), (A.1)

where η represents log-transformed density andηA(α) means the part as the function only of α.Then

η(u, v, w) = ηU (u) + ηV (v) + ηW (w)

+ ηU,W (u,w) + ηV,W (v, w) + ηU,V (u, v)

+ ηU,V,W (u, v, w) (A.2)

To assess the dependence between (U, V ) con-ditioned on W , we derive the conditional densityas

f(U,V |W )(u, v|w) =eη(u,v,w)∫U∫V e

η(u,v,w)

= C(w) · eηU (u)+ηU,W (u,w)·eηV (v)+ηV,W (v,w) · eηU,V (u,v)+ηU,V,W (u,v,w)

(A.3)

where C(w) denotes the denominator as themarginal density of W . Therefore U ⊥⊥ V |Wif and only if any interaction of U and V is con-stants, i.e. ηU,V + ηU,V,W = 0.

B Importance measurement inDNN

In order to construct mirror statistics in DNNs,we first adopt an importance measure for eachinput feature.

We consider a fully connected multi-layer per-ceptron (MLP) with k hidden layers denoted as

(H(t)1 , . . . ,H

(t)nt ) for t = 1, . . . , k. The output of

the (t+ 1)-th layer can be described as

H(t+1)j = gt+1(

nt∑i=1

w(t)ij H

(t)i ), for j = 1, . . . , nt+1,

where gt+1 is the activation function. Note thatt = 0 denotes the input layer and t = k + 1denotes the output layer.

We define

δ(l(Xj , H(1)i1, ...,H

(k)ik, y)) = ω

(0)ji1ω(k)ik1

k−1∏t=1

ω(t)itit+1

as the accumulated weight of the path l.

Let Ω(0)j = (ω

(0)j1 , . . . , ω

(0)jn1

) as the weight vectorconnecting Xj with the first layer. We define thefeature importance of jth feature as

L(Xj) =∑l∈Pj

δ(l) = 〈C,Ω(0)j 〉, (B.1)

where Pj is the set consisting of all paths con-necting Xj to y with one node in each layer and

C =∏kt=1 Ω(t), where Ω

(t)ij = ω

(t)ij ∈ Rnt×nt+1 for

i = 1, . . . , nt and j = 1, . . . , nt+1.For t = 1, . . . , k, define diagonal matrices

G(t) = diag(∂gt∂h

(

nt−1∑i=1

w(t−1)i1 H

(t−1)i ), . . . ,

∂gt∂h

(

nt−1∑i=1

w(t−1)int

H(t−1)i )). (B.2)

Then, the gradient w.r.t. Xj can be written as

∂y

∂Xj= Ω

(0)j

k∏t=1

(G(t)Ω(t)), (B.3)

which is a weighted version of L(Xj).

12

C Perturbation with linear ker-nel

Suppose n observations (xi, yi), i = 1, 2, . . . , nare drawn from the model

y = φ(x)>β + ε

where φ(·) is a feature map and the kernel func-tion k(·, ·) is define as

k(x, z) = 〈φ(x), φ(z)〉

In the classical linear model y = x>β + ε, φ ischosen to be the identity with a constant inter-ception and thus k(·, ·) = 〈·, ·〉+ const.

To specify the test statistics when adoptinglinear kernels

[IKj (c)]2

=1

n2[(HKUH) (HKVH) KW ]++

=1

n2[(HU)(HU)>] [(HV )(HV )>]

(WW>)++ (C.1)

where (U, V,W ) = (X+j , X

−j , X−j) and H = In−

1n1n1

>n . To simplify notations, denote U = HU

and V = HV , with which

G(c) , n2[IKj (c)]2

= [(Xj + cZj)(Xj + cZj)>]

[(Xj − cZj)(Xj − cZj)>] (WW>)++.(C.2)

We take gradient of G(c) with respect to c and

due to the community of operators:

∇cG(c)

= ∂∂c

([(Xj + cZj)(Xj + cZj)>]

[(Xj − cZj)(Xj − cZj)>]) (WW>)++

= ∇c[(Xj + cZj)(Xj + cZj)>]

[(Xj − cZj)(Xj − cZj)>] (WW>)

+ [(Xj + cZj)(Xj + cZj)>]

∇c[(Xj − cZj)(Xj − cZj)>] (WW>)++

= c[2c2(ZjZ>j ) (ZjZ>j ) + 2(XjX

>j ) (ZjZ

>j )

− (XjZ>j ) (XjZ

>j )− (ZjX

>j ) (ZjX

>j )

− 2(XjZ>j ) (ZjX

>j )] (WW>)++.

(C.3)

Therefore, ∇sG(s) = 0 leads to the solution asfollow:

c∗j =

√A(Xj , Zj) (WW>)++

2[(ZjZ>j ) (ZjZ>j ) (WW>)]++,

(C.4)

with A(Xj , Zj) = (ZjX>j ) (ZjX

>j ) + (XjZ

>j )

(XjZ>j ). And c∗j is the minimum point by com-

puting the second-order derivatives w.r.t. c.

For the space spanW, let PW be the projec-tion operator, then cj in the Gaussian Mirror canbe written as

cGMj =

√‖(I − PW )Xj‖2‖(I − PW )Zj‖2

. (C.5)

We should note that, without the term WW>,[(ZjZ

>j ) (ZjZ

>j )]++ = ‖Zj‖4, [(ZjX

>j )

(ZjX>j )]++ = [(XjZ

>j ) (XjZ

>j )]++ =

‖Xj‖2‖Zj‖2. Therefore, comparing the form ofC.4 and C.5, c∗j and cGM

j are of the same scalewhich is validated in the simulation and they areequivalent with orthogonal designs.

13

Table 3: Linear Model with varying constant partial correlation(q=0.1)

n = 1000 partial correlationS-SNGM SNGM S-INGM DeepPink

FDR Power FDR Power FDR Power FDR Power

p = 500 ρ = 0 0.064 0.910 0.083 0.907 0.060 0.890 0.135 0.907

ρ = 0.2 0.065 0.903 0.071 0.905 0.096 0.887 0.092 0.900

ρ = 0.4 0.068 0.902 0.012 0.815 0.080 0.898 0.094 0.913

ρ = 0.6 0.071 0.893 0.024 0.870 0.067 0.900 0.169 0.927

ρ = 0.8 0.076 0.880 0.008 0.838 0.103 0.882 0.168 0.907

p = 2000 ρ = 0 0.076 0.798 0.017 0.842 0.053 0.860 0.146 0.787

ρ = 0.2 0.053 0.808 0.058 0.873 0.062 0.850 0.058 0.740

ρ = 0.4 0.072 0.798 0.014 0.810 0.044 0.840 0.096 0.800

ρ = 0.6 0.083 0.805 0.048 0.812 0.063 0.810 0.125 0.767

ρ = 0.8 0.063 0.798 0.330 0.815 0.073 0.737 0.177 0.740

D Additional simulation re-sults with varying correla-tion

Consider the linear model yi = β>xi+ εi, εii.i.d.∼

N(0, 1), for i = 1, . . . , n. We randomly set k = 30elements in β to be nonzero and generated fromN (0, (20

√log(p)/n)2) to mimic various signal

strengths in real applications. We set n = 1000and study both the low-dimensional case withp = 500 and the high-dimensional case with p =2000 respectively.

As shown in the Table 3, S-SNGM and S-INGMhave exact FDR control under q = 0.1 and mean-while have higher power than the DeepPINK.Besides, SNGM controls FDR under q = 0.1 ex-cept for the case with p = 2000, ρ = 0.8 thatis diffcult for mirror construction. In the high-dimensional setting with high partial correlation,DeepPINK undergoes an obvious power loss sincethe minimum eigenvalue of correlation matrix is1/(1 + (p − 1)ρ) ≈ 0, which imposes great ob-stacles for the Knockoff construction, but NGMwith screening procedure is more stable in powermaintenance.

References

[1] Reza Abbasi-Asl and Bin Yu. Interpretingconvolutional neural networks through com-pression. arXiv preprint arXiv:1711.02329,2017.

[2] Rina Foygel Barber, Emmanuel J Candes,et al. Controlling the false discovery ratevia knockoffs. The Annals of Statistics,43(5):2055–2085, 2015.

[3] Rina Foygel Barber, Emmanuel J Candes,et al. A knockoff filter for high-dimensionalselective inference. The Annals of Statistics,47(5):2504–2537, 2019.

[4] Guillaume Bauchet, Stephane Grenier, Nico-las Samson, Julien Bonnet, Laurent Grivet,and Mathilde Causse. Use of modern tomatobreeding germplasm for deciphering the ge-netic control of agronomical traits by genomewide association study. Theoretical and ap-plied genetics, 130(5):875–889, 2017.

[5] Yoav Benjamini and Yosef Hochberg. Con-trolling the false discovery rate: a practicaland powerful approach to multiple testing.Journal of the Royal statistical society: se-ries B (Methodological), 57(1):289–300, 1995.

14

[6] Yoav Benjamini, Daniel Yekutieli, et al. Thecontrol of the false discovery rate in multipletesting under dependency. The annals ofstatistics, 29(4):1165–1188, 2001.

[7] Alain Berlinet and Christine Thomas-Agnan.Reproducing kernel Hilbert spaces in prob-ability and statistics. Springer Science &Business Media, 2011.

[8] Alexander Binder, Gregoire Montavon, Se-bastian Lapuschkin, Klaus-Robert Muller,and Wojciech Samek. Layer-wise relevancepropagation for neural networks with localrenormalization layers. In International Con-ference on Artificial Neural Networks, pages63–71. Springer, 2016.

[9] Emmanuel Candes, Yingying Fan, Lu-cas Janson, and Jinchi Lv. Panning forgold:‘model-x’knockoffs for high dimensionalcontrolled variable selection. Journal of theRoyal Statistical Society: Series B (Statisti-cal Methodology), 80(3):551–577, 2018.

[10] Juan De Ona and Concepcion Garrido. Ex-tracting the contribution of independent vari-ables in neural network models: A new ap-proach to handle instability. Neural Comput.Appl., 25(3–4):859–869, September 2014.

[11] Yingying Fan, Emre Demirkaya, Gaorong Li,and Jinchi Lv. Rank: large-scale inferencewith graphical nonlinear knockoffs. Jour-nal of the American Statistical Association,pages 1–43, 2019.

[12] Yingying Fan, Jinchi Lv, et al. Innovatedscalable efficient estimation in ultra-largegaussian graphical models. The Annals ofStatistics, 44(5):2098–2126, 2016.

[13] Kenji Fukumizu, Arthur Gretton, XiaohaiSun, and Bernhard Scholkopf. Kernel mea-sures of conditional dependence. In Advancesin neural information processing systems,pages 489–496, 2008.

[14] Amirata Ghorbani, Abubakar Abid, andJames Zou. Interpretation of neural networksis fragile. In Proceedings of the AAAI Con-ference on Artificial Intelligence, volume 33,pages 3681–3688, 2019.

[15] Amir Globerson and Sam Roweis. Night-mare at test time: robust learning by featuredeletion. In Proceedings of the 23rd interna-tional conference on Machine learning, pages353–360. ACM, 2006.

[16] Arthur Gretton, Karsten M Borgwardt,Malte J Rasch, Bernhard Scholkopf, andAlexander Smola. A kernel two-sampletest. Journal of Machine Learning Research,13(Mar):723–773, 2012.

[17] Chong Gu. Smoothing spline ANOVA mod-els, volume 297. Springer Science & BusinessMedia, 2013.

[18] Yotam Hechtlinger. Interpretation of predic-tion models using the input gradient. arXivpreprint arXiv:1611.07634, 2016.

[19] Geoffrey E Hinton and Ruslan R Salakhut-dinov. Reducing the dimensionality of datawith neural networks. science, 313(5786):504–507, 2006.

[20] Dongming Huang and Lucas Janson. Relax-ing the assumptions of knockoffs by condi-tioning. arXiv preprint arXiv:1903.02806,2019.

[21] Yi Lin et al. Tensor product space anovamodels. The Annals of Statistics, 28(3):734–755, 2000.

[22] Yang Lu, Yingying Fan, Jinchi Lv, andWilliam Stafford Noble. Deeppink: repro-ducible feature selection in deep neural net-works. In Advances in Neural InformationProcessing Systems, pages 8676–8686, 2018.

[23] Avanti Shrikumar, Peyton Greenside, andAnshul Kundaje. Learning important fea-

15

tures through propagating activation dif-ferences. In Proceedings of the 34th Inter-national Conference on Machine Learning-Volume 70, pages 3145–3153. JMLR. org,2017.

[24] Nitish Srivastava, Geoffrey Hinton, AlexKrizhevsky, Ilya Sutskever, and RuslanSalakhutdinov. Dropout: a simple way toprevent neural networks from overfitting.The journal of machine learning research,15(1):1929–1958, 2014.

[25] John D Storey, Jonathan E Taylor, andDavid Siegmund. Strong control, conser-vative point estimation and simultaneousconservative consistency of false discoveryrates: a unified approach. Journal of theRoyal Statistical Society: Series B (Statisti-cal Methodology), 66(1):187–205, 2004.

[26] Ryan Turner. A model explanation system.In 2016 IEEE 26th International Workshopon Machine Learning for Signal Processing(MLSP), pages 1–6. IEEE, 2016.

[27] Antanas Verikas and Marija Bacauskiene.Feature selection with neural networks. Pat-tern Recognition Letters, 23(11):1323–1335,2002.

[28] Pascal Vincent, Hugo Larochelle, YoshuaBengio, and Pierre-Antoine Manzagol. Ex-tracting and composing robust features withdenoising autoencoders. In Proceedings ofthe 25th international conference on Machinelearning, pages 1096–1103. ACM, 2008.

[29] Stefan Wager, Sida Wang, and Percy S Liang.Dropout training as adaptive regularization.In Advances in neural information processingsystems, pages 351–359, 2013.

[30] Sida Wang and Christopher Manning. Fastdropout training. In international conferenceon machine learning, pages 118–126, 2013.

[31] Xin Xing, Zhigen Zhao, and Jun S. Liu. Con-trolling false discovery rate using gaussianmirrors. Technical Report, 2019.

[32] Meziane Yacoub and Younes Bennani. Hvs: A heuristic for variable selection in multi-layer artificial neural network classifier. 1997.

[33] Bin Yu et al. Stability. Bernoulli, 19(4):1484–1500, 2013.

[34] Bin Yu and Karl Kumbier. Three princi-ples of data science: predictability, com-putability, and stability (pcs). arXiv preprintarXiv:1901.08152, 2019.

16

arXiv:2010.06175v1 [stat.ML] 13 Oct 2020

Documents

Transcript of arXiv:2010.06175v1 [stat.ML] 13 Oct 2020