arXiv:1602.04938v1 [cs.LG] 16 Feb 2016 · In this paper, we propose providing explanations for...

15
“Why Should I Trust You?” Explaining the Predictions of Any Classifier Marco Tulio Ribeiro University of Washington Seattle, WA 98105, USA [email protected] Sameer Singh University of Washington Seattle, WA 98105, USA [email protected] Carlos Guestrin University of Washington Seattle, WA 98105, USA [email protected] Abstract Despite widespread adoption, machine learn- ing models remain mostly black boxes. Un- derstanding the reasons behind predictions is, however, quite important in assessing trust in a model. Trust is fundamental if one plans to take action based on a prediction, or when choosing whether or not to deploy a new model. Such understanding further provides insights into the model, which can be used to turn an untrust- worthy model or prediction into a trustworthy one. In this work, we propose LIME, a novel expla- nation technique that explains the predictions of any classifier in an interpretable and faith- ful manner, by learning an interpretable model locally around the prediction. We further pro- pose a method to explain models by presenting representative individual predictions and their explanations in a non-redundant way, framing the task as a submodular optimization prob- lem. We demonstrate the flexibility of these methods by explaining different models for text (e.g. random forests) and image classification (e.g. neural networks). The usefulness of expla- nations is shown via novel experiments, both simulated and with human subjects. Our ex- planations empower users in various scenarios that require trust: deciding if one should trust a prediction, choosing between models, improv- ing an untrustworthy classifier, and detecting why a classifier should not be trusted. 1 Introduction Machine learning is at the core of many recent ad- vances in science and technology. Unfortunately, the important role of humans is an oft-overlooked aspect in the field. Whether humans are directly using ma- chine learning classifiers as tools, or are deploying models into products that need to be shipped, a vital concern remains: if the users do not trust a model or a prediction, they will not use it. It is important to differentiate between two different (but related) defi- nitions of trust: (1) trusting a prediction, i.e. whether a user trusts an individual prediction sufficiently to take some action based on it, and (2) trusting a model, i.e. whether the user trusts a model to behave in rea- sonable ways if deployed. Both are directly impacted by how much the human understands a model’s be- haviour, as opposed to seeing it as a black box. Determining trust in individual predictions is an important problem when the model is used for real- world actions. When using machine learning for med- ical diagnosis [6] or terrorism detection, for example, predictions cannot be acted upon on blind faith, as the consequences may be catastrophic. Apart from trusting individual predictions, there is also a need to evaluate the model as a whole before deploying it “in the wild”. To make this decision, users need to be confident that the model will perform well on real-world data, according to the metrics of interest. Currently, models are evaluated using metrics such as accuracy on an available validation dataset. However, real-world data is often significantly different, and further, the evaluation metric may not be indicative of the product’s goal. Inspecting individual predic- tions and their explanations can be a solution to this problem, in addition to such metrics. In this case, it is important to guide users by suggesting which instances to inspect, especially for larger datasets. arXiv:1602.04938v1 [cs.LG] 16 Feb 2016

Transcript of arXiv:1602.04938v1 [cs.LG] 16 Feb 2016 · In this paper, we propose providing explanations for...

Page 1: arXiv:1602.04938v1 [cs.LG] 16 Feb 2016 · In this paper, we propose providing explanations for individual predictions as a solution to the “trusting a prediction” problem, and

“Why Should I Trust You?”Explaining the Predictions of Any Classifier

Marco Tulio RibeiroUniversity of WashingtonSeattle, WA 98105, USA

[email protected]

Sameer SinghUniversity of WashingtonSeattle, WA 98105, USA

[email protected]

Carlos GuestrinUniversity of WashingtonSeattle, WA 98105, USA

[email protected]

Abstract

Despite widespread adoption, machine learn-ing models remain mostly black boxes. Un-derstanding the reasons behind predictions is,however, quite important in assessing trust in amodel. Trust is fundamental if one plans to takeaction based on a prediction, or when choosingwhether or not to deploy a new model. Suchunderstanding further provides insights into themodel, which can be used to turn an untrust-worthy model or prediction into a trustworthyone.

In this work, we propose LIME, a novel expla-nation technique that explains the predictionsof any classifier in an interpretable and faith-ful manner, by learning an interpretable modellocally around the prediction. We further pro-pose a method to explain models by presentingrepresentative individual predictions and theirexplanations in a non-redundant way, framingthe task as a submodular optimization prob-lem. We demonstrate the flexibility of thesemethods by explaining different models for text(e.g. random forests) and image classification(e.g. neural networks). The usefulness of expla-nations is shown via novel experiments, bothsimulated and with human subjects. Our ex-planations empower users in various scenariosthat require trust: deciding if one should trust aprediction, choosing between models, improv-ing an untrustworthy classifier, and detectingwhy a classifier should not be trusted.

1 Introduction

Machine learning is at the core of many recent ad-vances in science and technology. Unfortunately, the

important role of humans is an oft-overlooked aspectin the field. Whether humans are directly using ma-chine learning classifiers as tools, or are deployingmodels into products that need to be shipped, a vitalconcern remains: if the users do not trust a model ora prediction, they will not use it. It is important todifferentiate between two different (but related) defi-nitions of trust: (1) trusting a prediction, i.e. whethera user trusts an individual prediction sufficiently totake some action based on it, and (2) trusting a model,i.e. whether the user trusts a model to behave in rea-sonable ways if deployed. Both are directly impactedby how much the human understands a model’s be-haviour, as opposed to seeing it as a black box.

Determining trust in individual predictions is animportant problem when the model is used for real-world actions. When using machine learning for med-ical diagnosis [6] or terrorism detection, for example,predictions cannot be acted upon on blind faith, asthe consequences may be catastrophic. Apart fromtrusting individual predictions, there is also a needto evaluate the model as a whole before deployingit “in the wild”. To make this decision, users needto be confident that the model will perform well onreal-world data, according to the metrics of interest.Currently, models are evaluated using metrics such asaccuracy on an available validation dataset. However,real-world data is often significantly different, andfurther, the evaluation metric may not be indicativeof the product’s goal. Inspecting individual predic-tions and their explanations can be a solution to thisproblem, in addition to such metrics. In this case,it is important to guide users by suggesting whichinstances to inspect, especially for larger datasets.

arX

iv:1

602.

0493

8v1

[cs

.LG

] 1

6 Fe

b 20

16

Page 2: arXiv:1602.04938v1 [cs.LG] 16 Feb 2016 · In this paper, we propose providing explanations for individual predictions as a solution to the “trusting a prediction” problem, and

In this paper, we propose providing explanationsfor individual predictions as a solution to the “trustinga prediction” problem, and selecting multiple suchpredictions (and explanations) as a solution to the“trusting the model” problem. Our main contributionsare summarized as follows.

• LIME, an algorithm that can explain the predic-tions of any classifier or regressor in a faithful way,by approximating it locally with an interpretablemodel.

• SP-LIME, a method that selects a set of represen-tative instances with explanations to address the“trusting the model” problem, via submodular opti-mization.

• Comprehensive evaluation with simulated and hu-man subjects, where we measure the impact ofexplanations on trust and associated tasks. In ourexperiments, non-experts using LIME are able topick which classifier from a pair generalizes bet-ter in the real world. Further, they are able togreatly improve an untrustworthy classifier trainedon 20 newsgroups, by doing feature engineeringusing LIME. We also show how understanding thepredictions of a neural network on images helpspractitioners know when and why they should nottrust a model.

2 The Case for Explanations

By “explaining a prediction”, we mean presentingtextual or visual artifacts that provide qualitative un-derstanding of the relationship between the instance’scomponents (e.g. words in text, patches in an image)and the model’s prediction. We argue that explainingpredictions is an important aspect in getting humansto trust and use machine learning effectively, pro-vided the explanations are faithful and intelligible.

The process of explaining individual predictions isillustrated in Figure 1. It is clear that a doctor is muchbetter positioned to make a decision with the helpof a model if intelligible explanations are provided.In this case, explanations are a small list of symp-toms with relative weights - symptoms that eithercontribute towards the prediction (in green) or are ev-idence against it (in red). In this, and other exampleswhere humans make decisions with the help of pre-dictions, trust is of fundamental concern. Even when

stakes are lower, as in product or movie recommen-dations, the user needs to trust the prediction enoughto spend money or time on it. Humans usually haveprior knowledge about the application domain, whichthey can use to accept (trust) or reject a prediction ifthey understand the reasoning behind it. It has beenobserved, for example, that providing an explanationcan increase the acceptance of computer-generatedmovie recommendations [12] and other automatedsystems [7].

Every machine learning application requires a cer-tain measure of trust in the model. Development andevaluation of a classification model often consistsof collecting annotated data, followed by learningparameters on a subset and evaluating using automat-ically computed metrics on the remaining data. Al-though this is a useful pipeline for many applications,it has become evident that evaluation on validationdata often may not correspond to performance “inthe wild” due to a number of reasons - and thus trustcannot rely solely on it. Looking at examples is abasic human strategy for comprehension [20], andfor deciding if they are trustworthy - especially if theexamples are explained. We thus propose explain-ing several representative individual predictions ofa model as a way to provide a global understandingof the model. This global perspective is useful tomachine learning practitioners in deciding betweendifferent models, or configurations of a model.

There are several ways a model can go wrong, andpractitioners are known to overestimate the accuracyof their models based on cross validation [21]. Dataleakage, for example, defined as the unintentionalleakage of signal into the training (and validation)data that would not appear in the wild [14], poten-tially increases accuracy. A challenging examplecited by Kaufman et al. [14] is one where the patientID was found to be heavily correlated with the targetclass in the training and validation data. This issuewould be incredibly challenging to identify just byobserving the predictions and the raw data, but mucheasier if explanations such as the one in Figure 1 areprovided, as patient ID would be listed as an expla-nation for predictions. Another particularly hard todetect problem is dataset shift [5], where trainingdata is different than test data (we give an examplein the famous 20 newsgroups dataset later on). Theinsights given by explanations (if the explanations

Page 3: arXiv:1602.04938v1 [cs.LG] 16 Feb 2016 · In this paper, we propose providing explanations for individual predictions as a solution to the “trusting a prediction” problem, and

sneezeweightheadacheno fatigueage

Flu sneeze

headache

Model Data and Prediction

Explainer (LIME)

Explanation

Explainer (LIME)

Human makes decision Explanation

no fatigue

sneeze

headache

active

Human makes decision

Figure 1: Explaining individual predictions. A model predicts that a patient has the flu, and LIMEhighlights which symptoms in the patient’s history led to the prediction. Sneeze and headache areportrayed as contributing to the “flu” prediction, while “no fatigue” is evidence against it. With these,a doctor can make an informed decision about the model’s prediction.

correspond to what the model is actually doing) areparticularly helpful in identifying what must be doneto turn an untrustworthy model into a trustworthy one- for example, removing leaked data or changing thetraining data to avoid dataset shift.

Machine learning practitioners often have to se-lect a model from a number of alternatives, requiringthem to assess the relative trust between two or moremodels. In Figure 2, we show how individual pre-diction explanations can be used to select betweenmodels, in conjunction with accuracy. In this case,the algorithm with higher accuracy on the validationset is actually much worse, a fact that is easy to seewhen explanations are provided (again, due to humanprior knowledge), but hard otherwise. Further, thereis frequently a mismatch between the metrics that wecan compute and optimize (e.g. accuracy) and theactual metrics of interest such as user engagementand retention. While we may not be able to measuresuch metrics, we have knowledge about how certainmodel behaviors can influence them. Therefore, apractitioner may wish to choose a less accurate modelfor content recommendation that does not place highimportance in features related to “clickbait” articles(which may hurt user retention), even if exploitingsuch features increases the accuracy of the modelin cross validation. We note that explanations areparticularly useful in these (and other) scenarios if amethod can produce them for any model, so that avariety of models can be compared.

Desired Characteristics for Explainers

We have argued thus far that explaining individualpredictions of classifiers (or regressors) is a signifi-cant component for assessing trust in predictions or

Figure 2: Explaining individual predictions ofcompeting classifiers trying to determine if a doc-ument is about “Christianity” or “Atheism”. Thebar chart represents the importance given to themost relevant words, also highlighted in the text.Color indicates which class the word contributesto (green for “Christianity”, magenta for “Athe-ism”). Whole text not shown for space reasons.

models. We now outline a number of desired charac-teristics from explanation methods:

An essential criterion for explanations is that theymust be interpretable, i.e., provide qualitative un-derstanding between joint values of input variablesand the resulting predicted response value [11]. Wenote that interpretability must take into account hu-man limitations. Thus, a linear model [24], a gradientvector [2] or an additive model [6] may or may notbe interpretable. If hundreds or thousands of fea-tures significantly contribute to a prediction, it is notreasonable to expect users to comprehend why theprediction was made, even if they can inspect indi-vidual weights. This requirement further implies thatexplanations should be easy to understand - which isnot necessarily true for features used by the model.

Page 4: arXiv:1602.04938v1 [cs.LG] 16 Feb 2016 · In this paper, we propose providing explanations for individual predictions as a solution to the “trusting a prediction” problem, and

Thus, the “input variables” in the explanations maybe different than the features used by the model.

Another essential criterion is local fidelity. Al-though it is often impossible for an explanation tobe completely faithful unless it is the complete de-scription of the model itself, for an explanation to bemeaningful it must at least be locally faithful - i.e.it must correspond to how the model behaves in thevicinity of the instance being predicted. We note thatlocal fidelity does not imply global fidelity: featuresthat are globally important may not be important inthe local context, and vice versa. While global fi-delity would imply local fidelity, presenting globallyfaithful explanations that are interpretable remains achallenge for complex models.

While there are models that are inherently inter-pretable [6, 17, 26, 27], an explainer must be able toexplain any model, and thus be model-agnostic (i.e.treating the original model as a black box). Apartfrom the fact that many state-of-the-art classifiers arenot currently interpretable, this also provides flexibil-ity to explain future classifiers.

In addition to explaining predictions, providinga global perspective is important to ascertain trustin the model. As mentioned before, accuracy mayoften not be sufficient to evaluate the model, and thuswe want to explain the model. Building upon theexplanations for individual predictions, we select afew explanations to present to the user, such that theyare representative of the model.

3 Local Interpretable Model-AgnosticExplanations

We now present Local Interpretable Model-agnosticExplanations (LIME). The overall goal of LIME isto identify an interpretable model over the inter-pretable representation that is locally faithful to theclassifier.

3.1 Interpretable Data Representations

Before we present the explanation system, it is impor-tant to distinguish between features and interpretabledata representations. As mentioned before, inter-pretable explanations need to use a representationthat is understandable to humans, regardless of theactual features used by the model. For example, apossible interpretable representation for text classi-

fication is a binary vector indicating the presence orabsence of a word, even though the classifier mayuse more complex (and incomprehensible) featuressuch as word embeddings. Likewise for image clas-sification, an interpretable representation may be abinary vector indicating the “presence” or “absence”of a contiguous patch of similar pixels (a super-pixel),while the classifier may represent the image as a ten-sor with three color channels per pixel. We denotex ∈ Rd be the original representation of an instancebeing explained, and we use x′ ∈ 0, 1d′ to denotea binary vector for its interpretable representation.

3.2 Fidelity-Interpretability Trade-off

Formally, we define an explanation as a model g ∈ G,whereG is a class of potentially interpretable models,such as linear models, decision trees, or falling rulelists [27]. The assumption is that given a modelg ∈ G, we can readily present it to the user withvisual or textual artifacts. Note that the domain of gis 0, 1d′ , i.e. g acts over absence/presence of theinterpretable components. As noted before, not everyg ∈ G is simple enough to be interpretable - thuswe let Ω(g) be a measure of complexity (as opposedto interpretability) of the explanation g ∈ G. Forexample, for decision trees Ω(g) may be the depthof the tree, while for linear models, Ω(g) may be thenumber of non-zero weights.

Let the model being explained be denoted f :Rd → R. In classification, f(x) is the probabil-ity (or a binary indicator) that x belongs to a certainclass1. We further use Πx(z) as a proximity measurebetween an instance z to x, so as to define localityaround x. Finally, let L(f, g,Πx) be a measure ofhow unfaithful g is in approximating f in the localitydefined by Πx. In order to ensure both interpretabil-ity and local fidelity, we must minimize L(f, g,Πx)while having Ω(g) be low enough to be interpretableby humans. The explanation produced by LIME isobtained by the following:

ξ(x) = argming∈G L(f, g,Πx) + Ω(g) (1)

This formulation can be used with different explana-tion families G, fidelity functions L, and complexitymeasures Ω. Here we focus on sparse linear models

1For multiple classes, we explain each class separately, thusf(x) is the prediction of the relevant class.

Page 5: arXiv:1602.04938v1 [cs.LG] 16 Feb 2016 · In this paper, we propose providing explanations for individual predictions as a solution to the “trusting a prediction” problem, and

as explanations, and on performing the search usingperturbations.

3.3 Sampling for Local ExplorationWe want to minimize the expected locally-aware lossL(f, g,Πx) without making any assumptions aboutf , since we want the explainer to be model-agnostic.Thus, in order to learn the local behaviour of f as theinterpretable inputs vary, we approximateL(f, g,Πx)by drawing samples, weighted by Πx. We sampleinstances around x′ by drawing nonzero elements ofx′ uniformly at random (where the number of suchdraws is also uniformly sampled). Given a perturbedsample z′ ∈ 0, 1d′ (which contains a fraction ofthe nonzero elements of x′), we recover the samplein the original representation z ∈ Rd and obtainf(z), which is used as a label for the explanationmodel. Given this dataset Z of perturbed sampleswith the associated labels, we optimize Eq. (1) to getan explanation ξ(x). The primary intuition behindLIME is presented in Figure 3, where we sampleinstances both in the vicinity of x (which have a highweight due to Πx) and far away from x (low weightfrom Πx). Even though the original model may betoo complex to explain globally, LIME presents anexplanation that is locally faithful (linear in this case),where the locality is captured by Πx. It is worthnoting that our method is fairly robust to samplingnoise since the samples are weighted by Πx in Eq. (1).We now present a concrete instance of this generalframework.

3.4 Sparse Linear ExplanationsFor the rest of this paper, we let G be the class oflinear models, such that g(z′) = wg · z′. We usethe locally weighted square loss as L, as defined inEq. (2), where we let Πx(z) = exp(−D(x, z)2/σ2)be an exponential kernel defined on some distancefunction D (e.g. cosine distance for text, L2 distancefor images) with width σ.

L(f, g,Πx) =∑

z,z′∈ZΠx(z)

(f(z)− g(z′)

)2 (2)

For text classification, we ensure that the explana-tion is interpretable by letting the interpretablerepresentation be a bag of words, and by settinga limit K on the number of words included, i.e.Ω(g) = ∞1[‖wg‖0 > K]. We use the same Ω for

Figure 3: Toy example to present intuition forLIME. The black-box model’s complex decisionfunction f (unknown to LIME) is represented bythe blue/pink background, which cannot be ap-proximated well by a linear model. The brightbold red cross is the instance being explained.LIME samples instances, gets predictions usingf , and weighs them by the proximity to the in-stance being explained (represented here by size).The dashed line is the learned explanation that islocally (but not globally) faithful.

image classification, using “super-pixels” (computedusing any standard algorithm) instead of words, suchthat the interpretable representation of an image isa binary vector where 1 indicates the original super-pixel and 0 indicates a grayed out super-pixel. Thisparticular choice of Ω makes directly solving Eq. (1)intractable, but we approximate it by first selectingKfeatures with Lasso (using the regularization path [8])and then learning the weights via least squares (a pro-cedure we call K-LASSO in Algorithm 1). We notethat in Algorithm 1, the time required to producean explanation is dominated by the complexity ofthe black box model f(zi). To give a rough idea ofrunning time, explaining predictions from randomforests with 1000 trees using scikit-learn2 on a laptopwith N = 5000 takes around 3 seconds. Explain-ing each prediction of the Inception network [25] forimage classification takes around 10 minutes.

3.5 Example 1: Text classification with SVMs

In Figure 2 (right side), we explain the predictions ofa support vector machine with RBF kernel trained onunigrams to differentiate “Christianity” from “Athe-ism” (on a subset of the 20 newsgroup dataset). Al-though this classifier achieves 94% held-out accuracy,

2http://scikit-learn.org

Page 6: arXiv:1602.04938v1 [cs.LG] 16 Feb 2016 · In this paper, we propose providing explanations for individual predictions as a solution to the “trusting a prediction” problem, and

(a) Original Image (b) Explaining Electric guitar (c) Explaining Acoustic guitar (d) Explaining Labrador

Figure 4: Explaining an image classification prediction made by Google’s Inception network, high-lighting positive pixels. The top 3 classes predicted are “Electric Guitar” (p = 0.32), “Acoustic guitar”(p = 0.24) and “Labrador” (p = 0.21)

Algorithm 1 LIME for Sparse Linear ExplanationsRequire: Classifier f , Number of samples NRequire: Instance x, and its interpretable version x′

Require: Similarity kernel Πx, Length of explana-tion KZ ← for i ∈ 1, 2, 3, ..., N do

z′i ← sample_around(x′)Z ← Z ∪ 〈z′i, f(zi),Πx(zi)〉

end forw ← K-Lasso(Z,K) . with z′i as features, f(z)as target return w

and one would be tempted to trust it based on this, theexplanation for an instance shows that predictions aremade for quite arbitrary reasons (words “Posting”,“Host” and “Re” have no connection to either Chris-tianity or Atheism). The word “Posting” appears in22% of examples in the training set, 99% of them inthe class “Atheism”. Even if headers are removed,proper names of prolific posters (such as “Keith”) inthe original newsgroups are selected by the classifier,which would also not generalize.

After getting such insights from explanations, it isclear that this dataset has serious issues (which arenot evident just by studying the raw data or predic-tions), and that this classifier, or held-out evaluation,cannot be trusted. It is also clear what the problemsare, and the steps that can be taken to fix these issuesand train a more trustworthy classifier.

3.6 Example 2: Deep networks for images

We learn a linear model with positive and negativeweights for each super-pixel in an image. For the pur-pose of visualization, one may wish to just highlightthe super-pixels with positive weight towards a spe-cific class, as they give intuition as to why the modelwould think that class may be present. We explain theprediction of Google’s pre-trained Inception neuralnetwork [25] in this fashion on an arbitrary image(Figure 4a). Figures 4b, 4c, 4d show the superpixelsexplanations for the top 3 predicted classes (with therest of the image grayed out), having set K = 10.What the neural network picks up on for each of theclasses is quite natural to humans - Figure 4b in par-ticular provides insight as to why acoustic guitar waspredicted to be electric: due to the fretboard. Thiskind of explanation enhances trust in the classifier(even if the top predicted class is wrong), as it showsthat it is not acting in an unreasonable manner.

4 Submodular Pick for Explaining Models

Although an explanation of a single prediction pro-vides some understanding into the reliability of theclassifier to the user, it is not sufficient to evaluate andassess trust in the model as a whole. We propose togive a global understanding of the model by explain-ing a set of individual instances. This approach is stillmodel agnostic, and is complementary to computingsummary statistics such as held-out accuracy.

Even though explanations of multiple instancescan be insightful, these instances need to be selectedjudiciously, since users may not have the time to ex-

Page 7: arXiv:1602.04938v1 [cs.LG] 16 Feb 2016 · In this paper, we propose providing explanations for individual predictions as a solution to the “trusting a prediction” problem, and

amine a large number of explanations. We representthe time and patience that humans have by a budgetB that denotes the number of explanations they arewilling to look at in order to understand a model.Given a set of instances X , we define the pick stepas the task of selecting B instances for the user toinspect.

The pick step is not dependent on the existence ofexplanations - one of the main purpose of tools likeModeltracker [1] and others [10] is to assist users inselecting instances themselves, and examining theraw data and predictions. However, as we have ar-gued that looking at raw data is not enough to un-derstand predictions and get insights, it is intuitivethat a method for the pick step should take into ac-count the explanations that accompany each predic-tion. Moreover, this method should pick a diverse,representative set of explanations to show the user –i.e. non-redundant explanations that represent howthe model behaves globally.

Given all of the explanations for a set of instancesX , we construct an n×d′ explanation matrixW thatrepresents the local importance of the interpretablecomponents for each instance. When using linearmodels as explanations, for an instance xi and expla-nation gi = ξ(xi), we setWij = |wgij |. Further, foreach component j inW , we let Ij denote the globalimportance, or representativeness of that componentin the explanation space. Intuitively, we want I suchthat features that explain many different instanceshave higher importance scores. Concretely for thetext applications, we set Ij =

√∑ni=1Wij . For im-

ages, I must measure something that is comparableacross the super-pixels in different images, such ascolor histograms or other features of super-pixels;we leave further exploration of these ideas for futurework. In Figure 5, we show a toy exampleW , withn = d′ = 5, whereW is binary (for simplicity). Theimportance function I should score feature f2 higherthan feature f1, i.e. I2 > I1, since feature f2 is usedto explain more instances.

While we want to pick instances that cover the im-portant components, the set of explanations must notbe redundant in the components they show the users,i.e. avoid selecting instances with similar explana-tions. In Figure 5, after the second row is picked, thethird row adds no value, as the user has already seenfeatures f2 and f3 - while the last row exposes the

f1 f2 f3 f4 f5

Covered Features

Figure 5: Toy example W . Rows represent in-stances (documents) and columns represent fea-tures (words). Feature f2 (dotted blue) has thehighest importance. Rows 2 and 5 (in red) wouldbe selected by the pick procedure, covering allbut feature f1.

Algorithm 2 Submodular pick algorithmRequire: Instances X , Budget B

for all xi ∈ X doWi ← explain(xi, x

′i) . Using Algorithm 1

end forfor j ∈ 0 . . . d′ do

Ij ←√∑n

i=1 |Wij | . Compute featureimportancesend forV ← while |V | < B do . Greedy optimization ofEq (4)

V ← V ∪ argmaxic(V ∪ i,W, I)end whilereturn V

user to completely new features. Selecting the secondand last row results in the coverage of almost all thefeatures. We formalize this non-redundant coverageintuition in Eq. (3), where we define coverage as theset function c, givenW and I , which computes thetotal importance of the features that appear in at leastone instance in a set V .

c(V,W, I) =

d′∑j=1

1[∃i∈V :Wij>0]Ij (3)

The pick problem is defined in Eq. (4), and it consistsof finding the set V, |V | ≤ B that achieves highestcoverage.

Pick(W, I) = argmaxV,|V |≤Bc(V,W, I) (4)

The problem in Eq. (4) is maximizing a weightedcoverage function, and is NP-hard [9]. Let c(V ∪

Page 8: arXiv:1602.04938v1 [cs.LG] 16 Feb 2016 · In this paper, we propose providing explanations for individual predictions as a solution to the “trusting a prediction” problem, and

i,W, I) − c(V,W, I) be the marginal coveragegain of adding an instance i to a set V . Due to sub-modularity, a greedy algorithm that iteratively addsthe instance with the highest marginal coverage gainto the solution offers a constant-factor approximationguarantee of 1−1/e to the optimum [15]. We outlinethis approximation for the pick step in Algorithm 2,and call it submodular pick.

5 Simulated User Experiments

In this section, we present simulated user experi-ments to evaluate the usefulness of explanations intrust-related tasks. In particular, we address the fol-lowing questions: (1) Are the explanations faithfulto the model, (2) Can the explanations aid users inascertaining trust in predictions, and (3) Are the ex-planations useful for evaluating the model as a whole.

5.1 Experiment Setup

We use two sentiment analysis datasets (books andDVDs, 2000 instances each) where the task is to clas-sify product reviews as positive or negative [4]. Theresults on two other datasets (electronics, and kitchen)are similar, thus we omit them due to space. We traindecision trees (DT), logistic regression with L2 regu-larization (LR), nearest neighbors (NN), and supportvector machines with RBF kernel (SVM), all usingbag of words as features. We also include randomforests (with 1000 trees) trained with the averageword2vec embedding [19] (RF), a model that is im-possible to interpret. We use the implementationsand default parameters of scikit-learn, unless notedotherwise. We divide each dataset into train (1600 in-stances) and test (400 instances). Code for replicatingour experiments is available online3.

To explain individual predictions, we compare ourproposed approach (LIME), with parzen [2], forwhich we take the K features with the highest ab-solute gradients as explanations. We set the hyper-parameters for parzen and LIME using cross val-idation, and set N = 15, 000. We also compareagainst a greedy procedure (similar to Martens andProvost [18]) in which we greedily remove featuresthat contribute the most to the predicted class untilthe prediction changes (or we reach the maximum ofK features), and a random procedure that randomly

3https://github.com/marcotcr/lime-experiments

random parzen greedy LIME0

25

50

75

100

Reca

ll (%

)

17.4

72.864.3

92.1

(a) Sparse LR

random parzen greedy LIME0

25

50

75

100

Reca

ll (%

)

20.6

78.9

37.0

97.0

(b) Decision Tree

Figure 6: Recall on truly important features fortwo interpretable classifiers on the books dataset.

picks K features as an explanation. We set K to 10for our experiments.

For experiments where the pick procedure applies,we either do random selection (random pick, RP) orthe procedure described in Section 4 (submodularpick, SP). We refer to pick-explainer combinationsby adding RP or SP as a prefix.

5.2 Are explanations faithful to the model?We measure faithfulness of explanations on classi-fiers that are by themselves interpretable (sparse lo-gistic regression and decision trees). In particular, wetrain both classifiers such that the maximum numberof features they use for any instance is 10. For suchmodels, we know the set of truly important features.For each prediction on the test set, we generate expla-nations and compute the fraction of truly importantfeatures that are recovered by the explanations. Wereport this recall averaged over all the test instances inFigures 6 and 7. We observe that the greedy approachis comparable to parzen on logistic regression, butis substantially worse on decision trees since chang-ing a single feature at a time often does not have aneffect on the prediction. However, text is a particu-larly hard case for the parzen explainer, due to thedifficulty in approximating the original classifier inhigh dimensions, thus the overall recall by parzen islow. LIME consistently provides > 90% recall forboth logistic regression and decision trees on bothdatasets, demonstrating that LIME explanations arequite faithful to the model.

5.3 Should I trust this prediction?In order to simulate trust in individual predictions,we first randomly select 25% of the features to be“untrustworthy”, and assume that the users can iden-

Page 9: arXiv:1602.04938v1 [cs.LG] 16 Feb 2016 · In this paper, we propose providing explanations for individual predictions as a solution to the “trusting a prediction” problem, and

random parzen greedy LIME0

25

50

75

100R

eca

ll (%

)

19.2

60.8 63.4

90.2

(a) Sparse LR

random parzen greedy LIME0

25

50

75

100

Reca

ll (%

)

17.4

80.6

47.6

97.8

(b) Decision Tree

Figure 7: Recall on truly important features fortwo interpretable classifiers on the DVDs dataset.

tify and would not want to trust these features (suchas the headers in 20 newsgroups, leaked data, etc).We thus develop oracle “trustworthiness” by label-ing test set predictions from a black box classifier as“untrustworthy” if the prediction changes when un-trustworthy features are removed from the instance,and “trustworthy” otherwise. In order to simulateusers, we assume that users deem predictions untrust-worthy from LIME and parzen explanations if theprediction from the linear approximation changeswhen all untrustworthy features that appear in theexplanations are removed (the simulated human “dis-counts” the effect of untrustworthy features). Forgreedy and random, the prediction is mistrusted ifany untrustworthy features are present in the expla-nation, since these methods do not provide a notionof the contribution of each feature to the prediction.Thus for each test set prediction, we can evaluatewhether the simulated user trusts it using each expla-nation method, and compare it to the trustworthinessoracle.

Using this setup, we report the F1 on the trustwor-thy predictions for each explanation method, aver-aged over 100 runs, in Table 1. The results indicatethat LIME dominates others (all results are signifi-cant at p = 0.01) on both datasets, and for all of theblack box models. The other methods either achievea lower recall (i.e. they mistrust predictions morethan they should) or lower precision (i.e. they trusttoo many predictions), while LIME maintains bothhigh precision and high recall. Even though we artifi-cially select which features are untrustworthy, theseresults indicate that LIME is helpful in assessing trustin individual predictions.

Table 1: Average F1 of trustworthiness for differ-ent explainers on a collection of classifiers anddatasets.

Books DVDs

LR NN RF SVM LR NN RF SVM

Random 14.6 14.8 14.7 14.7 14.2 14.3 14.5 14.4Parzen 84.0 87.6 94.3 92.3 87.0 81.7 94.2 87.3Greedy 53.7 47.4 45.0 53.3 52.4 58.1 46.6 55.1LIME 96.6 94.5 96.2 96.7 96.6 91.8 96.1 95.6

0 10 20 30# of instances seen by the user

45

65

85

% c

orre

ct c

hoic

e

SP-LIMERP-LIMESP-greedyRP-greedy

(a) Books dataset

0 10 20 30# of instances seen by the user

45

65

85

% c

orre

ct c

hoic

e

SP-LIMERP-LIMESP-greedyRP-greedy

(b) DVDs dataset

Figure 8: Choosing between two classifiers, as thenumber of instances shown to a simulated user isvaried. Averages and standard errors from 800runs.

5.4 Can I trust this model?

In the final simulated user experiment, we evaluatewhether the explanations can be used for model se-lection, simulating the case where a human has todecide between two competing models with similaraccuracy on validation data. For this purpose, weadd 10 artificially “noisy” features. Specifically, ontraining and validation sets (80/20 split of the orig-inal training data), each artificial feature appears in10% of the examples in one class, and 20% of theother, while on the test instances, each artificial fea-ture appears in 10% of the examples in each class.This recreates the situation where the models use notonly features that are informative in the real world,but also ones that are noisy and introduce spuriouscorrelations. We create pairs of competing classi-fiers by repeatedly training pairs of random forestswith 30 trees until their validation accuracy is within0.1% of each other, but their test accuracy differs byat least 5%. Thus, it is not possible to identify thebetter classifier (the one with higher test accuracy)from the accuracy on the validation data.

Page 10: arXiv:1602.04938v1 [cs.LG] 16 Feb 2016 · In this paper, we propose providing explanations for individual predictions as a solution to the “trusting a prediction” problem, and

greedy LIME40

60

80

100

% c

orre

ct c

hoic

e68.0

75.080.0

89.0Random Pick (RP)Submodular Pick (RP)

Figure 9: Average accuracy of human subject(with standard errors) in choosing between twoclassifiers.

The goal of this experiment is to evaluate whethera user can identify the better classifier based on theexplanations of B instances from the validation set.The simulated human marks the set of artificial fea-tures that appear in the B explanations as untrust-worthy, following which we evaluate how many totalpredictions in the validation set should be trusted (asin the previous section, treating only marked featuresas untrustworthy). Then, we select the classifier withfewer untrustworthy predictions, and compare thischoice to the classifier with higher held-out test setaccuracy.

We present the accuracy of picking the correctclassifier as B varies, averaged over 800 runs, inFigure 8. We omit SP-parzen and RP-parzen from thefigure since they did not produce useful explanationsfor this task, performing only slightly better thanrandom. We see that LIME is consistently betterthan greedy, irrespective of the pick method. Further,combining submodular pick with LIME outperformsall other methods, in particularly it is much betterthan using RP-LIME when only a few examples areshown to the users. These results demonstrate thatthe trust assessments provided by SP-selected LIMEexplanations are good indicators of generalization,which we validate with human experiments in thenext section.

6 Evaluation with Human Subjects

In this section, we recreate three scenarios in ma-chine learning that require trust and understandingof predictions and models. In particular, we evalu-ate LIME and SP-LIME in the following settings:(1) Can users choose from two classifiers the onethat generalizes better (Section 6.2), (2) based on theexplanations, can users perform feature engineering

to improve the model (Section 6.3), and (3) are usersable to identify and describe classifier irregularitiesby looking at explanations (Section 6.4).

6.1 Experimental setupFor experiments in sections 6.2 and 6.3, we usethe subset of 20 newsgroups mentioned beforehand,where the task is to distinguish between “Christian-ity” and “Atheism” documents. This dataset is quiteproblematic since it contains features that do not gen-eralize well (e.g. very informative header informa-tion and author names), and thus validation accuracyconsiderably overestimates real-world performance.

In order to estimate the real world performance,we create a new religion dataset for evaluation. Wedownload Atheism and Christianity websites fromthe DMOZ directory4 and human curated lists, yield-ing 819 webpages in each class (more details anddata available online5). High accuracy on the reli-gion dataset by a classifier trained on 20 newsgroupsindicates that the classifier is generalizing using se-mantic content, instead of placing importance on thedata specific issues outlined above.

Unless noted otherwise, we use SVM with RBFkernel, trained on the 20 newsgroups data with hyper-parameters tuned via the cross-validation. This clas-sifier obtains 94% accuracy on the original 20 news-groups train-test split.

6.2 Can users select the best classifier?In this section, we want to evaluate whether expla-nations can help users decide which classifier gener-alizes better - that is, which classifier the user trustsmore “in the wild”. Specifically, users have to de-cide between two classifiers: SVM trained on theoriginal 20 newsgroups dataset, and a version of thesame classifier trained on a “cleaned” dataset wheremany of the features that do not generalize are manu-ally removed using regular expressions. The originalclassifier achieves an accuracy score of 57.3% onthe religion dataset, while the “cleaned” classifierachieves a score of 69.0%. In contrast, the test accu-racy on the original train/test split for 20 newsgroupsis 94.00% and 88.6%, respectively - suggesting thatthe worse classifier would be selected if accuracyalone is used as a measure of trust.

4https://www.dmoz.org/5https://github.com/marcotcr/lime-experiments

Page 11: arXiv:1602.04938v1 [cs.LG] 16 Feb 2016 · In this paper, we propose providing explanations for individual predictions as a solution to the “trusting a prediction” problem, and

0 1 2 3Rounds of interaction

0.5

0.6

0.7

0.8R

eal w

orld

acc

urac

ySP-LIMERP-LIMENo cleaning

Figure 10: Feature engineering experiment.Each shaded line represents the average accuracyof subjects in a path starting from one of the ini-tial 10 subjects. Each solid line represents the av-erage across all paths per round of interaction.

We recruit human subjects on Amazon Mechan-ical Turk – by no means machine learning experts,but instead people with basic knowledge about reli-gion. We measure their ability to choose the betteralgorithm by seeing side-by-side explanations withthe associated raw data (as shown in Figure 2). Werestrict both the number of words in each explana-tion (K) and the number of documents that each per-son inspects (B) to 6. The position of each algorithmand the order of the instances seen are randomizedbetween subjects. After examining the explanations,users are asked to select which algorithm will per-form best in the real world, and to explain why. Theexplanations are produced by either greedy (chosenas a baseline due to its performance in the simu-lated user experiment) or LIME, and the instancesare selected either by random (RP) or submodularpick (SP). We modify the greedy step in Algorithm 2slightly so it alternates between explanations of thetwo classifiers. For each setting, we repeat the exper-iment with 100 users.

The results are presented in Figure 9. The firstthing to note is that all of the methods are good atidentifying the better classifier, demonstrating thatthe explanations are useful in determining which clas-sifier to trust, while using test set accuracy wouldresult in the selection of the wrong classifier. Fur-ther, we see that the submodular pick (SP) greatlyimproves the user’s ability to select the best classifierwhen compared to random pick (RP), with LIME

outperforming greedy in both cases. While a fewusers got confused and selected a classifier for arbi-trary reasons, most indicated that the fact that oneof the classifiers clearly utilized more semanticallymeaningful words was critical to their selection.

6.3 Can non-experts improve a classifier?If one notes a classifier is untrustworthy, a commontask in machine learning is feature engineering, i.e.modifying the set of features and retraining in or-der to improve generalization and make the classifiertrustworthy. Explanations can aid in this process bypresenting the important features, especially for re-moving features that the users feel do not generalize.

We use the 20 newsgroups data here as well, andask Amazon Mechanical Turk users to identify whichwords from the explanations should be removed fromsubsequent training, in order to improve the worseclassifier from the previous section. At each roundof interaction, the subject marks words for deletionwhile seeing B = 10 instances with K = 10 wordsin each explanation (an interface similar to Figure 2,but with a single algorithm). As a reminder, theusers here are not experts in machine learning andare unfamiliar with feature engineering, thus are onlyidentifying words based on their semantic content.Further, users do not have any access to the religiondataset - they do not even know of its existence. Westart the experiment with 10 subjects. After they markwords for deletion, we train 10 different classifiers,one for each subject (with the corresponding wordsremoved). The explanations for each classifier arethen presented to a set of 5 users in a new round ofinteraction, which results in 50 new classifiers. Wedo a final round, after which we have 250 classifiers,each with a path of interaction tracing back to thefirst 10 subjects.

The explanations and instances shown to each userare produced by SP-LIME or RP-LIME. We showthe average accuracy on the religion dataset at eachinteraction round for the paths originating from eachof the original 10 subjects (shaded lines), and theaverage across all paths (solid lines) in Figure 10. Itis clear from the figure that the crowd workers areable to improve the model by removing features theydeem unimportant for the task. Further, SP-LIMEoutperforms RP-LIME, indicating selection of theinstances is crucial for efficient feature engineering.

Page 12: arXiv:1602.04938v1 [cs.LG] 16 Feb 2016 · In this paper, we propose providing explanations for individual predictions as a solution to the “trusting a prediction” problem, and

(a) Husky classified as wolf (b) Explanation

Figure 11: Raw data and explanation of a badmodel’s prediction in the “Husky vs Wolf” task.

Before After

Trusted the bad model 10 / 27 3 / 27Snow as a potential feature 12 / 27 25 / 27

Table 2: “Husky vs Wolf” experiment results.

It is also interesting to observe that paths where theinitial users do a relatively worse job in selectingfeatures are later fixed by the subsequent users.

Each subject took an average of 3.6 minutes perround of cleaning, resulting in just under 11 minutesto produce a classifier that generalizes much better toreal world data. Each path had on average 200 wordsremoved with SP, and 157 with RP, indicating thatincorporating coverage of important features is use-ful for feature engineering. Further, out of an averageof 200 words selected with SP, 174 were selected byat least half of the users, while 68 by all the users.Along with the fact that the variance in the accuracydecreases across rounds, this high agreement demon-strates that the users are converging to similar correctmodels. This evaluation is an example of how ex-planations make it easy to improve an untrustworthyclassifier – in this case easy enough that machinelearning knowledge is not required.

6.4 Do explanations lead to insights?Often artifacts of data collection can induce undesir-able correlations that the classifiers pick up duringtraining. These issues can be very difficult to identifyjust by looking at the raw data and predictions.

In an effort to reproduce such a setting, we take thetask of distinguishing between photos of Wolves andEskimo Dogs (huskies). We train a logistic regressionclassifier on a training set of 20 images, hand selectedsuch that all pictures of wolves had snow in the back-

ground, while pictures of huskies did not. As thefeatures for the images, we use the first max-poolinglayer of Google’s pre-trained Inception neural net-work [25]. On a collection of additional 60 images,the classifier predicts “Wolf” if there is snow (or lightbackground at the bottom), and “Husky” otherwise,regardless of animal color, position, pose, etc. Wetrained this bad classifier intentionally, to evaluatewhether subjects are able to detect it.

The experiment proceeds as follows: we firstpresent a balanced set of 10 test predictions (with-out explanations), where one wolf is not in a snowybackground (and thus the prediction is “Husky”) andone husky is (and is thus predicted as “Wolf”). Weshow the “Husky” mistake in Figure 11a. The other8 examples are classified correctly. We then ask thesubject three questions: (1) Do they trust this algo-rithm to work well in the real world, (2) why, and (3)how do they think the algorithm is able to distinguishbetween these photos of wolves and huskies. Aftergetting these responses, we show the same imageswith the associated explanations, such as in Figure11b, and ask the same questions.

Since this task requires some familiarity with thenotion of spurious correlations and generalization,the set of subjects for this experiment were graduatestudents and professors in machine learning and itsapplications (NLP, Vision, etc.). After gathering theresponses, we had 3 independent evaluators read theirreasoning and determine if each subject mentionedsnow, background, or equivalent as a potential featurethe model may be using. We pick the majority as anindication of whether the subject was correct aboutthe insight, and report these numbers before and aftershowing the explanations in Table 2.

Before observing the explanations, more than athird trusted the classifier, a somewhat low numbersince we presented only 10 examples. They did spec-ulate as to what the neural network was picking up on,and a little less than half mentioned the snow patternas a possible cause. After examining the explana-tions, however, almost all of the subjects identifiedthe correct insight, with much more certainty thatit was a determining factor. Further, the trust in theclassifier also dropped substantially. Although oursample size is small, this experiment demonstratesthe utility of explaining individual predictions forgetting insights into classifiers knowing when not to

Page 13: arXiv:1602.04938v1 [cs.LG] 16 Feb 2016 · In this paper, we propose providing explanations for individual predictions as a solution to the “trusting a prediction” problem, and

trust them and why. Figuring out the best interfacesand doing further experiments in this area (in particu-lar with real machine learning based services) is anexciting direction for future research.

7 Related Work

The problems with relying on validation set accu-racy as the primary measure of trust have beenwell studied. Practitioners consistently overesti-mate their model’s accuracy [21], propagate feedbackloops [23], or fail to notice data leaks [14]. In orderto address these issues, researchers have proposedtools like Gestalt [22] and Modeltracker [1], whichhelp users navigate individual instances. These toolsare complementary to LIME in terms of explainingmodels, since they do not address the problem ofexplaining individual predictions - instead they letthe user browse raw data or features. Further, oursubmodular pick procedure can be incorporated insuch tools to aid users in navigating larger datasets.

Some recent work aims to anticipate failures inmachine learning, specifically for vision tasks [3, 29].Letting users know when the systems are likely to failcan lead to an increase in trust, by avoiding “silly mis-takes” [7]. These solutions either require additionalannotations and feature engineering that is specific tovision tasks or do not provide insight into why a deci-sion should not be trusted. Furthermore, they assumethat the current evaluation metrics are reliable, whichmay not be the case if problems such as data leakageare present. Other recent work [10] focuses on ex-posing users to different kinds of mistakes (our pickstep). Interestingly, the subjects in their study didnot notice the serious problems in the 20 newsgroupsdata even after looking at many mistakes, suggestingthat examining raw data is not sufficient. Note thatGroce et al. [10] are not alone in this regard, manyresearchers in the field have unwittingly publishedclassifiers that would not generalize for this task. Us-ing LIME, we show that even non-experts are ableto identify these irregularities when explanations arepresent. Further, LIME can complement these ex-isting systems, and allow users to assess trust evenwhen a prediction seems “correct” but is made forthe wrong reasons.

Recognizing the utility of explanations in assess-ing trust, many have proposed using interpretablemodels [27], especially for the medical domain [6, 17,

26]. While such models may be appropriate for somedomains, they may not apply equally well to others(e.g. a supersparse linear model [26] with 5− 10 fea-tures is unsuitable for text applications). Interpretabil-ity, in these cases, comes at the cost of flexibility,accuracy, or efficiency. For text, EluciDebug [16] isa full human-in-the-loop system that shares many ofour goals (interpretability, faithfulness, etc). How-ever, they focus on an already interpretable model(Naive Bayes). In computer vision, systems thatrely on object detection to produce candidate align-ments [13] or attention [28] are able to produce ex-planations for their predictions. These are, however,constrained to specific neural network architecturesor incapable of detecting “non object” parts of theimages. Here we focus on general, model-agnosticexplanations that can be applied to any classifier orregressor that is appropriate for the domain - evenones that are yet to be proposed.

A common approach to model-agnostic explana-tion is learning a potentially interpretable model onthe predictions of the original model [2]. Having theexplanation be a gradient vector captures a similarlocality intuition to that of LIME. However, inter-preting the coefficients on the gradient is difficult,particularly for confident predictions (where gradientis near zero). Further, the model that produces thegradient is trained to approximate the original modelglobally. When the number of dimensions is high,maintaining local fidelity for such models becomesincreasingly hard, as our experiments demonstrate.In contrast, LIME solves the much more feasibletask of finding a model that approximates the origi-nal model locally. The idea of perturbing inputs forexplanations has been explored before [24], wherethe authors focus on learning a specific contribu-tion model, as opposed to our general framework.None of these approaches explicitly take cognitivelimitations into account, and thus may produce non-interpretable explanations, such as a gradients or lin-ear models with thousands of non-zero weights. Theproblem becomes worse if the original features arenonsensical to humans (e.g. word embeddings). Incontrast, LIME incorporates interpretability both inthe optimization and in our notion of interpretablerepresentation, such that domain and task specificinterpretability criteria can be accommodated.

Page 14: arXiv:1602.04938v1 [cs.LG] 16 Feb 2016 · In this paper, we propose providing explanations for individual predictions as a solution to the “trusting a prediction” problem, and

8 Conclusion and Future Work

In this paper, we argued that trust is crucial for ef-fective human interaction with machine learning sys-tems, and that explaining individual predictions isimportant in assessing trust. We proposed LIME,a modular and extensible approach to faithfully ex-plain the predictions of any model in an interpretablemanner. We also introduced SP-LIME, a method toselect representative and non-redundant predictions,providing a global view of the model to users. Ourexperiments demonstrated that explanations are use-ful for trust-related tasks: deciding between models,assessing trust, improving untrustworthy models, andgetting insights into predictions.

There are a number of avenues of future work thatwe would like to explore. Although we describe onlysparse linear models as explanations, our frameworksupports the exploration of a variety of explanationfamilies, such as decision trees; it would be inter-esting to see a comparative study on these with realusers. One issue that we do not mention in this workwas how to perform the pick step for images, andwe would like to address this limitation in the future.The domain and model agnosticism enables us toexplore a variety of applications, and we would liketo investigate potential uses in speech, video, andmedical domains. Finally, we would like to exploretheoretical properties (such as the appropriate numberof samples) and computational optimizations (suchas using parallelization and GPU processing), in or-der to provide the accurate, real-time explanationsthat are critical for any human-in-the-loop machinelearning system.

References

[1] S. Amershi, M. Chickering, S. M. Drucker,B. Lee, P. Simard, and J. Suh. Modeltracker:Redesigning performance analysis tools for ma-chine learning. In Human Factors in ComputingSystems (CHI), 2015.

[2] D. Baehrens, T. Schroeter, S. Harmeling,M. Kawanabe, K. Hansen, and K.-R. Müller.How to explain individual classification deci-sions. Journal of Machine Learning Research,11, 2010.

[3] A. Bansal, A. Farhadi, and D. Parikh. Towardstransparent systems: Semantic characterization

of failure modes. In European Conference onComputer Vision (ECCV), 2014.

[4] J. Blitzer, M. Dredze, and F. Pereira. Biogra-phies, bollywood, boom-boxes and blenders:Domain adaptation for sentiment classification.In Association for Computational Linguistics(ACL), 2007.

[5] J. Q. Candela, M. Sugiyama, A. Schwaighofer,and N. D. Lawrence. Dataset Shift in MachineLearning. MIT, 2009.

[6] R. Caruana, Y. Lou, J. Gehrke, P. Koch,M. Sturm, and N. Elhadad. Intelligible mod-els for healthcare: Predicting pneumonia riskand hospital 30-day readmission. In KnowledgeDiscovery and Data Mining (KDD), 2015.

[7] M. T. Dzindolet, S. A. Peterson, R. A. Pom-ranky, L. G. Pierce, and H. P. Beck. The roleof trust in automation reliance. Int. J. Hum.-Comput. Stud., 58(6), 2003.

[8] B. Efron, T. Hastie, I. Johnstone, and R. Tibshi-rani. Least angle regression. Annals of Statistics,32:407–499, 2004.

[9] U. Feige. A threshold of ln n for approximatingset cover. J. ACM, 45(4), July 1998.

[10] A. Groce, T. Kulesza, C. Zhang, S. Shamasun-der, M. Burnett, W.-K. Wong, S. Stumpf, S. Das,A. Shinsel, F. Bice, and K. McIntosh. You arethe only possible oracle: Effective test selectionfor end users of interactive machine learningsystems. IEEE Trans. Softw. Eng., 40(3), 2014.

[11] T. Hastie, R. Tibshirani, and J. Friedman. TheElements of Statistical Learning. Springer NewYork Inc., 2001.

[12] J. L. Herlocker, J. A. Konstan, and J. Riedl.Explaining collaborative filtering recommenda-tions. In Conference on Computer SupportedCooperative Work (CSCW), 2000.

[13] A. Karpathy and F. Li. Deep visual-semanticalignments for generating image descriptions.In Computer Vision and Pattern Recognition(CVPR), 2015.

[14] S. Kaufman, S. Rosset, and C. Perlich. Leakagein data mining: Formulation, detection, andavoidance. In Knowledge Discovery and DataMining (KDD), 2011.

Page 15: arXiv:1602.04938v1 [cs.LG] 16 Feb 2016 · In this paper, we propose providing explanations for individual predictions as a solution to the “trusting a prediction” problem, and

[15] A. Krause and D. Golovin. Submodular func-tion maximization. In Tractability: PracticalApproaches to Hard Problems. Cambridge Uni-versity Press, February 2014.

[16] T. Kulesza, M. Burnett, W.-K. Wong, andS. Stumpf. Principles of explanatory debuggingto personalize interactive machine learning. InIntelligent User Interfaces (IUI), 2015.

[17] B. Letham, C. Rudin, T. H. McCormick, andD. Madigan. Interpretable classifiers using rulesand bayesian analysis: Building a better strokeprediction model. Annals of Applied Statistics,2015.

[18] D. Martens and F. Provost. Explaining data-driven document classifications. MIS Q., 38(1),2014.

[19] T. Mikolov, I. Sutskever, K. Chen, G. S. Cor-rado, and J. Dean. Distributed representationsof words and phrases and their compositional-ity. In Neural Information Processing Systems(NIPS). 2013.

[20] A. Newell. Human Problem Solving. Prentice-Hall, Inc., 1972.

[21] K. Patel, J. Fogarty, J. A. Landay, and B. Harri-son. Investigating statistical machine learningas a tool for software development. In HumanFactors in Computing Systems (CHI), 2008.

[22] K. Patel, N. Bancroft, S. M. Drucker, J. Fogarty,A. J. Ko, and J. Landay. Gestalt: Integratedsupport for implementation and analysis in ma-chine learning. In User Interface Software andTechnology (UIST), 2010.

[23] D. Sculley, G. Holt, D. Golovin, E. Davydov,T. Phillips, D. Ebner, V. Chaudhary, M. Young,and J.-F. Crespo. Hidden technical debt in ma-chine learning systems. In Neural InformationProcessing Systems (NIPS). 2015.

[24] E. Strumbelj and I. Kononenko. An efficientexplanation of individual classifications usinggame theory. Journal of Machine LearningResearch, 11, 2010.

[25] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed,D. Anguelov, D. Erhan, V. Vanhoucke, andA. Rabinovich. Going deeper with convolutions.In Computer Vision and Pattern Recognition(CVPR), 2015.

[26] B. Ustun and C. Rudin. Supersparse linear inte-ger models for optimized medical scoring sys-tems. Machine Learning, 2015.

[27] F. Wang and C. Rudin. Falling rule lists. InArtificial Intelligence and Statistics (AISTATS),2015.

[28] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville,R. Salakhutdinov, R. Zemel, and Y. Bengio.Show, attend and tell: Neural image cap-tion generation with visual attention. In In-ternational Conference on Machine Learning(ICML), 2015.

[29] P. Zhang, J. Wang, A. Farhadi, M. Hebert, andD. Parikh. Predicting failures of vision systems.In Computer Vision and Pattern Recognition(CVPR), 2014.