Download - Interpreting Deep Neural Networks Through Backpropagation › downloadFile › ...exploramos um metodo generico para criar explicac¸´ oes para as decis˜ oes de qualquer rede neuronal.˜

Interpreting Deep Neural Networks ThroughBackpropagation

Miguel Coimbra Vera

Thesis to obtain the Master of Science Degree in

Information Systems and Computer Engineering

Supervisors: Prof. Manuel Fernando Cabido Peres LopesProf. Francisco António Chaves Saraiva de Melo

Examination Committee

Chairperson: Prof. Alberto Manuel Rodrigues da SilvaSupervisor: Prof. Manuel Fernando Cabido Peres Lopes

Members of the Committee: Prof. Rui Miguel Carrasqueiro Henriques

May 2019

Abstract

Deep Learning is one of the fastest growing technologies of the last five years, which has lead to its

introduction to a wide array of industries. From translation to image recognition, Deep Learning is being

used as a one-size-fits all solution for a broad spectrum of problems. This technology has one crucial

issue: although its performance surpasses all previously existing techniques, the internal mechanisms

of Neural Networks are a black-box. In other words, despite the fact that we can train the networks

and observe their behaviour, we can’t understand the reasoning behind their decisions or control how

they will behave when faced with new data. While this is not a problem in some fields where these

networks are being applied to domains where mistakes can be afforded, the consequences can be

catastrophic when applied to high-risk domains such as healthcare, finance or others. In this thesis we

explore a generic method for creating explanations for the decisions of any neural network. We utilize

backpropagation, an internal algorithm common across all Neural Network architectures, to understand

the correlation between the input to a network and the network’s output. We experiment with our method

in both the healthcare and industrial robotics domains. Using our methodology, allows us to analyse

these previously opaque algorithms and compare their interpretations of the data with more interpretable

models such as decision trees. Finally, we release DeepdIVE (Deriving Importance VEctors) an open-

source framework for generating local explanations of any neural network programmed in Pytorch.

Keywords

Deep Learning; Interpretability; Machine Learning; Artificial Inteligence; Neural Networks;

i

Resumo

A Aprendizagem Profunda, uma das tecnologias que apresentou maior crescimento nos últimos anos

tem sido introduzida num grande leque de industrias. Desde a tradução ao reconhecimento de ima-

gens, a Aprendizagem Profunda tem sido utilizada como uma solução para todo o tipo de problemas.

No entanto, esta tecnologia tem um problema fundamental, apesar do seu desempenho ultrapassar

aquele das tecnologias existentes anteriormente, os mecanismos internos das Redes Neuronais são

uma caixa negra. Noutras palavras, apesar de conseguirmos treinar estas redes e testar o seu fun-

cionamento, não conseguimos perceber as razões por detrás de cada decisão individual ou controlar

como estas se irão comportar quando confrontadas com novos dados. Apesar de não ser um problema

urgente num contexto de investigação onde estas redes são aplicadas a problemas de investigação,

as consequências podem ser catastroficas quando estão em causa cenários da vida real. Nesta tese

exploramos um método generico para criar explicações para as decisões de qualquer rede neuronal.

Utilizamos backpropagation, um algoritmo interno utilizado na grande maioria das Redes Neuronais,

para perceber a relação de causalidade entre os dados recebidos por uma rede e as suas decisões.

Realizámos experiências para testar a nossa metodologia em redes aplicadas tanto num contexto de

saúde como de indústria. Utilizando o nosso método, é possı́vel analizar estas redes previamente

opacas e verificar as suas representações internas dos dados tendo conhecimento do domı́nio. Final-

mente, publicámos o DeepdIVE (Deriving Importance VEctors) uma ferramenta open-source para gerar

explicações de decisões de qualquer Rede Neuronal implementada em Pytorch.

Palavras Chave

Interpretabilidade; Aprendizagem de Maquina; Inteligência Artificial; Redes Neuronais;

iii

Contents

1 Introduction 1

1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.2 Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.3 Hypothesis and Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2 Background 7

2.1 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

3 Related Work 13

3.1 Interpretability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.2 Interpretable Models as Explanations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.3 Inner Model Visualizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

4 Backpropagating Feature Importance 27

5 Results 35

5.1 Medical Domain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

5.1.1 Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

5.1.2 Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

5.1.3 Network Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

5.1.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

5.2 Industrial Domain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

5.2.1 Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

5.2.2 Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

5.2.3 Network Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

5.2.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

5.2.4.A Using Gradients for Visualization . . . . . . . . . . . . . . . . . . . . . . . 50

5.2.4.B Gradients as Explanation . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

5.3 DeepDIVE: Deriving Importance VEctors . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

5.3.1 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

5.3.2 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

v

6 Conclusion and Future Work 59

6.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

6.2 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

A Appendix A 67

vi

List of Figures

2.1 On the left a 2-layer Neural network, with one hidden layer and one output layer. On the

right, a 3-layer (deep) neural network with two hidden layers and one output layer. Note

that there are connections between neurons across layers, never within a layer. Image

sourced from CS231n [1]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3.1 This chart represents the traditional view of the trade-off between interpretability and per-

formance of a model. Although there is definitely truth to this statement, it is merelly an

heuristic. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.2 An example for LIME. The black-box model’s decision function f is represented by the

pink/blue background. A linear model is used to create a local explanation. The bold

red plus sign is the instance being explained. LIME samples instances (red plus signs

and blue circles), gets predictions using f and weights them by proximity to the original

instance. The dashed line is the learned explanation, it is locally (but not globally) faithful.

Image sourced from [2] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.3 Example of a depth 4 decision tree obtained from distilling a NN trained on MNIST. The

images at the inner nodes are filters learned from images. The ones at the leaves are

visualizations of the learned probability distribution over classes. The most likely classifi-

cations at each node are annotated in blue. Image sourced from Frosst et al. [3] . . . . . 21

3.4 The maps were extracted using a single back-propagation pass through a classification

ConvNet. These maps represent the top predicted class in ILSVRC-2013 test images.

Image sourced from Simonyan et al. [4] . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.5 Both perturbation based approaches like LIME and gradient based approaches fail to

model saturation. Here illustrated is a simple network that experiences saturation from its

inputs. When both i1 = 1 and i2 = 1, perturbing either of them to 0 will not produce a

change in the output. Similarly, the gradient of the output w.r.t the inputs is also zero when

i1 + i2 > 1. Image sourced from Shrikumar et al. [5] . . . . . . . . . . . . . . . . . . . . . 26

vii

4.1 Explanation of individual classifications of a simple Neural Network trained on a toy prob-

lem. On the top we have the dataset and the learned decision curve, left and right respec-

tively. On the bottom, the visualization of the local explanation for points equally spaced

throughout the dataset. On the left, the real values of the importance of each feature and

on the right a normalized version. We further explain this image in the text below. . . . . . 31

5.1 Data for this problem was extremely skewed, with the least imbalanced class (Risk of

Suicide) having close to 5:1 ratio and the most imbalanced class having around 10:1 ratio

between Not Very High and Very High risk. . . . . . . . . . . . . . . . . . . . . . . . . . . 40

5.2 The architecture of ReLUSigNET. Dropout indicates dropout layers and LL indicates fully

connected linear layers. We trained the network with an 80/20 train test split chosen

randomly. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

5.3 The figure on the left demonstrates the robot arm simulated with equation 5.1. The arm

has a central hinge and one more joint, controlled by θ1 and θ2 respectively. Both of this

joints can rotate in 360◦. This can be visualized on the figure on the right which shows

ten thousand randomly selected data points from the dataset. . . . . . . . . . . . . . . . . 48

5.4 The architecture of the ReLUTan network. Since the training and test set had exactly the

same distribution (an uncommon trait for machine learning problems) regularization didn’t

play a role as important as in the medical questionnaire problem. . . . . . . . . . . . . . . 49

5.5 Representation on output space of bias towards on of the angles w.r.t the x axis on the

left and w.r.t the y axis on the right. These graphs were created using a threshold of 1.5

for importance bias. In other words, a point is marked as biased if the importance of one

angle is at least 1.5x the importance of the other. . . . . . . . . . . . . . . . . . . . . . . . 51

5.6 Comparison of dx/dθ1 and dx/dθ2 between the theoretical values and ReLUTan. Although

the accuracy of ReLUTan is extremely high we can see that the fact that it uses ReLUs

makes the gradients much harsher than the original gradients of equation 5.1. . . . . . . . 52

5.7 Local explanations retrieved from ReLUTan. Each arrow color represents an explanation

for one of the axis of movement. The magnitude of the arrow represents the amount of

movement predicted. Arrows on the borders of the input space are generally bigger due

to the network’s inability to predict correctly outside this range. . . . . . . . . . . . . . . . 54

5.8 This small example shows how simple it is to retrieve feature importance vectors from any

neural network using deepDIVE. Here, we are retrieving angle importance for the x axis

in ReLUTan. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

viii

A.1 Comparison of dx/dtθ1 and dx/dtθ2 between several different models. Different activation

functions create very different types of surfaces. SigTan is a ReLUTan with sigmoid ac-

tivations. Here the difference between sigmoids and ReLUs is especially noticeable as

the former creates very smooth curves and the latter very harsh and noticeable decision

boundaries. It is also insightful to look at SimpleTan to see how a network with a much

lower representational capacity tries to mimic the objective function. . . . . . . . . . . . . 68

ix

List of Tables

5.1 Although both systems achieve high accuracies, the decision trees have a higher preci-

sion. It should be noted that we could not measure recall for the trees since we did not

have access to the final trained trees. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

5.2 As the number of features that we perturb increases, the amount of datapoints whose

classification is changed increases. As we reach 4 features we can clearly start seeing

diminishing returns, implying that these datapoints that weren’t changed are extremely far

from the decision boundary of the network. . . . . . . . . . . . . . . . . . . . . . . . . . . 44

5.3 From the biased data retrieved by filtering explanations where the importance ratio be-

tween angles was threshold we can see that movement on both axis is extremely faithful

to the explanations. Most of the datapoints where the movement of the arm didn’t corre-

spond to the explanation are located in the edges of the dataset. . . . . . . . . . . . . . . 55

xi

1Introduction

Contents

1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.2 Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.3 Hypothesis and Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1

1.1 Introduction

Machine learning is not a new field, the concept that we could teach machines how to ”learn” the solution

to a problem by themselves has been around for a very long time. However, limitations in techniques and

computational power held it back for decades. Due to this limitation, the algorithms used in the early days

of the field were considerably simpler than the ones used today. As such, it was easy to interpreted the

algorithms used for these regressions and deduce why they had made a certain decision. Models trained

with algorithms such as Linear Regressions and Decision trees are easy to interpret. These models,

however, often have a severe limitation in representation power, that is, their ability to learn complex

patterns displayed by data. When compared with other less interpretable models like the extremely

popular Neural Networks, these model’s performance is usually severely lacking.

Neural Networks are not a new concept, they were first envisioned after Hebbian Theory, a theory of

which we are commonly reminded of with the adage ”Neurons that fire together, wire together”. Shortly

after this, in 1958 Frank Rosenblatt developed the Perceptron, this would later become the basis of what

we now call an ”artifical neuron” [6]. Altough the perceptron enjoyed limited success in the research

community, the ability to join several of these together was still far off. However, in the last few years,

partly due to enormous advances in computational power and partly due to new optimization techniques

being developed, Deep Neural Networks have become the de facto standard for tasks such as Natural

Language Processing, Computer Vision and many, many others. These Networks have achieved state-

of-the-art performance in many of these fields and are experiencing a period of widespread use [7]. With

this increase in adoption, researchers have started looking at ways to expand the use these networks to

areas such as finance, medicine or industrial processes.

These networks, albeit responsible for achieving state-of-the-art performance in many fields, have

one major drawback. They’re extremely hard to interpret. They are composed of several interconnected

layers of sometimes thousands of ”neurons”, each of which applies a mathematical transformation to

received data. We can observe the data we feed into the networks, its output, the individual changes

to each neuron and understand all the singular pieces of their inner workings. Yet, due to the non-

linear transformations these networks apply throughout their functionality it is impossible to grasp its

end-to-end behaviour by looking at it’s raw behaviour.

Their fantastic generalization capabilities are related to the fact that they use distributed represen-

tations in their hidden layers, that is, their ”knowledge” is distributed unevenly by a set of neurons. [8]

These distributed representations are defined by the set of all activations in any given layer. And, be-

cause the minor effects of any particular ”neuron” rely on the effects of the activations of every other

unit in that layer, it is extremely hard to retrieve any meaningful information about their role. As such,

it is very difficult to discern what internal mechanisms are responsible for any specific phenomenon

(although some research has been done in that topic). [9]

3

All of this is further aggravated by another characteristic of Deep Neural Networks. The fact that

they can model an enormous number of weak statistical regularities. They model these weak regular-

ities between the inputs and outputs of the training data. [8] However, there is no mechanism that can

distinguish the regularities that are properties of the data from false ones created by peculiarities of the

training set. To sum up, not only is it extremely hard to correlate unit activations with any specific feature.

It is also very difficult to verify which units actually mater for the result and aren’t just activating due to

some relationship learned from random training set characteristics. This means that although we know

precisely how these networks work, if we want to unearth the reasons behind a specific decision even

experts can only make educated guesses.

This ”black box” nature of neural networks has become an enormous barrier of adoption in areas

where decisions have irreversible consequences. In areas such as NLP, the consequences for failure

are fairly minor, as a consequence, the focus on interpretability as remained secondary to performance.

On the other hand, in areas like medicine, where a wrong choice might prove deadly, it is of utmost

importance to understand why the network has made a decision.

To illustrate the importance of interpretability, Caruana et al. [10] detail a model trained to predict

probability of death from pneumonia that learned to assign less risk to pneumonia patients if they had

asthma. These patients, when arriving at the hospital, received more aggressive treatment due to their

condition. As a result, the model concluded that asthma was predictive of lower risk of death. If this

model were to be deployed to aid in triage, these patients would receive less aggressive treatment,

invalidating the model. Possibly resulting in a higher amount of deaths amongst asthma patients. How

can Doctors act upon predictions without knowing the reasoning behind the result? In scenarios such

as this, acting in blind faith could prove to be catastrophic.

In light of these facts, model interpretability has become of paramount importance for widespread

adoption in high-risk areas. However, even what exactly constitutes interpretability is often discussed,

as different authors strive for solutions to different, albeit often similar, problems. Fundamentally, we

consider interpretability as a source of trust in the model. For the purpose of this work, we will consider

trust as human confidence on the reasons behind the network’s prediction. This confidence relies on

an observable causality between inputs to the network and its outputs. As such, we can say that an

interpretation of a decision would be an explanation on how the input values were at the genesis of that

decision.

1.2 Problem

Although these networks achieve outstanding performance when evaluated through traditional measure-

ments like Accuracy, Precision, etc. [7] these measurements can only tell us so much about what the

4

network is doing and why it is doing it.

As we have previously hinted towards with the example by Caruana et al., considering the reasons

behind a decision is of the utmost importance [10]. Interpretable predictions lead to appropriate trust

and can also provide intuitions on how to improve the model. Understanding why a model made a

certain prediction is essential in many fields. If they understand the reasoning behind a prediction, they

can use their knowledge to accept or reject it. Besides providing much safer and controlled usage,

studies have shown that providing explanations can increase the acceptance of automated systems

recommendations e.g. entertainment recommendations [11,12].

Furthermore, there are a considerable number of ways a model could be trained wrongly. A good

example, studied by Kaufman et al. is data leakage [13]. Data leakage is the unintentional creation of

additional information into the training (and validation) data, potentially increasing accuracy. The author

cites an example where the patient ID was profoundly correlated with the target class in the training

and validation data, thus leading the model to have an artificially high performance in these controlled

environments. Another example, mentioned by Ribeiro et al. [2] is artifacts of data collection influencing

the model. The author describes an experiment where a model was designed to classify between a

Husky and a Wolf. Although the model had an extremely high accuracy, when rudimentary interpretability

methods were applied to it, it was found that the model was in fact looking at the background of the image

and using that to make a decision. A lighter, whiter background would be classified as a wolf, as most

photos of wolves had been taken in the wild in snowy backgrounds. While darker backgrounds would be

classified as huskies, since most of the pictures used in training had been taken inside a house or in a

modern city. Both of these issues would be extremely challenging to detect by looking at the predictions

and raw data.

Determining the causality between inputs and outputs would provide a much clearer picture into the

inner workings of these networks. Not only could this be used to implement them safely into new fields

but also as a tool to help existing machine learning practitioners analyse and assess new models before

deploying them.

In order to interpret these networks, various approaches have appeared in the research community.

These approaches have varied from attempts to create generic explanation methods for any machine

learning model to distilling more complex models into simpler easier-to-interpret ones. One that has

gained increased attention in the last few years is using the backpropagation algorithm to calculate a

gradient of the output w.r.t the input. Yet, this and its related approaches have only been researched in

the context of computer vision and images as input. Backpropagation is the algorithm responsible for

creating these inner distributed representations. As such, we wonder if its utility in interpreting Neural

Networks couldn’t be exploited in other domains.

5

Is it possible to use network gradients to interpret the output of any deep Neural Network?

1.3 Hypothesis and Contributions

The Hypothesis we put forth with this thesis is the following:

Network gradients after any particular decision can be used to establish a causal relationship betweeninput and output.

In this work, we propose a variation of backpropagation based techniques to interpret any deep

neural network, not only those using images as input. By calculating the gradients at decision time

we can create a causal relationship between input features and the network’s output. With it, we can

accurately attribute decisions to a specific cause or causes. This provides a generic method for Neural

Network Intrepretability. Contrary to previous works, which were solely based on image classification

networks, our method can be applied to a multitude of domains. Although capable of interpreting any

network, our main focus was to test this method in areas where it can have real world implications. As

such, focus was been given to the medical and industrial domains.

In the context of this work we implemented several different Neural Network architectures to solve

two distinct problems, one for each of our focused domains. The first challenge was to develop a clas-

sifier network responsible for predicting the severity of a patient in triage for suspected mental illness.

We explore how this network compares to a traditional decision tree algorithm and how we can use our

interpretability technique to retrieve insights from the network’s decisions. Secondly, we developed a

network responsible for moving a two-dimensional robot arm with two degrees of freedom. This arm is

equivalent to a human arm with shoulder and elbow but stuck to one plane of motion. Since this move-

ment can be described theoretically we use this example to explore the internal network representations

of the concepts. By comparing explanations of the network’s results at different arm angles with the

theoretical results we can explore how the network actually learns and accurately measure the reliability

of our explanation method.

The rest of this work details previous work, our approach, our methodology and results. But first, as

an attempt to introduce unfamiliar readers to the core concepts behind this work we make an endeavor

to explain Neural Networks and backpropagation as plainly as possible.

6

2Background

Contents

2.1 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

7

This section assumes a modest understanding of the basics of Machine Learning and Linear Alge-

bra. Here, we will explore and explain the basics of Neural Networks. We hope to supply the reader

with sufficient knowledge to understand subsequent chapters. We provide an overview of what these

networks are, how their fundamental mechanisms work and how they are used.

2.1 Neural Networks

Neural networks come in many shapes and forms. The designation neural networks comes from the

fact that these models are loosely inspired by the biological neural networks that form the human brain.

Putting it very simply, the basic computational unit of the brain is the neuron. There are approximately 86

billion neurons in our nervous system. They connect to other neurons through synapses, a single neuron

can have thousands of synapses. It’s through synapses that a neuron receives input signals. These are

then combined and used to produce output signals along its single axon, the ”branch” that extends it

to the proximity of other neurons. This axon eventually branches out and its synapses connect to other

neurons. When a neuron receives signals from others, it sums all of them. If the sum of all signals is

above a certain threshold, the neuron will send information along its axon to the next.

The computational model of this, the artificial neural networks follows much of the same concepts.

These networks are also comprised of numerous connected units, commonly referred to as ”artifical

neurons”. Contrarily to the biological model, the signals x that travel to another neuron interact through

multiplicationwxwith the other neuron, depending on how strong the synapse (or connection)w between

them is. The main idea behind the computational representation of the brain is that the strength of this

connection can be controlled and ”learned”. It should also be mentioned that this connection can have

positive or negative w, allowing a much broader influence on the next neuron. Similarly to the biological

model’s sum of signals before forwarding the information, artificial neurons have an activation function

f . Instead of acting as a threshold that controls the timings of the neuron, in this case it acts as a

representation of the frequency at which the neuron sends information to the next. To sum up, the basic

structure of a neuron can be represented as follows:

f(∑i

wixi + b)

In other words, a neuron performs a dot product with the inputs and their weights, adds a bias b,

applies a non-linearity f (activation function) and sends the result to the next neurons. Typically these

neurons are organized in layers with each layer performing different kinds of transformations on their

inputs. Signals travel from the first (input) layer to the last (output) layer. The layers of neurons between

the first and last are called hidden layers. Simply put, a neural network is considered a deep neural

network if it has two or more hidden layers.

The simplest of these are called feedforward neural networks, or multilayer perceptrons. The objec-

9

Figure 2.1: On the left a 2-layer Neural network, with one hidden layer and one output layer. On the right, a 3-layer(deep) neural network with two hidden layers and one output layer. Note that there are connectionsbetween neurons across layers, never within a layer. Image sourced from CS231n [1].

tive of a feedforward neural network is to approximate a certain function f∗ that belongs to a class of

functions F . Each of the layers in a neural network can be seen as a function. This means that (most)

neural networks are merely a chain of functions f(x) = f2(f1(x)). In this case, f1 would be the first

layer of the network and f2 the second. We can then think of information being propagated throughout

the network as a sequence of functions being applied on input data.

For a neural network to ”learn”, we need a way to codify what is a correct or incorrect output and

how ”far” an incorrect output is from the correct one. For that, we use a cost function C. A cost function

(or loss function) is a function so that for the optimal f∗, C(f∗) ≤ C(f)∀f�F . It serves as a measure

of how far away a particular solution is from an optimal solution. Learning algorithms, such as Gradient

Descent, will search through the solution space trying to find the function with the smallest possible cost.

Gradient Descent is an iterative algorithm that functions by correcting the values of the weights w and

biases b after each network mistake. The way it does this is by calculating the gradient of the function

at the current point and moving towards the negative gradient, in order to reach a minimum of the cost

function. This way, the network is constantly changing and moving closer to f∗.

In order to apply this method to a deep neural network the backpropagation algorithm is essential.

Backpropagation is the method used to calculate the gradient of the loss function with respect to the

weights of the connection between the units of the network. After calculating these gradients, they are

used to update the weights one layer at a time, from the output layer all the way back to the input layer.

And with this, Gradient Descent is put into action.

Neural networks can be applied to many types of problems and have many different architectures

in order to accomplish different goals. Networks used for Computer Vision tasks have a very different

architecture to the ones used in Natural Language Processing for example. Each of them exploits

particular characteristics of certain types of layers or connections between these layers to tweak the

networks to perform better in their intended purpose. Even feedforward networks used in ”simple” tasks

10

can vary wildly in architecture. In the remainder of our work, in an attempt to restrict our exploration

space, we will focus on feedforward networks of varying shapes and sizes.

11

3Related Work

Contents

3.1 Interpretability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.2 Interpretable Models as Explanations . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.3 Inner Model Visualizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

13

To understand the approach taken in our proposal, it is essential to review what has been done

before. Throughout the next section, we will revise a collection of previous work in interpretability, carried

out by many from across the research community. Analyzing this work is essential to give us guidance

on what does or does not work and what has been experimented before. This review of previous work

is also responsible for the, often overlooked, crucial task of homogenizing how certain terms are used

throughout the scientific community. One of these terms happens to be Interpretability itself.

Interpretability recently jumped into the limelight, but different papers seem to have different interpre-

tations of it. What makes an algorithm interpretable? What is the true purpose of interpretablity? We

will answer these questions and many others. In order to search for the optimal way to interpret neural

networks, we must understand what this means and what it is we are looking for.

After examining interpretability, we will look into several different approaches to demystifying neural

networks. First we will examine methods which use interpretable models to deliver explanations. In

these, neural networks are treated as black boxes with other, more interpretable, models being used

to withdraw information from the networks without any internal information. By training models we can

easily understand around the neural networks, researchers hoped to mirror some of their behaviour.

Subsequently they use the newly trained models to draw conclusions about the original networks. After

that, we will analyze another set of methods which has a completely different approach. These, make

an attempt to understand neural networks utilizing their internal mechanisms. These methods try to

slightly modify either forward propagation or backpropagation to provide useful information to explain

the network or one of its decisions.

3.1 Interpretability

Due to the enormous growth in it usage, Machine Learning has naturally started seeping into critical

areas. Areas like medicine, machinery control, financial markets, terrorism detection, among others.

When tackling these kinds of issues, the fact that humans are incapable of understanding advanced

models in machine learning takes a whole other dimension [14]. Machine Learning model Interpretability

came into the spotlight as a way to tackle these issues.

It should be noted however, that historically, the machine learning community, despite increasing

claims of using interpretability as motivation, remains notably isolated from other industries and policy

issues in general [15]. A prime example of this is the extensively discussed provision of the European

Union General Data Privacy Regulation (GDPR) that is often labeled as a right to explanation, something

that is highly connected to the understanding of model’s decisions. While it is true that the meaning and

enforceability of the law continue to be under debate, the machine learning community gazed upon it

extremely briefly in via just one paper [16]. This single paper, was the only one presenting information

15

about this law for a machine learning audience. This problem is noticeable on other areas of the machine

learning community related to interpretability like algorithmic fairness. In this area, a single paper about

gender-crossing is considered the uncontested authority [15]. Not because of its content but due to the

larger problem of a general disregard for policy in machine learning circles.

However, recent concerns about fairness, racial or gender bias, and other related issues have been

drawing attention of experts. Interpretability in particular as been increasingly looked at as a solution

to many of these problems. The reasoning is that, if we can accurately pinpoint how the model is

making a decision, we can take actionable steps to correct its behaviour. Despite this, not many people

have tried defining what Interpretability is. Counter-intuitively papers often make allegations about the

interpretability of numerous models.

Although slow, in the last year some progress has been made in this area, with some attempts to

define interpretability and the characteristics that make a model interpretable. A study by Lipton [17]

shows that both the motivations for interpretability and the technical definition of interpretable models

are disparate, sometimes even contrary. That is, depending on the work, the word interpretability refers

to more than one concept.

Figure 3.1: This chart represents the traditional view of the trade-off between interpretability and performance of amodel. Although there is definitely truth to this statement, it is merelly an heuristic.

However, it should be noted that, as we mentioned previously, not all machine learning models are

hard to interpret. Most machine learning practitioners would categorize interpretable models as shown

in Figure 3.1. Nonetheless, this is merely an heuristic but one that has been propagated throughout

16

research materials. In the research community, researchers usually adopt a definition for interpretability

that makes intuitive sense to them. This has the unwanted effect of the definition for interpretability

changing considerably from paper to paper. The study by Lipton [17] compiled a list of the different

descriptions of interpretability found throughout the spectrum of machine learning literature. Some re-

searchers attempt to describe interpretability as ”simulatability”, whether a person can contemplate the

entire model at once. Or in other words, whether one can examine the entirety of the model and un-

derstand what it is doing and how. This notion is also adopted by Ribeiro et al. which suggests that an

interpretable model is one that ”can be readily presented to the user with visual or textual artifacts” [2].

Other authors suggest different traits to judge interpretability such as decomposability or algorithmic

transparency. Decomposability suggests that a model is interpretable if each part of the model (each

input, parameter and calculation) has an intuitive explanation. It’s worth noting that although it is true that

we can take each singular ”part” of a neural network apart and explain what it is doing, it is often close

to impossible to decipher how the several different parts fit into the whole that is the network’s decision.

In other words, we know exactly what it is doing and how it is doing it but in the large majority of cases

don’t have the faintest idea of why it is doing it. On the other hand, algorithmic transparency argues that

we can only truly interpret and grasp a model if we understand the shape of its error surface. While this

is easily done in the case of linear models, we don’t really understand how the heuristic optimization

procedures for neural networks work. Besides, error surfaces for neural networks are more often than

not, dependant on thousands of weights. This creates a surface of very high dimensionality, impossible

for humans to imagine or visualize.

By none of these metrics are neural networks considered interpretable. But it can also be argued that

the different metrics have different goals or properties that they strive for. Not all models that are typically

considered interpretable would fit even two of those metrics. We believe that this is because there can

be no discussion about interpretability without considering, not only what makes a model interpretable,

but what it is we are striving for with interpretability. Just as with the different traits used to classify

models, researchers have sought after very different goals with their research in interpretability.

Luckily, Lipton [17], when compiling the different metrics used to classify an interpretable model, also

studied and collected the different properties interpretability research strives for. These properties are:

Trust, Casuality, Transferability, Informativeness and Fair Decision-Making. In our work we will mainly

focus on interpretability as a source of trust and information.

Trust is a common motivation for interpretability [2,14,18]. But what is trust? Is it trust in the model’s

performance? If so, a model with an adequate accuracy would be a highly trustworthy model and we’d

have no need for interpretability. Trust can be defined subjectively, e.g someone can be more com-

fortable working with a well-known model even if that knowledge gives them no advantage whatsoever

in interpreting the model’s decisions. However, it can also be defined in more practical ways. Trust in

17

a prediction can be assessed by whether a human would trust the prediction enough to make a de-

cision based on it [2]. Establishing trust in these individual predictions is of paramount importance.

Humans create a trust relationship with the tools they use to inform their decision-making. Studies have

shown that they are much more likely to act on a tool’s suggestions if they’re provided an explanation for

them [11]. This becomes especially important when the tool is not a traditional algorithm but something

considered a black-box, such as a machine learning model. Furthermore, in fields such as medical diag-

nosis, predictions cannot be acted upon without a certain degree of confidence [10]. The consequences

might be catastrophic.

On the other hand, interpretability can also help withdraw additional information from the model.

While the scientific goal of machine learning is to reduce the error of some function, in the real world its

goal is to produce helpful information. Traditionally, we only consider the information conveyed through

the model’s outputs. Nonetheless, through interpretability techniques it is be feasible to convey extra

information to a human decision maker. This has been explored as a paradigm where supervised

models are employed simply to provide information to humans [14, 19] instead of training the models

to act upon their predictions. However, we would like to explore the same approach as Ribeiro et al.,

deciphering the model’s decision making rationale and using that as an extra source of information [2].

An interpretation might prove valuable even if it does not provide any intuition or explanation into the

model’s inner mechanisms. For example, a model responsible for medical diagnosis can be better suited

for assisting a human make a decision if, instead of solely providing a diagnosis, it can explain which of

the symptoms contributed to that forecast. Then, the human can combine that knowledge with his own

domain specific expertise, likely resulting in a better decision. As mentioned before, it has been shown

that showing these types of information to humans makes them significantly more likely to accept, and

act upon, the original model prediction [11,12,20].

3.2 Interpretable Models as Explanations

When considering ways to provide explanations for the behaviour of deep neural networks, one of the

most common approaches is a delegation of responsibility. Instead of looking at the model’s internal

mechanisms for an explanation, the task of explaining the model’s behaviour is outsourced to an exter-

nal, simpler and interpretable model. Intuitively speaking, when looking at Figure 3.1, these approaches

are based on the use of the models on the top left (easily interpretable) to explain the ones on the bottom

right (black box nature). In this section we will explore some of the most recent proposals in this area.

Ribeiro et al. propose an approach capable of explaining any model, including neural networks, by

treating the model as a black-box. The authors argue that one of the most neglected but important

aspects of machine learning is the role of humans. They believe, in accordance with our work, that it

18

is necessary for a model to be trusted by humans for it to be used in the real world. Furthermore, they

defend that the best way for a human to trust a model is for it to explain the rationale behind a decision.

They suggest that two things are essential for an explanation of a machine learning model: It needs

to be interpretable and it needs to be locally faithful [2]. They claim that it is often impossible for an

explanation to be entirely faithful unless it is the complete description of the model itself. They argue that

explanations don’t need to represent the entirety of the model’s behaviour. They only need to be locally

faithful i.e accurately describe how the model behaves in the proximity of the instance being predicted,

or in other words, accurately describe the behaviour for that particular decision.

Taking this into account, the authors developed a framework by the name Locally Interpretable Model-

Agnostic Explanations (LIME). Its goal is to use an interpretable model to provide an interpretable repre-

sentation locally faithful to a prediction of another classifier. Their work tries to define a balance between

the complexity and faithfulness of an explanation. The goal was to define a general framework that could

continue being used in the future, even if the method of obtaining the explanations changes. However,

in the context of this first work, they focused on using sparse linear models to provide explanations.

While using perturbations of the input as a way to explore the model to be explained. An example is

provided in figure 3.2 It should be noted that although they only describe sparse linear models as ex-

planations in this work, LIME supports the exploration of an array of function families for explanations,

for example, decision trees. Hence, their base formula can be used with different explanation families,

fidelity functions and complexity measures. On the other hand, they argue that all interpretable models

and choices of representation will have some drawbacks when explaining certain behaviours. Although

the underlying model is treated as a black-box, certain representations won’t be adequate for certain

models. For example, it would be extremely hard to explain a model that predicts filters in images with

the importance of each pixel to the prediction.

Although an explanation of a single prediction contributes to understanding the reliability of the clas-

sifier, the authors argue that it isn’t enough to assess trust in the model as a whole. As a result of that,

they propose giving a global understanding of the model by explaining a batch of individual predictions.

However, they explain that this batch must be carefully picked, since users might not have the time to

examine many explanations and the explanations should be not only diverse but representative of the

model’s behaviour. For that purpose they devised an algorithm called submodular pick. They suggest

that the explanation is for all intents a submodular set function. That is, a set function that has a property

that states that: the difference in the value of the function that a single element makes when added to

the set, decreases as the size of the set increases. In other words, the more elements the set has, the

less of a difference each new element makes. They show that, due to submodularity, a greedy algorithm

that adds the instances with highest marginal coverage of the target model offers a guarantee of a good

approximation (1−1/e of the optimum) of the function [21]. With this, they created SP-LIME (Submodular

19

Figure 3.2: An example for LIME. The black-box model’s decision function f is represented by the pink/blue back-ground. A linear model is used to create a local explanation. The bold red plus sign is the instancebeing explained. LIME samples instances (red plus signs and blue circles), gets predictions using fand weights them by proximity to the original instance. The dashed line is the learned explanation, it islocally (but not globally) faithful. Image sourced from [2]

pick LIME) a framework that uses LIME to explain entire models, one explanation at a time.

Another similar approach was developed by Frosst et al. Instead of trying to understand how a

deep neural network makes its decisions, they use the network to train a decision tree that mimics the

function discovered by the network but works in a completely different way [3]. They experimented using

a technique labeled distillation [22] to extract the knowledge from the tree to a type of decision tree that

makes soft decisions [23]. They use soft binary decision trees trained by choosing the depth of the tree

and then using mini-batch gradient descent to update all the parameters of the tree simultaneously. The

model is equivalent to a hierarchical mixture of experts. An example of such a tree trained on the MNIST

can be seen on figure 3.3.

With this technique, the authors can easily provide an explanation for any prediction. They only have

to follow the path with the greatest probability from the root to the leaf. An explanation is the list of both

all the filters along the path and the binary activation decisions on every node. Although very easy to

create, these explanations might prove hard to interpret depending on the format of the data. As we

can see in the example, it isn’t easy to interpret the visual filters used by the decision tree. It is easy to

think of more complex examples where these explanations would quickly become too bothersome and

20

complicated to interpret.

Figure 3.3: Example of a depth 4 decision tree obtained from distilling a NN trained on MNIST. The images at theinner nodes are filters learned from images. The ones at the leaves are visualizations of the learnedprobability distribution over classes. The most likely classifications at each node are annotated in blue.Image sourced from Frosst et al. [3]

The authors also provide an interesting example of the utility of an explainable model. While exam-

ining the learned filters of a soft decision tree trained from a network that had learned to play Connect4,

they discovered an insight about the game. Inspecting the first filter of the tree revealed that any game of

connect4 can be divided into two separate groups: Games where the players started by placing pieces

on the edges of the board and ones where the first pieces were placed on the center of the board. By

distilling the original network into the decision tree they discovered that these two groups played out in

ways so different to one another that the decision tree used that condition as its root node [3].

However, this method has a problem. Models such as decision trees normally do not generalize as

well as deep neural networks. Contrary to the hidden units in a neural network, a node at a lower level

of a decision tree is only used in tiny part of the training data. This usually causes the lower levels of the

decision tree to overfit. Because of this, the number of parameters at which the soft decision trees start

to overfit is less than compared to the number at which a deep neural network starts to overfit. This is

noticeable when comparing performance in MNIST.

With a regular soft decision tree of depth 8 trained in the real dataset they were able to achieve a

test accuracy of at most 94.45%. They compare this with the performance of a neural network with

two convolutional hidden layers and a penultimate fully connected layer, this network managed to obtain

a much better test accuracy of 99.21%. Afterward, they experimented training soft decision tree with

labels that were a conglomerate of true labels and predictions of the neural network. The soft decision

21

tree trained in this way attained a test accuracy of 96.76%, which is roughly in the middle between the

neural network and the soft decision tree trained directly on the data. On another dataset, the Connect4

dataset, a soft decision tree distilled from a neural network achieved 80.60% test accuracy compared to

78.63% from one trained with real data [24].

All in all, the authors provide a way to reap some of the benefits of deep neural networks while using

a model that can be easily interpreted. In cases likes the ones described in the previous chapter, where

being able to explain a prediction is essential, this technique can be used to improve the performance of

a traditional interpretable model.

We believe that none of the aforementioned approaches is a good solution to our problem. LIME,

while an extremely interesting approach that can be applied to any model, has some glaring problems.

First, it tries to explain changes in a model’s behaviour by first order changes in a discrete interpretation

of the input space. In other words, by measuring the changes with permutations, there’s no guarantee

that the behaviour being modeled is actually from the space surrounding the original input in the target

model’s space. Secondly, for each decision of the model we intend to explain, we must train a new

explainable model. This adds a significant overhead to providing explanations to more complex models

with large runtimes, as the model needs to be run several times in order to generate an explanation.

Distilling a neural network to a soft decision tree, on the other hand, is also a far from ideal solution. It

does not explain neural networks directly, only by proxy through another model. There is no guarantee

that this other model will learn the same distribution as the original network. Furthermore, since the

decision trees can only attain a performance significantly lower than the networks, we must sacrifice

performance of the model for an explanation.

Overall, they both pose difficult questions for adoption. Both of them, although in different ways,

are problematic in terms of both reliability and feasibility of explanations. As interesting as the idea of

utilizing an interpretable model to explain neural networks might sound, it is extremely hard to surpass

the inherent limitations of these models. We believe that, since they do not use any of the unique

mechanisms possessed by neural networks, these explanations are held back by the expressiveness of

the models learning the explanations.

3.3 Inner Model Visualizations

Another type of approach to deciphering neural networks, in contrast to the ones in the previous section,

are based on using the inner mechanisms of the model to get information on how they are making the

prediction. It should be noted however, that before the recent rise in popularity of Deep Learning, there

had already been some attempts to interpret the networks’ inner workings [25, 26]. However, in recent

years, fueled by neural network’s new-found popularity, several new approaches have been proposed.

22

In 2009, Erhan et al. discovered that, counter-intuitively, it is possible (and simple) to interpret deep

architectures at the unit level [27]. Before his work, the only qualitative method to compare and explore

features representations by deep networks was by examining the ”filters” learned by the network [28,29].

That is, the linear weights in the input-to-first layer weight matrix represented in input space. Or in other

words, the pixels and shapes that have the highest weighted connections in first layer. However, this

a poor representation of the overall network because we’re only looking at its first layer. Erhan et al.’s

work was the first to develop methodologies to visualize representations of arbitrary layers of a deep

architecture, in this case, Deep Belief Networks [30]. They discovered a curious phenomenon: the

responses of the internal units to images, as a function in image space, are unimodal. Knowing this,

by finding the dominant mode, we can create a good characterization of how the unit is processing

the information. With their methodologies, they were able to verify the intuition that the first layers of a

network are trying to find generic features while the later layers build on these to create more complicated

concepts. This work opened up the pathway to exploring and visualizing the internal representations of

networks. Be that as it may, although these representations are fascinating in their own right and provide

a good insight into the internal reasoning of the networks, they are of little help in interpreting individual

decisions and bolstering end-user trust in these decisions. With recent works such as Olah et al. [31]

pushing this medium forward, trying to combine both visualization and attribution to create rich interface

for humans, the gap between feature visualization and feature attribution is closing. It is however, still

too wide to be considered an effective solution to our problem.

Meanwhile on the other side of this gap, more or less simultaneously to Erhan et al’s work, Baehrens

et al proposed a novel approach for interpreting Bayesian classifiers [32]. Their goal was similar to our

own, to peek into the black-box of modern classifiers and provide an answer to why a certain decision

was made by the classifier. Their approach uses the fact that the bayesian classifiers are optimized using

numerical algorithms such as gradient descent. With that in mind, they use class probability gradients

in order to create local explanations of the classifier. These class probability gradients are in essence

the derivative of the probability of the classifier being wrong. This creates a vector that ”points” away

from the chosen class, in essence, providing a weighted vector of the features which influenced the

classifier’s decision. This can easily convey the relevant features for a single prediction, both the ones

that contributed positively and negatively to the final class. At the time, this approach was completely

different from other available techniques. It provided much need improvements in reliability and quality

of explanations. It is especially significant because it allows measuring changes in any direction but

most of all, because by using local gradients of the model as the explanation vectors, it always reflects

the model and its assumptions.

This type of approach made its way into neural network research when Simonyan et al. [4] based on

Erhan et al.’s and Baehren et al.’s works, proposed using the gradient of the output with regards to pixels

23

of an input image to create a saliency map (Figure 3.4) as a way to interpret deep convolutional neural

networks [33]. It mixed the ideas from previous works [27, 32], gradients as explanations and creating

explanations using input space. The authors propose that the gradient of the output class with regards

to the pixels of the input image can create a weighted vector of contributions from the input pixels. Or

in other other words, how much each pixel of the input image, contributed to the final classification

output. At the time when this was proposed, the most common approach to interpreting convolutional

networks was a specific network architecture called DeconvNet [34]. This approach required training an

additional network with a specific architecture on top of an existing convolutional neural network. It is

unpractical to be required to train specific networks on top of existing networks in order to interpret the

latter. Simonyan et al.’s approach was especially significant because they proved that the reconstruction

of a layer by DeconvNet was in practice equivalent to computing the gradients w.r.t the input of that layer

(with a small difference on ReLU activation functions). With this, they managed to unify and generalize

a wide spectrum of interpretability research in one work.

Figure 3.4: The maps were extracted using a single back-propagation pass through a classification ConvNet. Thesemaps represent the top predicted class in ILSVRC-2013 test images. Image sourced from Simonyan etal. [4]

Since the introduction of this method, several other authors have proposed different gradient based

interpretability methodologies. Springenberg et al. unsatisfied with the different ways default gradi-

ents and DeconvNets treated ReLUs, suggested further homogenizing both approaches and proposed

guided backpropagation [35]. This approach is mostly equivalent to computing gradients except that any

gradients that become negative during the backward pass are discarded. The issue with the original

approach of doing a backpropagation of the importance using gradients is that the gradient coming into

a ReLU during the backward pass is zero’d if the input to the ReLU had been negative on the forward

pass. By contrast, in DeconvNet the gradient is only zero’d if it is negative with no regards to the sign of

24

the input to the ReLU during the forward pass. Springbenberg et al. combined both and zero’s out the

signal if either the input in the forward pass or the gradient in the backward pass are negative. Although

this method achieved valuable results in image segmentation, due to the exclusion of negative gradients,

it fails to highlighting inputs that contribute negatively to the output if they pass through a ReLU unit at

any point in the network.

Other approaches to solving this problem include Layerwise Relevance Propagation (LRP) [36], an

approach for calculating importance scores in a layer-by-layer approximation of backpropagation. This

method was later demonstrated to be equivalent within a scaling factor to an elementwise product be-

tween the saliency maps of Simonyan et al. and the input [37, 38]. These approaches have the advan-

tage of leveraging the sign and strength of the input. However, they are very difficult to apply outside an

image classification domain since other inputs such as categorical aren’t as standardized or numerically

well behaved. Sundarman et al. suggested that instead of computing the gradients only with the cur-

rent value of the input, we could integrate the gradients as inputs are scaled up from a certain starting

value [18]. Although this addresses a fundamental limitation of gradients which is neuron saturation,

it adds a significant computational overhead to obtain high-quality integrals. Furthermore, other works

have shown that this approach can still provide highly misleading results [5].

Finally, Shrikumar et al. recently proposed DeepLIFT (Deep Learning Important FeaTures), an al-

gorithm used for assigning an importance score (weights) to the inputs of a certain network output [5].

Although it is still a gradient based method, contrary to others it doesn’t rely solely on one gradient

calculation (or integrating up to one). DeepLIFT calculates feature contribution by comparing the output

in question to a ”reference” output obtained from a ”reference” input. This ”reference” input illustrates

some default or what the authors describe as ”neutral” input. This ”reference” is chosen according to the

problem at hand and can vary wildly from model to model. Then, just like other approaches, DeepLIFT

backpropagates the contributions of every hidden unit (to the output) in the network to each feature of

the input, effectively creating a representation on how each feature contributed to that prediction. It then

obtains the final explanation by identifying the differences of this backpropagated values to the ones on

the ”reference” input.

The explanations that DeepLift produces can be thought of as a difference-from-reference, the bigger

the difference the more that feature contributed to that output. Using this difference-from-reference,

DeepLift side steps one of the fundamental limitation of gradient based methods, model saturation,

which is illustrated in figure 3.5. Even when the derivative of a neuron with respect to a neuron in a

previous layer is zero it might still be signaling relevant information. While the gradient would be zero,

the difference-from-reference might not be.

To experiment on how DeepLIFT fares against other explanation methods that used backpropagation

the authors trained a convolutional neural network on MNIST to perform digit classification [39]. To

25

Figure 3.5: Both perturbation based approaches like LIME and gradient based approaches fail to model saturation.Here illustrated is a simple network that experiences saturation from its inputs. When both i1 = 1 andi2 = 1, perturbing either of them to 0 will not produce a change in the output. Similarly, the gradient ofthe output w.r.t the inputs is also zero when i1 + i2 > 1. Image sourced from Shrikumar et al. [5]

compare the different importance scores obtained by different methods, they try to identify which pixels

to erase to convert the image to another target class. They pick up to 157 (20%) of the image’s pixels

for each explanatory method. They are chosen in order of their importance scores as determined by

the different methods. Finally they evaluate the change in the log-odds score between the original and

target classes for the original image and one where the pixels have been erased. In the context of these

images, they defined the reference input as all zeros and it outperformed all other backpropagation-

based methods.

Overall, gradient based methods have been showing a lot of promise in the field of deep neural

network interpretability. By utilizing the internal mechanisms of the networks to create explanations, we

bypass one of the most serious deterrents of the approaches we discussed in the previous chapter,

the limitations of other models. Using backpropagation we can create explanations that are completely

faithful to the original model. On the other hand, almost all the existing research in these methodologies

has been focused on networks that work in the image domain. This means that although very promising,

the real world application of these approaches had been lagging behind research.

26

4Backpropagating Feature Importance

27

When we began working on the issue of unwrapping the black-box that are modern deep neural

networks, we had a clear goal: to be able to provide instance-based explanations. In order for these

explanations to be useful for a human with domain knowledge they should reveal the causality relation-

ship between the input received by the network and its output. In other words, provide a clear picture of

the reasons behind a decision. The original approach we proposed to solve this problem was to try to

create a linear representation of the local behaviour of the model. Our proposal would start by creating a

new dataset by permuting the original input data point and using the network to label our newly acquired

data points. Then, we would train a linear regression with this new dataset. For this, we proposed using

a Least Absolute Shrinkage and Selection Operator, or in short, LASSO regression. With this new lin-

earization around the input point, we hoped to provide an accurate local representation of the networks

behaviour. We note, again, that local fidelity does not imply global fidelity and features that are globally

important may not be important in a local context, and vice versa.

However, despite our initial conviction in this methodology we quickly realized that it had some glar-

ing flaws. When analyzing a network, we don’t know it’s function surface, because of that, we can’t

be sure how the output varies with changes to the input. So by using permutations to create the new

dataset from where local behaviour will be learned, we risk creating a dataset of points that are distant

and not representative of the original prediction. Also, training this linear regression would require sep-

arate executions of the network on each perturbation. This meant that the complexity of calculating an

explanation would be dependant on the time it took to run the network and the number of perturbations

made to the data. Taking all of this into account, we decided to change our approach to the linearization

of the network and the approximation of its local behaviour.

A feedforward artificial Neural Network is organized in layers with each layer performing different

transformations on its inputs. The goal of these networks is to approximate a certain function f∗. Each

of its layers can be seen as a function. This in turn means that a network with n layers can be seen as

a chain of functions:

f(x) = fn((...)f2(f1(x)))

In order to train these networks, a backpropagation algorithm is used. Backpropagation is used to

calculate the gradient of the loss function w.r.t weights of the connection between the layers of the

network. This is employed by the gradient descent optimization algorithm to adjust the weight of neurons

with the previously calculated gradient of the loss function. Assuming x as a vector of inputs and θ as

the parameters of the network, the gradient at time t can be calculated as:

∇ = ∂Loss(x, θt)

∂θ

However, backpropagation is merely a special case of a more general technique called automatic

29

differentiation. In essence, automatic differentiation, iteratively computes gradients for each layer using

the chain rule. Although this is commonly used as a way to train a network such that it can learn appro-

priate internal representations that allow it to approximate f∗, as we saw in the previous chapter other

uses, such as network interpretation, have been proposed for gradients. Here, we propose generalizing

the method by Simonyan et al. [4] to be used in a wider spectrum of networks. His original proposal

and most other subsequent works [18, 35, 38, 40], apply this methodology solely to image classification

networks with convolutional layers. With this work we intend to explore a generalization of this approach

that can be applied to any network architecture. We hope to provide an easy to use way for deep

learning practitioners to interpret their networks and extract relevant information from their predictions.

Specifically, consider a system that classifies some input into a class from a set C. Given an input x,

many networks will compute a class activation function Sc for each class c ∈ C. The final result will be

determined by which class has the highest score.

class(x) = argmaxc∈CSc(x)

We define an explanation vector of a data point x0 as the the gradient of the output w.r.t the input, x at

x = x0. With that, we estimate a weighted vector of feature importances. If Sc is piecewise differentiable

w.r.t x, for all classes c and over the entire input space, one can construct a feature importance vector

Ic(x) by computing:

Ic(x0) =∂Sc(x)

∂x

∣∣∣∣x=x0

(4.1)

This explanation vector is a local n-dimensional representation of the behaviour surrounding the

decision point on the surface of the function learned by the network. By having a representation of the

behaviour around that point of the manifold we can measure the sensitivity to change of the function

value with respect to changes in its arguments.

In a more intuitive manner, Ic represents how much difference a change in each feature of x would

cause in the classification score of class c. Thus, entries in Ic with large absolute values highlight

features that will influence the label of x0. On the other hand, the sign of the individual entries expresses

whether the prediction would increase or decrease when the corresponding feature is increased locally.

These explanations can also be motivated with a simple example. Consider a linear model for the

class c:

Sc(x) = wTc x+ bc

assuming x is represented in a vectorised form, and wc and bc are respectively the weight vector and

the bias of the model. In this case, we can easily see that the magnitude and sign of the elements of w

will define the importance of the corresponding features of x for the class c. In the case of deep Neural

30

Networks, the class score Sc(x) is a highly non-linear function of x. Because of this, it is extremely hard

to apply this reasoning. However, given an input x0, we can approximate Sc(x) with a linear function in

the neighbourhood of x0 by computing the first order Taylor expansion.

Sc(x) ≈ dT + b

where d is the derivative of Sc w.r.t the input x at the point x0, as defined in 4.1. By definition, the

gradient of a function f(x) is the linear approximation of that function around x. This linear approximation

can also be thought of as a Taylor series approximation with a Taylor polynomial of degree 1. As such,

the gradient can be seen as an approximation of the linearization of the function and, in essence, of

feature attribution. We provide a simple example in the following figure:

Figure 4.1: Explanation of individual classifications of a simple Neural Network trained on a toy problem. On thetop we have the dataset and the learned decision curve, left and right respectively. On the bottom, thevisualization of the local explanation for points equally spaced throughout the dataset. On the left, thereal values of the importance of each feature and on the right a normalized version. We further explainthis image in the text below.

31

On the top left corner of Figure 4.1 we can see the training data for the simple classification task. On

the top right is the model learned using a simple network which outputs the probability of the point be-

longing to the yellow points. On the bottom left corner we can see the probability gradient (background)

together with local explanation vectors generated by calculating the gradient of the output w.r.t the input.

We can see that along the borders of the triangle the explanations point overwhelmingly towards the

triangle class. Furthermore it is noticeable that along the hypotenuse and at the corners of the trian-

gle both features are considered important for the explanation while along the edges the importance is

dominated by one of the two features. In the bottom right we can see the direction of explanation vectors

represented without regard for their magnitude.

This method of creating a linear representation of the network’s behaviour has several improvements

over our initial solution. The most prominent one is the fact that since we use local gradients of the trained

model as explanatory vectors, they will always follow the model and its assumptions, providing a perfectly

faithful explanation. Furthermore, while perturbations would need to be handled in a myriad of ways

depending on the type of data being analyzed (categorical, continuous, sequential, etc), this approach

can consider any type of feature without any predefined structure. Lastly, by capturing the gradient of the

model we can experiment and measure the influence of changes in any direction, including weighted

combinations of variables. Perturbations would require us to analyse and experiment with different

individual features, omissions of sets of features or various combinations. Each of these would involve

running the model with the perturbed data, increasing the complexity of finding an explanation. With

the aforementioned approach we have perfect freedom to measure the impact of variables without an

increase in the complexity of finding the explanation.

However, there are some inherent problems in our approach that should be discussed. The first

is something intrinsic to any method of providing a local explanation. Visualizing a single point in an

n-dimensional space provides limited view into the workings of a model. Because of the nature of such

a space, we can’t guarantee that the behaviour of the net is maintained throughout its function surface.

As a result, extremely similar data points might have completely different explanations from the network.

This means that humans must look at all explanations produced, merely looking at some representative

examples and imagining that we’ve understood the behaviour might be a very costly error. Finally, this

interpretability method demonstrates another problem that we previously discussed, neural saturation

(Figure 3.5). This is an issue to which different researchers assign varying degrees of importance. On

one hand, in an example such as the one in the aforementioned figure (3.5) gradient based methods

will fail to assign any importance to any of the input features. On the other hand, it is extremely hard to

imagine an example where a wider network could have most neurons saturated at the same time. Unless

that is the case, the remaining neurons and weights of the network should be enough to measure feature

importance.

32

Overall, we believe that our approach is one of the most reliable and efficient ways of interpreting

deep Neural Networks. By using backpropagation to explain the behaviour of the net we not only manage

to do it in a single backwards pass we also provide explanations that are completely faithful to the model’s

behaviour.

33

5Results

Contents

5.1 Medical Domain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

5.2 Industrial Domain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

5.3 DeepDIVE: Deriving Importance VEctors . . . . . . . . . . . . . . . . . . . . . . . . . . 55

35

In order to test the viability of our approach, we need to test it’s performance in real networks. We

decided to carry out these tests on datasets that are in some way representative of the key areas where

interpretability is needed the most. With these experiments, we intend to put to the test the feasibility of

utilizing our approach in real world applications. Our method must prove to be fast, reliable and capable

of providing relevant information to a domain knowledgeable human. However, there are some problems

with the current state of interpretability research.

Contrary to other popular areas in machine learning, such as natural language or computer vision,

there are no standardized tests for interpretability. Each new paper or approach that is suggested

defines new evaluation metrics and experiments. This creates a fragmented research landscape as

approaches are evaluated in different ways, allowing authors to choose the light than want to show

them in. Although we will not suggest a standardized test for interpretability in the context of this work,

we try our best to follow best practices and experiments laid out by other recognized authors in the

field [5, 31, 40]. Nonetheless, there is another intrinsic problem to evaluating explanations of neural

network decisions, testing faithfulness to the model. Most neural networks are trained to solve problems

in an n-dimensional space which is extremely hard for us to comprehend. Because of this, when we are

looking at an explanation, how can we verify if the explanation is faithful to the underlying model? We

can see if it makes sense to someone with domain knowledge, but how can we make sure that those

are the features the model is actually using?

We suggest a straightforward way to test the faithfulness and fidelity of the explanations to the un-

derlying model. We outline a simple, theoretically solved problem and train a network to solve it. Having

the theoretical solution to the problem we can then analyze the error surface of the model and compare

it to the theoretical solution to make sure it is learning the correct pattern. If it is, when we create expla-

nations for its error surface we can verify if these explanations are correct by comparing them with the

feature attribution of the theoretical solution.

We chose two different areas to focus our experiments on, both of them are areas where inter-

pretability can have massive influence in deep learning adoption. One of the areas where interpretability

research is of paramount importance is machine learning in the medical domain. Any decision related

to human health is one that should be taken responsibly and with ethical concerns in mind. Decisions

need to be evaluated individually and pondered on by a human who will take responsibility for them.

This creates an extremely difficult problem for AI adoption in the field, especially in the current situation

of neural networks being black boxes. With this in mind, we decided it would be a splendid domain to

test our interpretability approach.

The other domain we decided to tackle is one that is attracting increasing attention of AI research,

Machine Automation for industry processes. Automation has been playing a pivotal role in the twenty-

first century’s race for factory efficiency. More and more factories are automating a majority of their

37

processes with both software and hardware. New types of ”Lights out” (Fully automated and require

no human to be present, hence they can work with the lights off) factories are being introduced yearly.

Until recently, most of the software controlling robotic hardware in factories was tailor made. In recent

years however, advances in reinforcement learning and deep learning have proved to be extremely more

efficient at teaching machines complex behaviour. These advances could be used to further improve

factory automation and even introduce robots in environments where human defined programming is

not adequate. But once again, these efforts are slowed down by the black box nature of deep learning.

Because although not as effortlessly, it is easy to image a wide array of situations where unexpected

behaviour from an automatic robot might have severe economic and dangerous consequences.

5.1 Medical Domain

5.1.1 Context

Machine learning applied to healthcare is booming, every year several new approaches are suggested

to solve pattern-detecting problems such as detecting pneumonia or tuberculosis in X-rays [41, 42] and

new surveys analyse the impact this technology might have in the way healthcare works. But although

some of these systems already achieve state-of-the-art results and have much better performance than

human experts, adoption has been slow [43]. Interpretability, or lack thereof, has been one of the main

reasons for the reluctant adoption of Deep learning in healthcare. As with anything related to Human

health and security, there is an enormous amount of uneasiness in letting a system we do not fully

comprehend make decisions that will directly influence our health. Furthermore, it is easy to imagine

examples such as the one by Caruana et al. [10] where seemingly high performing models are making

a mistake and invalidating their own results. In healthcare, this cannot happen.

With this in mind, we set out to find a dataset we could use that would be representative of the real

world circumstances of a machine learning problem in the medical domain. Luckily, concurrently to our

work, another student from our home institution (Instituto Superior Técnico) was pursuing a thesis in this

field [44]. Their work was based on a set of medical questionnaires from the biggest pediatric hospital

in Portugal, Hospital D.Estefânia. These were filled by doctors to assess risk factors in children being

evaluated for mental health issues. The work of our colleagues was to digitize, analyse and propose

changes to these questionnaires. They tried to identify redundant questions by training a Decision Tree

on the data, minimizing the amount of inner nodes (while maintaining performance) and pinpointing the

questions whose nodes were removed from the tree. This created a valuable parallel to our own work.

Not only did they create a dataset that could illustrate a real healthcare issue, they also analysed the

data and identified