Interpreting Deep Neural Networks ThroughBackpropagation
Miguel Coimbra Vera
Thesis to obtain the Master of Science Degree in
Information Systems and Computer Engineering
Supervisors: Prof. Manuel Fernando Cabido Peres LopesProf. Francisco António Chaves Saraiva de Melo
Examination Committee
Chairperson: Prof. Alberto Manuel Rodrigues da SilvaSupervisor: Prof. Manuel Fernando Cabido Peres Lopes
Members of the Committee: Prof. Rui Miguel Carrasqueiro Henriques
May 2019
Abstract
Deep Learning is one of the fastest growing technologies of the last five years, which has lead to its
introduction to a wide array of industries. From translation to image recognition, Deep Learning is being
used as a one-size-fits all solution for a broad spectrum of problems. This technology has one crucial
issue: although its performance surpasses all previously existing techniques, the internal mechanisms
of Neural Networks are a black-box. In other words, despite the fact that we can train the networks
and observe their behaviour, we can’t understand the reasoning behind their decisions or control how
they will behave when faced with new data. While this is not a problem in some fields where these
networks are being applied to domains where mistakes can be afforded, the consequences can be
catastrophic when applied to high-risk domains such as healthcare, finance or others. In this thesis we
explore a generic method for creating explanations for the decisions of any neural network. We utilize
backpropagation, an internal algorithm common across all Neural Network architectures, to understand
the correlation between the input to a network and the network’s output. We experiment with our method
in both the healthcare and industrial robotics domains. Using our methodology, allows us to analyse
these previously opaque algorithms and compare their interpretations of the data with more interpretable
models such as decision trees. Finally, we release DeepdIVE (Deriving Importance VEctors) an open-
source framework for generating local explanations of any neural network programmed in Pytorch.
Keywords
Deep Learning; Interpretability; Machine Learning; Artificial Inteligence; Neural Networks;
i
Resumo
A Aprendizagem Profunda, uma das tecnologias que apresentou maior crescimento nos últimos anos
tem sido introduzida num grande leque de industrias. Desde a tradução ao reconhecimento de ima-
gens, a Aprendizagem Profunda tem sido utilizada como uma solução para todo o tipo de problemas.
No entanto, esta tecnologia tem um problema fundamental, apesar do seu desempenho ultrapassar
aquele das tecnologias existentes anteriormente, os mecanismos internos das Redes Neuronais são
uma caixa negra. Noutras palavras, apesar de conseguirmos treinar estas redes e testar o seu fun-
cionamento, não conseguimos perceber as razões por detrás de cada decisão individual ou controlar
como estas se irão comportar quando confrontadas com novos dados. Apesar de não ser um problema
urgente num contexto de investigação onde estas redes são aplicadas a problemas de investigação,
as consequências podem ser catastroficas quando estão em causa cenários da vida real. Nesta tese
exploramos um método generico para criar explicações para as decisões de qualquer rede neuronal.
Utilizamos backpropagation, um algoritmo interno utilizado na grande maioria das Redes Neuronais,
para perceber a relação de causalidade entre os dados recebidos por uma rede e as suas decisões.
Realizámos experiências para testar a nossa metodologia em redes aplicadas tanto num contexto de
saúde como de indústria. Utilizando o nosso método, é possı́vel analizar estas redes previamente
opacas e verificar as suas representações internas dos dados tendo conhecimento do domı́nio. Final-
mente, publicámos o DeepdIVE (Deriving Importance VEctors) uma ferramenta open-source para gerar
explicações de decisões de qualquer Rede Neuronal implementada em Pytorch.
Palavras Chave
Interpretabilidade; Aprendizagem de Maquina; Inteligência Artificial; Redes Neuronais;
iii
Contents
1 Introduction 1
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Hypothesis and Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2 Background 7
2.1 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3 Related Work 13
3.1 Interpretability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.2 Interpretable Models as Explanations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.3 Inner Model Visualizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4 Backpropagating Feature Importance 27
5 Results 35
5.1 Medical Domain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5.1.1 Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5.1.2 Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.1.3 Network Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.1.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.2 Industrial Domain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.2.1 Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.2.2 Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.2.3 Network Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.2.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5.2.4.A Using Gradients for Visualization . . . . . . . . . . . . . . . . . . . . . . . 50
5.2.4.B Gradients as Explanation . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.3 DeepDIVE: Deriving Importance VEctors . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.3.1 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.3.2 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
v
6 Conclusion and Future Work 59
6.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
6.2 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
A Appendix A 67
vi
List of Figures
2.1 On the left a 2-layer Neural network, with one hidden layer and one output layer. On the
right, a 3-layer (deep) neural network with two hidden layers and one output layer. Note
that there are connections between neurons across layers, never within a layer. Image
sourced from CS231n [1]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.1 This chart represents the traditional view of the trade-off between interpretability and per-
formance of a model. Although there is definitely truth to this statement, it is merelly an
heuristic. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.2 An example for LIME. The black-box model’s decision function f is represented by the
pink/blue background. A linear model is used to create a local explanation. The bold
red plus sign is the instance being explained. LIME samples instances (red plus signs
and blue circles), gets predictions using f and weights them by proximity to the original
instance. The dashed line is the learned explanation, it is locally (but not globally) faithful.
Image sourced from [2] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.3 Example of a depth 4 decision tree obtained from distilling a NN trained on MNIST. The
images at the inner nodes are filters learned from images. The ones at the leaves are
visualizations of the learned probability distribution over classes. The most likely classifi-
cations at each node are annotated in blue. Image sourced from Frosst et al. [3] . . . . . 21
3.4 The maps were extracted using a single back-propagation pass through a classification
ConvNet. These maps represent the top predicted class in ILSVRC-2013 test images.
Image sourced from Simonyan et al. [4] . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.5 Both perturbation based approaches like LIME and gradient based approaches fail to
model saturation. Here illustrated is a simple network that experiences saturation from its
inputs. When both i1 = 1 and i2 = 1, perturbing either of them to 0 will not produce a
change in the output. Similarly, the gradient of the output w.r.t the inputs is also zero when
i1 + i2 > 1. Image sourced from Shrikumar et al. [5] . . . . . . . . . . . . . . . . . . . . . 26
vii
4.1 Explanation of individual classifications of a simple Neural Network trained on a toy prob-
lem. On the top we have the dataset and the learned decision curve, left and right respec-
tively. On the bottom, the visualization of the local explanation for points equally spaced
throughout the dataset. On the left, the real values of the importance of each feature and
on the right a normalized version. We further explain this image in the text below. . . . . . 31
5.1 Data for this problem was extremely skewed, with the least imbalanced class (Risk of
Suicide) having close to 5:1 ratio and the most imbalanced class having around 10:1 ratio
between Not Very High and Very High risk. . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5.2 The architecture of ReLUSigNET. Dropout indicates dropout layers and LL indicates fully
connected linear layers. We trained the network with an 80/20 train test split chosen
randomly. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.3 The figure on the left demonstrates the robot arm simulated with equation 5.1. The arm
has a central hinge and one more joint, controlled by θ1 and θ2 respectively. Both of this
joints can rotate in 360◦. This can be visualized on the figure on the right which shows
ten thousand randomly selected data points from the dataset. . . . . . . . . . . . . . . . . 48
5.4 The architecture of the ReLUTan network. Since the training and test set had exactly the
same distribution (an uncommon trait for machine learning problems) regularization didn’t
play a role as important as in the medical questionnaire problem. . . . . . . . . . . . . . . 49
5.5 Representation on output space of bias towards on of the angles w.r.t the x axis on the
left and w.r.t the y axis on the right. These graphs were created using a threshold of 1.5
for importance bias. In other words, a point is marked as biased if the importance of one
angle is at least 1.5x the importance of the other. . . . . . . . . . . . . . . . . . . . . . . . 51
5.6 Comparison of dx/dθ1 and dx/dθ2 between the theoretical values and ReLUTan. Although
the accuracy of ReLUTan is extremely high we can see that the fact that it uses ReLUs
makes the gradients much harsher than the original gradients of equation 5.1. . . . . . . . 52
5.7 Local explanations retrieved from ReLUTan. Each arrow color represents an explanation
for one of the axis of movement. The magnitude of the arrow represents the amount of
movement predicted. Arrows on the borders of the input space are generally bigger due
to the network’s inability to predict correctly outside this range. . . . . . . . . . . . . . . . 54
5.8 This small example shows how simple it is to retrieve feature importance vectors from any
neural network using deepDIVE. Here, we are retrieving angle importance for the x axis
in ReLUTan. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
viii
A.1 Comparison of dx/dtθ1 and dx/dtθ2 between several different models. Different activation
functions create very different types of surfaces. SigTan is a ReLUTan with sigmoid ac-
tivations. Here the difference between sigmoids and ReLUs is especially noticeable as
the former creates very smooth curves and the latter very harsh and noticeable decision
boundaries. It is also insightful to look at SimpleTan to see how a network with a much
lower representational capacity tries to mimic the objective function. . . . . . . . . . . . . 68
ix
x
List of Tables
5.1 Although both systems achieve high accuracies, the decision trees have a higher preci-
sion. It should be noted that we could not measure recall for the trees since we did not
have access to the final trained trees. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.2 As the number of features that we perturb increases, the amount of datapoints whose
classification is changed increases. As we reach 4 features we can clearly start seeing
diminishing returns, implying that these datapoints that weren’t changed are extremely far
from the decision boundary of the network. . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.3 From the biased data retrieved by filtering explanations where the importance ratio be-
tween angles was threshold we can see that movement on both axis is extremely faithful
to the explanations. Most of the datapoints where the movement of the arm didn’t corre-
spond to the explanation are located in the edges of the dataset. . . . . . . . . . . . . . . 55
xi
xii
1Introduction
Contents
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Hypothesis and Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1
2
1.1 Introduction
Machine learning is not a new field, the concept that we could teach machines how to ”learn” the solution
to a problem by themselves has been around for a very long time. However, limitations in techniques and
computational power held it back for decades. Due to this limitation, the algorithms used in the early days
of the field were considerably simpler than the ones used today. As such, it was easy to interpreted the
algorithms used for these regressions and deduce why they had made a certain decision. Models trained
with algorithms such as Linear Regressions and Decision trees are easy to interpret. These models,
however, often have a severe limitation in representation power, that is, their ability to learn complex
patterns displayed by data. When compared with other less interpretable models like the extremely
popular Neural Networks, these model’s performance is usually severely lacking.
Neural Networks are not a new concept, they were first envisioned after Hebbian Theory, a theory of
which we are commonly reminded of with the adage ”Neurons that fire together, wire together”. Shortly
after this, in 1958 Frank Rosenblatt developed the Perceptron, this would later become the basis of what
we now call an ”artifical neuron” [6]. Altough the perceptron enjoyed limited success in the research
community, the ability to join several of these together was still far off. However, in the last few years,
partly due to enormous advances in computational power and partly due to new optimization techniques
being developed, Deep Neural Networks have become the de facto standard for tasks such as Natural
Language Processing, Computer Vision and many, many others. These Networks have achieved state-
of-the-art performance in many of these fields and are experiencing a period of widespread use [7]. With
this increase in adoption, researchers have started looking at ways to expand the use these networks to
areas such as finance, medicine or industrial processes.
These networks, albeit responsible for achieving state-of-the-art performance in many fields, have
one major drawback. They’re extremely hard to interpret. They are composed of several interconnected
layers of sometimes thousands of ”neurons”, each of which applies a mathematical transformation to
received data. We can observe the data we feed into the networks, its output, the individual changes
to each neuron and understand all the singular pieces of their inner workings. Yet, due to the non-
linear transformations these networks apply throughout their functionality it is impossible to grasp its
end-to-end behaviour by looking at it’s raw behaviour.
Their fantastic generalization capabilities are related to the fact that they use distributed represen-
tations in their hidden layers, that is, their ”knowledge” is distributed unevenly by a set of neurons. [8]
These distributed representations are defined by the set of all activations in any given layer. And, be-
cause the minor effects of any particular ”neuron” rely on the effects of the activations of every other
unit in that layer, it is extremely hard to retrieve any meaningful information about their role. As such,
it is very difficult to discern what internal mechanisms are responsible for any specific phenomenon
(although some research has been done in that topic). [9]
3
All of this is further aggravated by another characteristic of Deep Neural Networks. The fact that
they can model an enormous number of weak statistical regularities. They model these weak regular-
ities between the inputs and outputs of the training data. [8] However, there is no mechanism that can
distinguish the regularities that are properties of the data from false ones created by peculiarities of the
training set. To sum up, not only is it extremely hard to correlate unit activations with any specific feature.
It is also very difficult to verify which units actually mater for the result and aren’t just activating due to
some relationship learned from random training set characteristics. This means that although we know
precisely how these networks work, if we want to unearth the reasons behind a specific decision even
experts can only make educated guesses.
This ”black box” nature of neural networks has become an enormous barrier of adoption in areas
where decisions have irreversible consequences. In areas such as NLP, the consequences for failure
are fairly minor, as a consequence, the focus on interpretability as remained secondary to performance.
On the other hand, in areas like medicine, where a wrong choice might prove deadly, it is of utmost
importance to understand why the network has made a decision.
To illustrate the importance of interpretability, Caruana et al. [10] detail a model trained to predict
probability of death from pneumonia that learned to assign less risk to pneumonia patients if they had
asthma. These patients, when arriving at the hospital, received more aggressive treatment due to their
condition. As a result, the model concluded that asthma was predictive of lower risk of death. If this
model were to be deployed to aid in triage, these patients would receive less aggressive treatment,
invalidating the model. Possibly resulting in a higher amount of deaths amongst asthma patients. How
can Doctors act upon predictions without knowing the reasoning behind the result? In scenarios such
as this, acting in blind faith could prove to be catastrophic.
In light of these facts, model interpretability has become of paramount importance for widespread
adoption in high-risk areas. However, even what exactly constitutes interpretability is often discussed,
as different authors strive for solutions to different, albeit often similar, problems. Fundamentally, we
consider interpretability as a source of trust in the model. For the purpose of this work, we will consider
trust as human confidence on the reasons behind the network’s prediction. This confidence relies on
an observable causality between inputs to the network and its outputs. As such, we can say that an
interpretation of a decision would be an explanation on how the input values were at the genesis of that
decision.
1.2 Problem
Although these networks achieve outstanding performance when evaluated through traditional measure-
ments like Accuracy, Precision, etc. [7] these measurements can only tell us so much about what the
4
network is doing and why it is doing it.
As we have previously hinted towards with the example by Caruana et al., considering the reasons
behind a decision is of the utmost importance [10]. Interpretable predictions lead to appropriate trust
and can also provide intuitions on how to improve the model. Understanding why a model made a
certain prediction is essential in many fields. If they understand the reasoning behind a prediction, they
can use their knowledge to accept or reject it. Besides providing much safer and controlled usage,
studies have shown that providing explanations can increase the acceptance of automated systems
recommendations e.g. entertainment recommendations [11,12].
Furthermore, there are a considerable number of ways a model could be trained wrongly. A good
example, studied by Kaufman et al. is data leakage [13]. Data leakage is the unintentional creation of
additional information into the training (and validation) data, potentially increasing accuracy. The author
cites an example where the patient ID was profoundly correlated with the target class in the training
and validation data, thus leading the model to have an artificially high performance in these controlled
environments. Another example, mentioned by Ribeiro et al. [2] is artifacts of data collection influencing
the model. The author describes an experiment where a model was designed to classify between a
Husky and a Wolf. Although the model had an extremely high accuracy, when rudimentary interpretability
methods were applied to it, it was found that the model was in fact looking at the background of the image
and using that to make a decision. A lighter, whiter background would be classified as a wolf, as most
photos of wolves had been taken in the wild in snowy backgrounds. While darker backgrounds would be
classified as huskies, since most of the pictures used in training had been taken inside a house or in a
modern city. Both of these issues would be extremely challenging to detect by looking at the predictions
and raw data.
Determining the causality between inputs and outputs would provide a much clearer picture into the
inner workings of these networks. Not only could this be used to implement them safely into new fields
but also as a tool to help existing machine learning practitioners analyse and assess new models before
deploying them.
In order to interpret these networks, various approaches have appeared in the research community.
These approaches have varied from attempts to create generic explanation methods for any machine
learning model to distilling more complex models into simpler easier-to-interpret ones. One that has
gained increased attention in the last few years is using the backpropagation algorithm to calculate a
gradient of the output w.r.t the input. Yet, this and its related approaches have only been researched in
the context of computer vision and images as input. Backpropagation is the algorithm responsible for
creating these inner distributed representations. As such, we wonder if its utility in interpreting Neural
Networks couldn’t be exploited in other domains.
5
Is it possible to use network gradients to interpret the output of any deep Neural Network?
1.3 Hypothesis and Contributions
The Hypothesis we put forth with this thesis is the following:
Network gradients after any particular decision can be used to establish a causal relationship betweeninput and output.
In this work, we propose a variation of backpropagation based techniques to interpret any deep
neural network, not only those using images as input. By calculating the gradients at decision time
we can create a causal relationship between input features and the network’s output. With it, we can
accurately attribute decisions to a specific cause or causes. This provides a generic method for Neural
Network Intrepretability. Contrary to previous works, which were solely based on image classification
networks, our method can be applied to a multitude of domains. Although capable of interpreting any
network, our main focus was to test this method in areas where it can have real world implications. As
such, focus was been given to the medical and industrial domains.
In the context of this work we implemented several different Neural Network architectures to solve
two distinct problems, one for each of our focused domains. The first challenge was to develop a clas-
sifier network responsible for predicting the severity of a patient in triage for suspected mental illness.
We explore how this network compares to a traditional decision tree algorithm and how we can use our
interpretability technique to retrieve insights from the network’s decisions. Secondly, we developed a
network responsible for moving a two-dimensional robot arm with two degrees of freedom. This arm is
equivalent to a human arm with shoulder and elbow but stuck to one plane of motion. Since this move-
ment can be described theoretically we use this example to explore the internal network representations
of the concepts. By comparing explanations of the network’s results at different arm angles with the
theoretical results we can explore how the network actually learns and accurately measure the reliability
of our explanation method.
The rest of this work details previous work, our approach, our methodology and results. But first, as
an attempt to introduce unfamiliar readers to the core concepts behind this work we make an endeavor
to explain Neural Networks and backpropagation as plainly as possible.
6
2Background
Contents
2.1 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
7
8
This section assumes a modest understanding of the basics of Machine Learning and Linear Alge-
bra. Here, we will explore and explain the basics of Neural Networks. We hope to supply the reader
with sufficient knowledge to understand subsequent chapters. We provide an overview of what these
networks are, how their fundamental mechanisms work and how they are used.
2.1 Neural Networks
Neural networks come in many shapes and forms. The designation neural networks comes from the
fact that these models are loosely inspired by the biological neural networks that form the human brain.
Putting it very simply, the basic computational unit of the brain is the neuron. There are approximately 86
billion neurons in our nervous system. They connect to other neurons through synapses, a single neuron
can have thousands of synapses. It’s through synapses that a neuron receives input signals. These are
then combined and used to produce output signals along its single axon, the ”branch” that extends it
to the proximity of other neurons. This axon eventually branches out and its synapses connect to other
neurons. When a neuron receives signals from others, it sums all of them. If the sum of all signals is
above a certain threshold, the neuron will send information along its axon to the next.
The computational model of this, the artificial neural networks follows much of the same concepts.
These networks are also comprised of numerous connected units, commonly referred to as ”artifical
neurons”. Contrarily to the biological model, the signals x that travel to another neuron interact through
multiplicationwxwith the other neuron, depending on how strong the synapse (or connection)w between
them is. The main idea behind the computational representation of the brain is that the strength of this
connection can be controlled and ”learned”. It should also be mentioned that this connection can have
positive or negative w, allowing a much broader influence on the next neuron. Similarly to the biological
model’s sum of signals before forwarding the information, artificial neurons have an activation function
f . Instead of acting as a threshold that controls the timings of the neuron, in this case it acts as a
representation of the frequency at which the neuron sends information to the next. To sum up, the basic
structure of a neuron can be represented as follows:
f(∑i
wixi + b)
In other words, a neuron performs a dot product with the inputs and their weights, adds a bias b,
applies a non-linearity f (activation function) and sends the result to the next neurons. Typically these
neurons are organized in layers with each layer performing different kinds of transformations on their
inputs. Signals travel from the first (input) layer to the last (output) layer. The layers of neurons between
the first and last are called hidden layers. Simply put, a neural network is considered a deep neural
network if it has two or more hidden layers.
The simplest of these are called feedforward neural networks, or multilayer perceptrons. The objec-
9
Figure 2.1: On the left a 2-layer Neural network, with one hidden layer and one output layer. On the right, a 3-layer(deep) neural network with two hidden layers and one output layer. Note that there are connectionsbetween neurons across layers, never within a layer. Image sourced from CS231n [1].
tive of a feedforward neural network is to approximate a certain function f∗ that belongs to a class of
functions F . Each of the layers in a neural network can be seen as a function. This means that (most)
neural networks are merely a chain of functions f(x) = f2(f1(x)). In this case, f1 would be the first
layer of the network and f2 the second. We can then think of information being propagated throughout
the network as a sequence of functions being applied on input data.
For a neural network to ”learn”, we need a way to codify what is a correct or incorrect output and
how ”far” an incorrect output is from the correct one. For that, we use a cost function C. A cost function
(or loss function) is a function so that for the optimal f∗, C(f∗) ≤ C(f)∀f�F . It serves as a measure
of how far away a particular solution is from an optimal solution. Learning algorithms, such as Gradient
Descent, will search through the solution space trying to find the function with the smallest possible cost.
Gradient Descent is an iterative algorithm that functions by correcting the values of the weights w and
biases b after each network mistake. The way it does this is by calculating the gradient of the function
at the current point and moving towards the negative gradient, in order to reach a minimum of the cost
function. This way, the network is constantly changing and moving closer to f∗.
In order to apply this method to a deep neural network the backpropagation algorithm is essential.
Backpropagation is the method used to calculate the gradient of the loss function with respect to the
weights of the connection between the units of the network. After calculating these gradients, they are
used to update the weights one layer at a time, from the output layer all the way back to the input layer.
And with this, Gradient Descent is put into action.
Neural networks can be applied to many types of problems and have many different architectures
in order to accomplish different goals. Networks used for Computer Vision tasks have a very different
architecture to the ones used in Natural Language Processing for example. Each of them exploits
particular characteristics of certain types of layers or connections between these layers to tweak the
networks to perform better in their intended purpose. Even feedforward networks used in ”simple” tasks
10
can vary wildly in architecture. In the remainder of our work, in an attempt to restrict our exploration
space, we will focus on feedforward networks of varying shapes and sizes.
11
12
3Related Work
Contents
3.1 Interpretability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.2 Interpretable Models as Explanations . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.3 Inner Model Visualizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
13
14
To understand the approach taken in our proposal, it is essential to review what has been done
before. Throughout the next section, we will revise a collection of previous work in interpretability, carried
out by many from across the research community. Analyzing this work is essential to give us guidance
on what does or does not work and what has been experimented before. This review of previous work
is also responsible for the, often overlooked, crucial task of homogenizing how certain terms are used
throughout the scientific community. One of these terms happens to be Interpretability itself.
Interpretability recently jumped into the limelight, but different papers seem to have different interpre-
tations of it. What makes an algorithm interpretable? What is the true purpose of interpretablity? We
will answer these questions and many others. In order to search for the optimal way to interpret neural
networks, we must understand what this means and what it is we are looking for.
After examining interpretability, we will look into several different approaches to demystifying neural
networks. First we will examine methods which use interpretable models to deliver explanations. In
these, neural networks are treated as black boxes with other, more interpretable, models being used
to withdraw information from the networks without any internal information. By training models we can
easily understand around the neural networks, researchers hoped to mirror some of their behaviour.
Subsequently they use the newly trained models to draw conclusions about the original networks. After
that, we will analyze another set of methods which has a completely different approach. These, make
an attempt to understand neural networks utilizing their internal mechanisms. These methods try to
slightly modify either forward propagation or backpropagation to provide useful information to explain
the network or one of its decisions.
3.1 Interpretability
Due to the enormous growth in it usage, Machine Learning has naturally started seeping into critical
areas. Areas like medicine, machinery control, financial markets, terrorism detection, among others.
When tackling these kinds of issues, the fact that humans are incapable of understanding advanced
models in machine learning takes a whole other dimension [14]. Machine Learning model Interpretability
came into the spotlight as a way to tackle these issues.
It should be noted however, that historically, the machine learning community, despite increasing
claims of using interpretability as motivation, remains notably isolated from other industries and policy
issues in general [15]. A prime example of this is the extensively discussed provision of the European
Union General Data Privacy Regulation (GDPR) that is often labeled as a right to explanation, something
that is highly connected to the understanding of model’s decisions. While it is true that the meaning and
enforceability of the law continue to be under debate, the machine learning community gazed upon it
extremely briefly in via just one paper [16]. This single paper, was the only one presenting information
15
about this law for a machine learning audience. This problem is noticeable on other areas of the machine
learning community related to interpretability like algorithmic fairness. In this area, a single paper about
gender-crossing is considered the uncontested authority [15]. Not because of its content but due to the
larger problem of a general disregard for policy in machine learning circles.
However, recent concerns about fairness, racial or gender bias, and other related issues have been
drawing attention of experts. Interpretability in particular as been increasingly looked at as a solution
to many of these problems. The reasoning is that, if we can accurately pinpoint how the model is
making a decision, we can take actionable steps to correct its behaviour. Despite this, not many people
have tried defining what Interpretability is. Counter-intuitively papers often make allegations about the
interpretability of numerous models.
Although slow, in the last year some progress has been made in this area, with some attempts to
define interpretability and the characteristics that make a model interpretable. A study by Lipton [17]
shows that both the motivations for interpretability and the technical definition of interpretable models
are disparate, sometimes even contrary. That is, depending on the work, the word interpretability refers
to more than one concept.
Figure 3.1: This chart represents the traditional view of the trade-off between interpretability and performance of amodel. Although there is definitely truth to this statement, it is merelly an heuristic.
However, it should be noted that, as we mentioned previously, not all machine learning models are
hard to interpret. Most machine learning practitioners would categorize interpretable models as shown
in Figure 3.1. Nonetheless, this is merely an heuristic but one that has been propagated throughout
16
research materials. In the research community, researchers usually adopt a definition for interpretability
that makes intuitive sense to them. This has the unwanted effect of the definition for interpretability
changing considerably from paper to paper. The study by Lipton [17] compiled a list of the different
descriptions of interpretability found throughout the spectrum of machine learning literature. Some re-
searchers attempt to describe interpretability as ”simulatability”, whether a person can contemplate the
entire model at once. Or in other words, whether one can examine the entirety of the model and un-
derstand what it is doing and how. This notion is also adopted by Ribeiro et al. which suggests that an
interpretable model is one that ”can be readily presented to the user with visual or textual artifacts” [2].
Other authors suggest different traits to judge interpretability such as decomposability or algorithmic
transparency. Decomposability suggests that a model is interpretable if each part of the model (each
input, parameter and calculation) has an intuitive explanation. It’s worth noting that although it is true that
we can take each singular ”part” of a neural network apart and explain what it is doing, it is often close
to impossible to decipher how the several different parts fit into the whole that is the network’s decision.
In other words, we know exactly what it is doing and how it is doing it but in the large majority of cases
don’t have the faintest idea of why it is doing it. On the other hand, algorithmic transparency argues that
we can only truly interpret and grasp a model if we understand the shape of its error surface. While this
is easily done in the case of linear models, we don’t really understand how the heuristic optimization
procedures for neural networks work. Besides, error surfaces for neural networks are more often than
not, dependant on thousands of weights. This creates a surface of very high dimensionality, impossible
for humans to imagine or visualize.
By none of these metrics are neural networks considered interpretable. But it can also be argued that
the different metrics have different goals or properties that they strive for. Not all models that are typically
considered interpretable would fit even two of those metrics. We believe that this is because there can
be no discussion about interpretability without considering, not only what makes a model interpretable,
but what it is we are striving for with interpretability. Just as with the different traits used to classify
models, researchers have sought after very different goals with their research in interpretability.
Luckily, Lipton [17], when compiling the different metrics used to classify an interpretable model, also
studied and collected the different properties interpretability research strives for. These properties are:
Trust, Casuality, Transferability, Informativeness and Fair Decision-Making. In our work we will mainly
focus on interpretability as a source of trust and information.
Trust is a common motivation for interpretability [2,14,18]. But what is trust? Is it trust in the model’s
performance? If so, a model with an adequate accuracy would be a highly trustworthy model and we’d
have no need for interpretability. Trust can be defined subjectively, e.g someone can be more com-
fortable working with a well-known model even if that knowledge gives them no advantage whatsoever
in interpreting the model’s decisions. However, it can also be defined in more practical ways. Trust in
17
a prediction can be assessed by whether a human would trust the prediction enough to make a de-
cision based on it [2]. Establishing trust in these individual predictions is of paramount importance.
Humans create a trust relationship with the tools they use to inform their decision-making. Studies have
shown that they are much more likely to act on a tool’s suggestions if they’re provided an explanation for
them [11]. This becomes especially important when the tool is not a traditional algorithm but something
considered a black-box, such as a machine learning model. Furthermore, in fields such as medical diag-
nosis, predictions cannot be acted upon without a certain degree of confidence [10]. The consequences
might be catastrophic.
On the other hand, interpretability can also help withdraw additional information from the model.
While the scientific goal of machine learning is to reduce the error of some function, in the real world its
goal is to produce helpful information. Traditionally, we only consider the information conveyed through
the model’s outputs. Nonetheless, through interpretability techniques it is be feasible to convey extra
information to a human decision maker. This has been explored as a paradigm where supervised
models are employed simply to provide information to humans [14, 19] instead of training the models
to act upon their predictions. However, we would like to explore the same approach as Ribeiro et al.,
deciphering the model’s decision making rationale and using that as an extra source of information [2].
An interpretation might prove valuable even if it does not provide any intuition or explanation into the
model’s inner mechanisms. For example, a model responsible for medical diagnosis can be better suited
for assisting a human make a decision if, instead of solely providing a diagnosis, it can explain which of
the symptoms contributed to that forecast. Then, the human can combine that knowledge with his own
domain specific expertise, likely resulting in a better decision. As mentioned before, it has been shown
that showing these types of information to humans makes them significantly more likely to accept, and
act upon, the original model prediction [11,12,20].
3.2 Interpretable Models as Explanations
When considering ways to provide explanations for the behaviour of deep neural networks, one of the
most common approaches is a delegation of responsibility. Instead of looking at the model’s internal
mechanisms for an explanation, the task of explaining the model’s behaviour is outsourced to an exter-
nal, simpler and interpretable model. Intuitively speaking, when looking at Figure 3.1, these approaches
are based on the use of the models on the top left (easily interpretable) to explain the ones on the bottom
right (black box nature). In this section we will explore some of the most recent proposals in this area.
Ribeiro et al. propose an approach capable of explaining any model, including neural networks, by
treating the model as a black-box. The authors argue that one of the most neglected but important
aspects of machine learning is the role of humans. They believe, in accordance with our work, that it
18
is necessary for a model to be trusted by humans for it to be used in the real world. Furthermore, they
defend that the best way for a human to trust a model is for it to explain the rationale behind a decision.
They suggest that two things are essential for an explanation of a machine learning model: It needs
to be interpretable and it needs to be locally faithful [2]. They claim that it is often impossible for an
explanation to be entirely faithful unless it is the complete description of the model itself. They argue that
explanations don’t need to represent the entirety of the model’s behaviour. They only need to be locally
faithful i.e accurately describe how the model behaves in the proximity of the instance being predicted,
or in other words, accurately describe the behaviour for that particular decision.
Taking this into account, the authors developed a framework by the name Locally Interpretable Model-
Agnostic Explanations (LIME). Its goal is to use an interpretable model to provide an interpretable repre-
sentation locally faithful to a prediction of another classifier. Their work tries to define a balance between
the complexity and faithfulness of an explanation. The goal was to define a general framework that could
continue being used in the future, even if the method of obtaining the explanations changes. However,
in the context of this first work, they focused on using sparse linear models to provide explanations.
While using perturbations of the input as a way to explore the model to be explained. An example is
provided in figure 3.2 It should be noted that although they only describe sparse linear models as ex-
planations in this work, LIME supports the exploration of an array of function families for explanations,
for example, decision trees. Hence, their base formula can be used with different explanation families,
fidelity functions and complexity measures. On the other hand, they argue that all interpretable models
and choices of representation will have some drawbacks when explaining certain behaviours. Although
the underlying model is treated as a black-box, certain representations won’t be adequate for certain
models. For example, it would be extremely hard to explain a model that predicts filters in images with
the importance of each pixel to the prediction.
Although an explanation of a single prediction contributes to understanding the reliability of the clas-
sifier, the authors argue that it isn’t enough to assess trust in the model as a whole. As a result of that,
they propose giving a global understanding of the model by explaining a batch of individual predictions.
However, they explain that this batch must be carefully picked, since users might not have the time to
examine many explanations and the explanations should be not only diverse but representative of the
model’s behaviour. For that purpose they devised an algorithm called submodular pick. They suggest
that the explanation is for all intents a submodular set function. That is, a set function that has a property
that states that: the difference in the value of the function that a single element makes when added to
the set, decreases as the size of the set increases. In other words, the more elements the set has, the
less of a difference each new element makes. They show that, due to submodularity, a greedy algorithm
that adds the instances with highest marginal coverage of the target model offers a guarantee of a good
approximation (1−1/e of the optimum) of the function [21]. With this, they created SP-LIME (Submodular
19
Figure 3.2: An example for LIME. The black-box model’s decision function f is represented by the pink/blue back-ground. A linear model is used to create a local explanation. The bold red plus sign is the instancebeing explained. LIME samples instances (red plus signs and blue circles), gets predictions using fand weights them by proximity to the original instance. The dashed line is the learned explanation, it islocally (but not globally) faithful. Image sourced from [2]
pick LIME) a framework that uses LIME to explain entire models, one explanation at a time.
Another similar approach was developed by Frosst et al. Instead of trying to understand how a
deep neural network makes its decisions, they use the network to train a decision tree that mimics the
function discovered by the network but works in a completely different way [3]. They experimented using
a technique labeled distillation [22] to extract the knowledge from the tree to a type of decision tree that
makes soft decisions [23]. They use soft binary decision trees trained by choosing the depth of the tree
and then using mini-batch gradient descent to update all the parameters of the tree simultaneously. The
model is equivalent to a hierarchical mixture of experts. An example of such a tree trained on the MNIST
can be seen on figure 3.3.
With this technique, the authors can easily provide an explanation for any prediction. They only have
to follow the path with the greatest probability from the root to the leaf. An explanation is the list of both
all the filters along the path and the binary activation decisions on every node. Although very easy to
create, these explanations might prove hard to interpret depending on the format of the data. As we
can see in the example, it isn’t easy to interpret the visual filters used by the decision tree. It is easy to
think of more complex examples where these explanations would quickly become too bothersome and
20
complicated to interpret.
Figure 3.3: Example of a depth 4 decision tree obtained from distilling a NN trained on MNIST. The images at theinner nodes are filters learned from images. The ones at the leaves are visualizations of the learnedprobability distribution over classes. The most likely classifications at each node are annotated in blue.Image sourced from Frosst et al. [3]
The authors also provide an interesting example of the utility of an explainable model. While exam-
ining the learned filters of a soft decision tree trained from a network that had learned to play Connect4,
they discovered an insight about the game. Inspecting the first filter of the tree revealed that any game of
connect4 can be divided into two separate groups: Games where the players started by placing pieces
on the edges of the board and ones where the first pieces were placed on the center of the board. By
distilling the original network into the decision tree they discovered that these two groups played out in
ways so different to one another that the decision tree used that condition as its root node [3].
However, this method has a problem. Models such as decision trees normally do not generalize as
well as deep neural networks. Contrary to the hidden units in a neural network, a node at a lower level
of a decision tree is only used in tiny part of the training data. This usually causes the lower levels of the
decision tree to overfit. Because of this, the number of parameters at which the soft decision trees start
to overfit is less than compared to the number at which a deep neural network starts to overfit. This is
noticeable when comparing performance in MNIST.
With a regular soft decision tree of depth 8 trained in the real dataset they were able to achieve a
test accuracy of at most 94.45%. They compare this with the performance of a neural network with
two convolutional hidden layers and a penultimate fully connected layer, this network managed to obtain
a much better test accuracy of 99.21%. Afterward, they experimented training soft decision tree with
labels that were a conglomerate of true labels and predictions of the neural network. The soft decision
21
tree trained in this way attained a test accuracy of 96.76%, which is roughly in the middle between the
neural network and the soft decision tree trained directly on the data. On another dataset, the Connect4
dataset, a soft decision tree distilled from a neural network achieved 80.60% test accuracy compared to
78.63% from one trained with real data [24].
All in all, the authors provide a way to reap some of the benefits of deep neural networks while using
a model that can be easily interpreted. In cases likes the ones described in the previous chapter, where
being able to explain a prediction is essential, this technique can be used to improve the performance of
a traditional interpretable model.
We believe that none of the aforementioned approaches is a good solution to our problem. LIME,
while an extremely interesting approach that can be applied to any model, has some glaring problems.
First, it tries to explain changes in a model’s behaviour by first order changes in a discrete interpretation
of the input space. In other words, by measuring the changes with permutations, there’s no guarantee
that the behaviour being modeled is actually from the space surrounding the original input in the target
model’s space. Secondly, for each decision of the model we intend to explain, we must train a new
explainable model. This adds a significant overhead to providing explanations to more complex models
with large runtimes, as the model needs to be run several times in order to generate an explanation.
Distilling a neural network to a soft decision tree, on the other hand, is also a far from ideal solution. It
does not explain neural networks directly, only by proxy through another model. There is no guarantee
that this other model will learn the same distribution as the original network. Furthermore, since the
decision trees can only attain a performance significantly lower than the networks, we must sacrifice
performance of the model for an explanation.
Overall, they both pose difficult questions for adoption. Both of them, although in different ways,
are problematic in terms of both reliability and feasibility of explanations. As interesting as the idea of
utilizing an interpretable model to explain neural networks might sound, it is extremely hard to surpass
the inherent limitations of these models. We believe that, since they do not use any of the unique
mechanisms possessed by neural networks, these explanations are held back by the expressiveness of
the models learning the explanations.
3.3 Inner Model Visualizations
Another type of approach to deciphering neural networks, in contrast to the ones in the previous section,
are based on using the inner mechanisms of the model to get information on how they are making the
prediction. It should be noted however, that before the recent rise in popularity of Deep Learning, there
had already been some attempts to interpret the networks’ inner workings [25, 26]. However, in recent
years, fueled by neural network’s new-found popularity, several new approaches have been proposed.
22
In 2009, Erhan et al. discovered that, counter-intuitively, it is possible (and simple) to interpret deep
architectures at the unit level [27]. Before his work, the only qualitative method to compare and explore
features representations by deep networks was by examining the ”filters” learned by the network [28,29].
That is, the linear weights in the input-to-first layer weight matrix represented in input space. Or in other
words, the pixels and shapes that have the highest weighted connections in first layer. However, this
a poor representation of the overall network because we’re only looking at its first layer. Erhan et al.’s
work was the first to develop methodologies to visualize representations of arbitrary layers of a deep
architecture, in this case, Deep Belief Networks [30]. They discovered a curious phenomenon: the
responses of the internal units to images, as a function in image space, are unimodal. Knowing this,
by finding the dominant mode, we can create a good characterization of how the unit is processing
the information. With their methodologies, they were able to verify the intuition that the first layers of a
network are trying to find generic features while the later layers build on these to create more complicated
concepts. This work opened up the pathway to exploring and visualizing the internal representations of
networks. Be that as it may, although these representations are fascinating in their own right and provide
a good insight into the internal reasoning of the networks, they are of little help in interpreting individual
decisions and bolstering end-user trust in these decisions. With recent works such as Olah et al. [31]
pushing this medium forward, trying to combine both visualization and attribution to create rich interface
for humans, the gap between feature visualization and feature attribution is closing. It is however, still
too wide to be considered an effective solution to our problem.
Meanwhile on the other side of this gap, more or less simultaneously to Erhan et al’s work, Baehrens
et al proposed a novel approach for interpreting Bayesian classifiers [32]. Their goal was similar to our
own, to peek into the black-box of modern classifiers and provide an answer to why a certain decision
was made by the classifier. Their approach uses the fact that the bayesian classifiers are optimized using
numerical algorithms such as gradient descent. With that in mind, they use class probability gradients
in order to create local explanations of the classifier. These class probability gradients are in essence
the derivative of the probability of the classifier being wrong. This creates a vector that ”points” away
from the chosen class, in essence, providing a weighted vector of the features which influenced the
classifier’s decision. This can easily convey the relevant features for a single prediction, both the ones
that contributed positively and negatively to the final class. At the time, this approach was completely
different from other available techniques. It provided much need improvements in reliability and quality
of explanations. It is especially significant because it allows measuring changes in any direction but
most of all, because by using local gradients of the model as the explanation vectors, it always reflects
the model and its assumptions.
This type of approach made its way into neural network research when Simonyan et al. [4] based on
Erhan et al.’s and Baehren et al.’s works, proposed using the gradient of the output with regards to pixels
23
of an input image to create a saliency map (Figure 3.4) as a way to interpret deep convolutional neural
networks [33]. It mixed the ideas from previous works [27, 32], gradients as explanations and creating
explanations using input space. The authors propose that the gradient of the output class with regards
to the pixels of the input image can create a weighted vector of contributions from the input pixels. Or
in other other words, how much each pixel of the input image, contributed to the final classification
output. At the time when this was proposed, the most common approach to interpreting convolutional
networks was a specific network architecture called DeconvNet [34]. This approach required training an
additional network with a specific architecture on top of an existing convolutional neural network. It is
unpractical to be required to train specific networks on top of existing networks in order to interpret the
latter. Simonyan et al.’s approach was especially significant because they proved that the reconstruction
of a layer by DeconvNet was in practice equivalent to computing the gradients w.r.t the input of that layer
(with a small difference on ReLU activation functions). With this, they managed to unify and generalize
a wide spectrum of interpretability research in one work.
Figure 3.4: The maps were extracted using a single back-propagation pass through a classification ConvNet. Thesemaps represent the top predicted class in ILSVRC-2013 test images. Image sourced from Simonyan etal. [4]
Since the introduction of this method, several other authors have proposed different gradient based
interpretability methodologies. Springenberg et al. unsatisfied with the different ways default gradi-
ents and DeconvNets treated ReLUs, suggested further homogenizing both approaches and proposed
guided backpropagation [35]. This approach is mostly equivalent to computing gradients except that any
gradients that become negative during the backward pass are discarded. The issue with the original
approach of doing a backpropagation of the importance using gradients is that the gradient coming into
a ReLU during the backward pass is zero’d if the input to the ReLU had been negative on the forward
pass. By contrast, in DeconvNet the gradient is only zero’d if it is negative with no regards to the sign of
24
the input to the ReLU during the forward pass. Springbenberg et al. combined both and zero’s out the
signal if either the input in the forward pass or the gradient in the backward pass are negative. Although
this method achieved valuable results in image segmentation, due to the exclusion of negative gradients,
it fails to highlighting inputs that contribute negatively to the output if they pass through a ReLU unit at
any point in the network.
Other approaches to solving this problem include Layerwise Relevance Propagation (LRP) [36], an
approach for calculating importance scores in a layer-by-layer approximation of backpropagation. This
method was later demonstrated to be equivalent within a scaling factor to an elementwise product be-
tween the saliency maps of Simonyan et al. and the input [37, 38]. These approaches have the advan-
tage of leveraging the sign and strength of the input. However, they are very difficult to apply outside an
image classification domain since other inputs such as categorical aren’t as standardized or numerically
well behaved. Sundarman et al. suggested that instead of computing the gradients only with the cur-
rent value of the input, we could integrate the gradients as inputs are scaled up from a certain starting
value [18]. Although this addresses a fundamental limitation of gradients which is neuron saturation,
it adds a significant computational overhead to obtain high-quality integrals. Furthermore, other works
have shown that this approach can still provide highly misleading results [5].
Finally, Shrikumar et al. recently proposed DeepLIFT (Deep Learning Important FeaTures), an al-
gorithm used for assigning an importance score (weights) to the inputs of a certain network output [5].
Although it is still a gradient based method, contrary to others it doesn’t rely solely on one gradient
calculation (or integrating up to one). DeepLIFT calculates feature contribution by comparing the output
in question to a ”reference” output obtained from a ”reference” input. This ”reference” input illustrates
some default or what the authors describe as ”neutral” input. This ”reference” is chosen according to the
problem at hand and can vary wildly from model to model. Then, just like other approaches, DeepLIFT
backpropagates the contributions of every hidden unit (to the output) in the network to each feature of
the input, effectively creating a representation on how each feature contributed to that prediction. It then
obtains the final explanation by identifying the differences of this backpropagated values to the ones on
the ”reference” input.
The explanations that DeepLift produces can be thought of as a difference-from-reference, the bigger
the difference the more that feature contributed to that output. Using this difference-from-reference,
DeepLift side steps one of the fundamental limitation of gradient based methods, model saturation,
which is illustrated in figure 3.5. Even when the derivative of a neuron with respect to a neuron in a
previous layer is zero it might still be signaling relevant information. While the gradient would be zero,
the difference-from-reference might not be.
To experiment on how DeepLIFT fares against other explanation methods that used backpropagation
the authors trained a convolutional neural network on MNIST to perform digit classification [39]. To
25
Figure 3.5: Both perturbation based approaches like LIME and gradient based approaches fail to model saturation.Here illustrated is a simple network that experiences saturation from its inputs. When both i1 = 1 andi2 = 1, perturbing either of them to 0 will not produce a change in the output. Similarly, the gradient ofthe output w.r.t the inputs is also zero when i1 + i2 > 1. Image sourced from Shrikumar et al. [5]
compare the different importance scores obtained by different methods, they try to identify which pixels
to erase to convert the image to another target class. They pick up to 157 (20%) of the image’s pixels
for each explanatory method. They are chosen in order of their importance scores as determined by
the different methods. Finally they evaluate the change in the log-odds score between the original and
target classes for the original image and one where the pixels have been erased. In the context of these
images, they defined the reference input as all zeros and it outperformed all other backpropagation-
based methods.
Overall, gradient based methods have been showing a lot of promise in the field of deep neural
network interpretability. By utilizing the internal mechanisms of the networks to create explanations, we
bypass one of the most serious deterrents of the approaches we discussed in the previous chapter,
the limitations of other models. Using backpropagation we can create explanations that are completely
faithful to the original model. On the other hand, almost all the existing research in these methodologies
has been focused on networks that work in the image domain. This means that although very promising,
the real world application of these approaches had been lagging behind research.
26
4Backpropagating Feature Importance
27
28
When we began working on the issue of unwrapping the black-box that are modern deep neural
networks, we had a clear goal: to be able to provide instance-based explanations. In order for these
explanations to be useful for a human with domain knowledge they should reveal the causality relation-
ship between the input received by the network and its output. In other words, provide a clear picture of
the reasons behind a decision. The original approach we proposed to solve this problem was to try to
create a linear representation of the local behaviour of the model. Our proposal would start by creating a
new dataset by permuting the original input data point and using the network to label our newly acquired
data points. Then, we would train a linear regression with this new dataset. For this, we proposed using
a Least Absolute Shrinkage and Selection Operator, or in short, LASSO regression. With this new lin-
earization around the input point, we hoped to provide an accurate local representation of the networks
behaviour. We note, again, that local fidelity does not imply global fidelity and features that are globally
important may not be important in a local context, and vice versa.
However, despite our initial conviction in this methodology we quickly realized that it had some glar-
ing flaws. When analyzing a network, we don’t know it’s function surface, because of that, we can’t
be sure how the output varies with changes to the input. So by using permutations to create the new
dataset from where local behaviour will be learned, we risk creating a dataset of points that are distant
and not representative of the original prediction. Also, training this linear regression would require sep-
arate executions of the network on each perturbation. This meant that the complexity of calculating an
explanation would be dependant on the time it took to run the network and the number of perturbations
made to the data. Taking all of this into account, we decided to change our approach to the linearization
of the network and the approximation of its local behaviour.
A feedforward artificial Neural Network is organized in layers with each layer performing different
transformations on its inputs. The goal of these networks is to approximate a certain function f∗. Each
of its layers can be seen as a function. This in turn means that a network with n layers can be seen as
a chain of functions:
f(x) = fn((...)f2(f1(x)))
In order to train these networks, a backpropagation algorithm is used. Backpropagation is used to
calculate the gradient of the loss function w.r.t weights of the connection between the layers of the
network. This is employed by the gradient descent optimization algorithm to adjust the weight of neurons
with the previously calculated gradient of the loss function. Assuming x as a vector of inputs and θ as
the parameters of the network, the gradient at time t can be calculated as:
∇ = ∂Loss(x, θt)
∂θ
However, backpropagation is merely a special case of a more general technique called automatic
29
differentiation. In essence, automatic differentiation, iteratively computes gradients for each layer using
the chain rule. Although this is commonly used as a way to train a network such that it can learn appro-
priate internal representations that allow it to approximate f∗, as we saw in the previous chapter other
uses, such as network interpretation, have been proposed for gradients. Here, we propose generalizing
the method by Simonyan et al. [4] to be used in a wider spectrum of networks. His original proposal
and most other subsequent works [18, 35, 38, 40], apply this methodology solely to image classification
networks with convolutional layers. With this work we intend to explore a generalization of this approach
that can be applied to any network architecture. We hope to provide an easy to use way for deep
learning practitioners to interpret their networks and extract relevant information from their predictions.
Specifically, consider a system that classifies some input into a class from a set C. Given an input x,
many networks will compute a class activation function Sc for each class c ∈ C. The final result will be
determined by which class has the highest score.
class(x) = argmaxc∈CSc(x)
We define an explanation vector of a data point x0 as the the gradient of the output w.r.t the input, x at
x = x0. With that, we estimate a weighted vector of feature importances. If Sc is piecewise differentiable
w.r.t x, for all classes c and over the entire input space, one can construct a feature importance vector
Ic(x) by computing:
Ic(x0) =∂Sc(x)
∂x
∣∣∣∣x=x0
(4.1)
This explanation vector is a local n-dimensional representation of the behaviour surrounding the
decision point on the surface of the function learned by the network. By having a representation of the
behaviour around that point of the manifold we can measure the sensitivity to change of the function
value with respect to changes in its arguments.
In a more intuitive manner, Ic represents how much difference a change in each feature of x would
cause in the classification score of class c. Thus, entries in Ic with large absolute values highlight
features that will influence the label of x0. On the other hand, the sign of the individual entries expresses
whether the prediction would increase or decrease when the corresponding feature is increased locally.
These explanations can also be motivated with a simple example. Consider a linear model for the
class c:
Sc(x) = wTc x+ bc
assuming x is represented in a vectorised form, and wc and bc are respectively the weight vector and
the bias of the model. In this case, we can easily see that the magnitude and sign of the elements of w
will define the importance of the corresponding features of x for the class c. In the case of deep Neural
30
Networks, the class score Sc(x) is a highly non-linear function of x. Because of this, it is extremely hard
to apply this reasoning. However, given an input x0, we can approximate Sc(x) with a linear function in
the neighbourhood of x0 by computing the first order Taylor expansion.
Sc(x) ≈ dT + b
where d is the derivative of Sc w.r.t the input x at the point x0, as defined in 4.1. By definition, the
gradient of a function f(x) is the linear approximation of that function around x. This linear approximation
can also be thought of as a Taylor series approximation with a Taylor polynomial of degree 1. As such,
the gradient can be seen as an approximation of the linearization of the function and, in essence, of
feature attribution. We provide a simple example in the following figure:
Figure 4.1: Explanation of individual classifications of a simple Neural Network trained on a toy problem. On thetop we have the dataset and the learned decision curve, left and right respectively. On the bottom, thevisualization of the local explanation for points equally spaced throughout the dataset. On the left, thereal values of the importance of each feature and on the right a normalized version. We further explainthis image in the text below.
31
On the top left corner of Figure 4.1 we can see the training data for the simple classification task. On
the top right is the model learned using a simple network which outputs the probability of the point be-
longing to the yellow points. On the bottom left corner we can see the probability gradient (background)
together with local explanation vectors generated by calculating the gradient of the output w.r.t the input.
We can see that along the borders of the triangle the explanations point overwhelmingly towards the
triangle class. Furthermore it is noticeable that along the hypotenuse and at the corners of the trian-
gle both features are considered important for the explanation while along the edges the importance is
dominated by one of the two features. In the bottom right we can see the direction of explanation vectors
represented without regard for their magnitude.
This method of creating a linear representation of the network’s behaviour has several improvements
over our initial solution. The most prominent one is the fact that since we use local gradients of the trained
model as explanatory vectors, they will always follow the model and its assumptions, providing a perfectly
faithful explanation. Furthermore, while perturbations would need to be handled in a myriad of ways
depending on the type of data being analyzed (categorical, continuous, sequential, etc), this approach
can consider any type of feature without any predefined structure. Lastly, by capturing the gradient of the
model we can experiment and measure the influence of changes in any direction, including weighted
combinations of variables. Perturbations would require us to analyse and experiment with different
individual features, omissions of sets of features or various combinations. Each of these would involve
running the model with the perturbed data, increasing the complexity of finding an explanation. With
the aforementioned approach we have perfect freedom to measure the impact of variables without an
increase in the complexity of finding the explanation.
However, there are some inherent problems in our approach that should be discussed. The first
is something intrinsic to any method of providing a local explanation. Visualizing a single point in an
n-dimensional space provides limited view into the workings of a model. Because of the nature of such
a space, we can’t guarantee that the behaviour of the net is maintained throughout its function surface.
As a result, extremely similar data points might have completely different explanations from the network.
This means that humans must look at all explanations produced, merely looking at some representative
examples and imagining that we’ve understood the behaviour might be a very costly error. Finally, this
interpretability method demonstrates another problem that we previously discussed, neural saturation
(Figure 3.5). This is an issue to which different researchers assign varying degrees of importance. On
one hand, in an example such as the one in the aforementioned figure (3.5) gradient based methods
will fail to assign any importance to any of the input features. On the other hand, it is extremely hard to
imagine an example where a wider network could have most neurons saturated at the same time. Unless
that is the case, the remaining neurons and weights of the network should be enough to measure feature
importance.
32
Overall, we believe that our approach is one of the most reliable and efficient ways of interpreting
deep Neural Networks. By using backpropagation to explain the behaviour of the net we not only manage
to do it in a single backwards pass we also provide explanations that are completely faithful to the model’s
behaviour.
33
34
5Results
Contents
5.1 Medical Domain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5.2 Industrial Domain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.3 DeepDIVE: Deriving Importance VEctors . . . . . . . . . . . . . . . . . . . . . . . . . . 55
35
36
In order to test the viability of our approach, we need to test it’s performance in real networks. We
decided to carry out these tests on datasets that are in some way representative of the key areas where
interpretability is needed the most. With these experiments, we intend to put to the test the feasibility of
utilizing our approach in real world applications. Our method must prove to be fast, reliable and capable
of providing relevant information to a domain knowledgeable human. However, there are some problems
with the current state of interpretability research.
Contrary to other popular areas in machine learning, such as natural language or computer vision,
there are no standardized tests for interpretability. Each new paper or approach that is suggested
defines new evaluation metrics and experiments. This creates a fragmented research landscape as
approaches are evaluated in different ways, allowing authors to choose the light than want to show
them in. Although we will not suggest a standardized test for interpretability in the context of this work,
we try our best to follow best practices and experiments laid out by other recognized authors in the
field [5, 31, 40]. Nonetheless, there is another intrinsic problem to evaluating explanations of neural
network decisions, testing faithfulness to the model. Most neural networks are trained to solve problems
in an n-dimensional space which is extremely hard for us to comprehend. Because of this, when we are
looking at an explanation, how can we verify if the explanation is faithful to the underlying model? We
can see if it makes sense to someone with domain knowledge, but how can we make sure that those
are the features the model is actually using?
We suggest a straightforward way to test the faithfulness and fidelity of the explanations to the un-
derlying model. We outline a simple, theoretically solved problem and train a network to solve it. Having
the theoretical solution to the problem we can then analyze the error surface of the model and compare
it to the theoretical solution to make sure it is learning the correct pattern. If it is, when we create expla-
nations for its error surface we can verify if these explanations are correct by comparing them with the
feature attribution of the theoretical solution.
We chose two different areas to focus our experiments on, both of them are areas where inter-
pretability can have massive influence in deep learning adoption. One of the areas where interpretability
research is of paramount importance is machine learning in the medical domain. Any decision related
to human health is one that should be taken responsibly and with ethical concerns in mind. Decisions
need to be evaluated individually and pondered on by a human who will take responsibility for them.
This creates an extremely difficult problem for AI adoption in the field, especially in the current situation
of neural networks being black boxes. With this in mind, we decided it would be a splendid domain to
test our interpretability approach.
The other domain we decided to tackle is one that is attracting increasing attention of AI research,
Machine Automation for industry processes. Automation has been playing a pivotal role in the twenty-
first century’s race for factory efficiency. More and more factories are automating a majority of their
37
processes with both software and hardware. New types of ”Lights out” (Fully automated and require
no human to be present, hence they can work with the lights off) factories are being introduced yearly.
Until recently, most of the software controlling robotic hardware in factories was tailor made. In recent
years however, advances in reinforcement learning and deep learning have proved to be extremely more
efficient at teaching machines complex behaviour. These advances could be used to further improve
factory automation and even introduce robots in environments where human defined programming is
not adequate. But once again, these efforts are slowed down by the black box nature of deep learning.
Because although not as effortlessly, it is easy to image a wide array of situations where unexpected
behaviour from an automatic robot might have severe economic and dangerous consequences.
5.1 Medical Domain
5.1.1 Context
Machine learning applied to healthcare is booming, every year several new approaches are suggested
to solve pattern-detecting problems such as detecting pneumonia or tuberculosis in X-rays [41, 42] and
new surveys analyse the impact this technology might have in the way healthcare works. But although
some of these systems already achieve state-of-the-art results and have much better performance than
human experts, adoption has been slow [43]. Interpretability, or lack thereof, has been one of the main
reasons for the reluctant adoption of Deep learning in healthcare. As with anything related to Human
health and security, there is an enormous amount of uneasiness in letting a system we do not fully
comprehend make decisions that will directly influence our health. Furthermore, it is easy to imagine
examples such as the one by Caruana et al. [10] where seemingly high performing models are making
a mistake and invalidating their own results. In healthcare, this cannot happen.
With this in mind, we set out to find a dataset we could use that would be representative of the real
world circumstances of a machine learning problem in the medical domain. Luckily, concurrently to our
work, another student from our home institution (Instituto Superior Técnico) was pursuing a thesis in this
field [44]. Their work was based on a set of medical questionnaires from the biggest pediatric hospital
in Portugal, Hospital D.Estefânia. These were filled by doctors to assess risk factors in children being
evaluated for mental health issues. The work of our colleagues was to digitize, analyse and propose
changes to these questionnaires. They tried to identify redundant questions by training a Decision Tree
on the data, minimizing the amount of inner nodes (while maintaining performance) and pinpointing the
questions whose nodes were removed from the tree. This created a valuable parallel to our own work.
Not only did they create a dataset that could illustrate a real healthcare issue, they also analysed the
data and identified
Top Related