Interpreting Deep Neural Networks Through Backpropagation › downloadFile › ...exploramos um...

82
Interpreting Deep Neural Networks Through Backpropagation Miguel Coimbra Vera Thesis to obtain the Master of Science Degree in Information Systems and Computer Engineering Supervisors: Prof. Manuel Fernando Cabido Peres Lopes Prof. Francisco Ant´ onio Chaves Saraiva de Melo Examination Committee Chairperson: Prof. Alberto Manuel Rodrigues da Silva Supervisor: Prof. Manuel Fernando Cabido Peres Lopes Members of the Committee: Prof. Rui Miguel Carrasqueiro Henriques May 2019

Transcript of Interpreting Deep Neural Networks Through Backpropagation › downloadFile › ...exploramos um...

  • Interpreting Deep Neural Networks ThroughBackpropagation

    Miguel Coimbra Vera

    Thesis to obtain the Master of Science Degree in

    Information Systems and Computer Engineering

    Supervisors: Prof. Manuel Fernando Cabido Peres LopesProf. Francisco António Chaves Saraiva de Melo

    Examination Committee

    Chairperson: Prof. Alberto Manuel Rodrigues da SilvaSupervisor: Prof. Manuel Fernando Cabido Peres Lopes

    Members of the Committee: Prof. Rui Miguel Carrasqueiro Henriques

    May 2019

  • Abstract

    Deep Learning is one of the fastest growing technologies of the last five years, which has lead to its

    introduction to a wide array of industries. From translation to image recognition, Deep Learning is being

    used as a one-size-fits all solution for a broad spectrum of problems. This technology has one crucial

    issue: although its performance surpasses all previously existing techniques, the internal mechanisms

    of Neural Networks are a black-box. In other words, despite the fact that we can train the networks

    and observe their behaviour, we can’t understand the reasoning behind their decisions or control how

    they will behave when faced with new data. While this is not a problem in some fields where these

    networks are being applied to domains where mistakes can be afforded, the consequences can be

    catastrophic when applied to high-risk domains such as healthcare, finance or others. In this thesis we

    explore a generic method for creating explanations for the decisions of any neural network. We utilize

    backpropagation, an internal algorithm common across all Neural Network architectures, to understand

    the correlation between the input to a network and the network’s output. We experiment with our method

    in both the healthcare and industrial robotics domains. Using our methodology, allows us to analyse

    these previously opaque algorithms and compare their interpretations of the data with more interpretable

    models such as decision trees. Finally, we release DeepdIVE (Deriving Importance VEctors) an open-

    source framework for generating local explanations of any neural network programmed in Pytorch.

    Keywords

    Deep Learning; Interpretability; Machine Learning; Artificial Inteligence; Neural Networks;

    i

  • Resumo

    A Aprendizagem Profunda, uma das tecnologias que apresentou maior crescimento nos últimos anos

    tem sido introduzida num grande leque de industrias. Desde a tradução ao reconhecimento de ima-

    gens, a Aprendizagem Profunda tem sido utilizada como uma solução para todo o tipo de problemas.

    No entanto, esta tecnologia tem um problema fundamental, apesar do seu desempenho ultrapassar

    aquele das tecnologias existentes anteriormente, os mecanismos internos das Redes Neuronais são

    uma caixa negra. Noutras palavras, apesar de conseguirmos treinar estas redes e testar o seu fun-

    cionamento, não conseguimos perceber as razões por detrás de cada decisão individual ou controlar

    como estas se irão comportar quando confrontadas com novos dados. Apesar de não ser um problema

    urgente num contexto de investigação onde estas redes são aplicadas a problemas de investigação,

    as consequências podem ser catastroficas quando estão em causa cenários da vida real. Nesta tese

    exploramos um método generico para criar explicações para as decisões de qualquer rede neuronal.

    Utilizamos backpropagation, um algoritmo interno utilizado na grande maioria das Redes Neuronais,

    para perceber a relação de causalidade entre os dados recebidos por uma rede e as suas decisões.

    Realizámos experiências para testar a nossa metodologia em redes aplicadas tanto num contexto de

    saúde como de indústria. Utilizando o nosso método, é possı́vel analizar estas redes previamente

    opacas e verificar as suas representações internas dos dados tendo conhecimento do domı́nio. Final-

    mente, publicámos o DeepdIVE (Deriving Importance VEctors) uma ferramenta open-source para gerar

    explicações de decisões de qualquer Rede Neuronal implementada em Pytorch.

    Palavras Chave

    Interpretabilidade; Aprendizagem de Maquina; Inteligência Artificial; Redes Neuronais;

    iii

  • Contents

    1 Introduction 1

    1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

    1.2 Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

    1.3 Hypothesis and Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

    2 Background 7

    2.1 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

    3 Related Work 13

    3.1 Interpretability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

    3.2 Interpretable Models as Explanations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

    3.3 Inner Model Visualizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

    4 Backpropagating Feature Importance 27

    5 Results 35

    5.1 Medical Domain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

    5.1.1 Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

    5.1.2 Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

    5.1.3 Network Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

    5.1.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

    5.2 Industrial Domain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

    5.2.1 Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

    5.2.2 Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

    5.2.3 Network Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

    5.2.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

    5.2.4.A Using Gradients for Visualization . . . . . . . . . . . . . . . . . . . . . . . 50

    5.2.4.B Gradients as Explanation . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

    5.3 DeepDIVE: Deriving Importance VEctors . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

    5.3.1 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

    5.3.2 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

    v

  • 6 Conclusion and Future Work 59

    6.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

    6.2 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

    A Appendix A 67

    vi

  • List of Figures

    2.1 On the left a 2-layer Neural network, with one hidden layer and one output layer. On the

    right, a 3-layer (deep) neural network with two hidden layers and one output layer. Note

    that there are connections between neurons across layers, never within a layer. Image

    sourced from CS231n [1]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

    3.1 This chart represents the traditional view of the trade-off between interpretability and per-

    formance of a model. Although there is definitely truth to this statement, it is merelly an

    heuristic. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

    3.2 An example for LIME. The black-box model’s decision function f is represented by the

    pink/blue background. A linear model is used to create a local explanation. The bold

    red plus sign is the instance being explained. LIME samples instances (red plus signs

    and blue circles), gets predictions using f and weights them by proximity to the original

    instance. The dashed line is the learned explanation, it is locally (but not globally) faithful.

    Image sourced from [2] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

    3.3 Example of a depth 4 decision tree obtained from distilling a NN trained on MNIST. The

    images at the inner nodes are filters learned from images. The ones at the leaves are

    visualizations of the learned probability distribution over classes. The most likely classifi-

    cations at each node are annotated in blue. Image sourced from Frosst et al. [3] . . . . . 21

    3.4 The maps were extracted using a single back-propagation pass through a classification

    ConvNet. These maps represent the top predicted class in ILSVRC-2013 test images.

    Image sourced from Simonyan et al. [4] . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

    3.5 Both perturbation based approaches like LIME and gradient based approaches fail to

    model saturation. Here illustrated is a simple network that experiences saturation from its

    inputs. When both i1 = 1 and i2 = 1, perturbing either of them to 0 will not produce a

    change in the output. Similarly, the gradient of the output w.r.t the inputs is also zero when

    i1 + i2 > 1. Image sourced from Shrikumar et al. [5] . . . . . . . . . . . . . . . . . . . . . 26

    vii

  • 4.1 Explanation of individual classifications of a simple Neural Network trained on a toy prob-

    lem. On the top we have the dataset and the learned decision curve, left and right respec-

    tively. On the bottom, the visualization of the local explanation for points equally spaced

    throughout the dataset. On the left, the real values of the importance of each feature and

    on the right a normalized version. We further explain this image in the text below. . . . . . 31

    5.1 Data for this problem was extremely skewed, with the least imbalanced class (Risk of

    Suicide) having close to 5:1 ratio and the most imbalanced class having around 10:1 ratio

    between Not Very High and Very High risk. . . . . . . . . . . . . . . . . . . . . . . . . . . 40

    5.2 The architecture of ReLUSigNET. Dropout indicates dropout layers and LL indicates fully

    connected linear layers. We trained the network with an 80/20 train test split chosen

    randomly. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

    5.3 The figure on the left demonstrates the robot arm simulated with equation 5.1. The arm

    has a central hinge and one more joint, controlled by θ1 and θ2 respectively. Both of this

    joints can rotate in 360◦. This can be visualized on the figure on the right which shows

    ten thousand randomly selected data points from the dataset. . . . . . . . . . . . . . . . . 48

    5.4 The architecture of the ReLUTan network. Since the training and test set had exactly the

    same distribution (an uncommon trait for machine learning problems) regularization didn’t

    play a role as important as in the medical questionnaire problem. . . . . . . . . . . . . . . 49

    5.5 Representation on output space of bias towards on of the angles w.r.t the x axis on the

    left and w.r.t the y axis on the right. These graphs were created using a threshold of 1.5

    for importance bias. In other words, a point is marked as biased if the importance of one

    angle is at least 1.5x the importance of the other. . . . . . . . . . . . . . . . . . . . . . . . 51

    5.6 Comparison of dx/dθ1 and dx/dθ2 between the theoretical values and ReLUTan. Although

    the accuracy of ReLUTan is extremely high we can see that the fact that it uses ReLUs

    makes the gradients much harsher than the original gradients of equation 5.1. . . . . . . . 52

    5.7 Local explanations retrieved from ReLUTan. Each arrow color represents an explanation

    for one of the axis of movement. The magnitude of the arrow represents the amount of

    movement predicted. Arrows on the borders of the input space are generally bigger due

    to the network’s inability to predict correctly outside this range. . . . . . . . . . . . . . . . 54

    5.8 This small example shows how simple it is to retrieve feature importance vectors from any

    neural network using deepDIVE. Here, we are retrieving angle importance for the x axis

    in ReLUTan. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

    viii

  • A.1 Comparison of dx/dtθ1 and dx/dtθ2 between several different models. Different activation

    functions create very different types of surfaces. SigTan is a ReLUTan with sigmoid ac-

    tivations. Here the difference between sigmoids and ReLUs is especially noticeable as

    the former creates very smooth curves and the latter very harsh and noticeable decision

    boundaries. It is also insightful to look at SimpleTan to see how a network with a much

    lower representational capacity tries to mimic the objective function. . . . . . . . . . . . . 68

    ix

  • x

  • List of Tables

    5.1 Although both systems achieve high accuracies, the decision trees have a higher preci-

    sion. It should be noted that we could not measure recall for the trees since we did not

    have access to the final trained trees. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

    5.2 As the number of features that we perturb increases, the amount of datapoints whose

    classification is changed increases. As we reach 4 features we can clearly start seeing

    diminishing returns, implying that these datapoints that weren’t changed are extremely far

    from the decision boundary of the network. . . . . . . . . . . . . . . . . . . . . . . . . . . 44

    5.3 From the biased data retrieved by filtering explanations where the importance ratio be-

    tween angles was threshold we can see that movement on both axis is extremely faithful

    to the explanations. Most of the datapoints where the movement of the arm didn’t corre-

    spond to the explanation are located in the edges of the dataset. . . . . . . . . . . . . . . 55

    xi

  • xii

  • 1Introduction

    Contents

    1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

    1.2 Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

    1.3 Hypothesis and Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

    1

  • 2

  • 1.1 Introduction

    Machine learning is not a new field, the concept that we could teach machines how to ”learn” the solution

    to a problem by themselves has been around for a very long time. However, limitations in techniques and

    computational power held it back for decades. Due to this limitation, the algorithms used in the early days

    of the field were considerably simpler than the ones used today. As such, it was easy to interpreted the

    algorithms used for these regressions and deduce why they had made a certain decision. Models trained

    with algorithms such as Linear Regressions and Decision trees are easy to interpret. These models,

    however, often have a severe limitation in representation power, that is, their ability to learn complex

    patterns displayed by data. When compared with other less interpretable models like the extremely

    popular Neural Networks, these model’s performance is usually severely lacking.

    Neural Networks are not a new concept, they were first envisioned after Hebbian Theory, a theory of

    which we are commonly reminded of with the adage ”Neurons that fire together, wire together”. Shortly

    after this, in 1958 Frank Rosenblatt developed the Perceptron, this would later become the basis of what

    we now call an ”artifical neuron” [6]. Altough the perceptron enjoyed limited success in the research

    community, the ability to join several of these together was still far off. However, in the last few years,

    partly due to enormous advances in computational power and partly due to new optimization techniques

    being developed, Deep Neural Networks have become the de facto standard for tasks such as Natural

    Language Processing, Computer Vision and many, many others. These Networks have achieved state-

    of-the-art performance in many of these fields and are experiencing a period of widespread use [7]. With

    this increase in adoption, researchers have started looking at ways to expand the use these networks to

    areas such as finance, medicine or industrial processes.

    These networks, albeit responsible for achieving state-of-the-art performance in many fields, have

    one major drawback. They’re extremely hard to interpret. They are composed of several interconnected

    layers of sometimes thousands of ”neurons”, each of which applies a mathematical transformation to

    received data. We can observe the data we feed into the networks, its output, the individual changes

    to each neuron and understand all the singular pieces of their inner workings. Yet, due to the non-

    linear transformations these networks apply throughout their functionality it is impossible to grasp its

    end-to-end behaviour by looking at it’s raw behaviour.

    Their fantastic generalization capabilities are related to the fact that they use distributed represen-

    tations in their hidden layers, that is, their ”knowledge” is distributed unevenly by a set of neurons. [8]

    These distributed representations are defined by the set of all activations in any given layer. And, be-

    cause the minor effects of any particular ”neuron” rely on the effects of the activations of every other

    unit in that layer, it is extremely hard to retrieve any meaningful information about their role. As such,

    it is very difficult to discern what internal mechanisms are responsible for any specific phenomenon

    (although some research has been done in that topic). [9]

    3

  • All of this is further aggravated by another characteristic of Deep Neural Networks. The fact that

    they can model an enormous number of weak statistical regularities. They model these weak regular-

    ities between the inputs and outputs of the training data. [8] However, there is no mechanism that can

    distinguish the regularities that are properties of the data from false ones created by peculiarities of the

    training set. To sum up, not only is it extremely hard to correlate unit activations with any specific feature.

    It is also very difficult to verify which units actually mater for the result and aren’t just activating due to

    some relationship learned from random training set characteristics. This means that although we know

    precisely how these networks work, if we want to unearth the reasons behind a specific decision even

    experts can only make educated guesses.

    This ”black box” nature of neural networks has become an enormous barrier of adoption in areas

    where decisions have irreversible consequences. In areas such as NLP, the consequences for failure

    are fairly minor, as a consequence, the focus on interpretability as remained secondary to performance.

    On the other hand, in areas like medicine, where a wrong choice might prove deadly, it is of utmost

    importance to understand why the network has made a decision.

    To illustrate the importance of interpretability, Caruana et al. [10] detail a model trained to predict

    probability of death from pneumonia that learned to assign less risk to pneumonia patients if they had

    asthma. These patients, when arriving at the hospital, received more aggressive treatment due to their

    condition. As a result, the model concluded that asthma was predictive of lower risk of death. If this

    model were to be deployed to aid in triage, these patients would receive less aggressive treatment,

    invalidating the model. Possibly resulting in a higher amount of deaths amongst asthma patients. How

    can Doctors act upon predictions without knowing the reasoning behind the result? In scenarios such

    as this, acting in blind faith could prove to be catastrophic.

    In light of these facts, model interpretability has become of paramount importance for widespread

    adoption in high-risk areas. However, even what exactly constitutes interpretability is often discussed,

    as different authors strive for solutions to different, albeit often similar, problems. Fundamentally, we

    consider interpretability as a source of trust in the model. For the purpose of this work, we will consider

    trust as human confidence on the reasons behind the network’s prediction. This confidence relies on

    an observable causality between inputs to the network and its outputs. As such, we can say that an

    interpretation of a decision would be an explanation on how the input values were at the genesis of that

    decision.

    1.2 Problem

    Although these networks achieve outstanding performance when evaluated through traditional measure-

    ments like Accuracy, Precision, etc. [7] these measurements can only tell us so much about what the

    4

  • network is doing and why it is doing it.

    As we have previously hinted towards with the example by Caruana et al., considering the reasons

    behind a decision is of the utmost importance [10]. Interpretable predictions lead to appropriate trust

    and can also provide intuitions on how to improve the model. Understanding why a model made a

    certain prediction is essential in many fields. If they understand the reasoning behind a prediction, they

    can use their knowledge to accept or reject it. Besides providing much safer and controlled usage,

    studies have shown that providing explanations can increase the acceptance of automated systems

    recommendations e.g. entertainment recommendations [11,12].

    Furthermore, there are a considerable number of ways a model could be trained wrongly. A good

    example, studied by Kaufman et al. is data leakage [13]. Data leakage is the unintentional creation of

    additional information into the training (and validation) data, potentially increasing accuracy. The author

    cites an example where the patient ID was profoundly correlated with the target class in the training

    and validation data, thus leading the model to have an artificially high performance in these controlled

    environments. Another example, mentioned by Ribeiro et al. [2] is artifacts of data collection influencing

    the model. The author describes an experiment where a model was designed to classify between a

    Husky and a Wolf. Although the model had an extremely high accuracy, when rudimentary interpretability

    methods were applied to it, it was found that the model was in fact looking at the background of the image

    and using that to make a decision. A lighter, whiter background would be classified as a wolf, as most

    photos of wolves had been taken in the wild in snowy backgrounds. While darker backgrounds would be

    classified as huskies, since most of the pictures used in training had been taken inside a house or in a

    modern city. Both of these issues would be extremely challenging to detect by looking at the predictions

    and raw data.

    Determining the causality between inputs and outputs would provide a much clearer picture into the

    inner workings of these networks. Not only could this be used to implement them safely into new fields

    but also as a tool to help existing machine learning practitioners analyse and assess new models before

    deploying them.

    In order to interpret these networks, various approaches have appeared in the research community.

    These approaches have varied from attempts to create generic explanation methods for any machine

    learning model to distilling more complex models into simpler easier-to-interpret ones. One that has

    gained increased attention in the last few years is using the backpropagation algorithm to calculate a

    gradient of the output w.r.t the input. Yet, this and its related approaches have only been researched in

    the context of computer vision and images as input. Backpropagation is the algorithm responsible for

    creating these inner distributed representations. As such, we wonder if its utility in interpreting Neural

    Networks couldn’t be exploited in other domains.

    5

  • Is it possible to use network gradients to interpret the output of any deep Neural Network?

    1.3 Hypothesis and Contributions

    The Hypothesis we put forth with this thesis is the following:

    Network gradients after any particular decision can be used to establish a causal relationship betweeninput and output.

    In this work, we propose a variation of backpropagation based techniques to interpret any deep

    neural network, not only those using images as input. By calculating the gradients at decision time

    we can create a causal relationship between input features and the network’s output. With it, we can

    accurately attribute decisions to a specific cause or causes. This provides a generic method for Neural

    Network Intrepretability. Contrary to previous works, which were solely based on image classification

    networks, our method can be applied to a multitude of domains. Although capable of interpreting any

    network, our main focus was to test this method in areas where it can have real world implications. As

    such, focus was been given to the medical and industrial domains.

    In the context of this work we implemented several different Neural Network architectures to solve

    two distinct problems, one for each of our focused domains. The first challenge was to develop a clas-

    sifier network responsible for predicting the severity of a patient in triage for suspected mental illness.

    We explore how this network compares to a traditional decision tree algorithm and how we can use our

    interpretability technique to retrieve insights from the network’s decisions. Secondly, we developed a

    network responsible for moving a two-dimensional robot arm with two degrees of freedom. This arm is

    equivalent to a human arm with shoulder and elbow but stuck to one plane of motion. Since this move-

    ment can be described theoretically we use this example to explore the internal network representations

    of the concepts. By comparing explanations of the network’s results at different arm angles with the

    theoretical results we can explore how the network actually learns and accurately measure the reliability

    of our explanation method.

    The rest of this work details previous work, our approach, our methodology and results. But first, as

    an attempt to introduce unfamiliar readers to the core concepts behind this work we make an endeavor

    to explain Neural Networks and backpropagation as plainly as possible.

    6

  • 2Background

    Contents

    2.1 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

    7

  • 8

  • This section assumes a modest understanding of the basics of Machine Learning and Linear Alge-

    bra. Here, we will explore and explain the basics of Neural Networks. We hope to supply the reader

    with sufficient knowledge to understand subsequent chapters. We provide an overview of what these

    networks are, how their fundamental mechanisms work and how they are used.

    2.1 Neural Networks

    Neural networks come in many shapes and forms. The designation neural networks comes from the

    fact that these models are loosely inspired by the biological neural networks that form the human brain.

    Putting it very simply, the basic computational unit of the brain is the neuron. There are approximately 86

    billion neurons in our nervous system. They connect to other neurons through synapses, a single neuron

    can have thousands of synapses. It’s through synapses that a neuron receives input signals. These are

    then combined and used to produce output signals along its single axon, the ”branch” that extends it

    to the proximity of other neurons. This axon eventually branches out and its synapses connect to other

    neurons. When a neuron receives signals from others, it sums all of them. If the sum of all signals is

    above a certain threshold, the neuron will send information along its axon to the next.

    The computational model of this, the artificial neural networks follows much of the same concepts.

    These networks are also comprised of numerous connected units, commonly referred to as ”artifical

    neurons”. Contrarily to the biological model, the signals x that travel to another neuron interact through

    multiplicationwxwith the other neuron, depending on how strong the synapse (or connection)w between

    them is. The main idea behind the computational representation of the brain is that the strength of this

    connection can be controlled and ”learned”. It should also be mentioned that this connection can have

    positive or negative w, allowing a much broader influence on the next neuron. Similarly to the biological

    model’s sum of signals before forwarding the information, artificial neurons have an activation function

    f . Instead of acting as a threshold that controls the timings of the neuron, in this case it acts as a

    representation of the frequency at which the neuron sends information to the next. To sum up, the basic

    structure of a neuron can be represented as follows:

    f(∑i

    wixi + b)

    In other words, a neuron performs a dot product with the inputs and their weights, adds a bias b,

    applies a non-linearity f (activation function) and sends the result to the next neurons. Typically these

    neurons are organized in layers with each layer performing different kinds of transformations on their

    inputs. Signals travel from the first (input) layer to the last (output) layer. The layers of neurons between

    the first and last are called hidden layers. Simply put, a neural network is considered a deep neural

    network if it has two or more hidden layers.

    The simplest of these are called feedforward neural networks, or multilayer perceptrons. The objec-

    9

  • Figure 2.1: On the left a 2-layer Neural network, with one hidden layer and one output layer. On the right, a 3-layer(deep) neural network with two hidden layers and one output layer. Note that there are connectionsbetween neurons across layers, never within a layer. Image sourced from CS231n [1].

    tive of a feedforward neural network is to approximate a certain function f∗ that belongs to a class of

    functions F . Each of the layers in a neural network can be seen as a function. This means that (most)

    neural networks are merely a chain of functions f(x) = f2(f1(x)). In this case, f1 would be the first

    layer of the network and f2 the second. We can then think of information being propagated throughout

    the network as a sequence of functions being applied on input data.

    For a neural network to ”learn”, we need a way to codify what is a correct or incorrect output and

    how ”far” an incorrect output is from the correct one. For that, we use a cost function C. A cost function

    (or loss function) is a function so that for the optimal f∗, C(f∗) ≤ C(f)∀f�F . It serves as a measure

    of how far away a particular solution is from an optimal solution. Learning algorithms, such as Gradient

    Descent, will search through the solution space trying to find the function with the smallest possible cost.

    Gradient Descent is an iterative algorithm that functions by correcting the values of the weights w and

    biases b after each network mistake. The way it does this is by calculating the gradient of the function

    at the current point and moving towards the negative gradient, in order to reach a minimum of the cost

    function. This way, the network is constantly changing and moving closer to f∗.

    In order to apply this method to a deep neural network the backpropagation algorithm is essential.

    Backpropagation is the method used to calculate the gradient of the loss function with respect to the

    weights of the connection between the units of the network. After calculating these gradients, they are

    used to update the weights one layer at a time, from the output layer all the way back to the input layer.

    And with this, Gradient Descent is put into action.

    Neural networks can be applied to many types of problems and have many different architectures

    in order to accomplish different goals. Networks used for Computer Vision tasks have a very different

    architecture to the ones used in Natural Language Processing for example. Each of them exploits

    particular characteristics of certain types of layers or connections between these layers to tweak the

    networks to perform better in their intended purpose. Even feedforward networks used in ”simple” tasks

    10

  • can vary wildly in architecture. In the remainder of our work, in an attempt to restrict our exploration

    space, we will focus on feedforward networks of varying shapes and sizes.

    11

  • 12

  • 3Related Work

    Contents

    3.1 Interpretability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

    3.2 Interpretable Models as Explanations . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

    3.3 Inner Model Visualizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

    13

  • 14

  • To understand the approach taken in our proposal, it is essential to review what has been done

    before. Throughout the next section, we will revise a collection of previous work in interpretability, carried

    out by many from across the research community. Analyzing this work is essential to give us guidance

    on what does or does not work and what has been experimented before. This review of previous work

    is also responsible for the, often overlooked, crucial task of homogenizing how certain terms are used

    throughout the scientific community. One of these terms happens to be Interpretability itself.

    Interpretability recently jumped into the limelight, but different papers seem to have different interpre-

    tations of it. What makes an algorithm interpretable? What is the true purpose of interpretablity? We

    will answer these questions and many others. In order to search for the optimal way to interpret neural

    networks, we must understand what this means and what it is we are looking for.

    After examining interpretability, we will look into several different approaches to demystifying neural

    networks. First we will examine methods which use interpretable models to deliver explanations. In

    these, neural networks are treated as black boxes with other, more interpretable, models being used

    to withdraw information from the networks without any internal information. By training models we can

    easily understand around the neural networks, researchers hoped to mirror some of their behaviour.

    Subsequently they use the newly trained models to draw conclusions about the original networks. After

    that, we will analyze another set of methods which has a completely different approach. These, make

    an attempt to understand neural networks utilizing their internal mechanisms. These methods try to

    slightly modify either forward propagation or backpropagation to provide useful information to explain

    the network or one of its decisions.

    3.1 Interpretability

    Due to the enormous growth in it usage, Machine Learning has naturally started seeping into critical

    areas. Areas like medicine, machinery control, financial markets, terrorism detection, among others.

    When tackling these kinds of issues, the fact that humans are incapable of understanding advanced

    models in machine learning takes a whole other dimension [14]. Machine Learning model Interpretability

    came into the spotlight as a way to tackle these issues.

    It should be noted however, that historically, the machine learning community, despite increasing

    claims of using interpretability as motivation, remains notably isolated from other industries and policy

    issues in general [15]. A prime example of this is the extensively discussed provision of the European

    Union General Data Privacy Regulation (GDPR) that is often labeled as a right to explanation, something

    that is highly connected to the understanding of model’s decisions. While it is true that the meaning and

    enforceability of the law continue to be under debate, the machine learning community gazed upon it

    extremely briefly in via just one paper [16]. This single paper, was the only one presenting information

    15

  • about this law for a machine learning audience. This problem is noticeable on other areas of the machine

    learning community related to interpretability like algorithmic fairness. In this area, a single paper about

    gender-crossing is considered the uncontested authority [15]. Not because of its content but due to the

    larger problem of a general disregard for policy in machine learning circles.

    However, recent concerns about fairness, racial or gender bias, and other related issues have been

    drawing attention of experts. Interpretability in particular as been increasingly looked at as a solution

    to many of these problems. The reasoning is that, if we can accurately pinpoint how the model is

    making a decision, we can take actionable steps to correct its behaviour. Despite this, not many people

    have tried defining what Interpretability is. Counter-intuitively papers often make allegations about the

    interpretability of numerous models.

    Although slow, in the last year some progress has been made in this area, with some attempts to

    define interpretability and the characteristics that make a model interpretable. A study by Lipton [17]

    shows that both the motivations for interpretability and the technical definition of interpretable models

    are disparate, sometimes even contrary. That is, depending on the work, the word interpretability refers

    to more than one concept.

    Figure 3.1: This chart represents the traditional view of the trade-off between interpretability and performance of amodel. Although there is definitely truth to this statement, it is merelly an heuristic.

    However, it should be noted that, as we mentioned previously, not all machine learning models are

    hard to interpret. Most machine learning practitioners would categorize interpretable models as shown

    in Figure 3.1. Nonetheless, this is merely an heuristic but one that has been propagated throughout

    16

  • research materials. In the research community, researchers usually adopt a definition for interpretability

    that makes intuitive sense to them. This has the unwanted effect of the definition for interpretability

    changing considerably from paper to paper. The study by Lipton [17] compiled a list of the different

    descriptions of interpretability found throughout the spectrum of machine learning literature. Some re-

    searchers attempt to describe interpretability as ”simulatability”, whether a person can contemplate the

    entire model at once. Or in other words, whether one can examine the entirety of the model and un-

    derstand what it is doing and how. This notion is also adopted by Ribeiro et al. which suggests that an

    interpretable model is one that ”can be readily presented to the user with visual or textual artifacts” [2].

    Other authors suggest different traits to judge interpretability such as decomposability or algorithmic

    transparency. Decomposability suggests that a model is interpretable if each part of the model (each

    input, parameter and calculation) has an intuitive explanation. It’s worth noting that although it is true that

    we can take each singular ”part” of a neural network apart and explain what it is doing, it is often close

    to impossible to decipher how the several different parts fit into the whole that is the network’s decision.

    In other words, we know exactly what it is doing and how it is doing it but in the large majority of cases

    don’t have the faintest idea of why it is doing it. On the other hand, algorithmic transparency argues that

    we can only truly interpret and grasp a model if we understand the shape of its error surface. While this

    is easily done in the case of linear models, we don’t really understand how the heuristic optimization

    procedures for neural networks work. Besides, error surfaces for neural networks are more often than

    not, dependant on thousands of weights. This creates a surface of very high dimensionality, impossible

    for humans to imagine or visualize.

    By none of these metrics are neural networks considered interpretable. But it can also be argued that

    the different metrics have different goals or properties that they strive for. Not all models that are typically

    considered interpretable would fit even two of those metrics. We believe that this is because there can

    be no discussion about interpretability without considering, not only what makes a model interpretable,

    but what it is we are striving for with interpretability. Just as with the different traits used to classify

    models, researchers have sought after very different goals with their research in interpretability.

    Luckily, Lipton [17], when compiling the different metrics used to classify an interpretable model, also

    studied and collected the different properties interpretability research strives for. These properties are:

    Trust, Casuality, Transferability, Informativeness and Fair Decision-Making. In our work we will mainly

    focus on interpretability as a source of trust and information.

    Trust is a common motivation for interpretability [2,14,18]. But what is trust? Is it trust in the model’s

    performance? If so, a model with an adequate accuracy would be a highly trustworthy model and we’d

    have no need for interpretability. Trust can be defined subjectively, e.g someone can be more com-

    fortable working with a well-known model even if that knowledge gives them no advantage whatsoever

    in interpreting the model’s decisions. However, it can also be defined in more practical ways. Trust in

    17

  • a prediction can be assessed by whether a human would trust the prediction enough to make a de-

    cision based on it [2]. Establishing trust in these individual predictions is of paramount importance.

    Humans create a trust relationship with the tools they use to inform their decision-making. Studies have

    shown that they are much more likely to act on a tool’s suggestions if they’re provided an explanation for

    them [11]. This becomes especially important when the tool is not a traditional algorithm but something

    considered a black-box, such as a machine learning model. Furthermore, in fields such as medical diag-

    nosis, predictions cannot be acted upon without a certain degree of confidence [10]. The consequences

    might be catastrophic.

    On the other hand, interpretability can also help withdraw additional information from the model.

    While the scientific goal of machine learning is to reduce the error of some function, in the real world its

    goal is to produce helpful information. Traditionally, we only consider the information conveyed through

    the model’s outputs. Nonetheless, through interpretability techniques it is be feasible to convey extra

    information to a human decision maker. This has been explored as a paradigm where supervised

    models are employed simply to provide information to humans [14, 19] instead of training the models

    to act upon their predictions. However, we would like to explore the same approach as Ribeiro et al.,

    deciphering the model’s decision making rationale and using that as an extra source of information [2].

    An interpretation might prove valuable even if it does not provide any intuition or explanation into the

    model’s inner mechanisms. For example, a model responsible for medical diagnosis can be better suited

    for assisting a human make a decision if, instead of solely providing a diagnosis, it can explain which of

    the symptoms contributed to that forecast. Then, the human can combine that knowledge with his own

    domain specific expertise, likely resulting in a better decision. As mentioned before, it has been shown

    that showing these types of information to humans makes them significantly more likely to accept, and

    act upon, the original model prediction [11,12,20].

    3.2 Interpretable Models as Explanations

    When considering ways to provide explanations for the behaviour of deep neural networks, one of the

    most common approaches is a delegation of responsibility. Instead of looking at the model’s internal

    mechanisms for an explanation, the task of explaining the model’s behaviour is outsourced to an exter-

    nal, simpler and interpretable model. Intuitively speaking, when looking at Figure 3.1, these approaches

    are based on the use of the models on the top left (easily interpretable) to explain the ones on the bottom

    right (black box nature). In this section we will explore some of the most recent proposals in this area.

    Ribeiro et al. propose an approach capable of explaining any model, including neural networks, by

    treating the model as a black-box. The authors argue that one of the most neglected but important

    aspects of machine learning is the role of humans. They believe, in accordance with our work, that it

    18

  • is necessary for a model to be trusted by humans for it to be used in the real world. Furthermore, they

    defend that the best way for a human to trust a model is for it to explain the rationale behind a decision.

    They suggest that two things are essential for an explanation of a machine learning model: It needs

    to be interpretable and it needs to be locally faithful [2]. They claim that it is often impossible for an

    explanation to be entirely faithful unless it is the complete description of the model itself. They argue that

    explanations don’t need to represent the entirety of the model’s behaviour. They only need to be locally

    faithful i.e accurately describe how the model behaves in the proximity of the instance being predicted,

    or in other words, accurately describe the behaviour for that particular decision.

    Taking this into account, the authors developed a framework by the name Locally Interpretable Model-

    Agnostic Explanations (LIME). Its goal is to use an interpretable model to provide an interpretable repre-

    sentation locally faithful to a prediction of another classifier. Their work tries to define a balance between

    the complexity and faithfulness of an explanation. The goal was to define a general framework that could

    continue being used in the future, even if the method of obtaining the explanations changes. However,

    in the context of this first work, they focused on using sparse linear models to provide explanations.

    While using perturbations of the input as a way to explore the model to be explained. An example is

    provided in figure 3.2 It should be noted that although they only describe sparse linear models as ex-

    planations in this work, LIME supports the exploration of an array of function families for explanations,

    for example, decision trees. Hence, their base formula can be used with different explanation families,

    fidelity functions and complexity measures. On the other hand, they argue that all interpretable models

    and choices of representation will have some drawbacks when explaining certain behaviours. Although

    the underlying model is treated as a black-box, certain representations won’t be adequate for certain

    models. For example, it would be extremely hard to explain a model that predicts filters in images with

    the importance of each pixel to the prediction.

    Although an explanation of a single prediction contributes to understanding the reliability of the clas-

    sifier, the authors argue that it isn’t enough to assess trust in the model as a whole. As a result of that,

    they propose giving a global understanding of the model by explaining a batch of individual predictions.

    However, they explain that this batch must be carefully picked, since users might not have the time to

    examine many explanations and the explanations should be not only diverse but representative of the

    model’s behaviour. For that purpose they devised an algorithm called submodular pick. They suggest

    that the explanation is for all intents a submodular set function. That is, a set function that has a property

    that states that: the difference in the value of the function that a single element makes when added to

    the set, decreases as the size of the set increases. In other words, the more elements the set has, the

    less of a difference each new element makes. They show that, due to submodularity, a greedy algorithm

    that adds the instances with highest marginal coverage of the target model offers a guarantee of a good

    approximation (1−1/e of the optimum) of the function [21]. With this, they created SP-LIME (Submodular

    19

  • Figure 3.2: An example for LIME. The black-box model’s decision function f is represented by the pink/blue back-ground. A linear model is used to create a local explanation. The bold red plus sign is the instancebeing explained. LIME samples instances (red plus signs and blue circles), gets predictions using fand weights them by proximity to the original instance. The dashed line is the learned explanation, it islocally (but not globally) faithful. Image sourced from [2]

    pick LIME) a framework that uses LIME to explain entire models, one explanation at a time.

    Another similar approach was developed by Frosst et al. Instead of trying to understand how a

    deep neural network makes its decisions, they use the network to train a decision tree that mimics the

    function discovered by the network but works in a completely different way [3]. They experimented using

    a technique labeled distillation [22] to extract the knowledge from the tree to a type of decision tree that

    makes soft decisions [23]. They use soft binary decision trees trained by choosing the depth of the tree

    and then using mini-batch gradient descent to update all the parameters of the tree simultaneously. The

    model is equivalent to a hierarchical mixture of experts. An example of such a tree trained on the MNIST

    can be seen on figure 3.3.

    With this technique, the authors can easily provide an explanation for any prediction. They only have

    to follow the path with the greatest probability from the root to the leaf. An explanation is the list of both

    all the filters along the path and the binary activation decisions on every node. Although very easy to

    create, these explanations might prove hard to interpret depending on the format of the data. As we

    can see in the example, it isn’t easy to interpret the visual filters used by the decision tree. It is easy to

    think of more complex examples where these explanations would quickly become too bothersome and

    20

  • complicated to interpret.

    Figure 3.3: Example of a depth 4 decision tree obtained from distilling a NN trained on MNIST. The images at theinner nodes are filters learned from images. The ones at the leaves are visualizations of the learnedprobability distribution over classes. The most likely classifications at each node are annotated in blue.Image sourced from Frosst et al. [3]

    The authors also provide an interesting example of the utility of an explainable model. While exam-

    ining the learned filters of a soft decision tree trained from a network that had learned to play Connect4,

    they discovered an insight about the game. Inspecting the first filter of the tree revealed that any game of

    connect4 can be divided into two separate groups: Games where the players started by placing pieces

    on the edges of the board and ones where the first pieces were placed on the center of the board. By

    distilling the original network into the decision tree they discovered that these two groups played out in

    ways so different to one another that the decision tree used that condition as its root node [3].

    However, this method has a problem. Models such as decision trees normally do not generalize as

    well as deep neural networks. Contrary to the hidden units in a neural network, a node at a lower level

    of a decision tree is only used in tiny part of the training data. This usually causes the lower levels of the

    decision tree to overfit. Because of this, the number of parameters at which the soft decision trees start

    to overfit is less than compared to the number at which a deep neural network starts to overfit. This is

    noticeable when comparing performance in MNIST.

    With a regular soft decision tree of depth 8 trained in the real dataset they were able to achieve a

    test accuracy of at most 94.45%. They compare this with the performance of a neural network with

    two convolutional hidden layers and a penultimate fully connected layer, this network managed to obtain

    a much better test accuracy of 99.21%. Afterward, they experimented training soft decision tree with

    labels that were a conglomerate of true labels and predictions of the neural network. The soft decision

    21

  • tree trained in this way attained a test accuracy of 96.76%, which is roughly in the middle between the

    neural network and the soft decision tree trained directly on the data. On another dataset, the Connect4

    dataset, a soft decision tree distilled from a neural network achieved 80.60% test accuracy compared to

    78.63% from one trained with real data [24].

    All in all, the authors provide a way to reap some of the benefits of deep neural networks while using

    a model that can be easily interpreted. In cases likes the ones described in the previous chapter, where

    being able to explain a prediction is essential, this technique can be used to improve the performance of

    a traditional interpretable model.

    We believe that none of the aforementioned approaches is a good solution to our problem. LIME,

    while an extremely interesting approach that can be applied to any model, has some glaring problems.

    First, it tries to explain changes in a model’s behaviour by first order changes in a discrete interpretation

    of the input space. In other words, by measuring the changes with permutations, there’s no guarantee

    that the behaviour being modeled is actually from the space surrounding the original input in the target

    model’s space. Secondly, for each decision of the model we intend to explain, we must train a new

    explainable model. This adds a significant overhead to providing explanations to more complex models

    with large runtimes, as the model needs to be run several times in order to generate an explanation.

    Distilling a neural network to a soft decision tree, on the other hand, is also a far from ideal solution. It

    does not explain neural networks directly, only by proxy through another model. There is no guarantee

    that this other model will learn the same distribution as the original network. Furthermore, since the

    decision trees can only attain a performance significantly lower than the networks, we must sacrifice

    performance of the model for an explanation.

    Overall, they both pose difficult questions for adoption. Both of them, although in different ways,

    are problematic in terms of both reliability and feasibility of explanations. As interesting as the idea of

    utilizing an interpretable model to explain neural networks might sound, it is extremely hard to surpass

    the inherent limitations of these models. We believe that, since they do not use any of the unique

    mechanisms possessed by neural networks, these explanations are held back by the expressiveness of

    the models learning the explanations.

    3.3 Inner Model Visualizations

    Another type of approach to deciphering neural networks, in contrast to the ones in the previous section,

    are based on using the inner mechanisms of the model to get information on how they are making the

    prediction. It should be noted however, that before the recent rise in popularity of Deep Learning, there

    had already been some attempts to interpret the networks’ inner workings [25, 26]. However, in recent

    years, fueled by neural network’s new-found popularity, several new approaches have been proposed.

    22

  • In 2009, Erhan et al. discovered that, counter-intuitively, it is possible (and simple) to interpret deep

    architectures at the unit level [27]. Before his work, the only qualitative method to compare and explore

    features representations by deep networks was by examining the ”filters” learned by the network [28,29].

    That is, the linear weights in the input-to-first layer weight matrix represented in input space. Or in other

    words, the pixels and shapes that have the highest weighted connections in first layer. However, this

    a poor representation of the overall network because we’re only looking at its first layer. Erhan et al.’s

    work was the first to develop methodologies to visualize representations of arbitrary layers of a deep

    architecture, in this case, Deep Belief Networks [30]. They discovered a curious phenomenon: the

    responses of the internal units to images, as a function in image space, are unimodal. Knowing this,

    by finding the dominant mode, we can create a good characterization of how the unit is processing

    the information. With their methodologies, they were able to verify the intuition that the first layers of a

    network are trying to find generic features while the later layers build on these to create more complicated

    concepts. This work opened up the pathway to exploring and visualizing the internal representations of

    networks. Be that as it may, although these representations are fascinating in their own right and provide

    a good insight into the internal reasoning of the networks, they are of little help in interpreting individual

    decisions and bolstering end-user trust in these decisions. With recent works such as Olah et al. [31]

    pushing this medium forward, trying to combine both visualization and attribution to create rich interface

    for humans, the gap between feature visualization and feature attribution is closing. It is however, still

    too wide to be considered an effective solution to our problem.

    Meanwhile on the other side of this gap, more or less simultaneously to Erhan et al’s work, Baehrens

    et al proposed a novel approach for interpreting Bayesian classifiers [32]. Their goal was similar to our

    own, to peek into the black-box of modern classifiers and provide an answer to why a certain decision

    was made by the classifier. Their approach uses the fact that the bayesian classifiers are optimized using

    numerical algorithms such as gradient descent. With that in mind, they use class probability gradients

    in order to create local explanations of the classifier. These class probability gradients are in essence

    the derivative of the probability of the classifier being wrong. This creates a vector that ”points” away

    from the chosen class, in essence, providing a weighted vector of the features which influenced the

    classifier’s decision. This can easily convey the relevant features for a single prediction, both the ones

    that contributed positively and negatively to the final class. At the time, this approach was completely

    different from other available techniques. It provided much need improvements in reliability and quality

    of explanations. It is especially significant because it allows measuring changes in any direction but

    most of all, because by using local gradients of the model as the explanation vectors, it always reflects

    the model and its assumptions.

    This type of approach made its way into neural network research when Simonyan et al. [4] based on

    Erhan et al.’s and Baehren et al.’s works, proposed using the gradient of the output with regards to pixels

    23

  • of an input image to create a saliency map (Figure 3.4) as a way to interpret deep convolutional neural

    networks [33]. It mixed the ideas from previous works [27, 32], gradients as explanations and creating

    explanations using input space. The authors propose that the gradient of the output class with regards

    to the pixels of the input image can create a weighted vector of contributions from the input pixels. Or

    in other other words, how much each pixel of the input image, contributed to the final classification

    output. At the time when this was proposed, the most common approach to interpreting convolutional

    networks was a specific network architecture called DeconvNet [34]. This approach required training an

    additional network with a specific architecture on top of an existing convolutional neural network. It is

    unpractical to be required to train specific networks on top of existing networks in order to interpret the

    latter. Simonyan et al.’s approach was especially significant because they proved that the reconstruction

    of a layer by DeconvNet was in practice equivalent to computing the gradients w.r.t the input of that layer

    (with a small difference on ReLU activation functions). With this, they managed to unify and generalize

    a wide spectrum of interpretability research in one work.

    Figure 3.4: The maps were extracted using a single back-propagation pass through a classification ConvNet. Thesemaps represent the top predicted class in ILSVRC-2013 test images. Image sourced from Simonyan etal. [4]

    Since the introduction of this method, several other authors have proposed different gradient based

    interpretability methodologies. Springenberg et al. unsatisfied with the different ways default gradi-

    ents and DeconvNets treated ReLUs, suggested further homogenizing both approaches and proposed

    guided backpropagation [35]. This approach is mostly equivalent to computing gradients except that any

    gradients that become negative during the backward pass are discarded. The issue with the original

    approach of doing a backpropagation of the importance using gradients is that the gradient coming into

    a ReLU during the backward pass is zero’d if the input to the ReLU had been negative on the forward

    pass. By contrast, in DeconvNet the gradient is only zero’d if it is negative with no regards to the sign of

    24

  • the input to the ReLU during the forward pass. Springbenberg et al. combined both and zero’s out the

    signal if either the input in the forward pass or the gradient in the backward pass are negative. Although

    this method achieved valuable results in image segmentation, due to the exclusion of negative gradients,

    it fails to highlighting inputs that contribute negatively to the output if they pass through a ReLU unit at

    any point in the network.

    Other approaches to solving this problem include Layerwise Relevance Propagation (LRP) [36], an

    approach for calculating importance scores in a layer-by-layer approximation of backpropagation. This

    method was later demonstrated to be equivalent within a scaling factor to an elementwise product be-

    tween the saliency maps of Simonyan et al. and the input [37, 38]. These approaches have the advan-

    tage of leveraging the sign and strength of the input. However, they are very difficult to apply outside an

    image classification domain since other inputs such as categorical aren’t as standardized or numerically

    well behaved. Sundarman et al. suggested that instead of computing the gradients only with the cur-

    rent value of the input, we could integrate the gradients as inputs are scaled up from a certain starting

    value [18]. Although this addresses a fundamental limitation of gradients which is neuron saturation,

    it adds a significant computational overhead to obtain high-quality integrals. Furthermore, other works

    have shown that this approach can still provide highly misleading results [5].

    Finally, Shrikumar et al. recently proposed DeepLIFT (Deep Learning Important FeaTures), an al-

    gorithm used for assigning an importance score (weights) to the inputs of a certain network output [5].

    Although it is still a gradient based method, contrary to others it doesn’t rely solely on one gradient

    calculation (or integrating up to one). DeepLIFT calculates feature contribution by comparing the output

    in question to a ”reference” output obtained from a ”reference” input. This ”reference” input illustrates

    some default or what the authors describe as ”neutral” input. This ”reference” is chosen according to the

    problem at hand and can vary wildly from model to model. Then, just like other approaches, DeepLIFT

    backpropagates the contributions of every hidden unit (to the output) in the network to each feature of

    the input, effectively creating a representation on how each feature contributed to that prediction. It then

    obtains the final explanation by identifying the differences of this backpropagated values to the ones on

    the ”reference” input.

    The explanations that DeepLift produces can be thought of as a difference-from-reference, the bigger

    the difference the more that feature contributed to that output. Using this difference-from-reference,

    DeepLift side steps one of the fundamental limitation of gradient based methods, model saturation,

    which is illustrated in figure 3.5. Even when the derivative of a neuron with respect to a neuron in a

    previous layer is zero it might still be signaling relevant information. While the gradient would be zero,

    the difference-from-reference might not be.

    To experiment on how DeepLIFT fares against other explanation methods that used backpropagation

    the authors trained a convolutional neural network on MNIST to perform digit classification [39]. To

    25

  • Figure 3.5: Both perturbation based approaches like LIME and gradient based approaches fail to model saturation.Here illustrated is a simple network that experiences saturation from its inputs. When both i1 = 1 andi2 = 1, perturbing either of them to 0 will not produce a change in the output. Similarly, the gradient ofthe output w.r.t the inputs is also zero when i1 + i2 > 1. Image sourced from Shrikumar et al. [5]

    compare the different importance scores obtained by different methods, they try to identify which pixels

    to erase to convert the image to another target class. They pick up to 157 (20%) of the image’s pixels

    for each explanatory method. They are chosen in order of their importance scores as determined by

    the different methods. Finally they evaluate the change in the log-odds score between the original and

    target classes for the original image and one where the pixels have been erased. In the context of these

    images, they defined the reference input as all zeros and it outperformed all other backpropagation-

    based methods.

    Overall, gradient based methods have been showing a lot of promise in the field of deep neural

    network interpretability. By utilizing the internal mechanisms of the networks to create explanations, we

    bypass one of the most serious deterrents of the approaches we discussed in the previous chapter,

    the limitations of other models. Using backpropagation we can create explanations that are completely

    faithful to the original model. On the other hand, almost all the existing research in these methodologies

    has been focused on networks that work in the image domain. This means that although very promising,

    the real world application of these approaches had been lagging behind research.

    26

  • 4Backpropagating Feature Importance

    27

  • 28

  • When we began working on the issue of unwrapping the black-box that are modern deep neural

    networks, we had a clear goal: to be able to provide instance-based explanations. In order for these

    explanations to be useful for a human with domain knowledge they should reveal the causality relation-

    ship between the input received by the network and its output. In other words, provide a clear picture of

    the reasons behind a decision. The original approach we proposed to solve this problem was to try to

    create a linear representation of the local behaviour of the model. Our proposal would start by creating a

    new dataset by permuting the original input data point and using the network to label our newly acquired

    data points. Then, we would train a linear regression with this new dataset. For this, we proposed using

    a Least Absolute Shrinkage and Selection Operator, or in short, LASSO regression. With this new lin-

    earization around the input point, we hoped to provide an accurate local representation of the networks

    behaviour. We note, again, that local fidelity does not imply global fidelity and features that are globally

    important may not be important in a local context, and vice versa.

    However, despite our initial conviction in this methodology we quickly realized that it had some glar-

    ing flaws. When analyzing a network, we don’t know it’s function surface, because of that, we can’t

    be sure how the output varies with changes to the input. So by using permutations to create the new

    dataset from where local behaviour will be learned, we risk creating a dataset of points that are distant

    and not representative of the original prediction. Also, training this linear regression would require sep-

    arate executions of the network on each perturbation. This meant that the complexity of calculating an

    explanation would be dependant on the time it took to run the network and the number of perturbations

    made to the data. Taking all of this into account, we decided to change our approach to the linearization

    of the network and the approximation of its local behaviour.

    A feedforward artificial Neural Network is organized in layers with each layer performing different

    transformations on its inputs. The goal of these networks is to approximate a certain function f∗. Each

    of its layers can be seen as a function. This in turn means that a network with n layers can be seen as

    a chain of functions:

    f(x) = fn((...)f2(f1(x)))

    In order to train these networks, a backpropagation algorithm is used. Backpropagation is used to

    calculate the gradient of the loss function w.r.t weights of the connection between the layers of the

    network. This is employed by the gradient descent optimization algorithm to adjust the weight of neurons

    with the previously calculated gradient of the loss function. Assuming x as a vector of inputs and θ as

    the parameters of the network, the gradient at time t can be calculated as:

    ∇ = ∂Loss(x, θt)

    ∂θ

    However, backpropagation is merely a special case of a more general technique called automatic

    29

  • differentiation. In essence, automatic differentiation, iteratively computes gradients for each layer using

    the chain rule. Although this is commonly used as a way to train a network such that it can learn appro-

    priate internal representations that allow it to approximate f∗, as we saw in the previous chapter other

    uses, such as network interpretation, have been proposed for gradients. Here, we propose generalizing

    the method by Simonyan et al. [4] to be used in a wider spectrum of networks. His original proposal

    and most other subsequent works [18, 35, 38, 40], apply this methodology solely to image classification

    networks with convolutional layers. With this work we intend to explore a generalization of this approach

    that can be applied to any network architecture. We hope to provide an easy to use way for deep

    learning practitioners to interpret their networks and extract relevant information from their predictions.

    Specifically, consider a system that classifies some input into a class from a set C. Given an input x,

    many networks will compute a class activation function Sc for each class c ∈ C. The final result will be

    determined by which class has the highest score.

    class(x) = argmaxc∈CSc(x)

    We define an explanation vector of a data point x0 as the the gradient of the output w.r.t the input, x at

    x = x0. With that, we estimate a weighted vector of feature importances. If Sc is piecewise differentiable

    w.r.t x, for all classes c and over the entire input space, one can construct a feature importance vector

    Ic(x) by computing:

    Ic(x0) =∂Sc(x)

    ∂x

    ∣∣∣∣x=x0

    (4.1)

    This explanation vector is a local n-dimensional representation of the behaviour surrounding the

    decision point on the surface of the function learned by the network. By having a representation of the

    behaviour around that point of the manifold we can measure the sensitivity to change of the function

    value with respect to changes in its arguments.

    In a more intuitive manner, Ic represents how much difference a change in each feature of x would

    cause in the classification score of class c. Thus, entries in Ic with large absolute values highlight

    features that will influence the label of x0. On the other hand, the sign of the individual entries expresses

    whether the prediction would increase or decrease when the corresponding feature is increased locally.

    These explanations can also be motivated with a simple example. Consider a linear model for the

    class c:

    Sc(x) = wTc x+ bc

    assuming x is represented in a vectorised form, and wc and bc are respectively the weight vector and

    the bias of the model. In this case, we can easily see that the magnitude and sign of the elements of w

    will define the importance of the corresponding features of x for the class c. In the case of deep Neural

    30

  • Networks, the class score Sc(x) is a highly non-linear function of x. Because of this, it is extremely hard

    to apply this reasoning. However, given an input x0, we can approximate Sc(x) with a linear function in

    the neighbourhood of x0 by computing the first order Taylor expansion.

    Sc(x) ≈ dT + b

    where d is the derivative of Sc w.r.t the input x at the point x0, as defined in 4.1. By definition, the

    gradient of a function f(x) is the linear approximation of that function around x. This linear approximation

    can also be thought of as a Taylor series approximation with a Taylor polynomial of degree 1. As such,

    the gradient can be seen as an approximation of the linearization of the function and, in essence, of

    feature attribution. We provide a simple example in the following figure:

    Figure 4.1: Explanation of individual classifications of a simple Neural Network trained on a toy problem. On thetop we have the dataset and the learned decision curve, left and right respectively. On the bottom, thevisualization of the local explanation for points equally spaced throughout the dataset. On the left, thereal values of the importance of each feature and on the right a normalized version. We further explainthis image in the text below.

    31

  • On the top left corner of Figure 4.1 we can see the training data for the simple classification task. On

    the top right is the model learned using a simple network which outputs the probability of the point be-

    longing to the yellow points. On the bottom left corner we can see the probability gradient (background)

    together with local explanation vectors generated by calculating the gradient of the output w.r.t the input.

    We can see that along the borders of the triangle the explanations point overwhelmingly towards the

    triangle class. Furthermore it is noticeable that along the hypotenuse and at the corners of the trian-

    gle both features are considered important for the explanation while along the edges the importance is

    dominated by one of the two features. In the bottom right we can see the direction of explanation vectors

    represented without regard for their magnitude.

    This method of creating a linear representation of the network’s behaviour has several improvements

    over our initial solution. The most prominent one is the fact that since we use local gradients of the trained

    model as explanatory vectors, they will always follow the model and its assumptions, providing a perfectly

    faithful explanation. Furthermore, while perturbations would need to be handled in a myriad of ways

    depending on the type of data being analyzed (categorical, continuous, sequential, etc), this approach

    can consider any type of feature without any predefined structure. Lastly, by capturing the gradient of the

    model we can experiment and measure the influence of changes in any direction, including weighted

    combinations of variables. Perturbations would require us to analyse and experiment with different

    individual features, omissions of sets of features or various combinations. Each of these would involve

    running the model with the perturbed data, increasing the complexity of finding an explanation. With

    the aforementioned approach we have perfect freedom to measure the impact of variables without an

    increase in the complexity of finding the explanation.

    However, there are some inherent problems in our approach that should be discussed. The first

    is something intrinsic to any method of providing a local explanation. Visualizing a single point in an

    n-dimensional space provides limited view into the workings of a model. Because of the nature of such

    a space, we can’t guarantee that the behaviour of the net is maintained throughout its function surface.

    As a result, extremely similar data points might have completely different explanations from the network.

    This means that humans must look at all explanations produced, merely looking at some representative

    examples and imagining that we’ve understood the behaviour might be a very costly error. Finally, this

    interpretability method demonstrates another problem that we previously discussed, neural saturation

    (Figure 3.5). This is an issue to which different researchers assign varying degrees of importance. On

    one hand, in an example such as the one in the aforementioned figure (3.5) gradient based methods

    will fail to assign any importance to any of the input features. On the other hand, it is extremely hard to

    imagine an example where a wider network could have most neurons saturated at the same time. Unless

    that is the case, the remaining neurons and weights of the network should be enough to measure feature

    importance.

    32

  • Overall, we believe that our approach is one of the most reliable and efficient ways of interpreting

    deep Neural Networks. By using backpropagation to explain the behaviour of the net we not only manage

    to do it in a single backwards pass we also provide explanations that are completely faithful to the model’s

    behaviour.

    33

  • 34

  • 5Results

    Contents

    5.1 Medical Domain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

    5.2 Industrial Domain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

    5.3 DeepDIVE: Deriving Importance VEctors . . . . . . . . . . . . . . . . . . . . . . . . . . 55

    35

  • 36

  • In order to test the viability of our approach, we need to test it’s performance in real networks. We

    decided to carry out these tests on datasets that are in some way representative of the key areas where

    interpretability is needed the most. With these experiments, we intend to put to the test the feasibility of

    utilizing our approach in real world applications. Our method must prove to be fast, reliable and capable

    of providing relevant information to a domain knowledgeable human. However, there are some problems

    with the current state of interpretability research.

    Contrary to other popular areas in machine learning, such as natural language or computer vision,

    there are no standardized tests for interpretability. Each new paper or approach that is suggested

    defines new evaluation metrics and experiments. This creates a fragmented research landscape as

    approaches are evaluated in different ways, allowing authors to choose the light than want to show

    them in. Although we will not suggest a standardized test for interpretability in the context of this work,

    we try our best to follow best practices and experiments laid out by other recognized authors in the

    field [5, 31, 40]. Nonetheless, there is another intrinsic problem to evaluating explanations of neural

    network decisions, testing faithfulness to the model. Most neural networks are trained to solve problems

    in an n-dimensional space which is extremely hard for us to comprehend. Because of this, when we are

    looking at an explanation, how can we verify if the explanation is faithful to the underlying model? We

    can see if it makes sense to someone with domain knowledge, but how can we make sure that those

    are the features the model is actually using?

    We suggest a straightforward way to test the faithfulness and fidelity of the explanations to the un-

    derlying model. We outline a simple, theoretically solved problem and train a network to solve it. Having

    the theoretical solution to the problem we can then analyze the error surface of the model and compare

    it to the theoretical solution to make sure it is learning the correct pattern. If it is, when we create expla-

    nations for its error surface we can verify if these explanations are correct by comparing them with the

    feature attribution of the theoretical solution.

    We chose two different areas to focus our experiments on, both of them are areas where inter-

    pretability can have massive influence in deep learning adoption. One of the areas where interpretability

    research is of paramount importance is machine learning in the medical domain. Any decision related

    to human health is one that should be taken responsibly and with ethical concerns in mind. Decisions

    need to be evaluated individually and pondered on by a human who will take responsibility for them.

    This creates an extremely difficult problem for AI adoption in the field, especially in the current situation

    of neural networks being black boxes. With this in mind, we decided it would be a splendid domain to

    test our interpretability approach.

    The other domain we decided to tackle is one that is attracting increasing attention of AI research,

    Machine Automation for industry processes. Automation has been playing a pivotal role in the twenty-

    first century’s race for factory efficiency. More and more factories are automating a majority of their

    37

  • processes with both software and hardware. New types of ”Lights out” (Fully automated and require

    no human to be present, hence they can work with the lights off) factories are being introduced yearly.

    Until recently, most of the software controlling robotic hardware in factories was tailor made. In recent

    years however, advances in reinforcement learning and deep learning have proved to be extremely more

    efficient at teaching machines complex behaviour. These advances could be used to further improve

    factory automation and even introduce robots in environments where human defined programming is

    not adequate. But once again, these efforts are slowed down by the black box nature of deep learning.

    Because although not as effortlessly, it is easy to image a wide array of situations where unexpected

    behaviour from an automatic robot might have severe economic and dangerous consequences.

    5.1 Medical Domain

    5.1.1 Context

    Machine learning applied to healthcare is booming, every year several new approaches are suggested

    to solve pattern-detecting problems such as detecting pneumonia or tuberculosis in X-rays [41, 42] and

    new surveys analyse the impact this technology might have in the way healthcare works. But although

    some of these systems already achieve state-of-the-art results and have much better performance than

    human experts, adoption has been slow [43]. Interpretability, or lack thereof, has been one of the main

    reasons for the reluctant adoption of Deep learning in healthcare. As with anything related to Human

    health and security, there is an enormous amount of uneasiness in letting a system we do not fully

    comprehend make decisions that will directly influence our health. Furthermore, it is easy to imagine

    examples such as the one by Caruana et al. [10] where seemingly high performing models are making

    a mistake and invalidating their own results. In healthcare, this cannot happen.

    With this in mind, we set out to find a dataset we could use that would be representative of the real

    world circumstances of a machine learning problem in the medical domain. Luckily, concurrently to our

    work, another student from our home institution (Instituto Superior Técnico) was pursuing a thesis in this

    field [44]. Their work was based on a set of medical questionnaires from the biggest pediatric hospital

    in Portugal, Hospital D.Estefânia. These were filled by doctors to assess risk factors in children being

    evaluated for mental health issues. The work of our colleagues was to digitize, analyse and propose

    changes to these questionnaires. They tried to identify redundant questions by training a Decision Tree

    on the data, minimizing the amount of inner nodes (while maintaining performance) and pinpointing the

    questions whose nodes were removed from the tree. This created a valuable parallel to our own work.

    Not only did they create a dataset that could illustrate a real healthcare issue, they also analysed the

    data and identified