Understanding RNN States with Predictive Semantic...

Understanding RNN States withPredictive Semantic Encodings and

Adaptive Representationsby

Lindsey Sawatzky

B.Sc., Thompson Rivers University, 2010

Thesis Submitted in Partial Fulfillment of theRequirements for the Degree of

Master of Science

in theSchool of Computing ScienceFaculty of Computing Science

c© Lindsey Sawatzky 2019SIMON FRASER UNIVERSITY

Summer 2019

Copyright in this work rests with the author. Please ensure that any reproductionor re-use is done in accordance with the relevant national copyright legislation.

Approval

Name:

Degree:

Title:

Examining Committee:

Date Defended:

Lindsey Sawatzky

Master of Science (Computing Science)

Understanding RNN States with Predictive Semantic Encodings and Adaptive Representations

Chair: Parmit ChilanaAssistant Professor, Computing Science

Fred PopowichProfessor, Computing ScienceSenior Supervisor

Steven BergnerResearch Associate, Computing ScienceCo-Supervisor

Jiannan WangAssistant Professor, Computing ScienceExaminer

July 26, 2019

ii

Abstract

Recurrent Neural Networks are an effective and prevalent tool used to model sequentialdata such as natural language text. However, their deep nature and massive number ofparameters pose a challenge for those intending to study precisely how they work. This isespecially the case for researchers with the expertise to understand the mathematics behindthese models at a macroscopic level, but who often lack the tools to expose the microscopicdetails of what information they internally represent. We present a combination of visu-alization and analysis techniques to show some of the inner workings of Recurrent NeuralNetworks and facilitate their study at a fine level of detail. Specifically, we use an auxiliarymodel to interpret the meaning of hidden states with respect to the task level outputs. Avisual encoding is designed for this model that is quickly interpreted and relates to otherelements of the visual design. We also introduce a consistent visual representation for vectordata that is adaptive with respect to the available visual space. When combined, these tech-niques provide a unique insight into RNN behaviours, allowing for both architectural anddetail views to be visualized in concert. These techniques are leveraged in a fully interactivevisualization tool which is demonstrated to improve our understanding of common NaturalLanguage Processing tasks.

Keywords: Data transformation and representation, machine learning, visualization sys-tem, dimensionality reduction

iii

Acknowledgements

As those of you who know me are aware, I am brief with my words, often to a fault.However, I hope you understand that this brevity is not proportionate to my gratitude. Iam immensely grateful to everyone who has helped me in any way, directly or indirectly,throughout this process; know that your support did not go unnoticed and you have mydeepest appreciation.

I would like to say a special thank you to Dr. Fred Popowich for taking me on as graduatestudent in the first place as well as supervising me through my entire degree. Having beenaway from school for a number of years, I wonder if he feels the same apprehension in takingme on as I did when I first returned. Nevertheless, Fred has mentored and encouraged methrough this process tirelessly. His insightful feedback and timely responses are greatlyappreciated, and I could not have succeeded without his direction and support.

I also want to thank Dr. Steven Bergner for his immeasurable guidance in this thesis,as well as for providing the platform from which this work could grow. These ideas startedfrom a simple course project and although the work here ultimately evolved several times,its genesis was in his visualization course. More significantly, Steven has spent countlesshours discussing these ideas and helping me refine them into their final form. I would alsolike to say a special thank you to him for suffering my many last minute reviews. In all ofthis you have my most sincere gratitude.

Last but not least, I whole-heartedly thank my beautiful wife for her love and supportthrough not only this thesis, but my degree as a whole. Her sharp eye, intelligent feed-back, and immeasurable patience has played no small role in my small successes. Moreover,whenever my flame flickered or waned, she was always by my side with the encouragement,compassion, and understanding that sparked it back to life again. Thank you Sehar, fromthe bottom of my heart.

iv

Table of Contents

Approval ii

Abstract iii

Acknowledgements iv

Table of Contents v

List of Tables vii

List of Figures viii

1 Problem Introduction 11.1 Challenges of Visualizing RNNs . . . . . . . . . . . . . . . . . . . . . . . . . 21.2 Thesis Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.3 Research Findings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.4 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2 Background 82.1 Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.2 Recurrent Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3 Methods 173.1 Domain Problem and Data Characterization . . . . . . . . . . . . . . . . . . 173.2 Operation and Data Type Abstraction . . . . . . . . . . . . . . . . . . . . . 193.3 Predictive Semantic Encodings . . . . . . . . . . . . . . . . . . . . . . . . . 223.4 Adaptive Representations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243.5 Interactive Visualization Framework . . . . . . . . . . . . . . . . . . . . . . 273.6 Technical Design and Implementation . . . . . . . . . . . . . . . . . . . . . 28

4 Applications 304.1 Analysis of Predictive Semantic Encodings . . . . . . . . . . . . . . . . . . . 30

4.1.1 Research Question . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304.1.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

v

4.1.3 Analysis and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 324.1.4 Conclusion and Discussion . . . . . . . . . . . . . . . . . . . . . . . . 36

4.2 Analysis of Adaptive Representations . . . . . . . . . . . . . . . . . . . . . . 374.2.1 Research Question . . . . . . . . . . . . . . . . . . . . . . . . . . . . 374.2.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 374.2.3 Analysis and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 384.2.4 Conclusion and Discussion . . . . . . . . . . . . . . . . . . . . . . . . 42

4.3 Case Study I: Exploring Information Flow . . . . . . . . . . . . . . . . . . . 434.3.1 Research Question . . . . . . . . . . . . . . . . . . . . . . . . . . . . 434.3.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 434.3.3 Analysis and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 444.3.4 Conclusion and Discussion . . . . . . . . . . . . . . . . . . . . . . . . 48

4.4 Case Study II: Exploring Feature Representations . . . . . . . . . . . . . . . 494.4.1 Research Question . . . . . . . . . . . . . . . . . . . . . . . . . . . . 494.4.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 494.4.3 Analysis and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 504.4.4 Conclusion and Discussion . . . . . . . . . . . . . . . . . . . . . . . . 60

5 Conclusion and Future Work 62

Bibliography 66

Appendix A Long Short-Term Memory 70

Appendix B Literature RNN Comparison Categorization 71

Appendix C Adaptive Representation Scale Function 73

vi

List of Tables

Table 3.1 Various forms of informative comparisons, based off multiple axes ofhidden state data. (1) The definition of a hidden state kind is uniqueper timestep. (2) The definition of Intra-Hidden State only observers asingle hidden state vector. . . . . . . . . . . . . . . . . . . . . . . . . 21

Table 4.1 The applied colour map for the language model visualization. Colouringis based off coarse grained part of speech tags. . . . . . . . . . . . . . 44

Table 4.2 Candidate activations of the 1st layer Cell c1t for quotation block latent

feature representation, discovered visually by inspection of Figure 4.11. 57Table 4.3 Results for TFM query of the 1st layer Cell hidden state based on

tolerance δ and parameters from Table 4.2 to find the quotation blockpattern “ ... ”. The token _ denotes any word or symbol match thatis not a start or end quotation mark. . . . . . . . . . . . . . . . . . . 58

Table 4.4 Tolerance Feature Matching activations for the quotation block fea-ture, as represented in the 1st layer Cell of the trained LSTM. Witha tolerance level δ = 0.25, this representation exclusively matches 613quotation blocks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

Table B.1 Placement of existing literature into the developed RNN ComparisonCategorization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

vii

List of Figures

Figure 2.1 Schematic of two layer RNN architecture with Sequence-to-label andSequence-to-one input output schemes. Notice, hidden states fromeach layer are fed into the input of the next layer as well as thenext timestep, providing the context necessary to model long termsequential inputs. The techniques proposed in this paper apply toany form of RNN (including those this diagram does not capture). . 15

Figure 3.1 1) Intuition of colour interpolation based off the top-2 predictionclasses. 2) Visual representation for PSE, a fit colour rectangle andmini-bar chart. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

Figure 3.2 1) The proposed scale function, defined in Appendix C, as comparedto the logarithmic and linear scale functions when applied to thedomain [0, 20] and range [0, 10]. 2) Adaptive Representations matrixof cells glyph for the vector {0, .1, 1, 2, 5, 8, 9,−10} using the var-ious scale functions. Notice how the top and bottom three valuesare barely distinguishable between the linear and logarithmic scales,respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

Figure 3.3 Two deterministic dimensionality reduction techniques, 1) Fixed widthbucket, and 2) Learned buckets that can be applied to reduce h intov. In the case of the learned buckets, the learned mapping is dimen-sions h1,3 → v1 and h2,4 → v2. The mean squared error (MSE) showsthe learned buckets portray this specific example more accurately. . 27

Figure 3.4 Schematic of the client-server architecture implemented for the vi-sualization tool. Various data generation and analysis tasks are rel-egated to a separate set of CLI tools. . . . . . . . . . . . . . . . . . 29

Figure 4.1 Per hidden state test set perplexity of the fully trained PSE using a0-layer projection. The y-axis representing perplexity is truncated at400, despite some values exceeding this range. Notice the 2nd layerOutput result is on par with that of MUI. . . . . . . . . . . . . . . 33

viii

Figure 4.2 Per hidden state test set perplexity of the fully trained PSE using a2-layer Feed Forward Neural Network. The scale in this figure is thesame as that from Figure 4.1. . . . . . . . . . . . . . . . . . . . . . 35

Figure 4.3 Average accuracy of Fixed Width Buckets and Learned Buckets di-mensionality reductions against the Penn Treebank language modeltask for a 2 layer, 300 width LSTM. . . . . . . . . . . . . . . . . . . 39

Figure 4.4 Per hidden state accuracy of the Learned Buckets dimensionalityreductions against the Penn Treebank Language Model task at P =10. The y-axis shows a log scale. . . . . . . . . . . . . . . . . . . . . 41

Figure 4.5 Visual comparison of hidden states for two slightly different noun-phrases. PSEs show general semantic agreement between the phrases,compared left to right. ARs show discernible differences in Outputstates h giving an intuition where the differences in data lead todifferent predictions y. . . . . . . . . . . . . . . . . . . . . . . . . . 45

Figure 4.6 Architecture View for the 6th and 7th timestep of the input sequence“ we stand in solidarity , ” she emphasized .. The top showsthe trend towards predicting the closing quotation ”, while the bot-tom shows the change in the language model to words that follow aquote. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

Figure 4.7 View of the ct1 hidden state for the input sequence “ we stand in

solidarity , ” she emphasized .. Gradual changes in the infor-mation the state encodes stand out, such as the continual growth inthe 1st row, 2nd of Activity Progression. . . . . . . . . . . . . . . . . 48

Figure 4.8 Detail View of the c11 hidden state for the input sequence “ we stand

in solidarity , ” she emphasized .. 1) The same low detailAdaptive Representation which was selected from the ArchitectureView to zoom into this view. 2) The high detail Adaptive Represen-tation for the same hidden state as from (1). Notice, when the userhovers over any matrix cell a dual black bordering is established be-tween the cell from the low detail AR and the corresponding cells inthe high detail AR. The context of where in the input sequence (3)as well as which component of the RNN (4) is maintained within theview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

Figure 4.9 Relative changes in the found Cell dimensions {58, 223, 251} from thestart of the sequence to its end. X-axis marks are intentionally ex-cluded, as sequences vary in length. Monotonic and Non-monotoniclines represent dataset wide averages, while Minimum and LargestDrop show the value changes across a single sequential instance. . . 53

ix

Figure 4.10 Detail View comparison of the Cell hidden states for two sequencesexhibiting the quotation block feature. Sequences have been alignedin the view so that the end quotation mark ” representations c1

7and c1

10 are compared. Histogram of similarity measure between thetwo hidden states (1) shows them to be quite similar, despite beingderived from different quotation block sequences. . . . . . . . . . . 55

Figure 4.11 Evolution from Figure 4.10, with Detail View focusing on the mostsimilar activations between the selected Cell c1

7 and c110 hidden states.

Only 99% similar values are shown allowing users to view likely can-didate dimensions for the quotation block latent feature representation. 56

Figure 4.12 One-off threshold perturbations on quotation block feature discov-ered in Table 4.4 for the 1st layer Cell. Each column represents aperturbation (≤ or ≥) applied a dimension. Results show thresholdsonly sometimes capture more features, while sometimes additionallymatching extraneous features. . . . . . . . . . . . . . . . . . . . . . 60

x

Chapter 1

Problem Introduction

Recent years have seen an explosion of success in applying machine learning techniques tosolve problems represented with sequential data, such as natural language text and speech.In particular, Recurrent Neural Networks (RNNs) have been successful at these sequentialdata modelling tasks due to their ability to capture and retain long term information. Thesemodels have proved especially effective in the field of Natural Language Processing (NLP),continuously improving state of the art results in problems such as language modelling [23,13], sentiment analysis [39], and machine translation [37, 4].

Despite their proven efficacy, RNNs and machine learning models can be difficult tointerpret and understand. This is a result of the black box nature of these systems whichsimply learn to model example input-output data pairs. Ultimately as a result of this train-ing process, latent features of the dataset are captured abstractly as a series of numericcalculations within the RNN. Therefore, when it comes to understanding how the modelhas come to answer an arbitrary query, the only explanation is a long list of calculationswithout attribution to the latent features they correspond to.

Moreover, recent studies have shown these models learn unintended biases implicit inthe data they are trained against [7]. One such troubling example is that of learning sexismembedded with the corpora, such as nurses are female while doctors are male [6]. Theseproblems affect the degree to which these models can be deployed and trusted, both ininnocuous and safety critical environments.

Visualizations have been proposed as one way of better understanding the complexdecisions RNNs make [28, 18, 15]. By explaining complex data in a digestible form, visual-izations provide a degree of accessibility that cannot be achieved by looking at raw data.Moreover, visualizations can be designed to specifically highlight and expose the latent fea-tures or other details represented by RNNs. In sum, visualization forms a natural avenuethat facilitates the interpretation and understanding of these models in an intuitive andhuman-centric way [25, 29].

This research explores the domain of RNN visualization with the objective of using thesetechniques to better understand RNN behaviours and the latent information they capture.

1

To this end, we use visualization techniques both to directly and indirectly interpret andanalyze some of the features these models learn to represent. Our research looks specificallyat a trained model, at which point its parameters are fixed and it is used for decision makingsuch as in production environments.

1.1 Challenges of Visualizing RNNs

Although visualization techniques take an effective step towards explaining complex machinelearning models, visualizing these models and RNNs in particular comes with its own set ofchallenges.

One such challenge is that of displaying hidden states, one of the key internal datarepresentations in neural models. Hidden states are vectors which contain hundreds tothousands of dimensions of numeric values. This sheer scale of data is precisely one of thereasons visualizing hidden states is so challenging. This problem is usually addressed eitherby viewing only a subset of dimensions at once, or by viewing the data in some abstractway. Some examples of the latter are to project the data into lower dimensional spaces, orto view dimensions across time rather than within the hidden states themselves.

Although these techniques have provided meaningful interpretations of hidden statevalues [14, 15, 35], they take a step away from the underlying mathematical description of themodel itself. By only looking at a subset of the data, or by observing it in some abstract way,the visualization elements are not immediately relatable in terms of the model architecture.This dissonance can put cognitive load on the user to understand the visualization andmodel while they reorient between the two.

Another challenge in the visualization of RNN hidden states is that of interpretingtheir meanings or semantics. Hidden states encode information and patterns in complex,distributed ways and their interactions are highly non-linear. Slight perturbations to thevalue in a single dimension may affect great changes in downstream vectors and RNN finalprediction. Moreover, the dimensions even at an individual level can be hard to interpretas they do not directly correspond to real world concepts.

A common approach to understanding word embeddings, a specific kind of hidden state,is to project them into 2-D dimensional representations whereby underlying relations andclusters are exposed. Although this technique has been shown to expose interesting patternswithin the data, it is not suitable for use in representing the semantics of other hidden states.Non-word embedding hidden states do not have a direct correlation to real world conceptsas word embeddings do to words. Furthermore, this technique is not suitable to comparingthe meaning of different types of hidden states which may not necessarily encode the sameset of latent features.

The final challenge we mention in visualizing RNNs is that of relating the visualizationto the underlying model. With their size, not only in number of parameters, but also in

2

depth (layers) and span across time, RNNs are large and complex systems. In order to fullygrasp how they behave, visualizations must tackle the problem of showing their internalcomponents, the interaction between these components, and the general flow of informationwhich coalesce into the final outcome.

This is especially the case given that the various components of RNN models havebeen designed with a specific role or intent in mind. If visualizations can be constructed toshow these components at once, then their function within the RNN can be confirmed andexamined in closer detail.

Visualizations tend to circumvent this challenge by focusing only on particular aspectsof the RNN at once. One idea that has been used successfully is to use the computationalgraph as a frame of reference by which components can be selected for visual inspection.Unfortunately, this only addresses framing the visualization in the underlying mathematicalmodel, and does not show the interaction between components or flow of information.

This challenge is particularly prohibitive in terms of communicating how these modelswork in instructional contexts. Although a niche situation, visualizing RNNs in simple,understandable ways is an important step in training machine learning practitioners whowill go on to deploy and further research these models.

1.2 Thesis Contribution

This thesis aims to expose insights further into RNN behaviours in a number of ways. Pri-marily, we develop techniques that address some of the challenges described in the previoussection. We use these techniques both directly in the form of visualization and indirectlyat the level of model analysis to gain insights into RNN behaviours and the latent featuresthey represent.

With respect to directly visualizing RNNs, we combine the proposed techniques into ageneral purpose visualization tool which can be applied to a number of sequential data mod-elling tasks. The tool is designed around the core interaction and architectural paradigmsinherent to sequential data, allowing users to study and interact with the RNN in a naturalway.

Specifically, our contributions are:

1. RNN Comparison Categorization (introduced in 3.2): An investigation of the var-ious forms of comparison relevant to interpreting the internal details of RNN models.This investigation develops a categorization of these forms of comparison and placesthe existing literature into their relevant categories.

2. Predictive Semantic Encodings (introduced in 3.3): An interpretation frameworkfor hidden states along with an easily understood visual encoding that facilitates com-

3

parison tasks. The visual encoding provides a novel view into the flow of informationthroughout the RNN.

3. Adaptive Representations (introduced in 3.4): A visual metaphor that representshidden states in a consistent and precise way across varying levels of detail. Thistechnique is shown to facilitate the discovery of latent features represented within theRNN.

4. Tolerance Feature Matching (introduced in 4.4.3): A novel approach to mappinghidden state details back to latent features of the sequential data. This technique is anextension of previous work that allows for more expressive queries to be formulated,and is shown to accurately reflect how latent features are represented within the RNN.

The RNN Comparison Categorization delves into the idea that comparisons area useful abstraction for interpreting and understanding the internal details of RecurrentNeural Networks. We systematically explore the dimensions across which comparisons maybe formulated, and the information these types of comparisons may generally reveal. Thiscategorization system is then applied to the relevant literature to understand how well thesystem is currently covered, and where future research can be targeted. Many of the findingsfrom this contribution lead to the later proposals of our research.

Predictive Semantic Encodings (PSE) are a method of interpretation which relatehidden states to the output predictions they are generally associated with. This is a specificform of what is sometimes referred to as an “auxiliary prediction task”. By interpretinghidden states under this formulation, users can intuitively interpret and understand hiddenstates despite their underlying complexity. Even though this is a relatively shallow inter-pretation, for example, not going into the level of latent sequential features, it is still shownto reveal interesting facets of the RNN model.

Moreover, we design a visual encoding that is well suited to this method of interpre-tation. The visual encoding is compact, facilitating observation of the flow of informationthroughout the RNN. Key parts of the internal RNN formulation can be seen as havingparticular effects on the model predictions. The PSEs enable users to quickly identify areasof interest within the model, drawing attention so that detailed analysis can be performedeffectively, while also confirming the high level role of the model components.

The Adaptive Representations (AR) act as a device which specifically mitigate thechallenge of visualizing the vast quantity of data encompassed within hidden states. Notonly does this representation handle hidden states with many dimensions, it also allows forvisualizing many hidden states at once - an important factor in showing the complex andnumerous interactions contained within the RNN.

By visualizing this data at a low level of detail, users can more easily see the flow ofinformation across the scale of time within the model. This also allows for the visualization to

4

be drawn in ways that mirror the model’s mathematical description, reducing the cognitiveburden of understanding the visualization. When drawn at high levels of detail, the visualencoding can be used to inspect the numerous dimensions of hidden state vectors. Byrepresenting hidden states in the same form across varying levels of detail, ARs further easethe burden of understanding the scale of data encoded within the model.

Moreover, this visual encoding is shown to suit a particularly useful level of comparison.This comparison is used to reveal latent sequential features represented within the RNN.

Tolerance Feature Matching (TFM) is a model through which the latent featuresRNNs encode can be represented. This interpretation of the data hidden states encode isinspired through usage of the contributed comparison categorizations and developed visu-alization tool. This technique uses tolerance levels, rather than thresholds which have beenused previously, to construct latent feature representations that reflect RNN details.

1.3 Research Findings

We find these contributions to be an effective lens through which to view the behaviour ofRNNs, and apply the techniques specifically on the well known Long Short-Term MemoryRNN architecture. Our research focuses on the NLP task of language modelling, howeverthese techniques are not limited to this task. In fact, given the complexity of the latentfeature space of this task, as well as its large number of potential output classes, the task oflanguage modelling is a good benchmark by which to evaluate the versatility of the proposedcontributions.

Firstly, we verify the fundamental accuracy of the proposed Predictive Semantic Encod-ings and Adaptive Representations for the language modelling dataset. This result servesnot only to validate that these techniques work on practical applications, but also as a basisfor which further analysis can be performed. Although outside the scope of visualization,this analysis is shown to provide explanations and insights into RNN behaviours.

Specifically through this analysis, we find that architecture choices for the PSE has alarge impact on its accuracy. These results suggest that individual models, in some caseswith various architectures, should be trained for each kind of hidden state (Section 4.1.3).

Furthermore, the PSE analysis reveals an interesting property around the types of in-formation encoded in the layers of the RNN. Generally speaking, the hidden states fromthe 2nd layer out of two are better indicators for the final task classification - a result thatis consistent with the model architecture as actual RNN uses a hidden state from the 2nd

layer to make its final prediction. However, this finding is inverted for the components ofthe RNN which capture long term memory, with the 1st layer hidden states significantlyoutperforming those from the 2nd layer. This suggests that the memory units of earlierlayers encode information directly useful with respect to performing the task predictions.The memory units of later layers, on the other hand, encode far more abstract information

5

that must necessarily be switched in order to make the final task classification. This impliespossible architectural optimization, forming a basis for future research in this area.

Also in the analysis external to the visualization task, we discover that the AR techniquecan be used to effectively reduce the number of elements that must be displayed for anygiven hidden state, while still maintaining a high degree of accuracy and comparability.Our experiments show the technique can effectively reduce hidden states by a factor of 30,reducing 300 dimensions down to 10 (Section 4.2.3). This achievement is significant in thatit enables the development of visualizations which show a manageable number of hiddenstate dimensions, as well as potentially many hidden states at once.

Moreover, an analysis of specific kinds of hidden states using this technique furtherreveals insights into the types and complexity of information they represent. We find thatgenerally speaking, the 2nd layer of a 2-layer RNN encodes more variant information, whilethe information from the 1st layer is much less varied. Also, this analysis discovers thatthe information captured by a few specific kinds of hidden states are quite regular. Thesefindings may serve as a basis for future research in optimization of the RNN architecture.

Finally, we use the proposed techniques to visualize and understand RNNs directly. Thecombination of visual encodings built around a user interface that supports various levelsof comparison are shown to explain RNN behaviours in an intuitive and relatable manner.We specifically perform two case studies using the built visualization tool.

The first case study highlights the value of observing RNN architectures as a whole,allowing for users to understand the high level flow of information. This information flowis observed through time across the recurrence of the RNN, as well as through its depth oflayers. We show the value behind studying an RNN at this level of detail and highlight howthis perspective of study is currently missing in the field (Section 4.3.3).

The second case study uses the same proposed visual metaphors, but at a much morefocused level of detail. Specifically, ARs are used to study the latent features represented bythe internal representations of the RNN. We successfully discover two such latent featuresrepresentations to showcase the effectiveness of the visualization in performing this commonunderstanding task (Section 4.4.3).

Through interaction of the visualization in this final case study, we also realize a refinedparadigm to searching and verifying these latent feature representations. This motivatesthe development of a tolerance based match algorithm, contrary to previous work usingthresholds. We call this Tolerance Feature Matching and show that it can accurately reflecthow RNNs encode latent features by discovering how quotation blocks are modelled withinthe RNN. Moreover, through a perturbation based analysis of this technique, we show oneextent to which RNNs encode information in multi-faceted and complex ways.

6

1.4 Thesis Outline

Chapter 1: Introduces the topic of study and motivates some challenges faced withvisualizing RNNs. The chapter concludes by describing a high level overview of our contri-butions and their results.

Chapter 2: Begins by exploring related work in the field of visualization. There is aparticular focus on works which visualize RNNs or hidden states and techniques applicableto this context. This chapter ends by reviewing the notation necessary for studying RNNs,with a particular focus the canonical Long Short-Term Memory form of RNN.

Chapter 3: Details the context in which this work is applicable and the specific methodsproposed. It first describes the users and their goals with respect to visualizing RNN details.The chapter then goes on to describe the relevant operations useful for this context ofvisualization, developing a novel categorization of hidden state comparisons. We end thechapter by describing in detail the proposed visualization techniques and tools we introducethrough the progression of this work.

Chapter 4: Describes the application of these techniques and tools. We begin by study-ing the accuracy of the core visualization techniques while using these results to reveal in-sights about the RNN architecture itself. The chapter is completed with two case studiesexploring RNN details when visualized under the proposed techniques. Our final case studyreveals the necessity of a novel paradigm by which the latent features of RNNs can beinterpreted.

Chapter 5: Reviews the findings of our study and ends with several proposals for futureresearch in this area.

7

Chapter 2

Background

2.1 Literature Review

Advances in machine learning techniques, and Neural Network models in particular, havebrought forward an impetus within the community to better understand these models. Ata high level, this understanding has been derived both through visualizations, as well asthrough non-visual analysis techniques. Additionally, practitioners tend to approach under-standing these models from two view points.

The first view point is that of understanding the model by interpreting its internalrepresentations. The idea behind this approach is that by interpreting specifically how themodel represents data and latent features, we can understand how its decisions are made.This approach can be seen as that of decoding the model itself, so that the ways it encodesdata are mapped to terms humans understand. Descriptions of our understanding of themodel from this perspective are necessarily formulated in the language of the model itself.

The second view point is that of understanding the model by interpreting its high leveldecisions. This approach maintains the perspective of the model as a black box, and ratherseeks to understand what factors drive the model decisions. In this way, a model is under-stood in abstract terms with no direct correlation to the model’s internal representations.Descriptions of our understanding of the model from this perspective speak in terms of itsinputs, outputs, and the abstract decisions it makes.

One primary example of this second view of understanding is developed by Ribeiro etal. [28]. They propose Local Interpretable Model-agnostic Explanations (LIME), which is atechnique for explaining why a model makes decisions based off “interpretable representa-tions”. Rather than seeking to describe the model’s internal representations, interpretablerepresentations describe the model in simple terms, such as by the presence and absenceof a few key symptoms. LIME then uses these simpler terms to describe a subset of deci-sions made by the model, so that the description is optimal for the subset, although notnecessarily for the dataset as a whole.

8

Although this second view point has been shown to provide powerful, human-centricexplanations of these complex models [36, 3], our work takes the first approach of under-standing through interpreting the internal model details. Much work exists in this area aswell, which we now discuss to more detail.

The domain of Computer Vision has many natural avenues for visualization of theinternal model details, as the inputs for these tasks are images. Therefore, most visualizationwork in this area has used techniques to relate the internal information represented by themodel back to the input image space.

One such example is that of computing the internal salience captured by the model withrespect to its inputs. This salience measure can then be visualized as a heatmap in imageform, showing which portions of the input image were most salient with respect to themodel’s decision [32]. Other work in this domain uses a myriad of similar techniques, suchas image reconstruction [21], latent feature deconvolution [40], and image perturbation [38]to relate and visualize aspects of the model in the same space as that of the input imagesthemselves.

On the other hand, visualizing Recurrent Neural Networks tends to be more difficultthan visualizing Computer Vision models. In particular, the data of these tasks are neces-sarily variable in length, so looking back to the inputs is not as straightforward as for othermodelling tasks. Moreover, the inputs to RNN tasks are typically categorical in nature (e.g.word and character tokens in the case of NLP tasks), and thus do not have direct andscalable visual parallels like in the case of the image pixels in Computer Vision.

Interpretation and UnderstandingOne of the key aspects to visualizing machine learning models such as RNNs is that ofinterpreting and understanding their internal representations. Interpreting these internalrepresentations, often referred to as hidden states, in the domain of sequential data oftentakes the form of relating them back to the task inputs and outputs.

For example, Li et al. [18] use salience, similar as in the case of Computer Vision, toshow which inputs the internal model representations pay attention to in making an outputprediction. This salience is also visualized as a heatmap with respect to the input worktokens across the sequential inputs in a grid format.

In a similar fashion, heat-grids have been used to show the amount of “attention” a modelpays to its sequential inputs at different timesteps within the recurrent function [4, 17].Here the idea is fundamentally the same as that from visualizing salience, except what isvisualized is not a calculated feature, but rather an internal representation from the RNNitself. Although this idea applies only to certain kinds of internal details of these recurrentmodels, this technique has seen strong adoption in the community as its interpretation isstraight forward.

9

Li et al. [19] also use heatmaps as a salience visualization technique. However, ratherthan relating internal representations with respect to task inputs, they study the impor-tance of specific aspects of the internal representation with respect to the task outputs.They specifically evaluate the importance of word embedding dimensions by erasing thedimensional values and testing the resulting model performance.

Instead of interpreting and visualizing model details with respect to the task input-outputs, another approach has been to relate hidden states to high level patterns of thesequential data. Karpathy et al. [15] show the magnitude of a single dimension within ahidden state vector overlaid on top of the natural language input as a colour intensityhighlight. When applied in a directed fashion, this technique shows that certain vectordimensions actually track long term language features such as line lengths and quotationblocks.

In similar fashion, Strobelt et al. [35] use line plots to visualize the magnitude anddirection of vector dimensions over time. Looking at the data in this way has further re-vealed high order features encoded in the RNN, such as noun-phrases, subject-verb-objectagreement, and parenthesis nesting depth just to name a few. Their tool, LSTMVis, uses aquerying technique to further corroborate these findings. They search through the datasetto find sequences that elicit similar magnitude and direction of hidden state values basedoff spans of thresholds. These results help to relate specific hidden state dimensions backto the latent features of the data they represent.

Rather than focusing on the values of hidden state dimensions across time, Ming etal. [24] find clusters of dimensions that behave similarly. These clusters are drawn togetheras a memory chip (matrix of cells). The various relations these clusters of dimensions formwith the task inputs and outputs are drawn as a bipartite graph, where line width and linecolour represent the relation strength and the positive/negative correlation, respectively.

In the case of sentiment analysis tasks, they also use a sequence level glyph to representthe effect each input word has on the final decision. This effect is measured by looking atthe dimension clusters from before and determining the sum of their positive and negativevalues, corresponding to a positive and negative impact on the classification decision.

Implicit in most of the aforementioned work is the idea of interpretation through com-parison. Whether it be the comparison of hidden state dimensions as they change overtime, or the comparison of how clusters of hidden state dimensions affect the outcome ofthe model decision, comparisons form a valuable tool for gaining a deeper understanding ofthese systems.

Kahng et al. [14] also use comparisons by looking at the hidden state values on customsubsets of dimensions. Their tool, ACTIVIS, allows users to compare these values for specifictest case instances to debug and explain certain mis-classifications. They show the values ofthese dimensions using colour intensity, arranged in a grid where rows indicate test instancesand the columns form specific hidden state dimensions.

10

Word embeddings, a specific type of hidden state, are often compared indirectly byprojecting them into a 2-D space and plotting their relative positions. This technique hasbeen used to reveal interesting facets of what these word embeddings represent [18, 27, 33],such as the near ubiquitous “king - man + woman = queen”. These findings are excitingas they show these models learn high level word meanings, despite never being explicitlyexposed to these semantics.

Although useful, all of these comparison based techniques only allow for the contrast ofthe same types of hidden states. They do not, for example, allow users to compare wordembeddings with the hidden states that represent memory in an RNN. Moreover, thesetechniques focus either on visualizing subsets of the hidden states, or on their indirect ob-servation.

Interpretation by Auxiliary PredictionAn emerging trend has been to interpret meaning from the internal representations of RNNsthrough a secondary model [5, 2]. These secondary models comes in various names, butgenerally follow a set pattern of construction and usage. The fundamental Model UnderInterpretation (MUI) is trained on its task, at which point its parameters are fixed. Thenan auxiliary or secondary classification model is trained to learn some property relevantto the context of the data by using the hidden state instances from the MUI as inputs.Finally, the performance of the auxiliary model is analyzed from which inferences can bemade about the quality of the MUI.

Shi et al. [31] use this idea to analyze the level of syntax captured at various layers ofthe encoder in a neural machine translation model. They perform this analysis on severaldatasets, using a 2-layer Long Short-Term Memory RNN architecture. Their work finds thatthe MUI generally captures syntactic information such as verb tense and part of speechtags, but lacks when it comes to grasping deeper syntactic structure. Additionally, they findthat certain layers of the encoder are better at predicting particular syntactical features,suggesting what kinds of roles these layers play in the model. The 1st layer is found toencode direct features represented by the data, while the 2nd layer of the RNN captureshigher order latent sequential features.

In a similar vein, Hupkes et al. [12] use what they term a “diagnostic classifier” to in-terpret the quality of the MUI. Their work uses RNN, as well as other neural architectures,to model an arithmetic computation task. Through use of their auxiliary model, they con-clude that the RNN generally approximates the data as expected by accumulating valuesover the arithmetic expression. However, their analysis also shows particular cases wherethe accumulation strategy is not employed, and raises further questions about the specificdetails represented by the MUI.

The previously mentioned LIME proposed by Ribeiro et al. [28] may also be considereda form of auxiliary model. In their work, the secondary model is used to formulate a simple

11

explanation for the MUI’s final outcome. This provides a human interpretable explanationfor model behaviour, however it does so in terms independent of the model itself. Therefore,this approach stands separate from our work which seeks to understand the model’s internalrepresentations in detail.

Although these auxiliary prediction tasks have been shown to reveal valuable insightsabout the behaviours of RNNs, this work remains at the level of instance based and statis-tical analysis. To our knowledge, the usage of these secondary models directly as a form ofvisualization has not yet been explored.

Visualizing Architecture and PipelineAnother important aspect to visualizing RNNs is showing their behaviours within the con-text of the higher level architecture and task pipeline. With the correct framing, it is im-mediately clear what is being observed and to which aspects of a high level pipeline thevisualization applies. This framing acts as an important bridge between written materials,such as the model definition or implementation, and the visualization itself, orienting theuser and reducing cognitive load.

TensorBoard is a visualization toolkit that ships with the machine learning packageTensorFlow [1]. Although it supports many aspects to visualizing a machine learning model,it specifically allows for users to view the model architecture as a computational graph innode-link diagram form. Nodes may be filtered to remove unnecessary details, as well aszoomed into to explore specific aspects of the computational graph.

The previously mentioned ACTIVIS also uses a node-link diagram to show the compu-tational graph. They allow for the selection of specific hidden states from this diagram inorder to narrow the focus of the comparison aspect of the visualization.

Looking instead at the model pipeline as a whole, Strobelt et al. [34] visualize coreaspects of a machine translation task. They show the components of this pipeline at once,allowing for users to inspect and debug any aspect on demand. Specifically, the encoder-attention portion of the model is displayed as a bipartite graph, where line width representsthe attention value between the nodes. The beam search component of the pipeline is shownwith a diverging tree, illustrating the various paths of sequences explored by the system.Finally, they also show hidden states in relation to each other by projecting them into 2-dimensional x-y plots. These plots give the user an intuition about other input sequencesthat produce similar hidden states. All aspects of the visualization tool allow for relevantforms of interaction, enabling the user to debug any portion of the machine translationpipeline.

Liu et al. [20] also visualize portions of the system pipeline for a natural languageinference task. They explicitly show the three stages of the pipeline as a flow chart, andexplain which portions of the pipeline change in order to correct mis-classified test instances.

12

Their tool also incorporates other techniques previously discussed, such as displaying theattention mechanism using heatmaps.

Although these studies focus on the importance of viewing model architectures and sys-tems at a high level, they tend to avoid the visualization of hidden states directly. Instead,hidden states are viewed indirectly via abstraction, such as projection into lower dimen-sional spaces, or through some subset of their dimensions. Moreover, despite framing thevisualization within the context of the model architecture, these studies only show a singletype of hidden state at once.

2.2 Recurrent Neural Networks

This section outlines the fundamental concepts of Neural Networks and Recurrent NeuralNetworks to serve as a basis for the rest of the discussions.

As a quick introduction, a Neural Network is a machine learning technique wherebyan arbitrary function is modelled using a series of mathematical operations and learnedparameters. Typically, these operations are defined in terms of vector and matrix operations,in combination with some non-linear activation function such as Sigmoid or Tanh. In itssimplest form, this looks like Equation 2.1 where a, b, and c are all vectors of size N andW is a square matrix N ×N .

c = σ(Wa + b) (2.1)

Where the Sigmoid function σ(x) = 11 + e−x

is performed element-wise over the inputvector. Using Neural Network terminology, the input a produces the output c by somefunction of weights W and bias b. The specific values of W and b are learned by thefollowing training procedure.

Values are first initialized arbitrarily and then incrementally updated until they can ac-curately model the expected behaviour. The expected behaviour is defined by many thou-sands to millions of instances of input-output pairs of training data. Each incrementalupdate to these parameters is performed by an algorithm called “Back Propagation” whichchanges the values such that they minimize some loss function based off this training data.

The final accuracy of the Neural Network is judged by a testing dataset, which modelsthe same form of data as the training dataset, but that the Neural Network was not exposedto during training. In this way, a fair judgment can be made about how well the model haslearned to represent the arbitrary function.

A Recurrent Neural Network is an extension of this Neural Network model that isadapted to handle sequential data. Specifically, it maps a sequence of inputs {q1, ..,qT },q ∈RA to a sequence of internal “hidden states” {h1, ..,hT },h ∈ RN . Each ht is updatedrecursively by the non-linear function ht ← RNN(qt,ht−1). This recurrent nature allows

13

the previous hidden states ht−1 to capture the context necessary to compute the next RNNupdate.

Multiple {RNN1, .., RNNU} layers may be stacked together to form deep recurrentrelations. This is done by assigning the output state from a lower order layer as the inputstate to the next layer qu

t = hu−1t , which can be done without loss of generality1 by setting

A = N . With this equivalence established, we can reformulate the original RNN updatefunction without q as follows.

hut ← RNN(hu−1

t ,hut−1) (2.2)

That is, the current hidden state is a function of the current input and the previouscontext respectively. Then, the base case to this recurrent relation is satisfied by setting h0

1to the input of the task, and the initial previous context to a fixed value such as the zerovector h∗0 ← ~0.

Depending on the specific task against which the RNN is deployed, the final layer’shidden state is fed into a function F to produce an output y. In the case of a regressionproblem y ∈ R, while y ∈ [0, 1]K for a classification task of K labels. Depending on theproblem application, a different number of outputs may be required which are outlined bythe following input-output schemes.

1. Sequence-to-one: This scheme takes exactly T inputs and produces exactly 1 output.An example of this is the task of sentiment analysis.

2. Sequence-to-label: This scheme takes exactly T inputs and produces exactly T

outputs. An example of this is the task of language modelling.

3. Sequence-to-sequence: This scheme takes exactly T inputs and produces [1, T ′]outputs. An example of this is the task of language translation, and is typically im-plemented using what is called an “Encoder-Decoder” architecture.

In the context of the NLP tasks specifically, the inputs to the RNN are words from avocabulary V . Words are represented as one-of-K encodings x ∈ {0, 1}V where only onedimension of x is set to 1, and the rest are 0. These encodings are transformed into individualword embeddings e ∈ RM . Since it may be that N 6= M , the word embeddings are projectedinto the hidden state space to produce the input to the 1st recurrent layer h0

t ←Weet.For this research specifically, F models a probability distribution over a set of output

labels which correspond to words from a vocabulary. The background on RNNs up to thispoint is summarized in Figure 2.1.

1When A 6= N , a linear projection can be used to allow for the assignment W qut = hu−1

t .

14

RNN Architectural Overview

Figure 2.1: Schematic of two layer RNN architecture with Sequence-to-label and Sequence-to-one input output schemes. Notice, hidden states from each layer are fed into the input ofthe next layer as well as the next timestep, providing the context necessary to model longterm sequential inputs. The techniques proposed in this paper apply to any form of RNN(including those this diagram does not capture).

The internal operations of the RNN function are typically written as a series of “hiddenstate” vector definitions, such that each hidden state vector is composed as a result of othervectors and operations, much like what is seen in Equation 2.1. We consider each of thesedefinitions as their own “kind” of hidden state, formulating specific components within theRNN. This distinction is important as different kinds of hidden states may perform differentroles with the recurrent function.

Moreover, in the case of multiple RNN layers, each component of the recurrent functionoccurs as a separate instance in each stacked layer. Although derived from the same vectordefinition, these instances may also be considered unique, especially for the purposes ofanalyzing the significance they play in the RNN (e.g. different layers may model and capturedifferent levels of information). Therefore, we use the term “kind” henceforth to refer to theconcept of a distinct pairing of hidden state vector definition and layer within the RNN asa whole.

Many different forms of the RNN function have been developed such as the VanillaRNN, Long Short-Term Memory (LSTM) [11] and Gated Recurrent Unit (GRU) [8]. Al-though any of these can be visualized by the techniques we propose, we focus on the LSTM

15

specifically for its complexity and natural inclination to be described using “memory”-likemetaphors. The detailed mathematical notation of the recurrent function of the LSTMis laid out in Appendix A, but for the purpose of later discussions we describe the coreconcepts and their terminology here.

The core concept behind an LSTM is element-wise multiplication between various hid-den states to “gate” the flow of information, where gates g are hidden states with valuesrestricted between zero and one g ⊂ h,g ∈ [0, 1]N . The gates are computed as a function ofthe current input and previous context, which allows them to control the flow of informationbased off long term memories.

There are three kinds of gates, the so-called “Remember Gate”, “Forget Gate”, and“Output Gate”. Additionally, the LSTM defines five other kinds of hidden sates: “CellInput”, “Short-term Memory”, “Long-term Memory”, “Cell”, and “Output”.

The Remember Gate controls the flow of information from Cell Input to the Short-termMemory. On the other hand, the Forget Gate prohibits the flow of information from theCell of the previous timestep to the Long-term Memory. Finally, the Output Gate controlsthe flow of information from the Cell of the current timestep to the Output.

The Cell of the LSTM is what is often considered its memory, and is produced as anelement-wise sum of the Short-term and Long-term Memory. Roughly speaking, the Long-term memory represents information that has been passed along from the past, while theShort-term Memory captures the information from the current input that will be added tothe memory for the future. Information from both of these may flow not only to the Cell,but also to produce the Output. This ends the description of the LSTM recurrent function,with its Output transformed into a probability distribution y via Softmax.

16

Chapter 3

Methods

We set out to design a visualization tool which will help researchers better understand thebehaviours of RNNs at a fine level of detail. Our study follows the nested model of visualdesign and validation described by Munzner [26] which lays out four layers, each predicatedon the one before.

The first layer is that of characterizing the problems and data of the domain undervisualization. This ensures that the design will address a real problem as well as to con-textualize and scope the work appropriately. The next layer takes this characterization andtransforms it so that the problem is addressed through a set of abstract operations anddata types. Where the problems and data from the first layer are described in domain levelterms, the second layer seeks to describe these in a generic way suitable for interpretation bycomputer scientists. The third layer seeks to map the abstractions from the previous layerto specific visual encodings and interactions. Here, specific visualization techniques are dis-cussed and compared to suit the abstractions from the second layer. The final layer lays outthe computational techniques necessary for creating the visualization. This includes detailedmathematical descriptions as well as comparative analysis of any potential techniques.

We directly map the first two layers of this nested model onto the first two sections ofthis chapter. However, since we propose a few techniques for visual encoding and interaction,the later sections individually capture the later two layers of the nested model.

3.1 Domain Problem and Data Characterization

With recent advancements and low deployment costs, RNNs have become a prevalent tool insolving sequential data tasks and are used by many individuals to different ends. Therefore,we preface the problem and data characterization by first describing the specific user groupsinterested in studying RNNs at a microscopic level of detail.

Strobelt et al. [35] develop three high level categories of user groups interested in RNNs;Architects, Trainers, and End Users. We are specifically interested in users from the var-ious groups with some level of pre-existing understanding of Neural Networks, primarily

17

Architects and sometimes Trainers. This includes users who at the very least have a basicnotion of the mathematics and notation used to describe these systems. These users wantto understand the detailed behaviours RNNs learn to model, whether that understandingmay be used to drive development of architectural improvements or to gain confidence thattheir results generalize beyond training data sets. Moreover, these users want an intuitionfor what the RNN represents in order to better grasp how it functions.

We also make note of a subset of this group that may be approaching the detailedstudy of RNNs for the first time, having only an initial surface level of understanding.This instructional context forms a unique situation where the simplicity of visualizationmust be balanced with clarity and completeness so as to effectively communicate how themodel works. Users from this group want to achieve understanding for the same purposes asalready mentioned, but may have difficulty transitioning from the learning materials used todescribe RNNs to a deep enough understanding that they can begin to make improvementsor to interpret model behaviours.

To reiterate, the users interested in this level of study all share in common:

• A mathematical perspective of describing the models. These users want to see thedetailed notation which describes the RNN, and already have an intuition for theeffects of the various mathematical components.

• An interest in inspecting the model based off various inputs. At a high level, thisis simply asking the question “what happens if..”, but more constructively these de-tails are used to drive comparative analysis. Inspecting differing instances in this wayaffords a deep understanding of how changes to inputs affect the model’s internalrepresentation and output.

• Studying a trained model, as opposed to a model undergoing the process of training,in order to observe its learned behaviours. Indeed, part of the user’s reason for study-ing detailed RNN behaviours may be to inform changes around training procedures.However, we focus only on the aspect of this study relating to an already trainedmodel.

From this group of users we elucidate a series of questions which they are interested inasking. Although some of these questions may have partial answers that already exist in theliterature, these still remain open topics for research as well as lack general visualizationsupport.

• Question 1: How does the information flow and change from the inputs through tothe outputs?

• Question 2: Where are changes more and less prominent? Do certain components havemore influence than others?

18

• Question 3: Does our intuition about the role of the mathematical components matchthat of the actual RNN behaviour?

• Question 4: What information is captured in RNN hidden states? How is this infor-mation represented among the dimensions of these hidden state vectors?

At a high level, these questions all seek to understand how the model behaves from amathematical/computational perspective. Users want to be able to describe various aspectsof the RNN in terms of the task they model. In Natural Language Processing tasks forexample, users want to find if and where the system models syntax and grammar? Anotherquestion is: how does negation affect the model state? As a final example, does the differencebetween a gender pronoun (ex: “he” vs. “she”) change more or less in the model than weexpect?

To summarize, the problem and data we seek to visualize is that of understanding thespecific behaviours and patterns RNNs learn to represent. This understanding should beframed in the perspective of the task which the model represents so that the Architect andTrainer user groups can explore mathematical details and drive further research.

3.2 Operation and Data Type Abstraction

With the questions from the previous layer in mind, we ascertain several abstract opera-tions and data types which can later be mapped to visualization techniques. To begin thisprocess, we look first at a key concept from the previous section - that of the mathematicaldescription of the RNN.

Without going deep into the details already outlined in Chapter 2, RNNs are describedas a series of vector and matrix operations, with various element-wise non-linear transfor-mations. In particular, a vector is defined as a function of some other vector(s), and thesevectors are colloquially referred to as hidden states. These vector definitions are sometimesthe result of pure vector operations (ex: Hadamard product), while other times being theresult of a combination of matrix multiplication and activation function such as Sigmoidor Tanh. The matrices of the RNN describe the parameters which are learned through thetraining exercise.

In this description, there are three high level concepts we can focus on studying: vec-tors, matrices, and activation functions. Vectors are the most natural concept to focus onfor study, not simply because there is precedent in previous works [19, 14, 35], but mainlybecause the RNN description uses vectors as the core element of notation. Focusing on vec-tors ties directly to how these users naturally describe these models, ensuring visualizationaccessibility. Moreover, this piece of data comprises the key differences between various in-stances of input to the RNN. That is, changing the inputs to the model will affect changeson its state vectors, which users can leverage to perform comparative analysis.

19

Notice, within the concept of visualizing hidden states vectors is implicit the idea ofvisualizing their dimensional values. These are typically called activations, not to be con-fused with activation functions. Activations encode the actual information hidden statesrepresent, so focusing on hidden states implies a focus on them as well.

The two remaining candidates from this description are matrices and activation func-tions. Although these details can be mapped to visualization metaphors, we choose notto study them for a number of reasons. Matrices can be conceptualized as very similar tovectors, however they capture more size and complexity which may become difficult to visu-alize. Without adequate techniques to visualize vectors, we need not focus on studying theirmore complex sibling the matrix. Moreover, the values of matrices are static once modeltraining is complete, so they do not lend as naturally to comparative analysis. Activationfunctions similarly do not change across input instances.

On the other hand, it may be more fruitful to visualize matrices and activation functionsin conjunction with hidden states, with a particular focus on the effects they have. Indeed,by focusing on the multiple hidden states within the RNN, this is exactly what is beingvisualized. That is, the effects of the matrix multiplication and activation functions canbe observed indirectly by viewing the hidden states before and after these mathematicaltransformations.

With this data type in mind, we explore the options for visualization operations thatmay be performed on hidden state vectors. Given that users would like to see the detailsof what is happening within the RNN, inspecting activation values is a core operation tovisualizing this data type. Moreover, to interpret and relate activations to each other, wewant to facilitate their comparison.

We also want to interpret hidden states as a whole, rather than looking at their individualdimensions, so as to relate back to the mathematics describing RNNs. We denote the term“semantics” to capture the general notion of the underlying meaning of a hidden state.Various semantics may be formulated as interpretations of hidden states, but, irrespectiveof these individual formulations, a visual encoding should be designed that describes thesesemantics as well as facilitates their comparisons.

Another aspect to consider with respect to contrasting hidden state data are the levelsby which comparisons may be formulated. Recall, we have noted that part of visualizinghidden states are to visualize their dimensional values (activations) themselves, so one of thefirst levels for comparison is within the hidden state itself: Intra-Hidden State. Conversely,we may consider arbitrary comparisons between hidden states to be at the Inter-HiddenState level. Within this level, we make the distinction for two further axes of comparison:Intra-Kind and Inter-Kind, which compare the same or different kind of hidden states.

Also consider how time plays a role in hidden state comparisons. Hidden states may beobserved just at a single point in time, as well as across multiple timesteps. Furthermore,we make the distinction between comparison across timesteps but within a single input

20

sequence as opposed to comparisons across different input sequences. The reason for thisbeing that comparisons within an instance have some notion of progression as the RNNbuilds out context along the time series. However, when observing hidden states acrossinstances this is not necessarily the case, since the instances may be unrelated.

Table 3.1 outlines the various forms of hidden state comparison along the various out-lined axes our user groups are interested in. Notice, not all cells of the table represent validforms of comparison.

RNN Comparison CategorizationIntra-Hidden State Inter-Hidden State

Intra-Kind Inter-KindWithinaTimestep

Relative Activity N/A (1) SemanticDevelopment

AcrossTimesteps N/A (2)

ActivityProgression,SemanticProgression

SemanticProgression-Development

AcrossInputInstances

N/A (2) Feature Activity,Feature Semantics

Feature Semantic-Development

Table 3.1: Various forms of informative comparisons, based off multiple axes of hidden statedata. (1) The definition of a hidden state kind is unique per timestep. (2) The definition ofIntra-Hidden State only observers a single hidden state vector.

These forms of comparison are further elaborated as follows.

• Relative Activity: describes the magnitude and direction of hidden state activationsin relation to each other. Shows which dimensions capture stronger or weaker values,which can be interpreted as the presence or absence of some abstract feature.

• Activity Progression: describes the change in hidden state activations over time. Anindication of the evolution of values, which can be interpreted as the introduction orremoval of some abstract feature.

• Feature Activity: describes the magnitude and direction of hidden state activations be-tween input instances. Exposes commonalities between activity over instances, whichcan be used to interpret the latent feature space.

• Semantic Progression: describes the change in hidden state semantics over time. Anindication of the evolution of semantics for a single kind of hidden state.

21

• Semantic Development: describes the change in hidden state semantics across therecurrent function itself. Shows the role of the different components of the RNN.

• Semantic Progression-Development: describes the change in hidden state semanticsover time as well as across the recurrent function. Shows the role over time of thedifferent components of the RNN.

• Feature Semantics: describes the difference in hidden state semantics between inputinstances. Exposes commonalities between hidden state instances, which can be usedto interpret the abstract semantics of instances.

• Feature Semantic-Development: describes the difference in hidden state semanticsbetween input instances as well as across the recurrent function. The value of this formof comparison is not apparent to these authors, however is included for completeness.

With this categorization of RNN detail comparisons established, an exhaustive place-ment of existing techniques into these categories is possible. Appendix B outlines the productof such an exercise and makes note of a few general trends and comparisons lacking in theexisting literature.

From these abstract data types and comparison operations, we derive the following twovisual encodings which are each described in their own section. The chapter is then com-pleted by a final section describing the design of an interactive visualization tool leveragingthese encodings to allow for exploratory research.

• Predictive Semantic Encodings: A visual encoding which gives high level interpreta-tion to hidden states as a whole. This interpretation affords macro-level reading ofinformation flow by mimicking potential outputs of the RNN prediction function andapplying the output colour mapping onto these outputs. Facilitates Semantic Progres-sion, Semantic Development, Semantic Progression-Development, Feature Semanticscomparisons.

• Adaptive Representations: A visual encoding which allows for inspection of hiddenstate activations. This encoding is designed to preserve the details of high dimensionalhidden states when adapted to low dimensional displays, while also affording valueinspection and comparison. Facilitates Relative Activity, Activity Progression, andFeature Activity comparisons.

3.3 Predictive Semantic Encodings

From Section 3.2, we must find a visual encoding that gives an intuitive representation of themeaning behind the hidden state vectors as a whole. Whatever its form, this meaning shouldbe easily interpreted and lend itself to making high level comparisons. Finally, in order to

22

operate within the interaction paradigm that allows for users to observe the visualizationon arbitrary inputs, this semantic representation must give meaning to previously unseenhidden states on demand.

To address these objectives, we introduce the semantics of hidden states as a proba-bility distribution over the task output labels (also introduced by Sawatzky et al. [30]).More formally, we consider the context free function G(γ,h) which produces a probabilitydistribution over the outputs y, where γ denotes the specific kind of hidden state of h. Gis context free in that it makes a prediction similar to that of the RNN, but without theprevious and surrounding information captured in other hidden states. Indeed, this formu-lation is analogous to the classification task of the RNN itself, and the function F (h) canbe seen as a specialization of G such that F (h) ≡ G(final-hidden-state,h). We denote thismodelling function G as Predictive Semantic Encodings (PSE).

Although the task of the RNN is to specifically predict the output labels, the semanticwe propose may similarly be used to encode hidden states with respect to the input labelsthey correspond to. It also may be used to predict the output labels at varying timesteps inSequence-to-label or Sequence-to-sequence input-output schemes. These variations allow forthe PSE to express different layers of semantics of any particular hidden state. The choiceof which semantic(s) to use will depend on the specific visualization task.

Notice, the notion of relating hidden states in some way to the task inputs or outputs isnot a new one, as for example is done in RNNVis [24]. However, to our knowledge other workhas not drawn a direct parallel between hidden states encoding a probability distributionsimilar to that of the task itself.

The advantage to this semantic formulation is that hidden states can share the samevisual encoding as that of the RNN output prediction. Moreover, this formulation describeshidden states in a way that is invariant of their internal representations. These attributesallow users to seamlessly transition between comparisons of hidden state semantics amongstthe components of the RNN as well as the task itself.

With these semantics in place, we develop a simple visual encoding designed to representprobability distributions across typical outputs to RNNs. We use a mini-bar chart wherebars represent class labels and the bar length represents the probability magnitude. Sincethere may be many class labels, only the top-k probabilities are shown in the bar chart,where k is selected depending on the visualization context. The class labels are shown indescending order based off the probability distribution so that the mini-bar chart representsthe most likely outcomes. Finally, each bar is coloured according to a user specified colourmapping, e.g. to give latent information about parts of speech. This allows for designers todraw attention to class labels as applicable to the visualization task.

We augment the mini-bar chart with a redundant encoding facilitating a quick, high levelcomparison of this hidden state semantic. The redundant encoding uses a simple colouredrectangle, where the colour also comes from the visualization’s colour mapping. The colour

23

PSE Visual Artifacts

Figure 3.1: 1) Intuition of colour interpolation based off the top-2 prediction classes. 2)Visual representation for PSE, a fit colour rectangle and mini-bar chart.

is produced by interpolating a point inversely proportional to the relative probabilities of thetop-2 predictions. These aspects are explained in Figure 3.1, with (1) showing an exampleof the interpolation, and (2) showing the entire PSE visual element.

This visual encoding gives a high level interpretation of the meaning captured withinhidden states as a whole. The colour rectangle shows the most salient outputs the hiddenstate leads towards, while the mini-bar chart gives a little more detail to these predictions inparticular. Both of these visual elements facilitate comparison by juxtaposition, allowing forSemantic Progression, Semantic Development, Semantic Progression-Development, FeatureSemantics comparisons.

3.4 Adaptive Representations

We transform the hidden state vector and inspection/comparison abstractions from Section3.2 into a visual encoding that renders the specific hidden state activations. Given thatthis vector data can be very large, on the order of hundreds to thousands of dimensions, itis important that the visual encoding can be rendered at various levels of detail. That is,the representation must adapt to the space it is accorded in the visualization, independentof the dimensionality of the underlying data. This adaptation should represent the truevalues of the hidden state activations as accurately as possible. The visual encoding mustalso facilitate varying levels of comparison so users may better understand the informationrepresented by the vector dimensions.

With these ideas in mind, we develop a matrix of glyph based vector representation,leveraged for its ability to efficiently fit square cells into as little space as possible. Themagnitude and direction of the dimensions of an arbitrary vector v are rendered in each cellof the matrix with a simple bar-glyph. Positive values are encoded with a leftwards zero-lineand a bar extending to the right, while negative values use a rightwards zero-line and a barextending to the left. This design decision is made as opposed to using a central zero-line,which would effectively cost twice as much visual space to fit the same number of values.

24

AR Visual Artifacts

Figure 3.2: 1) The proposed scale function, defined in Appendix C, as compared to thelogarithmic and linear scale functions when applied to the domain [0, 20] and range [0, 10].2) Adaptive Representations matrix of cells glyph for the vector {0, .1, 1, 2, 5, 8, 9,−10}using the various scale functions. Notice how the top and bottom three values are barelydistinguishable between the linear and logarithmic scales, respectively.

With the one of the key factors of this representation being able to distinguish relativemagnitudes, we find the rendering scale to be an important design decision of the glyph. Alinear scale diminishes many of the representations values to the same very small bar glyph,so much so that the difference between values cannot be perceived. Similarly, a logarithmicscale magnifies the values to the same very large bar representation, also not allowing forthe relative ordering of the values to show through. We settle at a specialized form of thelogarithmic scale which first scales the domain of values down into the [1, 10] segment of thelog function, and then re-scales to the range of the bar glyph. The general form of a scalefunction and our proposed modifications are outlined in Appendix C. Figure 3.2 shows theintuition behind these scale functions and examples of the Adaptive Representations matrixof cells glyph they produce.

When the visual space can accommodate the hidden state vector, it is drawn in fullby letting v ← h. In more constrained visual settings, the hidden state is reduced by adimensionality reduction technique R such that v ← R(h, P ),v ∈ RP , P < N where P isthe size of the reduction.

Notice, R must be of the class of dimensionality reductions which are deterministicand continuous. That is, for any input h, R will always produce the same output v andsufficiently small changes to h will result in arbitrarily small changes to v. This characteristicof R is important for two reasons. Firstly, it ensures that the reduction can be applied to anyarbitrary hidden state, even those it has not previously encountered. Secondly, it means thatthe reductions v can be compared in a consistent and meaningful way. Reductions of thisclass will form into distinguishable “shapes” which can be drawn side by side to immediately

25

spot similarities and differences. Indeed, these shapes are not necessarily distinct by natureof the loss of information in any dimensionality reduction, however they form into distinctgroupings which stand out amongst each other.

Therefore, we consider the class of dimensionality reductions which map each dimensionof h once into a “bucket” in v, where all buckets in v contain at least one mapped dimensionfrom h. Then, we can simply define R(h, P ) as a mapping b(n) → p, p ∈ [1, P ], where nrepresents the dimension in h.

It should be noted that using some form of dimensionality reduction is common practiceamongst the community for visualizing hidden states[18, 27, 33]. Our work differs from theseapproaches in that we seek to use these reductions to: 1) visualize arbitrary hidden states,rather than just the “embedding” hidden states, 2) represent the data in low dimensionalform, rather than use the reduction to relate or cluster hidden states to each other.

Specific reduction functions can be formulated to minimize the loss of information whentransforming from higher to lower dimensional spaces. Moreover, different choices for R maybe used depending not only on the level of detail being displayed, but also on which kindof hidden state is being visualized.

As a proof of concept of this technique, we explore two dimensionality reductions suitedto this task: 1) fixed width bucketing, and 2) learned buckets. In both cases, we determinea mapping of dimensions from the hidden state to “buckets” in the target reduced space.The values mapped into buckets are then averaged to produce the reduced value2.

1. Fixed width buckets simply map the dimensions of h into buckets of size N/P accord-ing to their sequential arrangement in h. Figure 3.3 (1) shows the intuition behindthis method.

2. Learned buckets are the result of a Gaussian mixture model where the latent variablescan be interpreted as the buckets. The model learns which dimensions of h to placeinto which buckets so as to minimize the error across a set of hidden states. We selecthidden states by using a random sample of the training data after the RNN has beenfully trained. Figure 3.3 (2) shows the intuition behind this method.

This technique allows for the same underlying data to be visualized in a consistentmanner with varying degrees of precision. Moreover, lower precision representations stillconvey the data to a reasonable degree of accuracy, and give an intuition for the roughshape of data. The matrix of cells idiom is well suited for multiple levels of comparisons.Each matrix glyph gives a view in the Relative Activity of the hidden state, while placinghidden state glyphs side to side gives an indication of the Activity Progression. Moreover,

2We only explore the averaging function to reduce the values within a bucket, however different functionsmay be explored in future work.

26

Dimensionality Reduction Intuition

Figure 3.3: Two deterministic dimensionality reduction techniques, 1) Fixed width bucket,and 2) Learned buckets that can be applied to reduce h into v. In the case of the learnedbuckets, the learned mapping is dimensions h1,3 → v1 and h2,4 → v2. The mean squarederror (MSE) shows the learned buckets portray this specific example more accurately.

by fitting the cells of two matrices in the space of one, as we will explain in Section 3.5, thevisualization can leverage Feature Activity comparisons.

3.5 Interactive Visualization Framework

At a high level, the interactions described in Section 3.2 focus on viewing and comparinghidden states within the context of the RNN model as a whole. Comparisons in particularneed to be facilitated across the various kinds of hidden states which can be numerousespecially when recurrent layers are stacked for improved expressive power. Moreover, userswant to interact with the visualization by supplying sequential inputs, observing the modelbehaviour, and switching levels of detail to take a closer look at points of interest. Theseinteractions need to be quick not only from the perspective of responsiveness, but also sothat the user experience readily affords experimentation and “what if” analysis.

To serve these requirements, we design and implement a visualization framework thatgrounds the interface around the sequential nature of the task and with the mathematicalnotations of the RNN. These notations are further used as labels for hidden states, allowingthe user to easily relate their observations back to the core mathematics of the model.Hidden states can be hidden or displayed on demand allowing the user to focus on whatis important to the specific research question they are trying to understand. Within thisframework, different views can be adapted to focus on various aspects and interpretationsof the RNN model.

27

The Architecture View, shown later in Figures 4.5 and 4.6, gives a broad overviewof the model’s inputs, outputs, and hidden states3. The internal components of the RNNare arranged in a natural left-to-right development, which follows that of the input-outputscheme. The user can view these details across the top-to-down progression of time across therecurrence relation, visualizing several timesteps at once in this view using a common-placemonitor and resolution. Scrolling forms the interaction for looking through these timesteps,and input words can be arbitrarily changed to observe the model response. This view showsa high level overview of the RNN behaviour, but also allows for switches in level of detailto view hidden states up close.

Rather than showing many hidden states at once, the Detail View, seen later in Figure4.8, 4.10, and 4.11, focuses instead on observing a microscopic level of detail of a singlehidden state at a time. Despite this focused level of detail, the hidden state is still representedin the context of the task by drawing the input sequence with the currently selected inputhighlighted and centered, as well as by representing the hidden state location within themodel architecture with an “inset map”. The layout of the inset map is the same as that inthe Architecture View and shows the averaged colour rectangle from the Predictive SemanticEncodings so that the general flow of information is still visible. Both these elements canbe interacted with to perform navigation across the input sequence as well as among thehidden states, preventing unnecessary switches between views.

Also in the Detail View, the user may input a second sequence and compare the hiddenstate activations in a novel interaction. This comparison renders both hidden states in thesame matrix space by placing the same dimension of the two hidden states in a single cell,top to bottom. This level of juxtaposition immediately shows the similarities and differencesbetween the hidden states, a comparison which is further facilitated via a control allowingthe user to hide similar or dissimilar values on demand, seen later in Figure 4.10 (2) and4.11. At a glance, the relation of the count of similar and different hidden state dimensionsis redundantly encoded in a histogram along the same axis of the control.

These two views serve as a powerful showcase of the PSEs and AR proposed visualencodings. By allowing for various level of detail switches in a seamless manner, users caneasily find and study interesting components of the complex RNN function and the data itmodels. Instance based exploration and comparison provide another avenue for fruitful andinteresting research.

3.6 Technical Design and Implementation

The technical implementation of our visualization tool leverages a standard client-serverarchitecture, as shown in Figure 3.4. The front-end client, a user’s web browser, commu-

3This view excludes gates, but the techniques developed in our work apply to visualizing them as well.

28

Visualization Tool Architecture

Figure 3.4: Schematic of the client-server architecture implemented for the visualizationtool. Various data generation and analysis tasks are relegated to a separate set of CLI tools.

nicates with the server which performs the actual data processing over an HTTP RESTinterface. The back-end server component interacts with the RNN model through a well de-fined interface. This allows for arbitrary input sequences given to the model, which returnsthe inference result as well as the internal hidden states elicited throughout the RNN.

The back-end component also includes a database for the purposes of querying pre-computed data. This is used in particular for the purpose of search and analysis of thehidden state data across the training dataset.

With respect to the technologies we use, the server is built with Python3 using therequests library for HTTP REST communication. We build the machine learning modelsusing Tensorflow [1], but as previously discussed the interaction between server and modelis defined over an internal API, allowing for our work to be extended to other machinelearning toolkits. A Postgresql database is used for long term data storage.

The front-end client uses HTML and D3 with functions that arbitrarily render thedata served from the back-end. This allows for the same client to render the visualizationindependent of its model parameters or task specific architecture.

We have also built out numerous data generation and analysis tools, such as that forbuilding the Postgresql database from the training dataset. These tools can be run in anyPython3 environment via command line (CLI), and are self documented with a “help”option. Our full source code and documentation are available at https://github.com/

sawatzkylindsey/rnn-sandbox.

29

https://github.com/sawatzkylindsey/rnn-sandbox

https://github.com/sawatzkylindsey/rnn-sandbox

Chapter 4

Applications

This chapter describes the application of these techniques against Recurrent Neural Net-works so as to ascertain their performance and viability. We first experiment with theindividual techniques, Adaptive Representations and Predictive Semantic Encodings, so asto set a baseline for their potential and show how indeed they may be an effective meansfor achieving their respective tasks. Then, we utilize the outlined visualization frameworkto explore RNNs at fine level of detail.

4.1 Analysis of Predictive Semantic Encodings

4.1.1 Research Question

This experiment sets out to find a reasonable baseline for the task of modelling the semanticsof RNN hidden states by the task level outputs they predict.

4.1.2 Methodology

To answer this question, we must first develop a dataset, metric and set of conditions toevaluate against.

The dataset for this task is derived from the hidden states of a trained RNN. We alsorefer to the trained RNN as the Model Under Interpretation (MUI), to distinguish it fromthe task of modelling a PSE. Let the sequential data which the MUI is trained on be labelledϕ and the sequential data which tests this model be labelled ψ. Then, datasets {ϕ′, ψ′} aregenerated for a different model G representing the PSE by invoking the MUI against {ϕ,ψ}and extracting the elicited hidden states. Notice, ϕ′ comes only from ϕ, while similarly ψ′

comes only from ψ. Keeping the data separate in this manner ensures that the reductiontechniques cannot counteract any biases trained into the MUI. That is, if the LSTM clustersthe testing data in some way, the reduction technique will not have access to this clusterand therefore cannot bias itself accordingly.

More specifically, {ϕ′, ψ′} are the result of pairing the observed hidden states withthe output labels from {ϕ,ψ}. For each input sequence {x1, ..,xT } and output sequence

30

{y1, ..,yT }4, where xt elicits the hidden states {ft, rt, .., ct,ht}, then each hidden state ispaired with an output label producing a dataset {(ft,yt), (rt,yt), .., (ct,yt), (ht,yt)}. Thisprocess results in a derived dataset which is non-sequential in nature as each hidden state istasked with predicting the output classification irrespective of other context encoded withinthe MUI. In this way, the PSE can be viewed as context-free approximation of the MUI.

In order to draw parallels with the MUI, we train and test on the full dataset extractedfrom {ϕ,ψ}. Similarly, the PSE is evaluated in the same manner as the MUI using theperplexity metric. Perplexity is the most common measure by which language models areevaluated [9, 23]. It is a measure of the degree to which the model actually reflects the “true”model of the data, with the lower the perplexity score then more accurate the model.

Perplexity(X) = T

√√√√ T∏t=1

1P (xt|x1...xt−1) (4.1)

Where X represents the sentence input and P (xt|x1...xt−1) is the model’s conditionalprobability of a word xt given the context of previous words x1...xt−1. Conceptually, per-plexity measures the average inverse probability of a sequence of words. Being normalizedover the length T of the input sequence, perplexity is particularly suited for comparisonbetween the MUI and PSE, as PSEs model non-sequential inputs. This normalization al-lows us to use perplexity to compare the accuracy of the language model vs. the PSE atthe same scale.

We experiment using a Feed Forward Neural Network (FFNN) for the model G undervarious parameterizations. Specifically, we consider 0-layer and 2-layer FFNNs. The 0-layerparameterization is a simple projection5 from the hidden state to the output labels, makingit equivalent to the final classifier function F of the trained RNN (as established in Section3.3). This equivalence G(final-hidden-state,h) ≡ F (h) is important to note, because it actsas the baseline for the experiment. That is, we expect the results for G on Output2, thefinal hidden state, to be able to produce roughly the same perplexity as the MUI itself.With this baseline in place, we establish how to compare the quality of G with respect tothe various kinds of hidden states, as well as ensure that the semantic model is correct.

The 2-layer FFNN uses a hidden layer width equivalent to the width of the LSTM hiddenstates themselves, chosen for its simplicity. Using a smaller hidden width may constrain theNeural Network too much, since the input hidden states already encode highly compresseddata from the RNN. On the other hand, a larger hidden width may help the model to performbetter, but at the cost of more learned parameters. As we are attempting to establish a

4Without loss of generality we only describe the Sequence-to-label RNN scheme. The Sequence-to-onescheme is the same, except that each input sequence only has a single output y.

5Matrix multiplication with no non-linearity.

31

baseline, and in particular ensure parity with F , more detailed experimentation would beexcessive for this initial experiment.

Finally, we consider a single mode of configuration for the FFNN. Specifically, we usecompletely separate models for each kind of hidden state, rather than attempting to builda monolithic model that handles all hidden states. This consideration is made in light ofthe challenge presented by instead comparing both separate and monolithic modes of con-figuration. Such a comparison would be difficult to ensure an apples to apples comparison,given that separate modes of configuration are advantaged by not needing to share modelparameters.

PSE training is performed with stochastic gradient descent, where the learning rateis decreased when the model perplexity does not improve as evaluated against a separatevalidation dataset. This procedure is repeated until the validation set perplexity cannot beconsistently improved.

4.1.3 Analysis and Results

We evaluate this experiment on the Penn Treebank [22] language model trained on a 300×2LSTM, using a total of 20M training and 2.6M testing hidden states to train the PSE. Aspreviously explained, this data is fully representative of the sequential data used to trainand test the LSTM, which allows for comparisons to be drawn between the performance ofthe MUI vs. the PSE.

The training procedure for the PSE takes approximately 90 hours using a single TitanRTX 24GB GPU. Notice, the runtime of this procedure can be significantly reduced by onlydrawing a random sample from the full 20M training pairs. However, we use the full rangeof hidden states generated by the sequential data in order to ensure parity with the qualitymeasure from the language model of the MUI.

Firstly, we ensure parity with the RNN by examining the separate 0-layer configuration’s2nd layer Output hidden state performance, after training convergence. This hidden stateachieves a test set perplexity of 70, the best of all the hidden states, and critically this valueis on par with the RNN which achieved a perplexity score of 66. This result establishes thatthe predictive semantic encodings are an accurate generalization of the MUI classificationfunction F .

With this equivalence confirmed, we move on to analyzing the results in more detail.Figure 4.1 shows the accuracy of the 0-layer projection model for each hidden state. Noticethat the axis of this figure is truncated to 400, obscuring the results of some of the findings.

The first observation that immediately stands out is the massive result for the Long-termMemory and Cell hidden states, but only in the second layer of the LSTM. These values areseveral orders of magnitude larger than any other hidden states. Despite this startling result,this finding actually explains an important aspect to the RNN architecture. Specifically, theLong-term Memory and Cell states represent information that spans through the LSTM

32

0-Layer PSE Accuracy

Figure 4.1: Per hidden state test set perplexity of the fully trained PSE using a 0-layerprojection. The y-axis representing perplexity is truncated at 400, despite some valuesexceeding this range. Notice the 2nd layer Output result is on par with that of MUI.

33

across time, rather than only containing information relevant to the current timestep. Asa result, these hidden states, especially at the final layer of the LSTM, are not reliableindicators of what final prediction will be made by the LSTM.

Another way to think about this is that the information from these hidden states ishighly dependent on other pieces of context represented throughout the LSTM in makinga final decision at any given timestep. Therefore, it is not surprising that the Long-termMemory and Cell behave so poorly for this task.

Continuing this line of reasoning, we observe that the best performing hidden state isthe Output of the final layer. This is what we expect, since the model for this hidden stateis equivalent to that of the actual RNNs classifier F . Moreover, the only difference in theLSTM architecture between the Output and the Cell, which contrast the best and worstcase results in the diagram, is the Output Gate. This fact highlights the important role thisparticular gating unit plays in the RNN.

After the Long-term Memory and Cell states, the next highest perplexity results are fromthe three gates as well, as the word embedding. This is also consistent with how we expectthe LSTM to operate, as gates provide some kind of informational switching capability, butdo not necessarily encode much information themselves with respect to the task output. Asimilar line of reasoning can be made as earlier, in that the gates are highly dependent onother aspects represented within the RNN, holding little representational power in and ofthemselves.

Similarly, the word embeddings do not encode any long term contextual informationat all. They are simply a 1-to-1 mapping from word token to hidden state representation.Therefore, these states are also poor general predictors of the next word, unable to modelthis task any better than using a bi-gram word distribution.

It is worth noting that although these several kinds of hidden states are poor estimatorsof the MUI itself, that does not mean they perform poorly as Predictive Semantic Encodings.Rather, this finding suggests they are working correctly, since the idea of a PSE is notto mimic the MUI, but instead to encode some sense of relatable semantics of hiddenstates as a whole. Since some hidden states naturally bear little standalone representationalpower, we expect their semantics to be highly ambiguous as these findings indicate. Withan appropriate visual encoding, this ambiguity will pop out in a visualization and help leadusers to the same conclusion about the semantics of these hidden states.

From here, we move on to analyzing the effectiveness of the PSE for a 2-layer FFNN.The results from this model are seen in Figure 4.2.

Looking at this figure, it is interesting to note the stark contrast from the results of the0-layer projection. Here, the 2-layer Feed Forward Neural Network is generally much betterat this task, resulting in a worst case perplexity of 358 for the Long-term Memory. Thispattern of improved results is fairly consistent across the various kinds of hidden states.

34

2-Layer PSE Accuracy

Figure 4.2: Per hidden state test set perplexity of the fully trained PSE using a 2-layer FeedForward Neural Network. The scale in this figure is the same as that from Figure 4.1.

35

Notice however, that these results are still consistent with the analysis of RNN behaviourfrom the 0-layer projection model.

Interestingly, the general gains found by using the 2-layer architecture are not consis-tent across all hidden states. Two notable examples of this are in the Cell Input, Short-termMemory, and 2nd layer Output states, which actually perform better in the 0-layer projec-tion. What is intriguing about this set of hidden states when compared to the others isthat they represent less long term information. These states have either only just devel-oped from the initial inputs, or fully developed to a final prediction, but either way theirrepresentations are barely clouded by any of the context captured within the RNN.

More deeper analysis is required to fully understand the reasons for this finding. How-ever, these results at the least have implications with respect to the modelling of a PSE.Specifically, they suggest that hidden states which are less dependent on the context of theRNN are best modelled with a 0-layer projection. On the other hand, those states thatencode the longer term information of the RNN are conversely best modelled with a 2-layerFeed Forward Neural Network.

Finally, we also notice that in the results for both PSE model architectures the accuracyfrom Output of the 1st layer of the RNN is not much worse than that of the 2nd layer. Thisis intriguing, because among other things, it suggests 1st layer encodes almost the samerepresentative information with respect to the PSE formulation.

We see two potential implications from this finding. For one, this may suggest that a 2-layer RNN is more complex than necessary for this particular dataset. Instead, there may bepotential to achieve roughly on-par results with a single layer LSTM. Another implicationof this finding is that the Output from the first layer might be combined with that fromthe second, to further improve the accuracy of RNN.

Of course, such an extension would only improve the underlying model performance if theinformation encoded between these layers is non-intersecting. That is, the first layer Outputstate must fundamentally represent something in addition to its second layer counterpartif the accuracy is to be improved in this way. In fact, there is some existing research inthis area with the idea of residual connections in deep Neural Networks [10]. Kim et al. [16]successfully extend this idea to RNNs, showing indeed that such an extension can improvemodel performance.

4.1.4 Conclusion and Discussion

We have used Predictive Semantic Encodings to experiment and investigate the high levelsemantics of hidden states within an LSTM. By establishing an equivalence between PSEsand the MUI itself, we are able to first show that this methodology is a valid lens throughwhich hidden state meaning can be derived. Then, we use this methodology to corroboratethe general semantics behind various hidden states in the LSTM architecture. In particular,Long-term Memory and Cell states represent the most significant long term information

36

captured amongst hidden states, while gates play a critical role in switching out the imme-diately salient portions of this information.

We also find that using a 0-layer projection is typically an inferior architecture whencompared to a 2-layer Feed Forward Neural Network for the modelling of a PSE. However,the unilateral applicability of this result is not so clear, as some kinds of hidden states donot appear to follow the rule. These results help inform the choice of model architecturefor a PSE formulation, suggesting a hybrid approach with specific architectures for specifickinds of hidden states will achieve the best results.

4.2 Analysis of Adaptive Representations


In this experiment, we seek to understand the extent to which the high dimensional hiddenstates of an RNN can be reduced into low dimensional representations while still remainingan accurate portrayal of the underlying data.

4.2.2 Methodology

To answer this question, we first must develop a dataset, metric and set of conditions toevaluate against.

As in the previous Section 4.1, we derive the datasets {ϕ′, ψ′} for this task by invoking theMUI against the sequential data {ϕ,ψ} and extracting the elicited hidden state instances.In order to keep the datasets reasonably sized, a uniform random sample of {ϕ,ψ} areused. Samples are drawn from the training and testing sequences, not their derived hiddenstates, which ensures an even distribution of hidden states kinds and layers as well as fairrepresentation from the sequential data.

The techniques are evaluated using mean squared error (MSE), where each hidden stateh is compared to its reduction v.

MSE(h) = 1|N |

∑n∈N

(hn − vb(n))2 (4.2)

With the nth dimension in h is compared to its mapped dimension b(n) in v. As abaseline, we draw a parallel to variance from statistics where every value is comparedagainst the global average µ. We call this baseline MSE-variance, and calculate it also byEquation 4.2, setting v ← ~µ. MSE-variance provides the initial baseline result for thesedimensionality reduction techniques to be compared against.

We consider best and worst case scenarios under which to run this experiment, derivedfrom the various contexts in which these reductions may be used. As a best case scenario,imagine where every kind of hidden state may learn its own individual dimensionality re-duction. This situation is applicable to visualizations where hidden states are compared

37

amongst their kinds (ex: Cell-to-Cell), but not across them (ex: Cell-to-Output). We extendthis further to study dimensionality reductions specific to the combinations of hidden statekinds and layers, so that differences in patterns between layers may be further optimized.Clearly this will produce the lowest MSE as each reduction can take advantage of any pat-terns specific to the data which the hidden state tends to encode, and need not worry aboutglobalizing these patterns across all hidden states.

As a worst case scenario, consider visualization contexts where any kind of hidden statemay be arbitrarily compared to any other. This can even be extended to comparisonsacross layers in the RNN. In this case, the dimensionality reduction must map the hiddenstates in the same way in order to make these visual comparisons meaningful. Since anygeneralizations the technique makes must be globally applicable, this scenario will measurethe worst possible MSE.

Between these individual/best-case and global/worst-case scenarios lies a middle groundwhere various combinations of hidden state kinds and layers are made in order to take advan-tage of locally-global patterns in the data. We leave any such studies of these combinationsout of our analysis, not only because we are already establishing lower and upper bounds,but also because any such relevant combinations will be highly application dependent.

Given these conditions, we use the two dimensionality reduction techniques describedpreviously to produce different sizes of P reductions. The fixed width reduction techniqueneeds no training data, and its best and worst case scenarios are the same since the mappingit produces is always the same for all hidden states. On the other hand, the learned bucketstechnique uses the training data as mentioned to optimize the way in which high orderdimensions are grouped together. To do so, first consider the training data X

S×Nwhere rows

S in the matrix represent the sample of hidden state instances and columns N representthe hidden state dimensions. Then, XT is fed into a Gaussian Mixture Model so that themodel observes all the values in the dimension and must learn to minimize the error ofcombining dimensions to a latent number of reduced dimensions P . The best case scenariofor the learned buckets uses a separate learned mapping for each kind and layer of hiddenstate, while the worst case scenario only uses a choice set of hidden state kinds and layersto produce a single learned mapping.


We evaluate this experiment on the Penn Treebank [22] language model task trained on a300 × 2 LSTM, using a total of 110,000 training and 690,000 testing hidden states for theevaluation of the dimensionality reduction. The training procedure for each target reductionP takes approximately 20 minutes using a standard machine CPU with sufficient memory(we do not leverage a GPU for this training task). The worst case scenario, learning a singleglobal mapping for all hidden state kinds and layers, is trained on both layers’ Cells, thereason for which is explained later in this analysis.

38

Dimensionality Reduction Technique Error Rates

Figure 4.3: Average accuracy of Fixed Width Buckets and Learned Buckets dimensionalityreductions against the Penn Treebank language model task for a 2 layer, 300 width LSTM.

Figure 4.3 shows these results where the x-axis represents the target reduction P andthe y-axis shows the resulting MSE average across the dataset.

As expected, at the reduction P = 1 all techniques and baseline have the same MSE asthey are mathematically equivalent (every dimension of N is mapped into a single bucket).The fixed width technique steadily falls from this initial error rate of 3.9 only down to 2.6as P increases to 100. This gradual improvement shows how the technique is no differentfrom randomly guessing a grouping of dimensions, and barely better than the baseline ofsetting all values to global average (the MSE-variance technique).

The learned buckets technique on the other hand quickly shows an immediate improve-ment from the baseline with a steep initial drop-off. This result is consistent between the in-dividual and global scenarios with the drop-off tapering out somewhere between P = [5, 10].Although both scenarios show similarly declining error rates, the global scenario is notice-ably not as effective as the individual learned buckets.

After the initial steep drop-off, the error rates gradually approach 0 at slope less steepthan that of the fixed width technique. This suggests that the primary generalizations ofthe data are made at the initial target reductions P , after which any improvements are not

39

necessarily that of the learning method, but rather simply a matter of naturally fitting thedata into more and more buckets. This is not to suggest that the learned bucket technique isineffective, but rather that its effects come mainly from leveraging a few general trends in thehidden state data which cannot be further extended upon. This pattern helps to illustratehow RNNs tend to model information by distributing it evenly across the dimensions ofhidden states.

Before examining the learned reductions in closer detail, it is worth relating their effec-tiveness in more understandable terms. Consider the individual learned buckets at P = 10,resulting in an MSE of 0.579. This error means that on average, each hidden state dimen-sion is reduced to within

√0.579 = ±0.761 of its actual value. To relate this number to the

underlying data, we investigate the test dataset’s statistical properties. On a whole, thesehidden state values are found to be within the global domain of [−76, 70], with an averagedomain of [−15, 15] (the average of the minimum bound/maximum bound for each kind ofhidden state). Therefore, the individual learned reduction at P = 10 is able to place eachhidden state value into a bucket which on average represents the value within a window of

0.761 ∗ 215− (−15) = 0.761 ∗ 2

30 = 5.1% accuracy across the entire domain. We can compare thisto the fixed width and baseline techniques also at P = 10, which have MSEs of 3.807 and3.923 resulting in accuracy windows of 13.0% and 13.2% respectively.

We dig deeper into these results by examining the MSEs for each hidden state kindand layer using the individual learned buckets technique at P = 10. This data is shown inFigure 4.4 using a logarithmic scale.

Observing the data across layers of the RNN itself, we see a general trend that the secondlayer has a higher MSE than the first. This finding is almost consistent for all hidden states,however it is most prominent in the Long-term memory and Cell states which have errors4 to 6 times higher than their counterpart layer6. The fact that the second layer does notreduce as effectively as the first suggests the RNN is learning to encode more information,or spread the information more evenly, in its later layers.

Looking with respect to the specific kinds of hidden states, we observe several additionalpatterns. Firstly, the word embeddings produce the smallest MSE indicating that theycan be reduced the most effectively under this task. This seems to imply not only thatthis reduction technique is suitable for this kind of hidden state, but also that there areredundant dimensions to the word embedding representations. This may be a result of thesehidden states sharing the same dimensionality as that of the RNN hidden states M = N ,suggesting they could just as effectively capture the same amount of information using lessdimensions M < N , but more experimentation would be necessary to further test thishypothesis.

6Cell layer 2 vs. 1: 2.430/0.572 = 4.24, Long-term Memory layer 2 vs. 1: 2.064/0.344 = 6.0.

40

Dimensionality Reduction Error Rates per Hidden State

Figure 4.4: Per hidden state accuracy of the Learned Buckets dimensionality reductionsagainst the Penn Treebank Language Model task at P = 10. The y-axis shows a log scale.

41

Similarly, the three kinds of gates all result in the next lowest set of errors, also indicatingthat these hidden states seem to generally encode less information. However, this findingshould be tempered by the fact that the values of the gates are tightly bound [0, 1], while thisis not true for the word embeddings which this dataset observes the minimum/maximumdomain of [−1.5, 1.8].

The remaining hidden states result in the highest MSE, with the Cell Input, Long-termMemory, and Cell topping the chart. This finding is consistent with the MSE-variance itself,which exhibits the greatest variance in these three hidden states. Again, this suggests thatthese particular hidden states encode a larger portion of information than the others. Also,this implies that dimensionality reductions aiming to improve our results should focus onthese hidden states in particular.

Moreover, when constructing visualizations which must compare hidden state kinds, anylearned dimensionality reductions will want to incorporate data points from these hiddenstates in order to better minimize the global error rate. It is for this reason that we choseto report with the worst case scenario of global learned buckets based off the Cells. Thisestablishes an optimal worst case scenario, although the most optimal choice would likelyneed to include data from all of these more variant kinds of hidden states.

The implication of these results is that different hidden states can be reduced moreeffectively while still remaining perceptually accurate. This finding can be used to informthe design of RNN visualizations by reserving more visual space for those hidden states thatreduce less effectively, or by placing emphasis appropriately on these more critical aspects.


We have experimented under several conditions with dimensionality reductions suitable forAdaptive Representations on the LSTM architecture. In particular, we look at two scenariossuitable for different visualization contexts, as well as two techniques suitable for producingthese reductions. We find that learning a mapping of hidden state dimensions to lowerdimensional vectors is reasonably accurate technique, especially to reduction levels thatare suitable for human inspection. In fact, these reductions are effective at very low targetdimensions up to 30x smaller than the source hidden states.

Moreover, this finding is even true in the more difficult scenario of learning a globalreduction suitable for comparing arbitrary hidden states. This can be achieved at the veryleast by including data from the Cell hidden states in the global reduction learning, as wellas possibly by other key hidden states.

Dimensionality reductions of this kind allow for a detailed level of examination of thevalues captured within the RNN and their particular behaviours. By facilitating hiddenstates comparisons, similarities and differences can be more easily exposed, enabling a deeperwhile also broad level of analysis than has been previously available.

42

4.3 Case Study I: Exploring Information Flow


As discussed in Section 3.1, the user groups we target are interested in understanding themacroscopic flow of information through the recurrent model. This interest is both in termsof flow through time across the recurrent function, as well as flow within the recurrent layeritself so as to understand the role of the various internal components. In this case study,we set out to use the proposed visualization techniques to observe these attributes of theRNN.

4.3.2 Methodology

For this case study, we use a fully trained RNN and a crafted set of test sequences to visualizeand explore. The visualization uses the results from the previous experiments to select theoptimal parameters for the Predictive Semantic Encodings and Adaptive Representations,balanced against the objectives of the visualization task itself. The specifics of these decisionsare discussed in the Analysis and Results section.

We visualize various inputs with a high level expectation of how the RNN models thedata. That is, with an understanding of the task which the model was trained against, wehave certain expectations from an input-output perspective. However, since the RNN acts asa black box, we are less sure about the effect and role of its various components. Therefore,the visualization is used in an exploratory manner so as to view the internal details of themodel.

Despite the fact that these models act as black boxes, that does not mean their inter-nal details are completely opaque. Quite the contrary, these models have been developedthrough meticulous research and therefore their components have particular roles withinthe recurrent function itself. So as well as using the visualization to discover, we also useit to confirm the internal behaviours of the model. By confirming model behaviour, we canestablish the effectiveness of the visualization as well as use the visualization for instructivepurposes - one of the key objectives laid out in Section 3.1.

The data for this study can be considered the input sequences and the hidden statesthey elicit. Input sequences are chosen to exhibit some pattern that we expect the RNNto model. We make the distinction between sequences which exist in the training data andthose that do not. This distinction is important, as training data can be expected to bemodelled accurately by the visualization since the PSE and AR have been derived solelyfrom the training data. The test sequences on the other hand can be expected to elicit novelhidden states, forming an objective basis for evaluation. We further detail the test sequencesby their various sub-sequences that exist in the training data, to give an accurate pictureof just how unique the hidden states they elicit may be.

43

With the objective of observing the high level flow of information in mind, this casestudy focuses only on subsets of the Architecture View of the visualization tool. Viewingthe hidden states at this level of detail shows their relation to each other within the contextof the RNN. The expectation being that this view will expose details such as the intuitionbehind the information hidden states encode, as well as where and how that informationchanges.


We conduct this case study on the Penn Treebank [22] language model trained on a 300×2LSTM to a test set perplexity of 66. As this is a language modelling task, and since theexperiments focus on visualizing high level information flow, we choose a simple colourmapping focused on part of speech (POS) tags. Specifically, the colour mapping uses thegranular POS tag mapping described in Table 4.1, colours courtesy of Color Brewer.

Language Model Colour Map

NounsAdjectives

VerbsAdverbsOther

Table 4.1: The applied colour map for the language model visualization. Colouring is basedoff coarse grained part of speech tags.

We start with a simple input sequence and language feature so as to progressivelyexamine the visualization. We compare two simple noun phrases: many big institutions

which exists in the training data and many family doctors which does not7. Although thehidden states of these three timesteps fit comfortably in a 1920× 1080 resolution monitor,Figure 4.5 shows just the Cells and Outputs from the final layer of the LSTM (c2

t , h2t ), as

well as the predicted Softmax (yt), so that we may focus the discussion on a few specificdetails.

First, we observe the Predictive Semantic Encodings of the figure. By design, these standout when first looking at the visualization with their dominant and contrasting colours.

The colours of the view immediately show the trend of information flow with respect tothe Semantic Progression, Development, and Semantic-Development for both of the exam-ples, which as we elaborate matches our expectations of a noun-phrase language pattern.

7The similar phrases many family and family doctor (notice the singular form) do exist in the trainingdata.

44

http://colorbrewer2.org/#type=qualitative&scheme=Accent&n=5

Flow of Information for Noun-Phrase

Figure 4.5: Visual comparison of hidden states for two slightly different noun-phrases. PSEsshow general semantic agreement between the phrases, compared left to right. ARs showdiscernible differences in Output states h giving an intuition where the differences in datalead to different predictions y.

The first timestep for each phrase is exactly the same, and tends to be coloured a yellow-ishgreen. This signifies the input word many tending to predict of (yellow) or another nounsuch as people (green). In both phrases, the next input word is an adjective which devel-ops towards predicting nouns - also green. Unlike the first timestep, the colouring of thistimestep is solid, bringing light to the fact that the RNN predominately expects nouns tofollow. The final input word of both phrases is a noun, completing the noun-phrase and indi-cating a switch towards verbs (purple) or other function words of the language model (and,,, or --). The PSE colours of this timestep show degrees of purple-ish yellow, indicatingthat the RNN hidden states have more varied semantics at this point.

Looking more closely at the low detail Adaptive Representations, other interesting as-pects stand out as well. Although Relative Activity can be seen at this level, we do not usethis form of comparison in this particular analysis.

Instead, we focus particularly on the Activity Progression. Observing each example in-dependently, we first inspect the h2

t states, which are the final hidden state in the LSTMbefore the output prediction. Although these share many similarities, as seen by compar-ing their respective cells which represent a grouping of reduced activations, there are alsonoticeable differences. One such example is with the many big institutions sequence inh2

3, where the value at the 2nd column, 3rd row is significantly larger than its predecessor’s.

45

Similar examples exist to varying degrees for the other h2t comparisons across both

input sequences. These differences hint at which activations of the hidden state are mostsalient in the output prediction, tipping the RNN over from a particular set of decisionsto others. Although we do not further explore the activations from this example, these lowdetail representations act as a guide for deeper investigation, as well as showing a high levelintuition of the shape of data the hidden states.

By contrast, comparing the c2t states show far less discernible differences especially

between the c22 and c2

3 states. This is interesting, because despite only minute differences,the outputs of these two timesteps are quite different. However, this peculiarity makes senseupon further consideration of the LSTM recurrent function. The only difference betweenthe Cell and Output states is a single gate, unseen in the current figure. Clearly, this gateis controlling the flow of information so that only the relevant portions are factored intothe final prediction. Moreover, the reason the Cells show little change between timesteps isthat they are slowly accruing the long term information relevant to future timesteps in thelanguage model. These changes are so slow in fact that they are barely visible at this levelof detail.

Next, we turn to the test input sequence “ we stand in solidarity , ” she

emphasized . to further explore the flow of information within the RNN. This exampleis chosen in particular to view the crossover from the quoted phrase to the rest of thesentence, where we suspect interesting changes to occur in the RNN. The test sequence isunseen in the training data, although common sub-phrases exist in the training data suchas quotation blocks and we stand. Both layers of the LSTM for the 6th and 7th timestepsof the sequence, which correspond to the comma and end quotation punctuation symbols,are shown in Figure 4.6.

As before, the PSE appears to effectively show the gist of information represented inthe RNN at these timesteps. The 6th timestep indicates a strong tendency to predictingthe ending quotation mark ”, which can be expected given the language model has learnedthis common signal from the input word ,. Of course, the language model has learned topredict the ending quotation from the comma as a result of being in this “inside quotation”context. Otherwise, the model would most likely predict other words, such as but or and (weverified this fact by experimenting with the input sequence We stand in solidarity ,).Interestingly, despite their context-free nature, the Predictive Semantic Encodings similarlyare able to make this distinction, generally tending to correctly predict ” over but or and.Moreover, the mini-bar chart visual encoding effectively conveys this information both byranking the expected term at the top as well as by showing a comparative magnitude of thetop-3 predictions.

Another interesting aspect of these encodings is seen at the states c16 and s1

6. Here,the colour rectangles are markedly tinged green, especially for the later hidden state. Thegreen factor of these components appears to come from pronouns, which make sense for a

46

Flow of Information Around End-Quote

Figure 4.6: Architecture View for the 6th and 7th timestep of the input sequence “ we standin solidarity , ” she emphasized .. The top shows the trend towards predicting theclosing quotation ”, while the bottom shows the change in the language model to wordsthat follow a quote.

language model. Only at c16 does the rectangle colour turn solid yellow. This indicates that

the c16 and s1

6 hidden states are more on the fence about predicting ”, and only at c16 after

mixing in the Long-term Memory, does the LSTM internally make the switch away fromthese pronouns.

We also take note to observe the AR within this figure. Interestingly, very few noticeabledifferences stand out upon inspection of these visual encodings. In fact, the main discernibledifferences only occur in the first LSTM layer in the Cell Input, Short-term Memory, andOutput hidden states. Rather than belabouring the previous point that this is a strongindication the other hidden states are correctly representing slowly changing information,we instead turn to Figure 4.7 which shows all the Cells for this sequence from the first layerof the LSTM.

Despite the barely discernible differences in the AR across the two timesteps from Figure4.6, this figure indeed shows an evolution of the data. The reduced activation values in theCell from the first timestep (c11) are particularly shallow, indicating that memory of theRNN starts off in a relatively empty state. However, as these hidden states are read fromleft to right through time, we indeed see a gradual accumulation of information.

One more prominent cell exhibiting this change is the one in the 1st row, 2nd column(top-right), which gradually grows from almost empty to almost full. Its neighbour in thesame row however shows no such trend, staying consistently empty across time. Otherexamples similar to these exist in the figure, and in summary form these give an indicationof where information is changing more and less prominently across the RNN. As before, at

47

LSTM Cell across Recurrence

Figure 4.7: View of the ct1 hidden state for the input sequence “ we stand in solidarity, ” she emphasized .. Gradual changes in the information the state encodes stand out,such as the continual growth in the 1st row, 2nd of Activity Progression.

this point we do not further explore the potential features these groupings of activationsrepresent - deeper analysis behind the meaning of these features occurs in the next casestudy. However, we note that even by observing the hidden states in this low detail form,the user is presented with a manageable amount of information from which more targetedquestions and analysis can be formulated.

We conclude this case study at this point, although further case studies draw from theseexamples to explore new details exposed through the visualization.


We have used an architectural view of the RNN hidden states to observe the internaldetails the model encodes. Despite only visualizing coarse grained information, we are ableto make valuable insights and corroborations into the flow of information and the role ofvarious components within the LSTM (Questions 1, 2, and 3 from 3.1). These insights areprimarily facilitated through various forms of comparison as investigated in Section 3.2.

Specifically, we notice the PSE are an effective technique for showing the high levelsemantics captured within the RNN. This technique allows for straightforward comparisonsof different kinds of hidden states, which otherwise cannot be compared due to the complexnon-linear nature of the neural model. Comparisons can be formulated in terms of theSemantic Progression, Development, and Progression-Development to gain insights into theflow of information over the RNN. Moreover, the PSE can effectively represent totally novelhidden states so that researchers may explore arbitrary inputs on demand.

Furthermore, the AR are shown to be an effective tool for representing many largehidden state vectors at once. Despite viewing hidden states activations at a significantlyreduced level of 37.5 = 300/8 (300 dimensions are reduced down to 8), marked trends ofincreasing, decreasing, or stable activations stand out over time. The user is able to make useof Relative Activity as well as Activity Progression comparisons so as to gain an intuition

48

of what is happening within the model. These intuitions lead to the development of refinedquestions about the meaning of specific hidden state activations.

By providing easily interpretable visual encodings for RNN hidden states, we show therepresentational strength of communicating model behaviours at an architectural level view.The power of this level of observation lies in its simplicity, serving as an effective tool forinstructive as well as research contexts. In particular, the instructional value of this view isexciting as a topic generally lacking in existing RNN visualizations. Finally, the visualizationbrings to light deeper questions to be asked about the model which we explore in more detailin the following case study.

4.4 Case Study II: Exploring Feature Representations


In this case study, we use the visualization tool to inspect hidden state activations at a highlevel of detail. We seek out this level of study to discover specifically how the RNN modelslatent features.

4.4.2 Methodology

To perform this case study, we use a fully trained RNN and a crafted set of test sequencesto visualize.

Since we want to understand the RNNs behaviour from the perspective of the high levelfeatures it captures, it is important to choose sequences that exhibit these features. Aswith the previous case study, we make the distinction between sequences that occur in thetraining data vs. those that exist in the test data. Differentiating the data in this way allowsus to understand whether the MUI has been exposed to the sequence before, which mayaffect how well its latent features are encoded. Generally, we can expect training data to berepresented by the model in some “optimal” way, based off the training procedure, whiletesting data may not have this advantage. This means that despite achieving reasonablepredictions with respect to the test data, the internal representations in theory may not beas precise as their training counterparts.

We additionally leverage the comparison of multiple input sequences for this case study.In particular, we compare sequences with similar language features and patterns with theintent that their similar activations stand out under this level of scrutiny. By contrastingseveral instances of the same language feature this way, we can formulate hypotheses aboutthe higher order patterns these activations represent. In order to make this level of com-parison, the visualization tool allows for users to arbitrarily compare the hidden states ofinput sequences at any point.

49

With the visualization having facilitated the formulation of these hypotheses, they arethen tested with a rigorous investigation into the activation data. We use several techniquesoutside of the proposed visualization techniques to confirm or disprove these hypotheses.

In general, a hypothesis is formulated in terms of an exhaustive search or analysisthrough the entire set of activations elicited by the training or testing datasets. If theobserved pattern holds through this dataset, then this is a strong indication that the specificactivations included in the hypothesis represent the high order pattern. On the other hand,if the expected pattern does not hold across the dataset, then the hypothesis is disprovedand must be discarded or refined.

Where applicable, searches through the sequences of the dataset use relative timesteppositioning, rather than any absolute timesteps. The reason for this should be clear - highlevel language features may be exhibited at different positions in different sequences.


We continue from the setup in the previous case study with the same Penn Treebank [22]language model and architecture. This case study uses the Detail View to explore specificactivation details, which the visualization tool exposes through an intuitive click-to-zoominteraction.

To begin, we resume study of the final test sequence “ we stand in solidarity , ”she emphasized . from the previous section. As noted earlier, one interesting aspect ofthis example can be observed by viewing the grouping of activations in the 1st row, 2nd

column of the c1t hidden state over its progression through time. In the visualization we

click on c11 to explore the details of this hidden state, and the resulting view is shown in

Figure 4.8.This Detail View shows the low dimensional AR (1) as well as its full underlying 300

dimension hidden state (2). By using the same visual encodings as from the ArchitectureView, it is clear to the user that the same type of data is being represented simply at ahigher level of detail. The relationship between the two ARs is further confirmed when theuser hovers over any matrix cell by drawing a dual black bordering of the cells that form adimensionality reduction group. Furthermore, this view also maintains the user’s point ofreference by showing the word token “ from the input sequence that is currently selected (3).Finally, we also see the currently selected hidden state c1

1 by mirroring the RNN componentlayout from the Architecture View, where hidden states are represented just by the colourinterpolation of the PSE (4). These contextualizing elements are all selectable, so that theuser can seamlessly navigate around the RNN hidden states from this view.

Our case study continues by hovering over the 1st row, 2nd column cell from the lowdimensional hidden state representation as the figure shows. Here, we see that this cellrepresents three dimensions from the actual hidden state vector, all of which share a similarmagnitude and direction. The comparison of these dimensions, as well as others, gives an

50

LSTM Cell Details

Figure 4.8: Detail View of the c11 hidden state for the input sequence “ we stand in

solidarity , ” she emphasized .. 1) The same low detail Adaptive Representation whichwas selected from the Architecture View to zoom into this view. 2) The high detail AdaptiveRepresentation for the same hidden state as from (1). Notice, when the user hovers overany matrix cell a dual black bordering is established between the cell from the low detailAR and the corresponding cells in the high detail AR. The context of where in the inputsequence (3) as well as which component of the RNN (4) is maintained within the view.

51

insight into the degree of excitement between activations. This Relative Activity is used toformulate hypothesis about the significance of various dimensions as we proceed throughthe case study.

Since we are investigating the Activity Progression of these three dimensions, the userbegins to select the following input words from the contextualizing sequence in the topleft, starting with the next word we. As each word is selected, the view updates with thecorresponding hidden state c1

t . By iterating through the sequence, we see that these threeactivations maintain an almost strictly increasing progression.

As this pattern is remarkable, we use the visualization tool to similarly experimentwith several different input sequences. Interestingly, the observation of these three strictlyincreasing activations holds across various input sequences, leading us to formulate thehypothesis that these dimensions represent some kind of counter within the RNN. Thisseems reasonable given we are looking at a so called memory Cell of the RNN, which likelyneeds to represent this information to make predictions such as the likelihood of a sequenceterminating word token.

We test this hypothesis by performing an exhaustive analysis of the dataset, recordingthe start and end values of these three activations for all training sequences. Sequences arecategorized into two categories: those that “increase monotonically” and those that “do notincrease monotonically”. The later is defined as when any of the three dimensions decreasesby over 25% when the value at a timestep t is compared to its predecessor t − 1, whilethe former categorization is the compliment of this definition. Notice, this categorizationis not strictly correct by the definition of a monotonically increasing series, but sufficientfor the purposes of this analysis which are to establish whether these dimensions representcontinuously increasing values.

The training and testing data resulted in only 2.95% and 3.15% sequences matchingthis definition of non-monotonically increasing. Figure 4.9 shows more details behind theseresults for the training data specifically. The monotonically and non-monotonically catego-rized start and end values are averaged to plot a single line for each pairing of category anddimension. The figure also includes two worst case examples where the increase from thestart to end for these dimensions is minimal across the dataset. These worst case scenariosare defined as:

• Largest Drop: the sequential instance x with the largest non-monotonic change be-tween timesteps argmax

x= max

t∈[0,T−1](c0

t+1 − c0t ).

• Minimum: the sequential instance x with the smallest change from the start to endargmin

x= c0

T − c00.

As the figure indicates, these three dimensions consistently increase from the start toend values across the input sequences. This fact is true for both monotonically and non-

52

Counter Dimension Values across Sequential Data

Figure 4.9: Relative changes in the found Cell dimensions {58, 223, 251} from the start of thesequence to its end. X-axis marks are intentionally excluded, as sequences vary in length.Monotonic and Non-monotonic lines represent dataset wide averages, while Minimum andLargest Drop show the value changes across a single sequential instance.

53

monotonically increasing categorizations, although the non-monotonic category unsurpris-ingly tends to result in slightly lower final values. Moreover, it is interesting to note that thelines from each dimension appear to form their own grouping, such that the monotonic andnon-monotonic categories are drawn side by side without other results coming in between.This seems to indicate that each of these dimensions have a distinct rate of increase.

Looking now to the worst case Minimum instances, it becomes evident that these di-mensions never dip below their starting levels, and indeed are indicators of some countingmechanism within the RNN. More specifically, the Minimum instances which represent thesmallest growth from start to end are sloped upwards. Since these instances are the mini-mum across the entire dataset, then by definition no instance exists which ends lower thanit starts. Further evidence to this hypothesis is that these Minimum instances are derivedfrom sequences of only T = 2 - as T grows so do the values in each of these dimensions.

It is also interesting to observe the sequential instances with the Largest Drop. Althoughthese sit at the top of the figure, recall their lines also represent a kind of minimum of thedataset. The reason they appear at the top is that the monotonic and non-monotonic linesbelow are the result of the average start and end over the dataset.

Nevertheless, the Largest Drop instances give an indication of just how complex andnon-linear the RNNs counting mechanism is. In these cases after a large non-monotonicdrop, it generally appears that the counter maintains its position or grows much moreslowly from that point onwards. Perhaps this signifies that the prediction the counter hashelped the RNN to achieve has been made, but further analysis is beyond the scope of thisdiscussion.

We now move onto a different use case where the visualization is used to find a specificfeature representation within the RNN, rather than discovering one as was the case withthe counter from the previous use case. Here, we attempt to reverse engineer how a higherorder language pattern is explicitly modelled.

For this use case, we decide to search for the internal representation of a quotation block“ ... ”. Our understanding of a language modelling task expects the RNN to somehowcapture this pattern so it can maintain grammar internal to the quotation block, as wellas predict other syntax like the common transitions for introducing and closing quotations(ex: he said , “ ... ” and “ ... , ” she emphasized). We suspect the modelling ofthis feature must span through the recurrent memory of the RNN in order to track thequotation block through time.

With this goal in mind, we resume using the visualization from Figure 4.8 at the pointof inspecting the Cell for a sequence with a quotation block. We now leverage the FeatureActivity form of comparison enabled by the visualization tool by inputting a sequence fromthe training data for comparison that also exhibits the quotation block language feature: “they seem to like these industrial parks , ” says .... A sequence from the train-ing data is chosen so that we may inspect the internal representations in an optimal situation

54

Quotation Block Cell Comparison

Figure 4.10: Detail View comparison of the Cell hidden states for two sequences exhibitingthe quotation block feature. Sequences have been aligned in the view so that the end quo-tation mark ” representations c1

7 and c110 are compared. Histogram of similarity measure

between the two hidden states (1) shows them to be quite similar, despite being derivedfrom different quotation block sequences.

when compared with the already inputed testing sequence. We line up the sentences in theview so that a comparison of the states c1

7 and c110 are shown, corresponding to the terminal

quotation mark ” of both sequences. Figure 4.10 shows the visualization at this point.The view now has become much more complex, since we’re observing two hidden state

instances in their various low dimensional, high dimensional, and semantic visual encodings.To make sense of the view, every respective visual element from the first sequence is de-noted “A” and positioned above its corresponding visual element from the second sequence,labelled “B”. For the high dimensional hidden state representation located on the right ofthe figure, this top-to-bottom arrangement is made at the level of each dimension from thevector, allowing for Feature Activity comparisons which we explore shortly. For illustration,the figure annotates the top left cell, noting it to represent the 11th dimension for bothhidden states.

Additionally, with the comparison enabled, a histogram is now shown in the center ofthe figure (1). The vertical axis of the histogram can be seen at (2), and additionally canbe controlled to limit the hidden state dimensions in the high dimensional representationto just those matching the controlled level of similarity or difference. We use this controlshortly.

55

Quotation Block Cell Comparison - Most Similar

Figure 4.11: Evolution from Figure 4.10, with Detail View focusing on the most similaractivations between the selected Cell c1

7 and c110 hidden states. Only 99% similar values

are shown allowing users to view likely candidate dimensions for the quotation block latentfeature representation.

As the histogram shows, these hidden states are largely similar with most dimensionscategorized as similar up to roughly 80%. This can be read conversely as the hidden statesbeing only slightly different up to 20%. Even this coarse level representation gives a strongindication that our hypothesis is correct that at the point of closing the quotation marks,these two sequences are modelled similarly in the memory of the LSTM.

We now use the circular control under (2) to show only dimensions that are 99% simi-lar, the result of which is seen in Figure 4.11. After dragging the control accordingly, onlydimensions that match the specified level of similarity are rendered in the matrix represen-tation, while the histogram count indicates that exactly 60 dimensions match this criteria.Notice, that by dragging the control up to 100% the count changes to 0 and no dimensionsare displayed. This means that despite the vast similarity of these hidden states, no singledimension between the two are the exact same.

Reflecting on the 60 similar dimensions seen in Figure 4.11, we suspect that some subsetof these are markers for the ending of a quotation block pattern. The reasoning behind thisis that these dimensions contain very similar values, despite them occurring in differentsequential inputs. Since these values are so similar, it is possible they correspond to thehigh level pattern that both sequences exhibit.

56

As a proof of concept for this hypothesis, we arbitrarily select a few of the dimensionsand formulate a query against the training data. Specifically, the values from dimensions{32, 105, 164} are selected as seen in the figure. We extract these dimensions and theirrespective values by hovering over the elements in the UI, whereby the visualization toolrenders the details accordingly (unseen in the diagram).

Since we are interested in finding a quotation block, which spans from start to endquotation marks, the query must not only describe the target end pattern of activationsas seen, but also the start pattern of activations. For simplicity, we choose the same set ofdimensions to start the query as those already selected8. Therefore, we use the visualizationto compare the same hidden state for the start quote “ of these two sequences, and find thecorresponding values to the same set of dimensions.

The query is described such that it first looks for any timestep tinitial to match the initialset of dimensions and values, corresponding to the start of the quotation block, and thenat some later timestep tnext to match the query later dimensions and values, correspondingto the end of the quotation block. Table 4.2 describes the query in full.

Candidate Cell Activations for Feature Matching

Dimension tinitial tnext

32 −0.834 0.954105 −0.885 −1.131164 0.947 −0.732

Table 4.2: Candidate activations of the 1st layer Cell c1t for quotation block latent feature

representation, discovered visually by inspection of Figure 4.11.

Notice that the query values from this table do not describe any consistent pattern interms of activation changes from tinitial to tnext. The value for dimension 32 increases, whilethose for dimensions 105 and 164 decrease. Moreover, dimension 32 and 164 even describechanges in sign between the relative query points. This observation leads us to believe that athreshold based querying approach will not successfully capture the feature representation.Therefore, here we develop a novel technique, called Tolerance Feature Matching (TFM),which adjusts for this finding.

Past work [35] has shown that consistent feature patterns can be found by querying by afixed threshold. This can be described as a search through the dataset where the activationvalue a is matched if it exceeds the query parameter threshold a ≥ ζ.

However, since the potential query parameters found in Table 4.2 do not match anyconsistent pattern where a threshold would apply, we seek to develop a different querying

8It is not necessarily the case that high order features are modelled in the RNN in this way.

57

strategy. Instead, we experiment with a query specification whereby activation values a arematched when they fit within some level of tolerance δ based off the parameter ζ, whereδ ≤ |a − ζ|. This query strategy is particularly expedient, as we need not specify differentsearch directions for the various query parameters. Moreover, given the observations fromvisualizing the data, we suspect a query with such tolerance is an accurate interpretationmodel for how the RNN internally represents these high order features.

We invoke this query on the 1st layer Cell of the training dataset using a varying choicesfor δ to test this hypothesis. Table 4.3 shows these results, where the token _ denotes amatch of any word or symbol that is not a quotation mark, and ... denotes some unspecifiedsubsequence.

Tolerance Feature Matching Results

Pattern δ = 0.05 δ = 0.1 δ = 0.15“ ... ” 29 129 293“ ... _ 2 16 49_ ... ” 1 4 20_ ... _ 0 2 72

Table 4.3: Results for TFM query of the 1st layer Cell hidden state based on tolerance δand parameters from Table 4.2 to find the quotation block pattern “ ... ”. The token _denotes any word or symbol match that is not a start or end quotation mark.

As the count of matches shows, these results are quite promising. This particular patternof activations under the threshold based query strategy generally seems to indicate theexpected quotation block pattern. In particular, the query appears to predominantly find “... ” for the specified levels of δ.

Unsurprisingly, as δ is increased, not only are more and more quotation blocks matched,but also are more and more other patterns. These include one side of the quotation block,such as “ ... _, but also even non-quotation block patterns, such as all ... this. Recall,since we are attempting to discover the quotation block feature, it is insufficient to find arepresentation that captures only one side of the quotation block.

Despite these unintended artifacts, this proof of concept has been shown promisingresults. We have used a process where first shared activations are found through the visual-ization and then these activations are used as query parameters to search the training datato see if the intended feature is represented by those parameters.

We painstakingly follow this process to see through this use case of finding how the RNNhas internally represented quotation blocks. After slowly building out a list of parameteractivations, cross checked against other training sequences also including the quotationblock feature, we discover a reasonable set of parameters to describe the feature. These 9

58

parameter dimensions are illustrated in Table 4.4, and a threshold query using δ = 0.25 isable to exclusively result in matching 613 quotation blocks.

Quotation Block Latent Feature Representation

Dimension tinitial tnext

7 0.041 0.04615 −0.980 −0.11529 0.973 0.00734 0.934 −0.01542 −0.848 −0.27747 −0.063 0.03448 −0.334 0.061

112 0.918 −0.071138 −0.126 0.903

Table 4.4: Tolerance Feature Matching activations for the quotation block feature, as rep-resented in the 1st layer Cell of the trained LSTM. With a tolerance level δ = 0.25, thisrepresentation exclusively matches 613 quotation blocks.

As with those dimension values found for the proof of concept query, there still remainsno obvious and simple way to formulate a query based off thresholds. Perhaps a queryrepresented with specific thresholds for specific dimensions would correctly capture theintended latent feature, but it is not immediately obvious how to find such directionalthresholds.

We explore this idea a little further by perturbing this baseline query to attempt todiscover if certain dimensions should employ a threshold based query strategy rather thana tolerance based one. To test this, we construct two new queries per dimension againstthe tnext value, resulting in a total of 18 queries. Each of these queries contains a singleperturbation that the selected dimension applies a partial threshold in one direction (hencetwo queries), while the other direction remains tolerance based as per the original baseline.In this manner, each perturbed query can only potentially result in matching more quotationor other patterns than the baseline result of 613. The results of this experiment are shownin Figure 4.12.

As this data shows, there is no single thresholding strategy that consistently increasesthe number of quotation blocks captured. The top most two gains comes from using ≤ onnegative values (dimensions 15 and 42), but the next best improvement contrarily comesfrom using ≥ on a negative value (also dimension 15). Moreover, all of these perturbationsresult in extraneously matching non-quotation block patterns. Also, notice that some ofthese changes do not increase the count of matched quotation blocks at all, while unin-

59

Tolerance Feature Matching Results with Threshold Perturbations

Figure 4.12: One-off threshold perturbations on quotation block feature discovered in Table4.4 for the 1st layer Cell. Each column represents a perturbation (≤ or ≥) applied a di-mension. Results show thresholds only sometimes capture more features, while sometimesadditionally matching extraneous features.

tentionally over matching non-quotation blocks. For an example of this, see the 112th and138th dimensions whose desired match counts remain at the baseline level of 613.

These observations lead us to conclude that the latent feature representation within theRNN is complex and varied. A general tolerance based modelling seems to apply to therepresentations, but even this approach implies developing specific δ tolerance levels foreach dimension to accurately capture the feature. Moreover, given the varied results of thethreshold based model, perhaps the most effective process to find feature representations isfirst to look at a simple tolerance based approach, and later expand the tolerance levels onan individual dimensional basis.


In this case study, we use the Detail View to hone in on some of the specific details theRNN captures. We are able to show that studying hidden states at this level of detail revealshigher order features encoded in the recurrent model (Question 4 from 3.1).

In particular, we delve into a question first exposed by viewing the low detail Archi-tecture View in the previous case study. After observing the Activity Progression in detailover several examples, we hypothesize that they correspond to a counter within the RNN.

60

This hypothesis, formulated by using the visualization tool, is then confirmed by analysisof the dataset outside of the visualization.

We also use the Detail View to reverse engineer which dimensions of the Cell memoryencode a long term sequential pattern. By leveraging Feature Activity comparisons, we areable to formulate a hypothesis that specific ranges of hidden state activity can be used tofind high order features of the recurrent model. The visualization tool is used to develop aproof of concept strategy for finding these feature representations in the model, which weterm Tolerance Feature Matching. We find this strategy to be a successful model for featurerepresentation, however we also discover a high degree of complexity to the ways in whichthe RNN captures latent features.

61

Chapter 5

Conclusion and Future Work

This research explores the topic of understanding the internal details captured by RecurrentNeural Networks. Our focus is specifically on trained RNNs in the context of the NaturalLanguage Processing task of language modelling. We take the approach of understandingtheir internal representations primarily through the use of visualization, while also examin-ing general trends of data representation.

First, our work outlines a set of target users and questions they seek to answer with re-spect to the internal details of these models. From this, we develop a series of abstract dataand operations that, when visualized, will help to explore the data and answer these ques-tions. Through this investigation, we observe a particular lack of accessibility with respectto some forms of comparison helpful in understanding the internal details of these models.Our research into this topic yields an RNN Comparison Categorization, which explainsthe valid set of comparisons available to visualizing RNNs, as well as what information thesecomparisons have the potential to yield.

With these new insights in mind, we design two novel visual metaphors for the ob-servation of different levels of detail within the RNN. These are specifically PredictiveSemantic Encodings (PSE) and Adaptive Representations (AR), and each addressesa different level of data visualization. PSEs give a high level interpretation of the mean-ings encoded within hidden states as a whole. Significantly, the hidden state semantics itdevelops expose a previously unexplored set of comparisons and interpretations to the de-tails captured within RNNs. On the other hand, the AR visual encoding affords a detailedrepresentation of hidden state activations while reducing the sheer quantity of data to dis-play. This metaphor allows for RNN details to be rendered at varying levels of detail in aconsistent manner, a property useful for studying RNNs in numerous visualization contexts.

Our work then goes on to implement a tool that leverages these visual metaphors to ex-plore a real world task. We observe the Penn Treebank language model dataset, a Sequence-to-label task with 15K unique classification labels. The model is trained using the LSTMarchitecture with a width of 300 and 2 stacked layers. Despite the complexity of this dataand model, our proposed visualization serves the need of explaining the high level flow of

62

information throughout the RNN. We demonstrate such an explanation is especially usefulfor instructive purposes by showing a simple yet precise picture of the model componentsand their various interactions.

Moreover, we also demonstrate how a view of the high level information flow is usefulin the context of an experienced machine learning researcher. By providing low level detailsin an architectural view, a tactical approach can be leveraged to explore areas of interest inthe model. The visualization then allows for this tactical exploration to become as focusedas necessary to gain deep insights into model behaviours.

Our research concludes by showing how the visualization can reverse engineer specificlatent features represented by the RNN. Through Tolerance Feature Matching (TFM),we refine a previous approach to feature representation verification. This approach is val-idated through the discovery of specific patterns of activations across the RNN. Althoughwe find that RNNs model feature patterns in complex ways, the proposed tolerance basedapproach is found to be a reasonable starting point to developing a more precise model ofthese feature representations.

Although our work ends here, many new topics have opened up from this research. Wenow explore several areas where future developments may be built upon these contributions.

One area that leaves room for more study is that of the strength of the proposed tech-niques with respect to human perception. Both the proposed PSEs and ARs include aspectsto their visualization which are perceptual. The PSEs in particular use a colour interpola-tion to indicate their quick-read semantics. The limitations of this concept should be betterunderstood so as to strengthen the ability of this element to convey high level semantics inan immediately interpretable way.

In a similar vein, ARs rely heavily on using relative magnitude to encode vector values.As we have seen, in this rendering task it is challenging to avoid squashing the values fromthe ends of the spectrum together, reducing the user’s ability to perceive small differencesamongst this sub-range of values. Future work may explore this topic, especially taking alook at hybrid approaches where different scales are used for different kinds of hidden states.

The reduction techniques we explore for ARs also have certain perceptual impacts.Further research may be conducted to investigate whether more sophisticated reductionmethods produce better results, especially with respect to observing changes in dimension-ally reduced values. Two such possibilities stand out from our work, although other optionsmay be explored as well.

1. The function whereby a grouping of dimensions are reduced to a single value maybe worth investigating. For example, we may consider the effect of not averagingthese grouped values, but rather taking their absolute minimum/maximum, or someother such reduction function. Further investigation into this topic may reveal moreperceptually effective ways to convey this data in low detail.

63

2. The learning algorithm that determines which hidden state dimensions are groupedtogether may be further explored. As a proof of concept, we only consider a GaussianMixture Model, but other clustering algorithms that suit this task may be worth in-vestigation. Furthermore, research may also consider the clustering mechanism’s costfunction so that the perceptual accuracy of this representation is improved. To moti-vate this idea, we note that some groups of dimensionally reduced values have a strongcancellation effect due to containing similar magnitude positive and negative values.A cost function which incorporates this problem, penalizing dimensional groups withthis cancellation effect, may prove especially useful in improving the representationalstrength of ARs.

Moving on from perception, our work also exposes new questions in light of the analysisof latent feature representations. Although thresholds have been adopted in past work, weshow that they do not always accurately model the actual feature pattern encoded in theRNN. The tolerance based approach we propose appears to be better suited to this task,but comes with its own set of challenges. In particular, it appears that such tolerance levelsare quite complex, potentially requiring specification on an individual per dimension basis.

Despite its complexity, feature discovery through TFM may have a reasonable solutionwhereby features can be found by a combination of automatic analysis and a human in theloop. We briefly consider how this process may work.

1. Human begins latent feature discovery by specifying the feature to discover based offseveral example sequences. This specification includes the appropriate alignment ofsequences in addition to a selection of the kinds of hidden states (e.g. memory Cells)suspected to represent the feature within the model.

2. Through automatic means, the sequences and their hidden state representations areanalyzed to discover the most likely candidate dimensions, values, and tolerance levelswhich encode the latent feature. This analysis is based on finding the intersection ofhidden state dimensions with close activation values.

3. The automatically discovered candidate feature representation is used to query thedataset to find all the sequences it matches. These results are presented in a digestibleform and allow the user to accept and reject specifically matched sequences.

4. The user’s input of accepted and rejected sequences is fed back into the process, andthese steps repeat as necessary until an acceptable representation has been discovered.

By leveraging a human in the loop, such a process should be able to rigorously discoveraccurate latent feature representations. This would be a significant improvement upon theproposed techniques, which have uncovered new insights into RNN behaviours, but do notgenerally scale to the level of quickly finding feature representations.

64

We conclude by reflecting that the topic of understanding and visualizing RNNs, espe-cially with respect to their internal details, is an open field for future research. Our hopeis that this work has helped to uncover some of this open field, while also lead the wayforward for new discoveries to be made.

65

Bibliography

[1] Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean,Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. Tensorflow: Asystem for large-scale machine learning. In 12th {USENIX} Symposium on OperatingSystems Design and Implementation ({OSDI} 16), pages 265–283, 2016.

[2] Yossi Adi, Einat Kermany, Yonatan Belinkov, Ofer Lavi, and Yoav Goldberg. Fine-grained analysis of sentence embeddings using auxiliary prediction tasks. arXiv preprintarXiv:1608.04207, 2016.

[3] Marco Ancona, Cengiz Öztireli, and Markus Gross. Explaining deep neural networkswith a polynomial time algorithm for shapley values approximation. arXiv preprintarXiv:1903.10992, 2019.

[4] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translationby jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.

[5] Yonatan Belinkov and James Glass. Analysis methods in neural language processing: Asurvey. Transactions of the Association for Computational Linguistics, 7:49–72, 2019.

[6] Tolga Bolukbasi, Kai-Wei Chang, James Y Zou, Venkatesh Saligrama, and Adam TKalai. Man is to computer programmer as woman is to homemaker? debiasing wordembeddings. In Advances in neural information processing systems, pages 4349–4357,2016.

[7] Aylin Caliskan, Joanna J Bryson, and Arvind Narayanan. Semantics derived automat-ically from language corpora contain human-like biases. Science, 356(6334):183–186,2017.

[8] Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau,Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representa-tions using rnn encoder-decoder for statistical machine translation. arXiv preprintarXiv:1406.1078, 2014.

[9] Joshua T Goodman. A bit of progress in language modeling. Computer Speech &Language, 15(4):403–434, 2001.

[10] K He, X Zhang, S Ren, and J Sun. Deep residual learning for image recognition.computer vision and pattern recognition (cvpr). In 2016 IEEE Conference on, volume 5,page 6, 2015.

[11] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural compu-tation, 9(8):1735–1780, 1997.

66

[12] Dieuwke Hupkes, Sara Veldhoen, and Willem Zuidema. Visualisation and’diagnosticclassifiers’ reveal how recurrent and recursive neural networks process hierarchicalstructure. Journal of Artificial Intelligence Research, 61:907–926, 2018.

[13] Rafal Jozefowicz, Oriol Vinyals, Mike Schuster, Noam Shazeer, and Yonghui Wu. Ex-ploring the limits of language modeling. arXiv preprint arXiv:1602.02410, 2016.

[14] Minsuk Kahng, Pierre Y Andrews, Aditya Kalro, and Duen Horng Polo Chau. Activis:Visual exploration of industry-scale deep neural network models. IEEE transactionson visualization and computer graphics, 24(1):88–97, 2018.

[15] Andrej Karpathy, Justin Johnson, and Li Fei-Fei. Visualizing and understanding re-current networks. arXiv preprint arXiv:1506.02078, 2015.

[16] Jaeyoung Kim, Mostafa El-Khamy, and Jungwon Lee. Residual lstm: Design of a deeprecurrent architecture for distant speech recognition. arXiv preprint arXiv:1701.03360,2017.

[17] Omer Levy, Kenton Lee, Nicholas FitzGerald, and Luke Zettlemoyer. Long short-term memory as a dynamically computed element-wise weighted sum. arXiv preprintarXiv:1805.03716, 2018.

[18] Jiwei Li, Xinlei Chen, Eduard Hovy, and Dan Jurafsky. Visualizing and understandingneural models in nlp. arXiv preprint arXiv:1506.01066, 2015.

[19] Jiwei Li, Will Monroe, and Dan Jurafsky. Understanding neural networks throughrepresentation erasure. arXiv preprint arXiv:1612.08220, 2016.

[20] Shusen Liu, Zhimin Li, Tao Li, Vivek Srikumar, Valerio Pascucci, and Peer-Timo Bre-mer. Nlize: A perturbation-driven visual interrogation tool for analyzing and inter-preting natural language inference models. IEEE transactions on visualization andcomputer graphics, 25(1):651–660, 2019.

[21] Aravindh Mahendran and Andrea Vedaldi. Understanding deep image representationsby inverting them. In Proceedings of the IEEE conference on computer vision andpattern recognition, pages 5188–5196, 2015.

[22] Mitchell Marcus, Beatrice Santorini, and Mary Ann Marcinkiewicz. Building a largeannotated corpus of english: The penn treebank. 1993.

[23] Tomáš Mikolov, Martin Karafiát, Lukáš Burget, Jan Černocky, and Sanjeev Khudan-pur. Recurrent neural network based language model. In Eleventh annual conferenceof the international speech communication association, 2010.

[24] Yao Ming, Shaozu Cao, Ruixiang Zhang, Zhen Li, Yuanzhe Chen, Yangqiu Song, andHuamin Qu. Understanding hidden memories of recurrent neural networks. In 2017IEEE Conference on Visual Analytics Science and Technology (VAST), pages 13–24.IEEE, 2017.

[25] Grégoire Montavon, Wojciech Samek, and Klaus-Robert Müller. Methods for inter-preting and understanding deep neural networks. Digital Signal Processing, 73:1–15,2018.

67

[26] Tamara Munzner. A nested model for visualization design and validation. IEEEtransactions on visualization and computer graphics, 15(6):921–928, 2009.

[27] Paulo E Rauber, Samuel G Fadel, Alexandre X Falcao, and Alexandru C Telea. Visu-alizing the hidden activity of artificial neural networks. IEEE transactions on visual-ization and computer graphics, 23(1):101–110, 2017.

[28] Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. Why should i trust you?:Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDDinternational conference on knowledge discovery and data mining, pages 1135–1144.ACM, 2016.

[29] Wojciech Samek, Thomas Wiegand, and Klaus-Robert Müller. Explainable artificialintelligence: Understanding, visualizing and interpreting deep learning models. arXivpreprint arXiv:1708.08296, 2017.

[30] Lindsey Sawatzky, Steven Bergner, and Fred Popowich. Visualizing rnn states withpredictive semantic encodings. arXiv preprint arXiv:1908.00588, 2019. To appear inProceedings of the 2019 IEEE Visualization short papers.

[31] Xing Shi, Inkit Padhi, and Kevin Knight. Does string-based neural mt learn source syn-tax? In Proceedings of the 2016 Conference on Empirical Methods in Natural LanguageProcessing, pages 1526–1534, 2016.

[32] Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Deep inside convolutionalnetworks: Visualising image classification models and saliency maps. arXiv preprintarXiv:1312.6034, 2013.

[33] Daniel Smilkov, Nikhil Thorat, Charles Nicholson, Emily Reif, Fernanda B Viégas, andMartin Wattenberg. Embedding projector: Interactive visualization and interpretationof embeddings. arXiv preprint arXiv:1611.05469, 2016.

[34] Hendrik Strobelt, Sebastian Gehrmann, Michael Behrisch, Adam Perer, Hanspeter Pfis-ter, and Alexander M Rush. Seq-2-seq-vis: A visual debugging tool for sequence-to-sequence models. IEEE transactions on visualization and computer graphics, 25(1):353–363, 2019.

[35] Hendrik Strobelt, Sebastian Gehrmann, Hanspeter Pfister, and Alexander M Rush. Lst-mvis: A tool for visual analysis of hidden state dynamics in recurrent neural networks.IEEE transactions on visualization and computer graphics, 24(1):667–676, 2018.

[36] Erik Štrumbelj and Igor Kononenko. Explaining prediction models and individualpredictions with feature contributions. Knowledge and information systems, 41(3):647–665, 2014.

[37] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning withneural networks. In Advances in neural information processing systems, pages 3104–3112, 2014.

[38] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, IanGoodfellow, and Rob Fergus. Intriguing properties of neural networks. arXiv preprintarXiv:1312.6199, 2013.

68

[39] Duyu Tang, Bing Qin, and Ting Liu. Document modeling with gated recurrent neuralnetwork for sentiment classification. In Proceedings of the 2015 conference on empiricalmethods in natural language processing, pages 1422–1432, 2015.

[40] Matthew D Zeiler and Rob Fergus. Visualizing and understanding convolutional net-works. In European conference on computer vision, pages 818–833. Springer, 2014.

69

Appendix A

Long Short-Term Memory

The recurrence function of the LSTM, described conceptually in Section , follows:

fut = σ(Wf [hu−1

t ,hut−1] + bf ) Forget Gate

iut = σ(Wi[hu−1t ,hu

t−1] + bi) Remember Gateou

t = σ(Wo[hu−1t ,hu

t−1] + bo) Output Gatecu

t = tanh(Wc[hu−1t ,hu

t−1] + bc) Cell Inputlut = fu

t ◦ cut−1 Long-term Memory

sut = iut ◦ cu

t Short-term Memorycu

t = lut + sut Cell State

hut = ou

t ◦ tanh(cut ) Output State

Where W∗ represents a matrix of the size N ×2N and b∗ a bias vector N ×1. With respectto the NLP tasks we model, the following equations complete the mathematical descriptionof these models.

et = Wext (Word) Embeddingyt = softmax(WyhU

t + by) Softmax

With We, Wy, and by of size M ×V , K×N , and K×1 respectively. et (M ×1) is projectedto N × 1 when M 6= N .

70

Appendix B

Literature RNN ComparisonCategorization

In this appendix we outline the high level comparison categorizations used by the existingwork in this area. Notice, only references which leverage some kind of applicable hiddenstate comparison are included in this categorization exercise. Table B.1 shows the resultsof this analysis.

A few general trends stand out from this table. The two most notable are the reliance onHeatmap based techniques to observe Activity Progression, and 2-D projection techniquesto observe Feature Semantics. The applicability of both of these techniques, however, tendto be limited.

The heatmap Activity Progression noted is that of viewing the “attention”, or similar, ofthe RNN. In practice, this only applies to one kind of hidden state within the model. Somework from this categorization do apply heatmaps to non-attention hidden states, specificallygate units, however this is done 1) by using an alteration of the LSTM architecture, and 2)by showing not the hidden state, but its L2 norm.

As for the 2-D projection representing Feature Semantics, these typically are done withrespect to word embedding hidden states. This technique is challenging to apply to otherhidden states, which do not directly relate to real world concepts such as task inputs oroutputs. Some work from this categorization list are able to overcome this challenge, forexample by relating the hidden state to the expected output the test case should produce.

With this in mind, we note the area of semantic comparison (Semantic Progression, SemanticDevelopment, Semantic Progression-Development, Feature Semantics, Feature Semantic-Development) to be quite lacking. This is particularly the case with Semantic Progression,Semantic Development, and Semantic Progression-Development, with the later two not haveany current visualization techniques.

We also note a general lack of work with respect to Feature Activity, although one suchimpactful work exists in this area.

71

Literature by RNN Comparison Categorization

Category Reference Medium TechniqueRelative Activity Karpathy et al. [15] Visualization Colour IntensityRelative Activity Ming et al. [24] Visualization Colour IntensityRelative Activity Kahng et al. [14] Visualization Colour IntensityRelative Activity Strobelt et al. [35] Visualization Line Graph

Activity Progression Liu et al. [20] Visualization HeatmapActivity Progression Li et al. [18] Visualization HeatmapActivity Progression Li et al. [19] Visualization HeatmapActivity Progression Bahdanau et al. [4] Visualization HeatmapActivity Progression Levy et al. [17] Visualization HeatmapActivity Progression Karpathy et al. [15] Visualization Intensity OverlayActivity Progression Strobelt et al. [35] Visualization Line GraphFeature Activity Strobelt et al. [35] Visualization Line Graph

Semantic Progression Ming et al. [24] Visualization GlyphSemantic Progression Strobelt et al. [34] Visualization 2-D ProjectionSemantic Progression Hupkes et al. [12] AnalysisSemantic Development Shi et al. [31] AnalysisFeature Semantics Ming et al. [24] Visualization GlyphFeature Semantics Li et al. [18] Visualization 2-D ProjectionFeature Semantics Kahng et al. [14] Visualization 2-D ProjectionFeature Semantics Rauber et al. [27] Visualization 2-D ProjectionFeature Semantics Smilkov et al. [33] Visualization 2-D ProjectionFeature Semantics Abadi et al. [1] Visualization 2-D ProjectionFeature Semantics Strobelt et al. [34] Visualization 2-D Projection

Table B.1: Placement of existing literature into the developed RNN Comparison Catego-rization.

72

Appendix C

Adaptive Representation ScaleFunction

In general, a scale function ν maps the domain of values x ∈ [α−, α+] onto a range y ∈[β−, β+] via some scale τ(x)9.

ν(x, y, τ) = (β+ − β−)τ(x)τ(α+)− τ(α−) −

(β+ − β−)τ(α−)τ(α+)− τ(α−) + β−

Some examples of scale τ(x) are the linear τ(x) = x and logarithmic τ(x) = log(x + ε)functions (ε is added to avoid taking the logarithm of 0). Then, the scale function wepropose for the Adaptive Representations can be defined as ν(x, y, ξ) with:

ξ = log(ν(x, [0, 9], λ) + 1)λ(x) = x

This can written in full and simplified with the assumption that y, [0, β+] as follows. Notice,that ξ(α−) = 0 and ξ(α+) = 1.

9Without loss of generality, we only describe the scale function for values when the domain and rangeare always positive.

73

ν(x, y, ξ) = (β+ − β−)ξ(x)ξ(α+)− ξ(α−) −

(β+ − β−)ξ(α−)ξ(α+)− ξ(α−) + β−

= β+ξ(x)ξ(α+)− ξ(α−) −

β+ξ(α−)ξ(α+)− ξ(α−)

= β+ξ(x)1 − β+ ∗ 0

1= β+log(ν(x, [0, 9], λ) + 1)

74

Understanding RNN States with Predictive Semantic...

Documents

Transcript of Understanding RNN States with Predictive Semantic...