QUESTION ANSWERING SYSTEMS WITH ATTENTION...

42
QUESTION ANSWERING SYSTEMS WITH ATTENTION MECHANISM By PURNENDU MUKHERJEE A THESIS PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF SCIENCE UNIVERSITY OF FLORIDA 2018

Transcript of QUESTION ANSWERING SYSTEMS WITH ATTENTION...

Page 1: QUESTION ANSWERING SYSTEMS WITH ATTENTION MECHANISMufdcimages.uflib.ufl.edu/UF/E0/05/23/73/00001/MUKHERJEE_P.pdf · QUESTION ANSWERING SYSTEMS WITH ATTENTION MECHANISM By PURNENDU

QUESTION ANSWERING SYSTEMS WITH ATTENTION MECHANISM

By

PURNENDU MUKHERJEE

A THESIS PRESENTED TO THE GRADUATE SCHOOL

OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF

MASTER OF SCIENCE

UNIVERSITY OF FLORIDA

2018

copy 2018 Purnendu Mukherjee

To my family friends and teachers

4

ACKNOWLEDGMENTS

I would like to thank my thesis advisor Professor Andy Li who has been a

constant source of inspiration and encouragement for me to pursue my ideas Also I

am thankful to him for providing all the necessary resources to succeed for my research

efforts

I am deeply thankful to Professor Kristy Boyer for her guidance and mentorship

as she motivated me to pursue my research interests in Natural Language Processing

and Deep Learning

I would like to express my heartfelt gratitude to Professor Jose Principe who had

taught me the very fundamentals of Learning systems and formally introduced me to

Deep Learning

I wish to extend my thanks to all the member of CBL Lab and especially to the

NLP group for their insightful remarks and overall support

I am grateful to my close friend roommate and lab partner Yash Sinha who had

supported and helped me throughout with all aspects of my thesis work

Finally I must thank my parents for their unwavering support and dedication

towards my well-being and growth

5

TABLE OF CONTENTS page

ACKNOWLEDGMENTS 4

LIST OF FIGURES 6

LIST OF ABBREVIATIONS 7

ABSTRACT 8

CHAPTER

1 INTRODUCTION 9

2 THE BUILDING BLOCKS OF A QUESTION ANSWERING SYSTEM 13

Neural Networks 13 Convolutional Neural Network 14 Recurrent Neural Networks (RNN) 15 Word Embedding 16 Attention Mechanism 17 Memory Networks 18

3 LITERATURE REVIEW AND STATE OF THE ART 21

Machine Comprehension Using Match-LSTM and Answer Pointer 21 R-NET Matching Reading Comprehension with Self-Matching Networks 23 Bi-Directional Attention Flow (BiDAF) for Machine Comprehension 25 Summary 28

4 MULTI-ATTENTION QUESTION ANSWERING 29

Multi-Attention BiDAF Model 30 Chatbot Design Using a QA System 32 Online QA System and Attention Visualization 35

5 FUTURE DIRECTIONS AND CONCLUSION 37

Future Directions 37 Conclusion 38

LIST OF REFERENCES 39

BIOGRAPHICAL SKETCH 42

6

LIST OF FIGURES

Figure page 1-1 The task of Question Answering 10

2-1 Simple and Deep Learning Neural Networks 13

2-2 Convolutional Neural Network Architecture 14

2-3 Semantic relation between words in vector space 17

2-4 Attention Mechanism flow 18

2-5 QA example on Lord of the Rings using Memory Networks 20

3-1 Match-LSTM Model Architecture 22

3-2 The task of Question Answering 24

3-3 Bi-Directional Attention Flow Model Architecture 26

4-1 The modified BiDAF model with multilevel attention 31

4-2 Flight reservation chatbotrsquos chat window 33

4-3 Chatbot within OneTask system 34

4-4 The Flow diagram of the Flight booking Chatbot system 35

4-5 QA system interface with attention highlight over candidate answers 36

5-1 An English language semantic parse tree 37

7

LIST OF ABBREVIATIONS

BiDAF Bi-Directional Attention Flow

CNN Convolutional Neural Network

GRU Gated Recurrent Units

LSTM Long Short Term Memory

NLP Natural Language Processing

NLU Natural Language Understanding

RNN Recurrent Neural Network

8

Abstract of Thesis Presented to the Graduate School of the University of Florida in Partial Fulfillment of the

Requirements for the Degree of Master of Science

QUESTION ANSWERING SYSTEMS WITH ATTENTION MECHANISM

By

Purnendu Mukherjee

May 2018

Chair Xiaolin Li Major Computer Science

Question Answering(QA) systems have had rapid growth since the last 3 years

and are close to reaching human level accuracy One of the fundamental reason for this

growth has been the use of attention mechanism along with other methods of Deep

Learning But just as with other Deep Learning methods some of the failure cases are

so obvious that it convinces us that there is a lot to improve upon In this work we first

did a literature review of the State of the Art and fundamental models in QA systems

Next we introduce an architecture which has shown improvement in the targeted area

We then introduce a general method to enable easy design of domain-specific Chatbot

applications and present a proof of the concept with the same method Finally we

present an easy to use Question Answering interface with attention visualization on the

passage We also propose a method to improve the current state of the art as a part of

our ongoing work

9

CHAPTER 1 INTRODUCTION

Teaching machines to read and understand human natural language is our

central and long standing goal for Natural Language Processing and Artificial

Intelligence in general The positive side of machines being able to reason and

comprehend human language could be enormous We are beginning to see how

commercial systems like Alexa Siri Google Now etc are being widely used as Speech

Recognition improved to human levels While speech recognition systems can

transcribe speech to text comprehension of that transcribed text is another task which

is currently a major focus for both academia and industry because of its possible

applications Moreover all the text information available throughout the internet is a

major reason why machine comprehension of text is such an important task

With the growth of Deep Learning methods in the last few years the field of

Machine Comprehension and Natural Language Processing(NLP) in general has

experienced a revolution While the traditional methods and practices are still prevalent

and forms the basis of our deep understanding of languages Deep Learning methods

have surpassed all traditional NLP and Machine Learning methods by a significant

margin and are currently driving the growth of the field

To be able to build a system that can understand human text we need to first

ask ourselves how can we evaluate the machinersquos comprehension ability We had

initially set a goal to build a Chabot for a specific domain and generalize to other topics

as we go ahead While developing the system we found out the necessity for reading

comprehension and how to measure it We finally found the answer with Questions

Answering systems

10

Just like how we human beings are tested for our ability of language

understanding with questions we should ask machines similar questions about what it

has just read The performance of the system on such question answering task will let

us evaluate how much the machine is able to reason about what it just read [1]

Reading comprehension has been a topic of Natural Language Understanding since the

1970s In 1977 Wendy Lehnert said in his doctoral thesis ndash ldquoOnly when we can ask a

program to answer questions about what it reads will we be able to begin to access that

programrsquos comprehensionrdquo [2]

Figure 1-1 The task of Question Answering

To achieve this task the NLP community has developed various datasets such as

CNN Daily Mail WebQuestions SQuAD TriviaQA [3] etc For our purpose we chose

SQuAD which stands for Stanford Question Answering Dataset [4] SQuAD consists of

questions posed by crowdworkers on a set of Wikipedia articles where the answer to

every question is a segment of text or span from the corresponding reading passage

With 100000+ question-answer pairs on 500+ articles SQuAD is significantly larger

than previous reading comprehension datasets [5]

An example from the SQuAD dataset is as follows

11

Passage Tesla later approached Morgan to ask for more funds to build a more

powerful transmitter When asked where all the money had gone Tesla responded by

saying that he was affected by the Panic of 1901 which he (Morgan) had caused

Morgan was shocked by the reminder of his part in the stock market crash and by

Teslarsquos breach of contract by asking for more funds Tesla wrote another plea to

Morgan but it was also fruitless Morgan still owed Tesla money on the original

agreement and Tesla had been facing foreclosure even before construction of the

tower began

Question On what did Tesla blame for the loss of the initial money

Answer Panic of 1901

As we started exploring the QA task we faced several challenges Some of them

we could solve with the help of other research and some of the challenges still exist in

the domain

bull Out of Vocabulary words

bull Multi-sentence reasoning may be required

bull There may exist several candidate answers

bull Optimizing the Exact Match (EM) metric may fail when the answer boundary is fuzzy or too long such as the answer of the ldquowhyrdquo query [6]

bull ldquoOne-hoprdquo prediction may fail to fully understand the query [6]

bull Fail to fully capture the long-distance contextual interaction between parts of the context by only using LSTMGRU

bull Current models are unable to capture semantics of the passage

In the upcoming chapters we will first briefly review the basics necessary for

understanding the models then we will delve deep into the fundamental models that

12

have shaped the current State of the Art models then we will discuss our contribution in

terms of architecture and applications and finally conclude with future directions

13

CHAPTER 2 THE BUILDING BLOCKS OF A QUESTION ANSWERING SYSTEM

To build a question answering system one needs to be familiar with the

fundamental deep learning models such as Recurrent Neural Networks (RNN) Long

Short Term Memory (LSTM) etc In this chapter we will have an overview on these

techniques and see how they all connect to building a question answering system

Neural Networks

What makes Deep Learning so intriguing is that it has close resemblance with

the working of the mammalian brain or at least draws inspiration from it The same can

be said for Artificial Neural networks [7] which consists of a system of interconnected

units called lsquoneuronsrsquo that take input from similar units and produces a single output

Figure 2-1 Simple and Deep Learning Neural Networks [8]

The connection from one neuron to another can be weighted based on the input data

which enables the network to tune itself to produce a certain output based on the input

This is the learning process which is achieved through backpropagation which is a

system of propagating the error from the output layer to the previous layers

14

Convolutional Neural Network

The first wave of deep learningrsquos success was brought by Convolutional Neural

Networks (CNN) [9] when this was the technique used by the winning team of ImageNet

competition in 2012 CNNs are deep artificial neural networks (ANN) that can be used to

classify images cluster them by similarity and perform object recognition within scenes

It can be used to detect and identify faces people signs or any other visual data

Figure 2-2 Convolutional Neural Network Architecture [10]

There are primarily four operations in a standard CNN model (as shown in Fig above)

1 Convolution - The primary purpose of Convolution in the ConvNet (above) is to extract features from the input image The spatial relationship between pixels ie the image features are preserved and learned by the convolution using small squares of input data

2 Non-Linearity (ReLU) ndash Rectified Linear Unit (ReLU) is a non-linear operation that carries out an element wise operation on each pixel This operation replaces the negative pixel values in the feature map by zero

3 Pooling or Sub Sampling - Spatial Pooling reduces the dimensionality of each feature map but retains the most important information For max pooling the largest value in the square window is taken and rest are dropped Other types of pooling are Average Sum etc

4 Classification (Fully Connected Layer) - The Fully Connected layer is a traditional Multi-Layer Perceptron as described before that uses a softmax activation function in the output layer The high-level features of the image are encoded by the convolutional and pooling layers which is then fed to the fully connected layer which then uses these features for classifying the input image into various classes based on the training dataset [10]

15

When a new image is fed into the CNN model all the above-mentioned steps are

carried out (forward propagation) and a probability distribution is achieved on the set of

output classes With a large enough training dataset the network will learn and

generalize well enough to classify new images into their correct classes

Recurrent Neural Networks (RNN)

Whenever we want to predict or encode sequential data RNN [11] is our go to

method RNNs perform the same task for every element of a sequence where the

output of each element depends on previous computations thus the recurrence In

practice RNNs are unable to retain long-term dependencies and can look back only a

few steps because of the vanishing gradient problem

(2-1)

A solution to the dependency problem is to use gated cells such as LSTM [11] or

GRU [13] These cells pass on important information to the next cells while ignoring

non-important ones The gated units in a GRU block are

bull Update Gate ndash Computed based on current input and hidden state

(2-2)

bull Reset Gate ndash Calculated similarly but with different weights

(2-3)

bull New memory content - (2-4)

If reset gate unit is ~0 then previous memory is ignored and only new

information is kept

16

Final memory at current time step combines previous and current time steps

(2-5)

While the GRU is computationally efficient the LSTM on the other hand is a

general case where there are three gates as follows

bull Input Gate ndash What new information to add to the current cell state

bull Forget Gate ndash How much information from previous states to be kept

bull Output gate ndash How much info should be sent to the next states

Just like GRU the current cell state is a sum of the previous cell state but

weighted by the forget gate and the new value is added which is weighted by the input

gate Based on the cell state the output gate regulates the final output

Word Embedding

Computation or gradients can be applied on numbers and not on words or letters

So first we need to convert words into their corresponding numerical formation before

feeding into a deep learning model In general there are two types of word embedding

Frequency based (which constitutes count vectors tf-idf and co-occurrence vectors)

and Prediction based With frequency based embedding the order of the words are not

preserved and works as a bag of words model Whereas with prediction based model

the order of words or locality of words are taken into consideration to generate the

numerical representation of the word Within this prediction based category there are

two fundamental techniques called Continuous Bag of Words (CBOW) and Skip Gram

Model which forms the basis for word2vec [14] and GloVe [15]

The basic intuition behind word2vec is that if two different words have very

similar ldquocontextsrdquo (that is what words are likely to appear around them) then the model

17

will produce similar vector for those words Conversely if the two word vectors are

similar then the network will produce similar context predictions for the same two words

For examples synonyms like ldquointelligentrdquo and ldquosmartrdquo would have very similar contexts

Or that words that are related like ldquoenginerdquo and ldquotransmissionrdquo would probably have

similar contexts as well [16] Plotting the word vectors learned by a word2vec over a

large corpus we could find some very interesting relationships between words

Figure 2-3 Semantic relation between words in vector space [17]

Attention Mechanism

We as humans put our attention to things are important or are relevant in a

context For example when asked a question from a passage we try to find the most

relevant part of the passage the question is relevant with and then reason from our

understanding of that part of the passage The same idea applies for attention

mechanism in Deep Learning It is used to identify the specific parts of a given context

to which the current question is relevant to

Formally put the techniques take n arguments y_1 y_n (in our case the

passage having words say y_i through h_i) and a question word say q It returns a

vector z which is supposed to be the laquo summary raquo of the y_i focusing on information

linked to the question q More formally it returns a weighted arithmetic mean of the y_i

18

and the weights are chosen according the relevance of each y_i given the context c

[18]

Figure 2-4 Attention Mechanism flow [18]

Memory Networks

Convolutional Neural Networks and Recurrent Neural Networks which does

capture how we form our visual and sequential memories their memory (encoded by

hidden states and weights) were typically too small and was not compartmentalized

enough to accurately remember facts from the past (knowledge is compressed into

dense vectors) [19]

Deep Learning needed to cultivate a methodology that preserved memories as

they are such that it wonrsquot be lost in generalization and recalling exact words or

sequence of events would be possible mdash something computers are already good at This

effort led us to Memory Networks [19] published at ICLR 2015 by Facebook AI

Research

This paper provides a basic framework to store augment and retrieve memories

while seamlessly working with a Recurrent Neural Network architecture The memory

19

network consists of a memory m (an array of objects 1 indexed by m i) and four

(potentially learned) components I G O and R as follows

I (input feature map) mdash converts the incoming input to the internal feature

representation either a sparse or dense feature vector like that from word2vec or

GloVe

G (generalization) mdash updates old memories given the new input They call this

generalization as there is an opportunity for the network to compress and generalize its

memories at this stage for some intended future use The analogy Irsquove been talking

before

O (output feature map) mdash produces a new output (in the feature representation

space) given the new input and the current memory state This component is

responsible for performing inference In a question answering system this part will

select the candidate sentences (which might contain the answer) from the story

(conversation) so far

R (response) mdash converts the output into the response format desired For

example a textual response or an action In the QA system described this component

finds the desired answer and then converts it from feature representation to the actual

word

This model is a fully supervised model meaning all the candidate sentences from

which the answer could be found are marked during training phase and can also be

termed as lsquohard attentionrsquo

The authors tested out the QA system on various literature including Lord of the

Rings

20

Figure 2-5 QA example on Lord of the Rings using Memory Networks [19]

21

CHAPTER 3 LITERATURE REVIEW AND STATE OF THE ART

There has been rapid progress since the release of the SQuAD dataset and

currently the best ensemble models are close to human level accuracy in machine

comprehension This is due to the various ingenious methods which solves some of the

problems with the previous methods Out of Vocabulary tokens were handled by using

Character embedding Long term dependency within context passage were solved

using self-attention And many other techniques such as Contextualized vectors History

of Words Attention Flow etc In this section we will have a look at the some of the most

important models that were fundamental to the progress of Questions Answering

Machine Comprehension Using Match-LSTM and Answer Pointer

In this paper [20] the authors propose an end-to-end neural architecture for the

QA task The architecture is based on match-LSTM [21] a model they proposed for

textual entailment and Pointer Net [22] a sequence-to-sequence model proposed by

Vinyals et al (2015) to constrain the output tokens to be from the input sequences

The model consists of an LSTM preprocessing layer a match-LSTM layer and an

Answer Pointer layer

We are given a piece of text which we refer to as a passage and a question

related to the passage The passage is represented by matrix P where P is the length

(number of tokens) of the passage and d is the dimensionality of word embeddings

Similarly the question is represented by matrix Q where Q is the length of the question

Our goal is to identify a subsequence from the passage as the answer to the question

22

Figure 3-1 Match-LSTM Model Architecture [20]

LSTM Preprocessing layer They use a standard one-directional LSTM

(Hochreiter amp Schmidhuber 1997) to process the passage and the question separately

as shown below

Match LSTM Layer They applied the match-LSTM model proposed for textual

entailment to their machine comprehension problem by treating the question as a

premise and the passage as a hypothesis The match-LSTM sequentially goes through

the passage At position i of the passage it first uses the standard word-by-word

attention mechanism to obtain attention weight vector as follows

(3-1)

23

where and are parameters to be learned

Answer Pointer Layer The Answer Pointer (Ans-Ptr) layer is motivated by the

Pointer Net introduced by Vinyals et al (2015) [22] The boundary model produces only

the start token and the end token of the answer and then all the tokens between these

two in the original passage are considered to be the answer

When this paper was released back in November 2016 Match-LSTM method

was the state of the art in Question Answering systems and was at the top of the

leaderboard for the SQuAD dataset

R-NET Matching Reading Comprehension with Self-Matching Networks

In this model [23] first the question and passage are processed by a

bidirectional recurrent network (Mikolov et al 2010) separately They then match the

question and passage with gated attention-based recurrent networks obtaining

question-aware representation for the passage On top of that they apply self-matching

attention to aggregate evidence from the whole passage and refine the passage

representation which is then fed into the output layer to predict the boundary of the

answer span

Question and passage encoding First the words are converted to their

respective word-level embeddings and character level embeddings The character-level

embeddings are generated by taking the final hidden states of a bi-directional recurrent

neural network (RNN) applied to embeddings of characters in the token Such

character-level embeddings have been shown to be helpful to deal with out-of-vocab

(OOV) tokens

24

They then use a bi-directional RNN to produce new representation and

of all words in the question and passage respectively

Figure 3-2 The task of Question Answering [23]

Gated Attention-based Recurrent Networks They use a variant of attention-

based recurrent networks with an additional gate to determine the importance of

information in the passage regarding a question Different from the gates in LSTM or

GRU the additional gate is based on the current passage word and its attention-pooling

vector of the question which focuses on the relation between the question and current

passage word The gate effectively model the phenomenon that only parts of the

passage are relevant to the question in reading comprehension and question answering

is utilized in subsequent calculations

25

Self-Matching Attention From the previous step the question aware passage

representation is generated to highlight the important parts of the passage One

problem with such representation is that it has very limited knowledge of context One

answer candidate is often oblivious to important cues in the passage outside its

surrounding window To address this problem the authors propose directly matching

the question-aware passage representation against itself It dynamically collects

evidence from the whole passage for words in passage and encodes the evidence

relevant to the current passage word and its matching question information into the

passage representation

Output Layer They use the same method as Wang amp Jiang (2016b) and use

pointer networks (Vinyals et al 2015) to predict the start and end position of the

answer In addition they use an attention-pooling over the question representation to

generate the initial hidden vector for the pointer network [23]

When the R-Net Model first appeared in the leaderboard in March 2017 it was at

the top with 723 Exact Match and 807 F1 score

Bi-Directional Attention Flow (BiDAF) for Machine Comprehension

BiDAF [24] is a hierarchical multi-stage architecture for modeling the

representations of the context paragraph at different levels of granularity BIDAF

includes character-level word-level and contextual embeddings and uses bi-directional

attention flow to obtain a query-aware context representation Their attention layer is not

used to summarize the context paragraph into a fixed-size vector Instead the attention

is computed for every time step and the attended vector at each time step along with

the representations from previous layers can flow through to the subsequent modeling

layer This reduces the information loss caused by early summarization [24]

26

Figure 3-3 Bi-Directional Attention Flow Model Architecture [24]

Their machine comprehension model is a hierarchical multi-stage process and

consists of six layers

1 Character Embedding Layer maps each word to a vector space using character-level CNNs

2 Word Embedding Layer maps each word to a vector space using a pre-trained word embedding model

3 Contextual Embedding Layer utilizes contextual cues from surrounding words to refine the embedding of the words These first three layers are applied to both the query and context They use a LSTM on top of the embeddings provided by the previous layers to model the temporal interactions between words We place an LSTM in both directions and concatenate the outputs of the two LSTMs

4 Attention Flow Layer couples the query and context vectors and produces a set of query aware feature vectors for each word in the context

5 Modeling Layer employs a Recurrent Neural Network to scan the context The input to the modeling layer is G which encodes the query-aware representations of context words The output of the modeling layer captures the interaction among the context words conditioned on the query They use two layers of bi-directional LSTM

27

with the output size of d for each direction Hence a matrix is obtained which is passed onto the output layer to predict the answer

6 Output Layer provides an answer to the query They define the training loss (to be minimized) as the sum of the negative log probabilities of the true start and end indices by the predicted distributions averaged over all examples [25]

In a further variation of their above work they add a self-attention layer after the

Bi-attention layer to further improve the results The architecture of the model is as

Figure 3-3 The task of Question Answering [25]

28

Summary

In this chapter we reviewed the methods that are fundamental to the state of the

art in Machine Comprehension and for the task of Question Answering We have

reached closed to human level accuracy and this is due to incremental developments

over previous models As we saw Out of Vocabulary (OOV) tokens were handled by

using Character embedding Long term dependency within context passage were

solved using self-attention and many other techniques such as Contextualized vectors

Attention Flow etc were employed to get better results In the next chapter we will see

how we can build on these models and develop further

29

CHAPTER 4 MULTI-ATTENTION QUESTION ANSWERING

Although the advancements in Question Answering systems since the release of

the SQuAD dataset has been impressive and the results are getting close to human

level accuracy it is far from being a fool-proof system The models still make mistakes

which would be obvious to a human For example

Passage The Panthers used the San Jose State practice facility and stayed at

the San Jose Marriott The Broncos practiced at Florida State Facility and stayed at the

Santa Clara Marriott

Question At what universitys facility did the Panthers practice

Actual Answer San Jose State

Predicted Answer Florida State Facility

To find out what is leading to the wrong predictions we wanted to see the

attention weights associated with such an example We plotted the passage and

question heat map which is a 2D matrix where the intensity of each cell signifies the

similarity between a passage word and a question word For the above example we

found out that while certain words of the question are given high weightage other parts

are not The words lsquoAtrsquo lsquofacilityrsquo lsquopracticersquo are given high attention but lsquoPanthersrsquo does

not receive high attention If it had received high attention then the system would have

predicted lsquoSan Jose Statersquo as the right answer To solve this issue we analyzed the

base BiDAF model and proposed adding two things

1 Bi-Attention and Self-Attention over Query

2 Second level attention over the output of (Bi-Attention + Self-Attention) from both Context and Query

30

Multi-Attention BiDAF Model

Our comprehension model is a hierarchical multi-stage process and consists of

the following layers

1 Embedding Just as all other models we embed words using pretrained word vectors We also embed the characters in each word into size 20 vectors which are learned and run a convolution neural network followed by max-pooling to get character-derived embeddings for each word The character-level and word-level embeddings are then concatenated and passed to the next layer We do not update the word embeddings during training

2 Pre-Process A shared bi-directional GRU (Cho et al 2014) is used to map the question and passage embeddings to context aware embeddings

3 Context Attention The bi-directional attention mechanism from the Bi-Directional Attention Flow (BiDAF) model (Seo et al 2016) is used to build a query-aware context representation Let h_i be the vector for context word i q_j be the vector for question word j and n_q and n_c be the lengths of the question and context respectively We compute attention between context word i and question word j as

(4-1)

where w1 w2 and w3 are learned vectors and is element-wise multiplication We then compute an attended vector c_i for each context token as

(4-2)

We also compute a query-to-context vector q_c

(4-3)

31

The final vector computed for each token is built by concatenating

and In our model we subsequently pass the result through a linear layer with ReLU activations

4 Context Self-Attention Next we use a layer of residual self-attention The input is passed through another bi-directional GRU Then we apply the same attention mechanism only now between the passage and itself In this case we do not use query-to context attention and we set aij = 1048576inf if i = j As before we pass the concatenated output through a linear layer with ReLU activations This layer is applied residually so this output is additionally summed with the input

5 Query Attention For this part we do the same way as context attention but calculate the weighted sum of the context words for each query word Thus we get the length as number of query words Then we calculate context to query similar to query to context in context attention layer

Figure 4-1 The modified BiDAF model with multilevel attention

6 Query Self-Attention This part is done the same way as the context self-attention layer but from the output of the Query Attention layer

32

7 Context Query Bi-Attention + Self-Attention The output of the Context self-attention and the Query Self-Attention layers are taken as input and the same process for Bi-Attention and self-attention is applied on these inputs

8 Prediction In the last layer of our model a bidirectional GRU is applied followed by a linear layer that computes answer start scores for each token The hidden states of that layer are concatenated with the input and fed into a second bidirectional GRU and linear layer to predict answer end scores The softmax operation is applied to the start and end scores to produce start and end probabilities and we optimize the negative log likelihood of selecting correct start and end tokens

Having carried out this modification we were able to solve the wrong example we

started with The multilevel attention model gives the correct output as ldquoSan Jose

Staterdquo Also we achieved slightly better scores than the original model with a F1 score

of 8544 on the SQuAD dev dataset

Chatbot Design Using a QA System

Designing a perfect chatbot that passes the Turing test is a fundamental goal for

Artificial Intelligence Although we are many order to magnitudes away from achieving

such a goal domain specific tasks can be solved with chatbots made from current

technology We had started our goal with a similar objective in mind ie to design a

domain specific chatbot and then generalize to other areas as it is able to achieve the

first domain specific objective robustly This led us to the fundamental problem of

Machine Comprehension and subsequently to the task of Question Answering Having

achieved some degree of success with QA systems we looked back if we could apply

our newly acquired knowledge in the task of designing Chatbots

The chatbots made with todayrsquos technologies are mostly handcrafted techniques

such as template matching that requires anticipating all possible ways a user may

articulate his requirements and a conversation may occur This requires a lot of man

hours for designing a domain specific system and is still very error prone In this section

33

we propose a general Chatbot design that would make the designing of a domain

specific chatbot very easy and robust at the same time

Every domain specific chatbot needs to obtain a set of information from the user

and show some results based on the user specific information obtained The traditional

chatbots use template matching and keywords lookup to determine if the user has

provided the required information Our idea is to use the Question Answering system in

the backend to extract out the required information from whatever the user has typed

until this point of the conversation The information to be extracted can be posed in the

form of a set of questions and the answers obtained from those questions can be used

as the parameters to supply the relevant information to the user

We had chosen our chatbot domain as the flight reservation system Our goal

was to extract the required information from the user to be able to show him the

available flights as per the userrsquos requirements

Figure 4-2 Flight reservation chatbotrsquos chat window

34

For a flight reservation task the booking agent needs to know the origin city

destination city and date of travel at minimum to be able to show the available flights

Optional information includes the number of tickets passengerrsquos name one way or

round trip etc

The minimalistic conversation with the user through the chat window would be as

shown above We had a platform called OneTask on which we wanted to implement our

chat bot The chat interface within the OneTask system looks as follows

Figure 4-3 Chatbot within OneTask system

The working of the chat system is as follows ndash

1 Initiation The user opens the chat window which starts a session with the chat bot The chat bot asks lsquoHow may I help yoursquo

2 User Reply The user may reply with none of the required information for flight booking or may reply with multiple information in the same message

3 User Reply parsing The conversation up to this point is treated as a passage and the internal questions are run on this passage So the four questions that are run are

Where do you want to go

35

From where do you want to leave

When do you want to depart

4 Parsed responses from QA model After running the questions on the QA system with the given conversation answers are obtained for all along with their corresponding confidence values Since answers will be obtained even if the required question was not answered up to this point in the conversation the confidence values plays a crucial role in determining the validity of the answer For this purpose we determined ranges of confidence values for validation A confidence value above 10 signifies the question has been answered correctly A confidence of 2 ndash 10 signifies that it may have been answered but should verify with the user for correctness and any confidence below 2 is discarded

5 Asking remaining questions iteratively After the parsing it is checked if any of the required questions are still unanswered If so the chatbot asks the remaining question and the process from 3 ndash 4 is carried out iteratively Once all the questions have been answered the user is shown with the available flight options as per his request

Figure 4-4 The Flow diagram of the Flight booking Chatbot system

Online QA System and Attention Visualization

To be able to test out various examples we used the BiDAF [23] model for an

online demo One can either choose from the available examples from the drop-down

menu or paste their own passage and examples While this is a useful and interesting

system to test the model in a user-friendly way we created this system to be able to

focus on the wrong samples

36

Figure 4-5 QA system interface with attention highlight over candidate answers

The candidate answers are shown with the blue highlights as per their

confidence values The higher the confidence the darker the answer The highest

confidence value is chosen as the predicted answer We developed the system to show

attention spread of the candidate answers to realize what needs to be done to improve

the system This led us to realize the importance of including the query attention part as

well as multilevel attention on the BiDAF model as described in the first section of this

chapter

37

CHAPTER 5 FUTURE DIRECTIONS AND CONCLUSION

Future Directions

Having done a thorough analysis of the current methods and state of the arts in

Machine Comprehension and for the QA task and developing systems on it we have

achieved a strong sense of what needs to be done to further improve the QA models

After observing the wrong samples we can see that the system is still unable to encode

meaning and is picking answers based on statistical patterns of occurrence of answers

on training examples

As per the State of the Art models and analyzing the ongoing research literature

it is easy to conclude that more features need to be embedded to encode the meaning

of words phrases and sentences A paper called Reinforced Mnemonic Reader for

Machine Comprehension encoded POS and NER tags of words along with their word

and character embedding This gave them better results

Figure 5-1 An English language semantic parse tree [26]

38

We have developed a method to encode the syntax parse tree of a sentence that

not only encodes the post tags but the relation of the word within the phrase and the

relation of the phrase within the whole sentence in a hierarchical manner

Finally data augmentation is another solution to get better results One definite

way to reduce the errors would be to include similar samples in the training data which

the system is faltering in the dev set One could generate similar examples as the failure

cases and include them in the training set to have better prediction Another system

would be to train to similar and bigger datasets Our models were trained on the SQuAD

dataset There are other datasets too which does the similar question answering task

such as TriviaQA We could augment the training set of TriviaQA along with SQuAD to

have a more robust system that is able to generalize better and thus have higher

accuracy for predicting answer spans

Conclusion

In this work we have tried to explore the most fundamental techniques that have

shaped the current state of the art Then we proposed a minor improvement of

architecture over an existing model Furthermore we developed two applications that

uses the base model First we talked about how a chatbot application can be made

using the QA system and lastly we also created a web interface where the model can

be used for any Passage and Question This interface also shows the attention spread

on the candidate answers While our effort is ongoing to push the state of the art

forward we strongly believe that surpassing human level accuracy on this task will have

high dividends for the society at large

39

LIST OF REFERENCES

[1] Danqi Chen From Reading Comprehension To Open-Domain Question Answering 2018 Youtube Accessed March 16 2018 httpswwwyoutubecomwatchv=1RN88O9C13U

[2] Lehnert Wendy G The Process of Question Answering No RR-88 Yale Univ New Haven Conn Dept of computer science 1977

[3] Joshi Mandar Eunsol Choi Daniel S Weld and Luke Zettlemoyer Triviaqa A large scale distantly supervised challenge dataset for reading comprehension arXiv preprint arXiv170503551 (2017)

[4] Rajpurkar Pranav Jian Zhang Konstantin Lopyrev and Percy Liang Squad 100000+ questions for machine comprehension of text arXiv preprint arXiv160605250 (2016)

[5] The Stanford Question Answering Dataset 2018 RajpurkarGithubIo Accessed March 16 2018 httpsrajpurkargithubioSQuAD-explorer

[6] Hu Minghao Yuxing Peng and Xipeng Qiu Reinforced mnemonic reader for machine comprehension CoRR abs170502798 (2017)

[7] Schalkoff Robert J Artificial neural networks Vol 1 New York McGraw-Hill 1997

[8] Me For The AI and Neetesh Mehrotra 2018 The Connect Between Deep Learning And AI - Open Source For You Open Source For You Accessed March 16 2018 httpopensourceforucom201801connect-deep-learning-ai

[9] Krizhevsky Alex Ilya Sutskever and Geoffrey E Hinton Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems pp 1097-1105 2012

[10] An Intuitive Explanation Of Convolutional Neural Networks 2016 The Data Science Blog Accessed March 16 2018 httpsujjwalkarnme20160811intuitive-explanation-convnets

[11] Williams Ronald J and David Zipser A learning algorithm for continually running fully recurrent neural networks Neural computation 1 no 2 (1989) 270-280

[12] Hochreiter Sepp and Juumlrgen Schmidhuber Long short-term memory Neural computation 9 no 8 (1997) 1735-1780

[13] Chung Junyoung Caglar Gulcehre KyungHyun Cho and Yoshua Bengio Empirical evaluation of gated recurrent neural networks on sequence modeling arXiv preprint arXiv14123555 (2014)

40

[14] Mikolov Tomas Wen-tau Yih and Geoffrey Zweig Linguistic regularities in continuous space word representations In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics Human Language Technologies pp 746-751 2013

[15] Pennington Jeffrey Richard Socher and Christopher Manning Glove Global vectors for word representation In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) pp 1532-1543 2014

[16] Word2vec Tutorial - The Skip-Gram Model middot Chris Mccormick 2016 MccormickmlCom Accessed March 16 2018 httpmccormickmlcom20160419word2vec-tutorial-the-skip-gram-model

[17] Group 2015 [Paper Introduction] Bilingual Word Representations With Monolingual hellip SlideshareNet Accessed March 16 2018 httpswwwslidesharenetnaist-mtstudybilingual-word-representations-with-monolingual-quality-in-mind

[18] Attention Mechanism 2016 BlogHeuritechCom Accessed March 16 2018 httpsblogheuritechcom20160120attention-mechanism

[19] Weston Jason Sumit Chopra and Antoine Bordes 2014 Memory Networks arXiv preprint arXiv14103916v11 Accessed March 16 2018 httpsarxivorgabs14103916

[20] Wang Shuohang and Jing Jiang Machine comprehension using match-lstm and answer pointer arXiv preprint arXiv160807905 (2016)

[21] Wang Shuohang and Jing Jiang Learning natural language inference with LSTM arXiv preprint arXiv151208849 (2015)

[22] Vinyals Oriol Meire Fortunato and Navdeep Jaitly Pointer networks In Advances in Neural Information Processing Systems pp 2692-2700 2015

[23] Wang Wenhui Nan Yang Furu Wei Baobao Chang and Ming Zhou Gated self-matching networks for reading comprehension and question answering In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1 Long Papers) vol 1 pp 189-198 2017

[24] Seo Minjoon Aniruddha Kembhavi Ali Farhadi and Hannaneh Hajishirzi Bidirectional attention flow for machine comprehension arXiv preprint arXiv161101603 (2016)

[25] Clark Christopher and Matt Gardner Simple and effective multi-paragraph reading comprehension arXiv preprint arXiv171010723 (2017)

41

[26] 8 Analyzing Sentence Structure 2018 NltkOrg Accessed March 16 2018 httpwwwnltkorgbookch08html

42

BIOGRAPHICAL SKETCH

Purnendu Mukherjee grew up in Kolkata India and had a strong interest for

computers since an early age After his high school he did his BSc in computer

science from Ramakrishna Mission Residential College Narendrapur followed by MSc

in computer science from St Xavierrsquos College Kolkata He had a strong intuition and

interest for human like learning systems and wanted to work in this area He started

working at TCS Innovation Labs Pune for the application of Natural Language

Processing in Educational Applications As he was deeply passionate about learning

system that mimic the human brain and learn like a human child does he was

increasing interested about Deep Learning and its applications After working for a year

he went on to pursue a Master of Science degree in computer science from the

University of Florida Gainesville His academic interests have been focused on Deep

Learning and Natural Language Processing and he has been working on Machine

Reading Comprehension since summer of 2017

  • ACKNOWLEDGMENTS
  • LIST OF FIGURES
  • LIST OF ABBREVIATIONS
  • INTRODUCTION
  • THE BUILDING BLOCKS OF A QUESTION ANSWERING SYSTEM
    • Neural Networks
    • Convolutional Neural Network
    • Recurrent Neural Networks (RNN)
    • Word Embedding
    • Attention Mechanism
    • Memory Networks
      • LITERATURE REVIEW AND STATE OF THE ART
        • Machine Comprehension Using Match-LSTM and Answer Pointer
        • R-NET Matching Reading Comprehension with Self-Matching Networks
        • Bi-Directional Attention Flow (BiDAF) for Machine Comprehension
        • Summary
          • MULTI-ATTENTION QUESTION ANSWERING
            • Multi-Attention BiDAF Model
            • Chatbot Design Using a QA System
            • Online QA System and Attention Visualization
              • FUTURE DIRECTIONS AND CONCLUSION
                • Future Directions
                • Conclusion
                  • LIST OF REFERENCES
                  • BIOGRAPHICAL SKETCH
Page 2: QUESTION ANSWERING SYSTEMS WITH ATTENTION MECHANISMufdcimages.uflib.ufl.edu/UF/E0/05/23/73/00001/MUKHERJEE_P.pdf · QUESTION ANSWERING SYSTEMS WITH ATTENTION MECHANISM By PURNENDU

copy 2018 Purnendu Mukherjee

To my family friends and teachers

4

ACKNOWLEDGMENTS

I would like to thank my thesis advisor Professor Andy Li who has been a

constant source of inspiration and encouragement for me to pursue my ideas Also I

am thankful to him for providing all the necessary resources to succeed for my research

efforts

I am deeply thankful to Professor Kristy Boyer for her guidance and mentorship

as she motivated me to pursue my research interests in Natural Language Processing

and Deep Learning

I would like to express my heartfelt gratitude to Professor Jose Principe who had

taught me the very fundamentals of Learning systems and formally introduced me to

Deep Learning

I wish to extend my thanks to all the member of CBL Lab and especially to the

NLP group for their insightful remarks and overall support

I am grateful to my close friend roommate and lab partner Yash Sinha who had

supported and helped me throughout with all aspects of my thesis work

Finally I must thank my parents for their unwavering support and dedication

towards my well-being and growth

5

TABLE OF CONTENTS page

ACKNOWLEDGMENTS 4

LIST OF FIGURES 6

LIST OF ABBREVIATIONS 7

ABSTRACT 8

CHAPTER

1 INTRODUCTION 9

2 THE BUILDING BLOCKS OF A QUESTION ANSWERING SYSTEM 13

Neural Networks 13 Convolutional Neural Network 14 Recurrent Neural Networks (RNN) 15 Word Embedding 16 Attention Mechanism 17 Memory Networks 18

3 LITERATURE REVIEW AND STATE OF THE ART 21

Machine Comprehension Using Match-LSTM and Answer Pointer 21 R-NET Matching Reading Comprehension with Self-Matching Networks 23 Bi-Directional Attention Flow (BiDAF) for Machine Comprehension 25 Summary 28

4 MULTI-ATTENTION QUESTION ANSWERING 29

Multi-Attention BiDAF Model 30 Chatbot Design Using a QA System 32 Online QA System and Attention Visualization 35

5 FUTURE DIRECTIONS AND CONCLUSION 37

Future Directions 37 Conclusion 38

LIST OF REFERENCES 39

BIOGRAPHICAL SKETCH 42

6

LIST OF FIGURES

Figure page 1-1 The task of Question Answering 10

2-1 Simple and Deep Learning Neural Networks 13

2-2 Convolutional Neural Network Architecture 14

2-3 Semantic relation between words in vector space 17

2-4 Attention Mechanism flow 18

2-5 QA example on Lord of the Rings using Memory Networks 20

3-1 Match-LSTM Model Architecture 22

3-2 The task of Question Answering 24

3-3 Bi-Directional Attention Flow Model Architecture 26

4-1 The modified BiDAF model with multilevel attention 31

4-2 Flight reservation chatbotrsquos chat window 33

4-3 Chatbot within OneTask system 34

4-4 The Flow diagram of the Flight booking Chatbot system 35

4-5 QA system interface with attention highlight over candidate answers 36

5-1 An English language semantic parse tree 37

7

LIST OF ABBREVIATIONS

BiDAF Bi-Directional Attention Flow

CNN Convolutional Neural Network

GRU Gated Recurrent Units

LSTM Long Short Term Memory

NLP Natural Language Processing

NLU Natural Language Understanding

RNN Recurrent Neural Network

8

Abstract of Thesis Presented to the Graduate School of the University of Florida in Partial Fulfillment of the

Requirements for the Degree of Master of Science

QUESTION ANSWERING SYSTEMS WITH ATTENTION MECHANISM

By

Purnendu Mukherjee

May 2018

Chair Xiaolin Li Major Computer Science

Question Answering(QA) systems have had rapid growth since the last 3 years

and are close to reaching human level accuracy One of the fundamental reason for this

growth has been the use of attention mechanism along with other methods of Deep

Learning But just as with other Deep Learning methods some of the failure cases are

so obvious that it convinces us that there is a lot to improve upon In this work we first

did a literature review of the State of the Art and fundamental models in QA systems

Next we introduce an architecture which has shown improvement in the targeted area

We then introduce a general method to enable easy design of domain-specific Chatbot

applications and present a proof of the concept with the same method Finally we

present an easy to use Question Answering interface with attention visualization on the

passage We also propose a method to improve the current state of the art as a part of

our ongoing work

9

CHAPTER 1 INTRODUCTION

Teaching machines to read and understand human natural language is our

central and long standing goal for Natural Language Processing and Artificial

Intelligence in general The positive side of machines being able to reason and

comprehend human language could be enormous We are beginning to see how

commercial systems like Alexa Siri Google Now etc are being widely used as Speech

Recognition improved to human levels While speech recognition systems can

transcribe speech to text comprehension of that transcribed text is another task which

is currently a major focus for both academia and industry because of its possible

applications Moreover all the text information available throughout the internet is a

major reason why machine comprehension of text is such an important task

With the growth of Deep Learning methods in the last few years the field of

Machine Comprehension and Natural Language Processing(NLP) in general has

experienced a revolution While the traditional methods and practices are still prevalent

and forms the basis of our deep understanding of languages Deep Learning methods

have surpassed all traditional NLP and Machine Learning methods by a significant

margin and are currently driving the growth of the field

To be able to build a system that can understand human text we need to first

ask ourselves how can we evaluate the machinersquos comprehension ability We had

initially set a goal to build a Chabot for a specific domain and generalize to other topics

as we go ahead While developing the system we found out the necessity for reading

comprehension and how to measure it We finally found the answer with Questions

Answering systems

10

Just like how we human beings are tested for our ability of language

understanding with questions we should ask machines similar questions about what it

has just read The performance of the system on such question answering task will let

us evaluate how much the machine is able to reason about what it just read [1]

Reading comprehension has been a topic of Natural Language Understanding since the

1970s In 1977 Wendy Lehnert said in his doctoral thesis ndash ldquoOnly when we can ask a

program to answer questions about what it reads will we be able to begin to access that

programrsquos comprehensionrdquo [2]

Figure 1-1 The task of Question Answering

To achieve this task the NLP community has developed various datasets such as

CNN Daily Mail WebQuestions SQuAD TriviaQA [3] etc For our purpose we chose

SQuAD which stands for Stanford Question Answering Dataset [4] SQuAD consists of

questions posed by crowdworkers on a set of Wikipedia articles where the answer to

every question is a segment of text or span from the corresponding reading passage

With 100000+ question-answer pairs on 500+ articles SQuAD is significantly larger

than previous reading comprehension datasets [5]

An example from the SQuAD dataset is as follows

11

Passage Tesla later approached Morgan to ask for more funds to build a more

powerful transmitter When asked where all the money had gone Tesla responded by

saying that he was affected by the Panic of 1901 which he (Morgan) had caused

Morgan was shocked by the reminder of his part in the stock market crash and by

Teslarsquos breach of contract by asking for more funds Tesla wrote another plea to

Morgan but it was also fruitless Morgan still owed Tesla money on the original

agreement and Tesla had been facing foreclosure even before construction of the

tower began

Question On what did Tesla blame for the loss of the initial money

Answer Panic of 1901

As we started exploring the QA task we faced several challenges Some of them

we could solve with the help of other research and some of the challenges still exist in

the domain

bull Out of Vocabulary words

bull Multi-sentence reasoning may be required

bull There may exist several candidate answers

bull Optimizing the Exact Match (EM) metric may fail when the answer boundary is fuzzy or too long such as the answer of the ldquowhyrdquo query [6]

bull ldquoOne-hoprdquo prediction may fail to fully understand the query [6]

bull Fail to fully capture the long-distance contextual interaction between parts of the context by only using LSTMGRU

bull Current models are unable to capture semantics of the passage

In the upcoming chapters we will first briefly review the basics necessary for

understanding the models then we will delve deep into the fundamental models that

12

have shaped the current State of the Art models then we will discuss our contribution in

terms of architecture and applications and finally conclude with future directions

13

CHAPTER 2 THE BUILDING BLOCKS OF A QUESTION ANSWERING SYSTEM

To build a question answering system one needs to be familiar with the

fundamental deep learning models such as Recurrent Neural Networks (RNN) Long

Short Term Memory (LSTM) etc In this chapter we will have an overview on these

techniques and see how they all connect to building a question answering system

Neural Networks

What makes Deep Learning so intriguing is that it has close resemblance with

the working of the mammalian brain or at least draws inspiration from it The same can

be said for Artificial Neural networks [7] which consists of a system of interconnected

units called lsquoneuronsrsquo that take input from similar units and produces a single output

Figure 2-1 Simple and Deep Learning Neural Networks [8]

The connection from one neuron to another can be weighted based on the input data

which enables the network to tune itself to produce a certain output based on the input

This is the learning process which is achieved through backpropagation which is a

system of propagating the error from the output layer to the previous layers

14

Convolutional Neural Network

The first wave of deep learningrsquos success was brought by Convolutional Neural

Networks (CNN) [9] when this was the technique used by the winning team of ImageNet

competition in 2012 CNNs are deep artificial neural networks (ANN) that can be used to

classify images cluster them by similarity and perform object recognition within scenes

It can be used to detect and identify faces people signs or any other visual data

Figure 2-2 Convolutional Neural Network Architecture [10]

There are primarily four operations in a standard CNN model (as shown in Fig above)

1 Convolution - The primary purpose of Convolution in the ConvNet (above) is to extract features from the input image The spatial relationship between pixels ie the image features are preserved and learned by the convolution using small squares of input data

2 Non-Linearity (ReLU) ndash Rectified Linear Unit (ReLU) is a non-linear operation that carries out an element wise operation on each pixel This operation replaces the negative pixel values in the feature map by zero

3 Pooling or Sub Sampling - Spatial Pooling reduces the dimensionality of each feature map but retains the most important information For max pooling the largest value in the square window is taken and rest are dropped Other types of pooling are Average Sum etc

4 Classification (Fully Connected Layer) - The Fully Connected layer is a traditional Multi-Layer Perceptron as described before that uses a softmax activation function in the output layer The high-level features of the image are encoded by the convolutional and pooling layers which is then fed to the fully connected layer which then uses these features for classifying the input image into various classes based on the training dataset [10]

15

When a new image is fed into the CNN model all the above-mentioned steps are

carried out (forward propagation) and a probability distribution is achieved on the set of

output classes With a large enough training dataset the network will learn and

generalize well enough to classify new images into their correct classes

Recurrent Neural Networks (RNN)

Whenever we want to predict or encode sequential data RNN [11] is our go to

method RNNs perform the same task for every element of a sequence where the

output of each element depends on previous computations thus the recurrence In

practice RNNs are unable to retain long-term dependencies and can look back only a

few steps because of the vanishing gradient problem

(2-1)

A solution to the dependency problem is to use gated cells such as LSTM [11] or

GRU [13] These cells pass on important information to the next cells while ignoring

non-important ones The gated units in a GRU block are

bull Update Gate ndash Computed based on current input and hidden state

(2-2)

bull Reset Gate ndash Calculated similarly but with different weights

(2-3)

bull New memory content - (2-4)

If reset gate unit is ~0 then previous memory is ignored and only new

information is kept

16

Final memory at current time step combines previous and current time steps

(2-5)

While the GRU is computationally efficient the LSTM on the other hand is a

general case where there are three gates as follows

bull Input Gate ndash What new information to add to the current cell state

bull Forget Gate ndash How much information from previous states to be kept

bull Output gate ndash How much info should be sent to the next states

Just like GRU the current cell state is a sum of the previous cell state but

weighted by the forget gate and the new value is added which is weighted by the input

gate Based on the cell state the output gate regulates the final output

Word Embedding

Computation or gradients can be applied on numbers and not on words or letters

So first we need to convert words into their corresponding numerical formation before

feeding into a deep learning model In general there are two types of word embedding

Frequency based (which constitutes count vectors tf-idf and co-occurrence vectors)

and Prediction based With frequency based embedding the order of the words are not

preserved and works as a bag of words model Whereas with prediction based model

the order of words or locality of words are taken into consideration to generate the

numerical representation of the word Within this prediction based category there are

two fundamental techniques called Continuous Bag of Words (CBOW) and Skip Gram

Model which forms the basis for word2vec [14] and GloVe [15]

The basic intuition behind word2vec is that if two different words have very

similar ldquocontextsrdquo (that is what words are likely to appear around them) then the model

17

will produce similar vector for those words Conversely if the two word vectors are

similar then the network will produce similar context predictions for the same two words

For examples synonyms like ldquointelligentrdquo and ldquosmartrdquo would have very similar contexts

Or that words that are related like ldquoenginerdquo and ldquotransmissionrdquo would probably have

similar contexts as well [16] Plotting the word vectors learned by a word2vec over a

large corpus we could find some very interesting relationships between words

Figure 2-3 Semantic relation between words in vector space [17]

Attention Mechanism

We as humans put our attention to things are important or are relevant in a

context For example when asked a question from a passage we try to find the most

relevant part of the passage the question is relevant with and then reason from our

understanding of that part of the passage The same idea applies for attention

mechanism in Deep Learning It is used to identify the specific parts of a given context

to which the current question is relevant to

Formally put the techniques take n arguments y_1 y_n (in our case the

passage having words say y_i through h_i) and a question word say q It returns a

vector z which is supposed to be the laquo summary raquo of the y_i focusing on information

linked to the question q More formally it returns a weighted arithmetic mean of the y_i

18

and the weights are chosen according the relevance of each y_i given the context c

[18]

Figure 2-4 Attention Mechanism flow [18]

Memory Networks

Convolutional Neural Networks and Recurrent Neural Networks which does

capture how we form our visual and sequential memories their memory (encoded by

hidden states and weights) were typically too small and was not compartmentalized

enough to accurately remember facts from the past (knowledge is compressed into

dense vectors) [19]

Deep Learning needed to cultivate a methodology that preserved memories as

they are such that it wonrsquot be lost in generalization and recalling exact words or

sequence of events would be possible mdash something computers are already good at This

effort led us to Memory Networks [19] published at ICLR 2015 by Facebook AI

Research

This paper provides a basic framework to store augment and retrieve memories

while seamlessly working with a Recurrent Neural Network architecture The memory

19

network consists of a memory m (an array of objects 1 indexed by m i) and four

(potentially learned) components I G O and R as follows

I (input feature map) mdash converts the incoming input to the internal feature

representation either a sparse or dense feature vector like that from word2vec or

GloVe

G (generalization) mdash updates old memories given the new input They call this

generalization as there is an opportunity for the network to compress and generalize its

memories at this stage for some intended future use The analogy Irsquove been talking

before

O (output feature map) mdash produces a new output (in the feature representation

space) given the new input and the current memory state This component is

responsible for performing inference In a question answering system this part will

select the candidate sentences (which might contain the answer) from the story

(conversation) so far

R (response) mdash converts the output into the response format desired For

example a textual response or an action In the QA system described this component

finds the desired answer and then converts it from feature representation to the actual

word

This model is a fully supervised model meaning all the candidate sentences from

which the answer could be found are marked during training phase and can also be

termed as lsquohard attentionrsquo

The authors tested out the QA system on various literature including Lord of the

Rings

20

Figure 2-5 QA example on Lord of the Rings using Memory Networks [19]

21

CHAPTER 3 LITERATURE REVIEW AND STATE OF THE ART

There has been rapid progress since the release of the SQuAD dataset and

currently the best ensemble models are close to human level accuracy in machine

comprehension This is due to the various ingenious methods which solves some of the

problems with the previous methods Out of Vocabulary tokens were handled by using

Character embedding Long term dependency within context passage were solved

using self-attention And many other techniques such as Contextualized vectors History

of Words Attention Flow etc In this section we will have a look at the some of the most

important models that were fundamental to the progress of Questions Answering

Machine Comprehension Using Match-LSTM and Answer Pointer

In this paper [20] the authors propose an end-to-end neural architecture for the

QA task The architecture is based on match-LSTM [21] a model they proposed for

textual entailment and Pointer Net [22] a sequence-to-sequence model proposed by

Vinyals et al (2015) to constrain the output tokens to be from the input sequences

The model consists of an LSTM preprocessing layer a match-LSTM layer and an

Answer Pointer layer

We are given a piece of text which we refer to as a passage and a question

related to the passage The passage is represented by matrix P where P is the length

(number of tokens) of the passage and d is the dimensionality of word embeddings

Similarly the question is represented by matrix Q where Q is the length of the question

Our goal is to identify a subsequence from the passage as the answer to the question

22

Figure 3-1 Match-LSTM Model Architecture [20]

LSTM Preprocessing layer They use a standard one-directional LSTM

(Hochreiter amp Schmidhuber 1997) to process the passage and the question separately

as shown below

Match LSTM Layer They applied the match-LSTM model proposed for textual

entailment to their machine comprehension problem by treating the question as a

premise and the passage as a hypothesis The match-LSTM sequentially goes through

the passage At position i of the passage it first uses the standard word-by-word

attention mechanism to obtain attention weight vector as follows

(3-1)

23

where and are parameters to be learned

Answer Pointer Layer The Answer Pointer (Ans-Ptr) layer is motivated by the

Pointer Net introduced by Vinyals et al (2015) [22] The boundary model produces only

the start token and the end token of the answer and then all the tokens between these

two in the original passage are considered to be the answer

When this paper was released back in November 2016 Match-LSTM method

was the state of the art in Question Answering systems and was at the top of the

leaderboard for the SQuAD dataset

R-NET Matching Reading Comprehension with Self-Matching Networks

In this model [23] first the question and passage are processed by a

bidirectional recurrent network (Mikolov et al 2010) separately They then match the

question and passage with gated attention-based recurrent networks obtaining

question-aware representation for the passage On top of that they apply self-matching

attention to aggregate evidence from the whole passage and refine the passage

representation which is then fed into the output layer to predict the boundary of the

answer span

Question and passage encoding First the words are converted to their

respective word-level embeddings and character level embeddings The character-level

embeddings are generated by taking the final hidden states of a bi-directional recurrent

neural network (RNN) applied to embeddings of characters in the token Such

character-level embeddings have been shown to be helpful to deal with out-of-vocab

(OOV) tokens

24

They then use a bi-directional RNN to produce new representation and

of all words in the question and passage respectively

Figure 3-2 The task of Question Answering [23]

Gated Attention-based Recurrent Networks They use a variant of attention-

based recurrent networks with an additional gate to determine the importance of

information in the passage regarding a question Different from the gates in LSTM or

GRU the additional gate is based on the current passage word and its attention-pooling

vector of the question which focuses on the relation between the question and current

passage word The gate effectively model the phenomenon that only parts of the

passage are relevant to the question in reading comprehension and question answering

is utilized in subsequent calculations

25

Self-Matching Attention From the previous step the question aware passage

representation is generated to highlight the important parts of the passage One

problem with such representation is that it has very limited knowledge of context One

answer candidate is often oblivious to important cues in the passage outside its

surrounding window To address this problem the authors propose directly matching

the question-aware passage representation against itself It dynamically collects

evidence from the whole passage for words in passage and encodes the evidence

relevant to the current passage word and its matching question information into the

passage representation

Output Layer They use the same method as Wang amp Jiang (2016b) and use

pointer networks (Vinyals et al 2015) to predict the start and end position of the

answer In addition they use an attention-pooling over the question representation to

generate the initial hidden vector for the pointer network [23]

When the R-Net Model first appeared in the leaderboard in March 2017 it was at

the top with 723 Exact Match and 807 F1 score

Bi-Directional Attention Flow (BiDAF) for Machine Comprehension

BiDAF [24] is a hierarchical multi-stage architecture for modeling the

representations of the context paragraph at different levels of granularity BIDAF

includes character-level word-level and contextual embeddings and uses bi-directional

attention flow to obtain a query-aware context representation Their attention layer is not

used to summarize the context paragraph into a fixed-size vector Instead the attention

is computed for every time step and the attended vector at each time step along with

the representations from previous layers can flow through to the subsequent modeling

layer This reduces the information loss caused by early summarization [24]

26

Figure 3-3 Bi-Directional Attention Flow Model Architecture [24]

Their machine comprehension model is a hierarchical multi-stage process and

consists of six layers

1 Character Embedding Layer maps each word to a vector space using character-level CNNs

2 Word Embedding Layer maps each word to a vector space using a pre-trained word embedding model

3 Contextual Embedding Layer utilizes contextual cues from surrounding words to refine the embedding of the words These first three layers are applied to both the query and context They use a LSTM on top of the embeddings provided by the previous layers to model the temporal interactions between words We place an LSTM in both directions and concatenate the outputs of the two LSTMs

4 Attention Flow Layer couples the query and context vectors and produces a set of query aware feature vectors for each word in the context

5 Modeling Layer employs a Recurrent Neural Network to scan the context The input to the modeling layer is G which encodes the query-aware representations of context words The output of the modeling layer captures the interaction among the context words conditioned on the query They use two layers of bi-directional LSTM

27

with the output size of d for each direction Hence a matrix is obtained which is passed onto the output layer to predict the answer

6 Output Layer provides an answer to the query They define the training loss (to be minimized) as the sum of the negative log probabilities of the true start and end indices by the predicted distributions averaged over all examples [25]

In a further variation of their above work they add a self-attention layer after the

Bi-attention layer to further improve the results The architecture of the model is as

Figure 3-3 The task of Question Answering [25]

28

Summary

In this chapter we reviewed the methods that are fundamental to the state of the

art in Machine Comprehension and for the task of Question Answering We have

reached closed to human level accuracy and this is due to incremental developments

over previous models As we saw Out of Vocabulary (OOV) tokens were handled by

using Character embedding Long term dependency within context passage were

solved using self-attention and many other techniques such as Contextualized vectors

Attention Flow etc were employed to get better results In the next chapter we will see

how we can build on these models and develop further

29

CHAPTER 4 MULTI-ATTENTION QUESTION ANSWERING

Although the advancements in Question Answering systems since the release of

the SQuAD dataset has been impressive and the results are getting close to human

level accuracy it is far from being a fool-proof system The models still make mistakes

which would be obvious to a human For example

Passage The Panthers used the San Jose State practice facility and stayed at

the San Jose Marriott The Broncos practiced at Florida State Facility and stayed at the

Santa Clara Marriott

Question At what universitys facility did the Panthers practice

Actual Answer San Jose State

Predicted Answer Florida State Facility

To find out what is leading to the wrong predictions we wanted to see the

attention weights associated with such an example We plotted the passage and

question heat map which is a 2D matrix where the intensity of each cell signifies the

similarity between a passage word and a question word For the above example we

found out that while certain words of the question are given high weightage other parts

are not The words lsquoAtrsquo lsquofacilityrsquo lsquopracticersquo are given high attention but lsquoPanthersrsquo does

not receive high attention If it had received high attention then the system would have

predicted lsquoSan Jose Statersquo as the right answer To solve this issue we analyzed the

base BiDAF model and proposed adding two things

1 Bi-Attention and Self-Attention over Query

2 Second level attention over the output of (Bi-Attention + Self-Attention) from both Context and Query

30

Multi-Attention BiDAF Model

Our comprehension model is a hierarchical multi-stage process and consists of

the following layers

1 Embedding Just as all other models we embed words using pretrained word vectors We also embed the characters in each word into size 20 vectors which are learned and run a convolution neural network followed by max-pooling to get character-derived embeddings for each word The character-level and word-level embeddings are then concatenated and passed to the next layer We do not update the word embeddings during training

2 Pre-Process A shared bi-directional GRU (Cho et al 2014) is used to map the question and passage embeddings to context aware embeddings

3 Context Attention The bi-directional attention mechanism from the Bi-Directional Attention Flow (BiDAF) model (Seo et al 2016) is used to build a query-aware context representation Let h_i be the vector for context word i q_j be the vector for question word j and n_q and n_c be the lengths of the question and context respectively We compute attention between context word i and question word j as

(4-1)

where w1 w2 and w3 are learned vectors and is element-wise multiplication We then compute an attended vector c_i for each context token as

(4-2)

We also compute a query-to-context vector q_c

(4-3)

31

The final vector computed for each token is built by concatenating

and In our model we subsequently pass the result through a linear layer with ReLU activations

4 Context Self-Attention Next we use a layer of residual self-attention The input is passed through another bi-directional GRU Then we apply the same attention mechanism only now between the passage and itself In this case we do not use query-to context attention and we set aij = 1048576inf if i = j As before we pass the concatenated output through a linear layer with ReLU activations This layer is applied residually so this output is additionally summed with the input

5 Query Attention For this part we do the same way as context attention but calculate the weighted sum of the context words for each query word Thus we get the length as number of query words Then we calculate context to query similar to query to context in context attention layer

Figure 4-1 The modified BiDAF model with multilevel attention

6 Query Self-Attention This part is done the same way as the context self-attention layer but from the output of the Query Attention layer

32

7 Context Query Bi-Attention + Self-Attention The output of the Context self-attention and the Query Self-Attention layers are taken as input and the same process for Bi-Attention and self-attention is applied on these inputs

8 Prediction In the last layer of our model a bidirectional GRU is applied followed by a linear layer that computes answer start scores for each token The hidden states of that layer are concatenated with the input and fed into a second bidirectional GRU and linear layer to predict answer end scores The softmax operation is applied to the start and end scores to produce start and end probabilities and we optimize the negative log likelihood of selecting correct start and end tokens

Having carried out this modification we were able to solve the wrong example we

started with The multilevel attention model gives the correct output as ldquoSan Jose

Staterdquo Also we achieved slightly better scores than the original model with a F1 score

of 8544 on the SQuAD dev dataset

Chatbot Design Using a QA System

Designing a perfect chatbot that passes the Turing test is a fundamental goal for

Artificial Intelligence Although we are many order to magnitudes away from achieving

such a goal domain specific tasks can be solved with chatbots made from current

technology We had started our goal with a similar objective in mind ie to design a

domain specific chatbot and then generalize to other areas as it is able to achieve the

first domain specific objective robustly This led us to the fundamental problem of

Machine Comprehension and subsequently to the task of Question Answering Having

achieved some degree of success with QA systems we looked back if we could apply

our newly acquired knowledge in the task of designing Chatbots

The chatbots made with todayrsquos technologies are mostly handcrafted techniques

such as template matching that requires anticipating all possible ways a user may

articulate his requirements and a conversation may occur This requires a lot of man

hours for designing a domain specific system and is still very error prone In this section

33

we propose a general Chatbot design that would make the designing of a domain

specific chatbot very easy and robust at the same time

Every domain specific chatbot needs to obtain a set of information from the user

and show some results based on the user specific information obtained The traditional

chatbots use template matching and keywords lookup to determine if the user has

provided the required information Our idea is to use the Question Answering system in

the backend to extract out the required information from whatever the user has typed

until this point of the conversation The information to be extracted can be posed in the

form of a set of questions and the answers obtained from those questions can be used

as the parameters to supply the relevant information to the user

We had chosen our chatbot domain as the flight reservation system Our goal

was to extract the required information from the user to be able to show him the

available flights as per the userrsquos requirements

Figure 4-2 Flight reservation chatbotrsquos chat window

34

For a flight reservation task the booking agent needs to know the origin city

destination city and date of travel at minimum to be able to show the available flights

Optional information includes the number of tickets passengerrsquos name one way or

round trip etc

The minimalistic conversation with the user through the chat window would be as

shown above We had a platform called OneTask on which we wanted to implement our

chat bot The chat interface within the OneTask system looks as follows

Figure 4-3 Chatbot within OneTask system

The working of the chat system is as follows ndash

1 Initiation The user opens the chat window which starts a session with the chat bot The chat bot asks lsquoHow may I help yoursquo

2 User Reply The user may reply with none of the required information for flight booking or may reply with multiple information in the same message

3 User Reply parsing The conversation up to this point is treated as a passage and the internal questions are run on this passage So the four questions that are run are

Where do you want to go

35

From where do you want to leave

When do you want to depart

4 Parsed responses from QA model After running the questions on the QA system with the given conversation answers are obtained for all along with their corresponding confidence values Since answers will be obtained even if the required question was not answered up to this point in the conversation the confidence values plays a crucial role in determining the validity of the answer For this purpose we determined ranges of confidence values for validation A confidence value above 10 signifies the question has been answered correctly A confidence of 2 ndash 10 signifies that it may have been answered but should verify with the user for correctness and any confidence below 2 is discarded

5 Asking remaining questions iteratively After the parsing it is checked if any of the required questions are still unanswered If so the chatbot asks the remaining question and the process from 3 ndash 4 is carried out iteratively Once all the questions have been answered the user is shown with the available flight options as per his request

Figure 4-4 The Flow diagram of the Flight booking Chatbot system

Online QA System and Attention Visualization

To be able to test out various examples we used the BiDAF [23] model for an

online demo One can either choose from the available examples from the drop-down

menu or paste their own passage and examples While this is a useful and interesting

system to test the model in a user-friendly way we created this system to be able to

focus on the wrong samples

36

Figure 4-5 QA system interface with attention highlight over candidate answers

The candidate answers are shown with the blue highlights as per their

confidence values The higher the confidence the darker the answer The highest

confidence value is chosen as the predicted answer We developed the system to show

attention spread of the candidate answers to realize what needs to be done to improve

the system This led us to realize the importance of including the query attention part as

well as multilevel attention on the BiDAF model as described in the first section of this

chapter

37

CHAPTER 5 FUTURE DIRECTIONS AND CONCLUSION

Future Directions

Having done a thorough analysis of the current methods and state of the arts in

Machine Comprehension and for the QA task and developing systems on it we have

achieved a strong sense of what needs to be done to further improve the QA models

After observing the wrong samples we can see that the system is still unable to encode

meaning and is picking answers based on statistical patterns of occurrence of answers

on training examples

As per the State of the Art models and analyzing the ongoing research literature

it is easy to conclude that more features need to be embedded to encode the meaning

of words phrases and sentences A paper called Reinforced Mnemonic Reader for

Machine Comprehension encoded POS and NER tags of words along with their word

and character embedding This gave them better results

Figure 5-1 An English language semantic parse tree [26]

38

We have developed a method to encode the syntax parse tree of a sentence that

not only encodes the post tags but the relation of the word within the phrase and the

relation of the phrase within the whole sentence in a hierarchical manner

Finally data augmentation is another solution to get better results One definite

way to reduce the errors would be to include similar samples in the training data which

the system is faltering in the dev set One could generate similar examples as the failure

cases and include them in the training set to have better prediction Another system

would be to train to similar and bigger datasets Our models were trained on the SQuAD

dataset There are other datasets too which does the similar question answering task

such as TriviaQA We could augment the training set of TriviaQA along with SQuAD to

have a more robust system that is able to generalize better and thus have higher

accuracy for predicting answer spans

Conclusion

In this work we have tried to explore the most fundamental techniques that have

shaped the current state of the art Then we proposed a minor improvement of

architecture over an existing model Furthermore we developed two applications that

uses the base model First we talked about how a chatbot application can be made

using the QA system and lastly we also created a web interface where the model can

be used for any Passage and Question This interface also shows the attention spread

on the candidate answers While our effort is ongoing to push the state of the art

forward we strongly believe that surpassing human level accuracy on this task will have

high dividends for the society at large

39

LIST OF REFERENCES

[1] Danqi Chen From Reading Comprehension To Open-Domain Question Answering 2018 Youtube Accessed March 16 2018 httpswwwyoutubecomwatchv=1RN88O9C13U

[2] Lehnert Wendy G The Process of Question Answering No RR-88 Yale Univ New Haven Conn Dept of computer science 1977

[3] Joshi Mandar Eunsol Choi Daniel S Weld and Luke Zettlemoyer Triviaqa A large scale distantly supervised challenge dataset for reading comprehension arXiv preprint arXiv170503551 (2017)

[4] Rajpurkar Pranav Jian Zhang Konstantin Lopyrev and Percy Liang Squad 100000+ questions for machine comprehension of text arXiv preprint arXiv160605250 (2016)

[5] The Stanford Question Answering Dataset 2018 RajpurkarGithubIo Accessed March 16 2018 httpsrajpurkargithubioSQuAD-explorer

[6] Hu Minghao Yuxing Peng and Xipeng Qiu Reinforced mnemonic reader for machine comprehension CoRR abs170502798 (2017)

[7] Schalkoff Robert J Artificial neural networks Vol 1 New York McGraw-Hill 1997

[8] Me For The AI and Neetesh Mehrotra 2018 The Connect Between Deep Learning And AI - Open Source For You Open Source For You Accessed March 16 2018 httpopensourceforucom201801connect-deep-learning-ai

[9] Krizhevsky Alex Ilya Sutskever and Geoffrey E Hinton Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems pp 1097-1105 2012

[10] An Intuitive Explanation Of Convolutional Neural Networks 2016 The Data Science Blog Accessed March 16 2018 httpsujjwalkarnme20160811intuitive-explanation-convnets

[11] Williams Ronald J and David Zipser A learning algorithm for continually running fully recurrent neural networks Neural computation 1 no 2 (1989) 270-280

[12] Hochreiter Sepp and Juumlrgen Schmidhuber Long short-term memory Neural computation 9 no 8 (1997) 1735-1780

[13] Chung Junyoung Caglar Gulcehre KyungHyun Cho and Yoshua Bengio Empirical evaluation of gated recurrent neural networks on sequence modeling arXiv preprint arXiv14123555 (2014)

40

[14] Mikolov Tomas Wen-tau Yih and Geoffrey Zweig Linguistic regularities in continuous space word representations In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics Human Language Technologies pp 746-751 2013

[15] Pennington Jeffrey Richard Socher and Christopher Manning Glove Global vectors for word representation In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) pp 1532-1543 2014

[16] Word2vec Tutorial - The Skip-Gram Model middot Chris Mccormick 2016 MccormickmlCom Accessed March 16 2018 httpmccormickmlcom20160419word2vec-tutorial-the-skip-gram-model

[17] Group 2015 [Paper Introduction] Bilingual Word Representations With Monolingual hellip SlideshareNet Accessed March 16 2018 httpswwwslidesharenetnaist-mtstudybilingual-word-representations-with-monolingual-quality-in-mind

[18] Attention Mechanism 2016 BlogHeuritechCom Accessed March 16 2018 httpsblogheuritechcom20160120attention-mechanism

[19] Weston Jason Sumit Chopra and Antoine Bordes 2014 Memory Networks arXiv preprint arXiv14103916v11 Accessed March 16 2018 httpsarxivorgabs14103916

[20] Wang Shuohang and Jing Jiang Machine comprehension using match-lstm and answer pointer arXiv preprint arXiv160807905 (2016)

[21] Wang Shuohang and Jing Jiang Learning natural language inference with LSTM arXiv preprint arXiv151208849 (2015)

[22] Vinyals Oriol Meire Fortunato and Navdeep Jaitly Pointer networks In Advances in Neural Information Processing Systems pp 2692-2700 2015

[23] Wang Wenhui Nan Yang Furu Wei Baobao Chang and Ming Zhou Gated self-matching networks for reading comprehension and question answering In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1 Long Papers) vol 1 pp 189-198 2017

[24] Seo Minjoon Aniruddha Kembhavi Ali Farhadi and Hannaneh Hajishirzi Bidirectional attention flow for machine comprehension arXiv preprint arXiv161101603 (2016)

[25] Clark Christopher and Matt Gardner Simple and effective multi-paragraph reading comprehension arXiv preprint arXiv171010723 (2017)

41

[26] 8 Analyzing Sentence Structure 2018 NltkOrg Accessed March 16 2018 httpwwwnltkorgbookch08html

42

BIOGRAPHICAL SKETCH

Purnendu Mukherjee grew up in Kolkata India and had a strong interest for

computers since an early age After his high school he did his BSc in computer

science from Ramakrishna Mission Residential College Narendrapur followed by MSc

in computer science from St Xavierrsquos College Kolkata He had a strong intuition and

interest for human like learning systems and wanted to work in this area He started

working at TCS Innovation Labs Pune for the application of Natural Language

Processing in Educational Applications As he was deeply passionate about learning

system that mimic the human brain and learn like a human child does he was

increasing interested about Deep Learning and its applications After working for a year

he went on to pursue a Master of Science degree in computer science from the

University of Florida Gainesville His academic interests have been focused on Deep

Learning and Natural Language Processing and he has been working on Machine

Reading Comprehension since summer of 2017

  • ACKNOWLEDGMENTS
  • LIST OF FIGURES
  • LIST OF ABBREVIATIONS
  • INTRODUCTION
  • THE BUILDING BLOCKS OF A QUESTION ANSWERING SYSTEM
    • Neural Networks
    • Convolutional Neural Network
    • Recurrent Neural Networks (RNN)
    • Word Embedding
    • Attention Mechanism
    • Memory Networks
      • LITERATURE REVIEW AND STATE OF THE ART
        • Machine Comprehension Using Match-LSTM and Answer Pointer
        • R-NET Matching Reading Comprehension with Self-Matching Networks
        • Bi-Directional Attention Flow (BiDAF) for Machine Comprehension
        • Summary
          • MULTI-ATTENTION QUESTION ANSWERING
            • Multi-Attention BiDAF Model
            • Chatbot Design Using a QA System
            • Online QA System and Attention Visualization
              • FUTURE DIRECTIONS AND CONCLUSION
                • Future Directions
                • Conclusion
                  • LIST OF REFERENCES
                  • BIOGRAPHICAL SKETCH
Page 3: QUESTION ANSWERING SYSTEMS WITH ATTENTION MECHANISMufdcimages.uflib.ufl.edu/UF/E0/05/23/73/00001/MUKHERJEE_P.pdf · QUESTION ANSWERING SYSTEMS WITH ATTENTION MECHANISM By PURNENDU

To my family friends and teachers

4

ACKNOWLEDGMENTS

I would like to thank my thesis advisor Professor Andy Li who has been a

constant source of inspiration and encouragement for me to pursue my ideas Also I

am thankful to him for providing all the necessary resources to succeed for my research

efforts

I am deeply thankful to Professor Kristy Boyer for her guidance and mentorship

as she motivated me to pursue my research interests in Natural Language Processing

and Deep Learning

I would like to express my heartfelt gratitude to Professor Jose Principe who had

taught me the very fundamentals of Learning systems and formally introduced me to

Deep Learning

I wish to extend my thanks to all the member of CBL Lab and especially to the

NLP group for their insightful remarks and overall support

I am grateful to my close friend roommate and lab partner Yash Sinha who had

supported and helped me throughout with all aspects of my thesis work

Finally I must thank my parents for their unwavering support and dedication

towards my well-being and growth

5

TABLE OF CONTENTS page

ACKNOWLEDGMENTS 4

LIST OF FIGURES 6

LIST OF ABBREVIATIONS 7

ABSTRACT 8

CHAPTER

1 INTRODUCTION 9

2 THE BUILDING BLOCKS OF A QUESTION ANSWERING SYSTEM 13

Neural Networks 13 Convolutional Neural Network 14 Recurrent Neural Networks (RNN) 15 Word Embedding 16 Attention Mechanism 17 Memory Networks 18

3 LITERATURE REVIEW AND STATE OF THE ART 21

Machine Comprehension Using Match-LSTM and Answer Pointer 21 R-NET Matching Reading Comprehension with Self-Matching Networks 23 Bi-Directional Attention Flow (BiDAF) for Machine Comprehension 25 Summary 28

4 MULTI-ATTENTION QUESTION ANSWERING 29

Multi-Attention BiDAF Model 30 Chatbot Design Using a QA System 32 Online QA System and Attention Visualization 35

5 FUTURE DIRECTIONS AND CONCLUSION 37

Future Directions 37 Conclusion 38

LIST OF REFERENCES 39

BIOGRAPHICAL SKETCH 42

6

LIST OF FIGURES

Figure page 1-1 The task of Question Answering 10

2-1 Simple and Deep Learning Neural Networks 13

2-2 Convolutional Neural Network Architecture 14

2-3 Semantic relation between words in vector space 17

2-4 Attention Mechanism flow 18

2-5 QA example on Lord of the Rings using Memory Networks 20

3-1 Match-LSTM Model Architecture 22

3-2 The task of Question Answering 24

3-3 Bi-Directional Attention Flow Model Architecture 26

4-1 The modified BiDAF model with multilevel attention 31

4-2 Flight reservation chatbotrsquos chat window 33

4-3 Chatbot within OneTask system 34

4-4 The Flow diagram of the Flight booking Chatbot system 35

4-5 QA system interface with attention highlight over candidate answers 36

5-1 An English language semantic parse tree 37

7

LIST OF ABBREVIATIONS

BiDAF Bi-Directional Attention Flow

CNN Convolutional Neural Network

GRU Gated Recurrent Units

LSTM Long Short Term Memory

NLP Natural Language Processing

NLU Natural Language Understanding

RNN Recurrent Neural Network

8

Abstract of Thesis Presented to the Graduate School of the University of Florida in Partial Fulfillment of the

Requirements for the Degree of Master of Science

QUESTION ANSWERING SYSTEMS WITH ATTENTION MECHANISM

By

Purnendu Mukherjee

May 2018

Chair Xiaolin Li Major Computer Science

Question Answering(QA) systems have had rapid growth since the last 3 years

and are close to reaching human level accuracy One of the fundamental reason for this

growth has been the use of attention mechanism along with other methods of Deep

Learning But just as with other Deep Learning methods some of the failure cases are

so obvious that it convinces us that there is a lot to improve upon In this work we first

did a literature review of the State of the Art and fundamental models in QA systems

Next we introduce an architecture which has shown improvement in the targeted area

We then introduce a general method to enable easy design of domain-specific Chatbot

applications and present a proof of the concept with the same method Finally we

present an easy to use Question Answering interface with attention visualization on the

passage We also propose a method to improve the current state of the art as a part of

our ongoing work

9

CHAPTER 1 INTRODUCTION

Teaching machines to read and understand human natural language is our

central and long standing goal for Natural Language Processing and Artificial

Intelligence in general The positive side of machines being able to reason and

comprehend human language could be enormous We are beginning to see how

commercial systems like Alexa Siri Google Now etc are being widely used as Speech

Recognition improved to human levels While speech recognition systems can

transcribe speech to text comprehension of that transcribed text is another task which

is currently a major focus for both academia and industry because of its possible

applications Moreover all the text information available throughout the internet is a

major reason why machine comprehension of text is such an important task

With the growth of Deep Learning methods in the last few years the field of

Machine Comprehension and Natural Language Processing(NLP) in general has

experienced a revolution While the traditional methods and practices are still prevalent

and forms the basis of our deep understanding of languages Deep Learning methods

have surpassed all traditional NLP and Machine Learning methods by a significant

margin and are currently driving the growth of the field

To be able to build a system that can understand human text we need to first

ask ourselves how can we evaluate the machinersquos comprehension ability We had

initially set a goal to build a Chabot for a specific domain and generalize to other topics

as we go ahead While developing the system we found out the necessity for reading

comprehension and how to measure it We finally found the answer with Questions

Answering systems

10

Just like how we human beings are tested for our ability of language

understanding with questions we should ask machines similar questions about what it

has just read The performance of the system on such question answering task will let

us evaluate how much the machine is able to reason about what it just read [1]

Reading comprehension has been a topic of Natural Language Understanding since the

1970s In 1977 Wendy Lehnert said in his doctoral thesis ndash ldquoOnly when we can ask a

program to answer questions about what it reads will we be able to begin to access that

programrsquos comprehensionrdquo [2]

Figure 1-1 The task of Question Answering

To achieve this task the NLP community has developed various datasets such as

CNN Daily Mail WebQuestions SQuAD TriviaQA [3] etc For our purpose we chose

SQuAD which stands for Stanford Question Answering Dataset [4] SQuAD consists of

questions posed by crowdworkers on a set of Wikipedia articles where the answer to

every question is a segment of text or span from the corresponding reading passage

With 100000+ question-answer pairs on 500+ articles SQuAD is significantly larger

than previous reading comprehension datasets [5]

An example from the SQuAD dataset is as follows

11

Passage Tesla later approached Morgan to ask for more funds to build a more

powerful transmitter When asked where all the money had gone Tesla responded by

saying that he was affected by the Panic of 1901 which he (Morgan) had caused

Morgan was shocked by the reminder of his part in the stock market crash and by

Teslarsquos breach of contract by asking for more funds Tesla wrote another plea to

Morgan but it was also fruitless Morgan still owed Tesla money on the original

agreement and Tesla had been facing foreclosure even before construction of the

tower began

Question On what did Tesla blame for the loss of the initial money

Answer Panic of 1901

As we started exploring the QA task we faced several challenges Some of them

we could solve with the help of other research and some of the challenges still exist in

the domain

bull Out of Vocabulary words

bull Multi-sentence reasoning may be required

bull There may exist several candidate answers

bull Optimizing the Exact Match (EM) metric may fail when the answer boundary is fuzzy or too long such as the answer of the ldquowhyrdquo query [6]

bull ldquoOne-hoprdquo prediction may fail to fully understand the query [6]

bull Fail to fully capture the long-distance contextual interaction between parts of the context by only using LSTMGRU

bull Current models are unable to capture semantics of the passage

In the upcoming chapters we will first briefly review the basics necessary for

understanding the models then we will delve deep into the fundamental models that

12

have shaped the current State of the Art models then we will discuss our contribution in

terms of architecture and applications and finally conclude with future directions

13

CHAPTER 2 THE BUILDING BLOCKS OF A QUESTION ANSWERING SYSTEM

To build a question answering system one needs to be familiar with the

fundamental deep learning models such as Recurrent Neural Networks (RNN) Long

Short Term Memory (LSTM) etc In this chapter we will have an overview on these

techniques and see how they all connect to building a question answering system

Neural Networks

What makes Deep Learning so intriguing is that it has close resemblance with

the working of the mammalian brain or at least draws inspiration from it The same can

be said for Artificial Neural networks [7] which consists of a system of interconnected

units called lsquoneuronsrsquo that take input from similar units and produces a single output

Figure 2-1 Simple and Deep Learning Neural Networks [8]

The connection from one neuron to another can be weighted based on the input data

which enables the network to tune itself to produce a certain output based on the input

This is the learning process which is achieved through backpropagation which is a

system of propagating the error from the output layer to the previous layers

14

Convolutional Neural Network

The first wave of deep learningrsquos success was brought by Convolutional Neural

Networks (CNN) [9] when this was the technique used by the winning team of ImageNet

competition in 2012 CNNs are deep artificial neural networks (ANN) that can be used to

classify images cluster them by similarity and perform object recognition within scenes

It can be used to detect and identify faces people signs or any other visual data

Figure 2-2 Convolutional Neural Network Architecture [10]

There are primarily four operations in a standard CNN model (as shown in Fig above)

1 Convolution - The primary purpose of Convolution in the ConvNet (above) is to extract features from the input image The spatial relationship between pixels ie the image features are preserved and learned by the convolution using small squares of input data

2 Non-Linearity (ReLU) ndash Rectified Linear Unit (ReLU) is a non-linear operation that carries out an element wise operation on each pixel This operation replaces the negative pixel values in the feature map by zero

3 Pooling or Sub Sampling - Spatial Pooling reduces the dimensionality of each feature map but retains the most important information For max pooling the largest value in the square window is taken and rest are dropped Other types of pooling are Average Sum etc

4 Classification (Fully Connected Layer) - The Fully Connected layer is a traditional Multi-Layer Perceptron as described before that uses a softmax activation function in the output layer The high-level features of the image are encoded by the convolutional and pooling layers which is then fed to the fully connected layer which then uses these features for classifying the input image into various classes based on the training dataset [10]

15

When a new image is fed into the CNN model all the above-mentioned steps are

carried out (forward propagation) and a probability distribution is achieved on the set of

output classes With a large enough training dataset the network will learn and

generalize well enough to classify new images into their correct classes

Recurrent Neural Networks (RNN)

Whenever we want to predict or encode sequential data RNN [11] is our go to

method RNNs perform the same task for every element of a sequence where the

output of each element depends on previous computations thus the recurrence In

practice RNNs are unable to retain long-term dependencies and can look back only a

few steps because of the vanishing gradient problem

(2-1)

A solution to the dependency problem is to use gated cells such as LSTM [11] or

GRU [13] These cells pass on important information to the next cells while ignoring

non-important ones The gated units in a GRU block are

bull Update Gate ndash Computed based on current input and hidden state

(2-2)

bull Reset Gate ndash Calculated similarly but with different weights

(2-3)

bull New memory content - (2-4)

If reset gate unit is ~0 then previous memory is ignored and only new

information is kept

16

Final memory at current time step combines previous and current time steps

(2-5)

While the GRU is computationally efficient the LSTM on the other hand is a

general case where there are three gates as follows

bull Input Gate ndash What new information to add to the current cell state

bull Forget Gate ndash How much information from previous states to be kept

bull Output gate ndash How much info should be sent to the next states

Just like GRU the current cell state is a sum of the previous cell state but

weighted by the forget gate and the new value is added which is weighted by the input

gate Based on the cell state the output gate regulates the final output

Word Embedding

Computation or gradients can be applied on numbers and not on words or letters

So first we need to convert words into their corresponding numerical formation before

feeding into a deep learning model In general there are two types of word embedding

Frequency based (which constitutes count vectors tf-idf and co-occurrence vectors)

and Prediction based With frequency based embedding the order of the words are not

preserved and works as a bag of words model Whereas with prediction based model

the order of words or locality of words are taken into consideration to generate the

numerical representation of the word Within this prediction based category there are

two fundamental techniques called Continuous Bag of Words (CBOW) and Skip Gram

Model which forms the basis for word2vec [14] and GloVe [15]

The basic intuition behind word2vec is that if two different words have very

similar ldquocontextsrdquo (that is what words are likely to appear around them) then the model

17

will produce similar vector for those words Conversely if the two word vectors are

similar then the network will produce similar context predictions for the same two words

For examples synonyms like ldquointelligentrdquo and ldquosmartrdquo would have very similar contexts

Or that words that are related like ldquoenginerdquo and ldquotransmissionrdquo would probably have

similar contexts as well [16] Plotting the word vectors learned by a word2vec over a

large corpus we could find some very interesting relationships between words

Figure 2-3 Semantic relation between words in vector space [17]

Attention Mechanism

We as humans put our attention to things are important or are relevant in a

context For example when asked a question from a passage we try to find the most

relevant part of the passage the question is relevant with and then reason from our

understanding of that part of the passage The same idea applies for attention

mechanism in Deep Learning It is used to identify the specific parts of a given context

to which the current question is relevant to

Formally put the techniques take n arguments y_1 y_n (in our case the

passage having words say y_i through h_i) and a question word say q It returns a

vector z which is supposed to be the laquo summary raquo of the y_i focusing on information

linked to the question q More formally it returns a weighted arithmetic mean of the y_i

18

and the weights are chosen according the relevance of each y_i given the context c

[18]

Figure 2-4 Attention Mechanism flow [18]

Memory Networks

Convolutional Neural Networks and Recurrent Neural Networks which does

capture how we form our visual and sequential memories their memory (encoded by

hidden states and weights) were typically too small and was not compartmentalized

enough to accurately remember facts from the past (knowledge is compressed into

dense vectors) [19]

Deep Learning needed to cultivate a methodology that preserved memories as

they are such that it wonrsquot be lost in generalization and recalling exact words or

sequence of events would be possible mdash something computers are already good at This

effort led us to Memory Networks [19] published at ICLR 2015 by Facebook AI

Research

This paper provides a basic framework to store augment and retrieve memories

while seamlessly working with a Recurrent Neural Network architecture The memory

19

network consists of a memory m (an array of objects 1 indexed by m i) and four

(potentially learned) components I G O and R as follows

I (input feature map) mdash converts the incoming input to the internal feature

representation either a sparse or dense feature vector like that from word2vec or

GloVe

G (generalization) mdash updates old memories given the new input They call this

generalization as there is an opportunity for the network to compress and generalize its

memories at this stage for some intended future use The analogy Irsquove been talking

before

O (output feature map) mdash produces a new output (in the feature representation

space) given the new input and the current memory state This component is

responsible for performing inference In a question answering system this part will

select the candidate sentences (which might contain the answer) from the story

(conversation) so far

R (response) mdash converts the output into the response format desired For

example a textual response or an action In the QA system described this component

finds the desired answer and then converts it from feature representation to the actual

word

This model is a fully supervised model meaning all the candidate sentences from

which the answer could be found are marked during training phase and can also be

termed as lsquohard attentionrsquo

The authors tested out the QA system on various literature including Lord of the

Rings

20

Figure 2-5 QA example on Lord of the Rings using Memory Networks [19]

21

CHAPTER 3 LITERATURE REVIEW AND STATE OF THE ART

There has been rapid progress since the release of the SQuAD dataset and

currently the best ensemble models are close to human level accuracy in machine

comprehension This is due to the various ingenious methods which solves some of the

problems with the previous methods Out of Vocabulary tokens were handled by using

Character embedding Long term dependency within context passage were solved

using self-attention And many other techniques such as Contextualized vectors History

of Words Attention Flow etc In this section we will have a look at the some of the most

important models that were fundamental to the progress of Questions Answering

Machine Comprehension Using Match-LSTM and Answer Pointer

In this paper [20] the authors propose an end-to-end neural architecture for the

QA task The architecture is based on match-LSTM [21] a model they proposed for

textual entailment and Pointer Net [22] a sequence-to-sequence model proposed by

Vinyals et al (2015) to constrain the output tokens to be from the input sequences

The model consists of an LSTM preprocessing layer a match-LSTM layer and an

Answer Pointer layer

We are given a piece of text which we refer to as a passage and a question

related to the passage The passage is represented by matrix P where P is the length

(number of tokens) of the passage and d is the dimensionality of word embeddings

Similarly the question is represented by matrix Q where Q is the length of the question

Our goal is to identify a subsequence from the passage as the answer to the question

22

Figure 3-1 Match-LSTM Model Architecture [20]

LSTM Preprocessing layer They use a standard one-directional LSTM

(Hochreiter amp Schmidhuber 1997) to process the passage and the question separately

as shown below

Match LSTM Layer They applied the match-LSTM model proposed for textual

entailment to their machine comprehension problem by treating the question as a

premise and the passage as a hypothesis The match-LSTM sequentially goes through

the passage At position i of the passage it first uses the standard word-by-word

attention mechanism to obtain attention weight vector as follows

(3-1)

23

where and are parameters to be learned

Answer Pointer Layer The Answer Pointer (Ans-Ptr) layer is motivated by the

Pointer Net introduced by Vinyals et al (2015) [22] The boundary model produces only

the start token and the end token of the answer and then all the tokens between these

two in the original passage are considered to be the answer

When this paper was released back in November 2016 Match-LSTM method

was the state of the art in Question Answering systems and was at the top of the

leaderboard for the SQuAD dataset

R-NET Matching Reading Comprehension with Self-Matching Networks

In this model [23] first the question and passage are processed by a

bidirectional recurrent network (Mikolov et al 2010) separately They then match the

question and passage with gated attention-based recurrent networks obtaining

question-aware representation for the passage On top of that they apply self-matching

attention to aggregate evidence from the whole passage and refine the passage

representation which is then fed into the output layer to predict the boundary of the

answer span

Question and passage encoding First the words are converted to their

respective word-level embeddings and character level embeddings The character-level

embeddings are generated by taking the final hidden states of a bi-directional recurrent

neural network (RNN) applied to embeddings of characters in the token Such

character-level embeddings have been shown to be helpful to deal with out-of-vocab

(OOV) tokens

24

They then use a bi-directional RNN to produce new representation and

of all words in the question and passage respectively

Figure 3-2 The task of Question Answering [23]

Gated Attention-based Recurrent Networks They use a variant of attention-

based recurrent networks with an additional gate to determine the importance of

information in the passage regarding a question Different from the gates in LSTM or

GRU the additional gate is based on the current passage word and its attention-pooling

vector of the question which focuses on the relation between the question and current

passage word The gate effectively model the phenomenon that only parts of the

passage are relevant to the question in reading comprehension and question answering

is utilized in subsequent calculations

25

Self-Matching Attention From the previous step the question aware passage

representation is generated to highlight the important parts of the passage One

problem with such representation is that it has very limited knowledge of context One

answer candidate is often oblivious to important cues in the passage outside its

surrounding window To address this problem the authors propose directly matching

the question-aware passage representation against itself It dynamically collects

evidence from the whole passage for words in passage and encodes the evidence

relevant to the current passage word and its matching question information into the

passage representation

Output Layer They use the same method as Wang amp Jiang (2016b) and use

pointer networks (Vinyals et al 2015) to predict the start and end position of the

answer In addition they use an attention-pooling over the question representation to

generate the initial hidden vector for the pointer network [23]

When the R-Net Model first appeared in the leaderboard in March 2017 it was at

the top with 723 Exact Match and 807 F1 score

Bi-Directional Attention Flow (BiDAF) for Machine Comprehension

BiDAF [24] is a hierarchical multi-stage architecture for modeling the

representations of the context paragraph at different levels of granularity BIDAF

includes character-level word-level and contextual embeddings and uses bi-directional

attention flow to obtain a query-aware context representation Their attention layer is not

used to summarize the context paragraph into a fixed-size vector Instead the attention

is computed for every time step and the attended vector at each time step along with

the representations from previous layers can flow through to the subsequent modeling

layer This reduces the information loss caused by early summarization [24]

26

Figure 3-3 Bi-Directional Attention Flow Model Architecture [24]

Their machine comprehension model is a hierarchical multi-stage process and

consists of six layers

1 Character Embedding Layer maps each word to a vector space using character-level CNNs

2 Word Embedding Layer maps each word to a vector space using a pre-trained word embedding model

3 Contextual Embedding Layer utilizes contextual cues from surrounding words to refine the embedding of the words These first three layers are applied to both the query and context They use a LSTM on top of the embeddings provided by the previous layers to model the temporal interactions between words We place an LSTM in both directions and concatenate the outputs of the two LSTMs

4 Attention Flow Layer couples the query and context vectors and produces a set of query aware feature vectors for each word in the context

5 Modeling Layer employs a Recurrent Neural Network to scan the context The input to the modeling layer is G which encodes the query-aware representations of context words The output of the modeling layer captures the interaction among the context words conditioned on the query They use two layers of bi-directional LSTM

27

with the output size of d for each direction Hence a matrix is obtained which is passed onto the output layer to predict the answer

6 Output Layer provides an answer to the query They define the training loss (to be minimized) as the sum of the negative log probabilities of the true start and end indices by the predicted distributions averaged over all examples [25]

In a further variation of their above work they add a self-attention layer after the

Bi-attention layer to further improve the results The architecture of the model is as

Figure 3-3 The task of Question Answering [25]

28

Summary

In this chapter we reviewed the methods that are fundamental to the state of the

art in Machine Comprehension and for the task of Question Answering We have

reached closed to human level accuracy and this is due to incremental developments

over previous models As we saw Out of Vocabulary (OOV) tokens were handled by

using Character embedding Long term dependency within context passage were

solved using self-attention and many other techniques such as Contextualized vectors

Attention Flow etc were employed to get better results In the next chapter we will see

how we can build on these models and develop further

29

CHAPTER 4 MULTI-ATTENTION QUESTION ANSWERING

Although the advancements in Question Answering systems since the release of

the SQuAD dataset has been impressive and the results are getting close to human

level accuracy it is far from being a fool-proof system The models still make mistakes

which would be obvious to a human For example

Passage The Panthers used the San Jose State practice facility and stayed at

the San Jose Marriott The Broncos practiced at Florida State Facility and stayed at the

Santa Clara Marriott

Question At what universitys facility did the Panthers practice

Actual Answer San Jose State

Predicted Answer Florida State Facility

To find out what is leading to the wrong predictions we wanted to see the

attention weights associated with such an example We plotted the passage and

question heat map which is a 2D matrix where the intensity of each cell signifies the

similarity between a passage word and a question word For the above example we

found out that while certain words of the question are given high weightage other parts

are not The words lsquoAtrsquo lsquofacilityrsquo lsquopracticersquo are given high attention but lsquoPanthersrsquo does

not receive high attention If it had received high attention then the system would have

predicted lsquoSan Jose Statersquo as the right answer To solve this issue we analyzed the

base BiDAF model and proposed adding two things

1 Bi-Attention and Self-Attention over Query

2 Second level attention over the output of (Bi-Attention + Self-Attention) from both Context and Query

30

Multi-Attention BiDAF Model

Our comprehension model is a hierarchical multi-stage process and consists of

the following layers

1 Embedding Just as all other models we embed words using pretrained word vectors We also embed the characters in each word into size 20 vectors which are learned and run a convolution neural network followed by max-pooling to get character-derived embeddings for each word The character-level and word-level embeddings are then concatenated and passed to the next layer We do not update the word embeddings during training

2 Pre-Process A shared bi-directional GRU (Cho et al 2014) is used to map the question and passage embeddings to context aware embeddings

3 Context Attention The bi-directional attention mechanism from the Bi-Directional Attention Flow (BiDAF) model (Seo et al 2016) is used to build a query-aware context representation Let h_i be the vector for context word i q_j be the vector for question word j and n_q and n_c be the lengths of the question and context respectively We compute attention between context word i and question word j as

(4-1)

where w1 w2 and w3 are learned vectors and is element-wise multiplication We then compute an attended vector c_i for each context token as

(4-2)

We also compute a query-to-context vector q_c

(4-3)

31

The final vector computed for each token is built by concatenating

and In our model we subsequently pass the result through a linear layer with ReLU activations

4 Context Self-Attention Next we use a layer of residual self-attention The input is passed through another bi-directional GRU Then we apply the same attention mechanism only now between the passage and itself In this case we do not use query-to context attention and we set aij = 1048576inf if i = j As before we pass the concatenated output through a linear layer with ReLU activations This layer is applied residually so this output is additionally summed with the input

5 Query Attention For this part we do the same way as context attention but calculate the weighted sum of the context words for each query word Thus we get the length as number of query words Then we calculate context to query similar to query to context in context attention layer

Figure 4-1 The modified BiDAF model with multilevel attention

6 Query Self-Attention This part is done the same way as the context self-attention layer but from the output of the Query Attention layer

32

7 Context Query Bi-Attention + Self-Attention The output of the Context self-attention and the Query Self-Attention layers are taken as input and the same process for Bi-Attention and self-attention is applied on these inputs

8 Prediction In the last layer of our model a bidirectional GRU is applied followed by a linear layer that computes answer start scores for each token The hidden states of that layer are concatenated with the input and fed into a second bidirectional GRU and linear layer to predict answer end scores The softmax operation is applied to the start and end scores to produce start and end probabilities and we optimize the negative log likelihood of selecting correct start and end tokens

Having carried out this modification we were able to solve the wrong example we

started with The multilevel attention model gives the correct output as ldquoSan Jose

Staterdquo Also we achieved slightly better scores than the original model with a F1 score

of 8544 on the SQuAD dev dataset

Chatbot Design Using a QA System

Designing a perfect chatbot that passes the Turing test is a fundamental goal for

Artificial Intelligence Although we are many order to magnitudes away from achieving

such a goal domain specific tasks can be solved with chatbots made from current

technology We had started our goal with a similar objective in mind ie to design a

domain specific chatbot and then generalize to other areas as it is able to achieve the

first domain specific objective robustly This led us to the fundamental problem of

Machine Comprehension and subsequently to the task of Question Answering Having

achieved some degree of success with QA systems we looked back if we could apply

our newly acquired knowledge in the task of designing Chatbots

The chatbots made with todayrsquos technologies are mostly handcrafted techniques

such as template matching that requires anticipating all possible ways a user may

articulate his requirements and a conversation may occur This requires a lot of man

hours for designing a domain specific system and is still very error prone In this section

33

we propose a general Chatbot design that would make the designing of a domain

specific chatbot very easy and robust at the same time

Every domain specific chatbot needs to obtain a set of information from the user

and show some results based on the user specific information obtained The traditional

chatbots use template matching and keywords lookup to determine if the user has

provided the required information Our idea is to use the Question Answering system in

the backend to extract out the required information from whatever the user has typed

until this point of the conversation The information to be extracted can be posed in the

form of a set of questions and the answers obtained from those questions can be used

as the parameters to supply the relevant information to the user

We had chosen our chatbot domain as the flight reservation system Our goal

was to extract the required information from the user to be able to show him the

available flights as per the userrsquos requirements

Figure 4-2 Flight reservation chatbotrsquos chat window

34

For a flight reservation task the booking agent needs to know the origin city

destination city and date of travel at minimum to be able to show the available flights

Optional information includes the number of tickets passengerrsquos name one way or

round trip etc

The minimalistic conversation with the user through the chat window would be as

shown above We had a platform called OneTask on which we wanted to implement our

chat bot The chat interface within the OneTask system looks as follows

Figure 4-3 Chatbot within OneTask system

The working of the chat system is as follows ndash

1 Initiation The user opens the chat window which starts a session with the chat bot The chat bot asks lsquoHow may I help yoursquo

2 User Reply The user may reply with none of the required information for flight booking or may reply with multiple information in the same message

3 User Reply parsing The conversation up to this point is treated as a passage and the internal questions are run on this passage So the four questions that are run are

Where do you want to go

35

From where do you want to leave

When do you want to depart

4 Parsed responses from QA model After running the questions on the QA system with the given conversation answers are obtained for all along with their corresponding confidence values Since answers will be obtained even if the required question was not answered up to this point in the conversation the confidence values plays a crucial role in determining the validity of the answer For this purpose we determined ranges of confidence values for validation A confidence value above 10 signifies the question has been answered correctly A confidence of 2 ndash 10 signifies that it may have been answered but should verify with the user for correctness and any confidence below 2 is discarded

5 Asking remaining questions iteratively After the parsing it is checked if any of the required questions are still unanswered If so the chatbot asks the remaining question and the process from 3 ndash 4 is carried out iteratively Once all the questions have been answered the user is shown with the available flight options as per his request

Figure 4-4 The Flow diagram of the Flight booking Chatbot system

Online QA System and Attention Visualization

To be able to test out various examples we used the BiDAF [23] model for an

online demo One can either choose from the available examples from the drop-down

menu or paste their own passage and examples While this is a useful and interesting

system to test the model in a user-friendly way we created this system to be able to

focus on the wrong samples

36

Figure 4-5 QA system interface with attention highlight over candidate answers

The candidate answers are shown with the blue highlights as per their

confidence values The higher the confidence the darker the answer The highest

confidence value is chosen as the predicted answer We developed the system to show

attention spread of the candidate answers to realize what needs to be done to improve

the system This led us to realize the importance of including the query attention part as

well as multilevel attention on the BiDAF model as described in the first section of this

chapter

37

CHAPTER 5 FUTURE DIRECTIONS AND CONCLUSION

Future Directions

Having done a thorough analysis of the current methods and state of the arts in

Machine Comprehension and for the QA task and developing systems on it we have

achieved a strong sense of what needs to be done to further improve the QA models

After observing the wrong samples we can see that the system is still unable to encode

meaning and is picking answers based on statistical patterns of occurrence of answers

on training examples

As per the State of the Art models and analyzing the ongoing research literature

it is easy to conclude that more features need to be embedded to encode the meaning

of words phrases and sentences A paper called Reinforced Mnemonic Reader for

Machine Comprehension encoded POS and NER tags of words along with their word

and character embedding This gave them better results

Figure 5-1 An English language semantic parse tree [26]

38

We have developed a method to encode the syntax parse tree of a sentence that

not only encodes the post tags but the relation of the word within the phrase and the

relation of the phrase within the whole sentence in a hierarchical manner

Finally data augmentation is another solution to get better results One definite

way to reduce the errors would be to include similar samples in the training data which

the system is faltering in the dev set One could generate similar examples as the failure

cases and include them in the training set to have better prediction Another system

would be to train to similar and bigger datasets Our models were trained on the SQuAD

dataset There are other datasets too which does the similar question answering task

such as TriviaQA We could augment the training set of TriviaQA along with SQuAD to

have a more robust system that is able to generalize better and thus have higher

accuracy for predicting answer spans

Conclusion

In this work we have tried to explore the most fundamental techniques that have

shaped the current state of the art Then we proposed a minor improvement of

architecture over an existing model Furthermore we developed two applications that

uses the base model First we talked about how a chatbot application can be made

using the QA system and lastly we also created a web interface where the model can

be used for any Passage and Question This interface also shows the attention spread

on the candidate answers While our effort is ongoing to push the state of the art

forward we strongly believe that surpassing human level accuracy on this task will have

high dividends for the society at large

39

LIST OF REFERENCES

[1] Danqi Chen From Reading Comprehension To Open-Domain Question Answering 2018 Youtube Accessed March 16 2018 httpswwwyoutubecomwatchv=1RN88O9C13U

[2] Lehnert Wendy G The Process of Question Answering No RR-88 Yale Univ New Haven Conn Dept of computer science 1977

[3] Joshi Mandar Eunsol Choi Daniel S Weld and Luke Zettlemoyer Triviaqa A large scale distantly supervised challenge dataset for reading comprehension arXiv preprint arXiv170503551 (2017)

[4] Rajpurkar Pranav Jian Zhang Konstantin Lopyrev and Percy Liang Squad 100000+ questions for machine comprehension of text arXiv preprint arXiv160605250 (2016)

[5] The Stanford Question Answering Dataset 2018 RajpurkarGithubIo Accessed March 16 2018 httpsrajpurkargithubioSQuAD-explorer

[6] Hu Minghao Yuxing Peng and Xipeng Qiu Reinforced mnemonic reader for machine comprehension CoRR abs170502798 (2017)

[7] Schalkoff Robert J Artificial neural networks Vol 1 New York McGraw-Hill 1997

[8] Me For The AI and Neetesh Mehrotra 2018 The Connect Between Deep Learning And AI - Open Source For You Open Source For You Accessed March 16 2018 httpopensourceforucom201801connect-deep-learning-ai

[9] Krizhevsky Alex Ilya Sutskever and Geoffrey E Hinton Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems pp 1097-1105 2012

[10] An Intuitive Explanation Of Convolutional Neural Networks 2016 The Data Science Blog Accessed March 16 2018 httpsujjwalkarnme20160811intuitive-explanation-convnets

[11] Williams Ronald J and David Zipser A learning algorithm for continually running fully recurrent neural networks Neural computation 1 no 2 (1989) 270-280

[12] Hochreiter Sepp and Juumlrgen Schmidhuber Long short-term memory Neural computation 9 no 8 (1997) 1735-1780

[13] Chung Junyoung Caglar Gulcehre KyungHyun Cho and Yoshua Bengio Empirical evaluation of gated recurrent neural networks on sequence modeling arXiv preprint arXiv14123555 (2014)

40

[14] Mikolov Tomas Wen-tau Yih and Geoffrey Zweig Linguistic regularities in continuous space word representations In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics Human Language Technologies pp 746-751 2013

[15] Pennington Jeffrey Richard Socher and Christopher Manning Glove Global vectors for word representation In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) pp 1532-1543 2014

[16] Word2vec Tutorial - The Skip-Gram Model middot Chris Mccormick 2016 MccormickmlCom Accessed March 16 2018 httpmccormickmlcom20160419word2vec-tutorial-the-skip-gram-model

[17] Group 2015 [Paper Introduction] Bilingual Word Representations With Monolingual hellip SlideshareNet Accessed March 16 2018 httpswwwslidesharenetnaist-mtstudybilingual-word-representations-with-monolingual-quality-in-mind

[18] Attention Mechanism 2016 BlogHeuritechCom Accessed March 16 2018 httpsblogheuritechcom20160120attention-mechanism

[19] Weston Jason Sumit Chopra and Antoine Bordes 2014 Memory Networks arXiv preprint arXiv14103916v11 Accessed March 16 2018 httpsarxivorgabs14103916

[20] Wang Shuohang and Jing Jiang Machine comprehension using match-lstm and answer pointer arXiv preprint arXiv160807905 (2016)

[21] Wang Shuohang and Jing Jiang Learning natural language inference with LSTM arXiv preprint arXiv151208849 (2015)

[22] Vinyals Oriol Meire Fortunato and Navdeep Jaitly Pointer networks In Advances in Neural Information Processing Systems pp 2692-2700 2015

[23] Wang Wenhui Nan Yang Furu Wei Baobao Chang and Ming Zhou Gated self-matching networks for reading comprehension and question answering In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1 Long Papers) vol 1 pp 189-198 2017

[24] Seo Minjoon Aniruddha Kembhavi Ali Farhadi and Hannaneh Hajishirzi Bidirectional attention flow for machine comprehension arXiv preprint arXiv161101603 (2016)

[25] Clark Christopher and Matt Gardner Simple and effective multi-paragraph reading comprehension arXiv preprint arXiv171010723 (2017)

41

[26] 8 Analyzing Sentence Structure 2018 NltkOrg Accessed March 16 2018 httpwwwnltkorgbookch08html

42

BIOGRAPHICAL SKETCH

Purnendu Mukherjee grew up in Kolkata India and had a strong interest for

computers since an early age After his high school he did his BSc in computer

science from Ramakrishna Mission Residential College Narendrapur followed by MSc

in computer science from St Xavierrsquos College Kolkata He had a strong intuition and

interest for human like learning systems and wanted to work in this area He started

working at TCS Innovation Labs Pune for the application of Natural Language

Processing in Educational Applications As he was deeply passionate about learning

system that mimic the human brain and learn like a human child does he was

increasing interested about Deep Learning and its applications After working for a year

he went on to pursue a Master of Science degree in computer science from the

University of Florida Gainesville His academic interests have been focused on Deep

Learning and Natural Language Processing and he has been working on Machine

Reading Comprehension since summer of 2017

  • ACKNOWLEDGMENTS
  • LIST OF FIGURES
  • LIST OF ABBREVIATIONS
  • INTRODUCTION
  • THE BUILDING BLOCKS OF A QUESTION ANSWERING SYSTEM
    • Neural Networks
    • Convolutional Neural Network
    • Recurrent Neural Networks (RNN)
    • Word Embedding
    • Attention Mechanism
    • Memory Networks
      • LITERATURE REVIEW AND STATE OF THE ART
        • Machine Comprehension Using Match-LSTM and Answer Pointer
        • R-NET Matching Reading Comprehension with Self-Matching Networks
        • Bi-Directional Attention Flow (BiDAF) for Machine Comprehension
        • Summary
          • MULTI-ATTENTION QUESTION ANSWERING
            • Multi-Attention BiDAF Model
            • Chatbot Design Using a QA System
            • Online QA System and Attention Visualization
              • FUTURE DIRECTIONS AND CONCLUSION
                • Future Directions
                • Conclusion
                  • LIST OF REFERENCES
                  • BIOGRAPHICAL SKETCH
Page 4: QUESTION ANSWERING SYSTEMS WITH ATTENTION MECHANISMufdcimages.uflib.ufl.edu/UF/E0/05/23/73/00001/MUKHERJEE_P.pdf · QUESTION ANSWERING SYSTEMS WITH ATTENTION MECHANISM By PURNENDU

4

ACKNOWLEDGMENTS

I would like to thank my thesis advisor Professor Andy Li who has been a

constant source of inspiration and encouragement for me to pursue my ideas Also I

am thankful to him for providing all the necessary resources to succeed for my research

efforts

I am deeply thankful to Professor Kristy Boyer for her guidance and mentorship

as she motivated me to pursue my research interests in Natural Language Processing

and Deep Learning

I would like to express my heartfelt gratitude to Professor Jose Principe who had

taught me the very fundamentals of Learning systems and formally introduced me to

Deep Learning

I wish to extend my thanks to all the member of CBL Lab and especially to the

NLP group for their insightful remarks and overall support

I am grateful to my close friend roommate and lab partner Yash Sinha who had

supported and helped me throughout with all aspects of my thesis work

Finally I must thank my parents for their unwavering support and dedication

towards my well-being and growth

5

TABLE OF CONTENTS page

ACKNOWLEDGMENTS 4

LIST OF FIGURES 6

LIST OF ABBREVIATIONS 7

ABSTRACT 8

CHAPTER

1 INTRODUCTION 9

2 THE BUILDING BLOCKS OF A QUESTION ANSWERING SYSTEM 13

Neural Networks 13 Convolutional Neural Network 14 Recurrent Neural Networks (RNN) 15 Word Embedding 16 Attention Mechanism 17 Memory Networks 18

3 LITERATURE REVIEW AND STATE OF THE ART 21

Machine Comprehension Using Match-LSTM and Answer Pointer 21 R-NET Matching Reading Comprehension with Self-Matching Networks 23 Bi-Directional Attention Flow (BiDAF) for Machine Comprehension 25 Summary 28

4 MULTI-ATTENTION QUESTION ANSWERING 29

Multi-Attention BiDAF Model 30 Chatbot Design Using a QA System 32 Online QA System and Attention Visualization 35

5 FUTURE DIRECTIONS AND CONCLUSION 37

Future Directions 37 Conclusion 38

LIST OF REFERENCES 39

BIOGRAPHICAL SKETCH 42

6

LIST OF FIGURES

Figure page 1-1 The task of Question Answering 10

2-1 Simple and Deep Learning Neural Networks 13

2-2 Convolutional Neural Network Architecture 14

2-3 Semantic relation between words in vector space 17

2-4 Attention Mechanism flow 18

2-5 QA example on Lord of the Rings using Memory Networks 20

3-1 Match-LSTM Model Architecture 22

3-2 The task of Question Answering 24

3-3 Bi-Directional Attention Flow Model Architecture 26

4-1 The modified BiDAF model with multilevel attention 31

4-2 Flight reservation chatbotrsquos chat window 33

4-3 Chatbot within OneTask system 34

4-4 The Flow diagram of the Flight booking Chatbot system 35

4-5 QA system interface with attention highlight over candidate answers 36

5-1 An English language semantic parse tree 37

7

LIST OF ABBREVIATIONS

BiDAF Bi-Directional Attention Flow

CNN Convolutional Neural Network

GRU Gated Recurrent Units

LSTM Long Short Term Memory

NLP Natural Language Processing

NLU Natural Language Understanding

RNN Recurrent Neural Network

8

Abstract of Thesis Presented to the Graduate School of the University of Florida in Partial Fulfillment of the

Requirements for the Degree of Master of Science

QUESTION ANSWERING SYSTEMS WITH ATTENTION MECHANISM

By

Purnendu Mukherjee

May 2018

Chair Xiaolin Li Major Computer Science

Question Answering(QA) systems have had rapid growth since the last 3 years

and are close to reaching human level accuracy One of the fundamental reason for this

growth has been the use of attention mechanism along with other methods of Deep

Learning But just as with other Deep Learning methods some of the failure cases are

so obvious that it convinces us that there is a lot to improve upon In this work we first

did a literature review of the State of the Art and fundamental models in QA systems

Next we introduce an architecture which has shown improvement in the targeted area

We then introduce a general method to enable easy design of domain-specific Chatbot

applications and present a proof of the concept with the same method Finally we

present an easy to use Question Answering interface with attention visualization on the

passage We also propose a method to improve the current state of the art as a part of

our ongoing work

9

CHAPTER 1 INTRODUCTION

Teaching machines to read and understand human natural language is our

central and long standing goal for Natural Language Processing and Artificial

Intelligence in general The positive side of machines being able to reason and

comprehend human language could be enormous We are beginning to see how

commercial systems like Alexa Siri Google Now etc are being widely used as Speech

Recognition improved to human levels While speech recognition systems can

transcribe speech to text comprehension of that transcribed text is another task which

is currently a major focus for both academia and industry because of its possible

applications Moreover all the text information available throughout the internet is a

major reason why machine comprehension of text is such an important task

With the growth of Deep Learning methods in the last few years the field of

Machine Comprehension and Natural Language Processing(NLP) in general has

experienced a revolution While the traditional methods and practices are still prevalent

and forms the basis of our deep understanding of languages Deep Learning methods

have surpassed all traditional NLP and Machine Learning methods by a significant

margin and are currently driving the growth of the field

To be able to build a system that can understand human text we need to first

ask ourselves how can we evaluate the machinersquos comprehension ability We had

initially set a goal to build a Chabot for a specific domain and generalize to other topics

as we go ahead While developing the system we found out the necessity for reading

comprehension and how to measure it We finally found the answer with Questions

Answering systems

10

Just like how we human beings are tested for our ability of language

understanding with questions we should ask machines similar questions about what it

has just read The performance of the system on such question answering task will let

us evaluate how much the machine is able to reason about what it just read [1]

Reading comprehension has been a topic of Natural Language Understanding since the

1970s In 1977 Wendy Lehnert said in his doctoral thesis ndash ldquoOnly when we can ask a

program to answer questions about what it reads will we be able to begin to access that

programrsquos comprehensionrdquo [2]

Figure 1-1 The task of Question Answering

To achieve this task the NLP community has developed various datasets such as

CNN Daily Mail WebQuestions SQuAD TriviaQA [3] etc For our purpose we chose

SQuAD which stands for Stanford Question Answering Dataset [4] SQuAD consists of

questions posed by crowdworkers on a set of Wikipedia articles where the answer to

every question is a segment of text or span from the corresponding reading passage

With 100000+ question-answer pairs on 500+ articles SQuAD is significantly larger

than previous reading comprehension datasets [5]

An example from the SQuAD dataset is as follows

11

Passage Tesla later approached Morgan to ask for more funds to build a more

powerful transmitter When asked where all the money had gone Tesla responded by

saying that he was affected by the Panic of 1901 which he (Morgan) had caused

Morgan was shocked by the reminder of his part in the stock market crash and by

Teslarsquos breach of contract by asking for more funds Tesla wrote another plea to

Morgan but it was also fruitless Morgan still owed Tesla money on the original

agreement and Tesla had been facing foreclosure even before construction of the

tower began

Question On what did Tesla blame for the loss of the initial money

Answer Panic of 1901

As we started exploring the QA task we faced several challenges Some of them

we could solve with the help of other research and some of the challenges still exist in

the domain

bull Out of Vocabulary words

bull Multi-sentence reasoning may be required

bull There may exist several candidate answers

bull Optimizing the Exact Match (EM) metric may fail when the answer boundary is fuzzy or too long such as the answer of the ldquowhyrdquo query [6]

bull ldquoOne-hoprdquo prediction may fail to fully understand the query [6]

bull Fail to fully capture the long-distance contextual interaction between parts of the context by only using LSTMGRU

bull Current models are unable to capture semantics of the passage

In the upcoming chapters we will first briefly review the basics necessary for

understanding the models then we will delve deep into the fundamental models that

12

have shaped the current State of the Art models then we will discuss our contribution in

terms of architecture and applications and finally conclude with future directions

13

CHAPTER 2 THE BUILDING BLOCKS OF A QUESTION ANSWERING SYSTEM

To build a question answering system one needs to be familiar with the

fundamental deep learning models such as Recurrent Neural Networks (RNN) Long

Short Term Memory (LSTM) etc In this chapter we will have an overview on these

techniques and see how they all connect to building a question answering system

Neural Networks

What makes Deep Learning so intriguing is that it has close resemblance with

the working of the mammalian brain or at least draws inspiration from it The same can

be said for Artificial Neural networks [7] which consists of a system of interconnected

units called lsquoneuronsrsquo that take input from similar units and produces a single output

Figure 2-1 Simple and Deep Learning Neural Networks [8]

The connection from one neuron to another can be weighted based on the input data

which enables the network to tune itself to produce a certain output based on the input

This is the learning process which is achieved through backpropagation which is a

system of propagating the error from the output layer to the previous layers

14

Convolutional Neural Network

The first wave of deep learningrsquos success was brought by Convolutional Neural

Networks (CNN) [9] when this was the technique used by the winning team of ImageNet

competition in 2012 CNNs are deep artificial neural networks (ANN) that can be used to

classify images cluster them by similarity and perform object recognition within scenes

It can be used to detect and identify faces people signs or any other visual data

Figure 2-2 Convolutional Neural Network Architecture [10]

There are primarily four operations in a standard CNN model (as shown in Fig above)

1 Convolution - The primary purpose of Convolution in the ConvNet (above) is to extract features from the input image The spatial relationship between pixels ie the image features are preserved and learned by the convolution using small squares of input data

2 Non-Linearity (ReLU) ndash Rectified Linear Unit (ReLU) is a non-linear operation that carries out an element wise operation on each pixel This operation replaces the negative pixel values in the feature map by zero

3 Pooling or Sub Sampling - Spatial Pooling reduces the dimensionality of each feature map but retains the most important information For max pooling the largest value in the square window is taken and rest are dropped Other types of pooling are Average Sum etc

4 Classification (Fully Connected Layer) - The Fully Connected layer is a traditional Multi-Layer Perceptron as described before that uses a softmax activation function in the output layer The high-level features of the image are encoded by the convolutional and pooling layers which is then fed to the fully connected layer which then uses these features for classifying the input image into various classes based on the training dataset [10]

15

When a new image is fed into the CNN model all the above-mentioned steps are

carried out (forward propagation) and a probability distribution is achieved on the set of

output classes With a large enough training dataset the network will learn and

generalize well enough to classify new images into their correct classes

Recurrent Neural Networks (RNN)

Whenever we want to predict or encode sequential data RNN [11] is our go to

method RNNs perform the same task for every element of a sequence where the

output of each element depends on previous computations thus the recurrence In

practice RNNs are unable to retain long-term dependencies and can look back only a

few steps because of the vanishing gradient problem

(2-1)

A solution to the dependency problem is to use gated cells such as LSTM [11] or

GRU [13] These cells pass on important information to the next cells while ignoring

non-important ones The gated units in a GRU block are

bull Update Gate ndash Computed based on current input and hidden state

(2-2)

bull Reset Gate ndash Calculated similarly but with different weights

(2-3)

bull New memory content - (2-4)

If reset gate unit is ~0 then previous memory is ignored and only new

information is kept

16

Final memory at current time step combines previous and current time steps

(2-5)

While the GRU is computationally efficient the LSTM on the other hand is a

general case where there are three gates as follows

bull Input Gate ndash What new information to add to the current cell state

bull Forget Gate ndash How much information from previous states to be kept

bull Output gate ndash How much info should be sent to the next states

Just like GRU the current cell state is a sum of the previous cell state but

weighted by the forget gate and the new value is added which is weighted by the input

gate Based on the cell state the output gate regulates the final output

Word Embedding

Computation or gradients can be applied on numbers and not on words or letters

So first we need to convert words into their corresponding numerical formation before

feeding into a deep learning model In general there are two types of word embedding

Frequency based (which constitutes count vectors tf-idf and co-occurrence vectors)

and Prediction based With frequency based embedding the order of the words are not

preserved and works as a bag of words model Whereas with prediction based model

the order of words or locality of words are taken into consideration to generate the

numerical representation of the word Within this prediction based category there are

two fundamental techniques called Continuous Bag of Words (CBOW) and Skip Gram

Model which forms the basis for word2vec [14] and GloVe [15]

The basic intuition behind word2vec is that if two different words have very

similar ldquocontextsrdquo (that is what words are likely to appear around them) then the model

17

will produce similar vector for those words Conversely if the two word vectors are

similar then the network will produce similar context predictions for the same two words

For examples synonyms like ldquointelligentrdquo and ldquosmartrdquo would have very similar contexts

Or that words that are related like ldquoenginerdquo and ldquotransmissionrdquo would probably have

similar contexts as well [16] Plotting the word vectors learned by a word2vec over a

large corpus we could find some very interesting relationships between words

Figure 2-3 Semantic relation between words in vector space [17]

Attention Mechanism

We as humans put our attention to things are important or are relevant in a

context For example when asked a question from a passage we try to find the most

relevant part of the passage the question is relevant with and then reason from our

understanding of that part of the passage The same idea applies for attention

mechanism in Deep Learning It is used to identify the specific parts of a given context

to which the current question is relevant to

Formally put the techniques take n arguments y_1 y_n (in our case the

passage having words say y_i through h_i) and a question word say q It returns a

vector z which is supposed to be the laquo summary raquo of the y_i focusing on information

linked to the question q More formally it returns a weighted arithmetic mean of the y_i

18

and the weights are chosen according the relevance of each y_i given the context c

[18]

Figure 2-4 Attention Mechanism flow [18]

Memory Networks

Convolutional Neural Networks and Recurrent Neural Networks which does

capture how we form our visual and sequential memories their memory (encoded by

hidden states and weights) were typically too small and was not compartmentalized

enough to accurately remember facts from the past (knowledge is compressed into

dense vectors) [19]

Deep Learning needed to cultivate a methodology that preserved memories as

they are such that it wonrsquot be lost in generalization and recalling exact words or

sequence of events would be possible mdash something computers are already good at This

effort led us to Memory Networks [19] published at ICLR 2015 by Facebook AI

Research

This paper provides a basic framework to store augment and retrieve memories

while seamlessly working with a Recurrent Neural Network architecture The memory

19

network consists of a memory m (an array of objects 1 indexed by m i) and four

(potentially learned) components I G O and R as follows

I (input feature map) mdash converts the incoming input to the internal feature

representation either a sparse or dense feature vector like that from word2vec or

GloVe

G (generalization) mdash updates old memories given the new input They call this

generalization as there is an opportunity for the network to compress and generalize its

memories at this stage for some intended future use The analogy Irsquove been talking

before

O (output feature map) mdash produces a new output (in the feature representation

space) given the new input and the current memory state This component is

responsible for performing inference In a question answering system this part will

select the candidate sentences (which might contain the answer) from the story

(conversation) so far

R (response) mdash converts the output into the response format desired For

example a textual response or an action In the QA system described this component

finds the desired answer and then converts it from feature representation to the actual

word

This model is a fully supervised model meaning all the candidate sentences from

which the answer could be found are marked during training phase and can also be

termed as lsquohard attentionrsquo

The authors tested out the QA system on various literature including Lord of the

Rings

20

Figure 2-5 QA example on Lord of the Rings using Memory Networks [19]

21

CHAPTER 3 LITERATURE REVIEW AND STATE OF THE ART

There has been rapid progress since the release of the SQuAD dataset and

currently the best ensemble models are close to human level accuracy in machine

comprehension This is due to the various ingenious methods which solves some of the

problems with the previous methods Out of Vocabulary tokens were handled by using

Character embedding Long term dependency within context passage were solved

using self-attention And many other techniques such as Contextualized vectors History

of Words Attention Flow etc In this section we will have a look at the some of the most

important models that were fundamental to the progress of Questions Answering

Machine Comprehension Using Match-LSTM and Answer Pointer

In this paper [20] the authors propose an end-to-end neural architecture for the

QA task The architecture is based on match-LSTM [21] a model they proposed for

textual entailment and Pointer Net [22] a sequence-to-sequence model proposed by

Vinyals et al (2015) to constrain the output tokens to be from the input sequences

The model consists of an LSTM preprocessing layer a match-LSTM layer and an

Answer Pointer layer

We are given a piece of text which we refer to as a passage and a question

related to the passage The passage is represented by matrix P where P is the length

(number of tokens) of the passage and d is the dimensionality of word embeddings

Similarly the question is represented by matrix Q where Q is the length of the question

Our goal is to identify a subsequence from the passage as the answer to the question

22

Figure 3-1 Match-LSTM Model Architecture [20]

LSTM Preprocessing layer They use a standard one-directional LSTM

(Hochreiter amp Schmidhuber 1997) to process the passage and the question separately

as shown below

Match LSTM Layer They applied the match-LSTM model proposed for textual

entailment to their machine comprehension problem by treating the question as a

premise and the passage as a hypothesis The match-LSTM sequentially goes through

the passage At position i of the passage it first uses the standard word-by-word

attention mechanism to obtain attention weight vector as follows

(3-1)

23

where and are parameters to be learned

Answer Pointer Layer The Answer Pointer (Ans-Ptr) layer is motivated by the

Pointer Net introduced by Vinyals et al (2015) [22] The boundary model produces only

the start token and the end token of the answer and then all the tokens between these

two in the original passage are considered to be the answer

When this paper was released back in November 2016 Match-LSTM method

was the state of the art in Question Answering systems and was at the top of the

leaderboard for the SQuAD dataset

R-NET Matching Reading Comprehension with Self-Matching Networks

In this model [23] first the question and passage are processed by a

bidirectional recurrent network (Mikolov et al 2010) separately They then match the

question and passage with gated attention-based recurrent networks obtaining

question-aware representation for the passage On top of that they apply self-matching

attention to aggregate evidence from the whole passage and refine the passage

representation which is then fed into the output layer to predict the boundary of the

answer span

Question and passage encoding First the words are converted to their

respective word-level embeddings and character level embeddings The character-level

embeddings are generated by taking the final hidden states of a bi-directional recurrent

neural network (RNN) applied to embeddings of characters in the token Such

character-level embeddings have been shown to be helpful to deal with out-of-vocab

(OOV) tokens

24

They then use a bi-directional RNN to produce new representation and

of all words in the question and passage respectively

Figure 3-2 The task of Question Answering [23]

Gated Attention-based Recurrent Networks They use a variant of attention-

based recurrent networks with an additional gate to determine the importance of

information in the passage regarding a question Different from the gates in LSTM or

GRU the additional gate is based on the current passage word and its attention-pooling

vector of the question which focuses on the relation between the question and current

passage word The gate effectively model the phenomenon that only parts of the

passage are relevant to the question in reading comprehension and question answering

is utilized in subsequent calculations

25

Self-Matching Attention From the previous step the question aware passage

representation is generated to highlight the important parts of the passage One

problem with such representation is that it has very limited knowledge of context One

answer candidate is often oblivious to important cues in the passage outside its

surrounding window To address this problem the authors propose directly matching

the question-aware passage representation against itself It dynamically collects

evidence from the whole passage for words in passage and encodes the evidence

relevant to the current passage word and its matching question information into the

passage representation

Output Layer They use the same method as Wang amp Jiang (2016b) and use

pointer networks (Vinyals et al 2015) to predict the start and end position of the

answer In addition they use an attention-pooling over the question representation to

generate the initial hidden vector for the pointer network [23]

When the R-Net Model first appeared in the leaderboard in March 2017 it was at

the top with 723 Exact Match and 807 F1 score

Bi-Directional Attention Flow (BiDAF) for Machine Comprehension

BiDAF [24] is a hierarchical multi-stage architecture for modeling the

representations of the context paragraph at different levels of granularity BIDAF

includes character-level word-level and contextual embeddings and uses bi-directional

attention flow to obtain a query-aware context representation Their attention layer is not

used to summarize the context paragraph into a fixed-size vector Instead the attention

is computed for every time step and the attended vector at each time step along with

the representations from previous layers can flow through to the subsequent modeling

layer This reduces the information loss caused by early summarization [24]

26

Figure 3-3 Bi-Directional Attention Flow Model Architecture [24]

Their machine comprehension model is a hierarchical multi-stage process and

consists of six layers

1 Character Embedding Layer maps each word to a vector space using character-level CNNs

2 Word Embedding Layer maps each word to a vector space using a pre-trained word embedding model

3 Contextual Embedding Layer utilizes contextual cues from surrounding words to refine the embedding of the words These first three layers are applied to both the query and context They use a LSTM on top of the embeddings provided by the previous layers to model the temporal interactions between words We place an LSTM in both directions and concatenate the outputs of the two LSTMs

4 Attention Flow Layer couples the query and context vectors and produces a set of query aware feature vectors for each word in the context

5 Modeling Layer employs a Recurrent Neural Network to scan the context The input to the modeling layer is G which encodes the query-aware representations of context words The output of the modeling layer captures the interaction among the context words conditioned on the query They use two layers of bi-directional LSTM

27

with the output size of d for each direction Hence a matrix is obtained which is passed onto the output layer to predict the answer

6 Output Layer provides an answer to the query They define the training loss (to be minimized) as the sum of the negative log probabilities of the true start and end indices by the predicted distributions averaged over all examples [25]

In a further variation of their above work they add a self-attention layer after the

Bi-attention layer to further improve the results The architecture of the model is as

Figure 3-3 The task of Question Answering [25]

28

Summary

In this chapter we reviewed the methods that are fundamental to the state of the

art in Machine Comprehension and for the task of Question Answering We have

reached closed to human level accuracy and this is due to incremental developments

over previous models As we saw Out of Vocabulary (OOV) tokens were handled by

using Character embedding Long term dependency within context passage were

solved using self-attention and many other techniques such as Contextualized vectors

Attention Flow etc were employed to get better results In the next chapter we will see

how we can build on these models and develop further

29

CHAPTER 4 MULTI-ATTENTION QUESTION ANSWERING

Although the advancements in Question Answering systems since the release of

the SQuAD dataset has been impressive and the results are getting close to human

level accuracy it is far from being a fool-proof system The models still make mistakes

which would be obvious to a human For example

Passage The Panthers used the San Jose State practice facility and stayed at

the San Jose Marriott The Broncos practiced at Florida State Facility and stayed at the

Santa Clara Marriott

Question At what universitys facility did the Panthers practice

Actual Answer San Jose State

Predicted Answer Florida State Facility

To find out what is leading to the wrong predictions we wanted to see the

attention weights associated with such an example We plotted the passage and

question heat map which is a 2D matrix where the intensity of each cell signifies the

similarity between a passage word and a question word For the above example we

found out that while certain words of the question are given high weightage other parts

are not The words lsquoAtrsquo lsquofacilityrsquo lsquopracticersquo are given high attention but lsquoPanthersrsquo does

not receive high attention If it had received high attention then the system would have

predicted lsquoSan Jose Statersquo as the right answer To solve this issue we analyzed the

base BiDAF model and proposed adding two things

1 Bi-Attention and Self-Attention over Query

2 Second level attention over the output of (Bi-Attention + Self-Attention) from both Context and Query

30

Multi-Attention BiDAF Model

Our comprehension model is a hierarchical multi-stage process and consists of

the following layers

1 Embedding Just as all other models we embed words using pretrained word vectors We also embed the characters in each word into size 20 vectors which are learned and run a convolution neural network followed by max-pooling to get character-derived embeddings for each word The character-level and word-level embeddings are then concatenated and passed to the next layer We do not update the word embeddings during training

2 Pre-Process A shared bi-directional GRU (Cho et al 2014) is used to map the question and passage embeddings to context aware embeddings

3 Context Attention The bi-directional attention mechanism from the Bi-Directional Attention Flow (BiDAF) model (Seo et al 2016) is used to build a query-aware context representation Let h_i be the vector for context word i q_j be the vector for question word j and n_q and n_c be the lengths of the question and context respectively We compute attention between context word i and question word j as

(4-1)

where w1 w2 and w3 are learned vectors and is element-wise multiplication We then compute an attended vector c_i for each context token as

(4-2)

We also compute a query-to-context vector q_c

(4-3)

31

The final vector computed for each token is built by concatenating

and In our model we subsequently pass the result through a linear layer with ReLU activations

4 Context Self-Attention Next we use a layer of residual self-attention The input is passed through another bi-directional GRU Then we apply the same attention mechanism only now between the passage and itself In this case we do not use query-to context attention and we set aij = 1048576inf if i = j As before we pass the concatenated output through a linear layer with ReLU activations This layer is applied residually so this output is additionally summed with the input

5 Query Attention For this part we do the same way as context attention but calculate the weighted sum of the context words for each query word Thus we get the length as number of query words Then we calculate context to query similar to query to context in context attention layer

Figure 4-1 The modified BiDAF model with multilevel attention

6 Query Self-Attention This part is done the same way as the context self-attention layer but from the output of the Query Attention layer

32

7 Context Query Bi-Attention + Self-Attention The output of the Context self-attention and the Query Self-Attention layers are taken as input and the same process for Bi-Attention and self-attention is applied on these inputs

8 Prediction In the last layer of our model a bidirectional GRU is applied followed by a linear layer that computes answer start scores for each token The hidden states of that layer are concatenated with the input and fed into a second bidirectional GRU and linear layer to predict answer end scores The softmax operation is applied to the start and end scores to produce start and end probabilities and we optimize the negative log likelihood of selecting correct start and end tokens

Having carried out this modification we were able to solve the wrong example we

started with The multilevel attention model gives the correct output as ldquoSan Jose

Staterdquo Also we achieved slightly better scores than the original model with a F1 score

of 8544 on the SQuAD dev dataset

Chatbot Design Using a QA System

Designing a perfect chatbot that passes the Turing test is a fundamental goal for

Artificial Intelligence Although we are many order to magnitudes away from achieving

such a goal domain specific tasks can be solved with chatbots made from current

technology We had started our goal with a similar objective in mind ie to design a

domain specific chatbot and then generalize to other areas as it is able to achieve the

first domain specific objective robustly This led us to the fundamental problem of

Machine Comprehension and subsequently to the task of Question Answering Having

achieved some degree of success with QA systems we looked back if we could apply

our newly acquired knowledge in the task of designing Chatbots

The chatbots made with todayrsquos technologies are mostly handcrafted techniques

such as template matching that requires anticipating all possible ways a user may

articulate his requirements and a conversation may occur This requires a lot of man

hours for designing a domain specific system and is still very error prone In this section

33

we propose a general Chatbot design that would make the designing of a domain

specific chatbot very easy and robust at the same time

Every domain specific chatbot needs to obtain a set of information from the user

and show some results based on the user specific information obtained The traditional

chatbots use template matching and keywords lookup to determine if the user has

provided the required information Our idea is to use the Question Answering system in

the backend to extract out the required information from whatever the user has typed

until this point of the conversation The information to be extracted can be posed in the

form of a set of questions and the answers obtained from those questions can be used

as the parameters to supply the relevant information to the user

We had chosen our chatbot domain as the flight reservation system Our goal

was to extract the required information from the user to be able to show him the

available flights as per the userrsquos requirements

Figure 4-2 Flight reservation chatbotrsquos chat window

34

For a flight reservation task the booking agent needs to know the origin city

destination city and date of travel at minimum to be able to show the available flights

Optional information includes the number of tickets passengerrsquos name one way or

round trip etc

The minimalistic conversation with the user through the chat window would be as

shown above We had a platform called OneTask on which we wanted to implement our

chat bot The chat interface within the OneTask system looks as follows

Figure 4-3 Chatbot within OneTask system

The working of the chat system is as follows ndash

1 Initiation The user opens the chat window which starts a session with the chat bot The chat bot asks lsquoHow may I help yoursquo

2 User Reply The user may reply with none of the required information for flight booking or may reply with multiple information in the same message

3 User Reply parsing The conversation up to this point is treated as a passage and the internal questions are run on this passage So the four questions that are run are

Where do you want to go

35

From where do you want to leave

When do you want to depart

4 Parsed responses from QA model After running the questions on the QA system with the given conversation answers are obtained for all along with their corresponding confidence values Since answers will be obtained even if the required question was not answered up to this point in the conversation the confidence values plays a crucial role in determining the validity of the answer For this purpose we determined ranges of confidence values for validation A confidence value above 10 signifies the question has been answered correctly A confidence of 2 ndash 10 signifies that it may have been answered but should verify with the user for correctness and any confidence below 2 is discarded

5 Asking remaining questions iteratively After the parsing it is checked if any of the required questions are still unanswered If so the chatbot asks the remaining question and the process from 3 ndash 4 is carried out iteratively Once all the questions have been answered the user is shown with the available flight options as per his request

Figure 4-4 The Flow diagram of the Flight booking Chatbot system

Online QA System and Attention Visualization

To be able to test out various examples we used the BiDAF [23] model for an

online demo One can either choose from the available examples from the drop-down

menu or paste their own passage and examples While this is a useful and interesting

system to test the model in a user-friendly way we created this system to be able to

focus on the wrong samples

36

Figure 4-5 QA system interface with attention highlight over candidate answers

The candidate answers are shown with the blue highlights as per their

confidence values The higher the confidence the darker the answer The highest

confidence value is chosen as the predicted answer We developed the system to show

attention spread of the candidate answers to realize what needs to be done to improve

the system This led us to realize the importance of including the query attention part as

well as multilevel attention on the BiDAF model as described in the first section of this

chapter

37

CHAPTER 5 FUTURE DIRECTIONS AND CONCLUSION

Future Directions

Having done a thorough analysis of the current methods and state of the arts in

Machine Comprehension and for the QA task and developing systems on it we have

achieved a strong sense of what needs to be done to further improve the QA models

After observing the wrong samples we can see that the system is still unable to encode

meaning and is picking answers based on statistical patterns of occurrence of answers

on training examples

As per the State of the Art models and analyzing the ongoing research literature

it is easy to conclude that more features need to be embedded to encode the meaning

of words phrases and sentences A paper called Reinforced Mnemonic Reader for

Machine Comprehension encoded POS and NER tags of words along with their word

and character embedding This gave them better results

Figure 5-1 An English language semantic parse tree [26]

38

We have developed a method to encode the syntax parse tree of a sentence that

not only encodes the post tags but the relation of the word within the phrase and the

relation of the phrase within the whole sentence in a hierarchical manner

Finally data augmentation is another solution to get better results One definite

way to reduce the errors would be to include similar samples in the training data which

the system is faltering in the dev set One could generate similar examples as the failure

cases and include them in the training set to have better prediction Another system

would be to train to similar and bigger datasets Our models were trained on the SQuAD

dataset There are other datasets too which does the similar question answering task

such as TriviaQA We could augment the training set of TriviaQA along with SQuAD to

have a more robust system that is able to generalize better and thus have higher

accuracy for predicting answer spans

Conclusion

In this work we have tried to explore the most fundamental techniques that have

shaped the current state of the art Then we proposed a minor improvement of

architecture over an existing model Furthermore we developed two applications that

uses the base model First we talked about how a chatbot application can be made

using the QA system and lastly we also created a web interface where the model can

be used for any Passage and Question This interface also shows the attention spread

on the candidate answers While our effort is ongoing to push the state of the art

forward we strongly believe that surpassing human level accuracy on this task will have

high dividends for the society at large

39

LIST OF REFERENCES

[1] Danqi Chen From Reading Comprehension To Open-Domain Question Answering 2018 Youtube Accessed March 16 2018 httpswwwyoutubecomwatchv=1RN88O9C13U

[2] Lehnert Wendy G The Process of Question Answering No RR-88 Yale Univ New Haven Conn Dept of computer science 1977

[3] Joshi Mandar Eunsol Choi Daniel S Weld and Luke Zettlemoyer Triviaqa A large scale distantly supervised challenge dataset for reading comprehension arXiv preprint arXiv170503551 (2017)

[4] Rajpurkar Pranav Jian Zhang Konstantin Lopyrev and Percy Liang Squad 100000+ questions for machine comprehension of text arXiv preprint arXiv160605250 (2016)

[5] The Stanford Question Answering Dataset 2018 RajpurkarGithubIo Accessed March 16 2018 httpsrajpurkargithubioSQuAD-explorer

[6] Hu Minghao Yuxing Peng and Xipeng Qiu Reinforced mnemonic reader for machine comprehension CoRR abs170502798 (2017)

[7] Schalkoff Robert J Artificial neural networks Vol 1 New York McGraw-Hill 1997

[8] Me For The AI and Neetesh Mehrotra 2018 The Connect Between Deep Learning And AI - Open Source For You Open Source For You Accessed March 16 2018 httpopensourceforucom201801connect-deep-learning-ai

[9] Krizhevsky Alex Ilya Sutskever and Geoffrey E Hinton Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems pp 1097-1105 2012

[10] An Intuitive Explanation Of Convolutional Neural Networks 2016 The Data Science Blog Accessed March 16 2018 httpsujjwalkarnme20160811intuitive-explanation-convnets

[11] Williams Ronald J and David Zipser A learning algorithm for continually running fully recurrent neural networks Neural computation 1 no 2 (1989) 270-280

[12] Hochreiter Sepp and Juumlrgen Schmidhuber Long short-term memory Neural computation 9 no 8 (1997) 1735-1780

[13] Chung Junyoung Caglar Gulcehre KyungHyun Cho and Yoshua Bengio Empirical evaluation of gated recurrent neural networks on sequence modeling arXiv preprint arXiv14123555 (2014)

40

[14] Mikolov Tomas Wen-tau Yih and Geoffrey Zweig Linguistic regularities in continuous space word representations In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics Human Language Technologies pp 746-751 2013

[15] Pennington Jeffrey Richard Socher and Christopher Manning Glove Global vectors for word representation In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) pp 1532-1543 2014

[16] Word2vec Tutorial - The Skip-Gram Model middot Chris Mccormick 2016 MccormickmlCom Accessed March 16 2018 httpmccormickmlcom20160419word2vec-tutorial-the-skip-gram-model

[17] Group 2015 [Paper Introduction] Bilingual Word Representations With Monolingual hellip SlideshareNet Accessed March 16 2018 httpswwwslidesharenetnaist-mtstudybilingual-word-representations-with-monolingual-quality-in-mind

[18] Attention Mechanism 2016 BlogHeuritechCom Accessed March 16 2018 httpsblogheuritechcom20160120attention-mechanism

[19] Weston Jason Sumit Chopra and Antoine Bordes 2014 Memory Networks arXiv preprint arXiv14103916v11 Accessed March 16 2018 httpsarxivorgabs14103916

[20] Wang Shuohang and Jing Jiang Machine comprehension using match-lstm and answer pointer arXiv preprint arXiv160807905 (2016)

[21] Wang Shuohang and Jing Jiang Learning natural language inference with LSTM arXiv preprint arXiv151208849 (2015)

[22] Vinyals Oriol Meire Fortunato and Navdeep Jaitly Pointer networks In Advances in Neural Information Processing Systems pp 2692-2700 2015

[23] Wang Wenhui Nan Yang Furu Wei Baobao Chang and Ming Zhou Gated self-matching networks for reading comprehension and question answering In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1 Long Papers) vol 1 pp 189-198 2017

[24] Seo Minjoon Aniruddha Kembhavi Ali Farhadi and Hannaneh Hajishirzi Bidirectional attention flow for machine comprehension arXiv preprint arXiv161101603 (2016)

[25] Clark Christopher and Matt Gardner Simple and effective multi-paragraph reading comprehension arXiv preprint arXiv171010723 (2017)

41

[26] 8 Analyzing Sentence Structure 2018 NltkOrg Accessed March 16 2018 httpwwwnltkorgbookch08html

42

BIOGRAPHICAL SKETCH

Purnendu Mukherjee grew up in Kolkata India and had a strong interest for

computers since an early age After his high school he did his BSc in computer

science from Ramakrishna Mission Residential College Narendrapur followed by MSc

in computer science from St Xavierrsquos College Kolkata He had a strong intuition and

interest for human like learning systems and wanted to work in this area He started

working at TCS Innovation Labs Pune for the application of Natural Language

Processing in Educational Applications As he was deeply passionate about learning

system that mimic the human brain and learn like a human child does he was

increasing interested about Deep Learning and its applications After working for a year

he went on to pursue a Master of Science degree in computer science from the

University of Florida Gainesville His academic interests have been focused on Deep

Learning and Natural Language Processing and he has been working on Machine

Reading Comprehension since summer of 2017

  • ACKNOWLEDGMENTS
  • LIST OF FIGURES
  • LIST OF ABBREVIATIONS
  • INTRODUCTION
  • THE BUILDING BLOCKS OF A QUESTION ANSWERING SYSTEM
    • Neural Networks
    • Convolutional Neural Network
    • Recurrent Neural Networks (RNN)
    • Word Embedding
    • Attention Mechanism
    • Memory Networks
      • LITERATURE REVIEW AND STATE OF THE ART
        • Machine Comprehension Using Match-LSTM and Answer Pointer
        • R-NET Matching Reading Comprehension with Self-Matching Networks
        • Bi-Directional Attention Flow (BiDAF) for Machine Comprehension
        • Summary
          • MULTI-ATTENTION QUESTION ANSWERING
            • Multi-Attention BiDAF Model
            • Chatbot Design Using a QA System
            • Online QA System and Attention Visualization
              • FUTURE DIRECTIONS AND CONCLUSION
                • Future Directions
                • Conclusion
                  • LIST OF REFERENCES
                  • BIOGRAPHICAL SKETCH
Page 5: QUESTION ANSWERING SYSTEMS WITH ATTENTION MECHANISMufdcimages.uflib.ufl.edu/UF/E0/05/23/73/00001/MUKHERJEE_P.pdf · QUESTION ANSWERING SYSTEMS WITH ATTENTION MECHANISM By PURNENDU

5

TABLE OF CONTENTS page

ACKNOWLEDGMENTS 4

LIST OF FIGURES 6

LIST OF ABBREVIATIONS 7

ABSTRACT 8

CHAPTER

1 INTRODUCTION 9

2 THE BUILDING BLOCKS OF A QUESTION ANSWERING SYSTEM 13

Neural Networks 13 Convolutional Neural Network 14 Recurrent Neural Networks (RNN) 15 Word Embedding 16 Attention Mechanism 17 Memory Networks 18

3 LITERATURE REVIEW AND STATE OF THE ART 21

Machine Comprehension Using Match-LSTM and Answer Pointer 21 R-NET Matching Reading Comprehension with Self-Matching Networks 23 Bi-Directional Attention Flow (BiDAF) for Machine Comprehension 25 Summary 28

4 MULTI-ATTENTION QUESTION ANSWERING 29

Multi-Attention BiDAF Model 30 Chatbot Design Using a QA System 32 Online QA System and Attention Visualization 35

5 FUTURE DIRECTIONS AND CONCLUSION 37

Future Directions 37 Conclusion 38

LIST OF REFERENCES 39

BIOGRAPHICAL SKETCH 42

6

LIST OF FIGURES

Figure page 1-1 The task of Question Answering 10

2-1 Simple and Deep Learning Neural Networks 13

2-2 Convolutional Neural Network Architecture 14

2-3 Semantic relation between words in vector space 17

2-4 Attention Mechanism flow 18

2-5 QA example on Lord of the Rings using Memory Networks 20

3-1 Match-LSTM Model Architecture 22

3-2 The task of Question Answering 24

3-3 Bi-Directional Attention Flow Model Architecture 26

4-1 The modified BiDAF model with multilevel attention 31

4-2 Flight reservation chatbotrsquos chat window 33

4-3 Chatbot within OneTask system 34

4-4 The Flow diagram of the Flight booking Chatbot system 35

4-5 QA system interface with attention highlight over candidate answers 36

5-1 An English language semantic parse tree 37

7

LIST OF ABBREVIATIONS

BiDAF Bi-Directional Attention Flow

CNN Convolutional Neural Network

GRU Gated Recurrent Units

LSTM Long Short Term Memory

NLP Natural Language Processing

NLU Natural Language Understanding

RNN Recurrent Neural Network

8

Abstract of Thesis Presented to the Graduate School of the University of Florida in Partial Fulfillment of the

Requirements for the Degree of Master of Science

QUESTION ANSWERING SYSTEMS WITH ATTENTION MECHANISM

By

Purnendu Mukherjee

May 2018

Chair Xiaolin Li Major Computer Science

Question Answering(QA) systems have had rapid growth since the last 3 years

and are close to reaching human level accuracy One of the fundamental reason for this

growth has been the use of attention mechanism along with other methods of Deep

Learning But just as with other Deep Learning methods some of the failure cases are

so obvious that it convinces us that there is a lot to improve upon In this work we first

did a literature review of the State of the Art and fundamental models in QA systems

Next we introduce an architecture which has shown improvement in the targeted area

We then introduce a general method to enable easy design of domain-specific Chatbot

applications and present a proof of the concept with the same method Finally we

present an easy to use Question Answering interface with attention visualization on the

passage We also propose a method to improve the current state of the art as a part of

our ongoing work

9

CHAPTER 1 INTRODUCTION

Teaching machines to read and understand human natural language is our

central and long standing goal for Natural Language Processing and Artificial

Intelligence in general The positive side of machines being able to reason and

comprehend human language could be enormous We are beginning to see how

commercial systems like Alexa Siri Google Now etc are being widely used as Speech

Recognition improved to human levels While speech recognition systems can

transcribe speech to text comprehension of that transcribed text is another task which

is currently a major focus for both academia and industry because of its possible

applications Moreover all the text information available throughout the internet is a

major reason why machine comprehension of text is such an important task

With the growth of Deep Learning methods in the last few years the field of

Machine Comprehension and Natural Language Processing(NLP) in general has

experienced a revolution While the traditional methods and practices are still prevalent

and forms the basis of our deep understanding of languages Deep Learning methods

have surpassed all traditional NLP and Machine Learning methods by a significant

margin and are currently driving the growth of the field

To be able to build a system that can understand human text we need to first

ask ourselves how can we evaluate the machinersquos comprehension ability We had

initially set a goal to build a Chabot for a specific domain and generalize to other topics

as we go ahead While developing the system we found out the necessity for reading

comprehension and how to measure it We finally found the answer with Questions

Answering systems

10

Just like how we human beings are tested for our ability of language

understanding with questions we should ask machines similar questions about what it

has just read The performance of the system on such question answering task will let

us evaluate how much the machine is able to reason about what it just read [1]

Reading comprehension has been a topic of Natural Language Understanding since the

1970s In 1977 Wendy Lehnert said in his doctoral thesis ndash ldquoOnly when we can ask a

program to answer questions about what it reads will we be able to begin to access that

programrsquos comprehensionrdquo [2]

Figure 1-1 The task of Question Answering

To achieve this task the NLP community has developed various datasets such as

CNN Daily Mail WebQuestions SQuAD TriviaQA [3] etc For our purpose we chose

SQuAD which stands for Stanford Question Answering Dataset [4] SQuAD consists of

questions posed by crowdworkers on a set of Wikipedia articles where the answer to

every question is a segment of text or span from the corresponding reading passage

With 100000+ question-answer pairs on 500+ articles SQuAD is significantly larger

than previous reading comprehension datasets [5]

An example from the SQuAD dataset is as follows

11

Passage Tesla later approached Morgan to ask for more funds to build a more

powerful transmitter When asked where all the money had gone Tesla responded by

saying that he was affected by the Panic of 1901 which he (Morgan) had caused

Morgan was shocked by the reminder of his part in the stock market crash and by

Teslarsquos breach of contract by asking for more funds Tesla wrote another plea to

Morgan but it was also fruitless Morgan still owed Tesla money on the original

agreement and Tesla had been facing foreclosure even before construction of the

tower began

Question On what did Tesla blame for the loss of the initial money

Answer Panic of 1901

As we started exploring the QA task we faced several challenges Some of them

we could solve with the help of other research and some of the challenges still exist in

the domain

bull Out of Vocabulary words

bull Multi-sentence reasoning may be required

bull There may exist several candidate answers

bull Optimizing the Exact Match (EM) metric may fail when the answer boundary is fuzzy or too long such as the answer of the ldquowhyrdquo query [6]

bull ldquoOne-hoprdquo prediction may fail to fully understand the query [6]

bull Fail to fully capture the long-distance contextual interaction between parts of the context by only using LSTMGRU

bull Current models are unable to capture semantics of the passage

In the upcoming chapters we will first briefly review the basics necessary for

understanding the models then we will delve deep into the fundamental models that

12

have shaped the current State of the Art models then we will discuss our contribution in

terms of architecture and applications and finally conclude with future directions

13

CHAPTER 2 THE BUILDING BLOCKS OF A QUESTION ANSWERING SYSTEM

To build a question answering system one needs to be familiar with the

fundamental deep learning models such as Recurrent Neural Networks (RNN) Long

Short Term Memory (LSTM) etc In this chapter we will have an overview on these

techniques and see how they all connect to building a question answering system

Neural Networks

What makes Deep Learning so intriguing is that it has close resemblance with

the working of the mammalian brain or at least draws inspiration from it The same can

be said for Artificial Neural networks [7] which consists of a system of interconnected

units called lsquoneuronsrsquo that take input from similar units and produces a single output

Figure 2-1 Simple and Deep Learning Neural Networks [8]

The connection from one neuron to another can be weighted based on the input data

which enables the network to tune itself to produce a certain output based on the input

This is the learning process which is achieved through backpropagation which is a

system of propagating the error from the output layer to the previous layers

14

Convolutional Neural Network

The first wave of deep learningrsquos success was brought by Convolutional Neural

Networks (CNN) [9] when this was the technique used by the winning team of ImageNet

competition in 2012 CNNs are deep artificial neural networks (ANN) that can be used to

classify images cluster them by similarity and perform object recognition within scenes

It can be used to detect and identify faces people signs or any other visual data

Figure 2-2 Convolutional Neural Network Architecture [10]

There are primarily four operations in a standard CNN model (as shown in Fig above)

1 Convolution - The primary purpose of Convolution in the ConvNet (above) is to extract features from the input image The spatial relationship between pixels ie the image features are preserved and learned by the convolution using small squares of input data

2 Non-Linearity (ReLU) ndash Rectified Linear Unit (ReLU) is a non-linear operation that carries out an element wise operation on each pixel This operation replaces the negative pixel values in the feature map by zero

3 Pooling or Sub Sampling - Spatial Pooling reduces the dimensionality of each feature map but retains the most important information For max pooling the largest value in the square window is taken and rest are dropped Other types of pooling are Average Sum etc

4 Classification (Fully Connected Layer) - The Fully Connected layer is a traditional Multi-Layer Perceptron as described before that uses a softmax activation function in the output layer The high-level features of the image are encoded by the convolutional and pooling layers which is then fed to the fully connected layer which then uses these features for classifying the input image into various classes based on the training dataset [10]

15

When a new image is fed into the CNN model all the above-mentioned steps are

carried out (forward propagation) and a probability distribution is achieved on the set of

output classes With a large enough training dataset the network will learn and

generalize well enough to classify new images into their correct classes

Recurrent Neural Networks (RNN)

Whenever we want to predict or encode sequential data RNN [11] is our go to

method RNNs perform the same task for every element of a sequence where the

output of each element depends on previous computations thus the recurrence In

practice RNNs are unable to retain long-term dependencies and can look back only a

few steps because of the vanishing gradient problem

(2-1)

A solution to the dependency problem is to use gated cells such as LSTM [11] or

GRU [13] These cells pass on important information to the next cells while ignoring

non-important ones The gated units in a GRU block are

bull Update Gate ndash Computed based on current input and hidden state

(2-2)

bull Reset Gate ndash Calculated similarly but with different weights

(2-3)

bull New memory content - (2-4)

If reset gate unit is ~0 then previous memory is ignored and only new

information is kept

16

Final memory at current time step combines previous and current time steps

(2-5)

While the GRU is computationally efficient the LSTM on the other hand is a

general case where there are three gates as follows

bull Input Gate ndash What new information to add to the current cell state

bull Forget Gate ndash How much information from previous states to be kept

bull Output gate ndash How much info should be sent to the next states

Just like GRU the current cell state is a sum of the previous cell state but

weighted by the forget gate and the new value is added which is weighted by the input

gate Based on the cell state the output gate regulates the final output

Word Embedding

Computation or gradients can be applied on numbers and not on words or letters

So first we need to convert words into their corresponding numerical formation before

feeding into a deep learning model In general there are two types of word embedding

Frequency based (which constitutes count vectors tf-idf and co-occurrence vectors)

and Prediction based With frequency based embedding the order of the words are not

preserved and works as a bag of words model Whereas with prediction based model

the order of words or locality of words are taken into consideration to generate the

numerical representation of the word Within this prediction based category there are

two fundamental techniques called Continuous Bag of Words (CBOW) and Skip Gram

Model which forms the basis for word2vec [14] and GloVe [15]

The basic intuition behind word2vec is that if two different words have very

similar ldquocontextsrdquo (that is what words are likely to appear around them) then the model

17

will produce similar vector for those words Conversely if the two word vectors are

similar then the network will produce similar context predictions for the same two words

For examples synonyms like ldquointelligentrdquo and ldquosmartrdquo would have very similar contexts

Or that words that are related like ldquoenginerdquo and ldquotransmissionrdquo would probably have

similar contexts as well [16] Plotting the word vectors learned by a word2vec over a

large corpus we could find some very interesting relationships between words

Figure 2-3 Semantic relation between words in vector space [17]

Attention Mechanism

We as humans put our attention to things are important or are relevant in a

context For example when asked a question from a passage we try to find the most

relevant part of the passage the question is relevant with and then reason from our

understanding of that part of the passage The same idea applies for attention

mechanism in Deep Learning It is used to identify the specific parts of a given context

to which the current question is relevant to

Formally put the techniques take n arguments y_1 y_n (in our case the

passage having words say y_i through h_i) and a question word say q It returns a

vector z which is supposed to be the laquo summary raquo of the y_i focusing on information

linked to the question q More formally it returns a weighted arithmetic mean of the y_i

18

and the weights are chosen according the relevance of each y_i given the context c

[18]

Figure 2-4 Attention Mechanism flow [18]

Memory Networks

Convolutional Neural Networks and Recurrent Neural Networks which does

capture how we form our visual and sequential memories their memory (encoded by

hidden states and weights) were typically too small and was not compartmentalized

enough to accurately remember facts from the past (knowledge is compressed into

dense vectors) [19]

Deep Learning needed to cultivate a methodology that preserved memories as

they are such that it wonrsquot be lost in generalization and recalling exact words or

sequence of events would be possible mdash something computers are already good at This

effort led us to Memory Networks [19] published at ICLR 2015 by Facebook AI

Research

This paper provides a basic framework to store augment and retrieve memories

while seamlessly working with a Recurrent Neural Network architecture The memory

19

network consists of a memory m (an array of objects 1 indexed by m i) and four

(potentially learned) components I G O and R as follows

I (input feature map) mdash converts the incoming input to the internal feature

representation either a sparse or dense feature vector like that from word2vec or

GloVe

G (generalization) mdash updates old memories given the new input They call this

generalization as there is an opportunity for the network to compress and generalize its

memories at this stage for some intended future use The analogy Irsquove been talking

before

O (output feature map) mdash produces a new output (in the feature representation

space) given the new input and the current memory state This component is

responsible for performing inference In a question answering system this part will

select the candidate sentences (which might contain the answer) from the story

(conversation) so far

R (response) mdash converts the output into the response format desired For

example a textual response or an action In the QA system described this component

finds the desired answer and then converts it from feature representation to the actual

word

This model is a fully supervised model meaning all the candidate sentences from

which the answer could be found are marked during training phase and can also be

termed as lsquohard attentionrsquo

The authors tested out the QA system on various literature including Lord of the

Rings

20

Figure 2-5 QA example on Lord of the Rings using Memory Networks [19]

21

CHAPTER 3 LITERATURE REVIEW AND STATE OF THE ART

There has been rapid progress since the release of the SQuAD dataset and

currently the best ensemble models are close to human level accuracy in machine

comprehension This is due to the various ingenious methods which solves some of the

problems with the previous methods Out of Vocabulary tokens were handled by using

Character embedding Long term dependency within context passage were solved

using self-attention And many other techniques such as Contextualized vectors History

of Words Attention Flow etc In this section we will have a look at the some of the most

important models that were fundamental to the progress of Questions Answering

Machine Comprehension Using Match-LSTM and Answer Pointer

In this paper [20] the authors propose an end-to-end neural architecture for the

QA task The architecture is based on match-LSTM [21] a model they proposed for

textual entailment and Pointer Net [22] a sequence-to-sequence model proposed by

Vinyals et al (2015) to constrain the output tokens to be from the input sequences

The model consists of an LSTM preprocessing layer a match-LSTM layer and an

Answer Pointer layer

We are given a piece of text which we refer to as a passage and a question

related to the passage The passage is represented by matrix P where P is the length

(number of tokens) of the passage and d is the dimensionality of word embeddings

Similarly the question is represented by matrix Q where Q is the length of the question

Our goal is to identify a subsequence from the passage as the answer to the question

22

Figure 3-1 Match-LSTM Model Architecture [20]

LSTM Preprocessing layer They use a standard one-directional LSTM

(Hochreiter amp Schmidhuber 1997) to process the passage and the question separately

as shown below

Match LSTM Layer They applied the match-LSTM model proposed for textual

entailment to their machine comprehension problem by treating the question as a

premise and the passage as a hypothesis The match-LSTM sequentially goes through

the passage At position i of the passage it first uses the standard word-by-word

attention mechanism to obtain attention weight vector as follows

(3-1)

23

where and are parameters to be learned

Answer Pointer Layer The Answer Pointer (Ans-Ptr) layer is motivated by the

Pointer Net introduced by Vinyals et al (2015) [22] The boundary model produces only

the start token and the end token of the answer and then all the tokens between these

two in the original passage are considered to be the answer

When this paper was released back in November 2016 Match-LSTM method

was the state of the art in Question Answering systems and was at the top of the

leaderboard for the SQuAD dataset

R-NET Matching Reading Comprehension with Self-Matching Networks

In this model [23] first the question and passage are processed by a

bidirectional recurrent network (Mikolov et al 2010) separately They then match the

question and passage with gated attention-based recurrent networks obtaining

question-aware representation for the passage On top of that they apply self-matching

attention to aggregate evidence from the whole passage and refine the passage

representation which is then fed into the output layer to predict the boundary of the

answer span

Question and passage encoding First the words are converted to their

respective word-level embeddings and character level embeddings The character-level

embeddings are generated by taking the final hidden states of a bi-directional recurrent

neural network (RNN) applied to embeddings of characters in the token Such

character-level embeddings have been shown to be helpful to deal with out-of-vocab

(OOV) tokens

24

They then use a bi-directional RNN to produce new representation and

of all words in the question and passage respectively

Figure 3-2 The task of Question Answering [23]

Gated Attention-based Recurrent Networks They use a variant of attention-

based recurrent networks with an additional gate to determine the importance of

information in the passage regarding a question Different from the gates in LSTM or

GRU the additional gate is based on the current passage word and its attention-pooling

vector of the question which focuses on the relation between the question and current

passage word The gate effectively model the phenomenon that only parts of the

passage are relevant to the question in reading comprehension and question answering

is utilized in subsequent calculations

25

Self-Matching Attention From the previous step the question aware passage

representation is generated to highlight the important parts of the passage One

problem with such representation is that it has very limited knowledge of context One

answer candidate is often oblivious to important cues in the passage outside its

surrounding window To address this problem the authors propose directly matching

the question-aware passage representation against itself It dynamically collects

evidence from the whole passage for words in passage and encodes the evidence

relevant to the current passage word and its matching question information into the

passage representation

Output Layer They use the same method as Wang amp Jiang (2016b) and use

pointer networks (Vinyals et al 2015) to predict the start and end position of the

answer In addition they use an attention-pooling over the question representation to

generate the initial hidden vector for the pointer network [23]

When the R-Net Model first appeared in the leaderboard in March 2017 it was at

the top with 723 Exact Match and 807 F1 score

Bi-Directional Attention Flow (BiDAF) for Machine Comprehension

BiDAF [24] is a hierarchical multi-stage architecture for modeling the

representations of the context paragraph at different levels of granularity BIDAF

includes character-level word-level and contextual embeddings and uses bi-directional

attention flow to obtain a query-aware context representation Their attention layer is not

used to summarize the context paragraph into a fixed-size vector Instead the attention

is computed for every time step and the attended vector at each time step along with

the representations from previous layers can flow through to the subsequent modeling

layer This reduces the information loss caused by early summarization [24]

26

Figure 3-3 Bi-Directional Attention Flow Model Architecture [24]

Their machine comprehension model is a hierarchical multi-stage process and

consists of six layers

1 Character Embedding Layer maps each word to a vector space using character-level CNNs

2 Word Embedding Layer maps each word to a vector space using a pre-trained word embedding model

3 Contextual Embedding Layer utilizes contextual cues from surrounding words to refine the embedding of the words These first three layers are applied to both the query and context They use a LSTM on top of the embeddings provided by the previous layers to model the temporal interactions between words We place an LSTM in both directions and concatenate the outputs of the two LSTMs

4 Attention Flow Layer couples the query and context vectors and produces a set of query aware feature vectors for each word in the context

5 Modeling Layer employs a Recurrent Neural Network to scan the context The input to the modeling layer is G which encodes the query-aware representations of context words The output of the modeling layer captures the interaction among the context words conditioned on the query They use two layers of bi-directional LSTM

27

with the output size of d for each direction Hence a matrix is obtained which is passed onto the output layer to predict the answer

6 Output Layer provides an answer to the query They define the training loss (to be minimized) as the sum of the negative log probabilities of the true start and end indices by the predicted distributions averaged over all examples [25]

In a further variation of their above work they add a self-attention layer after the

Bi-attention layer to further improve the results The architecture of the model is as

Figure 3-3 The task of Question Answering [25]

28

Summary

In this chapter we reviewed the methods that are fundamental to the state of the

art in Machine Comprehension and for the task of Question Answering We have

reached closed to human level accuracy and this is due to incremental developments

over previous models As we saw Out of Vocabulary (OOV) tokens were handled by

using Character embedding Long term dependency within context passage were

solved using self-attention and many other techniques such as Contextualized vectors

Attention Flow etc were employed to get better results In the next chapter we will see

how we can build on these models and develop further

29

CHAPTER 4 MULTI-ATTENTION QUESTION ANSWERING

Although the advancements in Question Answering systems since the release of

the SQuAD dataset has been impressive and the results are getting close to human

level accuracy it is far from being a fool-proof system The models still make mistakes

which would be obvious to a human For example

Passage The Panthers used the San Jose State practice facility and stayed at

the San Jose Marriott The Broncos practiced at Florida State Facility and stayed at the

Santa Clara Marriott

Question At what universitys facility did the Panthers practice

Actual Answer San Jose State

Predicted Answer Florida State Facility

To find out what is leading to the wrong predictions we wanted to see the

attention weights associated with such an example We plotted the passage and

question heat map which is a 2D matrix where the intensity of each cell signifies the

similarity between a passage word and a question word For the above example we

found out that while certain words of the question are given high weightage other parts

are not The words lsquoAtrsquo lsquofacilityrsquo lsquopracticersquo are given high attention but lsquoPanthersrsquo does

not receive high attention If it had received high attention then the system would have

predicted lsquoSan Jose Statersquo as the right answer To solve this issue we analyzed the

base BiDAF model and proposed adding two things

1 Bi-Attention and Self-Attention over Query

2 Second level attention over the output of (Bi-Attention + Self-Attention) from both Context and Query

30

Multi-Attention BiDAF Model

Our comprehension model is a hierarchical multi-stage process and consists of

the following layers

1 Embedding Just as all other models we embed words using pretrained word vectors We also embed the characters in each word into size 20 vectors which are learned and run a convolution neural network followed by max-pooling to get character-derived embeddings for each word The character-level and word-level embeddings are then concatenated and passed to the next layer We do not update the word embeddings during training

2 Pre-Process A shared bi-directional GRU (Cho et al 2014) is used to map the question and passage embeddings to context aware embeddings

3 Context Attention The bi-directional attention mechanism from the Bi-Directional Attention Flow (BiDAF) model (Seo et al 2016) is used to build a query-aware context representation Let h_i be the vector for context word i q_j be the vector for question word j and n_q and n_c be the lengths of the question and context respectively We compute attention between context word i and question word j as

(4-1)

where w1 w2 and w3 are learned vectors and is element-wise multiplication We then compute an attended vector c_i for each context token as

(4-2)

We also compute a query-to-context vector q_c

(4-3)

31

The final vector computed for each token is built by concatenating

and In our model we subsequently pass the result through a linear layer with ReLU activations

4 Context Self-Attention Next we use a layer of residual self-attention The input is passed through another bi-directional GRU Then we apply the same attention mechanism only now between the passage and itself In this case we do not use query-to context attention and we set aij = 1048576inf if i = j As before we pass the concatenated output through a linear layer with ReLU activations This layer is applied residually so this output is additionally summed with the input

5 Query Attention For this part we do the same way as context attention but calculate the weighted sum of the context words for each query word Thus we get the length as number of query words Then we calculate context to query similar to query to context in context attention layer

Figure 4-1 The modified BiDAF model with multilevel attention

6 Query Self-Attention This part is done the same way as the context self-attention layer but from the output of the Query Attention layer

32

7 Context Query Bi-Attention + Self-Attention The output of the Context self-attention and the Query Self-Attention layers are taken as input and the same process for Bi-Attention and self-attention is applied on these inputs

8 Prediction In the last layer of our model a bidirectional GRU is applied followed by a linear layer that computes answer start scores for each token The hidden states of that layer are concatenated with the input and fed into a second bidirectional GRU and linear layer to predict answer end scores The softmax operation is applied to the start and end scores to produce start and end probabilities and we optimize the negative log likelihood of selecting correct start and end tokens

Having carried out this modification we were able to solve the wrong example we

started with The multilevel attention model gives the correct output as ldquoSan Jose

Staterdquo Also we achieved slightly better scores than the original model with a F1 score

of 8544 on the SQuAD dev dataset

Chatbot Design Using a QA System

Designing a perfect chatbot that passes the Turing test is a fundamental goal for

Artificial Intelligence Although we are many order to magnitudes away from achieving

such a goal domain specific tasks can be solved with chatbots made from current

technology We had started our goal with a similar objective in mind ie to design a

domain specific chatbot and then generalize to other areas as it is able to achieve the

first domain specific objective robustly This led us to the fundamental problem of

Machine Comprehension and subsequently to the task of Question Answering Having

achieved some degree of success with QA systems we looked back if we could apply

our newly acquired knowledge in the task of designing Chatbots

The chatbots made with todayrsquos technologies are mostly handcrafted techniques

such as template matching that requires anticipating all possible ways a user may

articulate his requirements and a conversation may occur This requires a lot of man

hours for designing a domain specific system and is still very error prone In this section

33

we propose a general Chatbot design that would make the designing of a domain

specific chatbot very easy and robust at the same time

Every domain specific chatbot needs to obtain a set of information from the user

and show some results based on the user specific information obtained The traditional

chatbots use template matching and keywords lookup to determine if the user has

provided the required information Our idea is to use the Question Answering system in

the backend to extract out the required information from whatever the user has typed

until this point of the conversation The information to be extracted can be posed in the

form of a set of questions and the answers obtained from those questions can be used

as the parameters to supply the relevant information to the user

We had chosen our chatbot domain as the flight reservation system Our goal

was to extract the required information from the user to be able to show him the

available flights as per the userrsquos requirements

Figure 4-2 Flight reservation chatbotrsquos chat window

34

For a flight reservation task the booking agent needs to know the origin city

destination city and date of travel at minimum to be able to show the available flights

Optional information includes the number of tickets passengerrsquos name one way or

round trip etc

The minimalistic conversation with the user through the chat window would be as

shown above We had a platform called OneTask on which we wanted to implement our

chat bot The chat interface within the OneTask system looks as follows

Figure 4-3 Chatbot within OneTask system

The working of the chat system is as follows ndash

1 Initiation The user opens the chat window which starts a session with the chat bot The chat bot asks lsquoHow may I help yoursquo

2 User Reply The user may reply with none of the required information for flight booking or may reply with multiple information in the same message

3 User Reply parsing The conversation up to this point is treated as a passage and the internal questions are run on this passage So the four questions that are run are

Where do you want to go

35

From where do you want to leave

When do you want to depart

4 Parsed responses from QA model After running the questions on the QA system with the given conversation answers are obtained for all along with their corresponding confidence values Since answers will be obtained even if the required question was not answered up to this point in the conversation the confidence values plays a crucial role in determining the validity of the answer For this purpose we determined ranges of confidence values for validation A confidence value above 10 signifies the question has been answered correctly A confidence of 2 ndash 10 signifies that it may have been answered but should verify with the user for correctness and any confidence below 2 is discarded

5 Asking remaining questions iteratively After the parsing it is checked if any of the required questions are still unanswered If so the chatbot asks the remaining question and the process from 3 ndash 4 is carried out iteratively Once all the questions have been answered the user is shown with the available flight options as per his request

Figure 4-4 The Flow diagram of the Flight booking Chatbot system

Online QA System and Attention Visualization

To be able to test out various examples we used the BiDAF [23] model for an

online demo One can either choose from the available examples from the drop-down

menu or paste their own passage and examples While this is a useful and interesting

system to test the model in a user-friendly way we created this system to be able to

focus on the wrong samples

36

Figure 4-5 QA system interface with attention highlight over candidate answers

The candidate answers are shown with the blue highlights as per their

confidence values The higher the confidence the darker the answer The highest

confidence value is chosen as the predicted answer We developed the system to show

attention spread of the candidate answers to realize what needs to be done to improve

the system This led us to realize the importance of including the query attention part as

well as multilevel attention on the BiDAF model as described in the first section of this

chapter

37

CHAPTER 5 FUTURE DIRECTIONS AND CONCLUSION

Future Directions

Having done a thorough analysis of the current methods and state of the arts in

Machine Comprehension and for the QA task and developing systems on it we have

achieved a strong sense of what needs to be done to further improve the QA models

After observing the wrong samples we can see that the system is still unable to encode

meaning and is picking answers based on statistical patterns of occurrence of answers

on training examples

As per the State of the Art models and analyzing the ongoing research literature

it is easy to conclude that more features need to be embedded to encode the meaning

of words phrases and sentences A paper called Reinforced Mnemonic Reader for

Machine Comprehension encoded POS and NER tags of words along with their word

and character embedding This gave them better results

Figure 5-1 An English language semantic parse tree [26]

38

We have developed a method to encode the syntax parse tree of a sentence that

not only encodes the post tags but the relation of the word within the phrase and the

relation of the phrase within the whole sentence in a hierarchical manner

Finally data augmentation is another solution to get better results One definite

way to reduce the errors would be to include similar samples in the training data which

the system is faltering in the dev set One could generate similar examples as the failure

cases and include them in the training set to have better prediction Another system

would be to train to similar and bigger datasets Our models were trained on the SQuAD

dataset There are other datasets too which does the similar question answering task

such as TriviaQA We could augment the training set of TriviaQA along with SQuAD to

have a more robust system that is able to generalize better and thus have higher

accuracy for predicting answer spans

Conclusion

In this work we have tried to explore the most fundamental techniques that have

shaped the current state of the art Then we proposed a minor improvement of

architecture over an existing model Furthermore we developed two applications that

uses the base model First we talked about how a chatbot application can be made

using the QA system and lastly we also created a web interface where the model can

be used for any Passage and Question This interface also shows the attention spread

on the candidate answers While our effort is ongoing to push the state of the art

forward we strongly believe that surpassing human level accuracy on this task will have

high dividends for the society at large

39

LIST OF REFERENCES

[1] Danqi Chen From Reading Comprehension To Open-Domain Question Answering 2018 Youtube Accessed March 16 2018 httpswwwyoutubecomwatchv=1RN88O9C13U

[2] Lehnert Wendy G The Process of Question Answering No RR-88 Yale Univ New Haven Conn Dept of computer science 1977

[3] Joshi Mandar Eunsol Choi Daniel S Weld and Luke Zettlemoyer Triviaqa A large scale distantly supervised challenge dataset for reading comprehension arXiv preprint arXiv170503551 (2017)

[4] Rajpurkar Pranav Jian Zhang Konstantin Lopyrev and Percy Liang Squad 100000+ questions for machine comprehension of text arXiv preprint arXiv160605250 (2016)

[5] The Stanford Question Answering Dataset 2018 RajpurkarGithubIo Accessed March 16 2018 httpsrajpurkargithubioSQuAD-explorer

[6] Hu Minghao Yuxing Peng and Xipeng Qiu Reinforced mnemonic reader for machine comprehension CoRR abs170502798 (2017)

[7] Schalkoff Robert J Artificial neural networks Vol 1 New York McGraw-Hill 1997

[8] Me For The AI and Neetesh Mehrotra 2018 The Connect Between Deep Learning And AI - Open Source For You Open Source For You Accessed March 16 2018 httpopensourceforucom201801connect-deep-learning-ai

[9] Krizhevsky Alex Ilya Sutskever and Geoffrey E Hinton Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems pp 1097-1105 2012

[10] An Intuitive Explanation Of Convolutional Neural Networks 2016 The Data Science Blog Accessed March 16 2018 httpsujjwalkarnme20160811intuitive-explanation-convnets

[11] Williams Ronald J and David Zipser A learning algorithm for continually running fully recurrent neural networks Neural computation 1 no 2 (1989) 270-280

[12] Hochreiter Sepp and Juumlrgen Schmidhuber Long short-term memory Neural computation 9 no 8 (1997) 1735-1780

[13] Chung Junyoung Caglar Gulcehre KyungHyun Cho and Yoshua Bengio Empirical evaluation of gated recurrent neural networks on sequence modeling arXiv preprint arXiv14123555 (2014)

40

[14] Mikolov Tomas Wen-tau Yih and Geoffrey Zweig Linguistic regularities in continuous space word representations In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics Human Language Technologies pp 746-751 2013

[15] Pennington Jeffrey Richard Socher and Christopher Manning Glove Global vectors for word representation In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) pp 1532-1543 2014

[16] Word2vec Tutorial - The Skip-Gram Model middot Chris Mccormick 2016 MccormickmlCom Accessed March 16 2018 httpmccormickmlcom20160419word2vec-tutorial-the-skip-gram-model

[17] Group 2015 [Paper Introduction] Bilingual Word Representations With Monolingual hellip SlideshareNet Accessed March 16 2018 httpswwwslidesharenetnaist-mtstudybilingual-word-representations-with-monolingual-quality-in-mind

[18] Attention Mechanism 2016 BlogHeuritechCom Accessed March 16 2018 httpsblogheuritechcom20160120attention-mechanism

[19] Weston Jason Sumit Chopra and Antoine Bordes 2014 Memory Networks arXiv preprint arXiv14103916v11 Accessed March 16 2018 httpsarxivorgabs14103916

[20] Wang Shuohang and Jing Jiang Machine comprehension using match-lstm and answer pointer arXiv preprint arXiv160807905 (2016)

[21] Wang Shuohang and Jing Jiang Learning natural language inference with LSTM arXiv preprint arXiv151208849 (2015)

[22] Vinyals Oriol Meire Fortunato and Navdeep Jaitly Pointer networks In Advances in Neural Information Processing Systems pp 2692-2700 2015

[23] Wang Wenhui Nan Yang Furu Wei Baobao Chang and Ming Zhou Gated self-matching networks for reading comprehension and question answering In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1 Long Papers) vol 1 pp 189-198 2017

[24] Seo Minjoon Aniruddha Kembhavi Ali Farhadi and Hannaneh Hajishirzi Bidirectional attention flow for machine comprehension arXiv preprint arXiv161101603 (2016)

[25] Clark Christopher and Matt Gardner Simple and effective multi-paragraph reading comprehension arXiv preprint arXiv171010723 (2017)

41

[26] 8 Analyzing Sentence Structure 2018 NltkOrg Accessed March 16 2018 httpwwwnltkorgbookch08html

42

BIOGRAPHICAL SKETCH

Purnendu Mukherjee grew up in Kolkata India and had a strong interest for

computers since an early age After his high school he did his BSc in computer

science from Ramakrishna Mission Residential College Narendrapur followed by MSc

in computer science from St Xavierrsquos College Kolkata He had a strong intuition and

interest for human like learning systems and wanted to work in this area He started

working at TCS Innovation Labs Pune for the application of Natural Language

Processing in Educational Applications As he was deeply passionate about learning

system that mimic the human brain and learn like a human child does he was

increasing interested about Deep Learning and its applications After working for a year

he went on to pursue a Master of Science degree in computer science from the

University of Florida Gainesville His academic interests have been focused on Deep

Learning and Natural Language Processing and he has been working on Machine

Reading Comprehension since summer of 2017

  • ACKNOWLEDGMENTS
  • LIST OF FIGURES
  • LIST OF ABBREVIATIONS
  • INTRODUCTION
  • THE BUILDING BLOCKS OF A QUESTION ANSWERING SYSTEM
    • Neural Networks
    • Convolutional Neural Network
    • Recurrent Neural Networks (RNN)
    • Word Embedding
    • Attention Mechanism
    • Memory Networks
      • LITERATURE REVIEW AND STATE OF THE ART
        • Machine Comprehension Using Match-LSTM and Answer Pointer
        • R-NET Matching Reading Comprehension with Self-Matching Networks
        • Bi-Directional Attention Flow (BiDAF) for Machine Comprehension
        • Summary
          • MULTI-ATTENTION QUESTION ANSWERING
            • Multi-Attention BiDAF Model
            • Chatbot Design Using a QA System
            • Online QA System and Attention Visualization
              • FUTURE DIRECTIONS AND CONCLUSION
                • Future Directions
                • Conclusion
                  • LIST OF REFERENCES
                  • BIOGRAPHICAL SKETCH
Page 6: QUESTION ANSWERING SYSTEMS WITH ATTENTION MECHANISMufdcimages.uflib.ufl.edu/UF/E0/05/23/73/00001/MUKHERJEE_P.pdf · QUESTION ANSWERING SYSTEMS WITH ATTENTION MECHANISM By PURNENDU

6

LIST OF FIGURES

Figure page 1-1 The task of Question Answering 10

2-1 Simple and Deep Learning Neural Networks 13

2-2 Convolutional Neural Network Architecture 14

2-3 Semantic relation between words in vector space 17

2-4 Attention Mechanism flow 18

2-5 QA example on Lord of the Rings using Memory Networks 20

3-1 Match-LSTM Model Architecture 22

3-2 The task of Question Answering 24

3-3 Bi-Directional Attention Flow Model Architecture 26

4-1 The modified BiDAF model with multilevel attention 31

4-2 Flight reservation chatbotrsquos chat window 33

4-3 Chatbot within OneTask system 34

4-4 The Flow diagram of the Flight booking Chatbot system 35

4-5 QA system interface with attention highlight over candidate answers 36

5-1 An English language semantic parse tree 37

7

LIST OF ABBREVIATIONS

BiDAF Bi-Directional Attention Flow

CNN Convolutional Neural Network

GRU Gated Recurrent Units

LSTM Long Short Term Memory

NLP Natural Language Processing

NLU Natural Language Understanding

RNN Recurrent Neural Network

8

Abstract of Thesis Presented to the Graduate School of the University of Florida in Partial Fulfillment of the

Requirements for the Degree of Master of Science

QUESTION ANSWERING SYSTEMS WITH ATTENTION MECHANISM

By

Purnendu Mukherjee

May 2018

Chair Xiaolin Li Major Computer Science

Question Answering(QA) systems have had rapid growth since the last 3 years

and are close to reaching human level accuracy One of the fundamental reason for this

growth has been the use of attention mechanism along with other methods of Deep

Learning But just as with other Deep Learning methods some of the failure cases are

so obvious that it convinces us that there is a lot to improve upon In this work we first

did a literature review of the State of the Art and fundamental models in QA systems

Next we introduce an architecture which has shown improvement in the targeted area

We then introduce a general method to enable easy design of domain-specific Chatbot

applications and present a proof of the concept with the same method Finally we

present an easy to use Question Answering interface with attention visualization on the

passage We also propose a method to improve the current state of the art as a part of

our ongoing work

9

CHAPTER 1 INTRODUCTION

Teaching machines to read and understand human natural language is our

central and long standing goal for Natural Language Processing and Artificial

Intelligence in general The positive side of machines being able to reason and

comprehend human language could be enormous We are beginning to see how

commercial systems like Alexa Siri Google Now etc are being widely used as Speech

Recognition improved to human levels While speech recognition systems can

transcribe speech to text comprehension of that transcribed text is another task which

is currently a major focus for both academia and industry because of its possible

applications Moreover all the text information available throughout the internet is a

major reason why machine comprehension of text is such an important task

With the growth of Deep Learning methods in the last few years the field of

Machine Comprehension and Natural Language Processing(NLP) in general has

experienced a revolution While the traditional methods and practices are still prevalent

and forms the basis of our deep understanding of languages Deep Learning methods

have surpassed all traditional NLP and Machine Learning methods by a significant

margin and are currently driving the growth of the field

To be able to build a system that can understand human text we need to first

ask ourselves how can we evaluate the machinersquos comprehension ability We had

initially set a goal to build a Chabot for a specific domain and generalize to other topics

as we go ahead While developing the system we found out the necessity for reading

comprehension and how to measure it We finally found the answer with Questions

Answering systems

10

Just like how we human beings are tested for our ability of language

understanding with questions we should ask machines similar questions about what it

has just read The performance of the system on such question answering task will let

us evaluate how much the machine is able to reason about what it just read [1]

Reading comprehension has been a topic of Natural Language Understanding since the

1970s In 1977 Wendy Lehnert said in his doctoral thesis ndash ldquoOnly when we can ask a

program to answer questions about what it reads will we be able to begin to access that

programrsquos comprehensionrdquo [2]

Figure 1-1 The task of Question Answering

To achieve this task the NLP community has developed various datasets such as

CNN Daily Mail WebQuestions SQuAD TriviaQA [3] etc For our purpose we chose

SQuAD which stands for Stanford Question Answering Dataset [4] SQuAD consists of

questions posed by crowdworkers on a set of Wikipedia articles where the answer to

every question is a segment of text or span from the corresponding reading passage

With 100000+ question-answer pairs on 500+ articles SQuAD is significantly larger

than previous reading comprehension datasets [5]

An example from the SQuAD dataset is as follows

11

Passage Tesla later approached Morgan to ask for more funds to build a more

powerful transmitter When asked where all the money had gone Tesla responded by

saying that he was affected by the Panic of 1901 which he (Morgan) had caused

Morgan was shocked by the reminder of his part in the stock market crash and by

Teslarsquos breach of contract by asking for more funds Tesla wrote another plea to

Morgan but it was also fruitless Morgan still owed Tesla money on the original

agreement and Tesla had been facing foreclosure even before construction of the

tower began

Question On what did Tesla blame for the loss of the initial money

Answer Panic of 1901

As we started exploring the QA task we faced several challenges Some of them

we could solve with the help of other research and some of the challenges still exist in

the domain

bull Out of Vocabulary words

bull Multi-sentence reasoning may be required

bull There may exist several candidate answers

bull Optimizing the Exact Match (EM) metric may fail when the answer boundary is fuzzy or too long such as the answer of the ldquowhyrdquo query [6]

bull ldquoOne-hoprdquo prediction may fail to fully understand the query [6]

bull Fail to fully capture the long-distance contextual interaction between parts of the context by only using LSTMGRU

bull Current models are unable to capture semantics of the passage

In the upcoming chapters we will first briefly review the basics necessary for

understanding the models then we will delve deep into the fundamental models that

12

have shaped the current State of the Art models then we will discuss our contribution in

terms of architecture and applications and finally conclude with future directions

13

CHAPTER 2 THE BUILDING BLOCKS OF A QUESTION ANSWERING SYSTEM

To build a question answering system one needs to be familiar with the

fundamental deep learning models such as Recurrent Neural Networks (RNN) Long

Short Term Memory (LSTM) etc In this chapter we will have an overview on these

techniques and see how they all connect to building a question answering system

Neural Networks

What makes Deep Learning so intriguing is that it has close resemblance with

the working of the mammalian brain or at least draws inspiration from it The same can

be said for Artificial Neural networks [7] which consists of a system of interconnected

units called lsquoneuronsrsquo that take input from similar units and produces a single output

Figure 2-1 Simple and Deep Learning Neural Networks [8]

The connection from one neuron to another can be weighted based on the input data

which enables the network to tune itself to produce a certain output based on the input

This is the learning process which is achieved through backpropagation which is a

system of propagating the error from the output layer to the previous layers

14

Convolutional Neural Network

The first wave of deep learningrsquos success was brought by Convolutional Neural

Networks (CNN) [9] when this was the technique used by the winning team of ImageNet

competition in 2012 CNNs are deep artificial neural networks (ANN) that can be used to

classify images cluster them by similarity and perform object recognition within scenes

It can be used to detect and identify faces people signs or any other visual data

Figure 2-2 Convolutional Neural Network Architecture [10]

There are primarily four operations in a standard CNN model (as shown in Fig above)

1 Convolution - The primary purpose of Convolution in the ConvNet (above) is to extract features from the input image The spatial relationship between pixels ie the image features are preserved and learned by the convolution using small squares of input data

2 Non-Linearity (ReLU) ndash Rectified Linear Unit (ReLU) is a non-linear operation that carries out an element wise operation on each pixel This operation replaces the negative pixel values in the feature map by zero

3 Pooling or Sub Sampling - Spatial Pooling reduces the dimensionality of each feature map but retains the most important information For max pooling the largest value in the square window is taken and rest are dropped Other types of pooling are Average Sum etc

4 Classification (Fully Connected Layer) - The Fully Connected layer is a traditional Multi-Layer Perceptron as described before that uses a softmax activation function in the output layer The high-level features of the image are encoded by the convolutional and pooling layers which is then fed to the fully connected layer which then uses these features for classifying the input image into various classes based on the training dataset [10]

15

When a new image is fed into the CNN model all the above-mentioned steps are

carried out (forward propagation) and a probability distribution is achieved on the set of

output classes With a large enough training dataset the network will learn and

generalize well enough to classify new images into their correct classes

Recurrent Neural Networks (RNN)

Whenever we want to predict or encode sequential data RNN [11] is our go to

method RNNs perform the same task for every element of a sequence where the

output of each element depends on previous computations thus the recurrence In

practice RNNs are unable to retain long-term dependencies and can look back only a

few steps because of the vanishing gradient problem

(2-1)

A solution to the dependency problem is to use gated cells such as LSTM [11] or

GRU [13] These cells pass on important information to the next cells while ignoring

non-important ones The gated units in a GRU block are

bull Update Gate ndash Computed based on current input and hidden state

(2-2)

bull Reset Gate ndash Calculated similarly but with different weights

(2-3)

bull New memory content - (2-4)

If reset gate unit is ~0 then previous memory is ignored and only new

information is kept

16

Final memory at current time step combines previous and current time steps

(2-5)

While the GRU is computationally efficient the LSTM on the other hand is a

general case where there are three gates as follows

bull Input Gate ndash What new information to add to the current cell state

bull Forget Gate ndash How much information from previous states to be kept

bull Output gate ndash How much info should be sent to the next states

Just like GRU the current cell state is a sum of the previous cell state but

weighted by the forget gate and the new value is added which is weighted by the input

gate Based on the cell state the output gate regulates the final output

Word Embedding

Computation or gradients can be applied on numbers and not on words or letters

So first we need to convert words into their corresponding numerical formation before

feeding into a deep learning model In general there are two types of word embedding

Frequency based (which constitutes count vectors tf-idf and co-occurrence vectors)

and Prediction based With frequency based embedding the order of the words are not

preserved and works as a bag of words model Whereas with prediction based model

the order of words or locality of words are taken into consideration to generate the

numerical representation of the word Within this prediction based category there are

two fundamental techniques called Continuous Bag of Words (CBOW) and Skip Gram

Model which forms the basis for word2vec [14] and GloVe [15]

The basic intuition behind word2vec is that if two different words have very

similar ldquocontextsrdquo (that is what words are likely to appear around them) then the model

17

will produce similar vector for those words Conversely if the two word vectors are

similar then the network will produce similar context predictions for the same two words

For examples synonyms like ldquointelligentrdquo and ldquosmartrdquo would have very similar contexts

Or that words that are related like ldquoenginerdquo and ldquotransmissionrdquo would probably have

similar contexts as well [16] Plotting the word vectors learned by a word2vec over a

large corpus we could find some very interesting relationships between words

Figure 2-3 Semantic relation between words in vector space [17]

Attention Mechanism

We as humans put our attention to things are important or are relevant in a

context For example when asked a question from a passage we try to find the most

relevant part of the passage the question is relevant with and then reason from our

understanding of that part of the passage The same idea applies for attention

mechanism in Deep Learning It is used to identify the specific parts of a given context

to which the current question is relevant to

Formally put the techniques take n arguments y_1 y_n (in our case the

passage having words say y_i through h_i) and a question word say q It returns a

vector z which is supposed to be the laquo summary raquo of the y_i focusing on information

linked to the question q More formally it returns a weighted arithmetic mean of the y_i

18

and the weights are chosen according the relevance of each y_i given the context c

[18]

Figure 2-4 Attention Mechanism flow [18]

Memory Networks

Convolutional Neural Networks and Recurrent Neural Networks which does

capture how we form our visual and sequential memories their memory (encoded by

hidden states and weights) were typically too small and was not compartmentalized

enough to accurately remember facts from the past (knowledge is compressed into

dense vectors) [19]

Deep Learning needed to cultivate a methodology that preserved memories as

they are such that it wonrsquot be lost in generalization and recalling exact words or

sequence of events would be possible mdash something computers are already good at This

effort led us to Memory Networks [19] published at ICLR 2015 by Facebook AI

Research

This paper provides a basic framework to store augment and retrieve memories

while seamlessly working with a Recurrent Neural Network architecture The memory

19

network consists of a memory m (an array of objects 1 indexed by m i) and four

(potentially learned) components I G O and R as follows

I (input feature map) mdash converts the incoming input to the internal feature

representation either a sparse or dense feature vector like that from word2vec or

GloVe

G (generalization) mdash updates old memories given the new input They call this

generalization as there is an opportunity for the network to compress and generalize its

memories at this stage for some intended future use The analogy Irsquove been talking

before

O (output feature map) mdash produces a new output (in the feature representation

space) given the new input and the current memory state This component is

responsible for performing inference In a question answering system this part will

select the candidate sentences (which might contain the answer) from the story

(conversation) so far

R (response) mdash converts the output into the response format desired For

example a textual response or an action In the QA system described this component

finds the desired answer and then converts it from feature representation to the actual

word

This model is a fully supervised model meaning all the candidate sentences from

which the answer could be found are marked during training phase and can also be

termed as lsquohard attentionrsquo

The authors tested out the QA system on various literature including Lord of the

Rings

20

Figure 2-5 QA example on Lord of the Rings using Memory Networks [19]

21

CHAPTER 3 LITERATURE REVIEW AND STATE OF THE ART

There has been rapid progress since the release of the SQuAD dataset and

currently the best ensemble models are close to human level accuracy in machine

comprehension This is due to the various ingenious methods which solves some of the

problems with the previous methods Out of Vocabulary tokens were handled by using

Character embedding Long term dependency within context passage were solved

using self-attention And many other techniques such as Contextualized vectors History

of Words Attention Flow etc In this section we will have a look at the some of the most

important models that were fundamental to the progress of Questions Answering

Machine Comprehension Using Match-LSTM and Answer Pointer

In this paper [20] the authors propose an end-to-end neural architecture for the

QA task The architecture is based on match-LSTM [21] a model they proposed for

textual entailment and Pointer Net [22] a sequence-to-sequence model proposed by

Vinyals et al (2015) to constrain the output tokens to be from the input sequences

The model consists of an LSTM preprocessing layer a match-LSTM layer and an

Answer Pointer layer

We are given a piece of text which we refer to as a passage and a question

related to the passage The passage is represented by matrix P where P is the length

(number of tokens) of the passage and d is the dimensionality of word embeddings

Similarly the question is represented by matrix Q where Q is the length of the question

Our goal is to identify a subsequence from the passage as the answer to the question

22

Figure 3-1 Match-LSTM Model Architecture [20]

LSTM Preprocessing layer They use a standard one-directional LSTM

(Hochreiter amp Schmidhuber 1997) to process the passage and the question separately

as shown below

Match LSTM Layer They applied the match-LSTM model proposed for textual

entailment to their machine comprehension problem by treating the question as a

premise and the passage as a hypothesis The match-LSTM sequentially goes through

the passage At position i of the passage it first uses the standard word-by-word

attention mechanism to obtain attention weight vector as follows

(3-1)

23

where and are parameters to be learned

Answer Pointer Layer The Answer Pointer (Ans-Ptr) layer is motivated by the

Pointer Net introduced by Vinyals et al (2015) [22] The boundary model produces only

the start token and the end token of the answer and then all the tokens between these

two in the original passage are considered to be the answer

When this paper was released back in November 2016 Match-LSTM method

was the state of the art in Question Answering systems and was at the top of the

leaderboard for the SQuAD dataset

R-NET Matching Reading Comprehension with Self-Matching Networks

In this model [23] first the question and passage are processed by a

bidirectional recurrent network (Mikolov et al 2010) separately They then match the

question and passage with gated attention-based recurrent networks obtaining

question-aware representation for the passage On top of that they apply self-matching

attention to aggregate evidence from the whole passage and refine the passage

representation which is then fed into the output layer to predict the boundary of the

answer span

Question and passage encoding First the words are converted to their

respective word-level embeddings and character level embeddings The character-level

embeddings are generated by taking the final hidden states of a bi-directional recurrent

neural network (RNN) applied to embeddings of characters in the token Such

character-level embeddings have been shown to be helpful to deal with out-of-vocab

(OOV) tokens

24

They then use a bi-directional RNN to produce new representation and

of all words in the question and passage respectively

Figure 3-2 The task of Question Answering [23]

Gated Attention-based Recurrent Networks They use a variant of attention-

based recurrent networks with an additional gate to determine the importance of

information in the passage regarding a question Different from the gates in LSTM or

GRU the additional gate is based on the current passage word and its attention-pooling

vector of the question which focuses on the relation between the question and current

passage word The gate effectively model the phenomenon that only parts of the

passage are relevant to the question in reading comprehension and question answering

is utilized in subsequent calculations

25

Self-Matching Attention From the previous step the question aware passage

representation is generated to highlight the important parts of the passage One

problem with such representation is that it has very limited knowledge of context One

answer candidate is often oblivious to important cues in the passage outside its

surrounding window To address this problem the authors propose directly matching

the question-aware passage representation against itself It dynamically collects

evidence from the whole passage for words in passage and encodes the evidence

relevant to the current passage word and its matching question information into the

passage representation

Output Layer They use the same method as Wang amp Jiang (2016b) and use

pointer networks (Vinyals et al 2015) to predict the start and end position of the

answer In addition they use an attention-pooling over the question representation to

generate the initial hidden vector for the pointer network [23]

When the R-Net Model first appeared in the leaderboard in March 2017 it was at

the top with 723 Exact Match and 807 F1 score

Bi-Directional Attention Flow (BiDAF) for Machine Comprehension

BiDAF [24] is a hierarchical multi-stage architecture for modeling the

representations of the context paragraph at different levels of granularity BIDAF

includes character-level word-level and contextual embeddings and uses bi-directional

attention flow to obtain a query-aware context representation Their attention layer is not

used to summarize the context paragraph into a fixed-size vector Instead the attention

is computed for every time step and the attended vector at each time step along with

the representations from previous layers can flow through to the subsequent modeling

layer This reduces the information loss caused by early summarization [24]

26

Figure 3-3 Bi-Directional Attention Flow Model Architecture [24]

Their machine comprehension model is a hierarchical multi-stage process and

consists of six layers

1 Character Embedding Layer maps each word to a vector space using character-level CNNs

2 Word Embedding Layer maps each word to a vector space using a pre-trained word embedding model

3 Contextual Embedding Layer utilizes contextual cues from surrounding words to refine the embedding of the words These first three layers are applied to both the query and context They use a LSTM on top of the embeddings provided by the previous layers to model the temporal interactions between words We place an LSTM in both directions and concatenate the outputs of the two LSTMs

4 Attention Flow Layer couples the query and context vectors and produces a set of query aware feature vectors for each word in the context

5 Modeling Layer employs a Recurrent Neural Network to scan the context The input to the modeling layer is G which encodes the query-aware representations of context words The output of the modeling layer captures the interaction among the context words conditioned on the query They use two layers of bi-directional LSTM

27

with the output size of d for each direction Hence a matrix is obtained which is passed onto the output layer to predict the answer

6 Output Layer provides an answer to the query They define the training loss (to be minimized) as the sum of the negative log probabilities of the true start and end indices by the predicted distributions averaged over all examples [25]

In a further variation of their above work they add a self-attention layer after the

Bi-attention layer to further improve the results The architecture of the model is as

Figure 3-3 The task of Question Answering [25]

28

Summary

In this chapter we reviewed the methods that are fundamental to the state of the

art in Machine Comprehension and for the task of Question Answering We have

reached closed to human level accuracy and this is due to incremental developments

over previous models As we saw Out of Vocabulary (OOV) tokens were handled by

using Character embedding Long term dependency within context passage were

solved using self-attention and many other techniques such as Contextualized vectors

Attention Flow etc were employed to get better results In the next chapter we will see

how we can build on these models and develop further

29

CHAPTER 4 MULTI-ATTENTION QUESTION ANSWERING

Although the advancements in Question Answering systems since the release of

the SQuAD dataset has been impressive and the results are getting close to human

level accuracy it is far from being a fool-proof system The models still make mistakes

which would be obvious to a human For example

Passage The Panthers used the San Jose State practice facility and stayed at

the San Jose Marriott The Broncos practiced at Florida State Facility and stayed at the

Santa Clara Marriott

Question At what universitys facility did the Panthers practice

Actual Answer San Jose State

Predicted Answer Florida State Facility

To find out what is leading to the wrong predictions we wanted to see the

attention weights associated with such an example We plotted the passage and

question heat map which is a 2D matrix where the intensity of each cell signifies the

similarity between a passage word and a question word For the above example we

found out that while certain words of the question are given high weightage other parts

are not The words lsquoAtrsquo lsquofacilityrsquo lsquopracticersquo are given high attention but lsquoPanthersrsquo does

not receive high attention If it had received high attention then the system would have

predicted lsquoSan Jose Statersquo as the right answer To solve this issue we analyzed the

base BiDAF model and proposed adding two things

1 Bi-Attention and Self-Attention over Query

2 Second level attention over the output of (Bi-Attention + Self-Attention) from both Context and Query

30

Multi-Attention BiDAF Model

Our comprehension model is a hierarchical multi-stage process and consists of

the following layers

1 Embedding Just as all other models we embed words using pretrained word vectors We also embed the characters in each word into size 20 vectors which are learned and run a convolution neural network followed by max-pooling to get character-derived embeddings for each word The character-level and word-level embeddings are then concatenated and passed to the next layer We do not update the word embeddings during training

2 Pre-Process A shared bi-directional GRU (Cho et al 2014) is used to map the question and passage embeddings to context aware embeddings

3 Context Attention The bi-directional attention mechanism from the Bi-Directional Attention Flow (BiDAF) model (Seo et al 2016) is used to build a query-aware context representation Let h_i be the vector for context word i q_j be the vector for question word j and n_q and n_c be the lengths of the question and context respectively We compute attention between context word i and question word j as

(4-1)

where w1 w2 and w3 are learned vectors and is element-wise multiplication We then compute an attended vector c_i for each context token as

(4-2)

We also compute a query-to-context vector q_c

(4-3)

31

The final vector computed for each token is built by concatenating

and In our model we subsequently pass the result through a linear layer with ReLU activations

4 Context Self-Attention Next we use a layer of residual self-attention The input is passed through another bi-directional GRU Then we apply the same attention mechanism only now between the passage and itself In this case we do not use query-to context attention and we set aij = 1048576inf if i = j As before we pass the concatenated output through a linear layer with ReLU activations This layer is applied residually so this output is additionally summed with the input

5 Query Attention For this part we do the same way as context attention but calculate the weighted sum of the context words for each query word Thus we get the length as number of query words Then we calculate context to query similar to query to context in context attention layer

Figure 4-1 The modified BiDAF model with multilevel attention

6 Query Self-Attention This part is done the same way as the context self-attention layer but from the output of the Query Attention layer

32

7 Context Query Bi-Attention + Self-Attention The output of the Context self-attention and the Query Self-Attention layers are taken as input and the same process for Bi-Attention and self-attention is applied on these inputs

8 Prediction In the last layer of our model a bidirectional GRU is applied followed by a linear layer that computes answer start scores for each token The hidden states of that layer are concatenated with the input and fed into a second bidirectional GRU and linear layer to predict answer end scores The softmax operation is applied to the start and end scores to produce start and end probabilities and we optimize the negative log likelihood of selecting correct start and end tokens

Having carried out this modification we were able to solve the wrong example we

started with The multilevel attention model gives the correct output as ldquoSan Jose

Staterdquo Also we achieved slightly better scores than the original model with a F1 score

of 8544 on the SQuAD dev dataset

Chatbot Design Using a QA System

Designing a perfect chatbot that passes the Turing test is a fundamental goal for

Artificial Intelligence Although we are many order to magnitudes away from achieving

such a goal domain specific tasks can be solved with chatbots made from current

technology We had started our goal with a similar objective in mind ie to design a

domain specific chatbot and then generalize to other areas as it is able to achieve the

first domain specific objective robustly This led us to the fundamental problem of

Machine Comprehension and subsequently to the task of Question Answering Having

achieved some degree of success with QA systems we looked back if we could apply

our newly acquired knowledge in the task of designing Chatbots

The chatbots made with todayrsquos technologies are mostly handcrafted techniques

such as template matching that requires anticipating all possible ways a user may

articulate his requirements and a conversation may occur This requires a lot of man

hours for designing a domain specific system and is still very error prone In this section

33

we propose a general Chatbot design that would make the designing of a domain

specific chatbot very easy and robust at the same time

Every domain specific chatbot needs to obtain a set of information from the user

and show some results based on the user specific information obtained The traditional

chatbots use template matching and keywords lookup to determine if the user has

provided the required information Our idea is to use the Question Answering system in

the backend to extract out the required information from whatever the user has typed

until this point of the conversation The information to be extracted can be posed in the

form of a set of questions and the answers obtained from those questions can be used

as the parameters to supply the relevant information to the user

We had chosen our chatbot domain as the flight reservation system Our goal

was to extract the required information from the user to be able to show him the

available flights as per the userrsquos requirements

Figure 4-2 Flight reservation chatbotrsquos chat window

34

For a flight reservation task the booking agent needs to know the origin city

destination city and date of travel at minimum to be able to show the available flights

Optional information includes the number of tickets passengerrsquos name one way or

round trip etc

The minimalistic conversation with the user through the chat window would be as

shown above We had a platform called OneTask on which we wanted to implement our

chat bot The chat interface within the OneTask system looks as follows

Figure 4-3 Chatbot within OneTask system

The working of the chat system is as follows ndash

1 Initiation The user opens the chat window which starts a session with the chat bot The chat bot asks lsquoHow may I help yoursquo

2 User Reply The user may reply with none of the required information for flight booking or may reply with multiple information in the same message

3 User Reply parsing The conversation up to this point is treated as a passage and the internal questions are run on this passage So the four questions that are run are

Where do you want to go

35

From where do you want to leave

When do you want to depart

4 Parsed responses from QA model After running the questions on the QA system with the given conversation answers are obtained for all along with their corresponding confidence values Since answers will be obtained even if the required question was not answered up to this point in the conversation the confidence values plays a crucial role in determining the validity of the answer For this purpose we determined ranges of confidence values for validation A confidence value above 10 signifies the question has been answered correctly A confidence of 2 ndash 10 signifies that it may have been answered but should verify with the user for correctness and any confidence below 2 is discarded

5 Asking remaining questions iteratively After the parsing it is checked if any of the required questions are still unanswered If so the chatbot asks the remaining question and the process from 3 ndash 4 is carried out iteratively Once all the questions have been answered the user is shown with the available flight options as per his request

Figure 4-4 The Flow diagram of the Flight booking Chatbot system

Online QA System and Attention Visualization

To be able to test out various examples we used the BiDAF [23] model for an

online demo One can either choose from the available examples from the drop-down

menu or paste their own passage and examples While this is a useful and interesting

system to test the model in a user-friendly way we created this system to be able to

focus on the wrong samples

36

Figure 4-5 QA system interface with attention highlight over candidate answers

The candidate answers are shown with the blue highlights as per their

confidence values The higher the confidence the darker the answer The highest

confidence value is chosen as the predicted answer We developed the system to show

attention spread of the candidate answers to realize what needs to be done to improve

the system This led us to realize the importance of including the query attention part as

well as multilevel attention on the BiDAF model as described in the first section of this

chapter

37

CHAPTER 5 FUTURE DIRECTIONS AND CONCLUSION

Future Directions

Having done a thorough analysis of the current methods and state of the arts in

Machine Comprehension and for the QA task and developing systems on it we have

achieved a strong sense of what needs to be done to further improve the QA models

After observing the wrong samples we can see that the system is still unable to encode

meaning and is picking answers based on statistical patterns of occurrence of answers

on training examples

As per the State of the Art models and analyzing the ongoing research literature

it is easy to conclude that more features need to be embedded to encode the meaning

of words phrases and sentences A paper called Reinforced Mnemonic Reader for

Machine Comprehension encoded POS and NER tags of words along with their word

and character embedding This gave them better results

Figure 5-1 An English language semantic parse tree [26]

38

We have developed a method to encode the syntax parse tree of a sentence that

not only encodes the post tags but the relation of the word within the phrase and the

relation of the phrase within the whole sentence in a hierarchical manner

Finally data augmentation is another solution to get better results One definite

way to reduce the errors would be to include similar samples in the training data which

the system is faltering in the dev set One could generate similar examples as the failure

cases and include them in the training set to have better prediction Another system

would be to train to similar and bigger datasets Our models were trained on the SQuAD

dataset There are other datasets too which does the similar question answering task

such as TriviaQA We could augment the training set of TriviaQA along with SQuAD to

have a more robust system that is able to generalize better and thus have higher

accuracy for predicting answer spans

Conclusion

In this work we have tried to explore the most fundamental techniques that have

shaped the current state of the art Then we proposed a minor improvement of

architecture over an existing model Furthermore we developed two applications that

uses the base model First we talked about how a chatbot application can be made

using the QA system and lastly we also created a web interface where the model can

be used for any Passage and Question This interface also shows the attention spread

on the candidate answers While our effort is ongoing to push the state of the art

forward we strongly believe that surpassing human level accuracy on this task will have

high dividends for the society at large

39

LIST OF REFERENCES

[1] Danqi Chen From Reading Comprehension To Open-Domain Question Answering 2018 Youtube Accessed March 16 2018 httpswwwyoutubecomwatchv=1RN88O9C13U

[2] Lehnert Wendy G The Process of Question Answering No RR-88 Yale Univ New Haven Conn Dept of computer science 1977

[3] Joshi Mandar Eunsol Choi Daniel S Weld and Luke Zettlemoyer Triviaqa A large scale distantly supervised challenge dataset for reading comprehension arXiv preprint arXiv170503551 (2017)

[4] Rajpurkar Pranav Jian Zhang Konstantin Lopyrev and Percy Liang Squad 100000+ questions for machine comprehension of text arXiv preprint arXiv160605250 (2016)

[5] The Stanford Question Answering Dataset 2018 RajpurkarGithubIo Accessed March 16 2018 httpsrajpurkargithubioSQuAD-explorer

[6] Hu Minghao Yuxing Peng and Xipeng Qiu Reinforced mnemonic reader for machine comprehension CoRR abs170502798 (2017)

[7] Schalkoff Robert J Artificial neural networks Vol 1 New York McGraw-Hill 1997

[8] Me For The AI and Neetesh Mehrotra 2018 The Connect Between Deep Learning And AI - Open Source For You Open Source For You Accessed March 16 2018 httpopensourceforucom201801connect-deep-learning-ai

[9] Krizhevsky Alex Ilya Sutskever and Geoffrey E Hinton Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems pp 1097-1105 2012

[10] An Intuitive Explanation Of Convolutional Neural Networks 2016 The Data Science Blog Accessed March 16 2018 httpsujjwalkarnme20160811intuitive-explanation-convnets

[11] Williams Ronald J and David Zipser A learning algorithm for continually running fully recurrent neural networks Neural computation 1 no 2 (1989) 270-280

[12] Hochreiter Sepp and Juumlrgen Schmidhuber Long short-term memory Neural computation 9 no 8 (1997) 1735-1780

[13] Chung Junyoung Caglar Gulcehre KyungHyun Cho and Yoshua Bengio Empirical evaluation of gated recurrent neural networks on sequence modeling arXiv preprint arXiv14123555 (2014)

40

[14] Mikolov Tomas Wen-tau Yih and Geoffrey Zweig Linguistic regularities in continuous space word representations In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics Human Language Technologies pp 746-751 2013

[15] Pennington Jeffrey Richard Socher and Christopher Manning Glove Global vectors for word representation In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) pp 1532-1543 2014

[16] Word2vec Tutorial - The Skip-Gram Model middot Chris Mccormick 2016 MccormickmlCom Accessed March 16 2018 httpmccormickmlcom20160419word2vec-tutorial-the-skip-gram-model

[17] Group 2015 [Paper Introduction] Bilingual Word Representations With Monolingual hellip SlideshareNet Accessed March 16 2018 httpswwwslidesharenetnaist-mtstudybilingual-word-representations-with-monolingual-quality-in-mind

[18] Attention Mechanism 2016 BlogHeuritechCom Accessed March 16 2018 httpsblogheuritechcom20160120attention-mechanism

[19] Weston Jason Sumit Chopra and Antoine Bordes 2014 Memory Networks arXiv preprint arXiv14103916v11 Accessed March 16 2018 httpsarxivorgabs14103916

[20] Wang Shuohang and Jing Jiang Machine comprehension using match-lstm and answer pointer arXiv preprint arXiv160807905 (2016)

[21] Wang Shuohang and Jing Jiang Learning natural language inference with LSTM arXiv preprint arXiv151208849 (2015)

[22] Vinyals Oriol Meire Fortunato and Navdeep Jaitly Pointer networks In Advances in Neural Information Processing Systems pp 2692-2700 2015

[23] Wang Wenhui Nan Yang Furu Wei Baobao Chang and Ming Zhou Gated self-matching networks for reading comprehension and question answering In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1 Long Papers) vol 1 pp 189-198 2017

[24] Seo Minjoon Aniruddha Kembhavi Ali Farhadi and Hannaneh Hajishirzi Bidirectional attention flow for machine comprehension arXiv preprint arXiv161101603 (2016)

[25] Clark Christopher and Matt Gardner Simple and effective multi-paragraph reading comprehension arXiv preprint arXiv171010723 (2017)

41

[26] 8 Analyzing Sentence Structure 2018 NltkOrg Accessed March 16 2018 httpwwwnltkorgbookch08html

42

BIOGRAPHICAL SKETCH

Purnendu Mukherjee grew up in Kolkata India and had a strong interest for

computers since an early age After his high school he did his BSc in computer

science from Ramakrishna Mission Residential College Narendrapur followed by MSc

in computer science from St Xavierrsquos College Kolkata He had a strong intuition and

interest for human like learning systems and wanted to work in this area He started

working at TCS Innovation Labs Pune for the application of Natural Language

Processing in Educational Applications As he was deeply passionate about learning

system that mimic the human brain and learn like a human child does he was

increasing interested about Deep Learning and its applications After working for a year

he went on to pursue a Master of Science degree in computer science from the

University of Florida Gainesville His academic interests have been focused on Deep

Learning and Natural Language Processing and he has been working on Machine

Reading Comprehension since summer of 2017

  • ACKNOWLEDGMENTS
  • LIST OF FIGURES
  • LIST OF ABBREVIATIONS
  • INTRODUCTION
  • THE BUILDING BLOCKS OF A QUESTION ANSWERING SYSTEM
    • Neural Networks
    • Convolutional Neural Network
    • Recurrent Neural Networks (RNN)
    • Word Embedding
    • Attention Mechanism
    • Memory Networks
      • LITERATURE REVIEW AND STATE OF THE ART
        • Machine Comprehension Using Match-LSTM and Answer Pointer
        • R-NET Matching Reading Comprehension with Self-Matching Networks
        • Bi-Directional Attention Flow (BiDAF) for Machine Comprehension
        • Summary
          • MULTI-ATTENTION QUESTION ANSWERING
            • Multi-Attention BiDAF Model
            • Chatbot Design Using a QA System
            • Online QA System and Attention Visualization
              • FUTURE DIRECTIONS AND CONCLUSION
                • Future Directions
                • Conclusion
                  • LIST OF REFERENCES
                  • BIOGRAPHICAL SKETCH
Page 7: QUESTION ANSWERING SYSTEMS WITH ATTENTION MECHANISMufdcimages.uflib.ufl.edu/UF/E0/05/23/73/00001/MUKHERJEE_P.pdf · QUESTION ANSWERING SYSTEMS WITH ATTENTION MECHANISM By PURNENDU

7

LIST OF ABBREVIATIONS

BiDAF Bi-Directional Attention Flow

CNN Convolutional Neural Network

GRU Gated Recurrent Units

LSTM Long Short Term Memory

NLP Natural Language Processing

NLU Natural Language Understanding

RNN Recurrent Neural Network

8

Abstract of Thesis Presented to the Graduate School of the University of Florida in Partial Fulfillment of the

Requirements for the Degree of Master of Science

QUESTION ANSWERING SYSTEMS WITH ATTENTION MECHANISM

By

Purnendu Mukherjee

May 2018

Chair Xiaolin Li Major Computer Science

Question Answering(QA) systems have had rapid growth since the last 3 years

and are close to reaching human level accuracy One of the fundamental reason for this

growth has been the use of attention mechanism along with other methods of Deep

Learning But just as with other Deep Learning methods some of the failure cases are

so obvious that it convinces us that there is a lot to improve upon In this work we first

did a literature review of the State of the Art and fundamental models in QA systems

Next we introduce an architecture which has shown improvement in the targeted area

We then introduce a general method to enable easy design of domain-specific Chatbot

applications and present a proof of the concept with the same method Finally we

present an easy to use Question Answering interface with attention visualization on the

passage We also propose a method to improve the current state of the art as a part of

our ongoing work

9

CHAPTER 1 INTRODUCTION

Teaching machines to read and understand human natural language is our

central and long standing goal for Natural Language Processing and Artificial

Intelligence in general The positive side of machines being able to reason and

comprehend human language could be enormous We are beginning to see how

commercial systems like Alexa Siri Google Now etc are being widely used as Speech

Recognition improved to human levels While speech recognition systems can

transcribe speech to text comprehension of that transcribed text is another task which

is currently a major focus for both academia and industry because of its possible

applications Moreover all the text information available throughout the internet is a

major reason why machine comprehension of text is such an important task

With the growth of Deep Learning methods in the last few years the field of

Machine Comprehension and Natural Language Processing(NLP) in general has

experienced a revolution While the traditional methods and practices are still prevalent

and forms the basis of our deep understanding of languages Deep Learning methods

have surpassed all traditional NLP and Machine Learning methods by a significant

margin and are currently driving the growth of the field

To be able to build a system that can understand human text we need to first

ask ourselves how can we evaluate the machinersquos comprehension ability We had

initially set a goal to build a Chabot for a specific domain and generalize to other topics

as we go ahead While developing the system we found out the necessity for reading

comprehension and how to measure it We finally found the answer with Questions

Answering systems

10

Just like how we human beings are tested for our ability of language

understanding with questions we should ask machines similar questions about what it

has just read The performance of the system on such question answering task will let

us evaluate how much the machine is able to reason about what it just read [1]

Reading comprehension has been a topic of Natural Language Understanding since the

1970s In 1977 Wendy Lehnert said in his doctoral thesis ndash ldquoOnly when we can ask a

program to answer questions about what it reads will we be able to begin to access that

programrsquos comprehensionrdquo [2]

Figure 1-1 The task of Question Answering

To achieve this task the NLP community has developed various datasets such as

CNN Daily Mail WebQuestions SQuAD TriviaQA [3] etc For our purpose we chose

SQuAD which stands for Stanford Question Answering Dataset [4] SQuAD consists of

questions posed by crowdworkers on a set of Wikipedia articles where the answer to

every question is a segment of text or span from the corresponding reading passage

With 100000+ question-answer pairs on 500+ articles SQuAD is significantly larger

than previous reading comprehension datasets [5]

An example from the SQuAD dataset is as follows

11

Passage Tesla later approached Morgan to ask for more funds to build a more

powerful transmitter When asked where all the money had gone Tesla responded by

saying that he was affected by the Panic of 1901 which he (Morgan) had caused

Morgan was shocked by the reminder of his part in the stock market crash and by

Teslarsquos breach of contract by asking for more funds Tesla wrote another plea to

Morgan but it was also fruitless Morgan still owed Tesla money on the original

agreement and Tesla had been facing foreclosure even before construction of the

tower began

Question On what did Tesla blame for the loss of the initial money

Answer Panic of 1901

As we started exploring the QA task we faced several challenges Some of them

we could solve with the help of other research and some of the challenges still exist in

the domain

bull Out of Vocabulary words

bull Multi-sentence reasoning may be required

bull There may exist several candidate answers

bull Optimizing the Exact Match (EM) metric may fail when the answer boundary is fuzzy or too long such as the answer of the ldquowhyrdquo query [6]

bull ldquoOne-hoprdquo prediction may fail to fully understand the query [6]

bull Fail to fully capture the long-distance contextual interaction between parts of the context by only using LSTMGRU

bull Current models are unable to capture semantics of the passage

In the upcoming chapters we will first briefly review the basics necessary for

understanding the models then we will delve deep into the fundamental models that

12

have shaped the current State of the Art models then we will discuss our contribution in

terms of architecture and applications and finally conclude with future directions

13

CHAPTER 2 THE BUILDING BLOCKS OF A QUESTION ANSWERING SYSTEM

To build a question answering system one needs to be familiar with the

fundamental deep learning models such as Recurrent Neural Networks (RNN) Long

Short Term Memory (LSTM) etc In this chapter we will have an overview on these

techniques and see how they all connect to building a question answering system

Neural Networks

What makes Deep Learning so intriguing is that it has close resemblance with

the working of the mammalian brain or at least draws inspiration from it The same can

be said for Artificial Neural networks [7] which consists of a system of interconnected

units called lsquoneuronsrsquo that take input from similar units and produces a single output

Figure 2-1 Simple and Deep Learning Neural Networks [8]

The connection from one neuron to another can be weighted based on the input data

which enables the network to tune itself to produce a certain output based on the input

This is the learning process which is achieved through backpropagation which is a

system of propagating the error from the output layer to the previous layers

14

Convolutional Neural Network

The first wave of deep learningrsquos success was brought by Convolutional Neural

Networks (CNN) [9] when this was the technique used by the winning team of ImageNet

competition in 2012 CNNs are deep artificial neural networks (ANN) that can be used to

classify images cluster them by similarity and perform object recognition within scenes

It can be used to detect and identify faces people signs or any other visual data

Figure 2-2 Convolutional Neural Network Architecture [10]

There are primarily four operations in a standard CNN model (as shown in Fig above)

1 Convolution - The primary purpose of Convolution in the ConvNet (above) is to extract features from the input image The spatial relationship between pixels ie the image features are preserved and learned by the convolution using small squares of input data

2 Non-Linearity (ReLU) ndash Rectified Linear Unit (ReLU) is a non-linear operation that carries out an element wise operation on each pixel This operation replaces the negative pixel values in the feature map by zero

3 Pooling or Sub Sampling - Spatial Pooling reduces the dimensionality of each feature map but retains the most important information For max pooling the largest value in the square window is taken and rest are dropped Other types of pooling are Average Sum etc

4 Classification (Fully Connected Layer) - The Fully Connected layer is a traditional Multi-Layer Perceptron as described before that uses a softmax activation function in the output layer The high-level features of the image are encoded by the convolutional and pooling layers which is then fed to the fully connected layer which then uses these features for classifying the input image into various classes based on the training dataset [10]

15

When a new image is fed into the CNN model all the above-mentioned steps are

carried out (forward propagation) and a probability distribution is achieved on the set of

output classes With a large enough training dataset the network will learn and

generalize well enough to classify new images into their correct classes

Recurrent Neural Networks (RNN)

Whenever we want to predict or encode sequential data RNN [11] is our go to

method RNNs perform the same task for every element of a sequence where the

output of each element depends on previous computations thus the recurrence In

practice RNNs are unable to retain long-term dependencies and can look back only a

few steps because of the vanishing gradient problem

(2-1)

A solution to the dependency problem is to use gated cells such as LSTM [11] or

GRU [13] These cells pass on important information to the next cells while ignoring

non-important ones The gated units in a GRU block are

bull Update Gate ndash Computed based on current input and hidden state

(2-2)

bull Reset Gate ndash Calculated similarly but with different weights

(2-3)

bull New memory content - (2-4)

If reset gate unit is ~0 then previous memory is ignored and only new

information is kept

16

Final memory at current time step combines previous and current time steps

(2-5)

While the GRU is computationally efficient the LSTM on the other hand is a

general case where there are three gates as follows

bull Input Gate ndash What new information to add to the current cell state

bull Forget Gate ndash How much information from previous states to be kept

bull Output gate ndash How much info should be sent to the next states

Just like GRU the current cell state is a sum of the previous cell state but

weighted by the forget gate and the new value is added which is weighted by the input

gate Based on the cell state the output gate regulates the final output

Word Embedding

Computation or gradients can be applied on numbers and not on words or letters

So first we need to convert words into their corresponding numerical formation before

feeding into a deep learning model In general there are two types of word embedding

Frequency based (which constitutes count vectors tf-idf and co-occurrence vectors)

and Prediction based With frequency based embedding the order of the words are not

preserved and works as a bag of words model Whereas with prediction based model

the order of words or locality of words are taken into consideration to generate the

numerical representation of the word Within this prediction based category there are

two fundamental techniques called Continuous Bag of Words (CBOW) and Skip Gram

Model which forms the basis for word2vec [14] and GloVe [15]

The basic intuition behind word2vec is that if two different words have very

similar ldquocontextsrdquo (that is what words are likely to appear around them) then the model

17

will produce similar vector for those words Conversely if the two word vectors are

similar then the network will produce similar context predictions for the same two words

For examples synonyms like ldquointelligentrdquo and ldquosmartrdquo would have very similar contexts

Or that words that are related like ldquoenginerdquo and ldquotransmissionrdquo would probably have

similar contexts as well [16] Plotting the word vectors learned by a word2vec over a

large corpus we could find some very interesting relationships between words

Figure 2-3 Semantic relation between words in vector space [17]

Attention Mechanism

We as humans put our attention to things are important or are relevant in a

context For example when asked a question from a passage we try to find the most

relevant part of the passage the question is relevant with and then reason from our

understanding of that part of the passage The same idea applies for attention

mechanism in Deep Learning It is used to identify the specific parts of a given context

to which the current question is relevant to

Formally put the techniques take n arguments y_1 y_n (in our case the

passage having words say y_i through h_i) and a question word say q It returns a

vector z which is supposed to be the laquo summary raquo of the y_i focusing on information

linked to the question q More formally it returns a weighted arithmetic mean of the y_i

18

and the weights are chosen according the relevance of each y_i given the context c

[18]

Figure 2-4 Attention Mechanism flow [18]

Memory Networks

Convolutional Neural Networks and Recurrent Neural Networks which does

capture how we form our visual and sequential memories their memory (encoded by

hidden states and weights) were typically too small and was not compartmentalized

enough to accurately remember facts from the past (knowledge is compressed into

dense vectors) [19]

Deep Learning needed to cultivate a methodology that preserved memories as

they are such that it wonrsquot be lost in generalization and recalling exact words or

sequence of events would be possible mdash something computers are already good at This

effort led us to Memory Networks [19] published at ICLR 2015 by Facebook AI

Research

This paper provides a basic framework to store augment and retrieve memories

while seamlessly working with a Recurrent Neural Network architecture The memory

19

network consists of a memory m (an array of objects 1 indexed by m i) and four

(potentially learned) components I G O and R as follows

I (input feature map) mdash converts the incoming input to the internal feature

representation either a sparse or dense feature vector like that from word2vec or

GloVe

G (generalization) mdash updates old memories given the new input They call this

generalization as there is an opportunity for the network to compress and generalize its

memories at this stage for some intended future use The analogy Irsquove been talking

before

O (output feature map) mdash produces a new output (in the feature representation

space) given the new input and the current memory state This component is

responsible for performing inference In a question answering system this part will

select the candidate sentences (which might contain the answer) from the story

(conversation) so far

R (response) mdash converts the output into the response format desired For

example a textual response or an action In the QA system described this component

finds the desired answer and then converts it from feature representation to the actual

word

This model is a fully supervised model meaning all the candidate sentences from

which the answer could be found are marked during training phase and can also be

termed as lsquohard attentionrsquo

The authors tested out the QA system on various literature including Lord of the

Rings

20

Figure 2-5 QA example on Lord of the Rings using Memory Networks [19]

21

CHAPTER 3 LITERATURE REVIEW AND STATE OF THE ART

There has been rapid progress since the release of the SQuAD dataset and

currently the best ensemble models are close to human level accuracy in machine

comprehension This is due to the various ingenious methods which solves some of the

problems with the previous methods Out of Vocabulary tokens were handled by using

Character embedding Long term dependency within context passage were solved

using self-attention And many other techniques such as Contextualized vectors History

of Words Attention Flow etc In this section we will have a look at the some of the most

important models that were fundamental to the progress of Questions Answering

Machine Comprehension Using Match-LSTM and Answer Pointer

In this paper [20] the authors propose an end-to-end neural architecture for the

QA task The architecture is based on match-LSTM [21] a model they proposed for

textual entailment and Pointer Net [22] a sequence-to-sequence model proposed by

Vinyals et al (2015) to constrain the output tokens to be from the input sequences

The model consists of an LSTM preprocessing layer a match-LSTM layer and an

Answer Pointer layer

We are given a piece of text which we refer to as a passage and a question

related to the passage The passage is represented by matrix P where P is the length

(number of tokens) of the passage and d is the dimensionality of word embeddings

Similarly the question is represented by matrix Q where Q is the length of the question

Our goal is to identify a subsequence from the passage as the answer to the question

22

Figure 3-1 Match-LSTM Model Architecture [20]

LSTM Preprocessing layer They use a standard one-directional LSTM

(Hochreiter amp Schmidhuber 1997) to process the passage and the question separately

as shown below

Match LSTM Layer They applied the match-LSTM model proposed for textual

entailment to their machine comprehension problem by treating the question as a

premise and the passage as a hypothesis The match-LSTM sequentially goes through

the passage At position i of the passage it first uses the standard word-by-word

attention mechanism to obtain attention weight vector as follows

(3-1)

23

where and are parameters to be learned

Answer Pointer Layer The Answer Pointer (Ans-Ptr) layer is motivated by the

Pointer Net introduced by Vinyals et al (2015) [22] The boundary model produces only

the start token and the end token of the answer and then all the tokens between these

two in the original passage are considered to be the answer

When this paper was released back in November 2016 Match-LSTM method

was the state of the art in Question Answering systems and was at the top of the

leaderboard for the SQuAD dataset

R-NET Matching Reading Comprehension with Self-Matching Networks

In this model [23] first the question and passage are processed by a

bidirectional recurrent network (Mikolov et al 2010) separately They then match the

question and passage with gated attention-based recurrent networks obtaining

question-aware representation for the passage On top of that they apply self-matching

attention to aggregate evidence from the whole passage and refine the passage

representation which is then fed into the output layer to predict the boundary of the

answer span

Question and passage encoding First the words are converted to their

respective word-level embeddings and character level embeddings The character-level

embeddings are generated by taking the final hidden states of a bi-directional recurrent

neural network (RNN) applied to embeddings of characters in the token Such

character-level embeddings have been shown to be helpful to deal with out-of-vocab

(OOV) tokens

24

They then use a bi-directional RNN to produce new representation and

of all words in the question and passage respectively

Figure 3-2 The task of Question Answering [23]

Gated Attention-based Recurrent Networks They use a variant of attention-

based recurrent networks with an additional gate to determine the importance of

information in the passage regarding a question Different from the gates in LSTM or

GRU the additional gate is based on the current passage word and its attention-pooling

vector of the question which focuses on the relation between the question and current

passage word The gate effectively model the phenomenon that only parts of the

passage are relevant to the question in reading comprehension and question answering

is utilized in subsequent calculations

25

Self-Matching Attention From the previous step the question aware passage

representation is generated to highlight the important parts of the passage One

problem with such representation is that it has very limited knowledge of context One

answer candidate is often oblivious to important cues in the passage outside its

surrounding window To address this problem the authors propose directly matching

the question-aware passage representation against itself It dynamically collects

evidence from the whole passage for words in passage and encodes the evidence

relevant to the current passage word and its matching question information into the

passage representation

Output Layer They use the same method as Wang amp Jiang (2016b) and use

pointer networks (Vinyals et al 2015) to predict the start and end position of the

answer In addition they use an attention-pooling over the question representation to

generate the initial hidden vector for the pointer network [23]

When the R-Net Model first appeared in the leaderboard in March 2017 it was at

the top with 723 Exact Match and 807 F1 score

Bi-Directional Attention Flow (BiDAF) for Machine Comprehension

BiDAF [24] is a hierarchical multi-stage architecture for modeling the

representations of the context paragraph at different levels of granularity BIDAF

includes character-level word-level and contextual embeddings and uses bi-directional

attention flow to obtain a query-aware context representation Their attention layer is not

used to summarize the context paragraph into a fixed-size vector Instead the attention

is computed for every time step and the attended vector at each time step along with

the representations from previous layers can flow through to the subsequent modeling

layer This reduces the information loss caused by early summarization [24]

26

Figure 3-3 Bi-Directional Attention Flow Model Architecture [24]

Their machine comprehension model is a hierarchical multi-stage process and

consists of six layers

1 Character Embedding Layer maps each word to a vector space using character-level CNNs

2 Word Embedding Layer maps each word to a vector space using a pre-trained word embedding model

3 Contextual Embedding Layer utilizes contextual cues from surrounding words to refine the embedding of the words These first three layers are applied to both the query and context They use a LSTM on top of the embeddings provided by the previous layers to model the temporal interactions between words We place an LSTM in both directions and concatenate the outputs of the two LSTMs

4 Attention Flow Layer couples the query and context vectors and produces a set of query aware feature vectors for each word in the context

5 Modeling Layer employs a Recurrent Neural Network to scan the context The input to the modeling layer is G which encodes the query-aware representations of context words The output of the modeling layer captures the interaction among the context words conditioned on the query They use two layers of bi-directional LSTM

27

with the output size of d for each direction Hence a matrix is obtained which is passed onto the output layer to predict the answer

6 Output Layer provides an answer to the query They define the training loss (to be minimized) as the sum of the negative log probabilities of the true start and end indices by the predicted distributions averaged over all examples [25]

In a further variation of their above work they add a self-attention layer after the

Bi-attention layer to further improve the results The architecture of the model is as

Figure 3-3 The task of Question Answering [25]

28

Summary

In this chapter we reviewed the methods that are fundamental to the state of the

art in Machine Comprehension and for the task of Question Answering We have

reached closed to human level accuracy and this is due to incremental developments

over previous models As we saw Out of Vocabulary (OOV) tokens were handled by

using Character embedding Long term dependency within context passage were

solved using self-attention and many other techniques such as Contextualized vectors

Attention Flow etc were employed to get better results In the next chapter we will see

how we can build on these models and develop further

29

CHAPTER 4 MULTI-ATTENTION QUESTION ANSWERING

Although the advancements in Question Answering systems since the release of

the SQuAD dataset has been impressive and the results are getting close to human

level accuracy it is far from being a fool-proof system The models still make mistakes

which would be obvious to a human For example

Passage The Panthers used the San Jose State practice facility and stayed at

the San Jose Marriott The Broncos practiced at Florida State Facility and stayed at the

Santa Clara Marriott

Question At what universitys facility did the Panthers practice

Actual Answer San Jose State

Predicted Answer Florida State Facility

To find out what is leading to the wrong predictions we wanted to see the

attention weights associated with such an example We plotted the passage and

question heat map which is a 2D matrix where the intensity of each cell signifies the

similarity between a passage word and a question word For the above example we

found out that while certain words of the question are given high weightage other parts

are not The words lsquoAtrsquo lsquofacilityrsquo lsquopracticersquo are given high attention but lsquoPanthersrsquo does

not receive high attention If it had received high attention then the system would have

predicted lsquoSan Jose Statersquo as the right answer To solve this issue we analyzed the

base BiDAF model and proposed adding two things

1 Bi-Attention and Self-Attention over Query

2 Second level attention over the output of (Bi-Attention + Self-Attention) from both Context and Query

30

Multi-Attention BiDAF Model

Our comprehension model is a hierarchical multi-stage process and consists of

the following layers

1 Embedding Just as all other models we embed words using pretrained word vectors We also embed the characters in each word into size 20 vectors which are learned and run a convolution neural network followed by max-pooling to get character-derived embeddings for each word The character-level and word-level embeddings are then concatenated and passed to the next layer We do not update the word embeddings during training

2 Pre-Process A shared bi-directional GRU (Cho et al 2014) is used to map the question and passage embeddings to context aware embeddings

3 Context Attention The bi-directional attention mechanism from the Bi-Directional Attention Flow (BiDAF) model (Seo et al 2016) is used to build a query-aware context representation Let h_i be the vector for context word i q_j be the vector for question word j and n_q and n_c be the lengths of the question and context respectively We compute attention between context word i and question word j as

(4-1)

where w1 w2 and w3 are learned vectors and is element-wise multiplication We then compute an attended vector c_i for each context token as

(4-2)

We also compute a query-to-context vector q_c

(4-3)

31

The final vector computed for each token is built by concatenating

and In our model we subsequently pass the result through a linear layer with ReLU activations

4 Context Self-Attention Next we use a layer of residual self-attention The input is passed through another bi-directional GRU Then we apply the same attention mechanism only now between the passage and itself In this case we do not use query-to context attention and we set aij = 1048576inf if i = j As before we pass the concatenated output through a linear layer with ReLU activations This layer is applied residually so this output is additionally summed with the input

5 Query Attention For this part we do the same way as context attention but calculate the weighted sum of the context words for each query word Thus we get the length as number of query words Then we calculate context to query similar to query to context in context attention layer

Figure 4-1 The modified BiDAF model with multilevel attention

6 Query Self-Attention This part is done the same way as the context self-attention layer but from the output of the Query Attention layer

32

7 Context Query Bi-Attention + Self-Attention The output of the Context self-attention and the Query Self-Attention layers are taken as input and the same process for Bi-Attention and self-attention is applied on these inputs

8 Prediction In the last layer of our model a bidirectional GRU is applied followed by a linear layer that computes answer start scores for each token The hidden states of that layer are concatenated with the input and fed into a second bidirectional GRU and linear layer to predict answer end scores The softmax operation is applied to the start and end scores to produce start and end probabilities and we optimize the negative log likelihood of selecting correct start and end tokens

Having carried out this modification we were able to solve the wrong example we

started with The multilevel attention model gives the correct output as ldquoSan Jose

Staterdquo Also we achieved slightly better scores than the original model with a F1 score

of 8544 on the SQuAD dev dataset

Chatbot Design Using a QA System

Designing a perfect chatbot that passes the Turing test is a fundamental goal for

Artificial Intelligence Although we are many order to magnitudes away from achieving

such a goal domain specific tasks can be solved with chatbots made from current

technology We had started our goal with a similar objective in mind ie to design a

domain specific chatbot and then generalize to other areas as it is able to achieve the

first domain specific objective robustly This led us to the fundamental problem of

Machine Comprehension and subsequently to the task of Question Answering Having

achieved some degree of success with QA systems we looked back if we could apply

our newly acquired knowledge in the task of designing Chatbots

The chatbots made with todayrsquos technologies are mostly handcrafted techniques

such as template matching that requires anticipating all possible ways a user may

articulate his requirements and a conversation may occur This requires a lot of man

hours for designing a domain specific system and is still very error prone In this section

33

we propose a general Chatbot design that would make the designing of a domain

specific chatbot very easy and robust at the same time

Every domain specific chatbot needs to obtain a set of information from the user

and show some results based on the user specific information obtained The traditional

chatbots use template matching and keywords lookup to determine if the user has

provided the required information Our idea is to use the Question Answering system in

the backend to extract out the required information from whatever the user has typed

until this point of the conversation The information to be extracted can be posed in the

form of a set of questions and the answers obtained from those questions can be used

as the parameters to supply the relevant information to the user

We had chosen our chatbot domain as the flight reservation system Our goal

was to extract the required information from the user to be able to show him the

available flights as per the userrsquos requirements

Figure 4-2 Flight reservation chatbotrsquos chat window

34

For a flight reservation task the booking agent needs to know the origin city

destination city and date of travel at minimum to be able to show the available flights

Optional information includes the number of tickets passengerrsquos name one way or

round trip etc

The minimalistic conversation with the user through the chat window would be as

shown above We had a platform called OneTask on which we wanted to implement our

chat bot The chat interface within the OneTask system looks as follows

Figure 4-3 Chatbot within OneTask system

The working of the chat system is as follows ndash

1 Initiation The user opens the chat window which starts a session with the chat bot The chat bot asks lsquoHow may I help yoursquo

2 User Reply The user may reply with none of the required information for flight booking or may reply with multiple information in the same message

3 User Reply parsing The conversation up to this point is treated as a passage and the internal questions are run on this passage So the four questions that are run are

Where do you want to go

35

From where do you want to leave

When do you want to depart

4 Parsed responses from QA model After running the questions on the QA system with the given conversation answers are obtained for all along with their corresponding confidence values Since answers will be obtained even if the required question was not answered up to this point in the conversation the confidence values plays a crucial role in determining the validity of the answer For this purpose we determined ranges of confidence values for validation A confidence value above 10 signifies the question has been answered correctly A confidence of 2 ndash 10 signifies that it may have been answered but should verify with the user for correctness and any confidence below 2 is discarded

5 Asking remaining questions iteratively After the parsing it is checked if any of the required questions are still unanswered If so the chatbot asks the remaining question and the process from 3 ndash 4 is carried out iteratively Once all the questions have been answered the user is shown with the available flight options as per his request

Figure 4-4 The Flow diagram of the Flight booking Chatbot system

Online QA System and Attention Visualization

To be able to test out various examples we used the BiDAF [23] model for an

online demo One can either choose from the available examples from the drop-down

menu or paste their own passage and examples While this is a useful and interesting

system to test the model in a user-friendly way we created this system to be able to

focus on the wrong samples

36

Figure 4-5 QA system interface with attention highlight over candidate answers

The candidate answers are shown with the blue highlights as per their

confidence values The higher the confidence the darker the answer The highest

confidence value is chosen as the predicted answer We developed the system to show

attention spread of the candidate answers to realize what needs to be done to improve

the system This led us to realize the importance of including the query attention part as

well as multilevel attention on the BiDAF model as described in the first section of this

chapter

37

CHAPTER 5 FUTURE DIRECTIONS AND CONCLUSION

Future Directions

Having done a thorough analysis of the current methods and state of the arts in

Machine Comprehension and for the QA task and developing systems on it we have

achieved a strong sense of what needs to be done to further improve the QA models

After observing the wrong samples we can see that the system is still unable to encode

meaning and is picking answers based on statistical patterns of occurrence of answers

on training examples

As per the State of the Art models and analyzing the ongoing research literature

it is easy to conclude that more features need to be embedded to encode the meaning

of words phrases and sentences A paper called Reinforced Mnemonic Reader for

Machine Comprehension encoded POS and NER tags of words along with their word

and character embedding This gave them better results

Figure 5-1 An English language semantic parse tree [26]

38

We have developed a method to encode the syntax parse tree of a sentence that

not only encodes the post tags but the relation of the word within the phrase and the

relation of the phrase within the whole sentence in a hierarchical manner

Finally data augmentation is another solution to get better results One definite

way to reduce the errors would be to include similar samples in the training data which

the system is faltering in the dev set One could generate similar examples as the failure

cases and include them in the training set to have better prediction Another system

would be to train to similar and bigger datasets Our models were trained on the SQuAD

dataset There are other datasets too which does the similar question answering task

such as TriviaQA We could augment the training set of TriviaQA along with SQuAD to

have a more robust system that is able to generalize better and thus have higher

accuracy for predicting answer spans

Conclusion

In this work we have tried to explore the most fundamental techniques that have

shaped the current state of the art Then we proposed a minor improvement of

architecture over an existing model Furthermore we developed two applications that

uses the base model First we talked about how a chatbot application can be made

using the QA system and lastly we also created a web interface where the model can

be used for any Passage and Question This interface also shows the attention spread

on the candidate answers While our effort is ongoing to push the state of the art

forward we strongly believe that surpassing human level accuracy on this task will have

high dividends for the society at large

39

LIST OF REFERENCES

[1] Danqi Chen From Reading Comprehension To Open-Domain Question Answering 2018 Youtube Accessed March 16 2018 httpswwwyoutubecomwatchv=1RN88O9C13U

[2] Lehnert Wendy G The Process of Question Answering No RR-88 Yale Univ New Haven Conn Dept of computer science 1977

[3] Joshi Mandar Eunsol Choi Daniel S Weld and Luke Zettlemoyer Triviaqa A large scale distantly supervised challenge dataset for reading comprehension arXiv preprint arXiv170503551 (2017)

[4] Rajpurkar Pranav Jian Zhang Konstantin Lopyrev and Percy Liang Squad 100000+ questions for machine comprehension of text arXiv preprint arXiv160605250 (2016)

[5] The Stanford Question Answering Dataset 2018 RajpurkarGithubIo Accessed March 16 2018 httpsrajpurkargithubioSQuAD-explorer

[6] Hu Minghao Yuxing Peng and Xipeng Qiu Reinforced mnemonic reader for machine comprehension CoRR abs170502798 (2017)

[7] Schalkoff Robert J Artificial neural networks Vol 1 New York McGraw-Hill 1997

[8] Me For The AI and Neetesh Mehrotra 2018 The Connect Between Deep Learning And AI - Open Source For You Open Source For You Accessed March 16 2018 httpopensourceforucom201801connect-deep-learning-ai

[9] Krizhevsky Alex Ilya Sutskever and Geoffrey E Hinton Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems pp 1097-1105 2012

[10] An Intuitive Explanation Of Convolutional Neural Networks 2016 The Data Science Blog Accessed March 16 2018 httpsujjwalkarnme20160811intuitive-explanation-convnets

[11] Williams Ronald J and David Zipser A learning algorithm for continually running fully recurrent neural networks Neural computation 1 no 2 (1989) 270-280

[12] Hochreiter Sepp and Juumlrgen Schmidhuber Long short-term memory Neural computation 9 no 8 (1997) 1735-1780

[13] Chung Junyoung Caglar Gulcehre KyungHyun Cho and Yoshua Bengio Empirical evaluation of gated recurrent neural networks on sequence modeling arXiv preprint arXiv14123555 (2014)

40

[14] Mikolov Tomas Wen-tau Yih and Geoffrey Zweig Linguistic regularities in continuous space word representations In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics Human Language Technologies pp 746-751 2013

[15] Pennington Jeffrey Richard Socher and Christopher Manning Glove Global vectors for word representation In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) pp 1532-1543 2014

[16] Word2vec Tutorial - The Skip-Gram Model middot Chris Mccormick 2016 MccormickmlCom Accessed March 16 2018 httpmccormickmlcom20160419word2vec-tutorial-the-skip-gram-model

[17] Group 2015 [Paper Introduction] Bilingual Word Representations With Monolingual hellip SlideshareNet Accessed March 16 2018 httpswwwslidesharenetnaist-mtstudybilingual-word-representations-with-monolingual-quality-in-mind

[18] Attention Mechanism 2016 BlogHeuritechCom Accessed March 16 2018 httpsblogheuritechcom20160120attention-mechanism

[19] Weston Jason Sumit Chopra and Antoine Bordes 2014 Memory Networks arXiv preprint arXiv14103916v11 Accessed March 16 2018 httpsarxivorgabs14103916

[20] Wang Shuohang and Jing Jiang Machine comprehension using match-lstm and answer pointer arXiv preprint arXiv160807905 (2016)

[21] Wang Shuohang and Jing Jiang Learning natural language inference with LSTM arXiv preprint arXiv151208849 (2015)

[22] Vinyals Oriol Meire Fortunato and Navdeep Jaitly Pointer networks In Advances in Neural Information Processing Systems pp 2692-2700 2015

[23] Wang Wenhui Nan Yang Furu Wei Baobao Chang and Ming Zhou Gated self-matching networks for reading comprehension and question answering In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1 Long Papers) vol 1 pp 189-198 2017

[24] Seo Minjoon Aniruddha Kembhavi Ali Farhadi and Hannaneh Hajishirzi Bidirectional attention flow for machine comprehension arXiv preprint arXiv161101603 (2016)

[25] Clark Christopher and Matt Gardner Simple and effective multi-paragraph reading comprehension arXiv preprint arXiv171010723 (2017)

41

[26] 8 Analyzing Sentence Structure 2018 NltkOrg Accessed March 16 2018 httpwwwnltkorgbookch08html

42

BIOGRAPHICAL SKETCH

Purnendu Mukherjee grew up in Kolkata India and had a strong interest for

computers since an early age After his high school he did his BSc in computer

science from Ramakrishna Mission Residential College Narendrapur followed by MSc

in computer science from St Xavierrsquos College Kolkata He had a strong intuition and

interest for human like learning systems and wanted to work in this area He started

working at TCS Innovation Labs Pune for the application of Natural Language

Processing in Educational Applications As he was deeply passionate about learning

system that mimic the human brain and learn like a human child does he was

increasing interested about Deep Learning and its applications After working for a year

he went on to pursue a Master of Science degree in computer science from the

University of Florida Gainesville His academic interests have been focused on Deep

Learning and Natural Language Processing and he has been working on Machine

Reading Comprehension since summer of 2017

  • ACKNOWLEDGMENTS
  • LIST OF FIGURES
  • LIST OF ABBREVIATIONS
  • INTRODUCTION
  • THE BUILDING BLOCKS OF A QUESTION ANSWERING SYSTEM
    • Neural Networks
    • Convolutional Neural Network
    • Recurrent Neural Networks (RNN)
    • Word Embedding
    • Attention Mechanism
    • Memory Networks
      • LITERATURE REVIEW AND STATE OF THE ART
        • Machine Comprehension Using Match-LSTM and Answer Pointer
        • R-NET Matching Reading Comprehension with Self-Matching Networks
        • Bi-Directional Attention Flow (BiDAF) for Machine Comprehension
        • Summary
          • MULTI-ATTENTION QUESTION ANSWERING
            • Multi-Attention BiDAF Model
            • Chatbot Design Using a QA System
            • Online QA System and Attention Visualization
              • FUTURE DIRECTIONS AND CONCLUSION
                • Future Directions
                • Conclusion
                  • LIST OF REFERENCES
                  • BIOGRAPHICAL SKETCH
Page 8: QUESTION ANSWERING SYSTEMS WITH ATTENTION MECHANISMufdcimages.uflib.ufl.edu/UF/E0/05/23/73/00001/MUKHERJEE_P.pdf · QUESTION ANSWERING SYSTEMS WITH ATTENTION MECHANISM By PURNENDU

8

Abstract of Thesis Presented to the Graduate School of the University of Florida in Partial Fulfillment of the

Requirements for the Degree of Master of Science

QUESTION ANSWERING SYSTEMS WITH ATTENTION MECHANISM

By

Purnendu Mukherjee

May 2018

Chair Xiaolin Li Major Computer Science

Question Answering(QA) systems have had rapid growth since the last 3 years

and are close to reaching human level accuracy One of the fundamental reason for this

growth has been the use of attention mechanism along with other methods of Deep

Learning But just as with other Deep Learning methods some of the failure cases are

so obvious that it convinces us that there is a lot to improve upon In this work we first

did a literature review of the State of the Art and fundamental models in QA systems

Next we introduce an architecture which has shown improvement in the targeted area

We then introduce a general method to enable easy design of domain-specific Chatbot

applications and present a proof of the concept with the same method Finally we

present an easy to use Question Answering interface with attention visualization on the

passage We also propose a method to improve the current state of the art as a part of

our ongoing work

9

CHAPTER 1 INTRODUCTION

Teaching machines to read and understand human natural language is our

central and long standing goal for Natural Language Processing and Artificial

Intelligence in general The positive side of machines being able to reason and

comprehend human language could be enormous We are beginning to see how

commercial systems like Alexa Siri Google Now etc are being widely used as Speech

Recognition improved to human levels While speech recognition systems can

transcribe speech to text comprehension of that transcribed text is another task which

is currently a major focus for both academia and industry because of its possible

applications Moreover all the text information available throughout the internet is a

major reason why machine comprehension of text is such an important task

With the growth of Deep Learning methods in the last few years the field of

Machine Comprehension and Natural Language Processing(NLP) in general has

experienced a revolution While the traditional methods and practices are still prevalent

and forms the basis of our deep understanding of languages Deep Learning methods

have surpassed all traditional NLP and Machine Learning methods by a significant

margin and are currently driving the growth of the field

To be able to build a system that can understand human text we need to first

ask ourselves how can we evaluate the machinersquos comprehension ability We had

initially set a goal to build a Chabot for a specific domain and generalize to other topics

as we go ahead While developing the system we found out the necessity for reading

comprehension and how to measure it We finally found the answer with Questions

Answering systems

10

Just like how we human beings are tested for our ability of language

understanding with questions we should ask machines similar questions about what it

has just read The performance of the system on such question answering task will let

us evaluate how much the machine is able to reason about what it just read [1]

Reading comprehension has been a topic of Natural Language Understanding since the

1970s In 1977 Wendy Lehnert said in his doctoral thesis ndash ldquoOnly when we can ask a

program to answer questions about what it reads will we be able to begin to access that

programrsquos comprehensionrdquo [2]

Figure 1-1 The task of Question Answering

To achieve this task the NLP community has developed various datasets such as

CNN Daily Mail WebQuestions SQuAD TriviaQA [3] etc For our purpose we chose

SQuAD which stands for Stanford Question Answering Dataset [4] SQuAD consists of

questions posed by crowdworkers on a set of Wikipedia articles where the answer to

every question is a segment of text or span from the corresponding reading passage

With 100000+ question-answer pairs on 500+ articles SQuAD is significantly larger

than previous reading comprehension datasets [5]

An example from the SQuAD dataset is as follows

11

Passage Tesla later approached Morgan to ask for more funds to build a more

powerful transmitter When asked where all the money had gone Tesla responded by

saying that he was affected by the Panic of 1901 which he (Morgan) had caused

Morgan was shocked by the reminder of his part in the stock market crash and by

Teslarsquos breach of contract by asking for more funds Tesla wrote another plea to

Morgan but it was also fruitless Morgan still owed Tesla money on the original

agreement and Tesla had been facing foreclosure even before construction of the

tower began

Question On what did Tesla blame for the loss of the initial money

Answer Panic of 1901

As we started exploring the QA task we faced several challenges Some of them

we could solve with the help of other research and some of the challenges still exist in

the domain

bull Out of Vocabulary words

bull Multi-sentence reasoning may be required

bull There may exist several candidate answers

bull Optimizing the Exact Match (EM) metric may fail when the answer boundary is fuzzy or too long such as the answer of the ldquowhyrdquo query [6]

bull ldquoOne-hoprdquo prediction may fail to fully understand the query [6]

bull Fail to fully capture the long-distance contextual interaction between parts of the context by only using LSTMGRU

bull Current models are unable to capture semantics of the passage

In the upcoming chapters we will first briefly review the basics necessary for

understanding the models then we will delve deep into the fundamental models that

12

have shaped the current State of the Art models then we will discuss our contribution in

terms of architecture and applications and finally conclude with future directions

13

CHAPTER 2 THE BUILDING BLOCKS OF A QUESTION ANSWERING SYSTEM

To build a question answering system one needs to be familiar with the

fundamental deep learning models such as Recurrent Neural Networks (RNN) Long

Short Term Memory (LSTM) etc In this chapter we will have an overview on these

techniques and see how they all connect to building a question answering system

Neural Networks

What makes Deep Learning so intriguing is that it has close resemblance with

the working of the mammalian brain or at least draws inspiration from it The same can

be said for Artificial Neural networks [7] which consists of a system of interconnected

units called lsquoneuronsrsquo that take input from similar units and produces a single output

Figure 2-1 Simple and Deep Learning Neural Networks [8]

The connection from one neuron to another can be weighted based on the input data

which enables the network to tune itself to produce a certain output based on the input

This is the learning process which is achieved through backpropagation which is a

system of propagating the error from the output layer to the previous layers

14

Convolutional Neural Network

The first wave of deep learningrsquos success was brought by Convolutional Neural

Networks (CNN) [9] when this was the technique used by the winning team of ImageNet

competition in 2012 CNNs are deep artificial neural networks (ANN) that can be used to

classify images cluster them by similarity and perform object recognition within scenes

It can be used to detect and identify faces people signs or any other visual data

Figure 2-2 Convolutional Neural Network Architecture [10]

There are primarily four operations in a standard CNN model (as shown in Fig above)

1 Convolution - The primary purpose of Convolution in the ConvNet (above) is to extract features from the input image The spatial relationship between pixels ie the image features are preserved and learned by the convolution using small squares of input data

2 Non-Linearity (ReLU) ndash Rectified Linear Unit (ReLU) is a non-linear operation that carries out an element wise operation on each pixel This operation replaces the negative pixel values in the feature map by zero

3 Pooling or Sub Sampling - Spatial Pooling reduces the dimensionality of each feature map but retains the most important information For max pooling the largest value in the square window is taken and rest are dropped Other types of pooling are Average Sum etc

4 Classification (Fully Connected Layer) - The Fully Connected layer is a traditional Multi-Layer Perceptron as described before that uses a softmax activation function in the output layer The high-level features of the image are encoded by the convolutional and pooling layers which is then fed to the fully connected layer which then uses these features for classifying the input image into various classes based on the training dataset [10]

15

When a new image is fed into the CNN model all the above-mentioned steps are

carried out (forward propagation) and a probability distribution is achieved on the set of

output classes With a large enough training dataset the network will learn and

generalize well enough to classify new images into their correct classes

Recurrent Neural Networks (RNN)

Whenever we want to predict or encode sequential data RNN [11] is our go to

method RNNs perform the same task for every element of a sequence where the

output of each element depends on previous computations thus the recurrence In

practice RNNs are unable to retain long-term dependencies and can look back only a

few steps because of the vanishing gradient problem

(2-1)

A solution to the dependency problem is to use gated cells such as LSTM [11] or

GRU [13] These cells pass on important information to the next cells while ignoring

non-important ones The gated units in a GRU block are

bull Update Gate ndash Computed based on current input and hidden state

(2-2)

bull Reset Gate ndash Calculated similarly but with different weights

(2-3)

bull New memory content - (2-4)

If reset gate unit is ~0 then previous memory is ignored and only new

information is kept

16

Final memory at current time step combines previous and current time steps

(2-5)

While the GRU is computationally efficient the LSTM on the other hand is a

general case where there are three gates as follows

bull Input Gate ndash What new information to add to the current cell state

bull Forget Gate ndash How much information from previous states to be kept

bull Output gate ndash How much info should be sent to the next states

Just like GRU the current cell state is a sum of the previous cell state but

weighted by the forget gate and the new value is added which is weighted by the input

gate Based on the cell state the output gate regulates the final output

Word Embedding

Computation or gradients can be applied on numbers and not on words or letters

So first we need to convert words into their corresponding numerical formation before

feeding into a deep learning model In general there are two types of word embedding

Frequency based (which constitutes count vectors tf-idf and co-occurrence vectors)

and Prediction based With frequency based embedding the order of the words are not

preserved and works as a bag of words model Whereas with prediction based model

the order of words or locality of words are taken into consideration to generate the

numerical representation of the word Within this prediction based category there are

two fundamental techniques called Continuous Bag of Words (CBOW) and Skip Gram

Model which forms the basis for word2vec [14] and GloVe [15]

The basic intuition behind word2vec is that if two different words have very

similar ldquocontextsrdquo (that is what words are likely to appear around them) then the model

17

will produce similar vector for those words Conversely if the two word vectors are

similar then the network will produce similar context predictions for the same two words

For examples synonyms like ldquointelligentrdquo and ldquosmartrdquo would have very similar contexts

Or that words that are related like ldquoenginerdquo and ldquotransmissionrdquo would probably have

similar contexts as well [16] Plotting the word vectors learned by a word2vec over a

large corpus we could find some very interesting relationships between words

Figure 2-3 Semantic relation between words in vector space [17]

Attention Mechanism

We as humans put our attention to things are important or are relevant in a

context For example when asked a question from a passage we try to find the most

relevant part of the passage the question is relevant with and then reason from our

understanding of that part of the passage The same idea applies for attention

mechanism in Deep Learning It is used to identify the specific parts of a given context

to which the current question is relevant to

Formally put the techniques take n arguments y_1 y_n (in our case the

passage having words say y_i through h_i) and a question word say q It returns a

vector z which is supposed to be the laquo summary raquo of the y_i focusing on information

linked to the question q More formally it returns a weighted arithmetic mean of the y_i

18

and the weights are chosen according the relevance of each y_i given the context c

[18]

Figure 2-4 Attention Mechanism flow [18]

Memory Networks

Convolutional Neural Networks and Recurrent Neural Networks which does

capture how we form our visual and sequential memories their memory (encoded by

hidden states and weights) were typically too small and was not compartmentalized

enough to accurately remember facts from the past (knowledge is compressed into

dense vectors) [19]

Deep Learning needed to cultivate a methodology that preserved memories as

they are such that it wonrsquot be lost in generalization and recalling exact words or

sequence of events would be possible mdash something computers are already good at This

effort led us to Memory Networks [19] published at ICLR 2015 by Facebook AI

Research

This paper provides a basic framework to store augment and retrieve memories

while seamlessly working with a Recurrent Neural Network architecture The memory

19

network consists of a memory m (an array of objects 1 indexed by m i) and four

(potentially learned) components I G O and R as follows

I (input feature map) mdash converts the incoming input to the internal feature

representation either a sparse or dense feature vector like that from word2vec or

GloVe

G (generalization) mdash updates old memories given the new input They call this

generalization as there is an opportunity for the network to compress and generalize its

memories at this stage for some intended future use The analogy Irsquove been talking

before

O (output feature map) mdash produces a new output (in the feature representation

space) given the new input and the current memory state This component is

responsible for performing inference In a question answering system this part will

select the candidate sentences (which might contain the answer) from the story

(conversation) so far

R (response) mdash converts the output into the response format desired For

example a textual response or an action In the QA system described this component

finds the desired answer and then converts it from feature representation to the actual

word

This model is a fully supervised model meaning all the candidate sentences from

which the answer could be found are marked during training phase and can also be

termed as lsquohard attentionrsquo

The authors tested out the QA system on various literature including Lord of the

Rings

20

Figure 2-5 QA example on Lord of the Rings using Memory Networks [19]

21

CHAPTER 3 LITERATURE REVIEW AND STATE OF THE ART

There has been rapid progress since the release of the SQuAD dataset and

currently the best ensemble models are close to human level accuracy in machine

comprehension This is due to the various ingenious methods which solves some of the

problems with the previous methods Out of Vocabulary tokens were handled by using

Character embedding Long term dependency within context passage were solved

using self-attention And many other techniques such as Contextualized vectors History

of Words Attention Flow etc In this section we will have a look at the some of the most

important models that were fundamental to the progress of Questions Answering

Machine Comprehension Using Match-LSTM and Answer Pointer

In this paper [20] the authors propose an end-to-end neural architecture for the

QA task The architecture is based on match-LSTM [21] a model they proposed for

textual entailment and Pointer Net [22] a sequence-to-sequence model proposed by

Vinyals et al (2015) to constrain the output tokens to be from the input sequences

The model consists of an LSTM preprocessing layer a match-LSTM layer and an

Answer Pointer layer

We are given a piece of text which we refer to as a passage and a question

related to the passage The passage is represented by matrix P where P is the length

(number of tokens) of the passage and d is the dimensionality of word embeddings

Similarly the question is represented by matrix Q where Q is the length of the question

Our goal is to identify a subsequence from the passage as the answer to the question

22

Figure 3-1 Match-LSTM Model Architecture [20]

LSTM Preprocessing layer They use a standard one-directional LSTM

(Hochreiter amp Schmidhuber 1997) to process the passage and the question separately

as shown below

Match LSTM Layer They applied the match-LSTM model proposed for textual

entailment to their machine comprehension problem by treating the question as a

premise and the passage as a hypothesis The match-LSTM sequentially goes through

the passage At position i of the passage it first uses the standard word-by-word

attention mechanism to obtain attention weight vector as follows

(3-1)

23

where and are parameters to be learned

Answer Pointer Layer The Answer Pointer (Ans-Ptr) layer is motivated by the

Pointer Net introduced by Vinyals et al (2015) [22] The boundary model produces only

the start token and the end token of the answer and then all the tokens between these

two in the original passage are considered to be the answer

When this paper was released back in November 2016 Match-LSTM method

was the state of the art in Question Answering systems and was at the top of the

leaderboard for the SQuAD dataset

R-NET Matching Reading Comprehension with Self-Matching Networks

In this model [23] first the question and passage are processed by a

bidirectional recurrent network (Mikolov et al 2010) separately They then match the

question and passage with gated attention-based recurrent networks obtaining

question-aware representation for the passage On top of that they apply self-matching

attention to aggregate evidence from the whole passage and refine the passage

representation which is then fed into the output layer to predict the boundary of the

answer span

Question and passage encoding First the words are converted to their

respective word-level embeddings and character level embeddings The character-level

embeddings are generated by taking the final hidden states of a bi-directional recurrent

neural network (RNN) applied to embeddings of characters in the token Such

character-level embeddings have been shown to be helpful to deal with out-of-vocab

(OOV) tokens

24

They then use a bi-directional RNN to produce new representation and

of all words in the question and passage respectively

Figure 3-2 The task of Question Answering [23]

Gated Attention-based Recurrent Networks They use a variant of attention-

based recurrent networks with an additional gate to determine the importance of

information in the passage regarding a question Different from the gates in LSTM or

GRU the additional gate is based on the current passage word and its attention-pooling

vector of the question which focuses on the relation between the question and current

passage word The gate effectively model the phenomenon that only parts of the

passage are relevant to the question in reading comprehension and question answering

is utilized in subsequent calculations

25

Self-Matching Attention From the previous step the question aware passage

representation is generated to highlight the important parts of the passage One

problem with such representation is that it has very limited knowledge of context One

answer candidate is often oblivious to important cues in the passage outside its

surrounding window To address this problem the authors propose directly matching

the question-aware passage representation against itself It dynamically collects

evidence from the whole passage for words in passage and encodes the evidence

relevant to the current passage word and its matching question information into the

passage representation

Output Layer They use the same method as Wang amp Jiang (2016b) and use

pointer networks (Vinyals et al 2015) to predict the start and end position of the

answer In addition they use an attention-pooling over the question representation to

generate the initial hidden vector for the pointer network [23]

When the R-Net Model first appeared in the leaderboard in March 2017 it was at

the top with 723 Exact Match and 807 F1 score

Bi-Directional Attention Flow (BiDAF) for Machine Comprehension

BiDAF [24] is a hierarchical multi-stage architecture for modeling the

representations of the context paragraph at different levels of granularity BIDAF

includes character-level word-level and contextual embeddings and uses bi-directional

attention flow to obtain a query-aware context representation Their attention layer is not

used to summarize the context paragraph into a fixed-size vector Instead the attention

is computed for every time step and the attended vector at each time step along with

the representations from previous layers can flow through to the subsequent modeling

layer This reduces the information loss caused by early summarization [24]

26

Figure 3-3 Bi-Directional Attention Flow Model Architecture [24]

Their machine comprehension model is a hierarchical multi-stage process and

consists of six layers

1 Character Embedding Layer maps each word to a vector space using character-level CNNs

2 Word Embedding Layer maps each word to a vector space using a pre-trained word embedding model

3 Contextual Embedding Layer utilizes contextual cues from surrounding words to refine the embedding of the words These first three layers are applied to both the query and context They use a LSTM on top of the embeddings provided by the previous layers to model the temporal interactions between words We place an LSTM in both directions and concatenate the outputs of the two LSTMs

4 Attention Flow Layer couples the query and context vectors and produces a set of query aware feature vectors for each word in the context

5 Modeling Layer employs a Recurrent Neural Network to scan the context The input to the modeling layer is G which encodes the query-aware representations of context words The output of the modeling layer captures the interaction among the context words conditioned on the query They use two layers of bi-directional LSTM

27

with the output size of d for each direction Hence a matrix is obtained which is passed onto the output layer to predict the answer

6 Output Layer provides an answer to the query They define the training loss (to be minimized) as the sum of the negative log probabilities of the true start and end indices by the predicted distributions averaged over all examples [25]

In a further variation of their above work they add a self-attention layer after the

Bi-attention layer to further improve the results The architecture of the model is as

Figure 3-3 The task of Question Answering [25]

28

Summary

In this chapter we reviewed the methods that are fundamental to the state of the

art in Machine Comprehension and for the task of Question Answering We have

reached closed to human level accuracy and this is due to incremental developments

over previous models As we saw Out of Vocabulary (OOV) tokens were handled by

using Character embedding Long term dependency within context passage were

solved using self-attention and many other techniques such as Contextualized vectors

Attention Flow etc were employed to get better results In the next chapter we will see

how we can build on these models and develop further

29

CHAPTER 4 MULTI-ATTENTION QUESTION ANSWERING

Although the advancements in Question Answering systems since the release of

the SQuAD dataset has been impressive and the results are getting close to human

level accuracy it is far from being a fool-proof system The models still make mistakes

which would be obvious to a human For example

Passage The Panthers used the San Jose State practice facility and stayed at

the San Jose Marriott The Broncos practiced at Florida State Facility and stayed at the

Santa Clara Marriott

Question At what universitys facility did the Panthers practice

Actual Answer San Jose State

Predicted Answer Florida State Facility

To find out what is leading to the wrong predictions we wanted to see the

attention weights associated with such an example We plotted the passage and

question heat map which is a 2D matrix where the intensity of each cell signifies the

similarity between a passage word and a question word For the above example we

found out that while certain words of the question are given high weightage other parts

are not The words lsquoAtrsquo lsquofacilityrsquo lsquopracticersquo are given high attention but lsquoPanthersrsquo does

not receive high attention If it had received high attention then the system would have

predicted lsquoSan Jose Statersquo as the right answer To solve this issue we analyzed the

base BiDAF model and proposed adding two things

1 Bi-Attention and Self-Attention over Query

2 Second level attention over the output of (Bi-Attention + Self-Attention) from both Context and Query

30

Multi-Attention BiDAF Model

Our comprehension model is a hierarchical multi-stage process and consists of

the following layers

1 Embedding Just as all other models we embed words using pretrained word vectors We also embed the characters in each word into size 20 vectors which are learned and run a convolution neural network followed by max-pooling to get character-derived embeddings for each word The character-level and word-level embeddings are then concatenated and passed to the next layer We do not update the word embeddings during training

2 Pre-Process A shared bi-directional GRU (Cho et al 2014) is used to map the question and passage embeddings to context aware embeddings

3 Context Attention The bi-directional attention mechanism from the Bi-Directional Attention Flow (BiDAF) model (Seo et al 2016) is used to build a query-aware context representation Let h_i be the vector for context word i q_j be the vector for question word j and n_q and n_c be the lengths of the question and context respectively We compute attention between context word i and question word j as

(4-1)

where w1 w2 and w3 are learned vectors and is element-wise multiplication We then compute an attended vector c_i for each context token as

(4-2)

We also compute a query-to-context vector q_c

(4-3)

31

The final vector computed for each token is built by concatenating

and In our model we subsequently pass the result through a linear layer with ReLU activations

4 Context Self-Attention Next we use a layer of residual self-attention The input is passed through another bi-directional GRU Then we apply the same attention mechanism only now between the passage and itself In this case we do not use query-to context attention and we set aij = 1048576inf if i = j As before we pass the concatenated output through a linear layer with ReLU activations This layer is applied residually so this output is additionally summed with the input

5 Query Attention For this part we do the same way as context attention but calculate the weighted sum of the context words for each query word Thus we get the length as number of query words Then we calculate context to query similar to query to context in context attention layer

Figure 4-1 The modified BiDAF model with multilevel attention

6 Query Self-Attention This part is done the same way as the context self-attention layer but from the output of the Query Attention layer

32

7 Context Query Bi-Attention + Self-Attention The output of the Context self-attention and the Query Self-Attention layers are taken as input and the same process for Bi-Attention and self-attention is applied on these inputs

8 Prediction In the last layer of our model a bidirectional GRU is applied followed by a linear layer that computes answer start scores for each token The hidden states of that layer are concatenated with the input and fed into a second bidirectional GRU and linear layer to predict answer end scores The softmax operation is applied to the start and end scores to produce start and end probabilities and we optimize the negative log likelihood of selecting correct start and end tokens

Having carried out this modification we were able to solve the wrong example we

started with The multilevel attention model gives the correct output as ldquoSan Jose

Staterdquo Also we achieved slightly better scores than the original model with a F1 score

of 8544 on the SQuAD dev dataset

Chatbot Design Using a QA System

Designing a perfect chatbot that passes the Turing test is a fundamental goal for

Artificial Intelligence Although we are many order to magnitudes away from achieving

such a goal domain specific tasks can be solved with chatbots made from current

technology We had started our goal with a similar objective in mind ie to design a

domain specific chatbot and then generalize to other areas as it is able to achieve the

first domain specific objective robustly This led us to the fundamental problem of

Machine Comprehension and subsequently to the task of Question Answering Having

achieved some degree of success with QA systems we looked back if we could apply

our newly acquired knowledge in the task of designing Chatbots

The chatbots made with todayrsquos technologies are mostly handcrafted techniques

such as template matching that requires anticipating all possible ways a user may

articulate his requirements and a conversation may occur This requires a lot of man

hours for designing a domain specific system and is still very error prone In this section

33

we propose a general Chatbot design that would make the designing of a domain

specific chatbot very easy and robust at the same time

Every domain specific chatbot needs to obtain a set of information from the user

and show some results based on the user specific information obtained The traditional

chatbots use template matching and keywords lookup to determine if the user has

provided the required information Our idea is to use the Question Answering system in

the backend to extract out the required information from whatever the user has typed

until this point of the conversation The information to be extracted can be posed in the

form of a set of questions and the answers obtained from those questions can be used

as the parameters to supply the relevant information to the user

We had chosen our chatbot domain as the flight reservation system Our goal

was to extract the required information from the user to be able to show him the

available flights as per the userrsquos requirements

Figure 4-2 Flight reservation chatbotrsquos chat window

34

For a flight reservation task the booking agent needs to know the origin city

destination city and date of travel at minimum to be able to show the available flights

Optional information includes the number of tickets passengerrsquos name one way or

round trip etc

The minimalistic conversation with the user through the chat window would be as

shown above We had a platform called OneTask on which we wanted to implement our

chat bot The chat interface within the OneTask system looks as follows

Figure 4-3 Chatbot within OneTask system

The working of the chat system is as follows ndash

1 Initiation The user opens the chat window which starts a session with the chat bot The chat bot asks lsquoHow may I help yoursquo

2 User Reply The user may reply with none of the required information for flight booking or may reply with multiple information in the same message

3 User Reply parsing The conversation up to this point is treated as a passage and the internal questions are run on this passage So the four questions that are run are

Where do you want to go

35

From where do you want to leave

When do you want to depart

4 Parsed responses from QA model After running the questions on the QA system with the given conversation answers are obtained for all along with their corresponding confidence values Since answers will be obtained even if the required question was not answered up to this point in the conversation the confidence values plays a crucial role in determining the validity of the answer For this purpose we determined ranges of confidence values for validation A confidence value above 10 signifies the question has been answered correctly A confidence of 2 ndash 10 signifies that it may have been answered but should verify with the user for correctness and any confidence below 2 is discarded

5 Asking remaining questions iteratively After the parsing it is checked if any of the required questions are still unanswered If so the chatbot asks the remaining question and the process from 3 ndash 4 is carried out iteratively Once all the questions have been answered the user is shown with the available flight options as per his request

Figure 4-4 The Flow diagram of the Flight booking Chatbot system

Online QA System and Attention Visualization

To be able to test out various examples we used the BiDAF [23] model for an

online demo One can either choose from the available examples from the drop-down

menu or paste their own passage and examples While this is a useful and interesting

system to test the model in a user-friendly way we created this system to be able to

focus on the wrong samples

36

Figure 4-5 QA system interface with attention highlight over candidate answers

The candidate answers are shown with the blue highlights as per their

confidence values The higher the confidence the darker the answer The highest

confidence value is chosen as the predicted answer We developed the system to show

attention spread of the candidate answers to realize what needs to be done to improve

the system This led us to realize the importance of including the query attention part as

well as multilevel attention on the BiDAF model as described in the first section of this

chapter

37

CHAPTER 5 FUTURE DIRECTIONS AND CONCLUSION

Future Directions

Having done a thorough analysis of the current methods and state of the arts in

Machine Comprehension and for the QA task and developing systems on it we have

achieved a strong sense of what needs to be done to further improve the QA models

After observing the wrong samples we can see that the system is still unable to encode

meaning and is picking answers based on statistical patterns of occurrence of answers

on training examples

As per the State of the Art models and analyzing the ongoing research literature

it is easy to conclude that more features need to be embedded to encode the meaning

of words phrases and sentences A paper called Reinforced Mnemonic Reader for

Machine Comprehension encoded POS and NER tags of words along with their word

and character embedding This gave them better results

Figure 5-1 An English language semantic parse tree [26]

38

We have developed a method to encode the syntax parse tree of a sentence that

not only encodes the post tags but the relation of the word within the phrase and the

relation of the phrase within the whole sentence in a hierarchical manner

Finally data augmentation is another solution to get better results One definite

way to reduce the errors would be to include similar samples in the training data which

the system is faltering in the dev set One could generate similar examples as the failure

cases and include them in the training set to have better prediction Another system

would be to train to similar and bigger datasets Our models were trained on the SQuAD

dataset There are other datasets too which does the similar question answering task

such as TriviaQA We could augment the training set of TriviaQA along with SQuAD to

have a more robust system that is able to generalize better and thus have higher

accuracy for predicting answer spans

Conclusion

In this work we have tried to explore the most fundamental techniques that have

shaped the current state of the art Then we proposed a minor improvement of

architecture over an existing model Furthermore we developed two applications that

uses the base model First we talked about how a chatbot application can be made

using the QA system and lastly we also created a web interface where the model can

be used for any Passage and Question This interface also shows the attention spread

on the candidate answers While our effort is ongoing to push the state of the art

forward we strongly believe that surpassing human level accuracy on this task will have

high dividends for the society at large

39

LIST OF REFERENCES

[1] Danqi Chen From Reading Comprehension To Open-Domain Question Answering 2018 Youtube Accessed March 16 2018 httpswwwyoutubecomwatchv=1RN88O9C13U

[2] Lehnert Wendy G The Process of Question Answering No RR-88 Yale Univ New Haven Conn Dept of computer science 1977

[3] Joshi Mandar Eunsol Choi Daniel S Weld and Luke Zettlemoyer Triviaqa A large scale distantly supervised challenge dataset for reading comprehension arXiv preprint arXiv170503551 (2017)

[4] Rajpurkar Pranav Jian Zhang Konstantin Lopyrev and Percy Liang Squad 100000+ questions for machine comprehension of text arXiv preprint arXiv160605250 (2016)

[5] The Stanford Question Answering Dataset 2018 RajpurkarGithubIo Accessed March 16 2018 httpsrajpurkargithubioSQuAD-explorer

[6] Hu Minghao Yuxing Peng and Xipeng Qiu Reinforced mnemonic reader for machine comprehension CoRR abs170502798 (2017)

[7] Schalkoff Robert J Artificial neural networks Vol 1 New York McGraw-Hill 1997

[8] Me For The AI and Neetesh Mehrotra 2018 The Connect Between Deep Learning And AI - Open Source For You Open Source For You Accessed March 16 2018 httpopensourceforucom201801connect-deep-learning-ai

[9] Krizhevsky Alex Ilya Sutskever and Geoffrey E Hinton Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems pp 1097-1105 2012

[10] An Intuitive Explanation Of Convolutional Neural Networks 2016 The Data Science Blog Accessed March 16 2018 httpsujjwalkarnme20160811intuitive-explanation-convnets

[11] Williams Ronald J and David Zipser A learning algorithm for continually running fully recurrent neural networks Neural computation 1 no 2 (1989) 270-280

[12] Hochreiter Sepp and Juumlrgen Schmidhuber Long short-term memory Neural computation 9 no 8 (1997) 1735-1780

[13] Chung Junyoung Caglar Gulcehre KyungHyun Cho and Yoshua Bengio Empirical evaluation of gated recurrent neural networks on sequence modeling arXiv preprint arXiv14123555 (2014)

40

[14] Mikolov Tomas Wen-tau Yih and Geoffrey Zweig Linguistic regularities in continuous space word representations In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics Human Language Technologies pp 746-751 2013

[15] Pennington Jeffrey Richard Socher and Christopher Manning Glove Global vectors for word representation In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) pp 1532-1543 2014

[16] Word2vec Tutorial - The Skip-Gram Model middot Chris Mccormick 2016 MccormickmlCom Accessed March 16 2018 httpmccormickmlcom20160419word2vec-tutorial-the-skip-gram-model

[17] Group 2015 [Paper Introduction] Bilingual Word Representations With Monolingual hellip SlideshareNet Accessed March 16 2018 httpswwwslidesharenetnaist-mtstudybilingual-word-representations-with-monolingual-quality-in-mind

[18] Attention Mechanism 2016 BlogHeuritechCom Accessed March 16 2018 httpsblogheuritechcom20160120attention-mechanism

[19] Weston Jason Sumit Chopra and Antoine Bordes 2014 Memory Networks arXiv preprint arXiv14103916v11 Accessed March 16 2018 httpsarxivorgabs14103916

[20] Wang Shuohang and Jing Jiang Machine comprehension using match-lstm and answer pointer arXiv preprint arXiv160807905 (2016)

[21] Wang Shuohang and Jing Jiang Learning natural language inference with LSTM arXiv preprint arXiv151208849 (2015)

[22] Vinyals Oriol Meire Fortunato and Navdeep Jaitly Pointer networks In Advances in Neural Information Processing Systems pp 2692-2700 2015

[23] Wang Wenhui Nan Yang Furu Wei Baobao Chang and Ming Zhou Gated self-matching networks for reading comprehension and question answering In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1 Long Papers) vol 1 pp 189-198 2017

[24] Seo Minjoon Aniruddha Kembhavi Ali Farhadi and Hannaneh Hajishirzi Bidirectional attention flow for machine comprehension arXiv preprint arXiv161101603 (2016)

[25] Clark Christopher and Matt Gardner Simple and effective multi-paragraph reading comprehension arXiv preprint arXiv171010723 (2017)

41

[26] 8 Analyzing Sentence Structure 2018 NltkOrg Accessed March 16 2018 httpwwwnltkorgbookch08html

42

BIOGRAPHICAL SKETCH

Purnendu Mukherjee grew up in Kolkata India and had a strong interest for

computers since an early age After his high school he did his BSc in computer

science from Ramakrishna Mission Residential College Narendrapur followed by MSc

in computer science from St Xavierrsquos College Kolkata He had a strong intuition and

interest for human like learning systems and wanted to work in this area He started

working at TCS Innovation Labs Pune for the application of Natural Language

Processing in Educational Applications As he was deeply passionate about learning

system that mimic the human brain and learn like a human child does he was

increasing interested about Deep Learning and its applications After working for a year

he went on to pursue a Master of Science degree in computer science from the

University of Florida Gainesville His academic interests have been focused on Deep

Learning and Natural Language Processing and he has been working on Machine

Reading Comprehension since summer of 2017

  • ACKNOWLEDGMENTS
  • LIST OF FIGURES
  • LIST OF ABBREVIATIONS
  • INTRODUCTION
  • THE BUILDING BLOCKS OF A QUESTION ANSWERING SYSTEM
    • Neural Networks
    • Convolutional Neural Network
    • Recurrent Neural Networks (RNN)
    • Word Embedding
    • Attention Mechanism
    • Memory Networks
      • LITERATURE REVIEW AND STATE OF THE ART
        • Machine Comprehension Using Match-LSTM and Answer Pointer
        • R-NET Matching Reading Comprehension with Self-Matching Networks
        • Bi-Directional Attention Flow (BiDAF) for Machine Comprehension
        • Summary
          • MULTI-ATTENTION QUESTION ANSWERING
            • Multi-Attention BiDAF Model
            • Chatbot Design Using a QA System
            • Online QA System and Attention Visualization
              • FUTURE DIRECTIONS AND CONCLUSION
                • Future Directions
                • Conclusion
                  • LIST OF REFERENCES
                  • BIOGRAPHICAL SKETCH
Page 9: QUESTION ANSWERING SYSTEMS WITH ATTENTION MECHANISMufdcimages.uflib.ufl.edu/UF/E0/05/23/73/00001/MUKHERJEE_P.pdf · QUESTION ANSWERING SYSTEMS WITH ATTENTION MECHANISM By PURNENDU

9

CHAPTER 1 INTRODUCTION

Teaching machines to read and understand human natural language is our

central and long standing goal for Natural Language Processing and Artificial

Intelligence in general The positive side of machines being able to reason and

comprehend human language could be enormous We are beginning to see how

commercial systems like Alexa Siri Google Now etc are being widely used as Speech

Recognition improved to human levels While speech recognition systems can

transcribe speech to text comprehension of that transcribed text is another task which

is currently a major focus for both academia and industry because of its possible

applications Moreover all the text information available throughout the internet is a

major reason why machine comprehension of text is such an important task

With the growth of Deep Learning methods in the last few years the field of

Machine Comprehension and Natural Language Processing(NLP) in general has

experienced a revolution While the traditional methods and practices are still prevalent

and forms the basis of our deep understanding of languages Deep Learning methods

have surpassed all traditional NLP and Machine Learning methods by a significant

margin and are currently driving the growth of the field

To be able to build a system that can understand human text we need to first

ask ourselves how can we evaluate the machinersquos comprehension ability We had

initially set a goal to build a Chabot for a specific domain and generalize to other topics

as we go ahead While developing the system we found out the necessity for reading

comprehension and how to measure it We finally found the answer with Questions

Answering systems

10

Just like how we human beings are tested for our ability of language

understanding with questions we should ask machines similar questions about what it

has just read The performance of the system on such question answering task will let

us evaluate how much the machine is able to reason about what it just read [1]

Reading comprehension has been a topic of Natural Language Understanding since the

1970s In 1977 Wendy Lehnert said in his doctoral thesis ndash ldquoOnly when we can ask a

program to answer questions about what it reads will we be able to begin to access that

programrsquos comprehensionrdquo [2]

Figure 1-1 The task of Question Answering

To achieve this task the NLP community has developed various datasets such as

CNN Daily Mail WebQuestions SQuAD TriviaQA [3] etc For our purpose we chose

SQuAD which stands for Stanford Question Answering Dataset [4] SQuAD consists of

questions posed by crowdworkers on a set of Wikipedia articles where the answer to

every question is a segment of text or span from the corresponding reading passage

With 100000+ question-answer pairs on 500+ articles SQuAD is significantly larger

than previous reading comprehension datasets [5]

An example from the SQuAD dataset is as follows

11

Passage Tesla later approached Morgan to ask for more funds to build a more

powerful transmitter When asked where all the money had gone Tesla responded by

saying that he was affected by the Panic of 1901 which he (Morgan) had caused

Morgan was shocked by the reminder of his part in the stock market crash and by

Teslarsquos breach of contract by asking for more funds Tesla wrote another plea to

Morgan but it was also fruitless Morgan still owed Tesla money on the original

agreement and Tesla had been facing foreclosure even before construction of the

tower began

Question On what did Tesla blame for the loss of the initial money

Answer Panic of 1901

As we started exploring the QA task we faced several challenges Some of them

we could solve with the help of other research and some of the challenges still exist in

the domain

bull Out of Vocabulary words

bull Multi-sentence reasoning may be required

bull There may exist several candidate answers

bull Optimizing the Exact Match (EM) metric may fail when the answer boundary is fuzzy or too long such as the answer of the ldquowhyrdquo query [6]

bull ldquoOne-hoprdquo prediction may fail to fully understand the query [6]

bull Fail to fully capture the long-distance contextual interaction between parts of the context by only using LSTMGRU

bull Current models are unable to capture semantics of the passage

In the upcoming chapters we will first briefly review the basics necessary for

understanding the models then we will delve deep into the fundamental models that

12

have shaped the current State of the Art models then we will discuss our contribution in

terms of architecture and applications and finally conclude with future directions

13

CHAPTER 2 THE BUILDING BLOCKS OF A QUESTION ANSWERING SYSTEM

To build a question answering system one needs to be familiar with the

fundamental deep learning models such as Recurrent Neural Networks (RNN) Long

Short Term Memory (LSTM) etc In this chapter we will have an overview on these

techniques and see how they all connect to building a question answering system

Neural Networks

What makes Deep Learning so intriguing is that it has close resemblance with

the working of the mammalian brain or at least draws inspiration from it The same can

be said for Artificial Neural networks [7] which consists of a system of interconnected

units called lsquoneuronsrsquo that take input from similar units and produces a single output

Figure 2-1 Simple and Deep Learning Neural Networks [8]

The connection from one neuron to another can be weighted based on the input data

which enables the network to tune itself to produce a certain output based on the input

This is the learning process which is achieved through backpropagation which is a

system of propagating the error from the output layer to the previous layers

14

Convolutional Neural Network

The first wave of deep learningrsquos success was brought by Convolutional Neural

Networks (CNN) [9] when this was the technique used by the winning team of ImageNet

competition in 2012 CNNs are deep artificial neural networks (ANN) that can be used to

classify images cluster them by similarity and perform object recognition within scenes

It can be used to detect and identify faces people signs or any other visual data

Figure 2-2 Convolutional Neural Network Architecture [10]

There are primarily four operations in a standard CNN model (as shown in Fig above)

1 Convolution - The primary purpose of Convolution in the ConvNet (above) is to extract features from the input image The spatial relationship between pixels ie the image features are preserved and learned by the convolution using small squares of input data

2 Non-Linearity (ReLU) ndash Rectified Linear Unit (ReLU) is a non-linear operation that carries out an element wise operation on each pixel This operation replaces the negative pixel values in the feature map by zero

3 Pooling or Sub Sampling - Spatial Pooling reduces the dimensionality of each feature map but retains the most important information For max pooling the largest value in the square window is taken and rest are dropped Other types of pooling are Average Sum etc

4 Classification (Fully Connected Layer) - The Fully Connected layer is a traditional Multi-Layer Perceptron as described before that uses a softmax activation function in the output layer The high-level features of the image are encoded by the convolutional and pooling layers which is then fed to the fully connected layer which then uses these features for classifying the input image into various classes based on the training dataset [10]

15

When a new image is fed into the CNN model all the above-mentioned steps are

carried out (forward propagation) and a probability distribution is achieved on the set of

output classes With a large enough training dataset the network will learn and

generalize well enough to classify new images into their correct classes

Recurrent Neural Networks (RNN)

Whenever we want to predict or encode sequential data RNN [11] is our go to

method RNNs perform the same task for every element of a sequence where the

output of each element depends on previous computations thus the recurrence In

practice RNNs are unable to retain long-term dependencies and can look back only a

few steps because of the vanishing gradient problem

(2-1)

A solution to the dependency problem is to use gated cells such as LSTM [11] or

GRU [13] These cells pass on important information to the next cells while ignoring

non-important ones The gated units in a GRU block are

bull Update Gate ndash Computed based on current input and hidden state

(2-2)

bull Reset Gate ndash Calculated similarly but with different weights

(2-3)

bull New memory content - (2-4)

If reset gate unit is ~0 then previous memory is ignored and only new

information is kept

16

Final memory at current time step combines previous and current time steps

(2-5)

While the GRU is computationally efficient the LSTM on the other hand is a

general case where there are three gates as follows

bull Input Gate ndash What new information to add to the current cell state

bull Forget Gate ndash How much information from previous states to be kept

bull Output gate ndash How much info should be sent to the next states

Just like GRU the current cell state is a sum of the previous cell state but

weighted by the forget gate and the new value is added which is weighted by the input

gate Based on the cell state the output gate regulates the final output

Word Embedding

Computation or gradients can be applied on numbers and not on words or letters

So first we need to convert words into their corresponding numerical formation before

feeding into a deep learning model In general there are two types of word embedding

Frequency based (which constitutes count vectors tf-idf and co-occurrence vectors)

and Prediction based With frequency based embedding the order of the words are not

preserved and works as a bag of words model Whereas with prediction based model

the order of words or locality of words are taken into consideration to generate the

numerical representation of the word Within this prediction based category there are

two fundamental techniques called Continuous Bag of Words (CBOW) and Skip Gram

Model which forms the basis for word2vec [14] and GloVe [15]

The basic intuition behind word2vec is that if two different words have very

similar ldquocontextsrdquo (that is what words are likely to appear around them) then the model

17

will produce similar vector for those words Conversely if the two word vectors are

similar then the network will produce similar context predictions for the same two words

For examples synonyms like ldquointelligentrdquo and ldquosmartrdquo would have very similar contexts

Or that words that are related like ldquoenginerdquo and ldquotransmissionrdquo would probably have

similar contexts as well [16] Plotting the word vectors learned by a word2vec over a

large corpus we could find some very interesting relationships between words

Figure 2-3 Semantic relation between words in vector space [17]

Attention Mechanism

We as humans put our attention to things are important or are relevant in a

context For example when asked a question from a passage we try to find the most

relevant part of the passage the question is relevant with and then reason from our

understanding of that part of the passage The same idea applies for attention

mechanism in Deep Learning It is used to identify the specific parts of a given context

to which the current question is relevant to

Formally put the techniques take n arguments y_1 y_n (in our case the

passage having words say y_i through h_i) and a question word say q It returns a

vector z which is supposed to be the laquo summary raquo of the y_i focusing on information

linked to the question q More formally it returns a weighted arithmetic mean of the y_i

18

and the weights are chosen according the relevance of each y_i given the context c

[18]

Figure 2-4 Attention Mechanism flow [18]

Memory Networks

Convolutional Neural Networks and Recurrent Neural Networks which does

capture how we form our visual and sequential memories their memory (encoded by

hidden states and weights) were typically too small and was not compartmentalized

enough to accurately remember facts from the past (knowledge is compressed into

dense vectors) [19]

Deep Learning needed to cultivate a methodology that preserved memories as

they are such that it wonrsquot be lost in generalization and recalling exact words or

sequence of events would be possible mdash something computers are already good at This

effort led us to Memory Networks [19] published at ICLR 2015 by Facebook AI

Research

This paper provides a basic framework to store augment and retrieve memories

while seamlessly working with a Recurrent Neural Network architecture The memory

19

network consists of a memory m (an array of objects 1 indexed by m i) and four

(potentially learned) components I G O and R as follows

I (input feature map) mdash converts the incoming input to the internal feature

representation either a sparse or dense feature vector like that from word2vec or

GloVe

G (generalization) mdash updates old memories given the new input They call this

generalization as there is an opportunity for the network to compress and generalize its

memories at this stage for some intended future use The analogy Irsquove been talking

before

O (output feature map) mdash produces a new output (in the feature representation

space) given the new input and the current memory state This component is

responsible for performing inference In a question answering system this part will

select the candidate sentences (which might contain the answer) from the story

(conversation) so far

R (response) mdash converts the output into the response format desired For

example a textual response or an action In the QA system described this component

finds the desired answer and then converts it from feature representation to the actual

word

This model is a fully supervised model meaning all the candidate sentences from

which the answer could be found are marked during training phase and can also be

termed as lsquohard attentionrsquo

The authors tested out the QA system on various literature including Lord of the

Rings

20

Figure 2-5 QA example on Lord of the Rings using Memory Networks [19]

21

CHAPTER 3 LITERATURE REVIEW AND STATE OF THE ART

There has been rapid progress since the release of the SQuAD dataset and

currently the best ensemble models are close to human level accuracy in machine

comprehension This is due to the various ingenious methods which solves some of the

problems with the previous methods Out of Vocabulary tokens were handled by using

Character embedding Long term dependency within context passage were solved

using self-attention And many other techniques such as Contextualized vectors History

of Words Attention Flow etc In this section we will have a look at the some of the most

important models that were fundamental to the progress of Questions Answering

Machine Comprehension Using Match-LSTM and Answer Pointer

In this paper [20] the authors propose an end-to-end neural architecture for the

QA task The architecture is based on match-LSTM [21] a model they proposed for

textual entailment and Pointer Net [22] a sequence-to-sequence model proposed by

Vinyals et al (2015) to constrain the output tokens to be from the input sequences

The model consists of an LSTM preprocessing layer a match-LSTM layer and an

Answer Pointer layer

We are given a piece of text which we refer to as a passage and a question

related to the passage The passage is represented by matrix P where P is the length

(number of tokens) of the passage and d is the dimensionality of word embeddings

Similarly the question is represented by matrix Q where Q is the length of the question

Our goal is to identify a subsequence from the passage as the answer to the question

22

Figure 3-1 Match-LSTM Model Architecture [20]

LSTM Preprocessing layer They use a standard one-directional LSTM

(Hochreiter amp Schmidhuber 1997) to process the passage and the question separately

as shown below

Match LSTM Layer They applied the match-LSTM model proposed for textual

entailment to their machine comprehension problem by treating the question as a

premise and the passage as a hypothesis The match-LSTM sequentially goes through

the passage At position i of the passage it first uses the standard word-by-word

attention mechanism to obtain attention weight vector as follows

(3-1)

23

where and are parameters to be learned

Answer Pointer Layer The Answer Pointer (Ans-Ptr) layer is motivated by the

Pointer Net introduced by Vinyals et al (2015) [22] The boundary model produces only

the start token and the end token of the answer and then all the tokens between these

two in the original passage are considered to be the answer

When this paper was released back in November 2016 Match-LSTM method

was the state of the art in Question Answering systems and was at the top of the

leaderboard for the SQuAD dataset

R-NET Matching Reading Comprehension with Self-Matching Networks

In this model [23] first the question and passage are processed by a

bidirectional recurrent network (Mikolov et al 2010) separately They then match the

question and passage with gated attention-based recurrent networks obtaining

question-aware representation for the passage On top of that they apply self-matching

attention to aggregate evidence from the whole passage and refine the passage

representation which is then fed into the output layer to predict the boundary of the

answer span

Question and passage encoding First the words are converted to their

respective word-level embeddings and character level embeddings The character-level

embeddings are generated by taking the final hidden states of a bi-directional recurrent

neural network (RNN) applied to embeddings of characters in the token Such

character-level embeddings have been shown to be helpful to deal with out-of-vocab

(OOV) tokens

24

They then use a bi-directional RNN to produce new representation and

of all words in the question and passage respectively

Figure 3-2 The task of Question Answering [23]

Gated Attention-based Recurrent Networks They use a variant of attention-

based recurrent networks with an additional gate to determine the importance of

information in the passage regarding a question Different from the gates in LSTM or

GRU the additional gate is based on the current passage word and its attention-pooling

vector of the question which focuses on the relation between the question and current

passage word The gate effectively model the phenomenon that only parts of the

passage are relevant to the question in reading comprehension and question answering

is utilized in subsequent calculations

25

Self-Matching Attention From the previous step the question aware passage

representation is generated to highlight the important parts of the passage One

problem with such representation is that it has very limited knowledge of context One

answer candidate is often oblivious to important cues in the passage outside its

surrounding window To address this problem the authors propose directly matching

the question-aware passage representation against itself It dynamically collects

evidence from the whole passage for words in passage and encodes the evidence

relevant to the current passage word and its matching question information into the

passage representation

Output Layer They use the same method as Wang amp Jiang (2016b) and use

pointer networks (Vinyals et al 2015) to predict the start and end position of the

answer In addition they use an attention-pooling over the question representation to

generate the initial hidden vector for the pointer network [23]

When the R-Net Model first appeared in the leaderboard in March 2017 it was at

the top with 723 Exact Match and 807 F1 score

Bi-Directional Attention Flow (BiDAF) for Machine Comprehension

BiDAF [24] is a hierarchical multi-stage architecture for modeling the

representations of the context paragraph at different levels of granularity BIDAF

includes character-level word-level and contextual embeddings and uses bi-directional

attention flow to obtain a query-aware context representation Their attention layer is not

used to summarize the context paragraph into a fixed-size vector Instead the attention

is computed for every time step and the attended vector at each time step along with

the representations from previous layers can flow through to the subsequent modeling

layer This reduces the information loss caused by early summarization [24]

26

Figure 3-3 Bi-Directional Attention Flow Model Architecture [24]

Their machine comprehension model is a hierarchical multi-stage process and

consists of six layers

1 Character Embedding Layer maps each word to a vector space using character-level CNNs

2 Word Embedding Layer maps each word to a vector space using a pre-trained word embedding model

3 Contextual Embedding Layer utilizes contextual cues from surrounding words to refine the embedding of the words These first three layers are applied to both the query and context They use a LSTM on top of the embeddings provided by the previous layers to model the temporal interactions between words We place an LSTM in both directions and concatenate the outputs of the two LSTMs

4 Attention Flow Layer couples the query and context vectors and produces a set of query aware feature vectors for each word in the context

5 Modeling Layer employs a Recurrent Neural Network to scan the context The input to the modeling layer is G which encodes the query-aware representations of context words The output of the modeling layer captures the interaction among the context words conditioned on the query They use two layers of bi-directional LSTM

27

with the output size of d for each direction Hence a matrix is obtained which is passed onto the output layer to predict the answer

6 Output Layer provides an answer to the query They define the training loss (to be minimized) as the sum of the negative log probabilities of the true start and end indices by the predicted distributions averaged over all examples [25]

In a further variation of their above work they add a self-attention layer after the

Bi-attention layer to further improve the results The architecture of the model is as

Figure 3-3 The task of Question Answering [25]

28

Summary

In this chapter we reviewed the methods that are fundamental to the state of the

art in Machine Comprehension and for the task of Question Answering We have

reached closed to human level accuracy and this is due to incremental developments

over previous models As we saw Out of Vocabulary (OOV) tokens were handled by

using Character embedding Long term dependency within context passage were

solved using self-attention and many other techniques such as Contextualized vectors

Attention Flow etc were employed to get better results In the next chapter we will see

how we can build on these models and develop further

29

CHAPTER 4 MULTI-ATTENTION QUESTION ANSWERING

Although the advancements in Question Answering systems since the release of

the SQuAD dataset has been impressive and the results are getting close to human

level accuracy it is far from being a fool-proof system The models still make mistakes

which would be obvious to a human For example

Passage The Panthers used the San Jose State practice facility and stayed at

the San Jose Marriott The Broncos practiced at Florida State Facility and stayed at the

Santa Clara Marriott

Question At what universitys facility did the Panthers practice

Actual Answer San Jose State

Predicted Answer Florida State Facility

To find out what is leading to the wrong predictions we wanted to see the

attention weights associated with such an example We plotted the passage and

question heat map which is a 2D matrix where the intensity of each cell signifies the

similarity between a passage word and a question word For the above example we

found out that while certain words of the question are given high weightage other parts

are not The words lsquoAtrsquo lsquofacilityrsquo lsquopracticersquo are given high attention but lsquoPanthersrsquo does

not receive high attention If it had received high attention then the system would have

predicted lsquoSan Jose Statersquo as the right answer To solve this issue we analyzed the

base BiDAF model and proposed adding two things

1 Bi-Attention and Self-Attention over Query

2 Second level attention over the output of (Bi-Attention + Self-Attention) from both Context and Query

30

Multi-Attention BiDAF Model

Our comprehension model is a hierarchical multi-stage process and consists of

the following layers

1 Embedding Just as all other models we embed words using pretrained word vectors We also embed the characters in each word into size 20 vectors which are learned and run a convolution neural network followed by max-pooling to get character-derived embeddings for each word The character-level and word-level embeddings are then concatenated and passed to the next layer We do not update the word embeddings during training

2 Pre-Process A shared bi-directional GRU (Cho et al 2014) is used to map the question and passage embeddings to context aware embeddings

3 Context Attention The bi-directional attention mechanism from the Bi-Directional Attention Flow (BiDAF) model (Seo et al 2016) is used to build a query-aware context representation Let h_i be the vector for context word i q_j be the vector for question word j and n_q and n_c be the lengths of the question and context respectively We compute attention between context word i and question word j as

(4-1)

where w1 w2 and w3 are learned vectors and is element-wise multiplication We then compute an attended vector c_i for each context token as

(4-2)

We also compute a query-to-context vector q_c

(4-3)

31

The final vector computed for each token is built by concatenating

and In our model we subsequently pass the result through a linear layer with ReLU activations

4 Context Self-Attention Next we use a layer of residual self-attention The input is passed through another bi-directional GRU Then we apply the same attention mechanism only now between the passage and itself In this case we do not use query-to context attention and we set aij = 1048576inf if i = j As before we pass the concatenated output through a linear layer with ReLU activations This layer is applied residually so this output is additionally summed with the input

5 Query Attention For this part we do the same way as context attention but calculate the weighted sum of the context words for each query word Thus we get the length as number of query words Then we calculate context to query similar to query to context in context attention layer

Figure 4-1 The modified BiDAF model with multilevel attention

6 Query Self-Attention This part is done the same way as the context self-attention layer but from the output of the Query Attention layer

32

7 Context Query Bi-Attention + Self-Attention The output of the Context self-attention and the Query Self-Attention layers are taken as input and the same process for Bi-Attention and self-attention is applied on these inputs

8 Prediction In the last layer of our model a bidirectional GRU is applied followed by a linear layer that computes answer start scores for each token The hidden states of that layer are concatenated with the input and fed into a second bidirectional GRU and linear layer to predict answer end scores The softmax operation is applied to the start and end scores to produce start and end probabilities and we optimize the negative log likelihood of selecting correct start and end tokens

Having carried out this modification we were able to solve the wrong example we

started with The multilevel attention model gives the correct output as ldquoSan Jose

Staterdquo Also we achieved slightly better scores than the original model with a F1 score

of 8544 on the SQuAD dev dataset

Chatbot Design Using a QA System

Designing a perfect chatbot that passes the Turing test is a fundamental goal for

Artificial Intelligence Although we are many order to magnitudes away from achieving

such a goal domain specific tasks can be solved with chatbots made from current

technology We had started our goal with a similar objective in mind ie to design a

domain specific chatbot and then generalize to other areas as it is able to achieve the

first domain specific objective robustly This led us to the fundamental problem of

Machine Comprehension and subsequently to the task of Question Answering Having

achieved some degree of success with QA systems we looked back if we could apply

our newly acquired knowledge in the task of designing Chatbots

The chatbots made with todayrsquos technologies are mostly handcrafted techniques

such as template matching that requires anticipating all possible ways a user may

articulate his requirements and a conversation may occur This requires a lot of man

hours for designing a domain specific system and is still very error prone In this section

33

we propose a general Chatbot design that would make the designing of a domain

specific chatbot very easy and robust at the same time

Every domain specific chatbot needs to obtain a set of information from the user

and show some results based on the user specific information obtained The traditional

chatbots use template matching and keywords lookup to determine if the user has

provided the required information Our idea is to use the Question Answering system in

the backend to extract out the required information from whatever the user has typed

until this point of the conversation The information to be extracted can be posed in the

form of a set of questions and the answers obtained from those questions can be used

as the parameters to supply the relevant information to the user

We had chosen our chatbot domain as the flight reservation system Our goal

was to extract the required information from the user to be able to show him the

available flights as per the userrsquos requirements

Figure 4-2 Flight reservation chatbotrsquos chat window

34

For a flight reservation task the booking agent needs to know the origin city

destination city and date of travel at minimum to be able to show the available flights

Optional information includes the number of tickets passengerrsquos name one way or

round trip etc

The minimalistic conversation with the user through the chat window would be as

shown above We had a platform called OneTask on which we wanted to implement our

chat bot The chat interface within the OneTask system looks as follows

Figure 4-3 Chatbot within OneTask system

The working of the chat system is as follows ndash

1 Initiation The user opens the chat window which starts a session with the chat bot The chat bot asks lsquoHow may I help yoursquo

2 User Reply The user may reply with none of the required information for flight booking or may reply with multiple information in the same message

3 User Reply parsing The conversation up to this point is treated as a passage and the internal questions are run on this passage So the four questions that are run are

Where do you want to go

35

From where do you want to leave

When do you want to depart

4 Parsed responses from QA model After running the questions on the QA system with the given conversation answers are obtained for all along with their corresponding confidence values Since answers will be obtained even if the required question was not answered up to this point in the conversation the confidence values plays a crucial role in determining the validity of the answer For this purpose we determined ranges of confidence values for validation A confidence value above 10 signifies the question has been answered correctly A confidence of 2 ndash 10 signifies that it may have been answered but should verify with the user for correctness and any confidence below 2 is discarded

5 Asking remaining questions iteratively After the parsing it is checked if any of the required questions are still unanswered If so the chatbot asks the remaining question and the process from 3 ndash 4 is carried out iteratively Once all the questions have been answered the user is shown with the available flight options as per his request

Figure 4-4 The Flow diagram of the Flight booking Chatbot system

Online QA System and Attention Visualization

To be able to test out various examples we used the BiDAF [23] model for an

online demo One can either choose from the available examples from the drop-down

menu or paste their own passage and examples While this is a useful and interesting

system to test the model in a user-friendly way we created this system to be able to

focus on the wrong samples

36

Figure 4-5 QA system interface with attention highlight over candidate answers

The candidate answers are shown with the blue highlights as per their

confidence values The higher the confidence the darker the answer The highest

confidence value is chosen as the predicted answer We developed the system to show

attention spread of the candidate answers to realize what needs to be done to improve

the system This led us to realize the importance of including the query attention part as

well as multilevel attention on the BiDAF model as described in the first section of this

chapter

37

CHAPTER 5 FUTURE DIRECTIONS AND CONCLUSION

Future Directions

Having done a thorough analysis of the current methods and state of the arts in

Machine Comprehension and for the QA task and developing systems on it we have

achieved a strong sense of what needs to be done to further improve the QA models

After observing the wrong samples we can see that the system is still unable to encode

meaning and is picking answers based on statistical patterns of occurrence of answers

on training examples

As per the State of the Art models and analyzing the ongoing research literature

it is easy to conclude that more features need to be embedded to encode the meaning

of words phrases and sentences A paper called Reinforced Mnemonic Reader for

Machine Comprehension encoded POS and NER tags of words along with their word

and character embedding This gave them better results

Figure 5-1 An English language semantic parse tree [26]

38

We have developed a method to encode the syntax parse tree of a sentence that

not only encodes the post tags but the relation of the word within the phrase and the

relation of the phrase within the whole sentence in a hierarchical manner

Finally data augmentation is another solution to get better results One definite

way to reduce the errors would be to include similar samples in the training data which

the system is faltering in the dev set One could generate similar examples as the failure

cases and include them in the training set to have better prediction Another system

would be to train to similar and bigger datasets Our models were trained on the SQuAD

dataset There are other datasets too which does the similar question answering task

such as TriviaQA We could augment the training set of TriviaQA along with SQuAD to

have a more robust system that is able to generalize better and thus have higher

accuracy for predicting answer spans

Conclusion

In this work we have tried to explore the most fundamental techniques that have

shaped the current state of the art Then we proposed a minor improvement of

architecture over an existing model Furthermore we developed two applications that

uses the base model First we talked about how a chatbot application can be made

using the QA system and lastly we also created a web interface where the model can

be used for any Passage and Question This interface also shows the attention spread

on the candidate answers While our effort is ongoing to push the state of the art

forward we strongly believe that surpassing human level accuracy on this task will have

high dividends for the society at large

39

LIST OF REFERENCES

[1] Danqi Chen From Reading Comprehension To Open-Domain Question Answering 2018 Youtube Accessed March 16 2018 httpswwwyoutubecomwatchv=1RN88O9C13U

[2] Lehnert Wendy G The Process of Question Answering No RR-88 Yale Univ New Haven Conn Dept of computer science 1977

[3] Joshi Mandar Eunsol Choi Daniel S Weld and Luke Zettlemoyer Triviaqa A large scale distantly supervised challenge dataset for reading comprehension arXiv preprint arXiv170503551 (2017)

[4] Rajpurkar Pranav Jian Zhang Konstantin Lopyrev and Percy Liang Squad 100000+ questions for machine comprehension of text arXiv preprint arXiv160605250 (2016)

[5] The Stanford Question Answering Dataset 2018 RajpurkarGithubIo Accessed March 16 2018 httpsrajpurkargithubioSQuAD-explorer

[6] Hu Minghao Yuxing Peng and Xipeng Qiu Reinforced mnemonic reader for machine comprehension CoRR abs170502798 (2017)

[7] Schalkoff Robert J Artificial neural networks Vol 1 New York McGraw-Hill 1997

[8] Me For The AI and Neetesh Mehrotra 2018 The Connect Between Deep Learning And AI - Open Source For You Open Source For You Accessed March 16 2018 httpopensourceforucom201801connect-deep-learning-ai

[9] Krizhevsky Alex Ilya Sutskever and Geoffrey E Hinton Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems pp 1097-1105 2012

[10] An Intuitive Explanation Of Convolutional Neural Networks 2016 The Data Science Blog Accessed March 16 2018 httpsujjwalkarnme20160811intuitive-explanation-convnets

[11] Williams Ronald J and David Zipser A learning algorithm for continually running fully recurrent neural networks Neural computation 1 no 2 (1989) 270-280

[12] Hochreiter Sepp and Juumlrgen Schmidhuber Long short-term memory Neural computation 9 no 8 (1997) 1735-1780

[13] Chung Junyoung Caglar Gulcehre KyungHyun Cho and Yoshua Bengio Empirical evaluation of gated recurrent neural networks on sequence modeling arXiv preprint arXiv14123555 (2014)

40

[14] Mikolov Tomas Wen-tau Yih and Geoffrey Zweig Linguistic regularities in continuous space word representations In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics Human Language Technologies pp 746-751 2013

[15] Pennington Jeffrey Richard Socher and Christopher Manning Glove Global vectors for word representation In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) pp 1532-1543 2014

[16] Word2vec Tutorial - The Skip-Gram Model middot Chris Mccormick 2016 MccormickmlCom Accessed March 16 2018 httpmccormickmlcom20160419word2vec-tutorial-the-skip-gram-model

[17] Group 2015 [Paper Introduction] Bilingual Word Representations With Monolingual hellip SlideshareNet Accessed March 16 2018 httpswwwslidesharenetnaist-mtstudybilingual-word-representations-with-monolingual-quality-in-mind

[18] Attention Mechanism 2016 BlogHeuritechCom Accessed March 16 2018 httpsblogheuritechcom20160120attention-mechanism

[19] Weston Jason Sumit Chopra and Antoine Bordes 2014 Memory Networks arXiv preprint arXiv14103916v11 Accessed March 16 2018 httpsarxivorgabs14103916

[20] Wang Shuohang and Jing Jiang Machine comprehension using match-lstm and answer pointer arXiv preprint arXiv160807905 (2016)

[21] Wang Shuohang and Jing Jiang Learning natural language inference with LSTM arXiv preprint arXiv151208849 (2015)

[22] Vinyals Oriol Meire Fortunato and Navdeep Jaitly Pointer networks In Advances in Neural Information Processing Systems pp 2692-2700 2015

[23] Wang Wenhui Nan Yang Furu Wei Baobao Chang and Ming Zhou Gated self-matching networks for reading comprehension and question answering In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1 Long Papers) vol 1 pp 189-198 2017

[24] Seo Minjoon Aniruddha Kembhavi Ali Farhadi and Hannaneh Hajishirzi Bidirectional attention flow for machine comprehension arXiv preprint arXiv161101603 (2016)

[25] Clark Christopher and Matt Gardner Simple and effective multi-paragraph reading comprehension arXiv preprint arXiv171010723 (2017)

41

[26] 8 Analyzing Sentence Structure 2018 NltkOrg Accessed March 16 2018 httpwwwnltkorgbookch08html

42

BIOGRAPHICAL SKETCH

Purnendu Mukherjee grew up in Kolkata India and had a strong interest for

computers since an early age After his high school he did his BSc in computer

science from Ramakrishna Mission Residential College Narendrapur followed by MSc

in computer science from St Xavierrsquos College Kolkata He had a strong intuition and

interest for human like learning systems and wanted to work in this area He started

working at TCS Innovation Labs Pune for the application of Natural Language

Processing in Educational Applications As he was deeply passionate about learning

system that mimic the human brain and learn like a human child does he was

increasing interested about Deep Learning and its applications After working for a year

he went on to pursue a Master of Science degree in computer science from the

University of Florida Gainesville His academic interests have been focused on Deep

Learning and Natural Language Processing and he has been working on Machine

Reading Comprehension since summer of 2017

  • ACKNOWLEDGMENTS
  • LIST OF FIGURES
  • LIST OF ABBREVIATIONS
  • INTRODUCTION
  • THE BUILDING BLOCKS OF A QUESTION ANSWERING SYSTEM
    • Neural Networks
    • Convolutional Neural Network
    • Recurrent Neural Networks (RNN)
    • Word Embedding
    • Attention Mechanism
    • Memory Networks
      • LITERATURE REVIEW AND STATE OF THE ART
        • Machine Comprehension Using Match-LSTM and Answer Pointer
        • R-NET Matching Reading Comprehension with Self-Matching Networks
        • Bi-Directional Attention Flow (BiDAF) for Machine Comprehension
        • Summary
          • MULTI-ATTENTION QUESTION ANSWERING
            • Multi-Attention BiDAF Model
            • Chatbot Design Using a QA System
            • Online QA System and Attention Visualization
              • FUTURE DIRECTIONS AND CONCLUSION
                • Future Directions
                • Conclusion
                  • LIST OF REFERENCES
                  • BIOGRAPHICAL SKETCH
Page 10: QUESTION ANSWERING SYSTEMS WITH ATTENTION MECHANISMufdcimages.uflib.ufl.edu/UF/E0/05/23/73/00001/MUKHERJEE_P.pdf · QUESTION ANSWERING SYSTEMS WITH ATTENTION MECHANISM By PURNENDU

10

Just like how we human beings are tested for our ability of language

understanding with questions we should ask machines similar questions about what it

has just read The performance of the system on such question answering task will let

us evaluate how much the machine is able to reason about what it just read [1]

Reading comprehension has been a topic of Natural Language Understanding since the

1970s In 1977 Wendy Lehnert said in his doctoral thesis ndash ldquoOnly when we can ask a

program to answer questions about what it reads will we be able to begin to access that

programrsquos comprehensionrdquo [2]

Figure 1-1 The task of Question Answering

To achieve this task the NLP community has developed various datasets such as

CNN Daily Mail WebQuestions SQuAD TriviaQA [3] etc For our purpose we chose

SQuAD which stands for Stanford Question Answering Dataset [4] SQuAD consists of

questions posed by crowdworkers on a set of Wikipedia articles where the answer to

every question is a segment of text or span from the corresponding reading passage

With 100000+ question-answer pairs on 500+ articles SQuAD is significantly larger

than previous reading comprehension datasets [5]

An example from the SQuAD dataset is as follows

11

Passage Tesla later approached Morgan to ask for more funds to build a more

powerful transmitter When asked where all the money had gone Tesla responded by

saying that he was affected by the Panic of 1901 which he (Morgan) had caused

Morgan was shocked by the reminder of his part in the stock market crash and by

Teslarsquos breach of contract by asking for more funds Tesla wrote another plea to

Morgan but it was also fruitless Morgan still owed Tesla money on the original

agreement and Tesla had been facing foreclosure even before construction of the

tower began

Question On what did Tesla blame for the loss of the initial money

Answer Panic of 1901

As we started exploring the QA task we faced several challenges Some of them

we could solve with the help of other research and some of the challenges still exist in

the domain

bull Out of Vocabulary words

bull Multi-sentence reasoning may be required

bull There may exist several candidate answers

bull Optimizing the Exact Match (EM) metric may fail when the answer boundary is fuzzy or too long such as the answer of the ldquowhyrdquo query [6]

bull ldquoOne-hoprdquo prediction may fail to fully understand the query [6]

bull Fail to fully capture the long-distance contextual interaction between parts of the context by only using LSTMGRU

bull Current models are unable to capture semantics of the passage

In the upcoming chapters we will first briefly review the basics necessary for

understanding the models then we will delve deep into the fundamental models that

12

have shaped the current State of the Art models then we will discuss our contribution in

terms of architecture and applications and finally conclude with future directions

13

CHAPTER 2 THE BUILDING BLOCKS OF A QUESTION ANSWERING SYSTEM

To build a question answering system one needs to be familiar with the

fundamental deep learning models such as Recurrent Neural Networks (RNN) Long

Short Term Memory (LSTM) etc In this chapter we will have an overview on these

techniques and see how they all connect to building a question answering system

Neural Networks

What makes Deep Learning so intriguing is that it has close resemblance with

the working of the mammalian brain or at least draws inspiration from it The same can

be said for Artificial Neural networks [7] which consists of a system of interconnected

units called lsquoneuronsrsquo that take input from similar units and produces a single output

Figure 2-1 Simple and Deep Learning Neural Networks [8]

The connection from one neuron to another can be weighted based on the input data

which enables the network to tune itself to produce a certain output based on the input

This is the learning process which is achieved through backpropagation which is a

system of propagating the error from the output layer to the previous layers

14

Convolutional Neural Network

The first wave of deep learningrsquos success was brought by Convolutional Neural

Networks (CNN) [9] when this was the technique used by the winning team of ImageNet

competition in 2012 CNNs are deep artificial neural networks (ANN) that can be used to

classify images cluster them by similarity and perform object recognition within scenes

It can be used to detect and identify faces people signs or any other visual data

Figure 2-2 Convolutional Neural Network Architecture [10]

There are primarily four operations in a standard CNN model (as shown in Fig above)

1 Convolution - The primary purpose of Convolution in the ConvNet (above) is to extract features from the input image The spatial relationship between pixels ie the image features are preserved and learned by the convolution using small squares of input data

2 Non-Linearity (ReLU) ndash Rectified Linear Unit (ReLU) is a non-linear operation that carries out an element wise operation on each pixel This operation replaces the negative pixel values in the feature map by zero

3 Pooling or Sub Sampling - Spatial Pooling reduces the dimensionality of each feature map but retains the most important information For max pooling the largest value in the square window is taken and rest are dropped Other types of pooling are Average Sum etc

4 Classification (Fully Connected Layer) - The Fully Connected layer is a traditional Multi-Layer Perceptron as described before that uses a softmax activation function in the output layer The high-level features of the image are encoded by the convolutional and pooling layers which is then fed to the fully connected layer which then uses these features for classifying the input image into various classes based on the training dataset [10]

15

When a new image is fed into the CNN model all the above-mentioned steps are

carried out (forward propagation) and a probability distribution is achieved on the set of

output classes With a large enough training dataset the network will learn and

generalize well enough to classify new images into their correct classes

Recurrent Neural Networks (RNN)

Whenever we want to predict or encode sequential data RNN [11] is our go to

method RNNs perform the same task for every element of a sequence where the

output of each element depends on previous computations thus the recurrence In

practice RNNs are unable to retain long-term dependencies and can look back only a

few steps because of the vanishing gradient problem

(2-1)

A solution to the dependency problem is to use gated cells such as LSTM [11] or

GRU [13] These cells pass on important information to the next cells while ignoring

non-important ones The gated units in a GRU block are

bull Update Gate ndash Computed based on current input and hidden state

(2-2)

bull Reset Gate ndash Calculated similarly but with different weights

(2-3)

bull New memory content - (2-4)

If reset gate unit is ~0 then previous memory is ignored and only new

information is kept

16

Final memory at current time step combines previous and current time steps

(2-5)

While the GRU is computationally efficient the LSTM on the other hand is a

general case where there are three gates as follows

bull Input Gate ndash What new information to add to the current cell state

bull Forget Gate ndash How much information from previous states to be kept

bull Output gate ndash How much info should be sent to the next states

Just like GRU the current cell state is a sum of the previous cell state but

weighted by the forget gate and the new value is added which is weighted by the input

gate Based on the cell state the output gate regulates the final output

Word Embedding

Computation or gradients can be applied on numbers and not on words or letters

So first we need to convert words into their corresponding numerical formation before

feeding into a deep learning model In general there are two types of word embedding

Frequency based (which constitutes count vectors tf-idf and co-occurrence vectors)

and Prediction based With frequency based embedding the order of the words are not

preserved and works as a bag of words model Whereas with prediction based model

the order of words or locality of words are taken into consideration to generate the

numerical representation of the word Within this prediction based category there are

two fundamental techniques called Continuous Bag of Words (CBOW) and Skip Gram

Model which forms the basis for word2vec [14] and GloVe [15]

The basic intuition behind word2vec is that if two different words have very

similar ldquocontextsrdquo (that is what words are likely to appear around them) then the model

17

will produce similar vector for those words Conversely if the two word vectors are

similar then the network will produce similar context predictions for the same two words

For examples synonyms like ldquointelligentrdquo and ldquosmartrdquo would have very similar contexts

Or that words that are related like ldquoenginerdquo and ldquotransmissionrdquo would probably have

similar contexts as well [16] Plotting the word vectors learned by a word2vec over a

large corpus we could find some very interesting relationships between words

Figure 2-3 Semantic relation between words in vector space [17]

Attention Mechanism

We as humans put our attention to things are important or are relevant in a

context For example when asked a question from a passage we try to find the most

relevant part of the passage the question is relevant with and then reason from our

understanding of that part of the passage The same idea applies for attention

mechanism in Deep Learning It is used to identify the specific parts of a given context

to which the current question is relevant to

Formally put the techniques take n arguments y_1 y_n (in our case the

passage having words say y_i through h_i) and a question word say q It returns a

vector z which is supposed to be the laquo summary raquo of the y_i focusing on information

linked to the question q More formally it returns a weighted arithmetic mean of the y_i

18

and the weights are chosen according the relevance of each y_i given the context c

[18]

Figure 2-4 Attention Mechanism flow [18]

Memory Networks

Convolutional Neural Networks and Recurrent Neural Networks which does

capture how we form our visual and sequential memories their memory (encoded by

hidden states and weights) were typically too small and was not compartmentalized

enough to accurately remember facts from the past (knowledge is compressed into

dense vectors) [19]

Deep Learning needed to cultivate a methodology that preserved memories as

they are such that it wonrsquot be lost in generalization and recalling exact words or

sequence of events would be possible mdash something computers are already good at This

effort led us to Memory Networks [19] published at ICLR 2015 by Facebook AI

Research

This paper provides a basic framework to store augment and retrieve memories

while seamlessly working with a Recurrent Neural Network architecture The memory

19

network consists of a memory m (an array of objects 1 indexed by m i) and four

(potentially learned) components I G O and R as follows

I (input feature map) mdash converts the incoming input to the internal feature

representation either a sparse or dense feature vector like that from word2vec or

GloVe

G (generalization) mdash updates old memories given the new input They call this

generalization as there is an opportunity for the network to compress and generalize its

memories at this stage for some intended future use The analogy Irsquove been talking

before

O (output feature map) mdash produces a new output (in the feature representation

space) given the new input and the current memory state This component is

responsible for performing inference In a question answering system this part will

select the candidate sentences (which might contain the answer) from the story

(conversation) so far

R (response) mdash converts the output into the response format desired For

example a textual response or an action In the QA system described this component

finds the desired answer and then converts it from feature representation to the actual

word

This model is a fully supervised model meaning all the candidate sentences from

which the answer could be found are marked during training phase and can also be

termed as lsquohard attentionrsquo

The authors tested out the QA system on various literature including Lord of the

Rings

20

Figure 2-5 QA example on Lord of the Rings using Memory Networks [19]

21

CHAPTER 3 LITERATURE REVIEW AND STATE OF THE ART

There has been rapid progress since the release of the SQuAD dataset and

currently the best ensemble models are close to human level accuracy in machine

comprehension This is due to the various ingenious methods which solves some of the

problems with the previous methods Out of Vocabulary tokens were handled by using

Character embedding Long term dependency within context passage were solved

using self-attention And many other techniques such as Contextualized vectors History

of Words Attention Flow etc In this section we will have a look at the some of the most

important models that were fundamental to the progress of Questions Answering

Machine Comprehension Using Match-LSTM and Answer Pointer

In this paper [20] the authors propose an end-to-end neural architecture for the

QA task The architecture is based on match-LSTM [21] a model they proposed for

textual entailment and Pointer Net [22] a sequence-to-sequence model proposed by

Vinyals et al (2015) to constrain the output tokens to be from the input sequences

The model consists of an LSTM preprocessing layer a match-LSTM layer and an

Answer Pointer layer

We are given a piece of text which we refer to as a passage and a question

related to the passage The passage is represented by matrix P where P is the length

(number of tokens) of the passage and d is the dimensionality of word embeddings

Similarly the question is represented by matrix Q where Q is the length of the question

Our goal is to identify a subsequence from the passage as the answer to the question

22

Figure 3-1 Match-LSTM Model Architecture [20]

LSTM Preprocessing layer They use a standard one-directional LSTM

(Hochreiter amp Schmidhuber 1997) to process the passage and the question separately

as shown below

Match LSTM Layer They applied the match-LSTM model proposed for textual

entailment to their machine comprehension problem by treating the question as a

premise and the passage as a hypothesis The match-LSTM sequentially goes through

the passage At position i of the passage it first uses the standard word-by-word

attention mechanism to obtain attention weight vector as follows

(3-1)

23

where and are parameters to be learned

Answer Pointer Layer The Answer Pointer (Ans-Ptr) layer is motivated by the

Pointer Net introduced by Vinyals et al (2015) [22] The boundary model produces only

the start token and the end token of the answer and then all the tokens between these

two in the original passage are considered to be the answer

When this paper was released back in November 2016 Match-LSTM method

was the state of the art in Question Answering systems and was at the top of the

leaderboard for the SQuAD dataset

R-NET Matching Reading Comprehension with Self-Matching Networks

In this model [23] first the question and passage are processed by a

bidirectional recurrent network (Mikolov et al 2010) separately They then match the

question and passage with gated attention-based recurrent networks obtaining

question-aware representation for the passage On top of that they apply self-matching

attention to aggregate evidence from the whole passage and refine the passage

representation which is then fed into the output layer to predict the boundary of the

answer span

Question and passage encoding First the words are converted to their

respective word-level embeddings and character level embeddings The character-level

embeddings are generated by taking the final hidden states of a bi-directional recurrent

neural network (RNN) applied to embeddings of characters in the token Such

character-level embeddings have been shown to be helpful to deal with out-of-vocab

(OOV) tokens

24

They then use a bi-directional RNN to produce new representation and

of all words in the question and passage respectively

Figure 3-2 The task of Question Answering [23]

Gated Attention-based Recurrent Networks They use a variant of attention-

based recurrent networks with an additional gate to determine the importance of

information in the passage regarding a question Different from the gates in LSTM or

GRU the additional gate is based on the current passage word and its attention-pooling

vector of the question which focuses on the relation between the question and current

passage word The gate effectively model the phenomenon that only parts of the

passage are relevant to the question in reading comprehension and question answering

is utilized in subsequent calculations

25

Self-Matching Attention From the previous step the question aware passage

representation is generated to highlight the important parts of the passage One

problem with such representation is that it has very limited knowledge of context One

answer candidate is often oblivious to important cues in the passage outside its

surrounding window To address this problem the authors propose directly matching

the question-aware passage representation against itself It dynamically collects

evidence from the whole passage for words in passage and encodes the evidence

relevant to the current passage word and its matching question information into the

passage representation

Output Layer They use the same method as Wang amp Jiang (2016b) and use

pointer networks (Vinyals et al 2015) to predict the start and end position of the

answer In addition they use an attention-pooling over the question representation to

generate the initial hidden vector for the pointer network [23]

When the R-Net Model first appeared in the leaderboard in March 2017 it was at

the top with 723 Exact Match and 807 F1 score

Bi-Directional Attention Flow (BiDAF) for Machine Comprehension

BiDAF [24] is a hierarchical multi-stage architecture for modeling the

representations of the context paragraph at different levels of granularity BIDAF

includes character-level word-level and contextual embeddings and uses bi-directional

attention flow to obtain a query-aware context representation Their attention layer is not

used to summarize the context paragraph into a fixed-size vector Instead the attention

is computed for every time step and the attended vector at each time step along with

the representations from previous layers can flow through to the subsequent modeling

layer This reduces the information loss caused by early summarization [24]

26

Figure 3-3 Bi-Directional Attention Flow Model Architecture [24]

Their machine comprehension model is a hierarchical multi-stage process and

consists of six layers

1 Character Embedding Layer maps each word to a vector space using character-level CNNs

2 Word Embedding Layer maps each word to a vector space using a pre-trained word embedding model

3 Contextual Embedding Layer utilizes contextual cues from surrounding words to refine the embedding of the words These first three layers are applied to both the query and context They use a LSTM on top of the embeddings provided by the previous layers to model the temporal interactions between words We place an LSTM in both directions and concatenate the outputs of the two LSTMs

4 Attention Flow Layer couples the query and context vectors and produces a set of query aware feature vectors for each word in the context

5 Modeling Layer employs a Recurrent Neural Network to scan the context The input to the modeling layer is G which encodes the query-aware representations of context words The output of the modeling layer captures the interaction among the context words conditioned on the query They use two layers of bi-directional LSTM

27

with the output size of d for each direction Hence a matrix is obtained which is passed onto the output layer to predict the answer

6 Output Layer provides an answer to the query They define the training loss (to be minimized) as the sum of the negative log probabilities of the true start and end indices by the predicted distributions averaged over all examples [25]

In a further variation of their above work they add a self-attention layer after the

Bi-attention layer to further improve the results The architecture of the model is as

Figure 3-3 The task of Question Answering [25]

28

Summary

In this chapter we reviewed the methods that are fundamental to the state of the

art in Machine Comprehension and for the task of Question Answering We have

reached closed to human level accuracy and this is due to incremental developments

over previous models As we saw Out of Vocabulary (OOV) tokens were handled by

using Character embedding Long term dependency within context passage were

solved using self-attention and many other techniques such as Contextualized vectors

Attention Flow etc were employed to get better results In the next chapter we will see

how we can build on these models and develop further

29

CHAPTER 4 MULTI-ATTENTION QUESTION ANSWERING

Although the advancements in Question Answering systems since the release of

the SQuAD dataset has been impressive and the results are getting close to human

level accuracy it is far from being a fool-proof system The models still make mistakes

which would be obvious to a human For example

Passage The Panthers used the San Jose State practice facility and stayed at

the San Jose Marriott The Broncos practiced at Florida State Facility and stayed at the

Santa Clara Marriott

Question At what universitys facility did the Panthers practice

Actual Answer San Jose State

Predicted Answer Florida State Facility

To find out what is leading to the wrong predictions we wanted to see the

attention weights associated with such an example We plotted the passage and

question heat map which is a 2D matrix where the intensity of each cell signifies the

similarity between a passage word and a question word For the above example we

found out that while certain words of the question are given high weightage other parts

are not The words lsquoAtrsquo lsquofacilityrsquo lsquopracticersquo are given high attention but lsquoPanthersrsquo does

not receive high attention If it had received high attention then the system would have

predicted lsquoSan Jose Statersquo as the right answer To solve this issue we analyzed the

base BiDAF model and proposed adding two things

1 Bi-Attention and Self-Attention over Query

2 Second level attention over the output of (Bi-Attention + Self-Attention) from both Context and Query

30

Multi-Attention BiDAF Model

Our comprehension model is a hierarchical multi-stage process and consists of

the following layers

1 Embedding Just as all other models we embed words using pretrained word vectors We also embed the characters in each word into size 20 vectors which are learned and run a convolution neural network followed by max-pooling to get character-derived embeddings for each word The character-level and word-level embeddings are then concatenated and passed to the next layer We do not update the word embeddings during training

2 Pre-Process A shared bi-directional GRU (Cho et al 2014) is used to map the question and passage embeddings to context aware embeddings

3 Context Attention The bi-directional attention mechanism from the Bi-Directional Attention Flow (BiDAF) model (Seo et al 2016) is used to build a query-aware context representation Let h_i be the vector for context word i q_j be the vector for question word j and n_q and n_c be the lengths of the question and context respectively We compute attention between context word i and question word j as

(4-1)

where w1 w2 and w3 are learned vectors and is element-wise multiplication We then compute an attended vector c_i for each context token as

(4-2)

We also compute a query-to-context vector q_c

(4-3)

31

The final vector computed for each token is built by concatenating

and In our model we subsequently pass the result through a linear layer with ReLU activations

4 Context Self-Attention Next we use a layer of residual self-attention The input is passed through another bi-directional GRU Then we apply the same attention mechanism only now between the passage and itself In this case we do not use query-to context attention and we set aij = 1048576inf if i = j As before we pass the concatenated output through a linear layer with ReLU activations This layer is applied residually so this output is additionally summed with the input

5 Query Attention For this part we do the same way as context attention but calculate the weighted sum of the context words for each query word Thus we get the length as number of query words Then we calculate context to query similar to query to context in context attention layer

Figure 4-1 The modified BiDAF model with multilevel attention

6 Query Self-Attention This part is done the same way as the context self-attention layer but from the output of the Query Attention layer

32

7 Context Query Bi-Attention + Self-Attention The output of the Context self-attention and the Query Self-Attention layers are taken as input and the same process for Bi-Attention and self-attention is applied on these inputs

8 Prediction In the last layer of our model a bidirectional GRU is applied followed by a linear layer that computes answer start scores for each token The hidden states of that layer are concatenated with the input and fed into a second bidirectional GRU and linear layer to predict answer end scores The softmax operation is applied to the start and end scores to produce start and end probabilities and we optimize the negative log likelihood of selecting correct start and end tokens

Having carried out this modification we were able to solve the wrong example we

started with The multilevel attention model gives the correct output as ldquoSan Jose

Staterdquo Also we achieved slightly better scores than the original model with a F1 score

of 8544 on the SQuAD dev dataset

Chatbot Design Using a QA System

Designing a perfect chatbot that passes the Turing test is a fundamental goal for

Artificial Intelligence Although we are many order to magnitudes away from achieving

such a goal domain specific tasks can be solved with chatbots made from current

technology We had started our goal with a similar objective in mind ie to design a

domain specific chatbot and then generalize to other areas as it is able to achieve the

first domain specific objective robustly This led us to the fundamental problem of

Machine Comprehension and subsequently to the task of Question Answering Having

achieved some degree of success with QA systems we looked back if we could apply

our newly acquired knowledge in the task of designing Chatbots

The chatbots made with todayrsquos technologies are mostly handcrafted techniques

such as template matching that requires anticipating all possible ways a user may

articulate his requirements and a conversation may occur This requires a lot of man

hours for designing a domain specific system and is still very error prone In this section

33

we propose a general Chatbot design that would make the designing of a domain

specific chatbot very easy and robust at the same time

Every domain specific chatbot needs to obtain a set of information from the user

and show some results based on the user specific information obtained The traditional

chatbots use template matching and keywords lookup to determine if the user has

provided the required information Our idea is to use the Question Answering system in

the backend to extract out the required information from whatever the user has typed

until this point of the conversation The information to be extracted can be posed in the

form of a set of questions and the answers obtained from those questions can be used

as the parameters to supply the relevant information to the user

We had chosen our chatbot domain as the flight reservation system Our goal

was to extract the required information from the user to be able to show him the

available flights as per the userrsquos requirements

Figure 4-2 Flight reservation chatbotrsquos chat window

34

For a flight reservation task the booking agent needs to know the origin city

destination city and date of travel at minimum to be able to show the available flights

Optional information includes the number of tickets passengerrsquos name one way or

round trip etc

The minimalistic conversation with the user through the chat window would be as

shown above We had a platform called OneTask on which we wanted to implement our

chat bot The chat interface within the OneTask system looks as follows

Figure 4-3 Chatbot within OneTask system

The working of the chat system is as follows ndash

1 Initiation The user opens the chat window which starts a session with the chat bot The chat bot asks lsquoHow may I help yoursquo

2 User Reply The user may reply with none of the required information for flight booking or may reply with multiple information in the same message

3 User Reply parsing The conversation up to this point is treated as a passage and the internal questions are run on this passage So the four questions that are run are

Where do you want to go

35

From where do you want to leave

When do you want to depart

4 Parsed responses from QA model After running the questions on the QA system with the given conversation answers are obtained for all along with their corresponding confidence values Since answers will be obtained even if the required question was not answered up to this point in the conversation the confidence values plays a crucial role in determining the validity of the answer For this purpose we determined ranges of confidence values for validation A confidence value above 10 signifies the question has been answered correctly A confidence of 2 ndash 10 signifies that it may have been answered but should verify with the user for correctness and any confidence below 2 is discarded

5 Asking remaining questions iteratively After the parsing it is checked if any of the required questions are still unanswered If so the chatbot asks the remaining question and the process from 3 ndash 4 is carried out iteratively Once all the questions have been answered the user is shown with the available flight options as per his request

Figure 4-4 The Flow diagram of the Flight booking Chatbot system

Online QA System and Attention Visualization

To be able to test out various examples we used the BiDAF [23] model for an

online demo One can either choose from the available examples from the drop-down

menu or paste their own passage and examples While this is a useful and interesting

system to test the model in a user-friendly way we created this system to be able to

focus on the wrong samples

36

Figure 4-5 QA system interface with attention highlight over candidate answers

The candidate answers are shown with the blue highlights as per their

confidence values The higher the confidence the darker the answer The highest

confidence value is chosen as the predicted answer We developed the system to show

attention spread of the candidate answers to realize what needs to be done to improve

the system This led us to realize the importance of including the query attention part as

well as multilevel attention on the BiDAF model as described in the first section of this

chapter

37

CHAPTER 5 FUTURE DIRECTIONS AND CONCLUSION

Future Directions

Having done a thorough analysis of the current methods and state of the arts in

Machine Comprehension and for the QA task and developing systems on it we have

achieved a strong sense of what needs to be done to further improve the QA models

After observing the wrong samples we can see that the system is still unable to encode

meaning and is picking answers based on statistical patterns of occurrence of answers

on training examples

As per the State of the Art models and analyzing the ongoing research literature

it is easy to conclude that more features need to be embedded to encode the meaning

of words phrases and sentences A paper called Reinforced Mnemonic Reader for

Machine Comprehension encoded POS and NER tags of words along with their word

and character embedding This gave them better results

Figure 5-1 An English language semantic parse tree [26]

38

We have developed a method to encode the syntax parse tree of a sentence that

not only encodes the post tags but the relation of the word within the phrase and the

relation of the phrase within the whole sentence in a hierarchical manner

Finally data augmentation is another solution to get better results One definite

way to reduce the errors would be to include similar samples in the training data which

the system is faltering in the dev set One could generate similar examples as the failure

cases and include them in the training set to have better prediction Another system

would be to train to similar and bigger datasets Our models were trained on the SQuAD

dataset There are other datasets too which does the similar question answering task

such as TriviaQA We could augment the training set of TriviaQA along with SQuAD to

have a more robust system that is able to generalize better and thus have higher

accuracy for predicting answer spans

Conclusion

In this work we have tried to explore the most fundamental techniques that have

shaped the current state of the art Then we proposed a minor improvement of

architecture over an existing model Furthermore we developed two applications that

uses the base model First we talked about how a chatbot application can be made

using the QA system and lastly we also created a web interface where the model can

be used for any Passage and Question This interface also shows the attention spread

on the candidate answers While our effort is ongoing to push the state of the art

forward we strongly believe that surpassing human level accuracy on this task will have

high dividends for the society at large

39

LIST OF REFERENCES

[1] Danqi Chen From Reading Comprehension To Open-Domain Question Answering 2018 Youtube Accessed March 16 2018 httpswwwyoutubecomwatchv=1RN88O9C13U

[2] Lehnert Wendy G The Process of Question Answering No RR-88 Yale Univ New Haven Conn Dept of computer science 1977

[3] Joshi Mandar Eunsol Choi Daniel S Weld and Luke Zettlemoyer Triviaqa A large scale distantly supervised challenge dataset for reading comprehension arXiv preprint arXiv170503551 (2017)

[4] Rajpurkar Pranav Jian Zhang Konstantin Lopyrev and Percy Liang Squad 100000+ questions for machine comprehension of text arXiv preprint arXiv160605250 (2016)

[5] The Stanford Question Answering Dataset 2018 RajpurkarGithubIo Accessed March 16 2018 httpsrajpurkargithubioSQuAD-explorer

[6] Hu Minghao Yuxing Peng and Xipeng Qiu Reinforced mnemonic reader for machine comprehension CoRR abs170502798 (2017)

[7] Schalkoff Robert J Artificial neural networks Vol 1 New York McGraw-Hill 1997

[8] Me For The AI and Neetesh Mehrotra 2018 The Connect Between Deep Learning And AI - Open Source For You Open Source For You Accessed March 16 2018 httpopensourceforucom201801connect-deep-learning-ai

[9] Krizhevsky Alex Ilya Sutskever and Geoffrey E Hinton Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems pp 1097-1105 2012

[10] An Intuitive Explanation Of Convolutional Neural Networks 2016 The Data Science Blog Accessed March 16 2018 httpsujjwalkarnme20160811intuitive-explanation-convnets

[11] Williams Ronald J and David Zipser A learning algorithm for continually running fully recurrent neural networks Neural computation 1 no 2 (1989) 270-280

[12] Hochreiter Sepp and Juumlrgen Schmidhuber Long short-term memory Neural computation 9 no 8 (1997) 1735-1780

[13] Chung Junyoung Caglar Gulcehre KyungHyun Cho and Yoshua Bengio Empirical evaluation of gated recurrent neural networks on sequence modeling arXiv preprint arXiv14123555 (2014)

40

[14] Mikolov Tomas Wen-tau Yih and Geoffrey Zweig Linguistic regularities in continuous space word representations In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics Human Language Technologies pp 746-751 2013

[15] Pennington Jeffrey Richard Socher and Christopher Manning Glove Global vectors for word representation In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) pp 1532-1543 2014

[16] Word2vec Tutorial - The Skip-Gram Model middot Chris Mccormick 2016 MccormickmlCom Accessed March 16 2018 httpmccormickmlcom20160419word2vec-tutorial-the-skip-gram-model

[17] Group 2015 [Paper Introduction] Bilingual Word Representations With Monolingual hellip SlideshareNet Accessed March 16 2018 httpswwwslidesharenetnaist-mtstudybilingual-word-representations-with-monolingual-quality-in-mind

[18] Attention Mechanism 2016 BlogHeuritechCom Accessed March 16 2018 httpsblogheuritechcom20160120attention-mechanism

[19] Weston Jason Sumit Chopra and Antoine Bordes 2014 Memory Networks arXiv preprint arXiv14103916v11 Accessed March 16 2018 httpsarxivorgabs14103916

[20] Wang Shuohang and Jing Jiang Machine comprehension using match-lstm and answer pointer arXiv preprint arXiv160807905 (2016)

[21] Wang Shuohang and Jing Jiang Learning natural language inference with LSTM arXiv preprint arXiv151208849 (2015)

[22] Vinyals Oriol Meire Fortunato and Navdeep Jaitly Pointer networks In Advances in Neural Information Processing Systems pp 2692-2700 2015

[23] Wang Wenhui Nan Yang Furu Wei Baobao Chang and Ming Zhou Gated self-matching networks for reading comprehension and question answering In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1 Long Papers) vol 1 pp 189-198 2017

[24] Seo Minjoon Aniruddha Kembhavi Ali Farhadi and Hannaneh Hajishirzi Bidirectional attention flow for machine comprehension arXiv preprint arXiv161101603 (2016)

[25] Clark Christopher and Matt Gardner Simple and effective multi-paragraph reading comprehension arXiv preprint arXiv171010723 (2017)

41

[26] 8 Analyzing Sentence Structure 2018 NltkOrg Accessed March 16 2018 httpwwwnltkorgbookch08html

42

BIOGRAPHICAL SKETCH

Purnendu Mukherjee grew up in Kolkata India and had a strong interest for

computers since an early age After his high school he did his BSc in computer

science from Ramakrishna Mission Residential College Narendrapur followed by MSc

in computer science from St Xavierrsquos College Kolkata He had a strong intuition and

interest for human like learning systems and wanted to work in this area He started

working at TCS Innovation Labs Pune for the application of Natural Language

Processing in Educational Applications As he was deeply passionate about learning

system that mimic the human brain and learn like a human child does he was

increasing interested about Deep Learning and its applications After working for a year

he went on to pursue a Master of Science degree in computer science from the

University of Florida Gainesville His academic interests have been focused on Deep

Learning and Natural Language Processing and he has been working on Machine

Reading Comprehension since summer of 2017

  • ACKNOWLEDGMENTS
  • LIST OF FIGURES
  • LIST OF ABBREVIATIONS
  • INTRODUCTION
  • THE BUILDING BLOCKS OF A QUESTION ANSWERING SYSTEM
    • Neural Networks
    • Convolutional Neural Network
    • Recurrent Neural Networks (RNN)
    • Word Embedding
    • Attention Mechanism
    • Memory Networks
      • LITERATURE REVIEW AND STATE OF THE ART
        • Machine Comprehension Using Match-LSTM and Answer Pointer
        • R-NET Matching Reading Comprehension with Self-Matching Networks
        • Bi-Directional Attention Flow (BiDAF) for Machine Comprehension
        • Summary
          • MULTI-ATTENTION QUESTION ANSWERING
            • Multi-Attention BiDAF Model
            • Chatbot Design Using a QA System
            • Online QA System and Attention Visualization
              • FUTURE DIRECTIONS AND CONCLUSION
                • Future Directions
                • Conclusion
                  • LIST OF REFERENCES
                  • BIOGRAPHICAL SKETCH
Page 11: QUESTION ANSWERING SYSTEMS WITH ATTENTION MECHANISMufdcimages.uflib.ufl.edu/UF/E0/05/23/73/00001/MUKHERJEE_P.pdf · QUESTION ANSWERING SYSTEMS WITH ATTENTION MECHANISM By PURNENDU

11

Passage Tesla later approached Morgan to ask for more funds to build a more

powerful transmitter When asked where all the money had gone Tesla responded by

saying that he was affected by the Panic of 1901 which he (Morgan) had caused

Morgan was shocked by the reminder of his part in the stock market crash and by

Teslarsquos breach of contract by asking for more funds Tesla wrote another plea to

Morgan but it was also fruitless Morgan still owed Tesla money on the original

agreement and Tesla had been facing foreclosure even before construction of the

tower began

Question On what did Tesla blame for the loss of the initial money

Answer Panic of 1901

As we started exploring the QA task we faced several challenges Some of them

we could solve with the help of other research and some of the challenges still exist in

the domain

bull Out of Vocabulary words

bull Multi-sentence reasoning may be required

bull There may exist several candidate answers

bull Optimizing the Exact Match (EM) metric may fail when the answer boundary is fuzzy or too long such as the answer of the ldquowhyrdquo query [6]

bull ldquoOne-hoprdquo prediction may fail to fully understand the query [6]

bull Fail to fully capture the long-distance contextual interaction between parts of the context by only using LSTMGRU

bull Current models are unable to capture semantics of the passage

In the upcoming chapters we will first briefly review the basics necessary for

understanding the models then we will delve deep into the fundamental models that

12

have shaped the current State of the Art models then we will discuss our contribution in

terms of architecture and applications and finally conclude with future directions

13

CHAPTER 2 THE BUILDING BLOCKS OF A QUESTION ANSWERING SYSTEM

To build a question answering system one needs to be familiar with the

fundamental deep learning models such as Recurrent Neural Networks (RNN) Long

Short Term Memory (LSTM) etc In this chapter we will have an overview on these

techniques and see how they all connect to building a question answering system

Neural Networks

What makes Deep Learning so intriguing is that it has close resemblance with

the working of the mammalian brain or at least draws inspiration from it The same can

be said for Artificial Neural networks [7] which consists of a system of interconnected

units called lsquoneuronsrsquo that take input from similar units and produces a single output

Figure 2-1 Simple and Deep Learning Neural Networks [8]

The connection from one neuron to another can be weighted based on the input data

which enables the network to tune itself to produce a certain output based on the input

This is the learning process which is achieved through backpropagation which is a

system of propagating the error from the output layer to the previous layers

14

Convolutional Neural Network

The first wave of deep learningrsquos success was brought by Convolutional Neural

Networks (CNN) [9] when this was the technique used by the winning team of ImageNet

competition in 2012 CNNs are deep artificial neural networks (ANN) that can be used to

classify images cluster them by similarity and perform object recognition within scenes

It can be used to detect and identify faces people signs or any other visual data

Figure 2-2 Convolutional Neural Network Architecture [10]

There are primarily four operations in a standard CNN model (as shown in Fig above)

1 Convolution - The primary purpose of Convolution in the ConvNet (above) is to extract features from the input image The spatial relationship between pixels ie the image features are preserved and learned by the convolution using small squares of input data

2 Non-Linearity (ReLU) ndash Rectified Linear Unit (ReLU) is a non-linear operation that carries out an element wise operation on each pixel This operation replaces the negative pixel values in the feature map by zero

3 Pooling or Sub Sampling - Spatial Pooling reduces the dimensionality of each feature map but retains the most important information For max pooling the largest value in the square window is taken and rest are dropped Other types of pooling are Average Sum etc

4 Classification (Fully Connected Layer) - The Fully Connected layer is a traditional Multi-Layer Perceptron as described before that uses a softmax activation function in the output layer The high-level features of the image are encoded by the convolutional and pooling layers which is then fed to the fully connected layer which then uses these features for classifying the input image into various classes based on the training dataset [10]

15

When a new image is fed into the CNN model all the above-mentioned steps are

carried out (forward propagation) and a probability distribution is achieved on the set of

output classes With a large enough training dataset the network will learn and

generalize well enough to classify new images into their correct classes

Recurrent Neural Networks (RNN)

Whenever we want to predict or encode sequential data RNN [11] is our go to

method RNNs perform the same task for every element of a sequence where the

output of each element depends on previous computations thus the recurrence In

practice RNNs are unable to retain long-term dependencies and can look back only a

few steps because of the vanishing gradient problem

(2-1)

A solution to the dependency problem is to use gated cells such as LSTM [11] or

GRU [13] These cells pass on important information to the next cells while ignoring

non-important ones The gated units in a GRU block are

bull Update Gate ndash Computed based on current input and hidden state

(2-2)

bull Reset Gate ndash Calculated similarly but with different weights

(2-3)

bull New memory content - (2-4)

If reset gate unit is ~0 then previous memory is ignored and only new

information is kept

16

Final memory at current time step combines previous and current time steps

(2-5)

While the GRU is computationally efficient the LSTM on the other hand is a

general case where there are three gates as follows

bull Input Gate ndash What new information to add to the current cell state

bull Forget Gate ndash How much information from previous states to be kept

bull Output gate ndash How much info should be sent to the next states

Just like GRU the current cell state is a sum of the previous cell state but

weighted by the forget gate and the new value is added which is weighted by the input

gate Based on the cell state the output gate regulates the final output

Word Embedding

Computation or gradients can be applied on numbers and not on words or letters

So first we need to convert words into their corresponding numerical formation before

feeding into a deep learning model In general there are two types of word embedding

Frequency based (which constitutes count vectors tf-idf and co-occurrence vectors)

and Prediction based With frequency based embedding the order of the words are not

preserved and works as a bag of words model Whereas with prediction based model

the order of words or locality of words are taken into consideration to generate the

numerical representation of the word Within this prediction based category there are

two fundamental techniques called Continuous Bag of Words (CBOW) and Skip Gram

Model which forms the basis for word2vec [14] and GloVe [15]

The basic intuition behind word2vec is that if two different words have very

similar ldquocontextsrdquo (that is what words are likely to appear around them) then the model

17

will produce similar vector for those words Conversely if the two word vectors are

similar then the network will produce similar context predictions for the same two words

For examples synonyms like ldquointelligentrdquo and ldquosmartrdquo would have very similar contexts

Or that words that are related like ldquoenginerdquo and ldquotransmissionrdquo would probably have

similar contexts as well [16] Plotting the word vectors learned by a word2vec over a

large corpus we could find some very interesting relationships between words

Figure 2-3 Semantic relation between words in vector space [17]

Attention Mechanism

We as humans put our attention to things are important or are relevant in a

context For example when asked a question from a passage we try to find the most

relevant part of the passage the question is relevant with and then reason from our

understanding of that part of the passage The same idea applies for attention

mechanism in Deep Learning It is used to identify the specific parts of a given context

to which the current question is relevant to

Formally put the techniques take n arguments y_1 y_n (in our case the

passage having words say y_i through h_i) and a question word say q It returns a

vector z which is supposed to be the laquo summary raquo of the y_i focusing on information

linked to the question q More formally it returns a weighted arithmetic mean of the y_i

18

and the weights are chosen according the relevance of each y_i given the context c

[18]

Figure 2-4 Attention Mechanism flow [18]

Memory Networks

Convolutional Neural Networks and Recurrent Neural Networks which does

capture how we form our visual and sequential memories their memory (encoded by

hidden states and weights) were typically too small and was not compartmentalized

enough to accurately remember facts from the past (knowledge is compressed into

dense vectors) [19]

Deep Learning needed to cultivate a methodology that preserved memories as

they are such that it wonrsquot be lost in generalization and recalling exact words or

sequence of events would be possible mdash something computers are already good at This

effort led us to Memory Networks [19] published at ICLR 2015 by Facebook AI

Research

This paper provides a basic framework to store augment and retrieve memories

while seamlessly working with a Recurrent Neural Network architecture The memory

19

network consists of a memory m (an array of objects 1 indexed by m i) and four

(potentially learned) components I G O and R as follows

I (input feature map) mdash converts the incoming input to the internal feature

representation either a sparse or dense feature vector like that from word2vec or

GloVe

G (generalization) mdash updates old memories given the new input They call this

generalization as there is an opportunity for the network to compress and generalize its

memories at this stage for some intended future use The analogy Irsquove been talking

before

O (output feature map) mdash produces a new output (in the feature representation

space) given the new input and the current memory state This component is

responsible for performing inference In a question answering system this part will

select the candidate sentences (which might contain the answer) from the story

(conversation) so far

R (response) mdash converts the output into the response format desired For

example a textual response or an action In the QA system described this component

finds the desired answer and then converts it from feature representation to the actual

word

This model is a fully supervised model meaning all the candidate sentences from

which the answer could be found are marked during training phase and can also be

termed as lsquohard attentionrsquo

The authors tested out the QA system on various literature including Lord of the

Rings

20

Figure 2-5 QA example on Lord of the Rings using Memory Networks [19]

21

CHAPTER 3 LITERATURE REVIEW AND STATE OF THE ART

There has been rapid progress since the release of the SQuAD dataset and

currently the best ensemble models are close to human level accuracy in machine

comprehension This is due to the various ingenious methods which solves some of the

problems with the previous methods Out of Vocabulary tokens were handled by using

Character embedding Long term dependency within context passage were solved

using self-attention And many other techniques such as Contextualized vectors History

of Words Attention Flow etc In this section we will have a look at the some of the most

important models that were fundamental to the progress of Questions Answering

Machine Comprehension Using Match-LSTM and Answer Pointer

In this paper [20] the authors propose an end-to-end neural architecture for the

QA task The architecture is based on match-LSTM [21] a model they proposed for

textual entailment and Pointer Net [22] a sequence-to-sequence model proposed by

Vinyals et al (2015) to constrain the output tokens to be from the input sequences

The model consists of an LSTM preprocessing layer a match-LSTM layer and an

Answer Pointer layer

We are given a piece of text which we refer to as a passage and a question

related to the passage The passage is represented by matrix P where P is the length

(number of tokens) of the passage and d is the dimensionality of word embeddings

Similarly the question is represented by matrix Q where Q is the length of the question

Our goal is to identify a subsequence from the passage as the answer to the question

22

Figure 3-1 Match-LSTM Model Architecture [20]

LSTM Preprocessing layer They use a standard one-directional LSTM

(Hochreiter amp Schmidhuber 1997) to process the passage and the question separately

as shown below

Match LSTM Layer They applied the match-LSTM model proposed for textual

entailment to their machine comprehension problem by treating the question as a

premise and the passage as a hypothesis The match-LSTM sequentially goes through

the passage At position i of the passage it first uses the standard word-by-word

attention mechanism to obtain attention weight vector as follows

(3-1)

23

where and are parameters to be learned

Answer Pointer Layer The Answer Pointer (Ans-Ptr) layer is motivated by the

Pointer Net introduced by Vinyals et al (2015) [22] The boundary model produces only

the start token and the end token of the answer and then all the tokens between these

two in the original passage are considered to be the answer

When this paper was released back in November 2016 Match-LSTM method

was the state of the art in Question Answering systems and was at the top of the

leaderboard for the SQuAD dataset

R-NET Matching Reading Comprehension with Self-Matching Networks

In this model [23] first the question and passage are processed by a

bidirectional recurrent network (Mikolov et al 2010) separately They then match the

question and passage with gated attention-based recurrent networks obtaining

question-aware representation for the passage On top of that they apply self-matching

attention to aggregate evidence from the whole passage and refine the passage

representation which is then fed into the output layer to predict the boundary of the

answer span

Question and passage encoding First the words are converted to their

respective word-level embeddings and character level embeddings The character-level

embeddings are generated by taking the final hidden states of a bi-directional recurrent

neural network (RNN) applied to embeddings of characters in the token Such

character-level embeddings have been shown to be helpful to deal with out-of-vocab

(OOV) tokens

24

They then use a bi-directional RNN to produce new representation and

of all words in the question and passage respectively

Figure 3-2 The task of Question Answering [23]

Gated Attention-based Recurrent Networks They use a variant of attention-

based recurrent networks with an additional gate to determine the importance of

information in the passage regarding a question Different from the gates in LSTM or

GRU the additional gate is based on the current passage word and its attention-pooling

vector of the question which focuses on the relation between the question and current

passage word The gate effectively model the phenomenon that only parts of the

passage are relevant to the question in reading comprehension and question answering

is utilized in subsequent calculations

25

Self-Matching Attention From the previous step the question aware passage

representation is generated to highlight the important parts of the passage One

problem with such representation is that it has very limited knowledge of context One

answer candidate is often oblivious to important cues in the passage outside its

surrounding window To address this problem the authors propose directly matching

the question-aware passage representation against itself It dynamically collects

evidence from the whole passage for words in passage and encodes the evidence

relevant to the current passage word and its matching question information into the

passage representation

Output Layer They use the same method as Wang amp Jiang (2016b) and use

pointer networks (Vinyals et al 2015) to predict the start and end position of the

answer In addition they use an attention-pooling over the question representation to

generate the initial hidden vector for the pointer network [23]

When the R-Net Model first appeared in the leaderboard in March 2017 it was at

the top with 723 Exact Match and 807 F1 score

Bi-Directional Attention Flow (BiDAF) for Machine Comprehension

BiDAF [24] is a hierarchical multi-stage architecture for modeling the

representations of the context paragraph at different levels of granularity BIDAF

includes character-level word-level and contextual embeddings and uses bi-directional

attention flow to obtain a query-aware context representation Their attention layer is not

used to summarize the context paragraph into a fixed-size vector Instead the attention

is computed for every time step and the attended vector at each time step along with

the representations from previous layers can flow through to the subsequent modeling

layer This reduces the information loss caused by early summarization [24]

26

Figure 3-3 Bi-Directional Attention Flow Model Architecture [24]

Their machine comprehension model is a hierarchical multi-stage process and

consists of six layers

1 Character Embedding Layer maps each word to a vector space using character-level CNNs

2 Word Embedding Layer maps each word to a vector space using a pre-trained word embedding model

3 Contextual Embedding Layer utilizes contextual cues from surrounding words to refine the embedding of the words These first three layers are applied to both the query and context They use a LSTM on top of the embeddings provided by the previous layers to model the temporal interactions between words We place an LSTM in both directions and concatenate the outputs of the two LSTMs

4 Attention Flow Layer couples the query and context vectors and produces a set of query aware feature vectors for each word in the context

5 Modeling Layer employs a Recurrent Neural Network to scan the context The input to the modeling layer is G which encodes the query-aware representations of context words The output of the modeling layer captures the interaction among the context words conditioned on the query They use two layers of bi-directional LSTM

27

with the output size of d for each direction Hence a matrix is obtained which is passed onto the output layer to predict the answer

6 Output Layer provides an answer to the query They define the training loss (to be minimized) as the sum of the negative log probabilities of the true start and end indices by the predicted distributions averaged over all examples [25]

In a further variation of their above work they add a self-attention layer after the

Bi-attention layer to further improve the results The architecture of the model is as

Figure 3-3 The task of Question Answering [25]

28

Summary

In this chapter we reviewed the methods that are fundamental to the state of the

art in Machine Comprehension and for the task of Question Answering We have

reached closed to human level accuracy and this is due to incremental developments

over previous models As we saw Out of Vocabulary (OOV) tokens were handled by

using Character embedding Long term dependency within context passage were

solved using self-attention and many other techniques such as Contextualized vectors

Attention Flow etc were employed to get better results In the next chapter we will see

how we can build on these models and develop further

29

CHAPTER 4 MULTI-ATTENTION QUESTION ANSWERING

Although the advancements in Question Answering systems since the release of

the SQuAD dataset has been impressive and the results are getting close to human

level accuracy it is far from being a fool-proof system The models still make mistakes

which would be obvious to a human For example

Passage The Panthers used the San Jose State practice facility and stayed at

the San Jose Marriott The Broncos practiced at Florida State Facility and stayed at the

Santa Clara Marriott

Question At what universitys facility did the Panthers practice

Actual Answer San Jose State

Predicted Answer Florida State Facility

To find out what is leading to the wrong predictions we wanted to see the

attention weights associated with such an example We plotted the passage and

question heat map which is a 2D matrix where the intensity of each cell signifies the

similarity between a passage word and a question word For the above example we

found out that while certain words of the question are given high weightage other parts

are not The words lsquoAtrsquo lsquofacilityrsquo lsquopracticersquo are given high attention but lsquoPanthersrsquo does

not receive high attention If it had received high attention then the system would have

predicted lsquoSan Jose Statersquo as the right answer To solve this issue we analyzed the

base BiDAF model and proposed adding two things

1 Bi-Attention and Self-Attention over Query

2 Second level attention over the output of (Bi-Attention + Self-Attention) from both Context and Query

30

Multi-Attention BiDAF Model

Our comprehension model is a hierarchical multi-stage process and consists of

the following layers

1 Embedding Just as all other models we embed words using pretrained word vectors We also embed the characters in each word into size 20 vectors which are learned and run a convolution neural network followed by max-pooling to get character-derived embeddings for each word The character-level and word-level embeddings are then concatenated and passed to the next layer We do not update the word embeddings during training

2 Pre-Process A shared bi-directional GRU (Cho et al 2014) is used to map the question and passage embeddings to context aware embeddings

3 Context Attention The bi-directional attention mechanism from the Bi-Directional Attention Flow (BiDAF) model (Seo et al 2016) is used to build a query-aware context representation Let h_i be the vector for context word i q_j be the vector for question word j and n_q and n_c be the lengths of the question and context respectively We compute attention between context word i and question word j as

(4-1)

where w1 w2 and w3 are learned vectors and is element-wise multiplication We then compute an attended vector c_i for each context token as

(4-2)

We also compute a query-to-context vector q_c

(4-3)

31

The final vector computed for each token is built by concatenating

and In our model we subsequently pass the result through a linear layer with ReLU activations

4 Context Self-Attention Next we use a layer of residual self-attention The input is passed through another bi-directional GRU Then we apply the same attention mechanism only now between the passage and itself In this case we do not use query-to context attention and we set aij = 1048576inf if i = j As before we pass the concatenated output through a linear layer with ReLU activations This layer is applied residually so this output is additionally summed with the input

5 Query Attention For this part we do the same way as context attention but calculate the weighted sum of the context words for each query word Thus we get the length as number of query words Then we calculate context to query similar to query to context in context attention layer

Figure 4-1 The modified BiDAF model with multilevel attention

6 Query Self-Attention This part is done the same way as the context self-attention layer but from the output of the Query Attention layer

32

7 Context Query Bi-Attention + Self-Attention The output of the Context self-attention and the Query Self-Attention layers are taken as input and the same process for Bi-Attention and self-attention is applied on these inputs

8 Prediction In the last layer of our model a bidirectional GRU is applied followed by a linear layer that computes answer start scores for each token The hidden states of that layer are concatenated with the input and fed into a second bidirectional GRU and linear layer to predict answer end scores The softmax operation is applied to the start and end scores to produce start and end probabilities and we optimize the negative log likelihood of selecting correct start and end tokens

Having carried out this modification we were able to solve the wrong example we

started with The multilevel attention model gives the correct output as ldquoSan Jose

Staterdquo Also we achieved slightly better scores than the original model with a F1 score

of 8544 on the SQuAD dev dataset

Chatbot Design Using a QA System

Designing a perfect chatbot that passes the Turing test is a fundamental goal for

Artificial Intelligence Although we are many order to magnitudes away from achieving

such a goal domain specific tasks can be solved with chatbots made from current

technology We had started our goal with a similar objective in mind ie to design a

domain specific chatbot and then generalize to other areas as it is able to achieve the

first domain specific objective robustly This led us to the fundamental problem of

Machine Comprehension and subsequently to the task of Question Answering Having

achieved some degree of success with QA systems we looked back if we could apply

our newly acquired knowledge in the task of designing Chatbots

The chatbots made with todayrsquos technologies are mostly handcrafted techniques

such as template matching that requires anticipating all possible ways a user may

articulate his requirements and a conversation may occur This requires a lot of man

hours for designing a domain specific system and is still very error prone In this section

33

we propose a general Chatbot design that would make the designing of a domain

specific chatbot very easy and robust at the same time

Every domain specific chatbot needs to obtain a set of information from the user

and show some results based on the user specific information obtained The traditional

chatbots use template matching and keywords lookup to determine if the user has

provided the required information Our idea is to use the Question Answering system in

the backend to extract out the required information from whatever the user has typed

until this point of the conversation The information to be extracted can be posed in the

form of a set of questions and the answers obtained from those questions can be used

as the parameters to supply the relevant information to the user

We had chosen our chatbot domain as the flight reservation system Our goal

was to extract the required information from the user to be able to show him the

available flights as per the userrsquos requirements

Figure 4-2 Flight reservation chatbotrsquos chat window

34

For a flight reservation task the booking agent needs to know the origin city

destination city and date of travel at minimum to be able to show the available flights

Optional information includes the number of tickets passengerrsquos name one way or

round trip etc

The minimalistic conversation with the user through the chat window would be as

shown above We had a platform called OneTask on which we wanted to implement our

chat bot The chat interface within the OneTask system looks as follows

Figure 4-3 Chatbot within OneTask system

The working of the chat system is as follows ndash

1 Initiation The user opens the chat window which starts a session with the chat bot The chat bot asks lsquoHow may I help yoursquo

2 User Reply The user may reply with none of the required information for flight booking or may reply with multiple information in the same message

3 User Reply parsing The conversation up to this point is treated as a passage and the internal questions are run on this passage So the four questions that are run are

Where do you want to go

35

From where do you want to leave

When do you want to depart

4 Parsed responses from QA model After running the questions on the QA system with the given conversation answers are obtained for all along with their corresponding confidence values Since answers will be obtained even if the required question was not answered up to this point in the conversation the confidence values plays a crucial role in determining the validity of the answer For this purpose we determined ranges of confidence values for validation A confidence value above 10 signifies the question has been answered correctly A confidence of 2 ndash 10 signifies that it may have been answered but should verify with the user for correctness and any confidence below 2 is discarded

5 Asking remaining questions iteratively After the parsing it is checked if any of the required questions are still unanswered If so the chatbot asks the remaining question and the process from 3 ndash 4 is carried out iteratively Once all the questions have been answered the user is shown with the available flight options as per his request

Figure 4-4 The Flow diagram of the Flight booking Chatbot system

Online QA System and Attention Visualization

To be able to test out various examples we used the BiDAF [23] model for an

online demo One can either choose from the available examples from the drop-down

menu or paste their own passage and examples While this is a useful and interesting

system to test the model in a user-friendly way we created this system to be able to

focus on the wrong samples

36

Figure 4-5 QA system interface with attention highlight over candidate answers

The candidate answers are shown with the blue highlights as per their

confidence values The higher the confidence the darker the answer The highest

confidence value is chosen as the predicted answer We developed the system to show

attention spread of the candidate answers to realize what needs to be done to improve

the system This led us to realize the importance of including the query attention part as

well as multilevel attention on the BiDAF model as described in the first section of this

chapter

37

CHAPTER 5 FUTURE DIRECTIONS AND CONCLUSION

Future Directions

Having done a thorough analysis of the current methods and state of the arts in

Machine Comprehension and for the QA task and developing systems on it we have

achieved a strong sense of what needs to be done to further improve the QA models

After observing the wrong samples we can see that the system is still unable to encode

meaning and is picking answers based on statistical patterns of occurrence of answers

on training examples

As per the State of the Art models and analyzing the ongoing research literature

it is easy to conclude that more features need to be embedded to encode the meaning

of words phrases and sentences A paper called Reinforced Mnemonic Reader for

Machine Comprehension encoded POS and NER tags of words along with their word

and character embedding This gave them better results

Figure 5-1 An English language semantic parse tree [26]

38

We have developed a method to encode the syntax parse tree of a sentence that

not only encodes the post tags but the relation of the word within the phrase and the

relation of the phrase within the whole sentence in a hierarchical manner

Finally data augmentation is another solution to get better results One definite

way to reduce the errors would be to include similar samples in the training data which

the system is faltering in the dev set One could generate similar examples as the failure

cases and include them in the training set to have better prediction Another system

would be to train to similar and bigger datasets Our models were trained on the SQuAD

dataset There are other datasets too which does the similar question answering task

such as TriviaQA We could augment the training set of TriviaQA along with SQuAD to

have a more robust system that is able to generalize better and thus have higher

accuracy for predicting answer spans

Conclusion

In this work we have tried to explore the most fundamental techniques that have

shaped the current state of the art Then we proposed a minor improvement of

architecture over an existing model Furthermore we developed two applications that

uses the base model First we talked about how a chatbot application can be made

using the QA system and lastly we also created a web interface where the model can

be used for any Passage and Question This interface also shows the attention spread

on the candidate answers While our effort is ongoing to push the state of the art

forward we strongly believe that surpassing human level accuracy on this task will have

high dividends for the society at large

39

LIST OF REFERENCES

[1] Danqi Chen From Reading Comprehension To Open-Domain Question Answering 2018 Youtube Accessed March 16 2018 httpswwwyoutubecomwatchv=1RN88O9C13U

[2] Lehnert Wendy G The Process of Question Answering No RR-88 Yale Univ New Haven Conn Dept of computer science 1977

[3] Joshi Mandar Eunsol Choi Daniel S Weld and Luke Zettlemoyer Triviaqa A large scale distantly supervised challenge dataset for reading comprehension arXiv preprint arXiv170503551 (2017)

[4] Rajpurkar Pranav Jian Zhang Konstantin Lopyrev and Percy Liang Squad 100000+ questions for machine comprehension of text arXiv preprint arXiv160605250 (2016)

[5] The Stanford Question Answering Dataset 2018 RajpurkarGithubIo Accessed March 16 2018 httpsrajpurkargithubioSQuAD-explorer

[6] Hu Minghao Yuxing Peng and Xipeng Qiu Reinforced mnemonic reader for machine comprehension CoRR abs170502798 (2017)

[7] Schalkoff Robert J Artificial neural networks Vol 1 New York McGraw-Hill 1997

[8] Me For The AI and Neetesh Mehrotra 2018 The Connect Between Deep Learning And AI - Open Source For You Open Source For You Accessed March 16 2018 httpopensourceforucom201801connect-deep-learning-ai

[9] Krizhevsky Alex Ilya Sutskever and Geoffrey E Hinton Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems pp 1097-1105 2012

[10] An Intuitive Explanation Of Convolutional Neural Networks 2016 The Data Science Blog Accessed March 16 2018 httpsujjwalkarnme20160811intuitive-explanation-convnets

[11] Williams Ronald J and David Zipser A learning algorithm for continually running fully recurrent neural networks Neural computation 1 no 2 (1989) 270-280

[12] Hochreiter Sepp and Juumlrgen Schmidhuber Long short-term memory Neural computation 9 no 8 (1997) 1735-1780

[13] Chung Junyoung Caglar Gulcehre KyungHyun Cho and Yoshua Bengio Empirical evaluation of gated recurrent neural networks on sequence modeling arXiv preprint arXiv14123555 (2014)

40

[14] Mikolov Tomas Wen-tau Yih and Geoffrey Zweig Linguistic regularities in continuous space word representations In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics Human Language Technologies pp 746-751 2013

[15] Pennington Jeffrey Richard Socher and Christopher Manning Glove Global vectors for word representation In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) pp 1532-1543 2014

[16] Word2vec Tutorial - The Skip-Gram Model middot Chris Mccormick 2016 MccormickmlCom Accessed March 16 2018 httpmccormickmlcom20160419word2vec-tutorial-the-skip-gram-model

[17] Group 2015 [Paper Introduction] Bilingual Word Representations With Monolingual hellip SlideshareNet Accessed March 16 2018 httpswwwslidesharenetnaist-mtstudybilingual-word-representations-with-monolingual-quality-in-mind

[18] Attention Mechanism 2016 BlogHeuritechCom Accessed March 16 2018 httpsblogheuritechcom20160120attention-mechanism

[19] Weston Jason Sumit Chopra and Antoine Bordes 2014 Memory Networks arXiv preprint arXiv14103916v11 Accessed March 16 2018 httpsarxivorgabs14103916

[20] Wang Shuohang and Jing Jiang Machine comprehension using match-lstm and answer pointer arXiv preprint arXiv160807905 (2016)

[21] Wang Shuohang and Jing Jiang Learning natural language inference with LSTM arXiv preprint arXiv151208849 (2015)

[22] Vinyals Oriol Meire Fortunato and Navdeep Jaitly Pointer networks In Advances in Neural Information Processing Systems pp 2692-2700 2015

[23] Wang Wenhui Nan Yang Furu Wei Baobao Chang and Ming Zhou Gated self-matching networks for reading comprehension and question answering In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1 Long Papers) vol 1 pp 189-198 2017

[24] Seo Minjoon Aniruddha Kembhavi Ali Farhadi and Hannaneh Hajishirzi Bidirectional attention flow for machine comprehension arXiv preprint arXiv161101603 (2016)

[25] Clark Christopher and Matt Gardner Simple and effective multi-paragraph reading comprehension arXiv preprint arXiv171010723 (2017)

41

[26] 8 Analyzing Sentence Structure 2018 NltkOrg Accessed March 16 2018 httpwwwnltkorgbookch08html

42

BIOGRAPHICAL SKETCH

Purnendu Mukherjee grew up in Kolkata India and had a strong interest for

computers since an early age After his high school he did his BSc in computer

science from Ramakrishna Mission Residential College Narendrapur followed by MSc

in computer science from St Xavierrsquos College Kolkata He had a strong intuition and

interest for human like learning systems and wanted to work in this area He started

working at TCS Innovation Labs Pune for the application of Natural Language

Processing in Educational Applications As he was deeply passionate about learning

system that mimic the human brain and learn like a human child does he was

increasing interested about Deep Learning and its applications After working for a year

he went on to pursue a Master of Science degree in computer science from the

University of Florida Gainesville His academic interests have been focused on Deep

Learning and Natural Language Processing and he has been working on Machine

Reading Comprehension since summer of 2017

  • ACKNOWLEDGMENTS
  • LIST OF FIGURES
  • LIST OF ABBREVIATIONS
  • INTRODUCTION
  • THE BUILDING BLOCKS OF A QUESTION ANSWERING SYSTEM
    • Neural Networks
    • Convolutional Neural Network
    • Recurrent Neural Networks (RNN)
    • Word Embedding
    • Attention Mechanism
    • Memory Networks
      • LITERATURE REVIEW AND STATE OF THE ART
        • Machine Comprehension Using Match-LSTM and Answer Pointer
        • R-NET Matching Reading Comprehension with Self-Matching Networks
        • Bi-Directional Attention Flow (BiDAF) for Machine Comprehension
        • Summary
          • MULTI-ATTENTION QUESTION ANSWERING
            • Multi-Attention BiDAF Model
            • Chatbot Design Using a QA System
            • Online QA System and Attention Visualization
              • FUTURE DIRECTIONS AND CONCLUSION
                • Future Directions
                • Conclusion
                  • LIST OF REFERENCES
                  • BIOGRAPHICAL SKETCH
Page 12: QUESTION ANSWERING SYSTEMS WITH ATTENTION MECHANISMufdcimages.uflib.ufl.edu/UF/E0/05/23/73/00001/MUKHERJEE_P.pdf · QUESTION ANSWERING SYSTEMS WITH ATTENTION MECHANISM By PURNENDU

12

have shaped the current State of the Art models then we will discuss our contribution in

terms of architecture and applications and finally conclude with future directions

13

CHAPTER 2 THE BUILDING BLOCKS OF A QUESTION ANSWERING SYSTEM

To build a question answering system one needs to be familiar with the

fundamental deep learning models such as Recurrent Neural Networks (RNN) Long

Short Term Memory (LSTM) etc In this chapter we will have an overview on these

techniques and see how they all connect to building a question answering system

Neural Networks

What makes Deep Learning so intriguing is that it has close resemblance with

the working of the mammalian brain or at least draws inspiration from it The same can

be said for Artificial Neural networks [7] which consists of a system of interconnected

units called lsquoneuronsrsquo that take input from similar units and produces a single output

Figure 2-1 Simple and Deep Learning Neural Networks [8]

The connection from one neuron to another can be weighted based on the input data

which enables the network to tune itself to produce a certain output based on the input

This is the learning process which is achieved through backpropagation which is a

system of propagating the error from the output layer to the previous layers

14

Convolutional Neural Network

The first wave of deep learningrsquos success was brought by Convolutional Neural

Networks (CNN) [9] when this was the technique used by the winning team of ImageNet

competition in 2012 CNNs are deep artificial neural networks (ANN) that can be used to

classify images cluster them by similarity and perform object recognition within scenes

It can be used to detect and identify faces people signs or any other visual data

Figure 2-2 Convolutional Neural Network Architecture [10]

There are primarily four operations in a standard CNN model (as shown in Fig above)

1 Convolution - The primary purpose of Convolution in the ConvNet (above) is to extract features from the input image The spatial relationship between pixels ie the image features are preserved and learned by the convolution using small squares of input data

2 Non-Linearity (ReLU) ndash Rectified Linear Unit (ReLU) is a non-linear operation that carries out an element wise operation on each pixel This operation replaces the negative pixel values in the feature map by zero

3 Pooling or Sub Sampling - Spatial Pooling reduces the dimensionality of each feature map but retains the most important information For max pooling the largest value in the square window is taken and rest are dropped Other types of pooling are Average Sum etc

4 Classification (Fully Connected Layer) - The Fully Connected layer is a traditional Multi-Layer Perceptron as described before that uses a softmax activation function in the output layer The high-level features of the image are encoded by the convolutional and pooling layers which is then fed to the fully connected layer which then uses these features for classifying the input image into various classes based on the training dataset [10]

15

When a new image is fed into the CNN model all the above-mentioned steps are

carried out (forward propagation) and a probability distribution is achieved on the set of

output classes With a large enough training dataset the network will learn and

generalize well enough to classify new images into their correct classes

Recurrent Neural Networks (RNN)

Whenever we want to predict or encode sequential data RNN [11] is our go to

method RNNs perform the same task for every element of a sequence where the

output of each element depends on previous computations thus the recurrence In

practice RNNs are unable to retain long-term dependencies and can look back only a

few steps because of the vanishing gradient problem

(2-1)

A solution to the dependency problem is to use gated cells such as LSTM [11] or

GRU [13] These cells pass on important information to the next cells while ignoring

non-important ones The gated units in a GRU block are

bull Update Gate ndash Computed based on current input and hidden state

(2-2)

bull Reset Gate ndash Calculated similarly but with different weights

(2-3)

bull New memory content - (2-4)

If reset gate unit is ~0 then previous memory is ignored and only new

information is kept

16

Final memory at current time step combines previous and current time steps

(2-5)

While the GRU is computationally efficient the LSTM on the other hand is a

general case where there are three gates as follows

bull Input Gate ndash What new information to add to the current cell state

bull Forget Gate ndash How much information from previous states to be kept

bull Output gate ndash How much info should be sent to the next states

Just like GRU the current cell state is a sum of the previous cell state but

weighted by the forget gate and the new value is added which is weighted by the input

gate Based on the cell state the output gate regulates the final output

Word Embedding

Computation or gradients can be applied on numbers and not on words or letters

So first we need to convert words into their corresponding numerical formation before

feeding into a deep learning model In general there are two types of word embedding

Frequency based (which constitutes count vectors tf-idf and co-occurrence vectors)

and Prediction based With frequency based embedding the order of the words are not

preserved and works as a bag of words model Whereas with prediction based model

the order of words or locality of words are taken into consideration to generate the

numerical representation of the word Within this prediction based category there are

two fundamental techniques called Continuous Bag of Words (CBOW) and Skip Gram

Model which forms the basis for word2vec [14] and GloVe [15]

The basic intuition behind word2vec is that if two different words have very

similar ldquocontextsrdquo (that is what words are likely to appear around them) then the model

17

will produce similar vector for those words Conversely if the two word vectors are

similar then the network will produce similar context predictions for the same two words

For examples synonyms like ldquointelligentrdquo and ldquosmartrdquo would have very similar contexts

Or that words that are related like ldquoenginerdquo and ldquotransmissionrdquo would probably have

similar contexts as well [16] Plotting the word vectors learned by a word2vec over a

large corpus we could find some very interesting relationships between words

Figure 2-3 Semantic relation between words in vector space [17]

Attention Mechanism

We as humans put our attention to things are important or are relevant in a

context For example when asked a question from a passage we try to find the most

relevant part of the passage the question is relevant with and then reason from our

understanding of that part of the passage The same idea applies for attention

mechanism in Deep Learning It is used to identify the specific parts of a given context

to which the current question is relevant to

Formally put the techniques take n arguments y_1 y_n (in our case the

passage having words say y_i through h_i) and a question word say q It returns a

vector z which is supposed to be the laquo summary raquo of the y_i focusing on information

linked to the question q More formally it returns a weighted arithmetic mean of the y_i

18

and the weights are chosen according the relevance of each y_i given the context c

[18]

Figure 2-4 Attention Mechanism flow [18]

Memory Networks

Convolutional Neural Networks and Recurrent Neural Networks which does

capture how we form our visual and sequential memories their memory (encoded by

hidden states and weights) were typically too small and was not compartmentalized

enough to accurately remember facts from the past (knowledge is compressed into

dense vectors) [19]

Deep Learning needed to cultivate a methodology that preserved memories as

they are such that it wonrsquot be lost in generalization and recalling exact words or

sequence of events would be possible mdash something computers are already good at This

effort led us to Memory Networks [19] published at ICLR 2015 by Facebook AI

Research

This paper provides a basic framework to store augment and retrieve memories

while seamlessly working with a Recurrent Neural Network architecture The memory

19

network consists of a memory m (an array of objects 1 indexed by m i) and four

(potentially learned) components I G O and R as follows

I (input feature map) mdash converts the incoming input to the internal feature

representation either a sparse or dense feature vector like that from word2vec or

GloVe

G (generalization) mdash updates old memories given the new input They call this

generalization as there is an opportunity for the network to compress and generalize its

memories at this stage for some intended future use The analogy Irsquove been talking

before

O (output feature map) mdash produces a new output (in the feature representation

space) given the new input and the current memory state This component is

responsible for performing inference In a question answering system this part will

select the candidate sentences (which might contain the answer) from the story

(conversation) so far

R (response) mdash converts the output into the response format desired For

example a textual response or an action In the QA system described this component

finds the desired answer and then converts it from feature representation to the actual

word

This model is a fully supervised model meaning all the candidate sentences from

which the answer could be found are marked during training phase and can also be

termed as lsquohard attentionrsquo

The authors tested out the QA system on various literature including Lord of the

Rings

20

Figure 2-5 QA example on Lord of the Rings using Memory Networks [19]

21

CHAPTER 3 LITERATURE REVIEW AND STATE OF THE ART

There has been rapid progress since the release of the SQuAD dataset and

currently the best ensemble models are close to human level accuracy in machine

comprehension This is due to the various ingenious methods which solves some of the

problems with the previous methods Out of Vocabulary tokens were handled by using

Character embedding Long term dependency within context passage were solved

using self-attention And many other techniques such as Contextualized vectors History

of Words Attention Flow etc In this section we will have a look at the some of the most

important models that were fundamental to the progress of Questions Answering

Machine Comprehension Using Match-LSTM and Answer Pointer

In this paper [20] the authors propose an end-to-end neural architecture for the

QA task The architecture is based on match-LSTM [21] a model they proposed for

textual entailment and Pointer Net [22] a sequence-to-sequence model proposed by

Vinyals et al (2015) to constrain the output tokens to be from the input sequences

The model consists of an LSTM preprocessing layer a match-LSTM layer and an

Answer Pointer layer

We are given a piece of text which we refer to as a passage and a question

related to the passage The passage is represented by matrix P where P is the length

(number of tokens) of the passage and d is the dimensionality of word embeddings

Similarly the question is represented by matrix Q where Q is the length of the question

Our goal is to identify a subsequence from the passage as the answer to the question

22

Figure 3-1 Match-LSTM Model Architecture [20]

LSTM Preprocessing layer They use a standard one-directional LSTM

(Hochreiter amp Schmidhuber 1997) to process the passage and the question separately

as shown below

Match LSTM Layer They applied the match-LSTM model proposed for textual

entailment to their machine comprehension problem by treating the question as a

premise and the passage as a hypothesis The match-LSTM sequentially goes through

the passage At position i of the passage it first uses the standard word-by-word

attention mechanism to obtain attention weight vector as follows

(3-1)

23

where and are parameters to be learned

Answer Pointer Layer The Answer Pointer (Ans-Ptr) layer is motivated by the

Pointer Net introduced by Vinyals et al (2015) [22] The boundary model produces only

the start token and the end token of the answer and then all the tokens between these

two in the original passage are considered to be the answer

When this paper was released back in November 2016 Match-LSTM method

was the state of the art in Question Answering systems and was at the top of the

leaderboard for the SQuAD dataset

R-NET Matching Reading Comprehension with Self-Matching Networks

In this model [23] first the question and passage are processed by a

bidirectional recurrent network (Mikolov et al 2010) separately They then match the

question and passage with gated attention-based recurrent networks obtaining

question-aware representation for the passage On top of that they apply self-matching

attention to aggregate evidence from the whole passage and refine the passage

representation which is then fed into the output layer to predict the boundary of the

answer span

Question and passage encoding First the words are converted to their

respective word-level embeddings and character level embeddings The character-level

embeddings are generated by taking the final hidden states of a bi-directional recurrent

neural network (RNN) applied to embeddings of characters in the token Such

character-level embeddings have been shown to be helpful to deal with out-of-vocab

(OOV) tokens

24

They then use a bi-directional RNN to produce new representation and

of all words in the question and passage respectively

Figure 3-2 The task of Question Answering [23]

Gated Attention-based Recurrent Networks They use a variant of attention-

based recurrent networks with an additional gate to determine the importance of

information in the passage regarding a question Different from the gates in LSTM or

GRU the additional gate is based on the current passage word and its attention-pooling

vector of the question which focuses on the relation between the question and current

passage word The gate effectively model the phenomenon that only parts of the

passage are relevant to the question in reading comprehension and question answering

is utilized in subsequent calculations

25

Self-Matching Attention From the previous step the question aware passage

representation is generated to highlight the important parts of the passage One

problem with such representation is that it has very limited knowledge of context One

answer candidate is often oblivious to important cues in the passage outside its

surrounding window To address this problem the authors propose directly matching

the question-aware passage representation against itself It dynamically collects

evidence from the whole passage for words in passage and encodes the evidence

relevant to the current passage word and its matching question information into the

passage representation

Output Layer They use the same method as Wang amp Jiang (2016b) and use

pointer networks (Vinyals et al 2015) to predict the start and end position of the

answer In addition they use an attention-pooling over the question representation to

generate the initial hidden vector for the pointer network [23]

When the R-Net Model first appeared in the leaderboard in March 2017 it was at

the top with 723 Exact Match and 807 F1 score

Bi-Directional Attention Flow (BiDAF) for Machine Comprehension

BiDAF [24] is a hierarchical multi-stage architecture for modeling the

representations of the context paragraph at different levels of granularity BIDAF

includes character-level word-level and contextual embeddings and uses bi-directional

attention flow to obtain a query-aware context representation Their attention layer is not

used to summarize the context paragraph into a fixed-size vector Instead the attention

is computed for every time step and the attended vector at each time step along with

the representations from previous layers can flow through to the subsequent modeling

layer This reduces the information loss caused by early summarization [24]

26

Figure 3-3 Bi-Directional Attention Flow Model Architecture [24]

Their machine comprehension model is a hierarchical multi-stage process and

consists of six layers

1 Character Embedding Layer maps each word to a vector space using character-level CNNs

2 Word Embedding Layer maps each word to a vector space using a pre-trained word embedding model

3 Contextual Embedding Layer utilizes contextual cues from surrounding words to refine the embedding of the words These first three layers are applied to both the query and context They use a LSTM on top of the embeddings provided by the previous layers to model the temporal interactions between words We place an LSTM in both directions and concatenate the outputs of the two LSTMs

4 Attention Flow Layer couples the query and context vectors and produces a set of query aware feature vectors for each word in the context

5 Modeling Layer employs a Recurrent Neural Network to scan the context The input to the modeling layer is G which encodes the query-aware representations of context words The output of the modeling layer captures the interaction among the context words conditioned on the query They use two layers of bi-directional LSTM

27

with the output size of d for each direction Hence a matrix is obtained which is passed onto the output layer to predict the answer

6 Output Layer provides an answer to the query They define the training loss (to be minimized) as the sum of the negative log probabilities of the true start and end indices by the predicted distributions averaged over all examples [25]

In a further variation of their above work they add a self-attention layer after the

Bi-attention layer to further improve the results The architecture of the model is as

Figure 3-3 The task of Question Answering [25]

28

Summary

In this chapter we reviewed the methods that are fundamental to the state of the

art in Machine Comprehension and for the task of Question Answering We have

reached closed to human level accuracy and this is due to incremental developments

over previous models As we saw Out of Vocabulary (OOV) tokens were handled by

using Character embedding Long term dependency within context passage were

solved using self-attention and many other techniques such as Contextualized vectors

Attention Flow etc were employed to get better results In the next chapter we will see

how we can build on these models and develop further

29

CHAPTER 4 MULTI-ATTENTION QUESTION ANSWERING

Although the advancements in Question Answering systems since the release of

the SQuAD dataset has been impressive and the results are getting close to human

level accuracy it is far from being a fool-proof system The models still make mistakes

which would be obvious to a human For example

Passage The Panthers used the San Jose State practice facility and stayed at

the San Jose Marriott The Broncos practiced at Florida State Facility and stayed at the

Santa Clara Marriott

Question At what universitys facility did the Panthers practice

Actual Answer San Jose State

Predicted Answer Florida State Facility

To find out what is leading to the wrong predictions we wanted to see the

attention weights associated with such an example We plotted the passage and

question heat map which is a 2D matrix where the intensity of each cell signifies the

similarity between a passage word and a question word For the above example we

found out that while certain words of the question are given high weightage other parts

are not The words lsquoAtrsquo lsquofacilityrsquo lsquopracticersquo are given high attention but lsquoPanthersrsquo does

not receive high attention If it had received high attention then the system would have

predicted lsquoSan Jose Statersquo as the right answer To solve this issue we analyzed the

base BiDAF model and proposed adding two things

1 Bi-Attention and Self-Attention over Query

2 Second level attention over the output of (Bi-Attention + Self-Attention) from both Context and Query

30

Multi-Attention BiDAF Model

Our comprehension model is a hierarchical multi-stage process and consists of

the following layers

1 Embedding Just as all other models we embed words using pretrained word vectors We also embed the characters in each word into size 20 vectors which are learned and run a convolution neural network followed by max-pooling to get character-derived embeddings for each word The character-level and word-level embeddings are then concatenated and passed to the next layer We do not update the word embeddings during training

2 Pre-Process A shared bi-directional GRU (Cho et al 2014) is used to map the question and passage embeddings to context aware embeddings

3 Context Attention The bi-directional attention mechanism from the Bi-Directional Attention Flow (BiDAF) model (Seo et al 2016) is used to build a query-aware context representation Let h_i be the vector for context word i q_j be the vector for question word j and n_q and n_c be the lengths of the question and context respectively We compute attention between context word i and question word j as

(4-1)

where w1 w2 and w3 are learned vectors and is element-wise multiplication We then compute an attended vector c_i for each context token as

(4-2)

We also compute a query-to-context vector q_c

(4-3)

31

The final vector computed for each token is built by concatenating

and In our model we subsequently pass the result through a linear layer with ReLU activations

4 Context Self-Attention Next we use a layer of residual self-attention The input is passed through another bi-directional GRU Then we apply the same attention mechanism only now between the passage and itself In this case we do not use query-to context attention and we set aij = 1048576inf if i = j As before we pass the concatenated output through a linear layer with ReLU activations This layer is applied residually so this output is additionally summed with the input

5 Query Attention For this part we do the same way as context attention but calculate the weighted sum of the context words for each query word Thus we get the length as number of query words Then we calculate context to query similar to query to context in context attention layer

Figure 4-1 The modified BiDAF model with multilevel attention

6 Query Self-Attention This part is done the same way as the context self-attention layer but from the output of the Query Attention layer

32

7 Context Query Bi-Attention + Self-Attention The output of the Context self-attention and the Query Self-Attention layers are taken as input and the same process for Bi-Attention and self-attention is applied on these inputs

8 Prediction In the last layer of our model a bidirectional GRU is applied followed by a linear layer that computes answer start scores for each token The hidden states of that layer are concatenated with the input and fed into a second bidirectional GRU and linear layer to predict answer end scores The softmax operation is applied to the start and end scores to produce start and end probabilities and we optimize the negative log likelihood of selecting correct start and end tokens

Having carried out this modification we were able to solve the wrong example we

started with The multilevel attention model gives the correct output as ldquoSan Jose

Staterdquo Also we achieved slightly better scores than the original model with a F1 score

of 8544 on the SQuAD dev dataset

Chatbot Design Using a QA System

Designing a perfect chatbot that passes the Turing test is a fundamental goal for

Artificial Intelligence Although we are many order to magnitudes away from achieving

such a goal domain specific tasks can be solved with chatbots made from current

technology We had started our goal with a similar objective in mind ie to design a

domain specific chatbot and then generalize to other areas as it is able to achieve the

first domain specific objective robustly This led us to the fundamental problem of

Machine Comprehension and subsequently to the task of Question Answering Having

achieved some degree of success with QA systems we looked back if we could apply

our newly acquired knowledge in the task of designing Chatbots

The chatbots made with todayrsquos technologies are mostly handcrafted techniques

such as template matching that requires anticipating all possible ways a user may

articulate his requirements and a conversation may occur This requires a lot of man

hours for designing a domain specific system and is still very error prone In this section

33

we propose a general Chatbot design that would make the designing of a domain

specific chatbot very easy and robust at the same time

Every domain specific chatbot needs to obtain a set of information from the user

and show some results based on the user specific information obtained The traditional

chatbots use template matching and keywords lookup to determine if the user has

provided the required information Our idea is to use the Question Answering system in

the backend to extract out the required information from whatever the user has typed

until this point of the conversation The information to be extracted can be posed in the

form of a set of questions and the answers obtained from those questions can be used

as the parameters to supply the relevant information to the user

We had chosen our chatbot domain as the flight reservation system Our goal

was to extract the required information from the user to be able to show him the

available flights as per the userrsquos requirements

Figure 4-2 Flight reservation chatbotrsquos chat window

34

For a flight reservation task the booking agent needs to know the origin city

destination city and date of travel at minimum to be able to show the available flights

Optional information includes the number of tickets passengerrsquos name one way or

round trip etc

The minimalistic conversation with the user through the chat window would be as

shown above We had a platform called OneTask on which we wanted to implement our

chat bot The chat interface within the OneTask system looks as follows

Figure 4-3 Chatbot within OneTask system

The working of the chat system is as follows ndash

1 Initiation The user opens the chat window which starts a session with the chat bot The chat bot asks lsquoHow may I help yoursquo

2 User Reply The user may reply with none of the required information for flight booking or may reply with multiple information in the same message

3 User Reply parsing The conversation up to this point is treated as a passage and the internal questions are run on this passage So the four questions that are run are

Where do you want to go

35

From where do you want to leave

When do you want to depart

4 Parsed responses from QA model After running the questions on the QA system with the given conversation answers are obtained for all along with their corresponding confidence values Since answers will be obtained even if the required question was not answered up to this point in the conversation the confidence values plays a crucial role in determining the validity of the answer For this purpose we determined ranges of confidence values for validation A confidence value above 10 signifies the question has been answered correctly A confidence of 2 ndash 10 signifies that it may have been answered but should verify with the user for correctness and any confidence below 2 is discarded

5 Asking remaining questions iteratively After the parsing it is checked if any of the required questions are still unanswered If so the chatbot asks the remaining question and the process from 3 ndash 4 is carried out iteratively Once all the questions have been answered the user is shown with the available flight options as per his request

Figure 4-4 The Flow diagram of the Flight booking Chatbot system

Online QA System and Attention Visualization

To be able to test out various examples we used the BiDAF [23] model for an

online demo One can either choose from the available examples from the drop-down

menu or paste their own passage and examples While this is a useful and interesting

system to test the model in a user-friendly way we created this system to be able to

focus on the wrong samples

36

Figure 4-5 QA system interface with attention highlight over candidate answers

The candidate answers are shown with the blue highlights as per their

confidence values The higher the confidence the darker the answer The highest

confidence value is chosen as the predicted answer We developed the system to show

attention spread of the candidate answers to realize what needs to be done to improve

the system This led us to realize the importance of including the query attention part as

well as multilevel attention on the BiDAF model as described in the first section of this

chapter

37

CHAPTER 5 FUTURE DIRECTIONS AND CONCLUSION

Future Directions

Having done a thorough analysis of the current methods and state of the arts in

Machine Comprehension and for the QA task and developing systems on it we have

achieved a strong sense of what needs to be done to further improve the QA models

After observing the wrong samples we can see that the system is still unable to encode

meaning and is picking answers based on statistical patterns of occurrence of answers

on training examples

As per the State of the Art models and analyzing the ongoing research literature

it is easy to conclude that more features need to be embedded to encode the meaning

of words phrases and sentences A paper called Reinforced Mnemonic Reader for

Machine Comprehension encoded POS and NER tags of words along with their word

and character embedding This gave them better results

Figure 5-1 An English language semantic parse tree [26]

38

We have developed a method to encode the syntax parse tree of a sentence that

not only encodes the post tags but the relation of the word within the phrase and the

relation of the phrase within the whole sentence in a hierarchical manner

Finally data augmentation is another solution to get better results One definite

way to reduce the errors would be to include similar samples in the training data which

the system is faltering in the dev set One could generate similar examples as the failure

cases and include them in the training set to have better prediction Another system

would be to train to similar and bigger datasets Our models were trained on the SQuAD

dataset There are other datasets too which does the similar question answering task

such as TriviaQA We could augment the training set of TriviaQA along with SQuAD to

have a more robust system that is able to generalize better and thus have higher

accuracy for predicting answer spans

Conclusion

In this work we have tried to explore the most fundamental techniques that have

shaped the current state of the art Then we proposed a minor improvement of

architecture over an existing model Furthermore we developed two applications that

uses the base model First we talked about how a chatbot application can be made

using the QA system and lastly we also created a web interface where the model can

be used for any Passage and Question This interface also shows the attention spread

on the candidate answers While our effort is ongoing to push the state of the art

forward we strongly believe that surpassing human level accuracy on this task will have

high dividends for the society at large

39

LIST OF REFERENCES

[1] Danqi Chen From Reading Comprehension To Open-Domain Question Answering 2018 Youtube Accessed March 16 2018 httpswwwyoutubecomwatchv=1RN88O9C13U

[2] Lehnert Wendy G The Process of Question Answering No RR-88 Yale Univ New Haven Conn Dept of computer science 1977

[3] Joshi Mandar Eunsol Choi Daniel S Weld and Luke Zettlemoyer Triviaqa A large scale distantly supervised challenge dataset for reading comprehension arXiv preprint arXiv170503551 (2017)

[4] Rajpurkar Pranav Jian Zhang Konstantin Lopyrev and Percy Liang Squad 100000+ questions for machine comprehension of text arXiv preprint arXiv160605250 (2016)

[5] The Stanford Question Answering Dataset 2018 RajpurkarGithubIo Accessed March 16 2018 httpsrajpurkargithubioSQuAD-explorer

[6] Hu Minghao Yuxing Peng and Xipeng Qiu Reinforced mnemonic reader for machine comprehension CoRR abs170502798 (2017)

[7] Schalkoff Robert J Artificial neural networks Vol 1 New York McGraw-Hill 1997

[8] Me For The AI and Neetesh Mehrotra 2018 The Connect Between Deep Learning And AI - Open Source For You Open Source For You Accessed March 16 2018 httpopensourceforucom201801connect-deep-learning-ai

[9] Krizhevsky Alex Ilya Sutskever and Geoffrey E Hinton Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems pp 1097-1105 2012

[10] An Intuitive Explanation Of Convolutional Neural Networks 2016 The Data Science Blog Accessed March 16 2018 httpsujjwalkarnme20160811intuitive-explanation-convnets

[11] Williams Ronald J and David Zipser A learning algorithm for continually running fully recurrent neural networks Neural computation 1 no 2 (1989) 270-280

[12] Hochreiter Sepp and Juumlrgen Schmidhuber Long short-term memory Neural computation 9 no 8 (1997) 1735-1780

[13] Chung Junyoung Caglar Gulcehre KyungHyun Cho and Yoshua Bengio Empirical evaluation of gated recurrent neural networks on sequence modeling arXiv preprint arXiv14123555 (2014)

40

[14] Mikolov Tomas Wen-tau Yih and Geoffrey Zweig Linguistic regularities in continuous space word representations In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics Human Language Technologies pp 746-751 2013

[15] Pennington Jeffrey Richard Socher and Christopher Manning Glove Global vectors for word representation In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) pp 1532-1543 2014

[16] Word2vec Tutorial - The Skip-Gram Model middot Chris Mccormick 2016 MccormickmlCom Accessed March 16 2018 httpmccormickmlcom20160419word2vec-tutorial-the-skip-gram-model

[17] Group 2015 [Paper Introduction] Bilingual Word Representations With Monolingual hellip SlideshareNet Accessed March 16 2018 httpswwwslidesharenetnaist-mtstudybilingual-word-representations-with-monolingual-quality-in-mind

[18] Attention Mechanism 2016 BlogHeuritechCom Accessed March 16 2018 httpsblogheuritechcom20160120attention-mechanism

[19] Weston Jason Sumit Chopra and Antoine Bordes 2014 Memory Networks arXiv preprint arXiv14103916v11 Accessed March 16 2018 httpsarxivorgabs14103916

[20] Wang Shuohang and Jing Jiang Machine comprehension using match-lstm and answer pointer arXiv preprint arXiv160807905 (2016)

[21] Wang Shuohang and Jing Jiang Learning natural language inference with LSTM arXiv preprint arXiv151208849 (2015)

[22] Vinyals Oriol Meire Fortunato and Navdeep Jaitly Pointer networks In Advances in Neural Information Processing Systems pp 2692-2700 2015

[23] Wang Wenhui Nan Yang Furu Wei Baobao Chang and Ming Zhou Gated self-matching networks for reading comprehension and question answering In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1 Long Papers) vol 1 pp 189-198 2017

[24] Seo Minjoon Aniruddha Kembhavi Ali Farhadi and Hannaneh Hajishirzi Bidirectional attention flow for machine comprehension arXiv preprint arXiv161101603 (2016)

[25] Clark Christopher and Matt Gardner Simple and effective multi-paragraph reading comprehension arXiv preprint arXiv171010723 (2017)

41

[26] 8 Analyzing Sentence Structure 2018 NltkOrg Accessed March 16 2018 httpwwwnltkorgbookch08html

42

BIOGRAPHICAL SKETCH

Purnendu Mukherjee grew up in Kolkata India and had a strong interest for

computers since an early age After his high school he did his BSc in computer

science from Ramakrishna Mission Residential College Narendrapur followed by MSc

in computer science from St Xavierrsquos College Kolkata He had a strong intuition and

interest for human like learning systems and wanted to work in this area He started

working at TCS Innovation Labs Pune for the application of Natural Language

Processing in Educational Applications As he was deeply passionate about learning

system that mimic the human brain and learn like a human child does he was

increasing interested about Deep Learning and its applications After working for a year

he went on to pursue a Master of Science degree in computer science from the

University of Florida Gainesville His academic interests have been focused on Deep

Learning and Natural Language Processing and he has been working on Machine

Reading Comprehension since summer of 2017

  • ACKNOWLEDGMENTS
  • LIST OF FIGURES
  • LIST OF ABBREVIATIONS
  • INTRODUCTION
  • THE BUILDING BLOCKS OF A QUESTION ANSWERING SYSTEM
    • Neural Networks
    • Convolutional Neural Network
    • Recurrent Neural Networks (RNN)
    • Word Embedding
    • Attention Mechanism
    • Memory Networks
      • LITERATURE REVIEW AND STATE OF THE ART
        • Machine Comprehension Using Match-LSTM and Answer Pointer
        • R-NET Matching Reading Comprehension with Self-Matching Networks
        • Bi-Directional Attention Flow (BiDAF) for Machine Comprehension
        • Summary
          • MULTI-ATTENTION QUESTION ANSWERING
            • Multi-Attention BiDAF Model
            • Chatbot Design Using a QA System
            • Online QA System and Attention Visualization
              • FUTURE DIRECTIONS AND CONCLUSION
                • Future Directions
                • Conclusion
                  • LIST OF REFERENCES
                  • BIOGRAPHICAL SKETCH
Page 13: QUESTION ANSWERING SYSTEMS WITH ATTENTION MECHANISMufdcimages.uflib.ufl.edu/UF/E0/05/23/73/00001/MUKHERJEE_P.pdf · QUESTION ANSWERING SYSTEMS WITH ATTENTION MECHANISM By PURNENDU

13

CHAPTER 2 THE BUILDING BLOCKS OF A QUESTION ANSWERING SYSTEM

To build a question answering system one needs to be familiar with the

fundamental deep learning models such as Recurrent Neural Networks (RNN) Long

Short Term Memory (LSTM) etc In this chapter we will have an overview on these

techniques and see how they all connect to building a question answering system

Neural Networks

What makes Deep Learning so intriguing is that it has close resemblance with

the working of the mammalian brain or at least draws inspiration from it The same can

be said for Artificial Neural networks [7] which consists of a system of interconnected

units called lsquoneuronsrsquo that take input from similar units and produces a single output

Figure 2-1 Simple and Deep Learning Neural Networks [8]

The connection from one neuron to another can be weighted based on the input data

which enables the network to tune itself to produce a certain output based on the input

This is the learning process which is achieved through backpropagation which is a

system of propagating the error from the output layer to the previous layers

14

Convolutional Neural Network

The first wave of deep learningrsquos success was brought by Convolutional Neural

Networks (CNN) [9] when this was the technique used by the winning team of ImageNet

competition in 2012 CNNs are deep artificial neural networks (ANN) that can be used to

classify images cluster them by similarity and perform object recognition within scenes

It can be used to detect and identify faces people signs or any other visual data

Figure 2-2 Convolutional Neural Network Architecture [10]

There are primarily four operations in a standard CNN model (as shown in Fig above)

1 Convolution - The primary purpose of Convolution in the ConvNet (above) is to extract features from the input image The spatial relationship between pixels ie the image features are preserved and learned by the convolution using small squares of input data

2 Non-Linearity (ReLU) ndash Rectified Linear Unit (ReLU) is a non-linear operation that carries out an element wise operation on each pixel This operation replaces the negative pixel values in the feature map by zero

3 Pooling or Sub Sampling - Spatial Pooling reduces the dimensionality of each feature map but retains the most important information For max pooling the largest value in the square window is taken and rest are dropped Other types of pooling are Average Sum etc

4 Classification (Fully Connected Layer) - The Fully Connected layer is a traditional Multi-Layer Perceptron as described before that uses a softmax activation function in the output layer The high-level features of the image are encoded by the convolutional and pooling layers which is then fed to the fully connected layer which then uses these features for classifying the input image into various classes based on the training dataset [10]

15

When a new image is fed into the CNN model all the above-mentioned steps are

carried out (forward propagation) and a probability distribution is achieved on the set of

output classes With a large enough training dataset the network will learn and

generalize well enough to classify new images into their correct classes

Recurrent Neural Networks (RNN)

Whenever we want to predict or encode sequential data RNN [11] is our go to

method RNNs perform the same task for every element of a sequence where the

output of each element depends on previous computations thus the recurrence In

practice RNNs are unable to retain long-term dependencies and can look back only a

few steps because of the vanishing gradient problem

(2-1)

A solution to the dependency problem is to use gated cells such as LSTM [11] or

GRU [13] These cells pass on important information to the next cells while ignoring

non-important ones The gated units in a GRU block are

bull Update Gate ndash Computed based on current input and hidden state

(2-2)

bull Reset Gate ndash Calculated similarly but with different weights

(2-3)

bull New memory content - (2-4)

If reset gate unit is ~0 then previous memory is ignored and only new

information is kept

16

Final memory at current time step combines previous and current time steps

(2-5)

While the GRU is computationally efficient the LSTM on the other hand is a

general case where there are three gates as follows

bull Input Gate ndash What new information to add to the current cell state

bull Forget Gate ndash How much information from previous states to be kept

bull Output gate ndash How much info should be sent to the next states

Just like GRU the current cell state is a sum of the previous cell state but

weighted by the forget gate and the new value is added which is weighted by the input

gate Based on the cell state the output gate regulates the final output

Word Embedding

Computation or gradients can be applied on numbers and not on words or letters

So first we need to convert words into their corresponding numerical formation before

feeding into a deep learning model In general there are two types of word embedding

Frequency based (which constitutes count vectors tf-idf and co-occurrence vectors)

and Prediction based With frequency based embedding the order of the words are not

preserved and works as a bag of words model Whereas with prediction based model

the order of words or locality of words are taken into consideration to generate the

numerical representation of the word Within this prediction based category there are

two fundamental techniques called Continuous Bag of Words (CBOW) and Skip Gram

Model which forms the basis for word2vec [14] and GloVe [15]

The basic intuition behind word2vec is that if two different words have very

similar ldquocontextsrdquo (that is what words are likely to appear around them) then the model

17

will produce similar vector for those words Conversely if the two word vectors are

similar then the network will produce similar context predictions for the same two words

For examples synonyms like ldquointelligentrdquo and ldquosmartrdquo would have very similar contexts

Or that words that are related like ldquoenginerdquo and ldquotransmissionrdquo would probably have

similar contexts as well [16] Plotting the word vectors learned by a word2vec over a

large corpus we could find some very interesting relationships between words

Figure 2-3 Semantic relation between words in vector space [17]

Attention Mechanism

We as humans put our attention to things are important or are relevant in a

context For example when asked a question from a passage we try to find the most

relevant part of the passage the question is relevant with and then reason from our

understanding of that part of the passage The same idea applies for attention

mechanism in Deep Learning It is used to identify the specific parts of a given context

to which the current question is relevant to

Formally put the techniques take n arguments y_1 y_n (in our case the

passage having words say y_i through h_i) and a question word say q It returns a

vector z which is supposed to be the laquo summary raquo of the y_i focusing on information

linked to the question q More formally it returns a weighted arithmetic mean of the y_i

18

and the weights are chosen according the relevance of each y_i given the context c

[18]

Figure 2-4 Attention Mechanism flow [18]

Memory Networks

Convolutional Neural Networks and Recurrent Neural Networks which does

capture how we form our visual and sequential memories their memory (encoded by

hidden states and weights) were typically too small and was not compartmentalized

enough to accurately remember facts from the past (knowledge is compressed into

dense vectors) [19]

Deep Learning needed to cultivate a methodology that preserved memories as

they are such that it wonrsquot be lost in generalization and recalling exact words or

sequence of events would be possible mdash something computers are already good at This

effort led us to Memory Networks [19] published at ICLR 2015 by Facebook AI

Research

This paper provides a basic framework to store augment and retrieve memories

while seamlessly working with a Recurrent Neural Network architecture The memory

19

network consists of a memory m (an array of objects 1 indexed by m i) and four

(potentially learned) components I G O and R as follows

I (input feature map) mdash converts the incoming input to the internal feature

representation either a sparse or dense feature vector like that from word2vec or

GloVe

G (generalization) mdash updates old memories given the new input They call this

generalization as there is an opportunity for the network to compress and generalize its

memories at this stage for some intended future use The analogy Irsquove been talking

before

O (output feature map) mdash produces a new output (in the feature representation

space) given the new input and the current memory state This component is

responsible for performing inference In a question answering system this part will

select the candidate sentences (which might contain the answer) from the story

(conversation) so far

R (response) mdash converts the output into the response format desired For

example a textual response or an action In the QA system described this component

finds the desired answer and then converts it from feature representation to the actual

word

This model is a fully supervised model meaning all the candidate sentences from

which the answer could be found are marked during training phase and can also be

termed as lsquohard attentionrsquo

The authors tested out the QA system on various literature including Lord of the

Rings

20

Figure 2-5 QA example on Lord of the Rings using Memory Networks [19]

21

CHAPTER 3 LITERATURE REVIEW AND STATE OF THE ART

There has been rapid progress since the release of the SQuAD dataset and

currently the best ensemble models are close to human level accuracy in machine

comprehension This is due to the various ingenious methods which solves some of the

problems with the previous methods Out of Vocabulary tokens were handled by using

Character embedding Long term dependency within context passage were solved

using self-attention And many other techniques such as Contextualized vectors History

of Words Attention Flow etc In this section we will have a look at the some of the most

important models that were fundamental to the progress of Questions Answering

Machine Comprehension Using Match-LSTM and Answer Pointer

In this paper [20] the authors propose an end-to-end neural architecture for the

QA task The architecture is based on match-LSTM [21] a model they proposed for

textual entailment and Pointer Net [22] a sequence-to-sequence model proposed by

Vinyals et al (2015) to constrain the output tokens to be from the input sequences

The model consists of an LSTM preprocessing layer a match-LSTM layer and an

Answer Pointer layer

We are given a piece of text which we refer to as a passage and a question

related to the passage The passage is represented by matrix P where P is the length

(number of tokens) of the passage and d is the dimensionality of word embeddings

Similarly the question is represented by matrix Q where Q is the length of the question

Our goal is to identify a subsequence from the passage as the answer to the question

22

Figure 3-1 Match-LSTM Model Architecture [20]

LSTM Preprocessing layer They use a standard one-directional LSTM

(Hochreiter amp Schmidhuber 1997) to process the passage and the question separately

as shown below

Match LSTM Layer They applied the match-LSTM model proposed for textual

entailment to their machine comprehension problem by treating the question as a

premise and the passage as a hypothesis The match-LSTM sequentially goes through

the passage At position i of the passage it first uses the standard word-by-word

attention mechanism to obtain attention weight vector as follows

(3-1)

23

where and are parameters to be learned

Answer Pointer Layer The Answer Pointer (Ans-Ptr) layer is motivated by the

Pointer Net introduced by Vinyals et al (2015) [22] The boundary model produces only

the start token and the end token of the answer and then all the tokens between these

two in the original passage are considered to be the answer

When this paper was released back in November 2016 Match-LSTM method

was the state of the art in Question Answering systems and was at the top of the

leaderboard for the SQuAD dataset

R-NET Matching Reading Comprehension with Self-Matching Networks

In this model [23] first the question and passage are processed by a

bidirectional recurrent network (Mikolov et al 2010) separately They then match the

question and passage with gated attention-based recurrent networks obtaining

question-aware representation for the passage On top of that they apply self-matching

attention to aggregate evidence from the whole passage and refine the passage

representation which is then fed into the output layer to predict the boundary of the

answer span

Question and passage encoding First the words are converted to their

respective word-level embeddings and character level embeddings The character-level

embeddings are generated by taking the final hidden states of a bi-directional recurrent

neural network (RNN) applied to embeddings of characters in the token Such

character-level embeddings have been shown to be helpful to deal with out-of-vocab

(OOV) tokens

24

They then use a bi-directional RNN to produce new representation and

of all words in the question and passage respectively

Figure 3-2 The task of Question Answering [23]

Gated Attention-based Recurrent Networks They use a variant of attention-

based recurrent networks with an additional gate to determine the importance of

information in the passage regarding a question Different from the gates in LSTM or

GRU the additional gate is based on the current passage word and its attention-pooling

vector of the question which focuses on the relation between the question and current

passage word The gate effectively model the phenomenon that only parts of the

passage are relevant to the question in reading comprehension and question answering

is utilized in subsequent calculations

25

Self-Matching Attention From the previous step the question aware passage

representation is generated to highlight the important parts of the passage One

problem with such representation is that it has very limited knowledge of context One

answer candidate is often oblivious to important cues in the passage outside its

surrounding window To address this problem the authors propose directly matching

the question-aware passage representation against itself It dynamically collects

evidence from the whole passage for words in passage and encodes the evidence

relevant to the current passage word and its matching question information into the

passage representation

Output Layer They use the same method as Wang amp Jiang (2016b) and use

pointer networks (Vinyals et al 2015) to predict the start and end position of the

answer In addition they use an attention-pooling over the question representation to

generate the initial hidden vector for the pointer network [23]

When the R-Net Model first appeared in the leaderboard in March 2017 it was at

the top with 723 Exact Match and 807 F1 score

Bi-Directional Attention Flow (BiDAF) for Machine Comprehension

BiDAF [24] is a hierarchical multi-stage architecture for modeling the

representations of the context paragraph at different levels of granularity BIDAF

includes character-level word-level and contextual embeddings and uses bi-directional

attention flow to obtain a query-aware context representation Their attention layer is not

used to summarize the context paragraph into a fixed-size vector Instead the attention

is computed for every time step and the attended vector at each time step along with

the representations from previous layers can flow through to the subsequent modeling

layer This reduces the information loss caused by early summarization [24]

26

Figure 3-3 Bi-Directional Attention Flow Model Architecture [24]

Their machine comprehension model is a hierarchical multi-stage process and

consists of six layers

1 Character Embedding Layer maps each word to a vector space using character-level CNNs

2 Word Embedding Layer maps each word to a vector space using a pre-trained word embedding model

3 Contextual Embedding Layer utilizes contextual cues from surrounding words to refine the embedding of the words These first three layers are applied to both the query and context They use a LSTM on top of the embeddings provided by the previous layers to model the temporal interactions between words We place an LSTM in both directions and concatenate the outputs of the two LSTMs

4 Attention Flow Layer couples the query and context vectors and produces a set of query aware feature vectors for each word in the context

5 Modeling Layer employs a Recurrent Neural Network to scan the context The input to the modeling layer is G which encodes the query-aware representations of context words The output of the modeling layer captures the interaction among the context words conditioned on the query They use two layers of bi-directional LSTM

27

with the output size of d for each direction Hence a matrix is obtained which is passed onto the output layer to predict the answer

6 Output Layer provides an answer to the query They define the training loss (to be minimized) as the sum of the negative log probabilities of the true start and end indices by the predicted distributions averaged over all examples [25]

In a further variation of their above work they add a self-attention layer after the

Bi-attention layer to further improve the results The architecture of the model is as

Figure 3-3 The task of Question Answering [25]

28

Summary

In this chapter we reviewed the methods that are fundamental to the state of the

art in Machine Comprehension and for the task of Question Answering We have

reached closed to human level accuracy and this is due to incremental developments

over previous models As we saw Out of Vocabulary (OOV) tokens were handled by

using Character embedding Long term dependency within context passage were

solved using self-attention and many other techniques such as Contextualized vectors

Attention Flow etc were employed to get better results In the next chapter we will see

how we can build on these models and develop further

29

CHAPTER 4 MULTI-ATTENTION QUESTION ANSWERING

Although the advancements in Question Answering systems since the release of

the SQuAD dataset has been impressive and the results are getting close to human

level accuracy it is far from being a fool-proof system The models still make mistakes

which would be obvious to a human For example

Passage The Panthers used the San Jose State practice facility and stayed at

the San Jose Marriott The Broncos practiced at Florida State Facility and stayed at the

Santa Clara Marriott

Question At what universitys facility did the Panthers practice

Actual Answer San Jose State

Predicted Answer Florida State Facility

To find out what is leading to the wrong predictions we wanted to see the

attention weights associated with such an example We plotted the passage and

question heat map which is a 2D matrix where the intensity of each cell signifies the

similarity between a passage word and a question word For the above example we

found out that while certain words of the question are given high weightage other parts

are not The words lsquoAtrsquo lsquofacilityrsquo lsquopracticersquo are given high attention but lsquoPanthersrsquo does

not receive high attention If it had received high attention then the system would have

predicted lsquoSan Jose Statersquo as the right answer To solve this issue we analyzed the

base BiDAF model and proposed adding two things

1 Bi-Attention and Self-Attention over Query

2 Second level attention over the output of (Bi-Attention + Self-Attention) from both Context and Query

30

Multi-Attention BiDAF Model

Our comprehension model is a hierarchical multi-stage process and consists of

the following layers

1 Embedding Just as all other models we embed words using pretrained word vectors We also embed the characters in each word into size 20 vectors which are learned and run a convolution neural network followed by max-pooling to get character-derived embeddings for each word The character-level and word-level embeddings are then concatenated and passed to the next layer We do not update the word embeddings during training

2 Pre-Process A shared bi-directional GRU (Cho et al 2014) is used to map the question and passage embeddings to context aware embeddings

3 Context Attention The bi-directional attention mechanism from the Bi-Directional Attention Flow (BiDAF) model (Seo et al 2016) is used to build a query-aware context representation Let h_i be the vector for context word i q_j be the vector for question word j and n_q and n_c be the lengths of the question and context respectively We compute attention between context word i and question word j as

(4-1)

where w1 w2 and w3 are learned vectors and is element-wise multiplication We then compute an attended vector c_i for each context token as

(4-2)

We also compute a query-to-context vector q_c

(4-3)

31

The final vector computed for each token is built by concatenating

and In our model we subsequently pass the result through a linear layer with ReLU activations

4 Context Self-Attention Next we use a layer of residual self-attention The input is passed through another bi-directional GRU Then we apply the same attention mechanism only now between the passage and itself In this case we do not use query-to context attention and we set aij = 1048576inf if i = j As before we pass the concatenated output through a linear layer with ReLU activations This layer is applied residually so this output is additionally summed with the input

5 Query Attention For this part we do the same way as context attention but calculate the weighted sum of the context words for each query word Thus we get the length as number of query words Then we calculate context to query similar to query to context in context attention layer

Figure 4-1 The modified BiDAF model with multilevel attention

6 Query Self-Attention This part is done the same way as the context self-attention layer but from the output of the Query Attention layer

32

7 Context Query Bi-Attention + Self-Attention The output of the Context self-attention and the Query Self-Attention layers are taken as input and the same process for Bi-Attention and self-attention is applied on these inputs

8 Prediction In the last layer of our model a bidirectional GRU is applied followed by a linear layer that computes answer start scores for each token The hidden states of that layer are concatenated with the input and fed into a second bidirectional GRU and linear layer to predict answer end scores The softmax operation is applied to the start and end scores to produce start and end probabilities and we optimize the negative log likelihood of selecting correct start and end tokens

Having carried out this modification we were able to solve the wrong example we

started with The multilevel attention model gives the correct output as ldquoSan Jose

Staterdquo Also we achieved slightly better scores than the original model with a F1 score

of 8544 on the SQuAD dev dataset

Chatbot Design Using a QA System

Designing a perfect chatbot that passes the Turing test is a fundamental goal for

Artificial Intelligence Although we are many order to magnitudes away from achieving

such a goal domain specific tasks can be solved with chatbots made from current

technology We had started our goal with a similar objective in mind ie to design a

domain specific chatbot and then generalize to other areas as it is able to achieve the

first domain specific objective robustly This led us to the fundamental problem of

Machine Comprehension and subsequently to the task of Question Answering Having

achieved some degree of success with QA systems we looked back if we could apply

our newly acquired knowledge in the task of designing Chatbots

The chatbots made with todayrsquos technologies are mostly handcrafted techniques

such as template matching that requires anticipating all possible ways a user may

articulate his requirements and a conversation may occur This requires a lot of man

hours for designing a domain specific system and is still very error prone In this section

33

we propose a general Chatbot design that would make the designing of a domain

specific chatbot very easy and robust at the same time

Every domain specific chatbot needs to obtain a set of information from the user

and show some results based on the user specific information obtained The traditional

chatbots use template matching and keywords lookup to determine if the user has

provided the required information Our idea is to use the Question Answering system in

the backend to extract out the required information from whatever the user has typed

until this point of the conversation The information to be extracted can be posed in the

form of a set of questions and the answers obtained from those questions can be used

as the parameters to supply the relevant information to the user

We had chosen our chatbot domain as the flight reservation system Our goal

was to extract the required information from the user to be able to show him the

available flights as per the userrsquos requirements

Figure 4-2 Flight reservation chatbotrsquos chat window

34

For a flight reservation task the booking agent needs to know the origin city

destination city and date of travel at minimum to be able to show the available flights

Optional information includes the number of tickets passengerrsquos name one way or

round trip etc

The minimalistic conversation with the user through the chat window would be as

shown above We had a platform called OneTask on which we wanted to implement our

chat bot The chat interface within the OneTask system looks as follows

Figure 4-3 Chatbot within OneTask system

The working of the chat system is as follows ndash

1 Initiation The user opens the chat window which starts a session with the chat bot The chat bot asks lsquoHow may I help yoursquo

2 User Reply The user may reply with none of the required information for flight booking or may reply with multiple information in the same message

3 User Reply parsing The conversation up to this point is treated as a passage and the internal questions are run on this passage So the four questions that are run are

Where do you want to go

35

From where do you want to leave

When do you want to depart

4 Parsed responses from QA model After running the questions on the QA system with the given conversation answers are obtained for all along with their corresponding confidence values Since answers will be obtained even if the required question was not answered up to this point in the conversation the confidence values plays a crucial role in determining the validity of the answer For this purpose we determined ranges of confidence values for validation A confidence value above 10 signifies the question has been answered correctly A confidence of 2 ndash 10 signifies that it may have been answered but should verify with the user for correctness and any confidence below 2 is discarded

5 Asking remaining questions iteratively After the parsing it is checked if any of the required questions are still unanswered If so the chatbot asks the remaining question and the process from 3 ndash 4 is carried out iteratively Once all the questions have been answered the user is shown with the available flight options as per his request

Figure 4-4 The Flow diagram of the Flight booking Chatbot system

Online QA System and Attention Visualization

To be able to test out various examples we used the BiDAF [23] model for an

online demo One can either choose from the available examples from the drop-down

menu or paste their own passage and examples While this is a useful and interesting

system to test the model in a user-friendly way we created this system to be able to

focus on the wrong samples

36

Figure 4-5 QA system interface with attention highlight over candidate answers

The candidate answers are shown with the blue highlights as per their

confidence values The higher the confidence the darker the answer The highest

confidence value is chosen as the predicted answer We developed the system to show

attention spread of the candidate answers to realize what needs to be done to improve

the system This led us to realize the importance of including the query attention part as

well as multilevel attention on the BiDAF model as described in the first section of this

chapter

37

CHAPTER 5 FUTURE DIRECTIONS AND CONCLUSION

Future Directions

Having done a thorough analysis of the current methods and state of the arts in

Machine Comprehension and for the QA task and developing systems on it we have

achieved a strong sense of what needs to be done to further improve the QA models

After observing the wrong samples we can see that the system is still unable to encode

meaning and is picking answers based on statistical patterns of occurrence of answers

on training examples

As per the State of the Art models and analyzing the ongoing research literature

it is easy to conclude that more features need to be embedded to encode the meaning

of words phrases and sentences A paper called Reinforced Mnemonic Reader for

Machine Comprehension encoded POS and NER tags of words along with their word

and character embedding This gave them better results

Figure 5-1 An English language semantic parse tree [26]

38

We have developed a method to encode the syntax parse tree of a sentence that

not only encodes the post tags but the relation of the word within the phrase and the

relation of the phrase within the whole sentence in a hierarchical manner

Finally data augmentation is another solution to get better results One definite

way to reduce the errors would be to include similar samples in the training data which

the system is faltering in the dev set One could generate similar examples as the failure

cases and include them in the training set to have better prediction Another system

would be to train to similar and bigger datasets Our models were trained on the SQuAD

dataset There are other datasets too which does the similar question answering task

such as TriviaQA We could augment the training set of TriviaQA along with SQuAD to

have a more robust system that is able to generalize better and thus have higher

accuracy for predicting answer spans

Conclusion

In this work we have tried to explore the most fundamental techniques that have

shaped the current state of the art Then we proposed a minor improvement of

architecture over an existing model Furthermore we developed two applications that

uses the base model First we talked about how a chatbot application can be made

using the QA system and lastly we also created a web interface where the model can

be used for any Passage and Question This interface also shows the attention spread

on the candidate answers While our effort is ongoing to push the state of the art

forward we strongly believe that surpassing human level accuracy on this task will have

high dividends for the society at large

39

LIST OF REFERENCES

[1] Danqi Chen From Reading Comprehension To Open-Domain Question Answering 2018 Youtube Accessed March 16 2018 httpswwwyoutubecomwatchv=1RN88O9C13U

[2] Lehnert Wendy G The Process of Question Answering No RR-88 Yale Univ New Haven Conn Dept of computer science 1977

[3] Joshi Mandar Eunsol Choi Daniel S Weld and Luke Zettlemoyer Triviaqa A large scale distantly supervised challenge dataset for reading comprehension arXiv preprint arXiv170503551 (2017)

[4] Rajpurkar Pranav Jian Zhang Konstantin Lopyrev and Percy Liang Squad 100000+ questions for machine comprehension of text arXiv preprint arXiv160605250 (2016)

[5] The Stanford Question Answering Dataset 2018 RajpurkarGithubIo Accessed March 16 2018 httpsrajpurkargithubioSQuAD-explorer

[6] Hu Minghao Yuxing Peng and Xipeng Qiu Reinforced mnemonic reader for machine comprehension CoRR abs170502798 (2017)

[7] Schalkoff Robert J Artificial neural networks Vol 1 New York McGraw-Hill 1997

[8] Me For The AI and Neetesh Mehrotra 2018 The Connect Between Deep Learning And AI - Open Source For You Open Source For You Accessed March 16 2018 httpopensourceforucom201801connect-deep-learning-ai

[9] Krizhevsky Alex Ilya Sutskever and Geoffrey E Hinton Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems pp 1097-1105 2012

[10] An Intuitive Explanation Of Convolutional Neural Networks 2016 The Data Science Blog Accessed March 16 2018 httpsujjwalkarnme20160811intuitive-explanation-convnets

[11] Williams Ronald J and David Zipser A learning algorithm for continually running fully recurrent neural networks Neural computation 1 no 2 (1989) 270-280

[12] Hochreiter Sepp and Juumlrgen Schmidhuber Long short-term memory Neural computation 9 no 8 (1997) 1735-1780

[13] Chung Junyoung Caglar Gulcehre KyungHyun Cho and Yoshua Bengio Empirical evaluation of gated recurrent neural networks on sequence modeling arXiv preprint arXiv14123555 (2014)

40

[14] Mikolov Tomas Wen-tau Yih and Geoffrey Zweig Linguistic regularities in continuous space word representations In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics Human Language Technologies pp 746-751 2013

[15] Pennington Jeffrey Richard Socher and Christopher Manning Glove Global vectors for word representation In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) pp 1532-1543 2014

[16] Word2vec Tutorial - The Skip-Gram Model middot Chris Mccormick 2016 MccormickmlCom Accessed March 16 2018 httpmccormickmlcom20160419word2vec-tutorial-the-skip-gram-model

[17] Group 2015 [Paper Introduction] Bilingual Word Representations With Monolingual hellip SlideshareNet Accessed March 16 2018 httpswwwslidesharenetnaist-mtstudybilingual-word-representations-with-monolingual-quality-in-mind

[18] Attention Mechanism 2016 BlogHeuritechCom Accessed March 16 2018 httpsblogheuritechcom20160120attention-mechanism

[19] Weston Jason Sumit Chopra and Antoine Bordes 2014 Memory Networks arXiv preprint arXiv14103916v11 Accessed March 16 2018 httpsarxivorgabs14103916

[20] Wang Shuohang and Jing Jiang Machine comprehension using match-lstm and answer pointer arXiv preprint arXiv160807905 (2016)

[21] Wang Shuohang and Jing Jiang Learning natural language inference with LSTM arXiv preprint arXiv151208849 (2015)

[22] Vinyals Oriol Meire Fortunato and Navdeep Jaitly Pointer networks In Advances in Neural Information Processing Systems pp 2692-2700 2015

[23] Wang Wenhui Nan Yang Furu Wei Baobao Chang and Ming Zhou Gated self-matching networks for reading comprehension and question answering In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1 Long Papers) vol 1 pp 189-198 2017

[24] Seo Minjoon Aniruddha Kembhavi Ali Farhadi and Hannaneh Hajishirzi Bidirectional attention flow for machine comprehension arXiv preprint arXiv161101603 (2016)

[25] Clark Christopher and Matt Gardner Simple and effective multi-paragraph reading comprehension arXiv preprint arXiv171010723 (2017)

41

[26] 8 Analyzing Sentence Structure 2018 NltkOrg Accessed March 16 2018 httpwwwnltkorgbookch08html

42

BIOGRAPHICAL SKETCH

Purnendu Mukherjee grew up in Kolkata India and had a strong interest for

computers since an early age After his high school he did his BSc in computer

science from Ramakrishna Mission Residential College Narendrapur followed by MSc

in computer science from St Xavierrsquos College Kolkata He had a strong intuition and

interest for human like learning systems and wanted to work in this area He started

working at TCS Innovation Labs Pune for the application of Natural Language

Processing in Educational Applications As he was deeply passionate about learning

system that mimic the human brain and learn like a human child does he was

increasing interested about Deep Learning and its applications After working for a year

he went on to pursue a Master of Science degree in computer science from the

University of Florida Gainesville His academic interests have been focused on Deep

Learning and Natural Language Processing and he has been working on Machine

Reading Comprehension since summer of 2017

  • ACKNOWLEDGMENTS
  • LIST OF FIGURES
  • LIST OF ABBREVIATIONS
  • INTRODUCTION
  • THE BUILDING BLOCKS OF A QUESTION ANSWERING SYSTEM
    • Neural Networks
    • Convolutional Neural Network
    • Recurrent Neural Networks (RNN)
    • Word Embedding
    • Attention Mechanism
    • Memory Networks
      • LITERATURE REVIEW AND STATE OF THE ART
        • Machine Comprehension Using Match-LSTM and Answer Pointer
        • R-NET Matching Reading Comprehension with Self-Matching Networks
        • Bi-Directional Attention Flow (BiDAF) for Machine Comprehension
        • Summary
          • MULTI-ATTENTION QUESTION ANSWERING
            • Multi-Attention BiDAF Model
            • Chatbot Design Using a QA System
            • Online QA System and Attention Visualization
              • FUTURE DIRECTIONS AND CONCLUSION
                • Future Directions
                • Conclusion
                  • LIST OF REFERENCES
                  • BIOGRAPHICAL SKETCH
Page 14: QUESTION ANSWERING SYSTEMS WITH ATTENTION MECHANISMufdcimages.uflib.ufl.edu/UF/E0/05/23/73/00001/MUKHERJEE_P.pdf · QUESTION ANSWERING SYSTEMS WITH ATTENTION MECHANISM By PURNENDU

14

Convolutional Neural Network

The first wave of deep learningrsquos success was brought by Convolutional Neural

Networks (CNN) [9] when this was the technique used by the winning team of ImageNet

competition in 2012 CNNs are deep artificial neural networks (ANN) that can be used to

classify images cluster them by similarity and perform object recognition within scenes

It can be used to detect and identify faces people signs or any other visual data

Figure 2-2 Convolutional Neural Network Architecture [10]

There are primarily four operations in a standard CNN model (as shown in Fig above)

1 Convolution - The primary purpose of Convolution in the ConvNet (above) is to extract features from the input image The spatial relationship between pixels ie the image features are preserved and learned by the convolution using small squares of input data

2 Non-Linearity (ReLU) ndash Rectified Linear Unit (ReLU) is a non-linear operation that carries out an element wise operation on each pixel This operation replaces the negative pixel values in the feature map by zero

3 Pooling or Sub Sampling - Spatial Pooling reduces the dimensionality of each feature map but retains the most important information For max pooling the largest value in the square window is taken and rest are dropped Other types of pooling are Average Sum etc

4 Classification (Fully Connected Layer) - The Fully Connected layer is a traditional Multi-Layer Perceptron as described before that uses a softmax activation function in the output layer The high-level features of the image are encoded by the convolutional and pooling layers which is then fed to the fully connected layer which then uses these features for classifying the input image into various classes based on the training dataset [10]

15

When a new image is fed into the CNN model all the above-mentioned steps are

carried out (forward propagation) and a probability distribution is achieved on the set of

output classes With a large enough training dataset the network will learn and

generalize well enough to classify new images into their correct classes

Recurrent Neural Networks (RNN)

Whenever we want to predict or encode sequential data RNN [11] is our go to

method RNNs perform the same task for every element of a sequence where the

output of each element depends on previous computations thus the recurrence In

practice RNNs are unable to retain long-term dependencies and can look back only a

few steps because of the vanishing gradient problem

(2-1)

A solution to the dependency problem is to use gated cells such as LSTM [11] or

GRU [13] These cells pass on important information to the next cells while ignoring

non-important ones The gated units in a GRU block are

bull Update Gate ndash Computed based on current input and hidden state

(2-2)

bull Reset Gate ndash Calculated similarly but with different weights

(2-3)

bull New memory content - (2-4)

If reset gate unit is ~0 then previous memory is ignored and only new

information is kept

16

Final memory at current time step combines previous and current time steps

(2-5)

While the GRU is computationally efficient the LSTM on the other hand is a

general case where there are three gates as follows

bull Input Gate ndash What new information to add to the current cell state

bull Forget Gate ndash How much information from previous states to be kept

bull Output gate ndash How much info should be sent to the next states

Just like GRU the current cell state is a sum of the previous cell state but

weighted by the forget gate and the new value is added which is weighted by the input

gate Based on the cell state the output gate regulates the final output

Word Embedding

Computation or gradients can be applied on numbers and not on words or letters

So first we need to convert words into their corresponding numerical formation before

feeding into a deep learning model In general there are two types of word embedding

Frequency based (which constitutes count vectors tf-idf and co-occurrence vectors)

and Prediction based With frequency based embedding the order of the words are not

preserved and works as a bag of words model Whereas with prediction based model

the order of words or locality of words are taken into consideration to generate the

numerical representation of the word Within this prediction based category there are

two fundamental techniques called Continuous Bag of Words (CBOW) and Skip Gram

Model which forms the basis for word2vec [14] and GloVe [15]

The basic intuition behind word2vec is that if two different words have very

similar ldquocontextsrdquo (that is what words are likely to appear around them) then the model

17

will produce similar vector for those words Conversely if the two word vectors are

similar then the network will produce similar context predictions for the same two words

For examples synonyms like ldquointelligentrdquo and ldquosmartrdquo would have very similar contexts

Or that words that are related like ldquoenginerdquo and ldquotransmissionrdquo would probably have

similar contexts as well [16] Plotting the word vectors learned by a word2vec over a

large corpus we could find some very interesting relationships between words

Figure 2-3 Semantic relation between words in vector space [17]

Attention Mechanism

We as humans put our attention to things are important or are relevant in a

context For example when asked a question from a passage we try to find the most

relevant part of the passage the question is relevant with and then reason from our

understanding of that part of the passage The same idea applies for attention

mechanism in Deep Learning It is used to identify the specific parts of a given context

to which the current question is relevant to

Formally put the techniques take n arguments y_1 y_n (in our case the

passage having words say y_i through h_i) and a question word say q It returns a

vector z which is supposed to be the laquo summary raquo of the y_i focusing on information

linked to the question q More formally it returns a weighted arithmetic mean of the y_i

18

and the weights are chosen according the relevance of each y_i given the context c

[18]

Figure 2-4 Attention Mechanism flow [18]

Memory Networks

Convolutional Neural Networks and Recurrent Neural Networks which does

capture how we form our visual and sequential memories their memory (encoded by

hidden states and weights) were typically too small and was not compartmentalized

enough to accurately remember facts from the past (knowledge is compressed into

dense vectors) [19]

Deep Learning needed to cultivate a methodology that preserved memories as

they are such that it wonrsquot be lost in generalization and recalling exact words or

sequence of events would be possible mdash something computers are already good at This

effort led us to Memory Networks [19] published at ICLR 2015 by Facebook AI

Research

This paper provides a basic framework to store augment and retrieve memories

while seamlessly working with a Recurrent Neural Network architecture The memory

19

network consists of a memory m (an array of objects 1 indexed by m i) and four

(potentially learned) components I G O and R as follows

I (input feature map) mdash converts the incoming input to the internal feature

representation either a sparse or dense feature vector like that from word2vec or

GloVe

G (generalization) mdash updates old memories given the new input They call this

generalization as there is an opportunity for the network to compress and generalize its

memories at this stage for some intended future use The analogy Irsquove been talking

before

O (output feature map) mdash produces a new output (in the feature representation

space) given the new input and the current memory state This component is

responsible for performing inference In a question answering system this part will

select the candidate sentences (which might contain the answer) from the story

(conversation) so far

R (response) mdash converts the output into the response format desired For

example a textual response or an action In the QA system described this component

finds the desired answer and then converts it from feature representation to the actual

word

This model is a fully supervised model meaning all the candidate sentences from

which the answer could be found are marked during training phase and can also be

termed as lsquohard attentionrsquo

The authors tested out the QA system on various literature including Lord of the

Rings

20

Figure 2-5 QA example on Lord of the Rings using Memory Networks [19]

21

CHAPTER 3 LITERATURE REVIEW AND STATE OF THE ART

There has been rapid progress since the release of the SQuAD dataset and

currently the best ensemble models are close to human level accuracy in machine

comprehension This is due to the various ingenious methods which solves some of the

problems with the previous methods Out of Vocabulary tokens were handled by using

Character embedding Long term dependency within context passage were solved

using self-attention And many other techniques such as Contextualized vectors History

of Words Attention Flow etc In this section we will have a look at the some of the most

important models that were fundamental to the progress of Questions Answering

Machine Comprehension Using Match-LSTM and Answer Pointer

In this paper [20] the authors propose an end-to-end neural architecture for the

QA task The architecture is based on match-LSTM [21] a model they proposed for

textual entailment and Pointer Net [22] a sequence-to-sequence model proposed by

Vinyals et al (2015) to constrain the output tokens to be from the input sequences

The model consists of an LSTM preprocessing layer a match-LSTM layer and an

Answer Pointer layer

We are given a piece of text which we refer to as a passage and a question

related to the passage The passage is represented by matrix P where P is the length

(number of tokens) of the passage and d is the dimensionality of word embeddings

Similarly the question is represented by matrix Q where Q is the length of the question

Our goal is to identify a subsequence from the passage as the answer to the question

22

Figure 3-1 Match-LSTM Model Architecture [20]

LSTM Preprocessing layer They use a standard one-directional LSTM

(Hochreiter amp Schmidhuber 1997) to process the passage and the question separately

as shown below

Match LSTM Layer They applied the match-LSTM model proposed for textual

entailment to their machine comprehension problem by treating the question as a

premise and the passage as a hypothesis The match-LSTM sequentially goes through

the passage At position i of the passage it first uses the standard word-by-word

attention mechanism to obtain attention weight vector as follows

(3-1)

23

where and are parameters to be learned

Answer Pointer Layer The Answer Pointer (Ans-Ptr) layer is motivated by the

Pointer Net introduced by Vinyals et al (2015) [22] The boundary model produces only

the start token and the end token of the answer and then all the tokens between these

two in the original passage are considered to be the answer

When this paper was released back in November 2016 Match-LSTM method

was the state of the art in Question Answering systems and was at the top of the

leaderboard for the SQuAD dataset

R-NET Matching Reading Comprehension with Self-Matching Networks

In this model [23] first the question and passage are processed by a

bidirectional recurrent network (Mikolov et al 2010) separately They then match the

question and passage with gated attention-based recurrent networks obtaining

question-aware representation for the passage On top of that they apply self-matching

attention to aggregate evidence from the whole passage and refine the passage

representation which is then fed into the output layer to predict the boundary of the

answer span

Question and passage encoding First the words are converted to their

respective word-level embeddings and character level embeddings The character-level

embeddings are generated by taking the final hidden states of a bi-directional recurrent

neural network (RNN) applied to embeddings of characters in the token Such

character-level embeddings have been shown to be helpful to deal with out-of-vocab

(OOV) tokens

24

They then use a bi-directional RNN to produce new representation and

of all words in the question and passage respectively

Figure 3-2 The task of Question Answering [23]

Gated Attention-based Recurrent Networks They use a variant of attention-

based recurrent networks with an additional gate to determine the importance of

information in the passage regarding a question Different from the gates in LSTM or

GRU the additional gate is based on the current passage word and its attention-pooling

vector of the question which focuses on the relation between the question and current

passage word The gate effectively model the phenomenon that only parts of the

passage are relevant to the question in reading comprehension and question answering

is utilized in subsequent calculations

25

Self-Matching Attention From the previous step the question aware passage

representation is generated to highlight the important parts of the passage One

problem with such representation is that it has very limited knowledge of context One

answer candidate is often oblivious to important cues in the passage outside its

surrounding window To address this problem the authors propose directly matching

the question-aware passage representation against itself It dynamically collects

evidence from the whole passage for words in passage and encodes the evidence

relevant to the current passage word and its matching question information into the

passage representation

Output Layer They use the same method as Wang amp Jiang (2016b) and use

pointer networks (Vinyals et al 2015) to predict the start and end position of the

answer In addition they use an attention-pooling over the question representation to

generate the initial hidden vector for the pointer network [23]

When the R-Net Model first appeared in the leaderboard in March 2017 it was at

the top with 723 Exact Match and 807 F1 score

Bi-Directional Attention Flow (BiDAF) for Machine Comprehension

BiDAF [24] is a hierarchical multi-stage architecture for modeling the

representations of the context paragraph at different levels of granularity BIDAF

includes character-level word-level and contextual embeddings and uses bi-directional

attention flow to obtain a query-aware context representation Their attention layer is not

used to summarize the context paragraph into a fixed-size vector Instead the attention

is computed for every time step and the attended vector at each time step along with

the representations from previous layers can flow through to the subsequent modeling

layer This reduces the information loss caused by early summarization [24]

26

Figure 3-3 Bi-Directional Attention Flow Model Architecture [24]

Their machine comprehension model is a hierarchical multi-stage process and

consists of six layers

1 Character Embedding Layer maps each word to a vector space using character-level CNNs

2 Word Embedding Layer maps each word to a vector space using a pre-trained word embedding model

3 Contextual Embedding Layer utilizes contextual cues from surrounding words to refine the embedding of the words These first three layers are applied to both the query and context They use a LSTM on top of the embeddings provided by the previous layers to model the temporal interactions between words We place an LSTM in both directions and concatenate the outputs of the two LSTMs

4 Attention Flow Layer couples the query and context vectors and produces a set of query aware feature vectors for each word in the context

5 Modeling Layer employs a Recurrent Neural Network to scan the context The input to the modeling layer is G which encodes the query-aware representations of context words The output of the modeling layer captures the interaction among the context words conditioned on the query They use two layers of bi-directional LSTM

27

with the output size of d for each direction Hence a matrix is obtained which is passed onto the output layer to predict the answer

6 Output Layer provides an answer to the query They define the training loss (to be minimized) as the sum of the negative log probabilities of the true start and end indices by the predicted distributions averaged over all examples [25]

In a further variation of their above work they add a self-attention layer after the

Bi-attention layer to further improve the results The architecture of the model is as

Figure 3-3 The task of Question Answering [25]

28

Summary

In this chapter we reviewed the methods that are fundamental to the state of the

art in Machine Comprehension and for the task of Question Answering We have

reached closed to human level accuracy and this is due to incremental developments

over previous models As we saw Out of Vocabulary (OOV) tokens were handled by

using Character embedding Long term dependency within context passage were

solved using self-attention and many other techniques such as Contextualized vectors

Attention Flow etc were employed to get better results In the next chapter we will see

how we can build on these models and develop further

29

CHAPTER 4 MULTI-ATTENTION QUESTION ANSWERING

Although the advancements in Question Answering systems since the release of

the SQuAD dataset has been impressive and the results are getting close to human

level accuracy it is far from being a fool-proof system The models still make mistakes

which would be obvious to a human For example

Passage The Panthers used the San Jose State practice facility and stayed at

the San Jose Marriott The Broncos practiced at Florida State Facility and stayed at the

Santa Clara Marriott

Question At what universitys facility did the Panthers practice

Actual Answer San Jose State

Predicted Answer Florida State Facility

To find out what is leading to the wrong predictions we wanted to see the

attention weights associated with such an example We plotted the passage and

question heat map which is a 2D matrix where the intensity of each cell signifies the

similarity between a passage word and a question word For the above example we

found out that while certain words of the question are given high weightage other parts

are not The words lsquoAtrsquo lsquofacilityrsquo lsquopracticersquo are given high attention but lsquoPanthersrsquo does

not receive high attention If it had received high attention then the system would have

predicted lsquoSan Jose Statersquo as the right answer To solve this issue we analyzed the

base BiDAF model and proposed adding two things

1 Bi-Attention and Self-Attention over Query

2 Second level attention over the output of (Bi-Attention + Self-Attention) from both Context and Query

30

Multi-Attention BiDAF Model

Our comprehension model is a hierarchical multi-stage process and consists of

the following layers

1 Embedding Just as all other models we embed words using pretrained word vectors We also embed the characters in each word into size 20 vectors which are learned and run a convolution neural network followed by max-pooling to get character-derived embeddings for each word The character-level and word-level embeddings are then concatenated and passed to the next layer We do not update the word embeddings during training

2 Pre-Process A shared bi-directional GRU (Cho et al 2014) is used to map the question and passage embeddings to context aware embeddings

3 Context Attention The bi-directional attention mechanism from the Bi-Directional Attention Flow (BiDAF) model (Seo et al 2016) is used to build a query-aware context representation Let h_i be the vector for context word i q_j be the vector for question word j and n_q and n_c be the lengths of the question and context respectively We compute attention between context word i and question word j as

(4-1)

where w1 w2 and w3 are learned vectors and is element-wise multiplication We then compute an attended vector c_i for each context token as

(4-2)

We also compute a query-to-context vector q_c

(4-3)

31

The final vector computed for each token is built by concatenating

and In our model we subsequently pass the result through a linear layer with ReLU activations

4 Context Self-Attention Next we use a layer of residual self-attention The input is passed through another bi-directional GRU Then we apply the same attention mechanism only now between the passage and itself In this case we do not use query-to context attention and we set aij = 1048576inf if i = j As before we pass the concatenated output through a linear layer with ReLU activations This layer is applied residually so this output is additionally summed with the input

5 Query Attention For this part we do the same way as context attention but calculate the weighted sum of the context words for each query word Thus we get the length as number of query words Then we calculate context to query similar to query to context in context attention layer

Figure 4-1 The modified BiDAF model with multilevel attention

6 Query Self-Attention This part is done the same way as the context self-attention layer but from the output of the Query Attention layer

32

7 Context Query Bi-Attention + Self-Attention The output of the Context self-attention and the Query Self-Attention layers are taken as input and the same process for Bi-Attention and self-attention is applied on these inputs

8 Prediction In the last layer of our model a bidirectional GRU is applied followed by a linear layer that computes answer start scores for each token The hidden states of that layer are concatenated with the input and fed into a second bidirectional GRU and linear layer to predict answer end scores The softmax operation is applied to the start and end scores to produce start and end probabilities and we optimize the negative log likelihood of selecting correct start and end tokens

Having carried out this modification we were able to solve the wrong example we

started with The multilevel attention model gives the correct output as ldquoSan Jose

Staterdquo Also we achieved slightly better scores than the original model with a F1 score

of 8544 on the SQuAD dev dataset

Chatbot Design Using a QA System

Designing a perfect chatbot that passes the Turing test is a fundamental goal for

Artificial Intelligence Although we are many order to magnitudes away from achieving

such a goal domain specific tasks can be solved with chatbots made from current

technology We had started our goal with a similar objective in mind ie to design a

domain specific chatbot and then generalize to other areas as it is able to achieve the

first domain specific objective robustly This led us to the fundamental problem of

Machine Comprehension and subsequently to the task of Question Answering Having

achieved some degree of success with QA systems we looked back if we could apply

our newly acquired knowledge in the task of designing Chatbots

The chatbots made with todayrsquos technologies are mostly handcrafted techniques

such as template matching that requires anticipating all possible ways a user may

articulate his requirements and a conversation may occur This requires a lot of man

hours for designing a domain specific system and is still very error prone In this section

33

we propose a general Chatbot design that would make the designing of a domain

specific chatbot very easy and robust at the same time

Every domain specific chatbot needs to obtain a set of information from the user

and show some results based on the user specific information obtained The traditional

chatbots use template matching and keywords lookup to determine if the user has

provided the required information Our idea is to use the Question Answering system in

the backend to extract out the required information from whatever the user has typed

until this point of the conversation The information to be extracted can be posed in the

form of a set of questions and the answers obtained from those questions can be used

as the parameters to supply the relevant information to the user

We had chosen our chatbot domain as the flight reservation system Our goal

was to extract the required information from the user to be able to show him the

available flights as per the userrsquos requirements

Figure 4-2 Flight reservation chatbotrsquos chat window

34

For a flight reservation task the booking agent needs to know the origin city

destination city and date of travel at minimum to be able to show the available flights

Optional information includes the number of tickets passengerrsquos name one way or

round trip etc

The minimalistic conversation with the user through the chat window would be as

shown above We had a platform called OneTask on which we wanted to implement our

chat bot The chat interface within the OneTask system looks as follows

Figure 4-3 Chatbot within OneTask system

The working of the chat system is as follows ndash

1 Initiation The user opens the chat window which starts a session with the chat bot The chat bot asks lsquoHow may I help yoursquo

2 User Reply The user may reply with none of the required information for flight booking or may reply with multiple information in the same message

3 User Reply parsing The conversation up to this point is treated as a passage and the internal questions are run on this passage So the four questions that are run are

Where do you want to go

35

From where do you want to leave

When do you want to depart

4 Parsed responses from QA model After running the questions on the QA system with the given conversation answers are obtained for all along with their corresponding confidence values Since answers will be obtained even if the required question was not answered up to this point in the conversation the confidence values plays a crucial role in determining the validity of the answer For this purpose we determined ranges of confidence values for validation A confidence value above 10 signifies the question has been answered correctly A confidence of 2 ndash 10 signifies that it may have been answered but should verify with the user for correctness and any confidence below 2 is discarded

5 Asking remaining questions iteratively After the parsing it is checked if any of the required questions are still unanswered If so the chatbot asks the remaining question and the process from 3 ndash 4 is carried out iteratively Once all the questions have been answered the user is shown with the available flight options as per his request

Figure 4-4 The Flow diagram of the Flight booking Chatbot system

Online QA System and Attention Visualization

To be able to test out various examples we used the BiDAF [23] model for an

online demo One can either choose from the available examples from the drop-down

menu or paste their own passage and examples While this is a useful and interesting

system to test the model in a user-friendly way we created this system to be able to

focus on the wrong samples

36

Figure 4-5 QA system interface with attention highlight over candidate answers

The candidate answers are shown with the blue highlights as per their

confidence values The higher the confidence the darker the answer The highest

confidence value is chosen as the predicted answer We developed the system to show

attention spread of the candidate answers to realize what needs to be done to improve

the system This led us to realize the importance of including the query attention part as

well as multilevel attention on the BiDAF model as described in the first section of this

chapter

37

CHAPTER 5 FUTURE DIRECTIONS AND CONCLUSION

Future Directions

Having done a thorough analysis of the current methods and state of the arts in

Machine Comprehension and for the QA task and developing systems on it we have

achieved a strong sense of what needs to be done to further improve the QA models

After observing the wrong samples we can see that the system is still unable to encode

meaning and is picking answers based on statistical patterns of occurrence of answers

on training examples

As per the State of the Art models and analyzing the ongoing research literature

it is easy to conclude that more features need to be embedded to encode the meaning

of words phrases and sentences A paper called Reinforced Mnemonic Reader for

Machine Comprehension encoded POS and NER tags of words along with their word

and character embedding This gave them better results

Figure 5-1 An English language semantic parse tree [26]

38

We have developed a method to encode the syntax parse tree of a sentence that

not only encodes the post tags but the relation of the word within the phrase and the

relation of the phrase within the whole sentence in a hierarchical manner

Finally data augmentation is another solution to get better results One definite

way to reduce the errors would be to include similar samples in the training data which

the system is faltering in the dev set One could generate similar examples as the failure

cases and include them in the training set to have better prediction Another system

would be to train to similar and bigger datasets Our models were trained on the SQuAD

dataset There are other datasets too which does the similar question answering task

such as TriviaQA We could augment the training set of TriviaQA along with SQuAD to

have a more robust system that is able to generalize better and thus have higher

accuracy for predicting answer spans

Conclusion

In this work we have tried to explore the most fundamental techniques that have

shaped the current state of the art Then we proposed a minor improvement of

architecture over an existing model Furthermore we developed two applications that

uses the base model First we talked about how a chatbot application can be made

using the QA system and lastly we also created a web interface where the model can

be used for any Passage and Question This interface also shows the attention spread

on the candidate answers While our effort is ongoing to push the state of the art

forward we strongly believe that surpassing human level accuracy on this task will have

high dividends for the society at large

39

LIST OF REFERENCES

[1] Danqi Chen From Reading Comprehension To Open-Domain Question Answering 2018 Youtube Accessed March 16 2018 httpswwwyoutubecomwatchv=1RN88O9C13U

[2] Lehnert Wendy G The Process of Question Answering No RR-88 Yale Univ New Haven Conn Dept of computer science 1977

[3] Joshi Mandar Eunsol Choi Daniel S Weld and Luke Zettlemoyer Triviaqa A large scale distantly supervised challenge dataset for reading comprehension arXiv preprint arXiv170503551 (2017)

[4] Rajpurkar Pranav Jian Zhang Konstantin Lopyrev and Percy Liang Squad 100000+ questions for machine comprehension of text arXiv preprint arXiv160605250 (2016)

[5] The Stanford Question Answering Dataset 2018 RajpurkarGithubIo Accessed March 16 2018 httpsrajpurkargithubioSQuAD-explorer

[6] Hu Minghao Yuxing Peng and Xipeng Qiu Reinforced mnemonic reader for machine comprehension CoRR abs170502798 (2017)

[7] Schalkoff Robert J Artificial neural networks Vol 1 New York McGraw-Hill 1997

[8] Me For The AI and Neetesh Mehrotra 2018 The Connect Between Deep Learning And AI - Open Source For You Open Source For You Accessed March 16 2018 httpopensourceforucom201801connect-deep-learning-ai

[9] Krizhevsky Alex Ilya Sutskever and Geoffrey E Hinton Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems pp 1097-1105 2012

[10] An Intuitive Explanation Of Convolutional Neural Networks 2016 The Data Science Blog Accessed March 16 2018 httpsujjwalkarnme20160811intuitive-explanation-convnets

[11] Williams Ronald J and David Zipser A learning algorithm for continually running fully recurrent neural networks Neural computation 1 no 2 (1989) 270-280

[12] Hochreiter Sepp and Juumlrgen Schmidhuber Long short-term memory Neural computation 9 no 8 (1997) 1735-1780

[13] Chung Junyoung Caglar Gulcehre KyungHyun Cho and Yoshua Bengio Empirical evaluation of gated recurrent neural networks on sequence modeling arXiv preprint arXiv14123555 (2014)

40

[14] Mikolov Tomas Wen-tau Yih and Geoffrey Zweig Linguistic regularities in continuous space word representations In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics Human Language Technologies pp 746-751 2013

[15] Pennington Jeffrey Richard Socher and Christopher Manning Glove Global vectors for word representation In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) pp 1532-1543 2014

[16] Word2vec Tutorial - The Skip-Gram Model middot Chris Mccormick 2016 MccormickmlCom Accessed March 16 2018 httpmccormickmlcom20160419word2vec-tutorial-the-skip-gram-model

[17] Group 2015 [Paper Introduction] Bilingual Word Representations With Monolingual hellip SlideshareNet Accessed March 16 2018 httpswwwslidesharenetnaist-mtstudybilingual-word-representations-with-monolingual-quality-in-mind

[18] Attention Mechanism 2016 BlogHeuritechCom Accessed March 16 2018 httpsblogheuritechcom20160120attention-mechanism

[19] Weston Jason Sumit Chopra and Antoine Bordes 2014 Memory Networks arXiv preprint arXiv14103916v11 Accessed March 16 2018 httpsarxivorgabs14103916

[20] Wang Shuohang and Jing Jiang Machine comprehension using match-lstm and answer pointer arXiv preprint arXiv160807905 (2016)

[21] Wang Shuohang and Jing Jiang Learning natural language inference with LSTM arXiv preprint arXiv151208849 (2015)

[22] Vinyals Oriol Meire Fortunato and Navdeep Jaitly Pointer networks In Advances in Neural Information Processing Systems pp 2692-2700 2015

[23] Wang Wenhui Nan Yang Furu Wei Baobao Chang and Ming Zhou Gated self-matching networks for reading comprehension and question answering In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1 Long Papers) vol 1 pp 189-198 2017

[24] Seo Minjoon Aniruddha Kembhavi Ali Farhadi and Hannaneh Hajishirzi Bidirectional attention flow for machine comprehension arXiv preprint arXiv161101603 (2016)

[25] Clark Christopher and Matt Gardner Simple and effective multi-paragraph reading comprehension arXiv preprint arXiv171010723 (2017)

41

[26] 8 Analyzing Sentence Structure 2018 NltkOrg Accessed March 16 2018 httpwwwnltkorgbookch08html

42

BIOGRAPHICAL SKETCH

Purnendu Mukherjee grew up in Kolkata India and had a strong interest for

computers since an early age After his high school he did his BSc in computer

science from Ramakrishna Mission Residential College Narendrapur followed by MSc

in computer science from St Xavierrsquos College Kolkata He had a strong intuition and

interest for human like learning systems and wanted to work in this area He started

working at TCS Innovation Labs Pune for the application of Natural Language

Processing in Educational Applications As he was deeply passionate about learning

system that mimic the human brain and learn like a human child does he was

increasing interested about Deep Learning and its applications After working for a year

he went on to pursue a Master of Science degree in computer science from the

University of Florida Gainesville His academic interests have been focused on Deep

Learning and Natural Language Processing and he has been working on Machine

Reading Comprehension since summer of 2017

  • ACKNOWLEDGMENTS
  • LIST OF FIGURES
  • LIST OF ABBREVIATIONS
  • INTRODUCTION
  • THE BUILDING BLOCKS OF A QUESTION ANSWERING SYSTEM
    • Neural Networks
    • Convolutional Neural Network
    • Recurrent Neural Networks (RNN)
    • Word Embedding
    • Attention Mechanism
    • Memory Networks
      • LITERATURE REVIEW AND STATE OF THE ART
        • Machine Comprehension Using Match-LSTM and Answer Pointer
        • R-NET Matching Reading Comprehension with Self-Matching Networks
        • Bi-Directional Attention Flow (BiDAF) for Machine Comprehension
        • Summary
          • MULTI-ATTENTION QUESTION ANSWERING
            • Multi-Attention BiDAF Model
            • Chatbot Design Using a QA System
            • Online QA System and Attention Visualization
              • FUTURE DIRECTIONS AND CONCLUSION
                • Future Directions
                • Conclusion
                  • LIST OF REFERENCES
                  • BIOGRAPHICAL SKETCH
Page 15: QUESTION ANSWERING SYSTEMS WITH ATTENTION MECHANISMufdcimages.uflib.ufl.edu/UF/E0/05/23/73/00001/MUKHERJEE_P.pdf · QUESTION ANSWERING SYSTEMS WITH ATTENTION MECHANISM By PURNENDU

15

When a new image is fed into the CNN model all the above-mentioned steps are

carried out (forward propagation) and a probability distribution is achieved on the set of

output classes With a large enough training dataset the network will learn and

generalize well enough to classify new images into their correct classes

Recurrent Neural Networks (RNN)

Whenever we want to predict or encode sequential data RNN [11] is our go to

method RNNs perform the same task for every element of a sequence where the

output of each element depends on previous computations thus the recurrence In

practice RNNs are unable to retain long-term dependencies and can look back only a

few steps because of the vanishing gradient problem

(2-1)

A solution to the dependency problem is to use gated cells such as LSTM [11] or

GRU [13] These cells pass on important information to the next cells while ignoring

non-important ones The gated units in a GRU block are

bull Update Gate ndash Computed based on current input and hidden state

(2-2)

bull Reset Gate ndash Calculated similarly but with different weights

(2-3)

bull New memory content - (2-4)

If reset gate unit is ~0 then previous memory is ignored and only new

information is kept

16

Final memory at current time step combines previous and current time steps

(2-5)

While the GRU is computationally efficient the LSTM on the other hand is a

general case where there are three gates as follows

bull Input Gate ndash What new information to add to the current cell state

bull Forget Gate ndash How much information from previous states to be kept

bull Output gate ndash How much info should be sent to the next states

Just like GRU the current cell state is a sum of the previous cell state but

weighted by the forget gate and the new value is added which is weighted by the input

gate Based on the cell state the output gate regulates the final output

Word Embedding

Computation or gradients can be applied on numbers and not on words or letters

So first we need to convert words into their corresponding numerical formation before

feeding into a deep learning model In general there are two types of word embedding

Frequency based (which constitutes count vectors tf-idf and co-occurrence vectors)

and Prediction based With frequency based embedding the order of the words are not

preserved and works as a bag of words model Whereas with prediction based model

the order of words or locality of words are taken into consideration to generate the

numerical representation of the word Within this prediction based category there are

two fundamental techniques called Continuous Bag of Words (CBOW) and Skip Gram

Model which forms the basis for word2vec [14] and GloVe [15]

The basic intuition behind word2vec is that if two different words have very

similar ldquocontextsrdquo (that is what words are likely to appear around them) then the model

17

will produce similar vector for those words Conversely if the two word vectors are

similar then the network will produce similar context predictions for the same two words

For examples synonyms like ldquointelligentrdquo and ldquosmartrdquo would have very similar contexts

Or that words that are related like ldquoenginerdquo and ldquotransmissionrdquo would probably have

similar contexts as well [16] Plotting the word vectors learned by a word2vec over a

large corpus we could find some very interesting relationships between words

Figure 2-3 Semantic relation between words in vector space [17]

Attention Mechanism

We as humans put our attention to things are important or are relevant in a

context For example when asked a question from a passage we try to find the most

relevant part of the passage the question is relevant with and then reason from our

understanding of that part of the passage The same idea applies for attention

mechanism in Deep Learning It is used to identify the specific parts of a given context

to which the current question is relevant to

Formally put the techniques take n arguments y_1 y_n (in our case the

passage having words say y_i through h_i) and a question word say q It returns a

vector z which is supposed to be the laquo summary raquo of the y_i focusing on information

linked to the question q More formally it returns a weighted arithmetic mean of the y_i

18

and the weights are chosen according the relevance of each y_i given the context c

[18]

Figure 2-4 Attention Mechanism flow [18]

Memory Networks

Convolutional Neural Networks and Recurrent Neural Networks which does

capture how we form our visual and sequential memories their memory (encoded by

hidden states and weights) were typically too small and was not compartmentalized

enough to accurately remember facts from the past (knowledge is compressed into

dense vectors) [19]

Deep Learning needed to cultivate a methodology that preserved memories as

they are such that it wonrsquot be lost in generalization and recalling exact words or

sequence of events would be possible mdash something computers are already good at This

effort led us to Memory Networks [19] published at ICLR 2015 by Facebook AI

Research

This paper provides a basic framework to store augment and retrieve memories

while seamlessly working with a Recurrent Neural Network architecture The memory

19

network consists of a memory m (an array of objects 1 indexed by m i) and four

(potentially learned) components I G O and R as follows

I (input feature map) mdash converts the incoming input to the internal feature

representation either a sparse or dense feature vector like that from word2vec or

GloVe

G (generalization) mdash updates old memories given the new input They call this

generalization as there is an opportunity for the network to compress and generalize its

memories at this stage for some intended future use The analogy Irsquove been talking

before

O (output feature map) mdash produces a new output (in the feature representation

space) given the new input and the current memory state This component is

responsible for performing inference In a question answering system this part will

select the candidate sentences (which might contain the answer) from the story

(conversation) so far

R (response) mdash converts the output into the response format desired For

example a textual response or an action In the QA system described this component

finds the desired answer and then converts it from feature representation to the actual

word

This model is a fully supervised model meaning all the candidate sentences from

which the answer could be found are marked during training phase and can also be

termed as lsquohard attentionrsquo

The authors tested out the QA system on various literature including Lord of the

Rings

20

Figure 2-5 QA example on Lord of the Rings using Memory Networks [19]

21

CHAPTER 3 LITERATURE REVIEW AND STATE OF THE ART

There has been rapid progress since the release of the SQuAD dataset and

currently the best ensemble models are close to human level accuracy in machine

comprehension This is due to the various ingenious methods which solves some of the

problems with the previous methods Out of Vocabulary tokens were handled by using

Character embedding Long term dependency within context passage were solved

using self-attention And many other techniques such as Contextualized vectors History

of Words Attention Flow etc In this section we will have a look at the some of the most

important models that were fundamental to the progress of Questions Answering

Machine Comprehension Using Match-LSTM and Answer Pointer

In this paper [20] the authors propose an end-to-end neural architecture for the

QA task The architecture is based on match-LSTM [21] a model they proposed for

textual entailment and Pointer Net [22] a sequence-to-sequence model proposed by

Vinyals et al (2015) to constrain the output tokens to be from the input sequences

The model consists of an LSTM preprocessing layer a match-LSTM layer and an

Answer Pointer layer

We are given a piece of text which we refer to as a passage and a question

related to the passage The passage is represented by matrix P where P is the length

(number of tokens) of the passage and d is the dimensionality of word embeddings

Similarly the question is represented by matrix Q where Q is the length of the question

Our goal is to identify a subsequence from the passage as the answer to the question

22

Figure 3-1 Match-LSTM Model Architecture [20]

LSTM Preprocessing layer They use a standard one-directional LSTM

(Hochreiter amp Schmidhuber 1997) to process the passage and the question separately

as shown below

Match LSTM Layer They applied the match-LSTM model proposed for textual

entailment to their machine comprehension problem by treating the question as a

premise and the passage as a hypothesis The match-LSTM sequentially goes through

the passage At position i of the passage it first uses the standard word-by-word

attention mechanism to obtain attention weight vector as follows

(3-1)

23

where and are parameters to be learned

Answer Pointer Layer The Answer Pointer (Ans-Ptr) layer is motivated by the

Pointer Net introduced by Vinyals et al (2015) [22] The boundary model produces only

the start token and the end token of the answer and then all the tokens between these

two in the original passage are considered to be the answer

When this paper was released back in November 2016 Match-LSTM method

was the state of the art in Question Answering systems and was at the top of the

leaderboard for the SQuAD dataset

R-NET Matching Reading Comprehension with Self-Matching Networks

In this model [23] first the question and passage are processed by a

bidirectional recurrent network (Mikolov et al 2010) separately They then match the

question and passage with gated attention-based recurrent networks obtaining

question-aware representation for the passage On top of that they apply self-matching

attention to aggregate evidence from the whole passage and refine the passage

representation which is then fed into the output layer to predict the boundary of the

answer span

Question and passage encoding First the words are converted to their

respective word-level embeddings and character level embeddings The character-level

embeddings are generated by taking the final hidden states of a bi-directional recurrent

neural network (RNN) applied to embeddings of characters in the token Such

character-level embeddings have been shown to be helpful to deal with out-of-vocab

(OOV) tokens

24

They then use a bi-directional RNN to produce new representation and

of all words in the question and passage respectively

Figure 3-2 The task of Question Answering [23]

Gated Attention-based Recurrent Networks They use a variant of attention-

based recurrent networks with an additional gate to determine the importance of

information in the passage regarding a question Different from the gates in LSTM or

GRU the additional gate is based on the current passage word and its attention-pooling

vector of the question which focuses on the relation between the question and current

passage word The gate effectively model the phenomenon that only parts of the

passage are relevant to the question in reading comprehension and question answering

is utilized in subsequent calculations

25

Self-Matching Attention From the previous step the question aware passage

representation is generated to highlight the important parts of the passage One

problem with such representation is that it has very limited knowledge of context One

answer candidate is often oblivious to important cues in the passage outside its

surrounding window To address this problem the authors propose directly matching

the question-aware passage representation against itself It dynamically collects

evidence from the whole passage for words in passage and encodes the evidence

relevant to the current passage word and its matching question information into the

passage representation

Output Layer They use the same method as Wang amp Jiang (2016b) and use

pointer networks (Vinyals et al 2015) to predict the start and end position of the

answer In addition they use an attention-pooling over the question representation to

generate the initial hidden vector for the pointer network [23]

When the R-Net Model first appeared in the leaderboard in March 2017 it was at

the top with 723 Exact Match and 807 F1 score

Bi-Directional Attention Flow (BiDAF) for Machine Comprehension

BiDAF [24] is a hierarchical multi-stage architecture for modeling the

representations of the context paragraph at different levels of granularity BIDAF

includes character-level word-level and contextual embeddings and uses bi-directional

attention flow to obtain a query-aware context representation Their attention layer is not

used to summarize the context paragraph into a fixed-size vector Instead the attention

is computed for every time step and the attended vector at each time step along with

the representations from previous layers can flow through to the subsequent modeling

layer This reduces the information loss caused by early summarization [24]

26

Figure 3-3 Bi-Directional Attention Flow Model Architecture [24]

Their machine comprehension model is a hierarchical multi-stage process and

consists of six layers

1 Character Embedding Layer maps each word to a vector space using character-level CNNs

2 Word Embedding Layer maps each word to a vector space using a pre-trained word embedding model

3 Contextual Embedding Layer utilizes contextual cues from surrounding words to refine the embedding of the words These first three layers are applied to both the query and context They use a LSTM on top of the embeddings provided by the previous layers to model the temporal interactions between words We place an LSTM in both directions and concatenate the outputs of the two LSTMs

4 Attention Flow Layer couples the query and context vectors and produces a set of query aware feature vectors for each word in the context

5 Modeling Layer employs a Recurrent Neural Network to scan the context The input to the modeling layer is G which encodes the query-aware representations of context words The output of the modeling layer captures the interaction among the context words conditioned on the query They use two layers of bi-directional LSTM

27

with the output size of d for each direction Hence a matrix is obtained which is passed onto the output layer to predict the answer

6 Output Layer provides an answer to the query They define the training loss (to be minimized) as the sum of the negative log probabilities of the true start and end indices by the predicted distributions averaged over all examples [25]

In a further variation of their above work they add a self-attention layer after the

Bi-attention layer to further improve the results The architecture of the model is as

Figure 3-3 The task of Question Answering [25]

28

Summary

In this chapter we reviewed the methods that are fundamental to the state of the

art in Machine Comprehension and for the task of Question Answering We have

reached closed to human level accuracy and this is due to incremental developments

over previous models As we saw Out of Vocabulary (OOV) tokens were handled by

using Character embedding Long term dependency within context passage were

solved using self-attention and many other techniques such as Contextualized vectors

Attention Flow etc were employed to get better results In the next chapter we will see

how we can build on these models and develop further

29

CHAPTER 4 MULTI-ATTENTION QUESTION ANSWERING

Although the advancements in Question Answering systems since the release of

the SQuAD dataset has been impressive and the results are getting close to human

level accuracy it is far from being a fool-proof system The models still make mistakes

which would be obvious to a human For example

Passage The Panthers used the San Jose State practice facility and stayed at

the San Jose Marriott The Broncos practiced at Florida State Facility and stayed at the

Santa Clara Marriott

Question At what universitys facility did the Panthers practice

Actual Answer San Jose State

Predicted Answer Florida State Facility

To find out what is leading to the wrong predictions we wanted to see the

attention weights associated with such an example We plotted the passage and

question heat map which is a 2D matrix where the intensity of each cell signifies the

similarity between a passage word and a question word For the above example we

found out that while certain words of the question are given high weightage other parts

are not The words lsquoAtrsquo lsquofacilityrsquo lsquopracticersquo are given high attention but lsquoPanthersrsquo does

not receive high attention If it had received high attention then the system would have

predicted lsquoSan Jose Statersquo as the right answer To solve this issue we analyzed the

base BiDAF model and proposed adding two things

1 Bi-Attention and Self-Attention over Query

2 Second level attention over the output of (Bi-Attention + Self-Attention) from both Context and Query

30

Multi-Attention BiDAF Model

Our comprehension model is a hierarchical multi-stage process and consists of

the following layers

1 Embedding Just as all other models we embed words using pretrained word vectors We also embed the characters in each word into size 20 vectors which are learned and run a convolution neural network followed by max-pooling to get character-derived embeddings for each word The character-level and word-level embeddings are then concatenated and passed to the next layer We do not update the word embeddings during training

2 Pre-Process A shared bi-directional GRU (Cho et al 2014) is used to map the question and passage embeddings to context aware embeddings

3 Context Attention The bi-directional attention mechanism from the Bi-Directional Attention Flow (BiDAF) model (Seo et al 2016) is used to build a query-aware context representation Let h_i be the vector for context word i q_j be the vector for question word j and n_q and n_c be the lengths of the question and context respectively We compute attention between context word i and question word j as

(4-1)

where w1 w2 and w3 are learned vectors and is element-wise multiplication We then compute an attended vector c_i for each context token as

(4-2)

We also compute a query-to-context vector q_c

(4-3)

31

The final vector computed for each token is built by concatenating

and In our model we subsequently pass the result through a linear layer with ReLU activations

4 Context Self-Attention Next we use a layer of residual self-attention The input is passed through another bi-directional GRU Then we apply the same attention mechanism only now between the passage and itself In this case we do not use query-to context attention and we set aij = 1048576inf if i = j As before we pass the concatenated output through a linear layer with ReLU activations This layer is applied residually so this output is additionally summed with the input

5 Query Attention For this part we do the same way as context attention but calculate the weighted sum of the context words for each query word Thus we get the length as number of query words Then we calculate context to query similar to query to context in context attention layer

Figure 4-1 The modified BiDAF model with multilevel attention

6 Query Self-Attention This part is done the same way as the context self-attention layer but from the output of the Query Attention layer

32

7 Context Query Bi-Attention + Self-Attention The output of the Context self-attention and the Query Self-Attention layers are taken as input and the same process for Bi-Attention and self-attention is applied on these inputs

8 Prediction In the last layer of our model a bidirectional GRU is applied followed by a linear layer that computes answer start scores for each token The hidden states of that layer are concatenated with the input and fed into a second bidirectional GRU and linear layer to predict answer end scores The softmax operation is applied to the start and end scores to produce start and end probabilities and we optimize the negative log likelihood of selecting correct start and end tokens

Having carried out this modification we were able to solve the wrong example we

started with The multilevel attention model gives the correct output as ldquoSan Jose

Staterdquo Also we achieved slightly better scores than the original model with a F1 score

of 8544 on the SQuAD dev dataset

Chatbot Design Using a QA System

Designing a perfect chatbot that passes the Turing test is a fundamental goal for

Artificial Intelligence Although we are many order to magnitudes away from achieving

such a goal domain specific tasks can be solved with chatbots made from current

technology We had started our goal with a similar objective in mind ie to design a

domain specific chatbot and then generalize to other areas as it is able to achieve the

first domain specific objective robustly This led us to the fundamental problem of

Machine Comprehension and subsequently to the task of Question Answering Having

achieved some degree of success with QA systems we looked back if we could apply

our newly acquired knowledge in the task of designing Chatbots

The chatbots made with todayrsquos technologies are mostly handcrafted techniques

such as template matching that requires anticipating all possible ways a user may

articulate his requirements and a conversation may occur This requires a lot of man

hours for designing a domain specific system and is still very error prone In this section

33

we propose a general Chatbot design that would make the designing of a domain

specific chatbot very easy and robust at the same time

Every domain specific chatbot needs to obtain a set of information from the user

and show some results based on the user specific information obtained The traditional

chatbots use template matching and keywords lookup to determine if the user has

provided the required information Our idea is to use the Question Answering system in

the backend to extract out the required information from whatever the user has typed

until this point of the conversation The information to be extracted can be posed in the

form of a set of questions and the answers obtained from those questions can be used

as the parameters to supply the relevant information to the user

We had chosen our chatbot domain as the flight reservation system Our goal

was to extract the required information from the user to be able to show him the

available flights as per the userrsquos requirements

Figure 4-2 Flight reservation chatbotrsquos chat window

34

For a flight reservation task the booking agent needs to know the origin city

destination city and date of travel at minimum to be able to show the available flights

Optional information includes the number of tickets passengerrsquos name one way or

round trip etc

The minimalistic conversation with the user through the chat window would be as

shown above We had a platform called OneTask on which we wanted to implement our

chat bot The chat interface within the OneTask system looks as follows

Figure 4-3 Chatbot within OneTask system

The working of the chat system is as follows ndash

1 Initiation The user opens the chat window which starts a session with the chat bot The chat bot asks lsquoHow may I help yoursquo

2 User Reply The user may reply with none of the required information for flight booking or may reply with multiple information in the same message

3 User Reply parsing The conversation up to this point is treated as a passage and the internal questions are run on this passage So the four questions that are run are

Where do you want to go

35

From where do you want to leave

When do you want to depart

4 Parsed responses from QA model After running the questions on the QA system with the given conversation answers are obtained for all along with their corresponding confidence values Since answers will be obtained even if the required question was not answered up to this point in the conversation the confidence values plays a crucial role in determining the validity of the answer For this purpose we determined ranges of confidence values for validation A confidence value above 10 signifies the question has been answered correctly A confidence of 2 ndash 10 signifies that it may have been answered but should verify with the user for correctness and any confidence below 2 is discarded

5 Asking remaining questions iteratively After the parsing it is checked if any of the required questions are still unanswered If so the chatbot asks the remaining question and the process from 3 ndash 4 is carried out iteratively Once all the questions have been answered the user is shown with the available flight options as per his request

Figure 4-4 The Flow diagram of the Flight booking Chatbot system

Online QA System and Attention Visualization

To be able to test out various examples we used the BiDAF [23] model for an

online demo One can either choose from the available examples from the drop-down

menu or paste their own passage and examples While this is a useful and interesting

system to test the model in a user-friendly way we created this system to be able to

focus on the wrong samples

36

Figure 4-5 QA system interface with attention highlight over candidate answers

The candidate answers are shown with the blue highlights as per their

confidence values The higher the confidence the darker the answer The highest

confidence value is chosen as the predicted answer We developed the system to show

attention spread of the candidate answers to realize what needs to be done to improve

the system This led us to realize the importance of including the query attention part as

well as multilevel attention on the BiDAF model as described in the first section of this

chapter

37

CHAPTER 5 FUTURE DIRECTIONS AND CONCLUSION

Future Directions

Having done a thorough analysis of the current methods and state of the arts in

Machine Comprehension and for the QA task and developing systems on it we have

achieved a strong sense of what needs to be done to further improve the QA models

After observing the wrong samples we can see that the system is still unable to encode

meaning and is picking answers based on statistical patterns of occurrence of answers

on training examples

As per the State of the Art models and analyzing the ongoing research literature

it is easy to conclude that more features need to be embedded to encode the meaning

of words phrases and sentences A paper called Reinforced Mnemonic Reader for

Machine Comprehension encoded POS and NER tags of words along with their word

and character embedding This gave them better results

Figure 5-1 An English language semantic parse tree [26]

38

We have developed a method to encode the syntax parse tree of a sentence that

not only encodes the post tags but the relation of the word within the phrase and the

relation of the phrase within the whole sentence in a hierarchical manner

Finally data augmentation is another solution to get better results One definite

way to reduce the errors would be to include similar samples in the training data which

the system is faltering in the dev set One could generate similar examples as the failure

cases and include them in the training set to have better prediction Another system

would be to train to similar and bigger datasets Our models were trained on the SQuAD

dataset There are other datasets too which does the similar question answering task

such as TriviaQA We could augment the training set of TriviaQA along with SQuAD to

have a more robust system that is able to generalize better and thus have higher

accuracy for predicting answer spans

Conclusion

In this work we have tried to explore the most fundamental techniques that have

shaped the current state of the art Then we proposed a minor improvement of

architecture over an existing model Furthermore we developed two applications that

uses the base model First we talked about how a chatbot application can be made

using the QA system and lastly we also created a web interface where the model can

be used for any Passage and Question This interface also shows the attention spread

on the candidate answers While our effort is ongoing to push the state of the art

forward we strongly believe that surpassing human level accuracy on this task will have

high dividends for the society at large

39

LIST OF REFERENCES

[1] Danqi Chen From Reading Comprehension To Open-Domain Question Answering 2018 Youtube Accessed March 16 2018 httpswwwyoutubecomwatchv=1RN88O9C13U

[2] Lehnert Wendy G The Process of Question Answering No RR-88 Yale Univ New Haven Conn Dept of computer science 1977

[3] Joshi Mandar Eunsol Choi Daniel S Weld and Luke Zettlemoyer Triviaqa A large scale distantly supervised challenge dataset for reading comprehension arXiv preprint arXiv170503551 (2017)

[4] Rajpurkar Pranav Jian Zhang Konstantin Lopyrev and Percy Liang Squad 100000+ questions for machine comprehension of text arXiv preprint arXiv160605250 (2016)

[5] The Stanford Question Answering Dataset 2018 RajpurkarGithubIo Accessed March 16 2018 httpsrajpurkargithubioSQuAD-explorer

[6] Hu Minghao Yuxing Peng and Xipeng Qiu Reinforced mnemonic reader for machine comprehension CoRR abs170502798 (2017)

[7] Schalkoff Robert J Artificial neural networks Vol 1 New York McGraw-Hill 1997

[8] Me For The AI and Neetesh Mehrotra 2018 The Connect Between Deep Learning And AI - Open Source For You Open Source For You Accessed March 16 2018 httpopensourceforucom201801connect-deep-learning-ai

[9] Krizhevsky Alex Ilya Sutskever and Geoffrey E Hinton Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems pp 1097-1105 2012

[10] An Intuitive Explanation Of Convolutional Neural Networks 2016 The Data Science Blog Accessed March 16 2018 httpsujjwalkarnme20160811intuitive-explanation-convnets

[11] Williams Ronald J and David Zipser A learning algorithm for continually running fully recurrent neural networks Neural computation 1 no 2 (1989) 270-280

[12] Hochreiter Sepp and Juumlrgen Schmidhuber Long short-term memory Neural computation 9 no 8 (1997) 1735-1780

[13] Chung Junyoung Caglar Gulcehre KyungHyun Cho and Yoshua Bengio Empirical evaluation of gated recurrent neural networks on sequence modeling arXiv preprint arXiv14123555 (2014)

40

[14] Mikolov Tomas Wen-tau Yih and Geoffrey Zweig Linguistic regularities in continuous space word representations In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics Human Language Technologies pp 746-751 2013

[15] Pennington Jeffrey Richard Socher and Christopher Manning Glove Global vectors for word representation In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) pp 1532-1543 2014

[16] Word2vec Tutorial - The Skip-Gram Model middot Chris Mccormick 2016 MccormickmlCom Accessed March 16 2018 httpmccormickmlcom20160419word2vec-tutorial-the-skip-gram-model

[17] Group 2015 [Paper Introduction] Bilingual Word Representations With Monolingual hellip SlideshareNet Accessed March 16 2018 httpswwwslidesharenetnaist-mtstudybilingual-word-representations-with-monolingual-quality-in-mind

[18] Attention Mechanism 2016 BlogHeuritechCom Accessed March 16 2018 httpsblogheuritechcom20160120attention-mechanism

[19] Weston Jason Sumit Chopra and Antoine Bordes 2014 Memory Networks arXiv preprint arXiv14103916v11 Accessed March 16 2018 httpsarxivorgabs14103916

[20] Wang Shuohang and Jing Jiang Machine comprehension using match-lstm and answer pointer arXiv preprint arXiv160807905 (2016)

[21] Wang Shuohang and Jing Jiang Learning natural language inference with LSTM arXiv preprint arXiv151208849 (2015)

[22] Vinyals Oriol Meire Fortunato and Navdeep Jaitly Pointer networks In Advances in Neural Information Processing Systems pp 2692-2700 2015

[23] Wang Wenhui Nan Yang Furu Wei Baobao Chang and Ming Zhou Gated self-matching networks for reading comprehension and question answering In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1 Long Papers) vol 1 pp 189-198 2017

[24] Seo Minjoon Aniruddha Kembhavi Ali Farhadi and Hannaneh Hajishirzi Bidirectional attention flow for machine comprehension arXiv preprint arXiv161101603 (2016)

[25] Clark Christopher and Matt Gardner Simple and effective multi-paragraph reading comprehension arXiv preprint arXiv171010723 (2017)

41

[26] 8 Analyzing Sentence Structure 2018 NltkOrg Accessed March 16 2018 httpwwwnltkorgbookch08html

42

BIOGRAPHICAL SKETCH

Purnendu Mukherjee grew up in Kolkata India and had a strong interest for

computers since an early age After his high school he did his BSc in computer

science from Ramakrishna Mission Residential College Narendrapur followed by MSc

in computer science from St Xavierrsquos College Kolkata He had a strong intuition and

interest for human like learning systems and wanted to work in this area He started

working at TCS Innovation Labs Pune for the application of Natural Language

Processing in Educational Applications As he was deeply passionate about learning

system that mimic the human brain and learn like a human child does he was

increasing interested about Deep Learning and its applications After working for a year

he went on to pursue a Master of Science degree in computer science from the

University of Florida Gainesville His academic interests have been focused on Deep

Learning and Natural Language Processing and he has been working on Machine

Reading Comprehension since summer of 2017

  • ACKNOWLEDGMENTS
  • LIST OF FIGURES
  • LIST OF ABBREVIATIONS
  • INTRODUCTION
  • THE BUILDING BLOCKS OF A QUESTION ANSWERING SYSTEM
    • Neural Networks
    • Convolutional Neural Network
    • Recurrent Neural Networks (RNN)
    • Word Embedding
    • Attention Mechanism
    • Memory Networks
      • LITERATURE REVIEW AND STATE OF THE ART
        • Machine Comprehension Using Match-LSTM and Answer Pointer
        • R-NET Matching Reading Comprehension with Self-Matching Networks
        • Bi-Directional Attention Flow (BiDAF) for Machine Comprehension
        • Summary
          • MULTI-ATTENTION QUESTION ANSWERING
            • Multi-Attention BiDAF Model
            • Chatbot Design Using a QA System
            • Online QA System and Attention Visualization
              • FUTURE DIRECTIONS AND CONCLUSION
                • Future Directions
                • Conclusion
                  • LIST OF REFERENCES
                  • BIOGRAPHICAL SKETCH
Page 16: QUESTION ANSWERING SYSTEMS WITH ATTENTION MECHANISMufdcimages.uflib.ufl.edu/UF/E0/05/23/73/00001/MUKHERJEE_P.pdf · QUESTION ANSWERING SYSTEMS WITH ATTENTION MECHANISM By PURNENDU

16

Final memory at current time step combines previous and current time steps

(2-5)

While the GRU is computationally efficient the LSTM on the other hand is a

general case where there are three gates as follows

bull Input Gate ndash What new information to add to the current cell state

bull Forget Gate ndash How much information from previous states to be kept

bull Output gate ndash How much info should be sent to the next states

Just like GRU the current cell state is a sum of the previous cell state but

weighted by the forget gate and the new value is added which is weighted by the input

gate Based on the cell state the output gate regulates the final output

Word Embedding

Computation or gradients can be applied on numbers and not on words or letters

So first we need to convert words into their corresponding numerical formation before

feeding into a deep learning model In general there are two types of word embedding

Frequency based (which constitutes count vectors tf-idf and co-occurrence vectors)

and Prediction based With frequency based embedding the order of the words are not

preserved and works as a bag of words model Whereas with prediction based model

the order of words or locality of words are taken into consideration to generate the

numerical representation of the word Within this prediction based category there are

two fundamental techniques called Continuous Bag of Words (CBOW) and Skip Gram

Model which forms the basis for word2vec [14] and GloVe [15]

The basic intuition behind word2vec is that if two different words have very

similar ldquocontextsrdquo (that is what words are likely to appear around them) then the model

17

will produce similar vector for those words Conversely if the two word vectors are

similar then the network will produce similar context predictions for the same two words

For examples synonyms like ldquointelligentrdquo and ldquosmartrdquo would have very similar contexts

Or that words that are related like ldquoenginerdquo and ldquotransmissionrdquo would probably have

similar contexts as well [16] Plotting the word vectors learned by a word2vec over a

large corpus we could find some very interesting relationships between words

Figure 2-3 Semantic relation between words in vector space [17]

Attention Mechanism

We as humans put our attention to things are important or are relevant in a

context For example when asked a question from a passage we try to find the most

relevant part of the passage the question is relevant with and then reason from our

understanding of that part of the passage The same idea applies for attention

mechanism in Deep Learning It is used to identify the specific parts of a given context

to which the current question is relevant to

Formally put the techniques take n arguments y_1 y_n (in our case the

passage having words say y_i through h_i) and a question word say q It returns a

vector z which is supposed to be the laquo summary raquo of the y_i focusing on information

linked to the question q More formally it returns a weighted arithmetic mean of the y_i

18

and the weights are chosen according the relevance of each y_i given the context c

[18]

Figure 2-4 Attention Mechanism flow [18]

Memory Networks

Convolutional Neural Networks and Recurrent Neural Networks which does

capture how we form our visual and sequential memories their memory (encoded by

hidden states and weights) were typically too small and was not compartmentalized

enough to accurately remember facts from the past (knowledge is compressed into

dense vectors) [19]

Deep Learning needed to cultivate a methodology that preserved memories as

they are such that it wonrsquot be lost in generalization and recalling exact words or

sequence of events would be possible mdash something computers are already good at This

effort led us to Memory Networks [19] published at ICLR 2015 by Facebook AI

Research

This paper provides a basic framework to store augment and retrieve memories

while seamlessly working with a Recurrent Neural Network architecture The memory

19

network consists of a memory m (an array of objects 1 indexed by m i) and four

(potentially learned) components I G O and R as follows

I (input feature map) mdash converts the incoming input to the internal feature

representation either a sparse or dense feature vector like that from word2vec or

GloVe

G (generalization) mdash updates old memories given the new input They call this

generalization as there is an opportunity for the network to compress and generalize its

memories at this stage for some intended future use The analogy Irsquove been talking

before

O (output feature map) mdash produces a new output (in the feature representation

space) given the new input and the current memory state This component is

responsible for performing inference In a question answering system this part will

select the candidate sentences (which might contain the answer) from the story

(conversation) so far

R (response) mdash converts the output into the response format desired For

example a textual response or an action In the QA system described this component

finds the desired answer and then converts it from feature representation to the actual

word

This model is a fully supervised model meaning all the candidate sentences from

which the answer could be found are marked during training phase and can also be

termed as lsquohard attentionrsquo

The authors tested out the QA system on various literature including Lord of the

Rings

20

Figure 2-5 QA example on Lord of the Rings using Memory Networks [19]

21

CHAPTER 3 LITERATURE REVIEW AND STATE OF THE ART

There has been rapid progress since the release of the SQuAD dataset and

currently the best ensemble models are close to human level accuracy in machine

comprehension This is due to the various ingenious methods which solves some of the

problems with the previous methods Out of Vocabulary tokens were handled by using

Character embedding Long term dependency within context passage were solved

using self-attention And many other techniques such as Contextualized vectors History

of Words Attention Flow etc In this section we will have a look at the some of the most

important models that were fundamental to the progress of Questions Answering

Machine Comprehension Using Match-LSTM and Answer Pointer

In this paper [20] the authors propose an end-to-end neural architecture for the

QA task The architecture is based on match-LSTM [21] a model they proposed for

textual entailment and Pointer Net [22] a sequence-to-sequence model proposed by

Vinyals et al (2015) to constrain the output tokens to be from the input sequences

The model consists of an LSTM preprocessing layer a match-LSTM layer and an

Answer Pointer layer

We are given a piece of text which we refer to as a passage and a question

related to the passage The passage is represented by matrix P where P is the length

(number of tokens) of the passage and d is the dimensionality of word embeddings

Similarly the question is represented by matrix Q where Q is the length of the question

Our goal is to identify a subsequence from the passage as the answer to the question

22

Figure 3-1 Match-LSTM Model Architecture [20]

LSTM Preprocessing layer They use a standard one-directional LSTM

(Hochreiter amp Schmidhuber 1997) to process the passage and the question separately

as shown below

Match LSTM Layer They applied the match-LSTM model proposed for textual

entailment to their machine comprehension problem by treating the question as a

premise and the passage as a hypothesis The match-LSTM sequentially goes through

the passage At position i of the passage it first uses the standard word-by-word

attention mechanism to obtain attention weight vector as follows

(3-1)

23

where and are parameters to be learned

Answer Pointer Layer The Answer Pointer (Ans-Ptr) layer is motivated by the

Pointer Net introduced by Vinyals et al (2015) [22] The boundary model produces only

the start token and the end token of the answer and then all the tokens between these

two in the original passage are considered to be the answer

When this paper was released back in November 2016 Match-LSTM method

was the state of the art in Question Answering systems and was at the top of the

leaderboard for the SQuAD dataset

R-NET Matching Reading Comprehension with Self-Matching Networks

In this model [23] first the question and passage are processed by a

bidirectional recurrent network (Mikolov et al 2010) separately They then match the

question and passage with gated attention-based recurrent networks obtaining

question-aware representation for the passage On top of that they apply self-matching

attention to aggregate evidence from the whole passage and refine the passage

representation which is then fed into the output layer to predict the boundary of the

answer span

Question and passage encoding First the words are converted to their

respective word-level embeddings and character level embeddings The character-level

embeddings are generated by taking the final hidden states of a bi-directional recurrent

neural network (RNN) applied to embeddings of characters in the token Such

character-level embeddings have been shown to be helpful to deal with out-of-vocab

(OOV) tokens

24

They then use a bi-directional RNN to produce new representation and

of all words in the question and passage respectively

Figure 3-2 The task of Question Answering [23]

Gated Attention-based Recurrent Networks They use a variant of attention-

based recurrent networks with an additional gate to determine the importance of

information in the passage regarding a question Different from the gates in LSTM or

GRU the additional gate is based on the current passage word and its attention-pooling

vector of the question which focuses on the relation between the question and current

passage word The gate effectively model the phenomenon that only parts of the

passage are relevant to the question in reading comprehension and question answering

is utilized in subsequent calculations

25

Self-Matching Attention From the previous step the question aware passage

representation is generated to highlight the important parts of the passage One

problem with such representation is that it has very limited knowledge of context One

answer candidate is often oblivious to important cues in the passage outside its

surrounding window To address this problem the authors propose directly matching

the question-aware passage representation against itself It dynamically collects

evidence from the whole passage for words in passage and encodes the evidence

relevant to the current passage word and its matching question information into the

passage representation

Output Layer They use the same method as Wang amp Jiang (2016b) and use

pointer networks (Vinyals et al 2015) to predict the start and end position of the

answer In addition they use an attention-pooling over the question representation to

generate the initial hidden vector for the pointer network [23]

When the R-Net Model first appeared in the leaderboard in March 2017 it was at

the top with 723 Exact Match and 807 F1 score

Bi-Directional Attention Flow (BiDAF) for Machine Comprehension

BiDAF [24] is a hierarchical multi-stage architecture for modeling the

representations of the context paragraph at different levels of granularity BIDAF

includes character-level word-level and contextual embeddings and uses bi-directional

attention flow to obtain a query-aware context representation Their attention layer is not

used to summarize the context paragraph into a fixed-size vector Instead the attention

is computed for every time step and the attended vector at each time step along with

the representations from previous layers can flow through to the subsequent modeling

layer This reduces the information loss caused by early summarization [24]

26

Figure 3-3 Bi-Directional Attention Flow Model Architecture [24]

Their machine comprehension model is a hierarchical multi-stage process and

consists of six layers

1 Character Embedding Layer maps each word to a vector space using character-level CNNs

2 Word Embedding Layer maps each word to a vector space using a pre-trained word embedding model

3 Contextual Embedding Layer utilizes contextual cues from surrounding words to refine the embedding of the words These first three layers are applied to both the query and context They use a LSTM on top of the embeddings provided by the previous layers to model the temporal interactions between words We place an LSTM in both directions and concatenate the outputs of the two LSTMs

4 Attention Flow Layer couples the query and context vectors and produces a set of query aware feature vectors for each word in the context

5 Modeling Layer employs a Recurrent Neural Network to scan the context The input to the modeling layer is G which encodes the query-aware representations of context words The output of the modeling layer captures the interaction among the context words conditioned on the query They use two layers of bi-directional LSTM

27

with the output size of d for each direction Hence a matrix is obtained which is passed onto the output layer to predict the answer

6 Output Layer provides an answer to the query They define the training loss (to be minimized) as the sum of the negative log probabilities of the true start and end indices by the predicted distributions averaged over all examples [25]

In a further variation of their above work they add a self-attention layer after the

Bi-attention layer to further improve the results The architecture of the model is as

Figure 3-3 The task of Question Answering [25]

28

Summary

In this chapter we reviewed the methods that are fundamental to the state of the

art in Machine Comprehension and for the task of Question Answering We have

reached closed to human level accuracy and this is due to incremental developments

over previous models As we saw Out of Vocabulary (OOV) tokens were handled by

using Character embedding Long term dependency within context passage were

solved using self-attention and many other techniques such as Contextualized vectors

Attention Flow etc were employed to get better results In the next chapter we will see

how we can build on these models and develop further

29

CHAPTER 4 MULTI-ATTENTION QUESTION ANSWERING

Although the advancements in Question Answering systems since the release of

the SQuAD dataset has been impressive and the results are getting close to human

level accuracy it is far from being a fool-proof system The models still make mistakes

which would be obvious to a human For example

Passage The Panthers used the San Jose State practice facility and stayed at

the San Jose Marriott The Broncos practiced at Florida State Facility and stayed at the

Santa Clara Marriott

Question At what universitys facility did the Panthers practice

Actual Answer San Jose State

Predicted Answer Florida State Facility

To find out what is leading to the wrong predictions we wanted to see the

attention weights associated with such an example We plotted the passage and

question heat map which is a 2D matrix where the intensity of each cell signifies the

similarity between a passage word and a question word For the above example we

found out that while certain words of the question are given high weightage other parts

are not The words lsquoAtrsquo lsquofacilityrsquo lsquopracticersquo are given high attention but lsquoPanthersrsquo does

not receive high attention If it had received high attention then the system would have

predicted lsquoSan Jose Statersquo as the right answer To solve this issue we analyzed the

base BiDAF model and proposed adding two things

1 Bi-Attention and Self-Attention over Query

2 Second level attention over the output of (Bi-Attention + Self-Attention) from both Context and Query

30

Multi-Attention BiDAF Model

Our comprehension model is a hierarchical multi-stage process and consists of

the following layers

1 Embedding Just as all other models we embed words using pretrained word vectors We also embed the characters in each word into size 20 vectors which are learned and run a convolution neural network followed by max-pooling to get character-derived embeddings for each word The character-level and word-level embeddings are then concatenated and passed to the next layer We do not update the word embeddings during training

2 Pre-Process A shared bi-directional GRU (Cho et al 2014) is used to map the question and passage embeddings to context aware embeddings

3 Context Attention The bi-directional attention mechanism from the Bi-Directional Attention Flow (BiDAF) model (Seo et al 2016) is used to build a query-aware context representation Let h_i be the vector for context word i q_j be the vector for question word j and n_q and n_c be the lengths of the question and context respectively We compute attention between context word i and question word j as

(4-1)

where w1 w2 and w3 are learned vectors and is element-wise multiplication We then compute an attended vector c_i for each context token as

(4-2)

We also compute a query-to-context vector q_c

(4-3)

31

The final vector computed for each token is built by concatenating

and In our model we subsequently pass the result through a linear layer with ReLU activations

4 Context Self-Attention Next we use a layer of residual self-attention The input is passed through another bi-directional GRU Then we apply the same attention mechanism only now between the passage and itself In this case we do not use query-to context attention and we set aij = 1048576inf if i = j As before we pass the concatenated output through a linear layer with ReLU activations This layer is applied residually so this output is additionally summed with the input

5 Query Attention For this part we do the same way as context attention but calculate the weighted sum of the context words for each query word Thus we get the length as number of query words Then we calculate context to query similar to query to context in context attention layer

Figure 4-1 The modified BiDAF model with multilevel attention

6 Query Self-Attention This part is done the same way as the context self-attention layer but from the output of the Query Attention layer

32

7 Context Query Bi-Attention + Self-Attention The output of the Context self-attention and the Query Self-Attention layers are taken as input and the same process for Bi-Attention and self-attention is applied on these inputs

8 Prediction In the last layer of our model a bidirectional GRU is applied followed by a linear layer that computes answer start scores for each token The hidden states of that layer are concatenated with the input and fed into a second bidirectional GRU and linear layer to predict answer end scores The softmax operation is applied to the start and end scores to produce start and end probabilities and we optimize the negative log likelihood of selecting correct start and end tokens

Having carried out this modification we were able to solve the wrong example we

started with The multilevel attention model gives the correct output as ldquoSan Jose

Staterdquo Also we achieved slightly better scores than the original model with a F1 score

of 8544 on the SQuAD dev dataset

Chatbot Design Using a QA System

Designing a perfect chatbot that passes the Turing test is a fundamental goal for

Artificial Intelligence Although we are many order to magnitudes away from achieving

such a goal domain specific tasks can be solved with chatbots made from current

technology We had started our goal with a similar objective in mind ie to design a

domain specific chatbot and then generalize to other areas as it is able to achieve the

first domain specific objective robustly This led us to the fundamental problem of

Machine Comprehension and subsequently to the task of Question Answering Having

achieved some degree of success with QA systems we looked back if we could apply

our newly acquired knowledge in the task of designing Chatbots

The chatbots made with todayrsquos technologies are mostly handcrafted techniques

such as template matching that requires anticipating all possible ways a user may

articulate his requirements and a conversation may occur This requires a lot of man

hours for designing a domain specific system and is still very error prone In this section

33

we propose a general Chatbot design that would make the designing of a domain

specific chatbot very easy and robust at the same time

Every domain specific chatbot needs to obtain a set of information from the user

and show some results based on the user specific information obtained The traditional

chatbots use template matching and keywords lookup to determine if the user has

provided the required information Our idea is to use the Question Answering system in

the backend to extract out the required information from whatever the user has typed

until this point of the conversation The information to be extracted can be posed in the

form of a set of questions and the answers obtained from those questions can be used

as the parameters to supply the relevant information to the user

We had chosen our chatbot domain as the flight reservation system Our goal

was to extract the required information from the user to be able to show him the

available flights as per the userrsquos requirements

Figure 4-2 Flight reservation chatbotrsquos chat window

34

For a flight reservation task the booking agent needs to know the origin city

destination city and date of travel at minimum to be able to show the available flights

Optional information includes the number of tickets passengerrsquos name one way or

round trip etc

The minimalistic conversation with the user through the chat window would be as

shown above We had a platform called OneTask on which we wanted to implement our

chat bot The chat interface within the OneTask system looks as follows

Figure 4-3 Chatbot within OneTask system

The working of the chat system is as follows ndash

1 Initiation The user opens the chat window which starts a session with the chat bot The chat bot asks lsquoHow may I help yoursquo

2 User Reply The user may reply with none of the required information for flight booking or may reply with multiple information in the same message

3 User Reply parsing The conversation up to this point is treated as a passage and the internal questions are run on this passage So the four questions that are run are

Where do you want to go

35

From where do you want to leave

When do you want to depart

4 Parsed responses from QA model After running the questions on the QA system with the given conversation answers are obtained for all along with their corresponding confidence values Since answers will be obtained even if the required question was not answered up to this point in the conversation the confidence values plays a crucial role in determining the validity of the answer For this purpose we determined ranges of confidence values for validation A confidence value above 10 signifies the question has been answered correctly A confidence of 2 ndash 10 signifies that it may have been answered but should verify with the user for correctness and any confidence below 2 is discarded

5 Asking remaining questions iteratively After the parsing it is checked if any of the required questions are still unanswered If so the chatbot asks the remaining question and the process from 3 ndash 4 is carried out iteratively Once all the questions have been answered the user is shown with the available flight options as per his request

Figure 4-4 The Flow diagram of the Flight booking Chatbot system

Online QA System and Attention Visualization

To be able to test out various examples we used the BiDAF [23] model for an

online demo One can either choose from the available examples from the drop-down

menu or paste their own passage and examples While this is a useful and interesting

system to test the model in a user-friendly way we created this system to be able to

focus on the wrong samples

36

Figure 4-5 QA system interface with attention highlight over candidate answers

The candidate answers are shown with the blue highlights as per their

confidence values The higher the confidence the darker the answer The highest

confidence value is chosen as the predicted answer We developed the system to show

attention spread of the candidate answers to realize what needs to be done to improve

the system This led us to realize the importance of including the query attention part as

well as multilevel attention on the BiDAF model as described in the first section of this

chapter

37

CHAPTER 5 FUTURE DIRECTIONS AND CONCLUSION

Future Directions

Having done a thorough analysis of the current methods and state of the arts in

Machine Comprehension and for the QA task and developing systems on it we have

achieved a strong sense of what needs to be done to further improve the QA models

After observing the wrong samples we can see that the system is still unable to encode

meaning and is picking answers based on statistical patterns of occurrence of answers

on training examples

As per the State of the Art models and analyzing the ongoing research literature

it is easy to conclude that more features need to be embedded to encode the meaning

of words phrases and sentences A paper called Reinforced Mnemonic Reader for

Machine Comprehension encoded POS and NER tags of words along with their word

and character embedding This gave them better results

Figure 5-1 An English language semantic parse tree [26]

38

We have developed a method to encode the syntax parse tree of a sentence that

not only encodes the post tags but the relation of the word within the phrase and the

relation of the phrase within the whole sentence in a hierarchical manner

Finally data augmentation is another solution to get better results One definite

way to reduce the errors would be to include similar samples in the training data which

the system is faltering in the dev set One could generate similar examples as the failure

cases and include them in the training set to have better prediction Another system

would be to train to similar and bigger datasets Our models were trained on the SQuAD

dataset There are other datasets too which does the similar question answering task

such as TriviaQA We could augment the training set of TriviaQA along with SQuAD to

have a more robust system that is able to generalize better and thus have higher

accuracy for predicting answer spans

Conclusion

In this work we have tried to explore the most fundamental techniques that have

shaped the current state of the art Then we proposed a minor improvement of

architecture over an existing model Furthermore we developed two applications that

uses the base model First we talked about how a chatbot application can be made

using the QA system and lastly we also created a web interface where the model can

be used for any Passage and Question This interface also shows the attention spread

on the candidate answers While our effort is ongoing to push the state of the art

forward we strongly believe that surpassing human level accuracy on this task will have

high dividends for the society at large

39

LIST OF REFERENCES

[1] Danqi Chen From Reading Comprehension To Open-Domain Question Answering 2018 Youtube Accessed March 16 2018 httpswwwyoutubecomwatchv=1RN88O9C13U

[2] Lehnert Wendy G The Process of Question Answering No RR-88 Yale Univ New Haven Conn Dept of computer science 1977

[3] Joshi Mandar Eunsol Choi Daniel S Weld and Luke Zettlemoyer Triviaqa A large scale distantly supervised challenge dataset for reading comprehension arXiv preprint arXiv170503551 (2017)

[4] Rajpurkar Pranav Jian Zhang Konstantin Lopyrev and Percy Liang Squad 100000+ questions for machine comprehension of text arXiv preprint arXiv160605250 (2016)

[5] The Stanford Question Answering Dataset 2018 RajpurkarGithubIo Accessed March 16 2018 httpsrajpurkargithubioSQuAD-explorer

[6] Hu Minghao Yuxing Peng and Xipeng Qiu Reinforced mnemonic reader for machine comprehension CoRR abs170502798 (2017)

[7] Schalkoff Robert J Artificial neural networks Vol 1 New York McGraw-Hill 1997

[8] Me For The AI and Neetesh Mehrotra 2018 The Connect Between Deep Learning And AI - Open Source For You Open Source For You Accessed March 16 2018 httpopensourceforucom201801connect-deep-learning-ai

[9] Krizhevsky Alex Ilya Sutskever and Geoffrey E Hinton Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems pp 1097-1105 2012

[10] An Intuitive Explanation Of Convolutional Neural Networks 2016 The Data Science Blog Accessed March 16 2018 httpsujjwalkarnme20160811intuitive-explanation-convnets

[11] Williams Ronald J and David Zipser A learning algorithm for continually running fully recurrent neural networks Neural computation 1 no 2 (1989) 270-280

[12] Hochreiter Sepp and Juumlrgen Schmidhuber Long short-term memory Neural computation 9 no 8 (1997) 1735-1780

[13] Chung Junyoung Caglar Gulcehre KyungHyun Cho and Yoshua Bengio Empirical evaluation of gated recurrent neural networks on sequence modeling arXiv preprint arXiv14123555 (2014)

40

[14] Mikolov Tomas Wen-tau Yih and Geoffrey Zweig Linguistic regularities in continuous space word representations In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics Human Language Technologies pp 746-751 2013

[15] Pennington Jeffrey Richard Socher and Christopher Manning Glove Global vectors for word representation In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) pp 1532-1543 2014

[16] Word2vec Tutorial - The Skip-Gram Model middot Chris Mccormick 2016 MccormickmlCom Accessed March 16 2018 httpmccormickmlcom20160419word2vec-tutorial-the-skip-gram-model

[17] Group 2015 [Paper Introduction] Bilingual Word Representations With Monolingual hellip SlideshareNet Accessed March 16 2018 httpswwwslidesharenetnaist-mtstudybilingual-word-representations-with-monolingual-quality-in-mind

[18] Attention Mechanism 2016 BlogHeuritechCom Accessed March 16 2018 httpsblogheuritechcom20160120attention-mechanism

[19] Weston Jason Sumit Chopra and Antoine Bordes 2014 Memory Networks arXiv preprint arXiv14103916v11 Accessed March 16 2018 httpsarxivorgabs14103916

[20] Wang Shuohang and Jing Jiang Machine comprehension using match-lstm and answer pointer arXiv preprint arXiv160807905 (2016)

[21] Wang Shuohang and Jing Jiang Learning natural language inference with LSTM arXiv preprint arXiv151208849 (2015)

[22] Vinyals Oriol Meire Fortunato and Navdeep Jaitly Pointer networks In Advances in Neural Information Processing Systems pp 2692-2700 2015

[23] Wang Wenhui Nan Yang Furu Wei Baobao Chang and Ming Zhou Gated self-matching networks for reading comprehension and question answering In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1 Long Papers) vol 1 pp 189-198 2017

[24] Seo Minjoon Aniruddha Kembhavi Ali Farhadi and Hannaneh Hajishirzi Bidirectional attention flow for machine comprehension arXiv preprint arXiv161101603 (2016)

[25] Clark Christopher and Matt Gardner Simple and effective multi-paragraph reading comprehension arXiv preprint arXiv171010723 (2017)

41

[26] 8 Analyzing Sentence Structure 2018 NltkOrg Accessed March 16 2018 httpwwwnltkorgbookch08html

42

BIOGRAPHICAL SKETCH

Purnendu Mukherjee grew up in Kolkata India and had a strong interest for

computers since an early age After his high school he did his BSc in computer

science from Ramakrishna Mission Residential College Narendrapur followed by MSc

in computer science from St Xavierrsquos College Kolkata He had a strong intuition and

interest for human like learning systems and wanted to work in this area He started

working at TCS Innovation Labs Pune for the application of Natural Language

Processing in Educational Applications As he was deeply passionate about learning

system that mimic the human brain and learn like a human child does he was

increasing interested about Deep Learning and its applications After working for a year

he went on to pursue a Master of Science degree in computer science from the

University of Florida Gainesville His academic interests have been focused on Deep

Learning and Natural Language Processing and he has been working on Machine

Reading Comprehension since summer of 2017

  • ACKNOWLEDGMENTS
  • LIST OF FIGURES
  • LIST OF ABBREVIATIONS
  • INTRODUCTION
  • THE BUILDING BLOCKS OF A QUESTION ANSWERING SYSTEM
    • Neural Networks
    • Convolutional Neural Network
    • Recurrent Neural Networks (RNN)
    • Word Embedding
    • Attention Mechanism
    • Memory Networks
      • LITERATURE REVIEW AND STATE OF THE ART
        • Machine Comprehension Using Match-LSTM and Answer Pointer
        • R-NET Matching Reading Comprehension with Self-Matching Networks
        • Bi-Directional Attention Flow (BiDAF) for Machine Comprehension
        • Summary
          • MULTI-ATTENTION QUESTION ANSWERING
            • Multi-Attention BiDAF Model
            • Chatbot Design Using a QA System
            • Online QA System and Attention Visualization
              • FUTURE DIRECTIONS AND CONCLUSION
                • Future Directions
                • Conclusion
                  • LIST OF REFERENCES
                  • BIOGRAPHICAL SKETCH
Page 17: QUESTION ANSWERING SYSTEMS WITH ATTENTION MECHANISMufdcimages.uflib.ufl.edu/UF/E0/05/23/73/00001/MUKHERJEE_P.pdf · QUESTION ANSWERING SYSTEMS WITH ATTENTION MECHANISM By PURNENDU

17

will produce similar vector for those words Conversely if the two word vectors are

similar then the network will produce similar context predictions for the same two words

For examples synonyms like ldquointelligentrdquo and ldquosmartrdquo would have very similar contexts

Or that words that are related like ldquoenginerdquo and ldquotransmissionrdquo would probably have

similar contexts as well [16] Plotting the word vectors learned by a word2vec over a

large corpus we could find some very interesting relationships between words

Figure 2-3 Semantic relation between words in vector space [17]

Attention Mechanism

We as humans put our attention to things are important or are relevant in a

context For example when asked a question from a passage we try to find the most

relevant part of the passage the question is relevant with and then reason from our

understanding of that part of the passage The same idea applies for attention

mechanism in Deep Learning It is used to identify the specific parts of a given context

to which the current question is relevant to

Formally put the techniques take n arguments y_1 y_n (in our case the

passage having words say y_i through h_i) and a question word say q It returns a

vector z which is supposed to be the laquo summary raquo of the y_i focusing on information

linked to the question q More formally it returns a weighted arithmetic mean of the y_i

18

and the weights are chosen according the relevance of each y_i given the context c

[18]

Figure 2-4 Attention Mechanism flow [18]

Memory Networks

Convolutional Neural Networks and Recurrent Neural Networks which does

capture how we form our visual and sequential memories their memory (encoded by

hidden states and weights) were typically too small and was not compartmentalized

enough to accurately remember facts from the past (knowledge is compressed into

dense vectors) [19]

Deep Learning needed to cultivate a methodology that preserved memories as

they are such that it wonrsquot be lost in generalization and recalling exact words or

sequence of events would be possible mdash something computers are already good at This

effort led us to Memory Networks [19] published at ICLR 2015 by Facebook AI

Research

This paper provides a basic framework to store augment and retrieve memories

while seamlessly working with a Recurrent Neural Network architecture The memory

19

network consists of a memory m (an array of objects 1 indexed by m i) and four

(potentially learned) components I G O and R as follows

I (input feature map) mdash converts the incoming input to the internal feature

representation either a sparse or dense feature vector like that from word2vec or

GloVe

G (generalization) mdash updates old memories given the new input They call this

generalization as there is an opportunity for the network to compress and generalize its

memories at this stage for some intended future use The analogy Irsquove been talking

before

O (output feature map) mdash produces a new output (in the feature representation

space) given the new input and the current memory state This component is

responsible for performing inference In a question answering system this part will

select the candidate sentences (which might contain the answer) from the story

(conversation) so far

R (response) mdash converts the output into the response format desired For

example a textual response or an action In the QA system described this component

finds the desired answer and then converts it from feature representation to the actual

word

This model is a fully supervised model meaning all the candidate sentences from

which the answer could be found are marked during training phase and can also be

termed as lsquohard attentionrsquo

The authors tested out the QA system on various literature including Lord of the

Rings

20

Figure 2-5 QA example on Lord of the Rings using Memory Networks [19]

21

CHAPTER 3 LITERATURE REVIEW AND STATE OF THE ART

There has been rapid progress since the release of the SQuAD dataset and

currently the best ensemble models are close to human level accuracy in machine

comprehension This is due to the various ingenious methods which solves some of the

problems with the previous methods Out of Vocabulary tokens were handled by using

Character embedding Long term dependency within context passage were solved

using self-attention And many other techniques such as Contextualized vectors History

of Words Attention Flow etc In this section we will have a look at the some of the most

important models that were fundamental to the progress of Questions Answering

Machine Comprehension Using Match-LSTM and Answer Pointer

In this paper [20] the authors propose an end-to-end neural architecture for the

QA task The architecture is based on match-LSTM [21] a model they proposed for

textual entailment and Pointer Net [22] a sequence-to-sequence model proposed by

Vinyals et al (2015) to constrain the output tokens to be from the input sequences

The model consists of an LSTM preprocessing layer a match-LSTM layer and an

Answer Pointer layer

We are given a piece of text which we refer to as a passage and a question

related to the passage The passage is represented by matrix P where P is the length

(number of tokens) of the passage and d is the dimensionality of word embeddings

Similarly the question is represented by matrix Q where Q is the length of the question

Our goal is to identify a subsequence from the passage as the answer to the question

22

Figure 3-1 Match-LSTM Model Architecture [20]

LSTM Preprocessing layer They use a standard one-directional LSTM

(Hochreiter amp Schmidhuber 1997) to process the passage and the question separately

as shown below

Match LSTM Layer They applied the match-LSTM model proposed for textual

entailment to their machine comprehension problem by treating the question as a

premise and the passage as a hypothesis The match-LSTM sequentially goes through

the passage At position i of the passage it first uses the standard word-by-word

attention mechanism to obtain attention weight vector as follows

(3-1)

23

where and are parameters to be learned

Answer Pointer Layer The Answer Pointer (Ans-Ptr) layer is motivated by the

Pointer Net introduced by Vinyals et al (2015) [22] The boundary model produces only

the start token and the end token of the answer and then all the tokens between these

two in the original passage are considered to be the answer

When this paper was released back in November 2016 Match-LSTM method

was the state of the art in Question Answering systems and was at the top of the

leaderboard for the SQuAD dataset

R-NET Matching Reading Comprehension with Self-Matching Networks

In this model [23] first the question and passage are processed by a

bidirectional recurrent network (Mikolov et al 2010) separately They then match the

question and passage with gated attention-based recurrent networks obtaining

question-aware representation for the passage On top of that they apply self-matching

attention to aggregate evidence from the whole passage and refine the passage

representation which is then fed into the output layer to predict the boundary of the

answer span

Question and passage encoding First the words are converted to their

respective word-level embeddings and character level embeddings The character-level

embeddings are generated by taking the final hidden states of a bi-directional recurrent

neural network (RNN) applied to embeddings of characters in the token Such

character-level embeddings have been shown to be helpful to deal with out-of-vocab

(OOV) tokens

24

They then use a bi-directional RNN to produce new representation and

of all words in the question and passage respectively

Figure 3-2 The task of Question Answering [23]

Gated Attention-based Recurrent Networks They use a variant of attention-

based recurrent networks with an additional gate to determine the importance of

information in the passage regarding a question Different from the gates in LSTM or

GRU the additional gate is based on the current passage word and its attention-pooling

vector of the question which focuses on the relation between the question and current

passage word The gate effectively model the phenomenon that only parts of the

passage are relevant to the question in reading comprehension and question answering

is utilized in subsequent calculations

25

Self-Matching Attention From the previous step the question aware passage

representation is generated to highlight the important parts of the passage One

problem with such representation is that it has very limited knowledge of context One

answer candidate is often oblivious to important cues in the passage outside its

surrounding window To address this problem the authors propose directly matching

the question-aware passage representation against itself It dynamically collects

evidence from the whole passage for words in passage and encodes the evidence

relevant to the current passage word and its matching question information into the

passage representation

Output Layer They use the same method as Wang amp Jiang (2016b) and use

pointer networks (Vinyals et al 2015) to predict the start and end position of the

answer In addition they use an attention-pooling over the question representation to

generate the initial hidden vector for the pointer network [23]

When the R-Net Model first appeared in the leaderboard in March 2017 it was at

the top with 723 Exact Match and 807 F1 score

Bi-Directional Attention Flow (BiDAF) for Machine Comprehension

BiDAF [24] is a hierarchical multi-stage architecture for modeling the

representations of the context paragraph at different levels of granularity BIDAF

includes character-level word-level and contextual embeddings and uses bi-directional

attention flow to obtain a query-aware context representation Their attention layer is not

used to summarize the context paragraph into a fixed-size vector Instead the attention

is computed for every time step and the attended vector at each time step along with

the representations from previous layers can flow through to the subsequent modeling

layer This reduces the information loss caused by early summarization [24]

26

Figure 3-3 Bi-Directional Attention Flow Model Architecture [24]

Their machine comprehension model is a hierarchical multi-stage process and

consists of six layers

1 Character Embedding Layer maps each word to a vector space using character-level CNNs

2 Word Embedding Layer maps each word to a vector space using a pre-trained word embedding model

3 Contextual Embedding Layer utilizes contextual cues from surrounding words to refine the embedding of the words These first three layers are applied to both the query and context They use a LSTM on top of the embeddings provided by the previous layers to model the temporal interactions between words We place an LSTM in both directions and concatenate the outputs of the two LSTMs

4 Attention Flow Layer couples the query and context vectors and produces a set of query aware feature vectors for each word in the context

5 Modeling Layer employs a Recurrent Neural Network to scan the context The input to the modeling layer is G which encodes the query-aware representations of context words The output of the modeling layer captures the interaction among the context words conditioned on the query They use two layers of bi-directional LSTM

27

with the output size of d for each direction Hence a matrix is obtained which is passed onto the output layer to predict the answer

6 Output Layer provides an answer to the query They define the training loss (to be minimized) as the sum of the negative log probabilities of the true start and end indices by the predicted distributions averaged over all examples [25]

In a further variation of their above work they add a self-attention layer after the

Bi-attention layer to further improve the results The architecture of the model is as

Figure 3-3 The task of Question Answering [25]

28

Summary

In this chapter we reviewed the methods that are fundamental to the state of the

art in Machine Comprehension and for the task of Question Answering We have

reached closed to human level accuracy and this is due to incremental developments

over previous models As we saw Out of Vocabulary (OOV) tokens were handled by

using Character embedding Long term dependency within context passage were

solved using self-attention and many other techniques such as Contextualized vectors

Attention Flow etc were employed to get better results In the next chapter we will see

how we can build on these models and develop further

29

CHAPTER 4 MULTI-ATTENTION QUESTION ANSWERING

Although the advancements in Question Answering systems since the release of

the SQuAD dataset has been impressive and the results are getting close to human

level accuracy it is far from being a fool-proof system The models still make mistakes

which would be obvious to a human For example

Passage The Panthers used the San Jose State practice facility and stayed at

the San Jose Marriott The Broncos practiced at Florida State Facility and stayed at the

Santa Clara Marriott

Question At what universitys facility did the Panthers practice

Actual Answer San Jose State

Predicted Answer Florida State Facility

To find out what is leading to the wrong predictions we wanted to see the

attention weights associated with such an example We plotted the passage and

question heat map which is a 2D matrix where the intensity of each cell signifies the

similarity between a passage word and a question word For the above example we

found out that while certain words of the question are given high weightage other parts

are not The words lsquoAtrsquo lsquofacilityrsquo lsquopracticersquo are given high attention but lsquoPanthersrsquo does

not receive high attention If it had received high attention then the system would have

predicted lsquoSan Jose Statersquo as the right answer To solve this issue we analyzed the

base BiDAF model and proposed adding two things

1 Bi-Attention and Self-Attention over Query

2 Second level attention over the output of (Bi-Attention + Self-Attention) from both Context and Query

30

Multi-Attention BiDAF Model

Our comprehension model is a hierarchical multi-stage process and consists of

the following layers

1 Embedding Just as all other models we embed words using pretrained word vectors We also embed the characters in each word into size 20 vectors which are learned and run a convolution neural network followed by max-pooling to get character-derived embeddings for each word The character-level and word-level embeddings are then concatenated and passed to the next layer We do not update the word embeddings during training

2 Pre-Process A shared bi-directional GRU (Cho et al 2014) is used to map the question and passage embeddings to context aware embeddings

3 Context Attention The bi-directional attention mechanism from the Bi-Directional Attention Flow (BiDAF) model (Seo et al 2016) is used to build a query-aware context representation Let h_i be the vector for context word i q_j be the vector for question word j and n_q and n_c be the lengths of the question and context respectively We compute attention between context word i and question word j as

(4-1)

where w1 w2 and w3 are learned vectors and is element-wise multiplication We then compute an attended vector c_i for each context token as

(4-2)

We also compute a query-to-context vector q_c

(4-3)

31

The final vector computed for each token is built by concatenating

and In our model we subsequently pass the result through a linear layer with ReLU activations

4 Context Self-Attention Next we use a layer of residual self-attention The input is passed through another bi-directional GRU Then we apply the same attention mechanism only now between the passage and itself In this case we do not use query-to context attention and we set aij = 1048576inf if i = j As before we pass the concatenated output through a linear layer with ReLU activations This layer is applied residually so this output is additionally summed with the input

5 Query Attention For this part we do the same way as context attention but calculate the weighted sum of the context words for each query word Thus we get the length as number of query words Then we calculate context to query similar to query to context in context attention layer

Figure 4-1 The modified BiDAF model with multilevel attention

6 Query Self-Attention This part is done the same way as the context self-attention layer but from the output of the Query Attention layer

32

7 Context Query Bi-Attention + Self-Attention The output of the Context self-attention and the Query Self-Attention layers are taken as input and the same process for Bi-Attention and self-attention is applied on these inputs

8 Prediction In the last layer of our model a bidirectional GRU is applied followed by a linear layer that computes answer start scores for each token The hidden states of that layer are concatenated with the input and fed into a second bidirectional GRU and linear layer to predict answer end scores The softmax operation is applied to the start and end scores to produce start and end probabilities and we optimize the negative log likelihood of selecting correct start and end tokens

Having carried out this modification we were able to solve the wrong example we

started with The multilevel attention model gives the correct output as ldquoSan Jose

Staterdquo Also we achieved slightly better scores than the original model with a F1 score

of 8544 on the SQuAD dev dataset

Chatbot Design Using a QA System

Designing a perfect chatbot that passes the Turing test is a fundamental goal for

Artificial Intelligence Although we are many order to magnitudes away from achieving

such a goal domain specific tasks can be solved with chatbots made from current

technology We had started our goal with a similar objective in mind ie to design a

domain specific chatbot and then generalize to other areas as it is able to achieve the

first domain specific objective robustly This led us to the fundamental problem of

Machine Comprehension and subsequently to the task of Question Answering Having

achieved some degree of success with QA systems we looked back if we could apply

our newly acquired knowledge in the task of designing Chatbots

The chatbots made with todayrsquos technologies are mostly handcrafted techniques

such as template matching that requires anticipating all possible ways a user may

articulate his requirements and a conversation may occur This requires a lot of man

hours for designing a domain specific system and is still very error prone In this section

33

we propose a general Chatbot design that would make the designing of a domain

specific chatbot very easy and robust at the same time

Every domain specific chatbot needs to obtain a set of information from the user

and show some results based on the user specific information obtained The traditional

chatbots use template matching and keywords lookup to determine if the user has

provided the required information Our idea is to use the Question Answering system in

the backend to extract out the required information from whatever the user has typed

until this point of the conversation The information to be extracted can be posed in the

form of a set of questions and the answers obtained from those questions can be used

as the parameters to supply the relevant information to the user

We had chosen our chatbot domain as the flight reservation system Our goal

was to extract the required information from the user to be able to show him the

available flights as per the userrsquos requirements

Figure 4-2 Flight reservation chatbotrsquos chat window

34

For a flight reservation task the booking agent needs to know the origin city

destination city and date of travel at minimum to be able to show the available flights

Optional information includes the number of tickets passengerrsquos name one way or

round trip etc

The minimalistic conversation with the user through the chat window would be as

shown above We had a platform called OneTask on which we wanted to implement our

chat bot The chat interface within the OneTask system looks as follows

Figure 4-3 Chatbot within OneTask system

The working of the chat system is as follows ndash

1 Initiation The user opens the chat window which starts a session with the chat bot The chat bot asks lsquoHow may I help yoursquo

2 User Reply The user may reply with none of the required information for flight booking or may reply with multiple information in the same message

3 User Reply parsing The conversation up to this point is treated as a passage and the internal questions are run on this passage So the four questions that are run are

Where do you want to go

35

From where do you want to leave

When do you want to depart

4 Parsed responses from QA model After running the questions on the QA system with the given conversation answers are obtained for all along with their corresponding confidence values Since answers will be obtained even if the required question was not answered up to this point in the conversation the confidence values plays a crucial role in determining the validity of the answer For this purpose we determined ranges of confidence values for validation A confidence value above 10 signifies the question has been answered correctly A confidence of 2 ndash 10 signifies that it may have been answered but should verify with the user for correctness and any confidence below 2 is discarded

5 Asking remaining questions iteratively After the parsing it is checked if any of the required questions are still unanswered If so the chatbot asks the remaining question and the process from 3 ndash 4 is carried out iteratively Once all the questions have been answered the user is shown with the available flight options as per his request

Figure 4-4 The Flow diagram of the Flight booking Chatbot system

Online QA System and Attention Visualization

To be able to test out various examples we used the BiDAF [23] model for an

online demo One can either choose from the available examples from the drop-down

menu or paste their own passage and examples While this is a useful and interesting

system to test the model in a user-friendly way we created this system to be able to

focus on the wrong samples

36

Figure 4-5 QA system interface with attention highlight over candidate answers

The candidate answers are shown with the blue highlights as per their

confidence values The higher the confidence the darker the answer The highest

confidence value is chosen as the predicted answer We developed the system to show

attention spread of the candidate answers to realize what needs to be done to improve

the system This led us to realize the importance of including the query attention part as

well as multilevel attention on the BiDAF model as described in the first section of this

chapter

37

CHAPTER 5 FUTURE DIRECTIONS AND CONCLUSION

Future Directions

Having done a thorough analysis of the current methods and state of the arts in

Machine Comprehension and for the QA task and developing systems on it we have

achieved a strong sense of what needs to be done to further improve the QA models

After observing the wrong samples we can see that the system is still unable to encode

meaning and is picking answers based on statistical patterns of occurrence of answers

on training examples

As per the State of the Art models and analyzing the ongoing research literature

it is easy to conclude that more features need to be embedded to encode the meaning

of words phrases and sentences A paper called Reinforced Mnemonic Reader for

Machine Comprehension encoded POS and NER tags of words along with their word

and character embedding This gave them better results

Figure 5-1 An English language semantic parse tree [26]

38

We have developed a method to encode the syntax parse tree of a sentence that

not only encodes the post tags but the relation of the word within the phrase and the

relation of the phrase within the whole sentence in a hierarchical manner

Finally data augmentation is another solution to get better results One definite

way to reduce the errors would be to include similar samples in the training data which

the system is faltering in the dev set One could generate similar examples as the failure

cases and include them in the training set to have better prediction Another system

would be to train to similar and bigger datasets Our models were trained on the SQuAD

dataset There are other datasets too which does the similar question answering task

such as TriviaQA We could augment the training set of TriviaQA along with SQuAD to

have a more robust system that is able to generalize better and thus have higher

accuracy for predicting answer spans

Conclusion

In this work we have tried to explore the most fundamental techniques that have

shaped the current state of the art Then we proposed a minor improvement of

architecture over an existing model Furthermore we developed two applications that

uses the base model First we talked about how a chatbot application can be made

using the QA system and lastly we also created a web interface where the model can

be used for any Passage and Question This interface also shows the attention spread

on the candidate answers While our effort is ongoing to push the state of the art

forward we strongly believe that surpassing human level accuracy on this task will have

high dividends for the society at large

39

LIST OF REFERENCES

[1] Danqi Chen From Reading Comprehension To Open-Domain Question Answering 2018 Youtube Accessed March 16 2018 httpswwwyoutubecomwatchv=1RN88O9C13U

[2] Lehnert Wendy G The Process of Question Answering No RR-88 Yale Univ New Haven Conn Dept of computer science 1977

[3] Joshi Mandar Eunsol Choi Daniel S Weld and Luke Zettlemoyer Triviaqa A large scale distantly supervised challenge dataset for reading comprehension arXiv preprint arXiv170503551 (2017)

[4] Rajpurkar Pranav Jian Zhang Konstantin Lopyrev and Percy Liang Squad 100000+ questions for machine comprehension of text arXiv preprint arXiv160605250 (2016)

[5] The Stanford Question Answering Dataset 2018 RajpurkarGithubIo Accessed March 16 2018 httpsrajpurkargithubioSQuAD-explorer

[6] Hu Minghao Yuxing Peng and Xipeng Qiu Reinforced mnemonic reader for machine comprehension CoRR abs170502798 (2017)

[7] Schalkoff Robert J Artificial neural networks Vol 1 New York McGraw-Hill 1997

[8] Me For The AI and Neetesh Mehrotra 2018 The Connect Between Deep Learning And AI - Open Source For You Open Source For You Accessed March 16 2018 httpopensourceforucom201801connect-deep-learning-ai

[9] Krizhevsky Alex Ilya Sutskever and Geoffrey E Hinton Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems pp 1097-1105 2012

[10] An Intuitive Explanation Of Convolutional Neural Networks 2016 The Data Science Blog Accessed March 16 2018 httpsujjwalkarnme20160811intuitive-explanation-convnets

[11] Williams Ronald J and David Zipser A learning algorithm for continually running fully recurrent neural networks Neural computation 1 no 2 (1989) 270-280

[12] Hochreiter Sepp and Juumlrgen Schmidhuber Long short-term memory Neural computation 9 no 8 (1997) 1735-1780

[13] Chung Junyoung Caglar Gulcehre KyungHyun Cho and Yoshua Bengio Empirical evaluation of gated recurrent neural networks on sequence modeling arXiv preprint arXiv14123555 (2014)

40

[14] Mikolov Tomas Wen-tau Yih and Geoffrey Zweig Linguistic regularities in continuous space word representations In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics Human Language Technologies pp 746-751 2013

[15] Pennington Jeffrey Richard Socher and Christopher Manning Glove Global vectors for word representation In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) pp 1532-1543 2014

[16] Word2vec Tutorial - The Skip-Gram Model middot Chris Mccormick 2016 MccormickmlCom Accessed March 16 2018 httpmccormickmlcom20160419word2vec-tutorial-the-skip-gram-model

[17] Group 2015 [Paper Introduction] Bilingual Word Representations With Monolingual hellip SlideshareNet Accessed March 16 2018 httpswwwslidesharenetnaist-mtstudybilingual-word-representations-with-monolingual-quality-in-mind

[18] Attention Mechanism 2016 BlogHeuritechCom Accessed March 16 2018 httpsblogheuritechcom20160120attention-mechanism

[19] Weston Jason Sumit Chopra and Antoine Bordes 2014 Memory Networks arXiv preprint arXiv14103916v11 Accessed March 16 2018 httpsarxivorgabs14103916

[20] Wang Shuohang and Jing Jiang Machine comprehension using match-lstm and answer pointer arXiv preprint arXiv160807905 (2016)

[21] Wang Shuohang and Jing Jiang Learning natural language inference with LSTM arXiv preprint arXiv151208849 (2015)

[22] Vinyals Oriol Meire Fortunato and Navdeep Jaitly Pointer networks In Advances in Neural Information Processing Systems pp 2692-2700 2015

[23] Wang Wenhui Nan Yang Furu Wei Baobao Chang and Ming Zhou Gated self-matching networks for reading comprehension and question answering In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1 Long Papers) vol 1 pp 189-198 2017

[24] Seo Minjoon Aniruddha Kembhavi Ali Farhadi and Hannaneh Hajishirzi Bidirectional attention flow for machine comprehension arXiv preprint arXiv161101603 (2016)

[25] Clark Christopher and Matt Gardner Simple and effective multi-paragraph reading comprehension arXiv preprint arXiv171010723 (2017)

41

[26] 8 Analyzing Sentence Structure 2018 NltkOrg Accessed March 16 2018 httpwwwnltkorgbookch08html

42

BIOGRAPHICAL SKETCH

Purnendu Mukherjee grew up in Kolkata India and had a strong interest for

computers since an early age After his high school he did his BSc in computer

science from Ramakrishna Mission Residential College Narendrapur followed by MSc

in computer science from St Xavierrsquos College Kolkata He had a strong intuition and

interest for human like learning systems and wanted to work in this area He started

working at TCS Innovation Labs Pune for the application of Natural Language

Processing in Educational Applications As he was deeply passionate about learning

system that mimic the human brain and learn like a human child does he was

increasing interested about Deep Learning and its applications After working for a year

he went on to pursue a Master of Science degree in computer science from the

University of Florida Gainesville His academic interests have been focused on Deep

Learning and Natural Language Processing and he has been working on Machine

Reading Comprehension since summer of 2017

  • ACKNOWLEDGMENTS
  • LIST OF FIGURES
  • LIST OF ABBREVIATIONS
  • INTRODUCTION
  • THE BUILDING BLOCKS OF A QUESTION ANSWERING SYSTEM
    • Neural Networks
    • Convolutional Neural Network
    • Recurrent Neural Networks (RNN)
    • Word Embedding
    • Attention Mechanism
    • Memory Networks
      • LITERATURE REVIEW AND STATE OF THE ART
        • Machine Comprehension Using Match-LSTM and Answer Pointer
        • R-NET Matching Reading Comprehension with Self-Matching Networks
        • Bi-Directional Attention Flow (BiDAF) for Machine Comprehension
        • Summary
          • MULTI-ATTENTION QUESTION ANSWERING
            • Multi-Attention BiDAF Model
            • Chatbot Design Using a QA System
            • Online QA System and Attention Visualization
              • FUTURE DIRECTIONS AND CONCLUSION
                • Future Directions
                • Conclusion
                  • LIST OF REFERENCES
                  • BIOGRAPHICAL SKETCH
Page 18: QUESTION ANSWERING SYSTEMS WITH ATTENTION MECHANISMufdcimages.uflib.ufl.edu/UF/E0/05/23/73/00001/MUKHERJEE_P.pdf · QUESTION ANSWERING SYSTEMS WITH ATTENTION MECHANISM By PURNENDU

18

and the weights are chosen according the relevance of each y_i given the context c

[18]

Figure 2-4 Attention Mechanism flow [18]

Memory Networks

Convolutional Neural Networks and Recurrent Neural Networks which does

capture how we form our visual and sequential memories their memory (encoded by

hidden states and weights) were typically too small and was not compartmentalized

enough to accurately remember facts from the past (knowledge is compressed into

dense vectors) [19]

Deep Learning needed to cultivate a methodology that preserved memories as

they are such that it wonrsquot be lost in generalization and recalling exact words or

sequence of events would be possible mdash something computers are already good at This

effort led us to Memory Networks [19] published at ICLR 2015 by Facebook AI

Research

This paper provides a basic framework to store augment and retrieve memories

while seamlessly working with a Recurrent Neural Network architecture The memory

19

network consists of a memory m (an array of objects 1 indexed by m i) and four

(potentially learned) components I G O and R as follows

I (input feature map) mdash converts the incoming input to the internal feature

representation either a sparse or dense feature vector like that from word2vec or

GloVe

G (generalization) mdash updates old memories given the new input They call this

generalization as there is an opportunity for the network to compress and generalize its

memories at this stage for some intended future use The analogy Irsquove been talking

before

O (output feature map) mdash produces a new output (in the feature representation

space) given the new input and the current memory state This component is

responsible for performing inference In a question answering system this part will

select the candidate sentences (which might contain the answer) from the story

(conversation) so far

R (response) mdash converts the output into the response format desired For

example a textual response or an action In the QA system described this component

finds the desired answer and then converts it from feature representation to the actual

word

This model is a fully supervised model meaning all the candidate sentences from

which the answer could be found are marked during training phase and can also be

termed as lsquohard attentionrsquo

The authors tested out the QA system on various literature including Lord of the

Rings

20

Figure 2-5 QA example on Lord of the Rings using Memory Networks [19]

21

CHAPTER 3 LITERATURE REVIEW AND STATE OF THE ART

There has been rapid progress since the release of the SQuAD dataset and

currently the best ensemble models are close to human level accuracy in machine

comprehension This is due to the various ingenious methods which solves some of the

problems with the previous methods Out of Vocabulary tokens were handled by using

Character embedding Long term dependency within context passage were solved

using self-attention And many other techniques such as Contextualized vectors History

of Words Attention Flow etc In this section we will have a look at the some of the most

important models that were fundamental to the progress of Questions Answering

Machine Comprehension Using Match-LSTM and Answer Pointer

In this paper [20] the authors propose an end-to-end neural architecture for the

QA task The architecture is based on match-LSTM [21] a model they proposed for

textual entailment and Pointer Net [22] a sequence-to-sequence model proposed by

Vinyals et al (2015) to constrain the output tokens to be from the input sequences

The model consists of an LSTM preprocessing layer a match-LSTM layer and an

Answer Pointer layer

We are given a piece of text which we refer to as a passage and a question

related to the passage The passage is represented by matrix P where P is the length

(number of tokens) of the passage and d is the dimensionality of word embeddings

Similarly the question is represented by matrix Q where Q is the length of the question

Our goal is to identify a subsequence from the passage as the answer to the question

22

Figure 3-1 Match-LSTM Model Architecture [20]

LSTM Preprocessing layer They use a standard one-directional LSTM

(Hochreiter amp Schmidhuber 1997) to process the passage and the question separately

as shown below

Match LSTM Layer They applied the match-LSTM model proposed for textual

entailment to their machine comprehension problem by treating the question as a

premise and the passage as a hypothesis The match-LSTM sequentially goes through

the passage At position i of the passage it first uses the standard word-by-word

attention mechanism to obtain attention weight vector as follows

(3-1)

23

where and are parameters to be learned

Answer Pointer Layer The Answer Pointer (Ans-Ptr) layer is motivated by the

Pointer Net introduced by Vinyals et al (2015) [22] The boundary model produces only

the start token and the end token of the answer and then all the tokens between these

two in the original passage are considered to be the answer

When this paper was released back in November 2016 Match-LSTM method

was the state of the art in Question Answering systems and was at the top of the

leaderboard for the SQuAD dataset

R-NET Matching Reading Comprehension with Self-Matching Networks

In this model [23] first the question and passage are processed by a

bidirectional recurrent network (Mikolov et al 2010) separately They then match the

question and passage with gated attention-based recurrent networks obtaining

question-aware representation for the passage On top of that they apply self-matching

attention to aggregate evidence from the whole passage and refine the passage

representation which is then fed into the output layer to predict the boundary of the

answer span

Question and passage encoding First the words are converted to their

respective word-level embeddings and character level embeddings The character-level

embeddings are generated by taking the final hidden states of a bi-directional recurrent

neural network (RNN) applied to embeddings of characters in the token Such

character-level embeddings have been shown to be helpful to deal with out-of-vocab

(OOV) tokens

24

They then use a bi-directional RNN to produce new representation and

of all words in the question and passage respectively

Figure 3-2 The task of Question Answering [23]

Gated Attention-based Recurrent Networks They use a variant of attention-

based recurrent networks with an additional gate to determine the importance of

information in the passage regarding a question Different from the gates in LSTM or

GRU the additional gate is based on the current passage word and its attention-pooling

vector of the question which focuses on the relation between the question and current

passage word The gate effectively model the phenomenon that only parts of the

passage are relevant to the question in reading comprehension and question answering

is utilized in subsequent calculations

25

Self-Matching Attention From the previous step the question aware passage

representation is generated to highlight the important parts of the passage One

problem with such representation is that it has very limited knowledge of context One

answer candidate is often oblivious to important cues in the passage outside its

surrounding window To address this problem the authors propose directly matching

the question-aware passage representation against itself It dynamically collects

evidence from the whole passage for words in passage and encodes the evidence

relevant to the current passage word and its matching question information into the

passage representation

Output Layer They use the same method as Wang amp Jiang (2016b) and use

pointer networks (Vinyals et al 2015) to predict the start and end position of the

answer In addition they use an attention-pooling over the question representation to

generate the initial hidden vector for the pointer network [23]

When the R-Net Model first appeared in the leaderboard in March 2017 it was at

the top with 723 Exact Match and 807 F1 score

Bi-Directional Attention Flow (BiDAF) for Machine Comprehension

BiDAF [24] is a hierarchical multi-stage architecture for modeling the

representations of the context paragraph at different levels of granularity BIDAF

includes character-level word-level and contextual embeddings and uses bi-directional

attention flow to obtain a query-aware context representation Their attention layer is not

used to summarize the context paragraph into a fixed-size vector Instead the attention

is computed for every time step and the attended vector at each time step along with

the representations from previous layers can flow through to the subsequent modeling

layer This reduces the information loss caused by early summarization [24]

26

Figure 3-3 Bi-Directional Attention Flow Model Architecture [24]

Their machine comprehension model is a hierarchical multi-stage process and

consists of six layers

1 Character Embedding Layer maps each word to a vector space using character-level CNNs

2 Word Embedding Layer maps each word to a vector space using a pre-trained word embedding model

3 Contextual Embedding Layer utilizes contextual cues from surrounding words to refine the embedding of the words These first three layers are applied to both the query and context They use a LSTM on top of the embeddings provided by the previous layers to model the temporal interactions between words We place an LSTM in both directions and concatenate the outputs of the two LSTMs

4 Attention Flow Layer couples the query and context vectors and produces a set of query aware feature vectors for each word in the context

5 Modeling Layer employs a Recurrent Neural Network to scan the context The input to the modeling layer is G which encodes the query-aware representations of context words The output of the modeling layer captures the interaction among the context words conditioned on the query They use two layers of bi-directional LSTM

27

with the output size of d for each direction Hence a matrix is obtained which is passed onto the output layer to predict the answer

6 Output Layer provides an answer to the query They define the training loss (to be minimized) as the sum of the negative log probabilities of the true start and end indices by the predicted distributions averaged over all examples [25]

In a further variation of their above work they add a self-attention layer after the

Bi-attention layer to further improve the results The architecture of the model is as

Figure 3-3 The task of Question Answering [25]

28

Summary

In this chapter we reviewed the methods that are fundamental to the state of the

art in Machine Comprehension and for the task of Question Answering We have

reached closed to human level accuracy and this is due to incremental developments

over previous models As we saw Out of Vocabulary (OOV) tokens were handled by

using Character embedding Long term dependency within context passage were

solved using self-attention and many other techniques such as Contextualized vectors

Attention Flow etc were employed to get better results In the next chapter we will see

how we can build on these models and develop further

29

CHAPTER 4 MULTI-ATTENTION QUESTION ANSWERING

Although the advancements in Question Answering systems since the release of

the SQuAD dataset has been impressive and the results are getting close to human

level accuracy it is far from being a fool-proof system The models still make mistakes

which would be obvious to a human For example

Passage The Panthers used the San Jose State practice facility and stayed at

the San Jose Marriott The Broncos practiced at Florida State Facility and stayed at the

Santa Clara Marriott

Question At what universitys facility did the Panthers practice

Actual Answer San Jose State

Predicted Answer Florida State Facility

To find out what is leading to the wrong predictions we wanted to see the

attention weights associated with such an example We plotted the passage and

question heat map which is a 2D matrix where the intensity of each cell signifies the

similarity between a passage word and a question word For the above example we

found out that while certain words of the question are given high weightage other parts

are not The words lsquoAtrsquo lsquofacilityrsquo lsquopracticersquo are given high attention but lsquoPanthersrsquo does

not receive high attention If it had received high attention then the system would have

predicted lsquoSan Jose Statersquo as the right answer To solve this issue we analyzed the

base BiDAF model and proposed adding two things

1 Bi-Attention and Self-Attention over Query

2 Second level attention over the output of (Bi-Attention + Self-Attention) from both Context and Query

30

Multi-Attention BiDAF Model

Our comprehension model is a hierarchical multi-stage process and consists of

the following layers

1 Embedding Just as all other models we embed words using pretrained word vectors We also embed the characters in each word into size 20 vectors which are learned and run a convolution neural network followed by max-pooling to get character-derived embeddings for each word The character-level and word-level embeddings are then concatenated and passed to the next layer We do not update the word embeddings during training

2 Pre-Process A shared bi-directional GRU (Cho et al 2014) is used to map the question and passage embeddings to context aware embeddings

3 Context Attention The bi-directional attention mechanism from the Bi-Directional Attention Flow (BiDAF) model (Seo et al 2016) is used to build a query-aware context representation Let h_i be the vector for context word i q_j be the vector for question word j and n_q and n_c be the lengths of the question and context respectively We compute attention between context word i and question word j as

(4-1)

where w1 w2 and w3 are learned vectors and is element-wise multiplication We then compute an attended vector c_i for each context token as

(4-2)

We also compute a query-to-context vector q_c

(4-3)

31

The final vector computed for each token is built by concatenating

and In our model we subsequently pass the result through a linear layer with ReLU activations

4 Context Self-Attention Next we use a layer of residual self-attention The input is passed through another bi-directional GRU Then we apply the same attention mechanism only now between the passage and itself In this case we do not use query-to context attention and we set aij = 1048576inf if i = j As before we pass the concatenated output through a linear layer with ReLU activations This layer is applied residually so this output is additionally summed with the input

5 Query Attention For this part we do the same way as context attention but calculate the weighted sum of the context words for each query word Thus we get the length as number of query words Then we calculate context to query similar to query to context in context attention layer

Figure 4-1 The modified BiDAF model with multilevel attention

6 Query Self-Attention This part is done the same way as the context self-attention layer but from the output of the Query Attention layer

32

7 Context Query Bi-Attention + Self-Attention The output of the Context self-attention and the Query Self-Attention layers are taken as input and the same process for Bi-Attention and self-attention is applied on these inputs

8 Prediction In the last layer of our model a bidirectional GRU is applied followed by a linear layer that computes answer start scores for each token The hidden states of that layer are concatenated with the input and fed into a second bidirectional GRU and linear layer to predict answer end scores The softmax operation is applied to the start and end scores to produce start and end probabilities and we optimize the negative log likelihood of selecting correct start and end tokens

Having carried out this modification we were able to solve the wrong example we

started with The multilevel attention model gives the correct output as ldquoSan Jose

Staterdquo Also we achieved slightly better scores than the original model with a F1 score

of 8544 on the SQuAD dev dataset

Chatbot Design Using a QA System

Designing a perfect chatbot that passes the Turing test is a fundamental goal for

Artificial Intelligence Although we are many order to magnitudes away from achieving

such a goal domain specific tasks can be solved with chatbots made from current

technology We had started our goal with a similar objective in mind ie to design a

domain specific chatbot and then generalize to other areas as it is able to achieve the

first domain specific objective robustly This led us to the fundamental problem of

Machine Comprehension and subsequently to the task of Question Answering Having

achieved some degree of success with QA systems we looked back if we could apply

our newly acquired knowledge in the task of designing Chatbots

The chatbots made with todayrsquos technologies are mostly handcrafted techniques

such as template matching that requires anticipating all possible ways a user may

articulate his requirements and a conversation may occur This requires a lot of man

hours for designing a domain specific system and is still very error prone In this section

33

we propose a general Chatbot design that would make the designing of a domain

specific chatbot very easy and robust at the same time

Every domain specific chatbot needs to obtain a set of information from the user

and show some results based on the user specific information obtained The traditional

chatbots use template matching and keywords lookup to determine if the user has

provided the required information Our idea is to use the Question Answering system in

the backend to extract out the required information from whatever the user has typed

until this point of the conversation The information to be extracted can be posed in the

form of a set of questions and the answers obtained from those questions can be used

as the parameters to supply the relevant information to the user

We had chosen our chatbot domain as the flight reservation system Our goal

was to extract the required information from the user to be able to show him the

available flights as per the userrsquos requirements

Figure 4-2 Flight reservation chatbotrsquos chat window

34

For a flight reservation task the booking agent needs to know the origin city

destination city and date of travel at minimum to be able to show the available flights

Optional information includes the number of tickets passengerrsquos name one way or

round trip etc

The minimalistic conversation with the user through the chat window would be as

shown above We had a platform called OneTask on which we wanted to implement our

chat bot The chat interface within the OneTask system looks as follows

Figure 4-3 Chatbot within OneTask system

The working of the chat system is as follows ndash

1 Initiation The user opens the chat window which starts a session with the chat bot The chat bot asks lsquoHow may I help yoursquo

2 User Reply The user may reply with none of the required information for flight booking or may reply with multiple information in the same message

3 User Reply parsing The conversation up to this point is treated as a passage and the internal questions are run on this passage So the four questions that are run are

Where do you want to go

35

From where do you want to leave

When do you want to depart

4 Parsed responses from QA model After running the questions on the QA system with the given conversation answers are obtained for all along with their corresponding confidence values Since answers will be obtained even if the required question was not answered up to this point in the conversation the confidence values plays a crucial role in determining the validity of the answer For this purpose we determined ranges of confidence values for validation A confidence value above 10 signifies the question has been answered correctly A confidence of 2 ndash 10 signifies that it may have been answered but should verify with the user for correctness and any confidence below 2 is discarded

5 Asking remaining questions iteratively After the parsing it is checked if any of the required questions are still unanswered If so the chatbot asks the remaining question and the process from 3 ndash 4 is carried out iteratively Once all the questions have been answered the user is shown with the available flight options as per his request

Figure 4-4 The Flow diagram of the Flight booking Chatbot system

Online QA System and Attention Visualization

To be able to test out various examples we used the BiDAF [23] model for an

online demo One can either choose from the available examples from the drop-down

menu or paste their own passage and examples While this is a useful and interesting

system to test the model in a user-friendly way we created this system to be able to

focus on the wrong samples

36

Figure 4-5 QA system interface with attention highlight over candidate answers

The candidate answers are shown with the blue highlights as per their

confidence values The higher the confidence the darker the answer The highest

confidence value is chosen as the predicted answer We developed the system to show

attention spread of the candidate answers to realize what needs to be done to improve

the system This led us to realize the importance of including the query attention part as

well as multilevel attention on the BiDAF model as described in the first section of this

chapter

37

CHAPTER 5 FUTURE DIRECTIONS AND CONCLUSION

Future Directions

Having done a thorough analysis of the current methods and state of the arts in

Machine Comprehension and for the QA task and developing systems on it we have

achieved a strong sense of what needs to be done to further improve the QA models

After observing the wrong samples we can see that the system is still unable to encode

meaning and is picking answers based on statistical patterns of occurrence of answers

on training examples

As per the State of the Art models and analyzing the ongoing research literature

it is easy to conclude that more features need to be embedded to encode the meaning

of words phrases and sentences A paper called Reinforced Mnemonic Reader for

Machine Comprehension encoded POS and NER tags of words along with their word

and character embedding This gave them better results

Figure 5-1 An English language semantic parse tree [26]

38

We have developed a method to encode the syntax parse tree of a sentence that

not only encodes the post tags but the relation of the word within the phrase and the

relation of the phrase within the whole sentence in a hierarchical manner

Finally data augmentation is another solution to get better results One definite

way to reduce the errors would be to include similar samples in the training data which

the system is faltering in the dev set One could generate similar examples as the failure

cases and include them in the training set to have better prediction Another system

would be to train to similar and bigger datasets Our models were trained on the SQuAD

dataset There are other datasets too which does the similar question answering task

such as TriviaQA We could augment the training set of TriviaQA along with SQuAD to

have a more robust system that is able to generalize better and thus have higher

accuracy for predicting answer spans

Conclusion

In this work we have tried to explore the most fundamental techniques that have

shaped the current state of the art Then we proposed a minor improvement of

architecture over an existing model Furthermore we developed two applications that

uses the base model First we talked about how a chatbot application can be made

using the QA system and lastly we also created a web interface where the model can

be used for any Passage and Question This interface also shows the attention spread

on the candidate answers While our effort is ongoing to push the state of the art

forward we strongly believe that surpassing human level accuracy on this task will have

high dividends for the society at large

39

LIST OF REFERENCES

[1] Danqi Chen From Reading Comprehension To Open-Domain Question Answering 2018 Youtube Accessed March 16 2018 httpswwwyoutubecomwatchv=1RN88O9C13U

[2] Lehnert Wendy G The Process of Question Answering No RR-88 Yale Univ New Haven Conn Dept of computer science 1977

[3] Joshi Mandar Eunsol Choi Daniel S Weld and Luke Zettlemoyer Triviaqa A large scale distantly supervised challenge dataset for reading comprehension arXiv preprint arXiv170503551 (2017)

[4] Rajpurkar Pranav Jian Zhang Konstantin Lopyrev and Percy Liang Squad 100000+ questions for machine comprehension of text arXiv preprint arXiv160605250 (2016)

[5] The Stanford Question Answering Dataset 2018 RajpurkarGithubIo Accessed March 16 2018 httpsrajpurkargithubioSQuAD-explorer

[6] Hu Minghao Yuxing Peng and Xipeng Qiu Reinforced mnemonic reader for machine comprehension CoRR abs170502798 (2017)

[7] Schalkoff Robert J Artificial neural networks Vol 1 New York McGraw-Hill 1997

[8] Me For The AI and Neetesh Mehrotra 2018 The Connect Between Deep Learning And AI - Open Source For You Open Source For You Accessed March 16 2018 httpopensourceforucom201801connect-deep-learning-ai

[9] Krizhevsky Alex Ilya Sutskever and Geoffrey E Hinton Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems pp 1097-1105 2012

[10] An Intuitive Explanation Of Convolutional Neural Networks 2016 The Data Science Blog Accessed March 16 2018 httpsujjwalkarnme20160811intuitive-explanation-convnets

[11] Williams Ronald J and David Zipser A learning algorithm for continually running fully recurrent neural networks Neural computation 1 no 2 (1989) 270-280

[12] Hochreiter Sepp and Juumlrgen Schmidhuber Long short-term memory Neural computation 9 no 8 (1997) 1735-1780

[13] Chung Junyoung Caglar Gulcehre KyungHyun Cho and Yoshua Bengio Empirical evaluation of gated recurrent neural networks on sequence modeling arXiv preprint arXiv14123555 (2014)

40

[14] Mikolov Tomas Wen-tau Yih and Geoffrey Zweig Linguistic regularities in continuous space word representations In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics Human Language Technologies pp 746-751 2013

[15] Pennington Jeffrey Richard Socher and Christopher Manning Glove Global vectors for word representation In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) pp 1532-1543 2014

[16] Word2vec Tutorial - The Skip-Gram Model middot Chris Mccormick 2016 MccormickmlCom Accessed March 16 2018 httpmccormickmlcom20160419word2vec-tutorial-the-skip-gram-model

[17] Group 2015 [Paper Introduction] Bilingual Word Representations With Monolingual hellip SlideshareNet Accessed March 16 2018 httpswwwslidesharenetnaist-mtstudybilingual-word-representations-with-monolingual-quality-in-mind

[18] Attention Mechanism 2016 BlogHeuritechCom Accessed March 16 2018 httpsblogheuritechcom20160120attention-mechanism

[19] Weston Jason Sumit Chopra and Antoine Bordes 2014 Memory Networks arXiv preprint arXiv14103916v11 Accessed March 16 2018 httpsarxivorgabs14103916

[20] Wang Shuohang and Jing Jiang Machine comprehension using match-lstm and answer pointer arXiv preprint arXiv160807905 (2016)

[21] Wang Shuohang and Jing Jiang Learning natural language inference with LSTM arXiv preprint arXiv151208849 (2015)

[22] Vinyals Oriol Meire Fortunato and Navdeep Jaitly Pointer networks In Advances in Neural Information Processing Systems pp 2692-2700 2015

[23] Wang Wenhui Nan Yang Furu Wei Baobao Chang and Ming Zhou Gated self-matching networks for reading comprehension and question answering In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1 Long Papers) vol 1 pp 189-198 2017

[24] Seo Minjoon Aniruddha Kembhavi Ali Farhadi and Hannaneh Hajishirzi Bidirectional attention flow for machine comprehension arXiv preprint arXiv161101603 (2016)

[25] Clark Christopher and Matt Gardner Simple and effective multi-paragraph reading comprehension arXiv preprint arXiv171010723 (2017)

41

[26] 8 Analyzing Sentence Structure 2018 NltkOrg Accessed March 16 2018 httpwwwnltkorgbookch08html

42

BIOGRAPHICAL SKETCH

Purnendu Mukherjee grew up in Kolkata India and had a strong interest for

computers since an early age After his high school he did his BSc in computer

science from Ramakrishna Mission Residential College Narendrapur followed by MSc

in computer science from St Xavierrsquos College Kolkata He had a strong intuition and

interest for human like learning systems and wanted to work in this area He started

working at TCS Innovation Labs Pune for the application of Natural Language

Processing in Educational Applications As he was deeply passionate about learning

system that mimic the human brain and learn like a human child does he was

increasing interested about Deep Learning and its applications After working for a year

he went on to pursue a Master of Science degree in computer science from the

University of Florida Gainesville His academic interests have been focused on Deep

Learning and Natural Language Processing and he has been working on Machine

Reading Comprehension since summer of 2017

  • ACKNOWLEDGMENTS
  • LIST OF FIGURES
  • LIST OF ABBREVIATIONS
  • INTRODUCTION
  • THE BUILDING BLOCKS OF A QUESTION ANSWERING SYSTEM
    • Neural Networks
    • Convolutional Neural Network
    • Recurrent Neural Networks (RNN)
    • Word Embedding
    • Attention Mechanism
    • Memory Networks
      • LITERATURE REVIEW AND STATE OF THE ART
        • Machine Comprehension Using Match-LSTM and Answer Pointer
        • R-NET Matching Reading Comprehension with Self-Matching Networks
        • Bi-Directional Attention Flow (BiDAF) for Machine Comprehension
        • Summary
          • MULTI-ATTENTION QUESTION ANSWERING
            • Multi-Attention BiDAF Model
            • Chatbot Design Using a QA System
            • Online QA System and Attention Visualization
              • FUTURE DIRECTIONS AND CONCLUSION
                • Future Directions
                • Conclusion
                  • LIST OF REFERENCES
                  • BIOGRAPHICAL SKETCH
Page 19: QUESTION ANSWERING SYSTEMS WITH ATTENTION MECHANISMufdcimages.uflib.ufl.edu/UF/E0/05/23/73/00001/MUKHERJEE_P.pdf · QUESTION ANSWERING SYSTEMS WITH ATTENTION MECHANISM By PURNENDU

19

network consists of a memory m (an array of objects 1 indexed by m i) and four

(potentially learned) components I G O and R as follows

I (input feature map) mdash converts the incoming input to the internal feature

representation either a sparse or dense feature vector like that from word2vec or

GloVe

G (generalization) mdash updates old memories given the new input They call this

generalization as there is an opportunity for the network to compress and generalize its

memories at this stage for some intended future use The analogy Irsquove been talking

before

O (output feature map) mdash produces a new output (in the feature representation

space) given the new input and the current memory state This component is

responsible for performing inference In a question answering system this part will

select the candidate sentences (which might contain the answer) from the story

(conversation) so far

R (response) mdash converts the output into the response format desired For

example a textual response or an action In the QA system described this component

finds the desired answer and then converts it from feature representation to the actual

word

This model is a fully supervised model meaning all the candidate sentences from

which the answer could be found are marked during training phase and can also be

termed as lsquohard attentionrsquo

The authors tested out the QA system on various literature including Lord of the

Rings

20

Figure 2-5 QA example on Lord of the Rings using Memory Networks [19]

21

CHAPTER 3 LITERATURE REVIEW AND STATE OF THE ART

There has been rapid progress since the release of the SQuAD dataset and

currently the best ensemble models are close to human level accuracy in machine

comprehension This is due to the various ingenious methods which solves some of the

problems with the previous methods Out of Vocabulary tokens were handled by using

Character embedding Long term dependency within context passage were solved

using self-attention And many other techniques such as Contextualized vectors History

of Words Attention Flow etc In this section we will have a look at the some of the most

important models that were fundamental to the progress of Questions Answering

Machine Comprehension Using Match-LSTM and Answer Pointer

In this paper [20] the authors propose an end-to-end neural architecture for the

QA task The architecture is based on match-LSTM [21] a model they proposed for

textual entailment and Pointer Net [22] a sequence-to-sequence model proposed by

Vinyals et al (2015) to constrain the output tokens to be from the input sequences

The model consists of an LSTM preprocessing layer a match-LSTM layer and an

Answer Pointer layer

We are given a piece of text which we refer to as a passage and a question

related to the passage The passage is represented by matrix P where P is the length

(number of tokens) of the passage and d is the dimensionality of word embeddings

Similarly the question is represented by matrix Q where Q is the length of the question

Our goal is to identify a subsequence from the passage as the answer to the question

22

Figure 3-1 Match-LSTM Model Architecture [20]

LSTM Preprocessing layer They use a standard one-directional LSTM

(Hochreiter amp Schmidhuber 1997) to process the passage and the question separately

as shown below

Match LSTM Layer They applied the match-LSTM model proposed for textual

entailment to their machine comprehension problem by treating the question as a

premise and the passage as a hypothesis The match-LSTM sequentially goes through

the passage At position i of the passage it first uses the standard word-by-word

attention mechanism to obtain attention weight vector as follows

(3-1)

23

where and are parameters to be learned

Answer Pointer Layer The Answer Pointer (Ans-Ptr) layer is motivated by the

Pointer Net introduced by Vinyals et al (2015) [22] The boundary model produces only

the start token and the end token of the answer and then all the tokens between these

two in the original passage are considered to be the answer

When this paper was released back in November 2016 Match-LSTM method

was the state of the art in Question Answering systems and was at the top of the

leaderboard for the SQuAD dataset

R-NET Matching Reading Comprehension with Self-Matching Networks

In this model [23] first the question and passage are processed by a

bidirectional recurrent network (Mikolov et al 2010) separately They then match the

question and passage with gated attention-based recurrent networks obtaining

question-aware representation for the passage On top of that they apply self-matching

attention to aggregate evidence from the whole passage and refine the passage

representation which is then fed into the output layer to predict the boundary of the

answer span

Question and passage encoding First the words are converted to their

respective word-level embeddings and character level embeddings The character-level

embeddings are generated by taking the final hidden states of a bi-directional recurrent

neural network (RNN) applied to embeddings of characters in the token Such

character-level embeddings have been shown to be helpful to deal with out-of-vocab

(OOV) tokens

24

They then use a bi-directional RNN to produce new representation and

of all words in the question and passage respectively

Figure 3-2 The task of Question Answering [23]

Gated Attention-based Recurrent Networks They use a variant of attention-

based recurrent networks with an additional gate to determine the importance of

information in the passage regarding a question Different from the gates in LSTM or

GRU the additional gate is based on the current passage word and its attention-pooling

vector of the question which focuses on the relation between the question and current

passage word The gate effectively model the phenomenon that only parts of the

passage are relevant to the question in reading comprehension and question answering

is utilized in subsequent calculations

25

Self-Matching Attention From the previous step the question aware passage

representation is generated to highlight the important parts of the passage One

problem with such representation is that it has very limited knowledge of context One

answer candidate is often oblivious to important cues in the passage outside its

surrounding window To address this problem the authors propose directly matching

the question-aware passage representation against itself It dynamically collects

evidence from the whole passage for words in passage and encodes the evidence

relevant to the current passage word and its matching question information into the

passage representation

Output Layer They use the same method as Wang amp Jiang (2016b) and use

pointer networks (Vinyals et al 2015) to predict the start and end position of the

answer In addition they use an attention-pooling over the question representation to

generate the initial hidden vector for the pointer network [23]

When the R-Net Model first appeared in the leaderboard in March 2017 it was at

the top with 723 Exact Match and 807 F1 score

Bi-Directional Attention Flow (BiDAF) for Machine Comprehension

BiDAF [24] is a hierarchical multi-stage architecture for modeling the

representations of the context paragraph at different levels of granularity BIDAF

includes character-level word-level and contextual embeddings and uses bi-directional

attention flow to obtain a query-aware context representation Their attention layer is not

used to summarize the context paragraph into a fixed-size vector Instead the attention

is computed for every time step and the attended vector at each time step along with

the representations from previous layers can flow through to the subsequent modeling

layer This reduces the information loss caused by early summarization [24]

26

Figure 3-3 Bi-Directional Attention Flow Model Architecture [24]

Their machine comprehension model is a hierarchical multi-stage process and

consists of six layers

1 Character Embedding Layer maps each word to a vector space using character-level CNNs

2 Word Embedding Layer maps each word to a vector space using a pre-trained word embedding model

3 Contextual Embedding Layer utilizes contextual cues from surrounding words to refine the embedding of the words These first three layers are applied to both the query and context They use a LSTM on top of the embeddings provided by the previous layers to model the temporal interactions between words We place an LSTM in both directions and concatenate the outputs of the two LSTMs

4 Attention Flow Layer couples the query and context vectors and produces a set of query aware feature vectors for each word in the context

5 Modeling Layer employs a Recurrent Neural Network to scan the context The input to the modeling layer is G which encodes the query-aware representations of context words The output of the modeling layer captures the interaction among the context words conditioned on the query They use two layers of bi-directional LSTM

27

with the output size of d for each direction Hence a matrix is obtained which is passed onto the output layer to predict the answer

6 Output Layer provides an answer to the query They define the training loss (to be minimized) as the sum of the negative log probabilities of the true start and end indices by the predicted distributions averaged over all examples [25]

In a further variation of their above work they add a self-attention layer after the

Bi-attention layer to further improve the results The architecture of the model is as

Figure 3-3 The task of Question Answering [25]

28

Summary

In this chapter we reviewed the methods that are fundamental to the state of the

art in Machine Comprehension and for the task of Question Answering We have

reached closed to human level accuracy and this is due to incremental developments

over previous models As we saw Out of Vocabulary (OOV) tokens were handled by

using Character embedding Long term dependency within context passage were

solved using self-attention and many other techniques such as Contextualized vectors

Attention Flow etc were employed to get better results In the next chapter we will see

how we can build on these models and develop further

29

CHAPTER 4 MULTI-ATTENTION QUESTION ANSWERING

Although the advancements in Question Answering systems since the release of

the SQuAD dataset has been impressive and the results are getting close to human

level accuracy it is far from being a fool-proof system The models still make mistakes

which would be obvious to a human For example

Passage The Panthers used the San Jose State practice facility and stayed at

the San Jose Marriott The Broncos practiced at Florida State Facility and stayed at the

Santa Clara Marriott

Question At what universitys facility did the Panthers practice

Actual Answer San Jose State

Predicted Answer Florida State Facility

To find out what is leading to the wrong predictions we wanted to see the

attention weights associated with such an example We plotted the passage and

question heat map which is a 2D matrix where the intensity of each cell signifies the

similarity between a passage word and a question word For the above example we

found out that while certain words of the question are given high weightage other parts

are not The words lsquoAtrsquo lsquofacilityrsquo lsquopracticersquo are given high attention but lsquoPanthersrsquo does

not receive high attention If it had received high attention then the system would have

predicted lsquoSan Jose Statersquo as the right answer To solve this issue we analyzed the

base BiDAF model and proposed adding two things

1 Bi-Attention and Self-Attention over Query

2 Second level attention over the output of (Bi-Attention + Self-Attention) from both Context and Query

30

Multi-Attention BiDAF Model

Our comprehension model is a hierarchical multi-stage process and consists of

the following layers

1 Embedding Just as all other models we embed words using pretrained word vectors We also embed the characters in each word into size 20 vectors which are learned and run a convolution neural network followed by max-pooling to get character-derived embeddings for each word The character-level and word-level embeddings are then concatenated and passed to the next layer We do not update the word embeddings during training

2 Pre-Process A shared bi-directional GRU (Cho et al 2014) is used to map the question and passage embeddings to context aware embeddings

3 Context Attention The bi-directional attention mechanism from the Bi-Directional Attention Flow (BiDAF) model (Seo et al 2016) is used to build a query-aware context representation Let h_i be the vector for context word i q_j be the vector for question word j and n_q and n_c be the lengths of the question and context respectively We compute attention between context word i and question word j as

(4-1)

where w1 w2 and w3 are learned vectors and is element-wise multiplication We then compute an attended vector c_i for each context token as

(4-2)

We also compute a query-to-context vector q_c

(4-3)

31

The final vector computed for each token is built by concatenating

and In our model we subsequently pass the result through a linear layer with ReLU activations

4 Context Self-Attention Next we use a layer of residual self-attention The input is passed through another bi-directional GRU Then we apply the same attention mechanism only now between the passage and itself In this case we do not use query-to context attention and we set aij = 1048576inf if i = j As before we pass the concatenated output through a linear layer with ReLU activations This layer is applied residually so this output is additionally summed with the input

5 Query Attention For this part we do the same way as context attention but calculate the weighted sum of the context words for each query word Thus we get the length as number of query words Then we calculate context to query similar to query to context in context attention layer

Figure 4-1 The modified BiDAF model with multilevel attention

6 Query Self-Attention This part is done the same way as the context self-attention layer but from the output of the Query Attention layer

32

7 Context Query Bi-Attention + Self-Attention The output of the Context self-attention and the Query Self-Attention layers are taken as input and the same process for Bi-Attention and self-attention is applied on these inputs

8 Prediction In the last layer of our model a bidirectional GRU is applied followed by a linear layer that computes answer start scores for each token The hidden states of that layer are concatenated with the input and fed into a second bidirectional GRU and linear layer to predict answer end scores The softmax operation is applied to the start and end scores to produce start and end probabilities and we optimize the negative log likelihood of selecting correct start and end tokens

Having carried out this modification we were able to solve the wrong example we

started with The multilevel attention model gives the correct output as ldquoSan Jose

Staterdquo Also we achieved slightly better scores than the original model with a F1 score

of 8544 on the SQuAD dev dataset

Chatbot Design Using a QA System

Designing a perfect chatbot that passes the Turing test is a fundamental goal for

Artificial Intelligence Although we are many order to magnitudes away from achieving

such a goal domain specific tasks can be solved with chatbots made from current

technology We had started our goal with a similar objective in mind ie to design a

domain specific chatbot and then generalize to other areas as it is able to achieve the

first domain specific objective robustly This led us to the fundamental problem of

Machine Comprehension and subsequently to the task of Question Answering Having

achieved some degree of success with QA systems we looked back if we could apply

our newly acquired knowledge in the task of designing Chatbots

The chatbots made with todayrsquos technologies are mostly handcrafted techniques

such as template matching that requires anticipating all possible ways a user may

articulate his requirements and a conversation may occur This requires a lot of man

hours for designing a domain specific system and is still very error prone In this section

33

we propose a general Chatbot design that would make the designing of a domain

specific chatbot very easy and robust at the same time

Every domain specific chatbot needs to obtain a set of information from the user

and show some results based on the user specific information obtained The traditional

chatbots use template matching and keywords lookup to determine if the user has

provided the required information Our idea is to use the Question Answering system in

the backend to extract out the required information from whatever the user has typed

until this point of the conversation The information to be extracted can be posed in the

form of a set of questions and the answers obtained from those questions can be used

as the parameters to supply the relevant information to the user

We had chosen our chatbot domain as the flight reservation system Our goal

was to extract the required information from the user to be able to show him the

available flights as per the userrsquos requirements

Figure 4-2 Flight reservation chatbotrsquos chat window

34

For a flight reservation task the booking agent needs to know the origin city

destination city and date of travel at minimum to be able to show the available flights

Optional information includes the number of tickets passengerrsquos name one way or

round trip etc

The minimalistic conversation with the user through the chat window would be as

shown above We had a platform called OneTask on which we wanted to implement our

chat bot The chat interface within the OneTask system looks as follows

Figure 4-3 Chatbot within OneTask system

The working of the chat system is as follows ndash

1 Initiation The user opens the chat window which starts a session with the chat bot The chat bot asks lsquoHow may I help yoursquo

2 User Reply The user may reply with none of the required information for flight booking or may reply with multiple information in the same message

3 User Reply parsing The conversation up to this point is treated as a passage and the internal questions are run on this passage So the four questions that are run are

Where do you want to go

35

From where do you want to leave

When do you want to depart

4 Parsed responses from QA model After running the questions on the QA system with the given conversation answers are obtained for all along with their corresponding confidence values Since answers will be obtained even if the required question was not answered up to this point in the conversation the confidence values plays a crucial role in determining the validity of the answer For this purpose we determined ranges of confidence values for validation A confidence value above 10 signifies the question has been answered correctly A confidence of 2 ndash 10 signifies that it may have been answered but should verify with the user for correctness and any confidence below 2 is discarded

5 Asking remaining questions iteratively After the parsing it is checked if any of the required questions are still unanswered If so the chatbot asks the remaining question and the process from 3 ndash 4 is carried out iteratively Once all the questions have been answered the user is shown with the available flight options as per his request

Figure 4-4 The Flow diagram of the Flight booking Chatbot system

Online QA System and Attention Visualization

To be able to test out various examples we used the BiDAF [23] model for an

online demo One can either choose from the available examples from the drop-down

menu or paste their own passage and examples While this is a useful and interesting

system to test the model in a user-friendly way we created this system to be able to

focus on the wrong samples

36

Figure 4-5 QA system interface with attention highlight over candidate answers

The candidate answers are shown with the blue highlights as per their

confidence values The higher the confidence the darker the answer The highest

confidence value is chosen as the predicted answer We developed the system to show

attention spread of the candidate answers to realize what needs to be done to improve

the system This led us to realize the importance of including the query attention part as

well as multilevel attention on the BiDAF model as described in the first section of this

chapter

37

CHAPTER 5 FUTURE DIRECTIONS AND CONCLUSION

Future Directions

Having done a thorough analysis of the current methods and state of the arts in

Machine Comprehension and for the QA task and developing systems on it we have

achieved a strong sense of what needs to be done to further improve the QA models

After observing the wrong samples we can see that the system is still unable to encode

meaning and is picking answers based on statistical patterns of occurrence of answers

on training examples

As per the State of the Art models and analyzing the ongoing research literature

it is easy to conclude that more features need to be embedded to encode the meaning

of words phrases and sentences A paper called Reinforced Mnemonic Reader for

Machine Comprehension encoded POS and NER tags of words along with their word

and character embedding This gave them better results

Figure 5-1 An English language semantic parse tree [26]

38

We have developed a method to encode the syntax parse tree of a sentence that

not only encodes the post tags but the relation of the word within the phrase and the

relation of the phrase within the whole sentence in a hierarchical manner

Finally data augmentation is another solution to get better results One definite

way to reduce the errors would be to include similar samples in the training data which

the system is faltering in the dev set One could generate similar examples as the failure

cases and include them in the training set to have better prediction Another system

would be to train to similar and bigger datasets Our models were trained on the SQuAD

dataset There are other datasets too which does the similar question answering task

such as TriviaQA We could augment the training set of TriviaQA along with SQuAD to

have a more robust system that is able to generalize better and thus have higher

accuracy for predicting answer spans

Conclusion

In this work we have tried to explore the most fundamental techniques that have

shaped the current state of the art Then we proposed a minor improvement of

architecture over an existing model Furthermore we developed two applications that

uses the base model First we talked about how a chatbot application can be made

using the QA system and lastly we also created a web interface where the model can

be used for any Passage and Question This interface also shows the attention spread

on the candidate answers While our effort is ongoing to push the state of the art

forward we strongly believe that surpassing human level accuracy on this task will have

high dividends for the society at large

39

LIST OF REFERENCES

[1] Danqi Chen From Reading Comprehension To Open-Domain Question Answering 2018 Youtube Accessed March 16 2018 httpswwwyoutubecomwatchv=1RN88O9C13U

[2] Lehnert Wendy G The Process of Question Answering No RR-88 Yale Univ New Haven Conn Dept of computer science 1977

[3] Joshi Mandar Eunsol Choi Daniel S Weld and Luke Zettlemoyer Triviaqa A large scale distantly supervised challenge dataset for reading comprehension arXiv preprint arXiv170503551 (2017)

[4] Rajpurkar Pranav Jian Zhang Konstantin Lopyrev and Percy Liang Squad 100000+ questions for machine comprehension of text arXiv preprint arXiv160605250 (2016)

[5] The Stanford Question Answering Dataset 2018 RajpurkarGithubIo Accessed March 16 2018 httpsrajpurkargithubioSQuAD-explorer

[6] Hu Minghao Yuxing Peng and Xipeng Qiu Reinforced mnemonic reader for machine comprehension CoRR abs170502798 (2017)

[7] Schalkoff Robert J Artificial neural networks Vol 1 New York McGraw-Hill 1997

[8] Me For The AI and Neetesh Mehrotra 2018 The Connect Between Deep Learning And AI - Open Source For You Open Source For You Accessed March 16 2018 httpopensourceforucom201801connect-deep-learning-ai

[9] Krizhevsky Alex Ilya Sutskever and Geoffrey E Hinton Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems pp 1097-1105 2012

[10] An Intuitive Explanation Of Convolutional Neural Networks 2016 The Data Science Blog Accessed March 16 2018 httpsujjwalkarnme20160811intuitive-explanation-convnets

[11] Williams Ronald J and David Zipser A learning algorithm for continually running fully recurrent neural networks Neural computation 1 no 2 (1989) 270-280

[12] Hochreiter Sepp and Juumlrgen Schmidhuber Long short-term memory Neural computation 9 no 8 (1997) 1735-1780

[13] Chung Junyoung Caglar Gulcehre KyungHyun Cho and Yoshua Bengio Empirical evaluation of gated recurrent neural networks on sequence modeling arXiv preprint arXiv14123555 (2014)

40

[14] Mikolov Tomas Wen-tau Yih and Geoffrey Zweig Linguistic regularities in continuous space word representations In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics Human Language Technologies pp 746-751 2013

[15] Pennington Jeffrey Richard Socher and Christopher Manning Glove Global vectors for word representation In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) pp 1532-1543 2014

[16] Word2vec Tutorial - The Skip-Gram Model middot Chris Mccormick 2016 MccormickmlCom Accessed March 16 2018 httpmccormickmlcom20160419word2vec-tutorial-the-skip-gram-model

[17] Group 2015 [Paper Introduction] Bilingual Word Representations With Monolingual hellip SlideshareNet Accessed March 16 2018 httpswwwslidesharenetnaist-mtstudybilingual-word-representations-with-monolingual-quality-in-mind

[18] Attention Mechanism 2016 BlogHeuritechCom Accessed March 16 2018 httpsblogheuritechcom20160120attention-mechanism

[19] Weston Jason Sumit Chopra and Antoine Bordes 2014 Memory Networks arXiv preprint arXiv14103916v11 Accessed March 16 2018 httpsarxivorgabs14103916

[20] Wang Shuohang and Jing Jiang Machine comprehension using match-lstm and answer pointer arXiv preprint arXiv160807905 (2016)

[21] Wang Shuohang and Jing Jiang Learning natural language inference with LSTM arXiv preprint arXiv151208849 (2015)

[22] Vinyals Oriol Meire Fortunato and Navdeep Jaitly Pointer networks In Advances in Neural Information Processing Systems pp 2692-2700 2015

[23] Wang Wenhui Nan Yang Furu Wei Baobao Chang and Ming Zhou Gated self-matching networks for reading comprehension and question answering In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1 Long Papers) vol 1 pp 189-198 2017

[24] Seo Minjoon Aniruddha Kembhavi Ali Farhadi and Hannaneh Hajishirzi Bidirectional attention flow for machine comprehension arXiv preprint arXiv161101603 (2016)

[25] Clark Christopher and Matt Gardner Simple and effective multi-paragraph reading comprehension arXiv preprint arXiv171010723 (2017)

41

[26] 8 Analyzing Sentence Structure 2018 NltkOrg Accessed March 16 2018 httpwwwnltkorgbookch08html

42

BIOGRAPHICAL SKETCH

Purnendu Mukherjee grew up in Kolkata India and had a strong interest for

computers since an early age After his high school he did his BSc in computer

science from Ramakrishna Mission Residential College Narendrapur followed by MSc

in computer science from St Xavierrsquos College Kolkata He had a strong intuition and

interest for human like learning systems and wanted to work in this area He started

working at TCS Innovation Labs Pune for the application of Natural Language

Processing in Educational Applications As he was deeply passionate about learning

system that mimic the human brain and learn like a human child does he was

increasing interested about Deep Learning and its applications After working for a year

he went on to pursue a Master of Science degree in computer science from the

University of Florida Gainesville His academic interests have been focused on Deep

Learning and Natural Language Processing and he has been working on Machine

Reading Comprehension since summer of 2017

  • ACKNOWLEDGMENTS
  • LIST OF FIGURES
  • LIST OF ABBREVIATIONS
  • INTRODUCTION
  • THE BUILDING BLOCKS OF A QUESTION ANSWERING SYSTEM
    • Neural Networks
    • Convolutional Neural Network
    • Recurrent Neural Networks (RNN)
    • Word Embedding
    • Attention Mechanism
    • Memory Networks
      • LITERATURE REVIEW AND STATE OF THE ART
        • Machine Comprehension Using Match-LSTM and Answer Pointer
        • R-NET Matching Reading Comprehension with Self-Matching Networks
        • Bi-Directional Attention Flow (BiDAF) for Machine Comprehension
        • Summary
          • MULTI-ATTENTION QUESTION ANSWERING
            • Multi-Attention BiDAF Model
            • Chatbot Design Using a QA System
            • Online QA System and Attention Visualization
              • FUTURE DIRECTIONS AND CONCLUSION
                • Future Directions
                • Conclusion
                  • LIST OF REFERENCES
                  • BIOGRAPHICAL SKETCH
Page 20: QUESTION ANSWERING SYSTEMS WITH ATTENTION MECHANISMufdcimages.uflib.ufl.edu/UF/E0/05/23/73/00001/MUKHERJEE_P.pdf · QUESTION ANSWERING SYSTEMS WITH ATTENTION MECHANISM By PURNENDU

20

Figure 2-5 QA example on Lord of the Rings using Memory Networks [19]

21

CHAPTER 3 LITERATURE REVIEW AND STATE OF THE ART

There has been rapid progress since the release of the SQuAD dataset and

currently the best ensemble models are close to human level accuracy in machine

comprehension This is due to the various ingenious methods which solves some of the

problems with the previous methods Out of Vocabulary tokens were handled by using

Character embedding Long term dependency within context passage were solved

using self-attention And many other techniques such as Contextualized vectors History

of Words Attention Flow etc In this section we will have a look at the some of the most

important models that were fundamental to the progress of Questions Answering

Machine Comprehension Using Match-LSTM and Answer Pointer

In this paper [20] the authors propose an end-to-end neural architecture for the

QA task The architecture is based on match-LSTM [21] a model they proposed for

textual entailment and Pointer Net [22] a sequence-to-sequence model proposed by

Vinyals et al (2015) to constrain the output tokens to be from the input sequences

The model consists of an LSTM preprocessing layer a match-LSTM layer and an

Answer Pointer layer

We are given a piece of text which we refer to as a passage and a question

related to the passage The passage is represented by matrix P where P is the length

(number of tokens) of the passage and d is the dimensionality of word embeddings

Similarly the question is represented by matrix Q where Q is the length of the question

Our goal is to identify a subsequence from the passage as the answer to the question

22

Figure 3-1 Match-LSTM Model Architecture [20]

LSTM Preprocessing layer They use a standard one-directional LSTM

(Hochreiter amp Schmidhuber 1997) to process the passage and the question separately

as shown below

Match LSTM Layer They applied the match-LSTM model proposed for textual

entailment to their machine comprehension problem by treating the question as a

premise and the passage as a hypothesis The match-LSTM sequentially goes through

the passage At position i of the passage it first uses the standard word-by-word

attention mechanism to obtain attention weight vector as follows

(3-1)

23

where and are parameters to be learned

Answer Pointer Layer The Answer Pointer (Ans-Ptr) layer is motivated by the

Pointer Net introduced by Vinyals et al (2015) [22] The boundary model produces only

the start token and the end token of the answer and then all the tokens between these

two in the original passage are considered to be the answer

When this paper was released back in November 2016 Match-LSTM method

was the state of the art in Question Answering systems and was at the top of the

leaderboard for the SQuAD dataset

R-NET Matching Reading Comprehension with Self-Matching Networks

In this model [23] first the question and passage are processed by a

bidirectional recurrent network (Mikolov et al 2010) separately They then match the

question and passage with gated attention-based recurrent networks obtaining

question-aware representation for the passage On top of that they apply self-matching

attention to aggregate evidence from the whole passage and refine the passage

representation which is then fed into the output layer to predict the boundary of the

answer span

Question and passage encoding First the words are converted to their

respective word-level embeddings and character level embeddings The character-level

embeddings are generated by taking the final hidden states of a bi-directional recurrent

neural network (RNN) applied to embeddings of characters in the token Such

character-level embeddings have been shown to be helpful to deal with out-of-vocab

(OOV) tokens

24

They then use a bi-directional RNN to produce new representation and

of all words in the question and passage respectively

Figure 3-2 The task of Question Answering [23]

Gated Attention-based Recurrent Networks They use a variant of attention-

based recurrent networks with an additional gate to determine the importance of

information in the passage regarding a question Different from the gates in LSTM or

GRU the additional gate is based on the current passage word and its attention-pooling

vector of the question which focuses on the relation between the question and current

passage word The gate effectively model the phenomenon that only parts of the

passage are relevant to the question in reading comprehension and question answering

is utilized in subsequent calculations

25

Self-Matching Attention From the previous step the question aware passage

representation is generated to highlight the important parts of the passage One

problem with such representation is that it has very limited knowledge of context One

answer candidate is often oblivious to important cues in the passage outside its

surrounding window To address this problem the authors propose directly matching

the question-aware passage representation against itself It dynamically collects

evidence from the whole passage for words in passage and encodes the evidence

relevant to the current passage word and its matching question information into the

passage representation

Output Layer They use the same method as Wang amp Jiang (2016b) and use

pointer networks (Vinyals et al 2015) to predict the start and end position of the

answer In addition they use an attention-pooling over the question representation to

generate the initial hidden vector for the pointer network [23]

When the R-Net Model first appeared in the leaderboard in March 2017 it was at

the top with 723 Exact Match and 807 F1 score

Bi-Directional Attention Flow (BiDAF) for Machine Comprehension

BiDAF [24] is a hierarchical multi-stage architecture for modeling the

representations of the context paragraph at different levels of granularity BIDAF

includes character-level word-level and contextual embeddings and uses bi-directional

attention flow to obtain a query-aware context representation Their attention layer is not

used to summarize the context paragraph into a fixed-size vector Instead the attention

is computed for every time step and the attended vector at each time step along with

the representations from previous layers can flow through to the subsequent modeling

layer This reduces the information loss caused by early summarization [24]

26

Figure 3-3 Bi-Directional Attention Flow Model Architecture [24]

Their machine comprehension model is a hierarchical multi-stage process and

consists of six layers

1 Character Embedding Layer maps each word to a vector space using character-level CNNs

2 Word Embedding Layer maps each word to a vector space using a pre-trained word embedding model

3 Contextual Embedding Layer utilizes contextual cues from surrounding words to refine the embedding of the words These first three layers are applied to both the query and context They use a LSTM on top of the embeddings provided by the previous layers to model the temporal interactions between words We place an LSTM in both directions and concatenate the outputs of the two LSTMs

4 Attention Flow Layer couples the query and context vectors and produces a set of query aware feature vectors for each word in the context

5 Modeling Layer employs a Recurrent Neural Network to scan the context The input to the modeling layer is G which encodes the query-aware representations of context words The output of the modeling layer captures the interaction among the context words conditioned on the query They use two layers of bi-directional LSTM

27

with the output size of d for each direction Hence a matrix is obtained which is passed onto the output layer to predict the answer

6 Output Layer provides an answer to the query They define the training loss (to be minimized) as the sum of the negative log probabilities of the true start and end indices by the predicted distributions averaged over all examples [25]

In a further variation of their above work they add a self-attention layer after the

Bi-attention layer to further improve the results The architecture of the model is as

Figure 3-3 The task of Question Answering [25]

28

Summary

In this chapter we reviewed the methods that are fundamental to the state of the

art in Machine Comprehension and for the task of Question Answering We have

reached closed to human level accuracy and this is due to incremental developments

over previous models As we saw Out of Vocabulary (OOV) tokens were handled by

using Character embedding Long term dependency within context passage were

solved using self-attention and many other techniques such as Contextualized vectors

Attention Flow etc were employed to get better results In the next chapter we will see

how we can build on these models and develop further

29

CHAPTER 4 MULTI-ATTENTION QUESTION ANSWERING

Although the advancements in Question Answering systems since the release of

the SQuAD dataset has been impressive and the results are getting close to human

level accuracy it is far from being a fool-proof system The models still make mistakes

which would be obvious to a human For example

Passage The Panthers used the San Jose State practice facility and stayed at

the San Jose Marriott The Broncos practiced at Florida State Facility and stayed at the

Santa Clara Marriott

Question At what universitys facility did the Panthers practice

Actual Answer San Jose State

Predicted Answer Florida State Facility

To find out what is leading to the wrong predictions we wanted to see the

attention weights associated with such an example We plotted the passage and

question heat map which is a 2D matrix where the intensity of each cell signifies the

similarity between a passage word and a question word For the above example we

found out that while certain words of the question are given high weightage other parts

are not The words lsquoAtrsquo lsquofacilityrsquo lsquopracticersquo are given high attention but lsquoPanthersrsquo does

not receive high attention If it had received high attention then the system would have

predicted lsquoSan Jose Statersquo as the right answer To solve this issue we analyzed the

base BiDAF model and proposed adding two things

1 Bi-Attention and Self-Attention over Query

2 Second level attention over the output of (Bi-Attention + Self-Attention) from both Context and Query

30

Multi-Attention BiDAF Model

Our comprehension model is a hierarchical multi-stage process and consists of

the following layers

1 Embedding Just as all other models we embed words using pretrained word vectors We also embed the characters in each word into size 20 vectors which are learned and run a convolution neural network followed by max-pooling to get character-derived embeddings for each word The character-level and word-level embeddings are then concatenated and passed to the next layer We do not update the word embeddings during training

2 Pre-Process A shared bi-directional GRU (Cho et al 2014) is used to map the question and passage embeddings to context aware embeddings

3 Context Attention The bi-directional attention mechanism from the Bi-Directional Attention Flow (BiDAF) model (Seo et al 2016) is used to build a query-aware context representation Let h_i be the vector for context word i q_j be the vector for question word j and n_q and n_c be the lengths of the question and context respectively We compute attention between context word i and question word j as

(4-1)

where w1 w2 and w3 are learned vectors and is element-wise multiplication We then compute an attended vector c_i for each context token as

(4-2)

We also compute a query-to-context vector q_c

(4-3)

31

The final vector computed for each token is built by concatenating

and In our model we subsequently pass the result through a linear layer with ReLU activations

4 Context Self-Attention Next we use a layer of residual self-attention The input is passed through another bi-directional GRU Then we apply the same attention mechanism only now between the passage and itself In this case we do not use query-to context attention and we set aij = 1048576inf if i = j As before we pass the concatenated output through a linear layer with ReLU activations This layer is applied residually so this output is additionally summed with the input

5 Query Attention For this part we do the same way as context attention but calculate the weighted sum of the context words for each query word Thus we get the length as number of query words Then we calculate context to query similar to query to context in context attention layer

Figure 4-1 The modified BiDAF model with multilevel attention

6 Query Self-Attention This part is done the same way as the context self-attention layer but from the output of the Query Attention layer

32

7 Context Query Bi-Attention + Self-Attention The output of the Context self-attention and the Query Self-Attention layers are taken as input and the same process for Bi-Attention and self-attention is applied on these inputs

8 Prediction In the last layer of our model a bidirectional GRU is applied followed by a linear layer that computes answer start scores for each token The hidden states of that layer are concatenated with the input and fed into a second bidirectional GRU and linear layer to predict answer end scores The softmax operation is applied to the start and end scores to produce start and end probabilities and we optimize the negative log likelihood of selecting correct start and end tokens

Having carried out this modification we were able to solve the wrong example we

started with The multilevel attention model gives the correct output as ldquoSan Jose

Staterdquo Also we achieved slightly better scores than the original model with a F1 score

of 8544 on the SQuAD dev dataset

Chatbot Design Using a QA System

Designing a perfect chatbot that passes the Turing test is a fundamental goal for

Artificial Intelligence Although we are many order to magnitudes away from achieving

such a goal domain specific tasks can be solved with chatbots made from current

technology We had started our goal with a similar objective in mind ie to design a

domain specific chatbot and then generalize to other areas as it is able to achieve the

first domain specific objective robustly This led us to the fundamental problem of

Machine Comprehension and subsequently to the task of Question Answering Having

achieved some degree of success with QA systems we looked back if we could apply

our newly acquired knowledge in the task of designing Chatbots

The chatbots made with todayrsquos technologies are mostly handcrafted techniques

such as template matching that requires anticipating all possible ways a user may

articulate his requirements and a conversation may occur This requires a lot of man

hours for designing a domain specific system and is still very error prone In this section

33

we propose a general Chatbot design that would make the designing of a domain

specific chatbot very easy and robust at the same time

Every domain specific chatbot needs to obtain a set of information from the user

and show some results based on the user specific information obtained The traditional

chatbots use template matching and keywords lookup to determine if the user has

provided the required information Our idea is to use the Question Answering system in

the backend to extract out the required information from whatever the user has typed

until this point of the conversation The information to be extracted can be posed in the

form of a set of questions and the answers obtained from those questions can be used

as the parameters to supply the relevant information to the user

We had chosen our chatbot domain as the flight reservation system Our goal

was to extract the required information from the user to be able to show him the

available flights as per the userrsquos requirements

Figure 4-2 Flight reservation chatbotrsquos chat window

34

For a flight reservation task the booking agent needs to know the origin city

destination city and date of travel at minimum to be able to show the available flights

Optional information includes the number of tickets passengerrsquos name one way or

round trip etc

The minimalistic conversation with the user through the chat window would be as

shown above We had a platform called OneTask on which we wanted to implement our

chat bot The chat interface within the OneTask system looks as follows

Figure 4-3 Chatbot within OneTask system

The working of the chat system is as follows ndash

1 Initiation The user opens the chat window which starts a session with the chat bot The chat bot asks lsquoHow may I help yoursquo

2 User Reply The user may reply with none of the required information for flight booking or may reply with multiple information in the same message

3 User Reply parsing The conversation up to this point is treated as a passage and the internal questions are run on this passage So the four questions that are run are

Where do you want to go

35

From where do you want to leave

When do you want to depart

4 Parsed responses from QA model After running the questions on the QA system with the given conversation answers are obtained for all along with their corresponding confidence values Since answers will be obtained even if the required question was not answered up to this point in the conversation the confidence values plays a crucial role in determining the validity of the answer For this purpose we determined ranges of confidence values for validation A confidence value above 10 signifies the question has been answered correctly A confidence of 2 ndash 10 signifies that it may have been answered but should verify with the user for correctness and any confidence below 2 is discarded

5 Asking remaining questions iteratively After the parsing it is checked if any of the required questions are still unanswered If so the chatbot asks the remaining question and the process from 3 ndash 4 is carried out iteratively Once all the questions have been answered the user is shown with the available flight options as per his request

Figure 4-4 The Flow diagram of the Flight booking Chatbot system

Online QA System and Attention Visualization

To be able to test out various examples we used the BiDAF [23] model for an

online demo One can either choose from the available examples from the drop-down

menu or paste their own passage and examples While this is a useful and interesting

system to test the model in a user-friendly way we created this system to be able to

focus on the wrong samples

36

Figure 4-5 QA system interface with attention highlight over candidate answers

The candidate answers are shown with the blue highlights as per their

confidence values The higher the confidence the darker the answer The highest

confidence value is chosen as the predicted answer We developed the system to show

attention spread of the candidate answers to realize what needs to be done to improve

the system This led us to realize the importance of including the query attention part as

well as multilevel attention on the BiDAF model as described in the first section of this

chapter

37

CHAPTER 5 FUTURE DIRECTIONS AND CONCLUSION

Future Directions

Having done a thorough analysis of the current methods and state of the arts in

Machine Comprehension and for the QA task and developing systems on it we have

achieved a strong sense of what needs to be done to further improve the QA models

After observing the wrong samples we can see that the system is still unable to encode

meaning and is picking answers based on statistical patterns of occurrence of answers

on training examples

As per the State of the Art models and analyzing the ongoing research literature

it is easy to conclude that more features need to be embedded to encode the meaning

of words phrases and sentences A paper called Reinforced Mnemonic Reader for

Machine Comprehension encoded POS and NER tags of words along with their word

and character embedding This gave them better results

Figure 5-1 An English language semantic parse tree [26]

38

We have developed a method to encode the syntax parse tree of a sentence that

not only encodes the post tags but the relation of the word within the phrase and the

relation of the phrase within the whole sentence in a hierarchical manner

Finally data augmentation is another solution to get better results One definite

way to reduce the errors would be to include similar samples in the training data which

the system is faltering in the dev set One could generate similar examples as the failure

cases and include them in the training set to have better prediction Another system

would be to train to similar and bigger datasets Our models were trained on the SQuAD

dataset There are other datasets too which does the similar question answering task

such as TriviaQA We could augment the training set of TriviaQA along with SQuAD to

have a more robust system that is able to generalize better and thus have higher

accuracy for predicting answer spans

Conclusion

In this work we have tried to explore the most fundamental techniques that have

shaped the current state of the art Then we proposed a minor improvement of

architecture over an existing model Furthermore we developed two applications that

uses the base model First we talked about how a chatbot application can be made

using the QA system and lastly we also created a web interface where the model can

be used for any Passage and Question This interface also shows the attention spread

on the candidate answers While our effort is ongoing to push the state of the art

forward we strongly believe that surpassing human level accuracy on this task will have

high dividends for the society at large

39

LIST OF REFERENCES

[1] Danqi Chen From Reading Comprehension To Open-Domain Question Answering 2018 Youtube Accessed March 16 2018 httpswwwyoutubecomwatchv=1RN88O9C13U

[2] Lehnert Wendy G The Process of Question Answering No RR-88 Yale Univ New Haven Conn Dept of computer science 1977

[3] Joshi Mandar Eunsol Choi Daniel S Weld and Luke Zettlemoyer Triviaqa A large scale distantly supervised challenge dataset for reading comprehension arXiv preprint arXiv170503551 (2017)

[4] Rajpurkar Pranav Jian Zhang Konstantin Lopyrev and Percy Liang Squad 100000+ questions for machine comprehension of text arXiv preprint arXiv160605250 (2016)

[5] The Stanford Question Answering Dataset 2018 RajpurkarGithubIo Accessed March 16 2018 httpsrajpurkargithubioSQuAD-explorer

[6] Hu Minghao Yuxing Peng and Xipeng Qiu Reinforced mnemonic reader for machine comprehension CoRR abs170502798 (2017)

[7] Schalkoff Robert J Artificial neural networks Vol 1 New York McGraw-Hill 1997

[8] Me For The AI and Neetesh Mehrotra 2018 The Connect Between Deep Learning And AI - Open Source For You Open Source For You Accessed March 16 2018 httpopensourceforucom201801connect-deep-learning-ai

[9] Krizhevsky Alex Ilya Sutskever and Geoffrey E Hinton Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems pp 1097-1105 2012

[10] An Intuitive Explanation Of Convolutional Neural Networks 2016 The Data Science Blog Accessed March 16 2018 httpsujjwalkarnme20160811intuitive-explanation-convnets

[11] Williams Ronald J and David Zipser A learning algorithm for continually running fully recurrent neural networks Neural computation 1 no 2 (1989) 270-280

[12] Hochreiter Sepp and Juumlrgen Schmidhuber Long short-term memory Neural computation 9 no 8 (1997) 1735-1780

[13] Chung Junyoung Caglar Gulcehre KyungHyun Cho and Yoshua Bengio Empirical evaluation of gated recurrent neural networks on sequence modeling arXiv preprint arXiv14123555 (2014)

40

[14] Mikolov Tomas Wen-tau Yih and Geoffrey Zweig Linguistic regularities in continuous space word representations In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics Human Language Technologies pp 746-751 2013

[15] Pennington Jeffrey Richard Socher and Christopher Manning Glove Global vectors for word representation In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) pp 1532-1543 2014

[16] Word2vec Tutorial - The Skip-Gram Model middot Chris Mccormick 2016 MccormickmlCom Accessed March 16 2018 httpmccormickmlcom20160419word2vec-tutorial-the-skip-gram-model

[17] Group 2015 [Paper Introduction] Bilingual Word Representations With Monolingual hellip SlideshareNet Accessed March 16 2018 httpswwwslidesharenetnaist-mtstudybilingual-word-representations-with-monolingual-quality-in-mind

[18] Attention Mechanism 2016 BlogHeuritechCom Accessed March 16 2018 httpsblogheuritechcom20160120attention-mechanism

[19] Weston Jason Sumit Chopra and Antoine Bordes 2014 Memory Networks arXiv preprint arXiv14103916v11 Accessed March 16 2018 httpsarxivorgabs14103916

[20] Wang Shuohang and Jing Jiang Machine comprehension using match-lstm and answer pointer arXiv preprint arXiv160807905 (2016)

[21] Wang Shuohang and Jing Jiang Learning natural language inference with LSTM arXiv preprint arXiv151208849 (2015)

[22] Vinyals Oriol Meire Fortunato and Navdeep Jaitly Pointer networks In Advances in Neural Information Processing Systems pp 2692-2700 2015

[23] Wang Wenhui Nan Yang Furu Wei Baobao Chang and Ming Zhou Gated self-matching networks for reading comprehension and question answering In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1 Long Papers) vol 1 pp 189-198 2017

[24] Seo Minjoon Aniruddha Kembhavi Ali Farhadi and Hannaneh Hajishirzi Bidirectional attention flow for machine comprehension arXiv preprint arXiv161101603 (2016)

[25] Clark Christopher and Matt Gardner Simple and effective multi-paragraph reading comprehension arXiv preprint arXiv171010723 (2017)

41

[26] 8 Analyzing Sentence Structure 2018 NltkOrg Accessed March 16 2018 httpwwwnltkorgbookch08html

42

BIOGRAPHICAL SKETCH

Purnendu Mukherjee grew up in Kolkata India and had a strong interest for

computers since an early age After his high school he did his BSc in computer

science from Ramakrishna Mission Residential College Narendrapur followed by MSc

in computer science from St Xavierrsquos College Kolkata He had a strong intuition and

interest for human like learning systems and wanted to work in this area He started

working at TCS Innovation Labs Pune for the application of Natural Language

Processing in Educational Applications As he was deeply passionate about learning

system that mimic the human brain and learn like a human child does he was

increasing interested about Deep Learning and its applications After working for a year

he went on to pursue a Master of Science degree in computer science from the

University of Florida Gainesville His academic interests have been focused on Deep

Learning and Natural Language Processing and he has been working on Machine

Reading Comprehension since summer of 2017

  • ACKNOWLEDGMENTS
  • LIST OF FIGURES
  • LIST OF ABBREVIATIONS
  • INTRODUCTION
  • THE BUILDING BLOCKS OF A QUESTION ANSWERING SYSTEM
    • Neural Networks
    • Convolutional Neural Network
    • Recurrent Neural Networks (RNN)
    • Word Embedding
    • Attention Mechanism
    • Memory Networks
      • LITERATURE REVIEW AND STATE OF THE ART
        • Machine Comprehension Using Match-LSTM and Answer Pointer
        • R-NET Matching Reading Comprehension with Self-Matching Networks
        • Bi-Directional Attention Flow (BiDAF) for Machine Comprehension
        • Summary
          • MULTI-ATTENTION QUESTION ANSWERING
            • Multi-Attention BiDAF Model
            • Chatbot Design Using a QA System
            • Online QA System and Attention Visualization
              • FUTURE DIRECTIONS AND CONCLUSION
                • Future Directions
                • Conclusion
                  • LIST OF REFERENCES
                  • BIOGRAPHICAL SKETCH
Page 21: QUESTION ANSWERING SYSTEMS WITH ATTENTION MECHANISMufdcimages.uflib.ufl.edu/UF/E0/05/23/73/00001/MUKHERJEE_P.pdf · QUESTION ANSWERING SYSTEMS WITH ATTENTION MECHANISM By PURNENDU

21

CHAPTER 3 LITERATURE REVIEW AND STATE OF THE ART

There has been rapid progress since the release of the SQuAD dataset and

currently the best ensemble models are close to human level accuracy in machine

comprehension This is due to the various ingenious methods which solves some of the

problems with the previous methods Out of Vocabulary tokens were handled by using

Character embedding Long term dependency within context passage were solved

using self-attention And many other techniques such as Contextualized vectors History

of Words Attention Flow etc In this section we will have a look at the some of the most

important models that were fundamental to the progress of Questions Answering

Machine Comprehension Using Match-LSTM and Answer Pointer

In this paper [20] the authors propose an end-to-end neural architecture for the

QA task The architecture is based on match-LSTM [21] a model they proposed for

textual entailment and Pointer Net [22] a sequence-to-sequence model proposed by

Vinyals et al (2015) to constrain the output tokens to be from the input sequences

The model consists of an LSTM preprocessing layer a match-LSTM layer and an

Answer Pointer layer

We are given a piece of text which we refer to as a passage and a question

related to the passage The passage is represented by matrix P where P is the length

(number of tokens) of the passage and d is the dimensionality of word embeddings

Similarly the question is represented by matrix Q where Q is the length of the question

Our goal is to identify a subsequence from the passage as the answer to the question

22

Figure 3-1 Match-LSTM Model Architecture [20]

LSTM Preprocessing layer They use a standard one-directional LSTM

(Hochreiter amp Schmidhuber 1997) to process the passage and the question separately

as shown below

Match LSTM Layer They applied the match-LSTM model proposed for textual

entailment to their machine comprehension problem by treating the question as a

premise and the passage as a hypothesis The match-LSTM sequentially goes through

the passage At position i of the passage it first uses the standard word-by-word

attention mechanism to obtain attention weight vector as follows

(3-1)

23

where and are parameters to be learned

Answer Pointer Layer The Answer Pointer (Ans-Ptr) layer is motivated by the

Pointer Net introduced by Vinyals et al (2015) [22] The boundary model produces only

the start token and the end token of the answer and then all the tokens between these

two in the original passage are considered to be the answer

When this paper was released back in November 2016 Match-LSTM method

was the state of the art in Question Answering systems and was at the top of the

leaderboard for the SQuAD dataset

R-NET Matching Reading Comprehension with Self-Matching Networks

In this model [23] first the question and passage are processed by a

bidirectional recurrent network (Mikolov et al 2010) separately They then match the

question and passage with gated attention-based recurrent networks obtaining

question-aware representation for the passage On top of that they apply self-matching

attention to aggregate evidence from the whole passage and refine the passage

representation which is then fed into the output layer to predict the boundary of the

answer span

Question and passage encoding First the words are converted to their

respective word-level embeddings and character level embeddings The character-level

embeddings are generated by taking the final hidden states of a bi-directional recurrent

neural network (RNN) applied to embeddings of characters in the token Such

character-level embeddings have been shown to be helpful to deal with out-of-vocab

(OOV) tokens

24

They then use a bi-directional RNN to produce new representation and

of all words in the question and passage respectively

Figure 3-2 The task of Question Answering [23]

Gated Attention-based Recurrent Networks They use a variant of attention-

based recurrent networks with an additional gate to determine the importance of

information in the passage regarding a question Different from the gates in LSTM or

GRU the additional gate is based on the current passage word and its attention-pooling

vector of the question which focuses on the relation between the question and current

passage word The gate effectively model the phenomenon that only parts of the

passage are relevant to the question in reading comprehension and question answering

is utilized in subsequent calculations

25

Self-Matching Attention From the previous step the question aware passage

representation is generated to highlight the important parts of the passage One

problem with such representation is that it has very limited knowledge of context One

answer candidate is often oblivious to important cues in the passage outside its

surrounding window To address this problem the authors propose directly matching

the question-aware passage representation against itself It dynamically collects

evidence from the whole passage for words in passage and encodes the evidence

relevant to the current passage word and its matching question information into the

passage representation

Output Layer They use the same method as Wang amp Jiang (2016b) and use

pointer networks (Vinyals et al 2015) to predict the start and end position of the

answer In addition they use an attention-pooling over the question representation to

generate the initial hidden vector for the pointer network [23]

When the R-Net Model first appeared in the leaderboard in March 2017 it was at

the top with 723 Exact Match and 807 F1 score

Bi-Directional Attention Flow (BiDAF) for Machine Comprehension

BiDAF [24] is a hierarchical multi-stage architecture for modeling the

representations of the context paragraph at different levels of granularity BIDAF

includes character-level word-level and contextual embeddings and uses bi-directional

attention flow to obtain a query-aware context representation Their attention layer is not

used to summarize the context paragraph into a fixed-size vector Instead the attention

is computed for every time step and the attended vector at each time step along with

the representations from previous layers can flow through to the subsequent modeling

layer This reduces the information loss caused by early summarization [24]

26

Figure 3-3 Bi-Directional Attention Flow Model Architecture [24]

Their machine comprehension model is a hierarchical multi-stage process and

consists of six layers

1 Character Embedding Layer maps each word to a vector space using character-level CNNs

2 Word Embedding Layer maps each word to a vector space using a pre-trained word embedding model

3 Contextual Embedding Layer utilizes contextual cues from surrounding words to refine the embedding of the words These first three layers are applied to both the query and context They use a LSTM on top of the embeddings provided by the previous layers to model the temporal interactions between words We place an LSTM in both directions and concatenate the outputs of the two LSTMs

4 Attention Flow Layer couples the query and context vectors and produces a set of query aware feature vectors for each word in the context

5 Modeling Layer employs a Recurrent Neural Network to scan the context The input to the modeling layer is G which encodes the query-aware representations of context words The output of the modeling layer captures the interaction among the context words conditioned on the query They use two layers of bi-directional LSTM

27

with the output size of d for each direction Hence a matrix is obtained which is passed onto the output layer to predict the answer

6 Output Layer provides an answer to the query They define the training loss (to be minimized) as the sum of the negative log probabilities of the true start and end indices by the predicted distributions averaged over all examples [25]

In a further variation of their above work they add a self-attention layer after the

Bi-attention layer to further improve the results The architecture of the model is as

Figure 3-3 The task of Question Answering [25]

28

Summary

In this chapter we reviewed the methods that are fundamental to the state of the

art in Machine Comprehension and for the task of Question Answering We have

reached closed to human level accuracy and this is due to incremental developments

over previous models As we saw Out of Vocabulary (OOV) tokens were handled by

using Character embedding Long term dependency within context passage were

solved using self-attention and many other techniques such as Contextualized vectors

Attention Flow etc were employed to get better results In the next chapter we will see

how we can build on these models and develop further

29

CHAPTER 4 MULTI-ATTENTION QUESTION ANSWERING

Although the advancements in Question Answering systems since the release of

the SQuAD dataset has been impressive and the results are getting close to human

level accuracy it is far from being a fool-proof system The models still make mistakes

which would be obvious to a human For example

Passage The Panthers used the San Jose State practice facility and stayed at

the San Jose Marriott The Broncos practiced at Florida State Facility and stayed at the

Santa Clara Marriott

Question At what universitys facility did the Panthers practice

Actual Answer San Jose State

Predicted Answer Florida State Facility

To find out what is leading to the wrong predictions we wanted to see the

attention weights associated with such an example We plotted the passage and

question heat map which is a 2D matrix where the intensity of each cell signifies the

similarity between a passage word and a question word For the above example we

found out that while certain words of the question are given high weightage other parts

are not The words lsquoAtrsquo lsquofacilityrsquo lsquopracticersquo are given high attention but lsquoPanthersrsquo does

not receive high attention If it had received high attention then the system would have

predicted lsquoSan Jose Statersquo as the right answer To solve this issue we analyzed the

base BiDAF model and proposed adding two things

1 Bi-Attention and Self-Attention over Query

2 Second level attention over the output of (Bi-Attention + Self-Attention) from both Context and Query

30

Multi-Attention BiDAF Model

Our comprehension model is a hierarchical multi-stage process and consists of

the following layers

1 Embedding Just as all other models we embed words using pretrained word vectors We also embed the characters in each word into size 20 vectors which are learned and run a convolution neural network followed by max-pooling to get character-derived embeddings for each word The character-level and word-level embeddings are then concatenated and passed to the next layer We do not update the word embeddings during training

2 Pre-Process A shared bi-directional GRU (Cho et al 2014) is used to map the question and passage embeddings to context aware embeddings

3 Context Attention The bi-directional attention mechanism from the Bi-Directional Attention Flow (BiDAF) model (Seo et al 2016) is used to build a query-aware context representation Let h_i be the vector for context word i q_j be the vector for question word j and n_q and n_c be the lengths of the question and context respectively We compute attention between context word i and question word j as

(4-1)

where w1 w2 and w3 are learned vectors and is element-wise multiplication We then compute an attended vector c_i for each context token as

(4-2)

We also compute a query-to-context vector q_c

(4-3)

31

The final vector computed for each token is built by concatenating

and In our model we subsequently pass the result through a linear layer with ReLU activations

4 Context Self-Attention Next we use a layer of residual self-attention The input is passed through another bi-directional GRU Then we apply the same attention mechanism only now between the passage and itself In this case we do not use query-to context attention and we set aij = 1048576inf if i = j As before we pass the concatenated output through a linear layer with ReLU activations This layer is applied residually so this output is additionally summed with the input

5 Query Attention For this part we do the same way as context attention but calculate the weighted sum of the context words for each query word Thus we get the length as number of query words Then we calculate context to query similar to query to context in context attention layer

Figure 4-1 The modified BiDAF model with multilevel attention

6 Query Self-Attention This part is done the same way as the context self-attention layer but from the output of the Query Attention layer

32

7 Context Query Bi-Attention + Self-Attention The output of the Context self-attention and the Query Self-Attention layers are taken as input and the same process for Bi-Attention and self-attention is applied on these inputs

8 Prediction In the last layer of our model a bidirectional GRU is applied followed by a linear layer that computes answer start scores for each token The hidden states of that layer are concatenated with the input and fed into a second bidirectional GRU and linear layer to predict answer end scores The softmax operation is applied to the start and end scores to produce start and end probabilities and we optimize the negative log likelihood of selecting correct start and end tokens

Having carried out this modification we were able to solve the wrong example we

started with The multilevel attention model gives the correct output as ldquoSan Jose

Staterdquo Also we achieved slightly better scores than the original model with a F1 score

of 8544 on the SQuAD dev dataset

Chatbot Design Using a QA System

Designing a perfect chatbot that passes the Turing test is a fundamental goal for

Artificial Intelligence Although we are many order to magnitudes away from achieving

such a goal domain specific tasks can be solved with chatbots made from current

technology We had started our goal with a similar objective in mind ie to design a

domain specific chatbot and then generalize to other areas as it is able to achieve the

first domain specific objective robustly This led us to the fundamental problem of

Machine Comprehension and subsequently to the task of Question Answering Having

achieved some degree of success with QA systems we looked back if we could apply

our newly acquired knowledge in the task of designing Chatbots

The chatbots made with todayrsquos technologies are mostly handcrafted techniques

such as template matching that requires anticipating all possible ways a user may

articulate his requirements and a conversation may occur This requires a lot of man

hours for designing a domain specific system and is still very error prone In this section

33

we propose a general Chatbot design that would make the designing of a domain

specific chatbot very easy and robust at the same time

Every domain specific chatbot needs to obtain a set of information from the user

and show some results based on the user specific information obtained The traditional

chatbots use template matching and keywords lookup to determine if the user has

provided the required information Our idea is to use the Question Answering system in

the backend to extract out the required information from whatever the user has typed

until this point of the conversation The information to be extracted can be posed in the

form of a set of questions and the answers obtained from those questions can be used

as the parameters to supply the relevant information to the user

We had chosen our chatbot domain as the flight reservation system Our goal

was to extract the required information from the user to be able to show him the

available flights as per the userrsquos requirements

Figure 4-2 Flight reservation chatbotrsquos chat window

34

For a flight reservation task the booking agent needs to know the origin city

destination city and date of travel at minimum to be able to show the available flights

Optional information includes the number of tickets passengerrsquos name one way or

round trip etc

The minimalistic conversation with the user through the chat window would be as

shown above We had a platform called OneTask on which we wanted to implement our

chat bot The chat interface within the OneTask system looks as follows

Figure 4-3 Chatbot within OneTask system

The working of the chat system is as follows ndash

1 Initiation The user opens the chat window which starts a session with the chat bot The chat bot asks lsquoHow may I help yoursquo

2 User Reply The user may reply with none of the required information for flight booking or may reply with multiple information in the same message

3 User Reply parsing The conversation up to this point is treated as a passage and the internal questions are run on this passage So the four questions that are run are

Where do you want to go

35

From where do you want to leave

When do you want to depart

4 Parsed responses from QA model After running the questions on the QA system with the given conversation answers are obtained for all along with their corresponding confidence values Since answers will be obtained even if the required question was not answered up to this point in the conversation the confidence values plays a crucial role in determining the validity of the answer For this purpose we determined ranges of confidence values for validation A confidence value above 10 signifies the question has been answered correctly A confidence of 2 ndash 10 signifies that it may have been answered but should verify with the user for correctness and any confidence below 2 is discarded

5 Asking remaining questions iteratively After the parsing it is checked if any of the required questions are still unanswered If so the chatbot asks the remaining question and the process from 3 ndash 4 is carried out iteratively Once all the questions have been answered the user is shown with the available flight options as per his request

Figure 4-4 The Flow diagram of the Flight booking Chatbot system

Online QA System and Attention Visualization

To be able to test out various examples we used the BiDAF [23] model for an

online demo One can either choose from the available examples from the drop-down

menu or paste their own passage and examples While this is a useful and interesting

system to test the model in a user-friendly way we created this system to be able to

focus on the wrong samples

36

Figure 4-5 QA system interface with attention highlight over candidate answers

The candidate answers are shown with the blue highlights as per their

confidence values The higher the confidence the darker the answer The highest

confidence value is chosen as the predicted answer We developed the system to show

attention spread of the candidate answers to realize what needs to be done to improve

the system This led us to realize the importance of including the query attention part as

well as multilevel attention on the BiDAF model as described in the first section of this

chapter

37

CHAPTER 5 FUTURE DIRECTIONS AND CONCLUSION

Future Directions

Having done a thorough analysis of the current methods and state of the arts in

Machine Comprehension and for the QA task and developing systems on it we have

achieved a strong sense of what needs to be done to further improve the QA models

After observing the wrong samples we can see that the system is still unable to encode

meaning and is picking answers based on statistical patterns of occurrence of answers

on training examples

As per the State of the Art models and analyzing the ongoing research literature

it is easy to conclude that more features need to be embedded to encode the meaning

of words phrases and sentences A paper called Reinforced Mnemonic Reader for

Machine Comprehension encoded POS and NER tags of words along with their word

and character embedding This gave them better results

Figure 5-1 An English language semantic parse tree [26]

38

We have developed a method to encode the syntax parse tree of a sentence that

not only encodes the post tags but the relation of the word within the phrase and the

relation of the phrase within the whole sentence in a hierarchical manner

Finally data augmentation is another solution to get better results One definite

way to reduce the errors would be to include similar samples in the training data which

the system is faltering in the dev set One could generate similar examples as the failure

cases and include them in the training set to have better prediction Another system

would be to train to similar and bigger datasets Our models were trained on the SQuAD

dataset There are other datasets too which does the similar question answering task

such as TriviaQA We could augment the training set of TriviaQA along with SQuAD to

have a more robust system that is able to generalize better and thus have higher

accuracy for predicting answer spans

Conclusion

In this work we have tried to explore the most fundamental techniques that have

shaped the current state of the art Then we proposed a minor improvement of

architecture over an existing model Furthermore we developed two applications that

uses the base model First we talked about how a chatbot application can be made

using the QA system and lastly we also created a web interface where the model can

be used for any Passage and Question This interface also shows the attention spread

on the candidate answers While our effort is ongoing to push the state of the art

forward we strongly believe that surpassing human level accuracy on this task will have

high dividends for the society at large

39

LIST OF REFERENCES

[1] Danqi Chen From Reading Comprehension To Open-Domain Question Answering 2018 Youtube Accessed March 16 2018 httpswwwyoutubecomwatchv=1RN88O9C13U

[2] Lehnert Wendy G The Process of Question Answering No RR-88 Yale Univ New Haven Conn Dept of computer science 1977

[3] Joshi Mandar Eunsol Choi Daniel S Weld and Luke Zettlemoyer Triviaqa A large scale distantly supervised challenge dataset for reading comprehension arXiv preprint arXiv170503551 (2017)

[4] Rajpurkar Pranav Jian Zhang Konstantin Lopyrev and Percy Liang Squad 100000+ questions for machine comprehension of text arXiv preprint arXiv160605250 (2016)

[5] The Stanford Question Answering Dataset 2018 RajpurkarGithubIo Accessed March 16 2018 httpsrajpurkargithubioSQuAD-explorer

[6] Hu Minghao Yuxing Peng and Xipeng Qiu Reinforced mnemonic reader for machine comprehension CoRR abs170502798 (2017)

[7] Schalkoff Robert J Artificial neural networks Vol 1 New York McGraw-Hill 1997

[8] Me For The AI and Neetesh Mehrotra 2018 The Connect Between Deep Learning And AI - Open Source For You Open Source For You Accessed March 16 2018 httpopensourceforucom201801connect-deep-learning-ai

[9] Krizhevsky Alex Ilya Sutskever and Geoffrey E Hinton Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems pp 1097-1105 2012

[10] An Intuitive Explanation Of Convolutional Neural Networks 2016 The Data Science Blog Accessed March 16 2018 httpsujjwalkarnme20160811intuitive-explanation-convnets

[11] Williams Ronald J and David Zipser A learning algorithm for continually running fully recurrent neural networks Neural computation 1 no 2 (1989) 270-280

[12] Hochreiter Sepp and Juumlrgen Schmidhuber Long short-term memory Neural computation 9 no 8 (1997) 1735-1780

[13] Chung Junyoung Caglar Gulcehre KyungHyun Cho and Yoshua Bengio Empirical evaluation of gated recurrent neural networks on sequence modeling arXiv preprint arXiv14123555 (2014)

40

[14] Mikolov Tomas Wen-tau Yih and Geoffrey Zweig Linguistic regularities in continuous space word representations In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics Human Language Technologies pp 746-751 2013

[15] Pennington Jeffrey Richard Socher and Christopher Manning Glove Global vectors for word representation In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) pp 1532-1543 2014

[16] Word2vec Tutorial - The Skip-Gram Model middot Chris Mccormick 2016 MccormickmlCom Accessed March 16 2018 httpmccormickmlcom20160419word2vec-tutorial-the-skip-gram-model

[17] Group 2015 [Paper Introduction] Bilingual Word Representations With Monolingual hellip SlideshareNet Accessed March 16 2018 httpswwwslidesharenetnaist-mtstudybilingual-word-representations-with-monolingual-quality-in-mind

[18] Attention Mechanism 2016 BlogHeuritechCom Accessed March 16 2018 httpsblogheuritechcom20160120attention-mechanism

[19] Weston Jason Sumit Chopra and Antoine Bordes 2014 Memory Networks arXiv preprint arXiv14103916v11 Accessed March 16 2018 httpsarxivorgabs14103916

[20] Wang Shuohang and Jing Jiang Machine comprehension using match-lstm and answer pointer arXiv preprint arXiv160807905 (2016)

[21] Wang Shuohang and Jing Jiang Learning natural language inference with LSTM arXiv preprint arXiv151208849 (2015)

[22] Vinyals Oriol Meire Fortunato and Navdeep Jaitly Pointer networks In Advances in Neural Information Processing Systems pp 2692-2700 2015

[23] Wang Wenhui Nan Yang Furu Wei Baobao Chang and Ming Zhou Gated self-matching networks for reading comprehension and question answering In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1 Long Papers) vol 1 pp 189-198 2017

[24] Seo Minjoon Aniruddha Kembhavi Ali Farhadi and Hannaneh Hajishirzi Bidirectional attention flow for machine comprehension arXiv preprint arXiv161101603 (2016)

[25] Clark Christopher and Matt Gardner Simple and effective multi-paragraph reading comprehension arXiv preprint arXiv171010723 (2017)

41

[26] 8 Analyzing Sentence Structure 2018 NltkOrg Accessed March 16 2018 httpwwwnltkorgbookch08html

42

BIOGRAPHICAL SKETCH

Purnendu Mukherjee grew up in Kolkata India and had a strong interest for

computers since an early age After his high school he did his BSc in computer

science from Ramakrishna Mission Residential College Narendrapur followed by MSc

in computer science from St Xavierrsquos College Kolkata He had a strong intuition and

interest for human like learning systems and wanted to work in this area He started

working at TCS Innovation Labs Pune for the application of Natural Language

Processing in Educational Applications As he was deeply passionate about learning

system that mimic the human brain and learn like a human child does he was

increasing interested about Deep Learning and its applications After working for a year

he went on to pursue a Master of Science degree in computer science from the

University of Florida Gainesville His academic interests have been focused on Deep

Learning and Natural Language Processing and he has been working on Machine

Reading Comprehension since summer of 2017

  • ACKNOWLEDGMENTS
  • LIST OF FIGURES
  • LIST OF ABBREVIATIONS
  • INTRODUCTION
  • THE BUILDING BLOCKS OF A QUESTION ANSWERING SYSTEM
    • Neural Networks
    • Convolutional Neural Network
    • Recurrent Neural Networks (RNN)
    • Word Embedding
    • Attention Mechanism
    • Memory Networks
      • LITERATURE REVIEW AND STATE OF THE ART
        • Machine Comprehension Using Match-LSTM and Answer Pointer
        • R-NET Matching Reading Comprehension with Self-Matching Networks
        • Bi-Directional Attention Flow (BiDAF) for Machine Comprehension
        • Summary
          • MULTI-ATTENTION QUESTION ANSWERING
            • Multi-Attention BiDAF Model
            • Chatbot Design Using a QA System
            • Online QA System and Attention Visualization
              • FUTURE DIRECTIONS AND CONCLUSION
                • Future Directions
                • Conclusion
                  • LIST OF REFERENCES
                  • BIOGRAPHICAL SKETCH
Page 22: QUESTION ANSWERING SYSTEMS WITH ATTENTION MECHANISMufdcimages.uflib.ufl.edu/UF/E0/05/23/73/00001/MUKHERJEE_P.pdf · QUESTION ANSWERING SYSTEMS WITH ATTENTION MECHANISM By PURNENDU

22

Figure 3-1 Match-LSTM Model Architecture [20]

LSTM Preprocessing layer They use a standard one-directional LSTM

(Hochreiter amp Schmidhuber 1997) to process the passage and the question separately

as shown below

Match LSTM Layer They applied the match-LSTM model proposed for textual

entailment to their machine comprehension problem by treating the question as a

premise and the passage as a hypothesis The match-LSTM sequentially goes through

the passage At position i of the passage it first uses the standard word-by-word

attention mechanism to obtain attention weight vector as follows

(3-1)

23

where and are parameters to be learned

Answer Pointer Layer The Answer Pointer (Ans-Ptr) layer is motivated by the

Pointer Net introduced by Vinyals et al (2015) [22] The boundary model produces only

the start token and the end token of the answer and then all the tokens between these

two in the original passage are considered to be the answer

When this paper was released back in November 2016 Match-LSTM method

was the state of the art in Question Answering systems and was at the top of the

leaderboard for the SQuAD dataset

R-NET Matching Reading Comprehension with Self-Matching Networks

In this model [23] first the question and passage are processed by a

bidirectional recurrent network (Mikolov et al 2010) separately They then match the

question and passage with gated attention-based recurrent networks obtaining

question-aware representation for the passage On top of that they apply self-matching

attention to aggregate evidence from the whole passage and refine the passage

representation which is then fed into the output layer to predict the boundary of the

answer span

Question and passage encoding First the words are converted to their

respective word-level embeddings and character level embeddings The character-level

embeddings are generated by taking the final hidden states of a bi-directional recurrent

neural network (RNN) applied to embeddings of characters in the token Such

character-level embeddings have been shown to be helpful to deal with out-of-vocab

(OOV) tokens

24

They then use a bi-directional RNN to produce new representation and

of all words in the question and passage respectively

Figure 3-2 The task of Question Answering [23]

Gated Attention-based Recurrent Networks They use a variant of attention-

based recurrent networks with an additional gate to determine the importance of

information in the passage regarding a question Different from the gates in LSTM or

GRU the additional gate is based on the current passage word and its attention-pooling

vector of the question which focuses on the relation between the question and current

passage word The gate effectively model the phenomenon that only parts of the

passage are relevant to the question in reading comprehension and question answering

is utilized in subsequent calculations

25

Self-Matching Attention From the previous step the question aware passage

representation is generated to highlight the important parts of the passage One

problem with such representation is that it has very limited knowledge of context One

answer candidate is often oblivious to important cues in the passage outside its

surrounding window To address this problem the authors propose directly matching

the question-aware passage representation against itself It dynamically collects

evidence from the whole passage for words in passage and encodes the evidence

relevant to the current passage word and its matching question information into the

passage representation

Output Layer They use the same method as Wang amp Jiang (2016b) and use

pointer networks (Vinyals et al 2015) to predict the start and end position of the

answer In addition they use an attention-pooling over the question representation to

generate the initial hidden vector for the pointer network [23]

When the R-Net Model first appeared in the leaderboard in March 2017 it was at

the top with 723 Exact Match and 807 F1 score

Bi-Directional Attention Flow (BiDAF) for Machine Comprehension

BiDAF [24] is a hierarchical multi-stage architecture for modeling the

representations of the context paragraph at different levels of granularity BIDAF

includes character-level word-level and contextual embeddings and uses bi-directional

attention flow to obtain a query-aware context representation Their attention layer is not

used to summarize the context paragraph into a fixed-size vector Instead the attention

is computed for every time step and the attended vector at each time step along with

the representations from previous layers can flow through to the subsequent modeling

layer This reduces the information loss caused by early summarization [24]

26

Figure 3-3 Bi-Directional Attention Flow Model Architecture [24]

Their machine comprehension model is a hierarchical multi-stage process and

consists of six layers

1 Character Embedding Layer maps each word to a vector space using character-level CNNs

2 Word Embedding Layer maps each word to a vector space using a pre-trained word embedding model

3 Contextual Embedding Layer utilizes contextual cues from surrounding words to refine the embedding of the words These first three layers are applied to both the query and context They use a LSTM on top of the embeddings provided by the previous layers to model the temporal interactions between words We place an LSTM in both directions and concatenate the outputs of the two LSTMs

4 Attention Flow Layer couples the query and context vectors and produces a set of query aware feature vectors for each word in the context

5 Modeling Layer employs a Recurrent Neural Network to scan the context The input to the modeling layer is G which encodes the query-aware representations of context words The output of the modeling layer captures the interaction among the context words conditioned on the query They use two layers of bi-directional LSTM

27

with the output size of d for each direction Hence a matrix is obtained which is passed onto the output layer to predict the answer

6 Output Layer provides an answer to the query They define the training loss (to be minimized) as the sum of the negative log probabilities of the true start and end indices by the predicted distributions averaged over all examples [25]

In a further variation of their above work they add a self-attention layer after the

Bi-attention layer to further improve the results The architecture of the model is as

Figure 3-3 The task of Question Answering [25]

28

Summary

In this chapter we reviewed the methods that are fundamental to the state of the

art in Machine Comprehension and for the task of Question Answering We have

reached closed to human level accuracy and this is due to incremental developments

over previous models As we saw Out of Vocabulary (OOV) tokens were handled by

using Character embedding Long term dependency within context passage were

solved using self-attention and many other techniques such as Contextualized vectors

Attention Flow etc were employed to get better results In the next chapter we will see

how we can build on these models and develop further

29

CHAPTER 4 MULTI-ATTENTION QUESTION ANSWERING

Although the advancements in Question Answering systems since the release of

the SQuAD dataset has been impressive and the results are getting close to human

level accuracy it is far from being a fool-proof system The models still make mistakes

which would be obvious to a human For example

Passage The Panthers used the San Jose State practice facility and stayed at

the San Jose Marriott The Broncos practiced at Florida State Facility and stayed at the

Santa Clara Marriott

Question At what universitys facility did the Panthers practice

Actual Answer San Jose State

Predicted Answer Florida State Facility

To find out what is leading to the wrong predictions we wanted to see the

attention weights associated with such an example We plotted the passage and

question heat map which is a 2D matrix where the intensity of each cell signifies the

similarity between a passage word and a question word For the above example we

found out that while certain words of the question are given high weightage other parts

are not The words lsquoAtrsquo lsquofacilityrsquo lsquopracticersquo are given high attention but lsquoPanthersrsquo does

not receive high attention If it had received high attention then the system would have

predicted lsquoSan Jose Statersquo as the right answer To solve this issue we analyzed the

base BiDAF model and proposed adding two things

1 Bi-Attention and Self-Attention over Query

2 Second level attention over the output of (Bi-Attention + Self-Attention) from both Context and Query

30

Multi-Attention BiDAF Model

Our comprehension model is a hierarchical multi-stage process and consists of

the following layers

1 Embedding Just as all other models we embed words using pretrained word vectors We also embed the characters in each word into size 20 vectors which are learned and run a convolution neural network followed by max-pooling to get character-derived embeddings for each word The character-level and word-level embeddings are then concatenated and passed to the next layer We do not update the word embeddings during training

2 Pre-Process A shared bi-directional GRU (Cho et al 2014) is used to map the question and passage embeddings to context aware embeddings

3 Context Attention The bi-directional attention mechanism from the Bi-Directional Attention Flow (BiDAF) model (Seo et al 2016) is used to build a query-aware context representation Let h_i be the vector for context word i q_j be the vector for question word j and n_q and n_c be the lengths of the question and context respectively We compute attention between context word i and question word j as

(4-1)

where w1 w2 and w3 are learned vectors and is element-wise multiplication We then compute an attended vector c_i for each context token as

(4-2)

We also compute a query-to-context vector q_c

(4-3)

31

The final vector computed for each token is built by concatenating

and In our model we subsequently pass the result through a linear layer with ReLU activations

4 Context Self-Attention Next we use a layer of residual self-attention The input is passed through another bi-directional GRU Then we apply the same attention mechanism only now between the passage and itself In this case we do not use query-to context attention and we set aij = 1048576inf if i = j As before we pass the concatenated output through a linear layer with ReLU activations This layer is applied residually so this output is additionally summed with the input

5 Query Attention For this part we do the same way as context attention but calculate the weighted sum of the context words for each query word Thus we get the length as number of query words Then we calculate context to query similar to query to context in context attention layer

Figure 4-1 The modified BiDAF model with multilevel attention

6 Query Self-Attention This part is done the same way as the context self-attention layer but from the output of the Query Attention layer

32

7 Context Query Bi-Attention + Self-Attention The output of the Context self-attention and the Query Self-Attention layers are taken as input and the same process for Bi-Attention and self-attention is applied on these inputs

8 Prediction In the last layer of our model a bidirectional GRU is applied followed by a linear layer that computes answer start scores for each token The hidden states of that layer are concatenated with the input and fed into a second bidirectional GRU and linear layer to predict answer end scores The softmax operation is applied to the start and end scores to produce start and end probabilities and we optimize the negative log likelihood of selecting correct start and end tokens

Having carried out this modification we were able to solve the wrong example we

started with The multilevel attention model gives the correct output as ldquoSan Jose

Staterdquo Also we achieved slightly better scores than the original model with a F1 score

of 8544 on the SQuAD dev dataset

Chatbot Design Using a QA System

Designing a perfect chatbot that passes the Turing test is a fundamental goal for

Artificial Intelligence Although we are many order to magnitudes away from achieving

such a goal domain specific tasks can be solved with chatbots made from current

technology We had started our goal with a similar objective in mind ie to design a

domain specific chatbot and then generalize to other areas as it is able to achieve the

first domain specific objective robustly This led us to the fundamental problem of

Machine Comprehension and subsequently to the task of Question Answering Having

achieved some degree of success with QA systems we looked back if we could apply

our newly acquired knowledge in the task of designing Chatbots

The chatbots made with todayrsquos technologies are mostly handcrafted techniques

such as template matching that requires anticipating all possible ways a user may

articulate his requirements and a conversation may occur This requires a lot of man

hours for designing a domain specific system and is still very error prone In this section

33

we propose a general Chatbot design that would make the designing of a domain

specific chatbot very easy and robust at the same time

Every domain specific chatbot needs to obtain a set of information from the user

and show some results based on the user specific information obtained The traditional

chatbots use template matching and keywords lookup to determine if the user has

provided the required information Our idea is to use the Question Answering system in

the backend to extract out the required information from whatever the user has typed

until this point of the conversation The information to be extracted can be posed in the

form of a set of questions and the answers obtained from those questions can be used

as the parameters to supply the relevant information to the user

We had chosen our chatbot domain as the flight reservation system Our goal

was to extract the required information from the user to be able to show him the

available flights as per the userrsquos requirements

Figure 4-2 Flight reservation chatbotrsquos chat window

34

For a flight reservation task the booking agent needs to know the origin city

destination city and date of travel at minimum to be able to show the available flights

Optional information includes the number of tickets passengerrsquos name one way or

round trip etc

The minimalistic conversation with the user through the chat window would be as

shown above We had a platform called OneTask on which we wanted to implement our

chat bot The chat interface within the OneTask system looks as follows

Figure 4-3 Chatbot within OneTask system

The working of the chat system is as follows ndash

1 Initiation The user opens the chat window which starts a session with the chat bot The chat bot asks lsquoHow may I help yoursquo

2 User Reply The user may reply with none of the required information for flight booking or may reply with multiple information in the same message

3 User Reply parsing The conversation up to this point is treated as a passage and the internal questions are run on this passage So the four questions that are run are

Where do you want to go

35

From where do you want to leave

When do you want to depart

4 Parsed responses from QA model After running the questions on the QA system with the given conversation answers are obtained for all along with their corresponding confidence values Since answers will be obtained even if the required question was not answered up to this point in the conversation the confidence values plays a crucial role in determining the validity of the answer For this purpose we determined ranges of confidence values for validation A confidence value above 10 signifies the question has been answered correctly A confidence of 2 ndash 10 signifies that it may have been answered but should verify with the user for correctness and any confidence below 2 is discarded

5 Asking remaining questions iteratively After the parsing it is checked if any of the required questions are still unanswered If so the chatbot asks the remaining question and the process from 3 ndash 4 is carried out iteratively Once all the questions have been answered the user is shown with the available flight options as per his request

Figure 4-4 The Flow diagram of the Flight booking Chatbot system

Online QA System and Attention Visualization

To be able to test out various examples we used the BiDAF [23] model for an

online demo One can either choose from the available examples from the drop-down

menu or paste their own passage and examples While this is a useful and interesting

system to test the model in a user-friendly way we created this system to be able to

focus on the wrong samples

36

Figure 4-5 QA system interface with attention highlight over candidate answers

The candidate answers are shown with the blue highlights as per their

confidence values The higher the confidence the darker the answer The highest

confidence value is chosen as the predicted answer We developed the system to show

attention spread of the candidate answers to realize what needs to be done to improve

the system This led us to realize the importance of including the query attention part as

well as multilevel attention on the BiDAF model as described in the first section of this

chapter

37

CHAPTER 5 FUTURE DIRECTIONS AND CONCLUSION

Future Directions

Having done a thorough analysis of the current methods and state of the arts in

Machine Comprehension and for the QA task and developing systems on it we have

achieved a strong sense of what needs to be done to further improve the QA models

After observing the wrong samples we can see that the system is still unable to encode

meaning and is picking answers based on statistical patterns of occurrence of answers

on training examples

As per the State of the Art models and analyzing the ongoing research literature

it is easy to conclude that more features need to be embedded to encode the meaning

of words phrases and sentences A paper called Reinforced Mnemonic Reader for

Machine Comprehension encoded POS and NER tags of words along with their word

and character embedding This gave them better results

Figure 5-1 An English language semantic parse tree [26]

38

We have developed a method to encode the syntax parse tree of a sentence that

not only encodes the post tags but the relation of the word within the phrase and the

relation of the phrase within the whole sentence in a hierarchical manner

Finally data augmentation is another solution to get better results One definite

way to reduce the errors would be to include similar samples in the training data which

the system is faltering in the dev set One could generate similar examples as the failure

cases and include them in the training set to have better prediction Another system

would be to train to similar and bigger datasets Our models were trained on the SQuAD

dataset There are other datasets too which does the similar question answering task

such as TriviaQA We could augment the training set of TriviaQA along with SQuAD to

have a more robust system that is able to generalize better and thus have higher

accuracy for predicting answer spans

Conclusion

In this work we have tried to explore the most fundamental techniques that have

shaped the current state of the art Then we proposed a minor improvement of

architecture over an existing model Furthermore we developed two applications that

uses the base model First we talked about how a chatbot application can be made

using the QA system and lastly we also created a web interface where the model can

be used for any Passage and Question This interface also shows the attention spread

on the candidate answers While our effort is ongoing to push the state of the art

forward we strongly believe that surpassing human level accuracy on this task will have

high dividends for the society at large

39

LIST OF REFERENCES

[1] Danqi Chen From Reading Comprehension To Open-Domain Question Answering 2018 Youtube Accessed March 16 2018 httpswwwyoutubecomwatchv=1RN88O9C13U

[2] Lehnert Wendy G The Process of Question Answering No RR-88 Yale Univ New Haven Conn Dept of computer science 1977

[3] Joshi Mandar Eunsol Choi Daniel S Weld and Luke Zettlemoyer Triviaqa A large scale distantly supervised challenge dataset for reading comprehension arXiv preprint arXiv170503551 (2017)

[4] Rajpurkar Pranav Jian Zhang Konstantin Lopyrev and Percy Liang Squad 100000+ questions for machine comprehension of text arXiv preprint arXiv160605250 (2016)

[5] The Stanford Question Answering Dataset 2018 RajpurkarGithubIo Accessed March 16 2018 httpsrajpurkargithubioSQuAD-explorer

[6] Hu Minghao Yuxing Peng and Xipeng Qiu Reinforced mnemonic reader for machine comprehension CoRR abs170502798 (2017)

[7] Schalkoff Robert J Artificial neural networks Vol 1 New York McGraw-Hill 1997

[8] Me For The AI and Neetesh Mehrotra 2018 The Connect Between Deep Learning And AI - Open Source For You Open Source For You Accessed March 16 2018 httpopensourceforucom201801connect-deep-learning-ai

[9] Krizhevsky Alex Ilya Sutskever and Geoffrey E Hinton Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems pp 1097-1105 2012

[10] An Intuitive Explanation Of Convolutional Neural Networks 2016 The Data Science Blog Accessed March 16 2018 httpsujjwalkarnme20160811intuitive-explanation-convnets

[11] Williams Ronald J and David Zipser A learning algorithm for continually running fully recurrent neural networks Neural computation 1 no 2 (1989) 270-280

[12] Hochreiter Sepp and Juumlrgen Schmidhuber Long short-term memory Neural computation 9 no 8 (1997) 1735-1780

[13] Chung Junyoung Caglar Gulcehre KyungHyun Cho and Yoshua Bengio Empirical evaluation of gated recurrent neural networks on sequence modeling arXiv preprint arXiv14123555 (2014)

40

[14] Mikolov Tomas Wen-tau Yih and Geoffrey Zweig Linguistic regularities in continuous space word representations In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics Human Language Technologies pp 746-751 2013

[15] Pennington Jeffrey Richard Socher and Christopher Manning Glove Global vectors for word representation In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) pp 1532-1543 2014

[16] Word2vec Tutorial - The Skip-Gram Model middot Chris Mccormick 2016 MccormickmlCom Accessed March 16 2018 httpmccormickmlcom20160419word2vec-tutorial-the-skip-gram-model

[17] Group 2015 [Paper Introduction] Bilingual Word Representations With Monolingual hellip SlideshareNet Accessed March 16 2018 httpswwwslidesharenetnaist-mtstudybilingual-word-representations-with-monolingual-quality-in-mind

[18] Attention Mechanism 2016 BlogHeuritechCom Accessed March 16 2018 httpsblogheuritechcom20160120attention-mechanism

[19] Weston Jason Sumit Chopra and Antoine Bordes 2014 Memory Networks arXiv preprint arXiv14103916v11 Accessed March 16 2018 httpsarxivorgabs14103916

[20] Wang Shuohang and Jing Jiang Machine comprehension using match-lstm and answer pointer arXiv preprint arXiv160807905 (2016)

[21] Wang Shuohang and Jing Jiang Learning natural language inference with LSTM arXiv preprint arXiv151208849 (2015)

[22] Vinyals Oriol Meire Fortunato and Navdeep Jaitly Pointer networks In Advances in Neural Information Processing Systems pp 2692-2700 2015

[23] Wang Wenhui Nan Yang Furu Wei Baobao Chang and Ming Zhou Gated self-matching networks for reading comprehension and question answering In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1 Long Papers) vol 1 pp 189-198 2017

[24] Seo Minjoon Aniruddha Kembhavi Ali Farhadi and Hannaneh Hajishirzi Bidirectional attention flow for machine comprehension arXiv preprint arXiv161101603 (2016)

[25] Clark Christopher and Matt Gardner Simple and effective multi-paragraph reading comprehension arXiv preprint arXiv171010723 (2017)

41

[26] 8 Analyzing Sentence Structure 2018 NltkOrg Accessed March 16 2018 httpwwwnltkorgbookch08html

42

BIOGRAPHICAL SKETCH

Purnendu Mukherjee grew up in Kolkata India and had a strong interest for

computers since an early age After his high school he did his BSc in computer

science from Ramakrishna Mission Residential College Narendrapur followed by MSc

in computer science from St Xavierrsquos College Kolkata He had a strong intuition and

interest for human like learning systems and wanted to work in this area He started

working at TCS Innovation Labs Pune for the application of Natural Language

Processing in Educational Applications As he was deeply passionate about learning

system that mimic the human brain and learn like a human child does he was

increasing interested about Deep Learning and its applications After working for a year

he went on to pursue a Master of Science degree in computer science from the

University of Florida Gainesville His academic interests have been focused on Deep

Learning and Natural Language Processing and he has been working on Machine

Reading Comprehension since summer of 2017

  • ACKNOWLEDGMENTS
  • LIST OF FIGURES
  • LIST OF ABBREVIATIONS
  • INTRODUCTION
  • THE BUILDING BLOCKS OF A QUESTION ANSWERING SYSTEM
    • Neural Networks
    • Convolutional Neural Network
    • Recurrent Neural Networks (RNN)
    • Word Embedding
    • Attention Mechanism
    • Memory Networks
      • LITERATURE REVIEW AND STATE OF THE ART
        • Machine Comprehension Using Match-LSTM and Answer Pointer
        • R-NET Matching Reading Comprehension with Self-Matching Networks
        • Bi-Directional Attention Flow (BiDAF) for Machine Comprehension
        • Summary
          • MULTI-ATTENTION QUESTION ANSWERING
            • Multi-Attention BiDAF Model
            • Chatbot Design Using a QA System
            • Online QA System and Attention Visualization
              • FUTURE DIRECTIONS AND CONCLUSION
                • Future Directions
                • Conclusion
                  • LIST OF REFERENCES
                  • BIOGRAPHICAL SKETCH
Page 23: QUESTION ANSWERING SYSTEMS WITH ATTENTION MECHANISMufdcimages.uflib.ufl.edu/UF/E0/05/23/73/00001/MUKHERJEE_P.pdf · QUESTION ANSWERING SYSTEMS WITH ATTENTION MECHANISM By PURNENDU

23

where and are parameters to be learned

Answer Pointer Layer The Answer Pointer (Ans-Ptr) layer is motivated by the

Pointer Net introduced by Vinyals et al (2015) [22] The boundary model produces only

the start token and the end token of the answer and then all the tokens between these

two in the original passage are considered to be the answer

When this paper was released back in November 2016 Match-LSTM method

was the state of the art in Question Answering systems and was at the top of the

leaderboard for the SQuAD dataset

R-NET Matching Reading Comprehension with Self-Matching Networks

In this model [23] first the question and passage are processed by a

bidirectional recurrent network (Mikolov et al 2010) separately They then match the

question and passage with gated attention-based recurrent networks obtaining

question-aware representation for the passage On top of that they apply self-matching

attention to aggregate evidence from the whole passage and refine the passage

representation which is then fed into the output layer to predict the boundary of the

answer span

Question and passage encoding First the words are converted to their

respective word-level embeddings and character level embeddings The character-level

embeddings are generated by taking the final hidden states of a bi-directional recurrent

neural network (RNN) applied to embeddings of characters in the token Such

character-level embeddings have been shown to be helpful to deal with out-of-vocab

(OOV) tokens

24

They then use a bi-directional RNN to produce new representation and

of all words in the question and passage respectively

Figure 3-2 The task of Question Answering [23]

Gated Attention-based Recurrent Networks They use a variant of attention-

based recurrent networks with an additional gate to determine the importance of

information in the passage regarding a question Different from the gates in LSTM or

GRU the additional gate is based on the current passage word and its attention-pooling

vector of the question which focuses on the relation between the question and current

passage word The gate effectively model the phenomenon that only parts of the

passage are relevant to the question in reading comprehension and question answering

is utilized in subsequent calculations

25

Self-Matching Attention From the previous step the question aware passage

representation is generated to highlight the important parts of the passage One

problem with such representation is that it has very limited knowledge of context One

answer candidate is often oblivious to important cues in the passage outside its

surrounding window To address this problem the authors propose directly matching

the question-aware passage representation against itself It dynamically collects

evidence from the whole passage for words in passage and encodes the evidence

relevant to the current passage word and its matching question information into the

passage representation

Output Layer They use the same method as Wang amp Jiang (2016b) and use

pointer networks (Vinyals et al 2015) to predict the start and end position of the

answer In addition they use an attention-pooling over the question representation to

generate the initial hidden vector for the pointer network [23]

When the R-Net Model first appeared in the leaderboard in March 2017 it was at

the top with 723 Exact Match and 807 F1 score

Bi-Directional Attention Flow (BiDAF) for Machine Comprehension

BiDAF [24] is a hierarchical multi-stage architecture for modeling the

representations of the context paragraph at different levels of granularity BIDAF

includes character-level word-level and contextual embeddings and uses bi-directional

attention flow to obtain a query-aware context representation Their attention layer is not

used to summarize the context paragraph into a fixed-size vector Instead the attention

is computed for every time step and the attended vector at each time step along with

the representations from previous layers can flow through to the subsequent modeling

layer This reduces the information loss caused by early summarization [24]

26

Figure 3-3 Bi-Directional Attention Flow Model Architecture [24]

Their machine comprehension model is a hierarchical multi-stage process and

consists of six layers

1 Character Embedding Layer maps each word to a vector space using character-level CNNs

2 Word Embedding Layer maps each word to a vector space using a pre-trained word embedding model

3 Contextual Embedding Layer utilizes contextual cues from surrounding words to refine the embedding of the words These first three layers are applied to both the query and context They use a LSTM on top of the embeddings provided by the previous layers to model the temporal interactions between words We place an LSTM in both directions and concatenate the outputs of the two LSTMs

4 Attention Flow Layer couples the query and context vectors and produces a set of query aware feature vectors for each word in the context

5 Modeling Layer employs a Recurrent Neural Network to scan the context The input to the modeling layer is G which encodes the query-aware representations of context words The output of the modeling layer captures the interaction among the context words conditioned on the query They use two layers of bi-directional LSTM

27

with the output size of d for each direction Hence a matrix is obtained which is passed onto the output layer to predict the answer

6 Output Layer provides an answer to the query They define the training loss (to be minimized) as the sum of the negative log probabilities of the true start and end indices by the predicted distributions averaged over all examples [25]

In a further variation of their above work they add a self-attention layer after the

Bi-attention layer to further improve the results The architecture of the model is as

Figure 3-3 The task of Question Answering [25]

28

Summary

In this chapter we reviewed the methods that are fundamental to the state of the

art in Machine Comprehension and for the task of Question Answering We have

reached closed to human level accuracy and this is due to incremental developments

over previous models As we saw Out of Vocabulary (OOV) tokens were handled by

using Character embedding Long term dependency within context passage were

solved using self-attention and many other techniques such as Contextualized vectors

Attention Flow etc were employed to get better results In the next chapter we will see

how we can build on these models and develop further

29

CHAPTER 4 MULTI-ATTENTION QUESTION ANSWERING

Although the advancements in Question Answering systems since the release of

the SQuAD dataset has been impressive and the results are getting close to human

level accuracy it is far from being a fool-proof system The models still make mistakes

which would be obvious to a human For example

Passage The Panthers used the San Jose State practice facility and stayed at

the San Jose Marriott The Broncos practiced at Florida State Facility and stayed at the

Santa Clara Marriott

Question At what universitys facility did the Panthers practice

Actual Answer San Jose State

Predicted Answer Florida State Facility

To find out what is leading to the wrong predictions we wanted to see the

attention weights associated with such an example We plotted the passage and

question heat map which is a 2D matrix where the intensity of each cell signifies the

similarity between a passage word and a question word For the above example we

found out that while certain words of the question are given high weightage other parts

are not The words lsquoAtrsquo lsquofacilityrsquo lsquopracticersquo are given high attention but lsquoPanthersrsquo does

not receive high attention If it had received high attention then the system would have

predicted lsquoSan Jose Statersquo as the right answer To solve this issue we analyzed the

base BiDAF model and proposed adding two things

1 Bi-Attention and Self-Attention over Query

2 Second level attention over the output of (Bi-Attention + Self-Attention) from both Context and Query

30

Multi-Attention BiDAF Model

Our comprehension model is a hierarchical multi-stage process and consists of

the following layers

1 Embedding Just as all other models we embed words using pretrained word vectors We also embed the characters in each word into size 20 vectors which are learned and run a convolution neural network followed by max-pooling to get character-derived embeddings for each word The character-level and word-level embeddings are then concatenated and passed to the next layer We do not update the word embeddings during training

2 Pre-Process A shared bi-directional GRU (Cho et al 2014) is used to map the question and passage embeddings to context aware embeddings

3 Context Attention The bi-directional attention mechanism from the Bi-Directional Attention Flow (BiDAF) model (Seo et al 2016) is used to build a query-aware context representation Let h_i be the vector for context word i q_j be the vector for question word j and n_q and n_c be the lengths of the question and context respectively We compute attention between context word i and question word j as

(4-1)

where w1 w2 and w3 are learned vectors and is element-wise multiplication We then compute an attended vector c_i for each context token as

(4-2)

We also compute a query-to-context vector q_c

(4-3)

31

The final vector computed for each token is built by concatenating

and In our model we subsequently pass the result through a linear layer with ReLU activations

4 Context Self-Attention Next we use a layer of residual self-attention The input is passed through another bi-directional GRU Then we apply the same attention mechanism only now between the passage and itself In this case we do not use query-to context attention and we set aij = 1048576inf if i = j As before we pass the concatenated output through a linear layer with ReLU activations This layer is applied residually so this output is additionally summed with the input

5 Query Attention For this part we do the same way as context attention but calculate the weighted sum of the context words for each query word Thus we get the length as number of query words Then we calculate context to query similar to query to context in context attention layer

Figure 4-1 The modified BiDAF model with multilevel attention

6 Query Self-Attention This part is done the same way as the context self-attention layer but from the output of the Query Attention layer

32

7 Context Query Bi-Attention + Self-Attention The output of the Context self-attention and the Query Self-Attention layers are taken as input and the same process for Bi-Attention and self-attention is applied on these inputs

8 Prediction In the last layer of our model a bidirectional GRU is applied followed by a linear layer that computes answer start scores for each token The hidden states of that layer are concatenated with the input and fed into a second bidirectional GRU and linear layer to predict answer end scores The softmax operation is applied to the start and end scores to produce start and end probabilities and we optimize the negative log likelihood of selecting correct start and end tokens

Having carried out this modification we were able to solve the wrong example we

started with The multilevel attention model gives the correct output as ldquoSan Jose

Staterdquo Also we achieved slightly better scores than the original model with a F1 score

of 8544 on the SQuAD dev dataset

Chatbot Design Using a QA System

Designing a perfect chatbot that passes the Turing test is a fundamental goal for

Artificial Intelligence Although we are many order to magnitudes away from achieving

such a goal domain specific tasks can be solved with chatbots made from current

technology We had started our goal with a similar objective in mind ie to design a

domain specific chatbot and then generalize to other areas as it is able to achieve the

first domain specific objective robustly This led us to the fundamental problem of

Machine Comprehension and subsequently to the task of Question Answering Having

achieved some degree of success with QA systems we looked back if we could apply

our newly acquired knowledge in the task of designing Chatbots

The chatbots made with todayrsquos technologies are mostly handcrafted techniques

such as template matching that requires anticipating all possible ways a user may

articulate his requirements and a conversation may occur This requires a lot of man

hours for designing a domain specific system and is still very error prone In this section

33

we propose a general Chatbot design that would make the designing of a domain

specific chatbot very easy and robust at the same time

Every domain specific chatbot needs to obtain a set of information from the user

and show some results based on the user specific information obtained The traditional

chatbots use template matching and keywords lookup to determine if the user has

provided the required information Our idea is to use the Question Answering system in

the backend to extract out the required information from whatever the user has typed

until this point of the conversation The information to be extracted can be posed in the

form of a set of questions and the answers obtained from those questions can be used

as the parameters to supply the relevant information to the user

We had chosen our chatbot domain as the flight reservation system Our goal

was to extract the required information from the user to be able to show him the

available flights as per the userrsquos requirements

Figure 4-2 Flight reservation chatbotrsquos chat window

34

For a flight reservation task the booking agent needs to know the origin city

destination city and date of travel at minimum to be able to show the available flights

Optional information includes the number of tickets passengerrsquos name one way or

round trip etc

The minimalistic conversation with the user through the chat window would be as

shown above We had a platform called OneTask on which we wanted to implement our

chat bot The chat interface within the OneTask system looks as follows

Figure 4-3 Chatbot within OneTask system

The working of the chat system is as follows ndash

1 Initiation The user opens the chat window which starts a session with the chat bot The chat bot asks lsquoHow may I help yoursquo

2 User Reply The user may reply with none of the required information for flight booking or may reply with multiple information in the same message

3 User Reply parsing The conversation up to this point is treated as a passage and the internal questions are run on this passage So the four questions that are run are

Where do you want to go

35

From where do you want to leave

When do you want to depart

4 Parsed responses from QA model After running the questions on the QA system with the given conversation answers are obtained for all along with their corresponding confidence values Since answers will be obtained even if the required question was not answered up to this point in the conversation the confidence values plays a crucial role in determining the validity of the answer For this purpose we determined ranges of confidence values for validation A confidence value above 10 signifies the question has been answered correctly A confidence of 2 ndash 10 signifies that it may have been answered but should verify with the user for correctness and any confidence below 2 is discarded

5 Asking remaining questions iteratively After the parsing it is checked if any of the required questions are still unanswered If so the chatbot asks the remaining question and the process from 3 ndash 4 is carried out iteratively Once all the questions have been answered the user is shown with the available flight options as per his request

Figure 4-4 The Flow diagram of the Flight booking Chatbot system

Online QA System and Attention Visualization

To be able to test out various examples we used the BiDAF [23] model for an

online demo One can either choose from the available examples from the drop-down

menu or paste their own passage and examples While this is a useful and interesting

system to test the model in a user-friendly way we created this system to be able to

focus on the wrong samples

36

Figure 4-5 QA system interface with attention highlight over candidate answers

The candidate answers are shown with the blue highlights as per their

confidence values The higher the confidence the darker the answer The highest

confidence value is chosen as the predicted answer We developed the system to show

attention spread of the candidate answers to realize what needs to be done to improve

the system This led us to realize the importance of including the query attention part as

well as multilevel attention on the BiDAF model as described in the first section of this

chapter

37

CHAPTER 5 FUTURE DIRECTIONS AND CONCLUSION

Future Directions

Having done a thorough analysis of the current methods and state of the arts in

Machine Comprehension and for the QA task and developing systems on it we have

achieved a strong sense of what needs to be done to further improve the QA models

After observing the wrong samples we can see that the system is still unable to encode

meaning and is picking answers based on statistical patterns of occurrence of answers

on training examples

As per the State of the Art models and analyzing the ongoing research literature

it is easy to conclude that more features need to be embedded to encode the meaning

of words phrases and sentences A paper called Reinforced Mnemonic Reader for

Machine Comprehension encoded POS and NER tags of words along with their word

and character embedding This gave them better results

Figure 5-1 An English language semantic parse tree [26]

38

We have developed a method to encode the syntax parse tree of a sentence that

not only encodes the post tags but the relation of the word within the phrase and the

relation of the phrase within the whole sentence in a hierarchical manner

Finally data augmentation is another solution to get better results One definite

way to reduce the errors would be to include similar samples in the training data which

the system is faltering in the dev set One could generate similar examples as the failure

cases and include them in the training set to have better prediction Another system

would be to train to similar and bigger datasets Our models were trained on the SQuAD

dataset There are other datasets too which does the similar question answering task

such as TriviaQA We could augment the training set of TriviaQA along with SQuAD to

have a more robust system that is able to generalize better and thus have higher

accuracy for predicting answer spans

Conclusion

In this work we have tried to explore the most fundamental techniques that have

shaped the current state of the art Then we proposed a minor improvement of

architecture over an existing model Furthermore we developed two applications that

uses the base model First we talked about how a chatbot application can be made

using the QA system and lastly we also created a web interface where the model can

be used for any Passage and Question This interface also shows the attention spread

on the candidate answers While our effort is ongoing to push the state of the art

forward we strongly believe that surpassing human level accuracy on this task will have

high dividends for the society at large

39

LIST OF REFERENCES

[1] Danqi Chen From Reading Comprehension To Open-Domain Question Answering 2018 Youtube Accessed March 16 2018 httpswwwyoutubecomwatchv=1RN88O9C13U

[2] Lehnert Wendy G The Process of Question Answering No RR-88 Yale Univ New Haven Conn Dept of computer science 1977

[3] Joshi Mandar Eunsol Choi Daniel S Weld and Luke Zettlemoyer Triviaqa A large scale distantly supervised challenge dataset for reading comprehension arXiv preprint arXiv170503551 (2017)

[4] Rajpurkar Pranav Jian Zhang Konstantin Lopyrev and Percy Liang Squad 100000+ questions for machine comprehension of text arXiv preprint arXiv160605250 (2016)

[5] The Stanford Question Answering Dataset 2018 RajpurkarGithubIo Accessed March 16 2018 httpsrajpurkargithubioSQuAD-explorer

[6] Hu Minghao Yuxing Peng and Xipeng Qiu Reinforced mnemonic reader for machine comprehension CoRR abs170502798 (2017)

[7] Schalkoff Robert J Artificial neural networks Vol 1 New York McGraw-Hill 1997

[8] Me For The AI and Neetesh Mehrotra 2018 The Connect Between Deep Learning And AI - Open Source For You Open Source For You Accessed March 16 2018 httpopensourceforucom201801connect-deep-learning-ai

[9] Krizhevsky Alex Ilya Sutskever and Geoffrey E Hinton Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems pp 1097-1105 2012

[10] An Intuitive Explanation Of Convolutional Neural Networks 2016 The Data Science Blog Accessed March 16 2018 httpsujjwalkarnme20160811intuitive-explanation-convnets

[11] Williams Ronald J and David Zipser A learning algorithm for continually running fully recurrent neural networks Neural computation 1 no 2 (1989) 270-280

[12] Hochreiter Sepp and Juumlrgen Schmidhuber Long short-term memory Neural computation 9 no 8 (1997) 1735-1780

[13] Chung Junyoung Caglar Gulcehre KyungHyun Cho and Yoshua Bengio Empirical evaluation of gated recurrent neural networks on sequence modeling arXiv preprint arXiv14123555 (2014)

40

[14] Mikolov Tomas Wen-tau Yih and Geoffrey Zweig Linguistic regularities in continuous space word representations In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics Human Language Technologies pp 746-751 2013

[15] Pennington Jeffrey Richard Socher and Christopher Manning Glove Global vectors for word representation In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) pp 1532-1543 2014

[16] Word2vec Tutorial - The Skip-Gram Model middot Chris Mccormick 2016 MccormickmlCom Accessed March 16 2018 httpmccormickmlcom20160419word2vec-tutorial-the-skip-gram-model

[17] Group 2015 [Paper Introduction] Bilingual Word Representations With Monolingual hellip SlideshareNet Accessed March 16 2018 httpswwwslidesharenetnaist-mtstudybilingual-word-representations-with-monolingual-quality-in-mind

[18] Attention Mechanism 2016 BlogHeuritechCom Accessed March 16 2018 httpsblogheuritechcom20160120attention-mechanism

[19] Weston Jason Sumit Chopra and Antoine Bordes 2014 Memory Networks arXiv preprint arXiv14103916v11 Accessed March 16 2018 httpsarxivorgabs14103916

[20] Wang Shuohang and Jing Jiang Machine comprehension using match-lstm and answer pointer arXiv preprint arXiv160807905 (2016)

[21] Wang Shuohang and Jing Jiang Learning natural language inference with LSTM arXiv preprint arXiv151208849 (2015)

[22] Vinyals Oriol Meire Fortunato and Navdeep Jaitly Pointer networks In Advances in Neural Information Processing Systems pp 2692-2700 2015

[23] Wang Wenhui Nan Yang Furu Wei Baobao Chang and Ming Zhou Gated self-matching networks for reading comprehension and question answering In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1 Long Papers) vol 1 pp 189-198 2017

[24] Seo Minjoon Aniruddha Kembhavi Ali Farhadi and Hannaneh Hajishirzi Bidirectional attention flow for machine comprehension arXiv preprint arXiv161101603 (2016)

[25] Clark Christopher and Matt Gardner Simple and effective multi-paragraph reading comprehension arXiv preprint arXiv171010723 (2017)

41

[26] 8 Analyzing Sentence Structure 2018 NltkOrg Accessed March 16 2018 httpwwwnltkorgbookch08html

42

BIOGRAPHICAL SKETCH

Purnendu Mukherjee grew up in Kolkata India and had a strong interest for

computers since an early age After his high school he did his BSc in computer

science from Ramakrishna Mission Residential College Narendrapur followed by MSc

in computer science from St Xavierrsquos College Kolkata He had a strong intuition and

interest for human like learning systems and wanted to work in this area He started

working at TCS Innovation Labs Pune for the application of Natural Language

Processing in Educational Applications As he was deeply passionate about learning

system that mimic the human brain and learn like a human child does he was

increasing interested about Deep Learning and its applications After working for a year

he went on to pursue a Master of Science degree in computer science from the

University of Florida Gainesville His academic interests have been focused on Deep

Learning and Natural Language Processing and he has been working on Machine

Reading Comprehension since summer of 2017

  • ACKNOWLEDGMENTS
  • LIST OF FIGURES
  • LIST OF ABBREVIATIONS
  • INTRODUCTION
  • THE BUILDING BLOCKS OF A QUESTION ANSWERING SYSTEM
    • Neural Networks
    • Convolutional Neural Network
    • Recurrent Neural Networks (RNN)
    • Word Embedding
    • Attention Mechanism
    • Memory Networks
      • LITERATURE REVIEW AND STATE OF THE ART
        • Machine Comprehension Using Match-LSTM and Answer Pointer
        • R-NET Matching Reading Comprehension with Self-Matching Networks
        • Bi-Directional Attention Flow (BiDAF) for Machine Comprehension
        • Summary
          • MULTI-ATTENTION QUESTION ANSWERING
            • Multi-Attention BiDAF Model
            • Chatbot Design Using a QA System
            • Online QA System and Attention Visualization
              • FUTURE DIRECTIONS AND CONCLUSION
                • Future Directions
                • Conclusion
                  • LIST OF REFERENCES
                  • BIOGRAPHICAL SKETCH
Page 24: QUESTION ANSWERING SYSTEMS WITH ATTENTION MECHANISMufdcimages.uflib.ufl.edu/UF/E0/05/23/73/00001/MUKHERJEE_P.pdf · QUESTION ANSWERING SYSTEMS WITH ATTENTION MECHANISM By PURNENDU

24

They then use a bi-directional RNN to produce new representation and

of all words in the question and passage respectively

Figure 3-2 The task of Question Answering [23]

Gated Attention-based Recurrent Networks They use a variant of attention-

based recurrent networks with an additional gate to determine the importance of

information in the passage regarding a question Different from the gates in LSTM or

GRU the additional gate is based on the current passage word and its attention-pooling

vector of the question which focuses on the relation between the question and current

passage word The gate effectively model the phenomenon that only parts of the

passage are relevant to the question in reading comprehension and question answering

is utilized in subsequent calculations

25

Self-Matching Attention From the previous step the question aware passage

representation is generated to highlight the important parts of the passage One

problem with such representation is that it has very limited knowledge of context One

answer candidate is often oblivious to important cues in the passage outside its

surrounding window To address this problem the authors propose directly matching

the question-aware passage representation against itself It dynamically collects

evidence from the whole passage for words in passage and encodes the evidence

relevant to the current passage word and its matching question information into the

passage representation

Output Layer They use the same method as Wang amp Jiang (2016b) and use

pointer networks (Vinyals et al 2015) to predict the start and end position of the

answer In addition they use an attention-pooling over the question representation to

generate the initial hidden vector for the pointer network [23]

When the R-Net Model first appeared in the leaderboard in March 2017 it was at

the top with 723 Exact Match and 807 F1 score

Bi-Directional Attention Flow (BiDAF) for Machine Comprehension

BiDAF [24] is a hierarchical multi-stage architecture for modeling the

representations of the context paragraph at different levels of granularity BIDAF

includes character-level word-level and contextual embeddings and uses bi-directional

attention flow to obtain a query-aware context representation Their attention layer is not

used to summarize the context paragraph into a fixed-size vector Instead the attention

is computed for every time step and the attended vector at each time step along with

the representations from previous layers can flow through to the subsequent modeling

layer This reduces the information loss caused by early summarization [24]

26

Figure 3-3 Bi-Directional Attention Flow Model Architecture [24]

Their machine comprehension model is a hierarchical multi-stage process and

consists of six layers

1 Character Embedding Layer maps each word to a vector space using character-level CNNs

2 Word Embedding Layer maps each word to a vector space using a pre-trained word embedding model

3 Contextual Embedding Layer utilizes contextual cues from surrounding words to refine the embedding of the words These first three layers are applied to both the query and context They use a LSTM on top of the embeddings provided by the previous layers to model the temporal interactions between words We place an LSTM in both directions and concatenate the outputs of the two LSTMs

4 Attention Flow Layer couples the query and context vectors and produces a set of query aware feature vectors for each word in the context

5 Modeling Layer employs a Recurrent Neural Network to scan the context The input to the modeling layer is G which encodes the query-aware representations of context words The output of the modeling layer captures the interaction among the context words conditioned on the query They use two layers of bi-directional LSTM

27

with the output size of d for each direction Hence a matrix is obtained which is passed onto the output layer to predict the answer

6 Output Layer provides an answer to the query They define the training loss (to be minimized) as the sum of the negative log probabilities of the true start and end indices by the predicted distributions averaged over all examples [25]

In a further variation of their above work they add a self-attention layer after the

Bi-attention layer to further improve the results The architecture of the model is as

Figure 3-3 The task of Question Answering [25]

28

Summary

In this chapter we reviewed the methods that are fundamental to the state of the

art in Machine Comprehension and for the task of Question Answering We have

reached closed to human level accuracy and this is due to incremental developments

over previous models As we saw Out of Vocabulary (OOV) tokens were handled by

using Character embedding Long term dependency within context passage were

solved using self-attention and many other techniques such as Contextualized vectors

Attention Flow etc were employed to get better results In the next chapter we will see

how we can build on these models and develop further

29

CHAPTER 4 MULTI-ATTENTION QUESTION ANSWERING

Although the advancements in Question Answering systems since the release of

the SQuAD dataset has been impressive and the results are getting close to human

level accuracy it is far from being a fool-proof system The models still make mistakes

which would be obvious to a human For example

Passage The Panthers used the San Jose State practice facility and stayed at

the San Jose Marriott The Broncos practiced at Florida State Facility and stayed at the

Santa Clara Marriott

Question At what universitys facility did the Panthers practice

Actual Answer San Jose State

Predicted Answer Florida State Facility

To find out what is leading to the wrong predictions we wanted to see the

attention weights associated with such an example We plotted the passage and

question heat map which is a 2D matrix where the intensity of each cell signifies the

similarity between a passage word and a question word For the above example we

found out that while certain words of the question are given high weightage other parts

are not The words lsquoAtrsquo lsquofacilityrsquo lsquopracticersquo are given high attention but lsquoPanthersrsquo does

not receive high attention If it had received high attention then the system would have

predicted lsquoSan Jose Statersquo as the right answer To solve this issue we analyzed the

base BiDAF model and proposed adding two things

1 Bi-Attention and Self-Attention over Query

2 Second level attention over the output of (Bi-Attention + Self-Attention) from both Context and Query

30

Multi-Attention BiDAF Model

Our comprehension model is a hierarchical multi-stage process and consists of

the following layers

1 Embedding Just as all other models we embed words using pretrained word vectors We also embed the characters in each word into size 20 vectors which are learned and run a convolution neural network followed by max-pooling to get character-derived embeddings for each word The character-level and word-level embeddings are then concatenated and passed to the next layer We do not update the word embeddings during training

2 Pre-Process A shared bi-directional GRU (Cho et al 2014) is used to map the question and passage embeddings to context aware embeddings

3 Context Attention The bi-directional attention mechanism from the Bi-Directional Attention Flow (BiDAF) model (Seo et al 2016) is used to build a query-aware context representation Let h_i be the vector for context word i q_j be the vector for question word j and n_q and n_c be the lengths of the question and context respectively We compute attention between context word i and question word j as

(4-1)

where w1 w2 and w3 are learned vectors and is element-wise multiplication We then compute an attended vector c_i for each context token as

(4-2)

We also compute a query-to-context vector q_c

(4-3)

31

The final vector computed for each token is built by concatenating

and In our model we subsequently pass the result through a linear layer with ReLU activations

4 Context Self-Attention Next we use a layer of residual self-attention The input is passed through another bi-directional GRU Then we apply the same attention mechanism only now between the passage and itself In this case we do not use query-to context attention and we set aij = 1048576inf if i = j As before we pass the concatenated output through a linear layer with ReLU activations This layer is applied residually so this output is additionally summed with the input

5 Query Attention For this part we do the same way as context attention but calculate the weighted sum of the context words for each query word Thus we get the length as number of query words Then we calculate context to query similar to query to context in context attention layer

Figure 4-1 The modified BiDAF model with multilevel attention

6 Query Self-Attention This part is done the same way as the context self-attention layer but from the output of the Query Attention layer

32

7 Context Query Bi-Attention + Self-Attention The output of the Context self-attention and the Query Self-Attention layers are taken as input and the same process for Bi-Attention and self-attention is applied on these inputs

8 Prediction In the last layer of our model a bidirectional GRU is applied followed by a linear layer that computes answer start scores for each token The hidden states of that layer are concatenated with the input and fed into a second bidirectional GRU and linear layer to predict answer end scores The softmax operation is applied to the start and end scores to produce start and end probabilities and we optimize the negative log likelihood of selecting correct start and end tokens

Having carried out this modification we were able to solve the wrong example we

started with The multilevel attention model gives the correct output as ldquoSan Jose

Staterdquo Also we achieved slightly better scores than the original model with a F1 score

of 8544 on the SQuAD dev dataset

Chatbot Design Using a QA System

Designing a perfect chatbot that passes the Turing test is a fundamental goal for

Artificial Intelligence Although we are many order to magnitudes away from achieving

such a goal domain specific tasks can be solved with chatbots made from current

technology We had started our goal with a similar objective in mind ie to design a

domain specific chatbot and then generalize to other areas as it is able to achieve the

first domain specific objective robustly This led us to the fundamental problem of

Machine Comprehension and subsequently to the task of Question Answering Having

achieved some degree of success with QA systems we looked back if we could apply

our newly acquired knowledge in the task of designing Chatbots

The chatbots made with todayrsquos technologies are mostly handcrafted techniques

such as template matching that requires anticipating all possible ways a user may

articulate his requirements and a conversation may occur This requires a lot of man

hours for designing a domain specific system and is still very error prone In this section

33

we propose a general Chatbot design that would make the designing of a domain

specific chatbot very easy and robust at the same time

Every domain specific chatbot needs to obtain a set of information from the user

and show some results based on the user specific information obtained The traditional

chatbots use template matching and keywords lookup to determine if the user has

provided the required information Our idea is to use the Question Answering system in

the backend to extract out the required information from whatever the user has typed

until this point of the conversation The information to be extracted can be posed in the

form of a set of questions and the answers obtained from those questions can be used

as the parameters to supply the relevant information to the user

We had chosen our chatbot domain as the flight reservation system Our goal

was to extract the required information from the user to be able to show him the

available flights as per the userrsquos requirements

Figure 4-2 Flight reservation chatbotrsquos chat window

34

For a flight reservation task the booking agent needs to know the origin city

destination city and date of travel at minimum to be able to show the available flights

Optional information includes the number of tickets passengerrsquos name one way or

round trip etc

The minimalistic conversation with the user through the chat window would be as

shown above We had a platform called OneTask on which we wanted to implement our

chat bot The chat interface within the OneTask system looks as follows

Figure 4-3 Chatbot within OneTask system

The working of the chat system is as follows ndash

1 Initiation The user opens the chat window which starts a session with the chat bot The chat bot asks lsquoHow may I help yoursquo

2 User Reply The user may reply with none of the required information for flight booking or may reply with multiple information in the same message

3 User Reply parsing The conversation up to this point is treated as a passage and the internal questions are run on this passage So the four questions that are run are

Where do you want to go

35

From where do you want to leave

When do you want to depart

4 Parsed responses from QA model After running the questions on the QA system with the given conversation answers are obtained for all along with their corresponding confidence values Since answers will be obtained even if the required question was not answered up to this point in the conversation the confidence values plays a crucial role in determining the validity of the answer For this purpose we determined ranges of confidence values for validation A confidence value above 10 signifies the question has been answered correctly A confidence of 2 ndash 10 signifies that it may have been answered but should verify with the user for correctness and any confidence below 2 is discarded

5 Asking remaining questions iteratively After the parsing it is checked if any of the required questions are still unanswered If so the chatbot asks the remaining question and the process from 3 ndash 4 is carried out iteratively Once all the questions have been answered the user is shown with the available flight options as per his request

Figure 4-4 The Flow diagram of the Flight booking Chatbot system

Online QA System and Attention Visualization

To be able to test out various examples we used the BiDAF [23] model for an

online demo One can either choose from the available examples from the drop-down

menu or paste their own passage and examples While this is a useful and interesting

system to test the model in a user-friendly way we created this system to be able to

focus on the wrong samples

36

Figure 4-5 QA system interface with attention highlight over candidate answers

The candidate answers are shown with the blue highlights as per their

confidence values The higher the confidence the darker the answer The highest

confidence value is chosen as the predicted answer We developed the system to show

attention spread of the candidate answers to realize what needs to be done to improve

the system This led us to realize the importance of including the query attention part as

well as multilevel attention on the BiDAF model as described in the first section of this

chapter

37

CHAPTER 5 FUTURE DIRECTIONS AND CONCLUSION

Future Directions

Having done a thorough analysis of the current methods and state of the arts in

Machine Comprehension and for the QA task and developing systems on it we have

achieved a strong sense of what needs to be done to further improve the QA models

After observing the wrong samples we can see that the system is still unable to encode

meaning and is picking answers based on statistical patterns of occurrence of answers

on training examples

As per the State of the Art models and analyzing the ongoing research literature

it is easy to conclude that more features need to be embedded to encode the meaning

of words phrases and sentences A paper called Reinforced Mnemonic Reader for

Machine Comprehension encoded POS and NER tags of words along with their word

and character embedding This gave them better results

Figure 5-1 An English language semantic parse tree [26]

38

We have developed a method to encode the syntax parse tree of a sentence that

not only encodes the post tags but the relation of the word within the phrase and the

relation of the phrase within the whole sentence in a hierarchical manner

Finally data augmentation is another solution to get better results One definite

way to reduce the errors would be to include similar samples in the training data which

the system is faltering in the dev set One could generate similar examples as the failure

cases and include them in the training set to have better prediction Another system

would be to train to similar and bigger datasets Our models were trained on the SQuAD

dataset There are other datasets too which does the similar question answering task

such as TriviaQA We could augment the training set of TriviaQA along with SQuAD to

have a more robust system that is able to generalize better and thus have higher

accuracy for predicting answer spans

Conclusion

In this work we have tried to explore the most fundamental techniques that have

shaped the current state of the art Then we proposed a minor improvement of

architecture over an existing model Furthermore we developed two applications that

uses the base model First we talked about how a chatbot application can be made

using the QA system and lastly we also created a web interface where the model can

be used for any Passage and Question This interface also shows the attention spread

on the candidate answers While our effort is ongoing to push the state of the art

forward we strongly believe that surpassing human level accuracy on this task will have

high dividends for the society at large

39

LIST OF REFERENCES

[1] Danqi Chen From Reading Comprehension To Open-Domain Question Answering 2018 Youtube Accessed March 16 2018 httpswwwyoutubecomwatchv=1RN88O9C13U

[2] Lehnert Wendy G The Process of Question Answering No RR-88 Yale Univ New Haven Conn Dept of computer science 1977

[3] Joshi Mandar Eunsol Choi Daniel S Weld and Luke Zettlemoyer Triviaqa A large scale distantly supervised challenge dataset for reading comprehension arXiv preprint arXiv170503551 (2017)

[4] Rajpurkar Pranav Jian Zhang Konstantin Lopyrev and Percy Liang Squad 100000+ questions for machine comprehension of text arXiv preprint arXiv160605250 (2016)

[5] The Stanford Question Answering Dataset 2018 RajpurkarGithubIo Accessed March 16 2018 httpsrajpurkargithubioSQuAD-explorer

[6] Hu Minghao Yuxing Peng and Xipeng Qiu Reinforced mnemonic reader for machine comprehension CoRR abs170502798 (2017)

[7] Schalkoff Robert J Artificial neural networks Vol 1 New York McGraw-Hill 1997

[8] Me For The AI and Neetesh Mehrotra 2018 The Connect Between Deep Learning And AI - Open Source For You Open Source For You Accessed March 16 2018 httpopensourceforucom201801connect-deep-learning-ai

[9] Krizhevsky Alex Ilya Sutskever and Geoffrey E Hinton Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems pp 1097-1105 2012

[10] An Intuitive Explanation Of Convolutional Neural Networks 2016 The Data Science Blog Accessed March 16 2018 httpsujjwalkarnme20160811intuitive-explanation-convnets

[11] Williams Ronald J and David Zipser A learning algorithm for continually running fully recurrent neural networks Neural computation 1 no 2 (1989) 270-280

[12] Hochreiter Sepp and Juumlrgen Schmidhuber Long short-term memory Neural computation 9 no 8 (1997) 1735-1780

[13] Chung Junyoung Caglar Gulcehre KyungHyun Cho and Yoshua Bengio Empirical evaluation of gated recurrent neural networks on sequence modeling arXiv preprint arXiv14123555 (2014)

40

[14] Mikolov Tomas Wen-tau Yih and Geoffrey Zweig Linguistic regularities in continuous space word representations In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics Human Language Technologies pp 746-751 2013

[15] Pennington Jeffrey Richard Socher and Christopher Manning Glove Global vectors for word representation In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) pp 1532-1543 2014

[16] Word2vec Tutorial - The Skip-Gram Model middot Chris Mccormick 2016 MccormickmlCom Accessed March 16 2018 httpmccormickmlcom20160419word2vec-tutorial-the-skip-gram-model

[17] Group 2015 [Paper Introduction] Bilingual Word Representations With Monolingual hellip SlideshareNet Accessed March 16 2018 httpswwwslidesharenetnaist-mtstudybilingual-word-representations-with-monolingual-quality-in-mind

[18] Attention Mechanism 2016 BlogHeuritechCom Accessed March 16 2018 httpsblogheuritechcom20160120attention-mechanism

[19] Weston Jason Sumit Chopra and Antoine Bordes 2014 Memory Networks arXiv preprint arXiv14103916v11 Accessed March 16 2018 httpsarxivorgabs14103916

[20] Wang Shuohang and Jing Jiang Machine comprehension using match-lstm and answer pointer arXiv preprint arXiv160807905 (2016)

[21] Wang Shuohang and Jing Jiang Learning natural language inference with LSTM arXiv preprint arXiv151208849 (2015)

[22] Vinyals Oriol Meire Fortunato and Navdeep Jaitly Pointer networks In Advances in Neural Information Processing Systems pp 2692-2700 2015

[23] Wang Wenhui Nan Yang Furu Wei Baobao Chang and Ming Zhou Gated self-matching networks for reading comprehension and question answering In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1 Long Papers) vol 1 pp 189-198 2017

[24] Seo Minjoon Aniruddha Kembhavi Ali Farhadi and Hannaneh Hajishirzi Bidirectional attention flow for machine comprehension arXiv preprint arXiv161101603 (2016)

[25] Clark Christopher and Matt Gardner Simple and effective multi-paragraph reading comprehension arXiv preprint arXiv171010723 (2017)

41

[26] 8 Analyzing Sentence Structure 2018 NltkOrg Accessed March 16 2018 httpwwwnltkorgbookch08html

42

BIOGRAPHICAL SKETCH

Purnendu Mukherjee grew up in Kolkata India and had a strong interest for

computers since an early age After his high school he did his BSc in computer

science from Ramakrishna Mission Residential College Narendrapur followed by MSc

in computer science from St Xavierrsquos College Kolkata He had a strong intuition and

interest for human like learning systems and wanted to work in this area He started

working at TCS Innovation Labs Pune for the application of Natural Language

Processing in Educational Applications As he was deeply passionate about learning

system that mimic the human brain and learn like a human child does he was

increasing interested about Deep Learning and its applications After working for a year

he went on to pursue a Master of Science degree in computer science from the

University of Florida Gainesville His academic interests have been focused on Deep

Learning and Natural Language Processing and he has been working on Machine

Reading Comprehension since summer of 2017

  • ACKNOWLEDGMENTS
  • LIST OF FIGURES
  • LIST OF ABBREVIATIONS
  • INTRODUCTION
  • THE BUILDING BLOCKS OF A QUESTION ANSWERING SYSTEM
    • Neural Networks
    • Convolutional Neural Network
    • Recurrent Neural Networks (RNN)
    • Word Embedding
    • Attention Mechanism
    • Memory Networks
      • LITERATURE REVIEW AND STATE OF THE ART
        • Machine Comprehension Using Match-LSTM and Answer Pointer
        • R-NET Matching Reading Comprehension with Self-Matching Networks
        • Bi-Directional Attention Flow (BiDAF) for Machine Comprehension
        • Summary
          • MULTI-ATTENTION QUESTION ANSWERING
            • Multi-Attention BiDAF Model
            • Chatbot Design Using a QA System
            • Online QA System and Attention Visualization
              • FUTURE DIRECTIONS AND CONCLUSION
                • Future Directions
                • Conclusion
                  • LIST OF REFERENCES
                  • BIOGRAPHICAL SKETCH
Page 25: QUESTION ANSWERING SYSTEMS WITH ATTENTION MECHANISMufdcimages.uflib.ufl.edu/UF/E0/05/23/73/00001/MUKHERJEE_P.pdf · QUESTION ANSWERING SYSTEMS WITH ATTENTION MECHANISM By PURNENDU

25

Self-Matching Attention From the previous step the question aware passage

representation is generated to highlight the important parts of the passage One

problem with such representation is that it has very limited knowledge of context One

answer candidate is often oblivious to important cues in the passage outside its

surrounding window To address this problem the authors propose directly matching

the question-aware passage representation against itself It dynamically collects

evidence from the whole passage for words in passage and encodes the evidence

relevant to the current passage word and its matching question information into the

passage representation

Output Layer They use the same method as Wang amp Jiang (2016b) and use

pointer networks (Vinyals et al 2015) to predict the start and end position of the

answer In addition they use an attention-pooling over the question representation to

generate the initial hidden vector for the pointer network [23]

When the R-Net Model first appeared in the leaderboard in March 2017 it was at

the top with 723 Exact Match and 807 F1 score

Bi-Directional Attention Flow (BiDAF) for Machine Comprehension

BiDAF [24] is a hierarchical multi-stage architecture for modeling the

representations of the context paragraph at different levels of granularity BIDAF

includes character-level word-level and contextual embeddings and uses bi-directional

attention flow to obtain a query-aware context representation Their attention layer is not

used to summarize the context paragraph into a fixed-size vector Instead the attention

is computed for every time step and the attended vector at each time step along with

the representations from previous layers can flow through to the subsequent modeling

layer This reduces the information loss caused by early summarization [24]

26

Figure 3-3 Bi-Directional Attention Flow Model Architecture [24]

Their machine comprehension model is a hierarchical multi-stage process and

consists of six layers

1 Character Embedding Layer maps each word to a vector space using character-level CNNs

2 Word Embedding Layer maps each word to a vector space using a pre-trained word embedding model

3 Contextual Embedding Layer utilizes contextual cues from surrounding words to refine the embedding of the words These first three layers are applied to both the query and context They use a LSTM on top of the embeddings provided by the previous layers to model the temporal interactions between words We place an LSTM in both directions and concatenate the outputs of the two LSTMs

4 Attention Flow Layer couples the query and context vectors and produces a set of query aware feature vectors for each word in the context

5 Modeling Layer employs a Recurrent Neural Network to scan the context The input to the modeling layer is G which encodes the query-aware representations of context words The output of the modeling layer captures the interaction among the context words conditioned on the query They use two layers of bi-directional LSTM

27

with the output size of d for each direction Hence a matrix is obtained which is passed onto the output layer to predict the answer

6 Output Layer provides an answer to the query They define the training loss (to be minimized) as the sum of the negative log probabilities of the true start and end indices by the predicted distributions averaged over all examples [25]

In a further variation of their above work they add a self-attention layer after the

Bi-attention layer to further improve the results The architecture of the model is as

Figure 3-3 The task of Question Answering [25]

28

Summary

In this chapter we reviewed the methods that are fundamental to the state of the

art in Machine Comprehension and for the task of Question Answering We have

reached closed to human level accuracy and this is due to incremental developments

over previous models As we saw Out of Vocabulary (OOV) tokens were handled by

using Character embedding Long term dependency within context passage were

solved using self-attention and many other techniques such as Contextualized vectors

Attention Flow etc were employed to get better results In the next chapter we will see

how we can build on these models and develop further

29

CHAPTER 4 MULTI-ATTENTION QUESTION ANSWERING

Although the advancements in Question Answering systems since the release of

the SQuAD dataset has been impressive and the results are getting close to human

level accuracy it is far from being a fool-proof system The models still make mistakes

which would be obvious to a human For example

Passage The Panthers used the San Jose State practice facility and stayed at

the San Jose Marriott The Broncos practiced at Florida State Facility and stayed at the

Santa Clara Marriott

Question At what universitys facility did the Panthers practice

Actual Answer San Jose State

Predicted Answer Florida State Facility

To find out what is leading to the wrong predictions we wanted to see the

attention weights associated with such an example We plotted the passage and

question heat map which is a 2D matrix where the intensity of each cell signifies the

similarity between a passage word and a question word For the above example we

found out that while certain words of the question are given high weightage other parts

are not The words lsquoAtrsquo lsquofacilityrsquo lsquopracticersquo are given high attention but lsquoPanthersrsquo does

not receive high attention If it had received high attention then the system would have

predicted lsquoSan Jose Statersquo as the right answer To solve this issue we analyzed the

base BiDAF model and proposed adding two things

1 Bi-Attention and Self-Attention over Query

2 Second level attention over the output of (Bi-Attention + Self-Attention) from both Context and Query

30

Multi-Attention BiDAF Model

Our comprehension model is a hierarchical multi-stage process and consists of

the following layers

1 Embedding Just as all other models we embed words using pretrained word vectors We also embed the characters in each word into size 20 vectors which are learned and run a convolution neural network followed by max-pooling to get character-derived embeddings for each word The character-level and word-level embeddings are then concatenated and passed to the next layer We do not update the word embeddings during training

2 Pre-Process A shared bi-directional GRU (Cho et al 2014) is used to map the question and passage embeddings to context aware embeddings

3 Context Attention The bi-directional attention mechanism from the Bi-Directional Attention Flow (BiDAF) model (Seo et al 2016) is used to build a query-aware context representation Let h_i be the vector for context word i q_j be the vector for question word j and n_q and n_c be the lengths of the question and context respectively We compute attention between context word i and question word j as

(4-1)

where w1 w2 and w3 are learned vectors and is element-wise multiplication We then compute an attended vector c_i for each context token as

(4-2)

We also compute a query-to-context vector q_c

(4-3)

31

The final vector computed for each token is built by concatenating

and In our model we subsequently pass the result through a linear layer with ReLU activations

4 Context Self-Attention Next we use a layer of residual self-attention The input is passed through another bi-directional GRU Then we apply the same attention mechanism only now between the passage and itself In this case we do not use query-to context attention and we set aij = 1048576inf if i = j As before we pass the concatenated output through a linear layer with ReLU activations This layer is applied residually so this output is additionally summed with the input

5 Query Attention For this part we do the same way as context attention but calculate the weighted sum of the context words for each query word Thus we get the length as number of query words Then we calculate context to query similar to query to context in context attention layer

Figure 4-1 The modified BiDAF model with multilevel attention

6 Query Self-Attention This part is done the same way as the context self-attention layer but from the output of the Query Attention layer

32

7 Context Query Bi-Attention + Self-Attention The output of the Context self-attention and the Query Self-Attention layers are taken as input and the same process for Bi-Attention and self-attention is applied on these inputs

8 Prediction In the last layer of our model a bidirectional GRU is applied followed by a linear layer that computes answer start scores for each token The hidden states of that layer are concatenated with the input and fed into a second bidirectional GRU and linear layer to predict answer end scores The softmax operation is applied to the start and end scores to produce start and end probabilities and we optimize the negative log likelihood of selecting correct start and end tokens

Having carried out this modification we were able to solve the wrong example we

started with The multilevel attention model gives the correct output as ldquoSan Jose

Staterdquo Also we achieved slightly better scores than the original model with a F1 score

of 8544 on the SQuAD dev dataset

Chatbot Design Using a QA System

Designing a perfect chatbot that passes the Turing test is a fundamental goal for

Artificial Intelligence Although we are many order to magnitudes away from achieving

such a goal domain specific tasks can be solved with chatbots made from current

technology We had started our goal with a similar objective in mind ie to design a

domain specific chatbot and then generalize to other areas as it is able to achieve the

first domain specific objective robustly This led us to the fundamental problem of

Machine Comprehension and subsequently to the task of Question Answering Having

achieved some degree of success with QA systems we looked back if we could apply

our newly acquired knowledge in the task of designing Chatbots

The chatbots made with todayrsquos technologies are mostly handcrafted techniques

such as template matching that requires anticipating all possible ways a user may

articulate his requirements and a conversation may occur This requires a lot of man

hours for designing a domain specific system and is still very error prone In this section

33

we propose a general Chatbot design that would make the designing of a domain

specific chatbot very easy and robust at the same time

Every domain specific chatbot needs to obtain a set of information from the user

and show some results based on the user specific information obtained The traditional

chatbots use template matching and keywords lookup to determine if the user has

provided the required information Our idea is to use the Question Answering system in

the backend to extract out the required information from whatever the user has typed

until this point of the conversation The information to be extracted can be posed in the

form of a set of questions and the answers obtained from those questions can be used

as the parameters to supply the relevant information to the user

We had chosen our chatbot domain as the flight reservation system Our goal

was to extract the required information from the user to be able to show him the

available flights as per the userrsquos requirements

Figure 4-2 Flight reservation chatbotrsquos chat window

34

For a flight reservation task the booking agent needs to know the origin city

destination city and date of travel at minimum to be able to show the available flights

Optional information includes the number of tickets passengerrsquos name one way or

round trip etc

The minimalistic conversation with the user through the chat window would be as

shown above We had a platform called OneTask on which we wanted to implement our

chat bot The chat interface within the OneTask system looks as follows

Figure 4-3 Chatbot within OneTask system

The working of the chat system is as follows ndash

1 Initiation The user opens the chat window which starts a session with the chat bot The chat bot asks lsquoHow may I help yoursquo

2 User Reply The user may reply with none of the required information for flight booking or may reply with multiple information in the same message

3 User Reply parsing The conversation up to this point is treated as a passage and the internal questions are run on this passage So the four questions that are run are

Where do you want to go

35

From where do you want to leave

When do you want to depart

4 Parsed responses from QA model After running the questions on the QA system with the given conversation answers are obtained for all along with their corresponding confidence values Since answers will be obtained even if the required question was not answered up to this point in the conversation the confidence values plays a crucial role in determining the validity of the answer For this purpose we determined ranges of confidence values for validation A confidence value above 10 signifies the question has been answered correctly A confidence of 2 ndash 10 signifies that it may have been answered but should verify with the user for correctness and any confidence below 2 is discarded

5 Asking remaining questions iteratively After the parsing it is checked if any of the required questions are still unanswered If so the chatbot asks the remaining question and the process from 3 ndash 4 is carried out iteratively Once all the questions have been answered the user is shown with the available flight options as per his request

Figure 4-4 The Flow diagram of the Flight booking Chatbot system

Online QA System and Attention Visualization

To be able to test out various examples we used the BiDAF [23] model for an

online demo One can either choose from the available examples from the drop-down

menu or paste their own passage and examples While this is a useful and interesting

system to test the model in a user-friendly way we created this system to be able to

focus on the wrong samples

36

Figure 4-5 QA system interface with attention highlight over candidate answers

The candidate answers are shown with the blue highlights as per their

confidence values The higher the confidence the darker the answer The highest

confidence value is chosen as the predicted answer We developed the system to show

attention spread of the candidate answers to realize what needs to be done to improve

the system This led us to realize the importance of including the query attention part as

well as multilevel attention on the BiDAF model as described in the first section of this

chapter

37

CHAPTER 5 FUTURE DIRECTIONS AND CONCLUSION

Future Directions

Having done a thorough analysis of the current methods and state of the arts in

Machine Comprehension and for the QA task and developing systems on it we have

achieved a strong sense of what needs to be done to further improve the QA models

After observing the wrong samples we can see that the system is still unable to encode

meaning and is picking answers based on statistical patterns of occurrence of answers

on training examples

As per the State of the Art models and analyzing the ongoing research literature

it is easy to conclude that more features need to be embedded to encode the meaning

of words phrases and sentences A paper called Reinforced Mnemonic Reader for

Machine Comprehension encoded POS and NER tags of words along with their word

and character embedding This gave them better results

Figure 5-1 An English language semantic parse tree [26]

38

We have developed a method to encode the syntax parse tree of a sentence that

not only encodes the post tags but the relation of the word within the phrase and the

relation of the phrase within the whole sentence in a hierarchical manner

Finally data augmentation is another solution to get better results One definite

way to reduce the errors would be to include similar samples in the training data which

the system is faltering in the dev set One could generate similar examples as the failure

cases and include them in the training set to have better prediction Another system

would be to train to similar and bigger datasets Our models were trained on the SQuAD

dataset There are other datasets too which does the similar question answering task

such as TriviaQA We could augment the training set of TriviaQA along with SQuAD to

have a more robust system that is able to generalize better and thus have higher

accuracy for predicting answer spans

Conclusion

In this work we have tried to explore the most fundamental techniques that have

shaped the current state of the art Then we proposed a minor improvement of

architecture over an existing model Furthermore we developed two applications that

uses the base model First we talked about how a chatbot application can be made

using the QA system and lastly we also created a web interface where the model can

be used for any Passage and Question This interface also shows the attention spread

on the candidate answers While our effort is ongoing to push the state of the art

forward we strongly believe that surpassing human level accuracy on this task will have

high dividends for the society at large

39

LIST OF REFERENCES

[1] Danqi Chen From Reading Comprehension To Open-Domain Question Answering 2018 Youtube Accessed March 16 2018 httpswwwyoutubecomwatchv=1RN88O9C13U

[2] Lehnert Wendy G The Process of Question Answering No RR-88 Yale Univ New Haven Conn Dept of computer science 1977

[3] Joshi Mandar Eunsol Choi Daniel S Weld and Luke Zettlemoyer Triviaqa A large scale distantly supervised challenge dataset for reading comprehension arXiv preprint arXiv170503551 (2017)

[4] Rajpurkar Pranav Jian Zhang Konstantin Lopyrev and Percy Liang Squad 100000+ questions for machine comprehension of text arXiv preprint arXiv160605250 (2016)

[5] The Stanford Question Answering Dataset 2018 RajpurkarGithubIo Accessed March 16 2018 httpsrajpurkargithubioSQuAD-explorer

[6] Hu Minghao Yuxing Peng and Xipeng Qiu Reinforced mnemonic reader for machine comprehension CoRR abs170502798 (2017)

[7] Schalkoff Robert J Artificial neural networks Vol 1 New York McGraw-Hill 1997

[8] Me For The AI and Neetesh Mehrotra 2018 The Connect Between Deep Learning And AI - Open Source For You Open Source For You Accessed March 16 2018 httpopensourceforucom201801connect-deep-learning-ai

[9] Krizhevsky Alex Ilya Sutskever and Geoffrey E Hinton Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems pp 1097-1105 2012

[10] An Intuitive Explanation Of Convolutional Neural Networks 2016 The Data Science Blog Accessed March 16 2018 httpsujjwalkarnme20160811intuitive-explanation-convnets

[11] Williams Ronald J and David Zipser A learning algorithm for continually running fully recurrent neural networks Neural computation 1 no 2 (1989) 270-280

[12] Hochreiter Sepp and Juumlrgen Schmidhuber Long short-term memory Neural computation 9 no 8 (1997) 1735-1780

[13] Chung Junyoung Caglar Gulcehre KyungHyun Cho and Yoshua Bengio Empirical evaluation of gated recurrent neural networks on sequence modeling arXiv preprint arXiv14123555 (2014)

40

[14] Mikolov Tomas Wen-tau Yih and Geoffrey Zweig Linguistic regularities in continuous space word representations In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics Human Language Technologies pp 746-751 2013

[15] Pennington Jeffrey Richard Socher and Christopher Manning Glove Global vectors for word representation In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) pp 1532-1543 2014

[16] Word2vec Tutorial - The Skip-Gram Model middot Chris Mccormick 2016 MccormickmlCom Accessed March 16 2018 httpmccormickmlcom20160419word2vec-tutorial-the-skip-gram-model

[17] Group 2015 [Paper Introduction] Bilingual Word Representations With Monolingual hellip SlideshareNet Accessed March 16 2018 httpswwwslidesharenetnaist-mtstudybilingual-word-representations-with-monolingual-quality-in-mind

[18] Attention Mechanism 2016 BlogHeuritechCom Accessed March 16 2018 httpsblogheuritechcom20160120attention-mechanism

[19] Weston Jason Sumit Chopra and Antoine Bordes 2014 Memory Networks arXiv preprint arXiv14103916v11 Accessed March 16 2018 httpsarxivorgabs14103916

[20] Wang Shuohang and Jing Jiang Machine comprehension using match-lstm and answer pointer arXiv preprint arXiv160807905 (2016)

[21] Wang Shuohang and Jing Jiang Learning natural language inference with LSTM arXiv preprint arXiv151208849 (2015)

[22] Vinyals Oriol Meire Fortunato and Navdeep Jaitly Pointer networks In Advances in Neural Information Processing Systems pp 2692-2700 2015

[23] Wang Wenhui Nan Yang Furu Wei Baobao Chang and Ming Zhou Gated self-matching networks for reading comprehension and question answering In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1 Long Papers) vol 1 pp 189-198 2017

[24] Seo Minjoon Aniruddha Kembhavi Ali Farhadi and Hannaneh Hajishirzi Bidirectional attention flow for machine comprehension arXiv preprint arXiv161101603 (2016)

[25] Clark Christopher and Matt Gardner Simple and effective multi-paragraph reading comprehension arXiv preprint arXiv171010723 (2017)

41

[26] 8 Analyzing Sentence Structure 2018 NltkOrg Accessed March 16 2018 httpwwwnltkorgbookch08html

42

BIOGRAPHICAL SKETCH

Purnendu Mukherjee grew up in Kolkata India and had a strong interest for

computers since an early age After his high school he did his BSc in computer

science from Ramakrishna Mission Residential College Narendrapur followed by MSc

in computer science from St Xavierrsquos College Kolkata He had a strong intuition and

interest for human like learning systems and wanted to work in this area He started

working at TCS Innovation Labs Pune for the application of Natural Language

Processing in Educational Applications As he was deeply passionate about learning

system that mimic the human brain and learn like a human child does he was

increasing interested about Deep Learning and its applications After working for a year

he went on to pursue a Master of Science degree in computer science from the

University of Florida Gainesville His academic interests have been focused on Deep

Learning and Natural Language Processing and he has been working on Machine

Reading Comprehension since summer of 2017

  • ACKNOWLEDGMENTS
  • LIST OF FIGURES
  • LIST OF ABBREVIATIONS
  • INTRODUCTION
  • THE BUILDING BLOCKS OF A QUESTION ANSWERING SYSTEM
    • Neural Networks
    • Convolutional Neural Network
    • Recurrent Neural Networks (RNN)
    • Word Embedding
    • Attention Mechanism
    • Memory Networks
      • LITERATURE REVIEW AND STATE OF THE ART
        • Machine Comprehension Using Match-LSTM and Answer Pointer
        • R-NET Matching Reading Comprehension with Self-Matching Networks
        • Bi-Directional Attention Flow (BiDAF) for Machine Comprehension
        • Summary
          • MULTI-ATTENTION QUESTION ANSWERING
            • Multi-Attention BiDAF Model
            • Chatbot Design Using a QA System
            • Online QA System and Attention Visualization
              • FUTURE DIRECTIONS AND CONCLUSION
                • Future Directions
                • Conclusion
                  • LIST OF REFERENCES
                  • BIOGRAPHICAL SKETCH
Page 26: QUESTION ANSWERING SYSTEMS WITH ATTENTION MECHANISMufdcimages.uflib.ufl.edu/UF/E0/05/23/73/00001/MUKHERJEE_P.pdf · QUESTION ANSWERING SYSTEMS WITH ATTENTION MECHANISM By PURNENDU

26

Figure 3-3 Bi-Directional Attention Flow Model Architecture [24]

Their machine comprehension model is a hierarchical multi-stage process and

consists of six layers

1 Character Embedding Layer maps each word to a vector space using character-level CNNs

2 Word Embedding Layer maps each word to a vector space using a pre-trained word embedding model

3 Contextual Embedding Layer utilizes contextual cues from surrounding words to refine the embedding of the words These first three layers are applied to both the query and context They use a LSTM on top of the embeddings provided by the previous layers to model the temporal interactions between words We place an LSTM in both directions and concatenate the outputs of the two LSTMs

4 Attention Flow Layer couples the query and context vectors and produces a set of query aware feature vectors for each word in the context

5 Modeling Layer employs a Recurrent Neural Network to scan the context The input to the modeling layer is G which encodes the query-aware representations of context words The output of the modeling layer captures the interaction among the context words conditioned on the query They use two layers of bi-directional LSTM

27

with the output size of d for each direction Hence a matrix is obtained which is passed onto the output layer to predict the answer

6 Output Layer provides an answer to the query They define the training loss (to be minimized) as the sum of the negative log probabilities of the true start and end indices by the predicted distributions averaged over all examples [25]

In a further variation of their above work they add a self-attention layer after the

Bi-attention layer to further improve the results The architecture of the model is as

Figure 3-3 The task of Question Answering [25]

28

Summary

In this chapter we reviewed the methods that are fundamental to the state of the

art in Machine Comprehension and for the task of Question Answering We have

reached closed to human level accuracy and this is due to incremental developments

over previous models As we saw Out of Vocabulary (OOV) tokens were handled by

using Character embedding Long term dependency within context passage were

solved using self-attention and many other techniques such as Contextualized vectors

Attention Flow etc were employed to get better results In the next chapter we will see

how we can build on these models and develop further

29

CHAPTER 4 MULTI-ATTENTION QUESTION ANSWERING

Although the advancements in Question Answering systems since the release of

the SQuAD dataset has been impressive and the results are getting close to human

level accuracy it is far from being a fool-proof system The models still make mistakes

which would be obvious to a human For example

Passage The Panthers used the San Jose State practice facility and stayed at

the San Jose Marriott The Broncos practiced at Florida State Facility and stayed at the

Santa Clara Marriott

Question At what universitys facility did the Panthers practice

Actual Answer San Jose State

Predicted Answer Florida State Facility

To find out what is leading to the wrong predictions we wanted to see the

attention weights associated with such an example We plotted the passage and

question heat map which is a 2D matrix where the intensity of each cell signifies the

similarity between a passage word and a question word For the above example we

found out that while certain words of the question are given high weightage other parts

are not The words lsquoAtrsquo lsquofacilityrsquo lsquopracticersquo are given high attention but lsquoPanthersrsquo does

not receive high attention If it had received high attention then the system would have

predicted lsquoSan Jose Statersquo as the right answer To solve this issue we analyzed the

base BiDAF model and proposed adding two things

1 Bi-Attention and Self-Attention over Query

2 Second level attention over the output of (Bi-Attention + Self-Attention) from both Context and Query

30

Multi-Attention BiDAF Model

Our comprehension model is a hierarchical multi-stage process and consists of

the following layers

1 Embedding Just as all other models we embed words using pretrained word vectors We also embed the characters in each word into size 20 vectors which are learned and run a convolution neural network followed by max-pooling to get character-derived embeddings for each word The character-level and word-level embeddings are then concatenated and passed to the next layer We do not update the word embeddings during training

2 Pre-Process A shared bi-directional GRU (Cho et al 2014) is used to map the question and passage embeddings to context aware embeddings

3 Context Attention The bi-directional attention mechanism from the Bi-Directional Attention Flow (BiDAF) model (Seo et al 2016) is used to build a query-aware context representation Let h_i be the vector for context word i q_j be the vector for question word j and n_q and n_c be the lengths of the question and context respectively We compute attention between context word i and question word j as

(4-1)

where w1 w2 and w3 are learned vectors and is element-wise multiplication We then compute an attended vector c_i for each context token as

(4-2)

We also compute a query-to-context vector q_c

(4-3)

31

The final vector computed for each token is built by concatenating

and In our model we subsequently pass the result through a linear layer with ReLU activations

4 Context Self-Attention Next we use a layer of residual self-attention The input is passed through another bi-directional GRU Then we apply the same attention mechanism only now between the passage and itself In this case we do not use query-to context attention and we set aij = 1048576inf if i = j As before we pass the concatenated output through a linear layer with ReLU activations This layer is applied residually so this output is additionally summed with the input

5 Query Attention For this part we do the same way as context attention but calculate the weighted sum of the context words for each query word Thus we get the length as number of query words Then we calculate context to query similar to query to context in context attention layer

Figure 4-1 The modified BiDAF model with multilevel attention

6 Query Self-Attention This part is done the same way as the context self-attention layer but from the output of the Query Attention layer

32

7 Context Query Bi-Attention + Self-Attention The output of the Context self-attention and the Query Self-Attention layers are taken as input and the same process for Bi-Attention and self-attention is applied on these inputs

8 Prediction In the last layer of our model a bidirectional GRU is applied followed by a linear layer that computes answer start scores for each token The hidden states of that layer are concatenated with the input and fed into a second bidirectional GRU and linear layer to predict answer end scores The softmax operation is applied to the start and end scores to produce start and end probabilities and we optimize the negative log likelihood of selecting correct start and end tokens

Having carried out this modification we were able to solve the wrong example we

started with The multilevel attention model gives the correct output as ldquoSan Jose

Staterdquo Also we achieved slightly better scores than the original model with a F1 score

of 8544 on the SQuAD dev dataset

Chatbot Design Using a QA System

Designing a perfect chatbot that passes the Turing test is a fundamental goal for

Artificial Intelligence Although we are many order to magnitudes away from achieving

such a goal domain specific tasks can be solved with chatbots made from current

technology We had started our goal with a similar objective in mind ie to design a

domain specific chatbot and then generalize to other areas as it is able to achieve the

first domain specific objective robustly This led us to the fundamental problem of

Machine Comprehension and subsequently to the task of Question Answering Having

achieved some degree of success with QA systems we looked back if we could apply

our newly acquired knowledge in the task of designing Chatbots

The chatbots made with todayrsquos technologies are mostly handcrafted techniques

such as template matching that requires anticipating all possible ways a user may

articulate his requirements and a conversation may occur This requires a lot of man

hours for designing a domain specific system and is still very error prone In this section

33

we propose a general Chatbot design that would make the designing of a domain

specific chatbot very easy and robust at the same time

Every domain specific chatbot needs to obtain a set of information from the user

and show some results based on the user specific information obtained The traditional

chatbots use template matching and keywords lookup to determine if the user has

provided the required information Our idea is to use the Question Answering system in

the backend to extract out the required information from whatever the user has typed

until this point of the conversation The information to be extracted can be posed in the

form of a set of questions and the answers obtained from those questions can be used

as the parameters to supply the relevant information to the user

We had chosen our chatbot domain as the flight reservation system Our goal

was to extract the required information from the user to be able to show him the

available flights as per the userrsquos requirements

Figure 4-2 Flight reservation chatbotrsquos chat window

34

For a flight reservation task the booking agent needs to know the origin city

destination city and date of travel at minimum to be able to show the available flights

Optional information includes the number of tickets passengerrsquos name one way or

round trip etc

The minimalistic conversation with the user through the chat window would be as

shown above We had a platform called OneTask on which we wanted to implement our

chat bot The chat interface within the OneTask system looks as follows

Figure 4-3 Chatbot within OneTask system

The working of the chat system is as follows ndash

1 Initiation The user opens the chat window which starts a session with the chat bot The chat bot asks lsquoHow may I help yoursquo

2 User Reply The user may reply with none of the required information for flight booking or may reply with multiple information in the same message

3 User Reply parsing The conversation up to this point is treated as a passage and the internal questions are run on this passage So the four questions that are run are

Where do you want to go

35

From where do you want to leave

When do you want to depart

4 Parsed responses from QA model After running the questions on the QA system with the given conversation answers are obtained for all along with their corresponding confidence values Since answers will be obtained even if the required question was not answered up to this point in the conversation the confidence values plays a crucial role in determining the validity of the answer For this purpose we determined ranges of confidence values for validation A confidence value above 10 signifies the question has been answered correctly A confidence of 2 ndash 10 signifies that it may have been answered but should verify with the user for correctness and any confidence below 2 is discarded

5 Asking remaining questions iteratively After the parsing it is checked if any of the required questions are still unanswered If so the chatbot asks the remaining question and the process from 3 ndash 4 is carried out iteratively Once all the questions have been answered the user is shown with the available flight options as per his request

Figure 4-4 The Flow diagram of the Flight booking Chatbot system

Online QA System and Attention Visualization

To be able to test out various examples we used the BiDAF [23] model for an

online demo One can either choose from the available examples from the drop-down

menu or paste their own passage and examples While this is a useful and interesting

system to test the model in a user-friendly way we created this system to be able to

focus on the wrong samples

36

Figure 4-5 QA system interface with attention highlight over candidate answers

The candidate answers are shown with the blue highlights as per their

confidence values The higher the confidence the darker the answer The highest

confidence value is chosen as the predicted answer We developed the system to show

attention spread of the candidate answers to realize what needs to be done to improve

the system This led us to realize the importance of including the query attention part as

well as multilevel attention on the BiDAF model as described in the first section of this

chapter

37

CHAPTER 5 FUTURE DIRECTIONS AND CONCLUSION

Future Directions

Having done a thorough analysis of the current methods and state of the arts in

Machine Comprehension and for the QA task and developing systems on it we have

achieved a strong sense of what needs to be done to further improve the QA models

After observing the wrong samples we can see that the system is still unable to encode

meaning and is picking answers based on statistical patterns of occurrence of answers

on training examples

As per the State of the Art models and analyzing the ongoing research literature

it is easy to conclude that more features need to be embedded to encode the meaning

of words phrases and sentences A paper called Reinforced Mnemonic Reader for

Machine Comprehension encoded POS and NER tags of words along with their word

and character embedding This gave them better results

Figure 5-1 An English language semantic parse tree [26]

38

We have developed a method to encode the syntax parse tree of a sentence that

not only encodes the post tags but the relation of the word within the phrase and the

relation of the phrase within the whole sentence in a hierarchical manner

Finally data augmentation is another solution to get better results One definite

way to reduce the errors would be to include similar samples in the training data which

the system is faltering in the dev set One could generate similar examples as the failure

cases and include them in the training set to have better prediction Another system

would be to train to similar and bigger datasets Our models were trained on the SQuAD

dataset There are other datasets too which does the similar question answering task

such as TriviaQA We could augment the training set of TriviaQA along with SQuAD to

have a more robust system that is able to generalize better and thus have higher

accuracy for predicting answer spans

Conclusion

In this work we have tried to explore the most fundamental techniques that have

shaped the current state of the art Then we proposed a minor improvement of

architecture over an existing model Furthermore we developed two applications that

uses the base model First we talked about how a chatbot application can be made

using the QA system and lastly we also created a web interface where the model can

be used for any Passage and Question This interface also shows the attention spread

on the candidate answers While our effort is ongoing to push the state of the art

forward we strongly believe that surpassing human level accuracy on this task will have

high dividends for the society at large

39

LIST OF REFERENCES

[1] Danqi Chen From Reading Comprehension To Open-Domain Question Answering 2018 Youtube Accessed March 16 2018 httpswwwyoutubecomwatchv=1RN88O9C13U

[2] Lehnert Wendy G The Process of Question Answering No RR-88 Yale Univ New Haven Conn Dept of computer science 1977

[3] Joshi Mandar Eunsol Choi Daniel S Weld and Luke Zettlemoyer Triviaqa A large scale distantly supervised challenge dataset for reading comprehension arXiv preprint arXiv170503551 (2017)

[4] Rajpurkar Pranav Jian Zhang Konstantin Lopyrev and Percy Liang Squad 100000+ questions for machine comprehension of text arXiv preprint arXiv160605250 (2016)

[5] The Stanford Question Answering Dataset 2018 RajpurkarGithubIo Accessed March 16 2018 httpsrajpurkargithubioSQuAD-explorer

[6] Hu Minghao Yuxing Peng and Xipeng Qiu Reinforced mnemonic reader for machine comprehension CoRR abs170502798 (2017)

[7] Schalkoff Robert J Artificial neural networks Vol 1 New York McGraw-Hill 1997

[8] Me For The AI and Neetesh Mehrotra 2018 The Connect Between Deep Learning And AI - Open Source For You Open Source For You Accessed March 16 2018 httpopensourceforucom201801connect-deep-learning-ai

[9] Krizhevsky Alex Ilya Sutskever and Geoffrey E Hinton Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems pp 1097-1105 2012

[10] An Intuitive Explanation Of Convolutional Neural Networks 2016 The Data Science Blog Accessed March 16 2018 httpsujjwalkarnme20160811intuitive-explanation-convnets

[11] Williams Ronald J and David Zipser A learning algorithm for continually running fully recurrent neural networks Neural computation 1 no 2 (1989) 270-280

[12] Hochreiter Sepp and Juumlrgen Schmidhuber Long short-term memory Neural computation 9 no 8 (1997) 1735-1780

[13] Chung Junyoung Caglar Gulcehre KyungHyun Cho and Yoshua Bengio Empirical evaluation of gated recurrent neural networks on sequence modeling arXiv preprint arXiv14123555 (2014)

40

[14] Mikolov Tomas Wen-tau Yih and Geoffrey Zweig Linguistic regularities in continuous space word representations In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics Human Language Technologies pp 746-751 2013

[15] Pennington Jeffrey Richard Socher and Christopher Manning Glove Global vectors for word representation In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) pp 1532-1543 2014

[16] Word2vec Tutorial - The Skip-Gram Model middot Chris Mccormick 2016 MccormickmlCom Accessed March 16 2018 httpmccormickmlcom20160419word2vec-tutorial-the-skip-gram-model

[17] Group 2015 [Paper Introduction] Bilingual Word Representations With Monolingual hellip SlideshareNet Accessed March 16 2018 httpswwwslidesharenetnaist-mtstudybilingual-word-representations-with-monolingual-quality-in-mind

[18] Attention Mechanism 2016 BlogHeuritechCom Accessed March 16 2018 httpsblogheuritechcom20160120attention-mechanism

[19] Weston Jason Sumit Chopra and Antoine Bordes 2014 Memory Networks arXiv preprint arXiv14103916v11 Accessed March 16 2018 httpsarxivorgabs14103916

[20] Wang Shuohang and Jing Jiang Machine comprehension using match-lstm and answer pointer arXiv preprint arXiv160807905 (2016)

[21] Wang Shuohang and Jing Jiang Learning natural language inference with LSTM arXiv preprint arXiv151208849 (2015)

[22] Vinyals Oriol Meire Fortunato and Navdeep Jaitly Pointer networks In Advances in Neural Information Processing Systems pp 2692-2700 2015

[23] Wang Wenhui Nan Yang Furu Wei Baobao Chang and Ming Zhou Gated self-matching networks for reading comprehension and question answering In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1 Long Papers) vol 1 pp 189-198 2017

[24] Seo Minjoon Aniruddha Kembhavi Ali Farhadi and Hannaneh Hajishirzi Bidirectional attention flow for machine comprehension arXiv preprint arXiv161101603 (2016)

[25] Clark Christopher and Matt Gardner Simple and effective multi-paragraph reading comprehension arXiv preprint arXiv171010723 (2017)

41

[26] 8 Analyzing Sentence Structure 2018 NltkOrg Accessed March 16 2018 httpwwwnltkorgbookch08html

42

BIOGRAPHICAL SKETCH

Purnendu Mukherjee grew up in Kolkata India and had a strong interest for

computers since an early age After his high school he did his BSc in computer

science from Ramakrishna Mission Residential College Narendrapur followed by MSc

in computer science from St Xavierrsquos College Kolkata He had a strong intuition and

interest for human like learning systems and wanted to work in this area He started

working at TCS Innovation Labs Pune for the application of Natural Language

Processing in Educational Applications As he was deeply passionate about learning

system that mimic the human brain and learn like a human child does he was

increasing interested about Deep Learning and its applications After working for a year

he went on to pursue a Master of Science degree in computer science from the

University of Florida Gainesville His academic interests have been focused on Deep

Learning and Natural Language Processing and he has been working on Machine

Reading Comprehension since summer of 2017

  • ACKNOWLEDGMENTS
  • LIST OF FIGURES
  • LIST OF ABBREVIATIONS
  • INTRODUCTION
  • THE BUILDING BLOCKS OF A QUESTION ANSWERING SYSTEM
    • Neural Networks
    • Convolutional Neural Network
    • Recurrent Neural Networks (RNN)
    • Word Embedding
    • Attention Mechanism
    • Memory Networks
      • LITERATURE REVIEW AND STATE OF THE ART
        • Machine Comprehension Using Match-LSTM and Answer Pointer
        • R-NET Matching Reading Comprehension with Self-Matching Networks
        • Bi-Directional Attention Flow (BiDAF) for Machine Comprehension
        • Summary
          • MULTI-ATTENTION QUESTION ANSWERING
            • Multi-Attention BiDAF Model
            • Chatbot Design Using a QA System
            • Online QA System and Attention Visualization
              • FUTURE DIRECTIONS AND CONCLUSION
                • Future Directions
                • Conclusion
                  • LIST OF REFERENCES
                  • BIOGRAPHICAL SKETCH
Page 27: QUESTION ANSWERING SYSTEMS WITH ATTENTION MECHANISMufdcimages.uflib.ufl.edu/UF/E0/05/23/73/00001/MUKHERJEE_P.pdf · QUESTION ANSWERING SYSTEMS WITH ATTENTION MECHANISM By PURNENDU

27

with the output size of d for each direction Hence a matrix is obtained which is passed onto the output layer to predict the answer

6 Output Layer provides an answer to the query They define the training loss (to be minimized) as the sum of the negative log probabilities of the true start and end indices by the predicted distributions averaged over all examples [25]

In a further variation of their above work they add a self-attention layer after the

Bi-attention layer to further improve the results The architecture of the model is as

Figure 3-3 The task of Question Answering [25]

28

Summary

In this chapter we reviewed the methods that are fundamental to the state of the

art in Machine Comprehension and for the task of Question Answering We have

reached closed to human level accuracy and this is due to incremental developments

over previous models As we saw Out of Vocabulary (OOV) tokens were handled by

using Character embedding Long term dependency within context passage were

solved using self-attention and many other techniques such as Contextualized vectors

Attention Flow etc were employed to get better results In the next chapter we will see

how we can build on these models and develop further

29

CHAPTER 4 MULTI-ATTENTION QUESTION ANSWERING

Although the advancements in Question Answering systems since the release of

the SQuAD dataset has been impressive and the results are getting close to human

level accuracy it is far from being a fool-proof system The models still make mistakes

which would be obvious to a human For example

Passage The Panthers used the San Jose State practice facility and stayed at

the San Jose Marriott The Broncos practiced at Florida State Facility and stayed at the

Santa Clara Marriott

Question At what universitys facility did the Panthers practice

Actual Answer San Jose State

Predicted Answer Florida State Facility

To find out what is leading to the wrong predictions we wanted to see the

attention weights associated with such an example We plotted the passage and

question heat map which is a 2D matrix where the intensity of each cell signifies the

similarity between a passage word and a question word For the above example we

found out that while certain words of the question are given high weightage other parts

are not The words lsquoAtrsquo lsquofacilityrsquo lsquopracticersquo are given high attention but lsquoPanthersrsquo does

not receive high attention If it had received high attention then the system would have

predicted lsquoSan Jose Statersquo as the right answer To solve this issue we analyzed the

base BiDAF model and proposed adding two things

1 Bi-Attention and Self-Attention over Query

2 Second level attention over the output of (Bi-Attention + Self-Attention) from both Context and Query

30

Multi-Attention BiDAF Model

Our comprehension model is a hierarchical multi-stage process and consists of

the following layers

1 Embedding Just as all other models we embed words using pretrained word vectors We also embed the characters in each word into size 20 vectors which are learned and run a convolution neural network followed by max-pooling to get character-derived embeddings for each word The character-level and word-level embeddings are then concatenated and passed to the next layer We do not update the word embeddings during training

2 Pre-Process A shared bi-directional GRU (Cho et al 2014) is used to map the question and passage embeddings to context aware embeddings

3 Context Attention The bi-directional attention mechanism from the Bi-Directional Attention Flow (BiDAF) model (Seo et al 2016) is used to build a query-aware context representation Let h_i be the vector for context word i q_j be the vector for question word j and n_q and n_c be the lengths of the question and context respectively We compute attention between context word i and question word j as

(4-1)

where w1 w2 and w3 are learned vectors and is element-wise multiplication We then compute an attended vector c_i for each context token as

(4-2)

We also compute a query-to-context vector q_c

(4-3)

31

The final vector computed for each token is built by concatenating

and In our model we subsequently pass the result through a linear layer with ReLU activations

4 Context Self-Attention Next we use a layer of residual self-attention The input is passed through another bi-directional GRU Then we apply the same attention mechanism only now between the passage and itself In this case we do not use query-to context attention and we set aij = 1048576inf if i = j As before we pass the concatenated output through a linear layer with ReLU activations This layer is applied residually so this output is additionally summed with the input

5 Query Attention For this part we do the same way as context attention but calculate the weighted sum of the context words for each query word Thus we get the length as number of query words Then we calculate context to query similar to query to context in context attention layer

Figure 4-1 The modified BiDAF model with multilevel attention

6 Query Self-Attention This part is done the same way as the context self-attention layer but from the output of the Query Attention layer

32

7 Context Query Bi-Attention + Self-Attention The output of the Context self-attention and the Query Self-Attention layers are taken as input and the same process for Bi-Attention and self-attention is applied on these inputs

8 Prediction In the last layer of our model a bidirectional GRU is applied followed by a linear layer that computes answer start scores for each token The hidden states of that layer are concatenated with the input and fed into a second bidirectional GRU and linear layer to predict answer end scores The softmax operation is applied to the start and end scores to produce start and end probabilities and we optimize the negative log likelihood of selecting correct start and end tokens

Having carried out this modification we were able to solve the wrong example we

started with The multilevel attention model gives the correct output as ldquoSan Jose

Staterdquo Also we achieved slightly better scores than the original model with a F1 score

of 8544 on the SQuAD dev dataset

Chatbot Design Using a QA System

Designing a perfect chatbot that passes the Turing test is a fundamental goal for

Artificial Intelligence Although we are many order to magnitudes away from achieving

such a goal domain specific tasks can be solved with chatbots made from current

technology We had started our goal with a similar objective in mind ie to design a

domain specific chatbot and then generalize to other areas as it is able to achieve the

first domain specific objective robustly This led us to the fundamental problem of

Machine Comprehension and subsequently to the task of Question Answering Having

achieved some degree of success with QA systems we looked back if we could apply

our newly acquired knowledge in the task of designing Chatbots

The chatbots made with todayrsquos technologies are mostly handcrafted techniques

such as template matching that requires anticipating all possible ways a user may

articulate his requirements and a conversation may occur This requires a lot of man

hours for designing a domain specific system and is still very error prone In this section

33

we propose a general Chatbot design that would make the designing of a domain

specific chatbot very easy and robust at the same time

Every domain specific chatbot needs to obtain a set of information from the user

and show some results based on the user specific information obtained The traditional

chatbots use template matching and keywords lookup to determine if the user has

provided the required information Our idea is to use the Question Answering system in

the backend to extract out the required information from whatever the user has typed

until this point of the conversation The information to be extracted can be posed in the

form of a set of questions and the answers obtained from those questions can be used

as the parameters to supply the relevant information to the user

We had chosen our chatbot domain as the flight reservation system Our goal

was to extract the required information from the user to be able to show him the

available flights as per the userrsquos requirements

Figure 4-2 Flight reservation chatbotrsquos chat window

34

For a flight reservation task the booking agent needs to know the origin city

destination city and date of travel at minimum to be able to show the available flights

Optional information includes the number of tickets passengerrsquos name one way or

round trip etc

The minimalistic conversation with the user through the chat window would be as

shown above We had a platform called OneTask on which we wanted to implement our

chat bot The chat interface within the OneTask system looks as follows

Figure 4-3 Chatbot within OneTask system

The working of the chat system is as follows ndash

1 Initiation The user opens the chat window which starts a session with the chat bot The chat bot asks lsquoHow may I help yoursquo

2 User Reply The user may reply with none of the required information for flight booking or may reply with multiple information in the same message

3 User Reply parsing The conversation up to this point is treated as a passage and the internal questions are run on this passage So the four questions that are run are

Where do you want to go

35

From where do you want to leave

When do you want to depart

4 Parsed responses from QA model After running the questions on the QA system with the given conversation answers are obtained for all along with their corresponding confidence values Since answers will be obtained even if the required question was not answered up to this point in the conversation the confidence values plays a crucial role in determining the validity of the answer For this purpose we determined ranges of confidence values for validation A confidence value above 10 signifies the question has been answered correctly A confidence of 2 ndash 10 signifies that it may have been answered but should verify with the user for correctness and any confidence below 2 is discarded

5 Asking remaining questions iteratively After the parsing it is checked if any of the required questions are still unanswered If so the chatbot asks the remaining question and the process from 3 ndash 4 is carried out iteratively Once all the questions have been answered the user is shown with the available flight options as per his request

Figure 4-4 The Flow diagram of the Flight booking Chatbot system

Online QA System and Attention Visualization

To be able to test out various examples we used the BiDAF [23] model for an

online demo One can either choose from the available examples from the drop-down

menu or paste their own passage and examples While this is a useful and interesting

system to test the model in a user-friendly way we created this system to be able to

focus on the wrong samples

36

Figure 4-5 QA system interface with attention highlight over candidate answers

The candidate answers are shown with the blue highlights as per their

confidence values The higher the confidence the darker the answer The highest

confidence value is chosen as the predicted answer We developed the system to show

attention spread of the candidate answers to realize what needs to be done to improve

the system This led us to realize the importance of including the query attention part as

well as multilevel attention on the BiDAF model as described in the first section of this

chapter

37

CHAPTER 5 FUTURE DIRECTIONS AND CONCLUSION

Future Directions

Having done a thorough analysis of the current methods and state of the arts in

Machine Comprehension and for the QA task and developing systems on it we have

achieved a strong sense of what needs to be done to further improve the QA models

After observing the wrong samples we can see that the system is still unable to encode

meaning and is picking answers based on statistical patterns of occurrence of answers

on training examples

As per the State of the Art models and analyzing the ongoing research literature

it is easy to conclude that more features need to be embedded to encode the meaning

of words phrases and sentences A paper called Reinforced Mnemonic Reader for

Machine Comprehension encoded POS and NER tags of words along with their word

and character embedding This gave them better results

Figure 5-1 An English language semantic parse tree [26]

38

We have developed a method to encode the syntax parse tree of a sentence that

not only encodes the post tags but the relation of the word within the phrase and the

relation of the phrase within the whole sentence in a hierarchical manner

Finally data augmentation is another solution to get better results One definite

way to reduce the errors would be to include similar samples in the training data which

the system is faltering in the dev set One could generate similar examples as the failure

cases and include them in the training set to have better prediction Another system

would be to train to similar and bigger datasets Our models were trained on the SQuAD

dataset There are other datasets too which does the similar question answering task

such as TriviaQA We could augment the training set of TriviaQA along with SQuAD to

have a more robust system that is able to generalize better and thus have higher

accuracy for predicting answer spans

Conclusion

In this work we have tried to explore the most fundamental techniques that have

shaped the current state of the art Then we proposed a minor improvement of

architecture over an existing model Furthermore we developed two applications that

uses the base model First we talked about how a chatbot application can be made

using the QA system and lastly we also created a web interface where the model can

be used for any Passage and Question This interface also shows the attention spread

on the candidate answers While our effort is ongoing to push the state of the art

forward we strongly believe that surpassing human level accuracy on this task will have

high dividends for the society at large

39

LIST OF REFERENCES

[1] Danqi Chen From Reading Comprehension To Open-Domain Question Answering 2018 Youtube Accessed March 16 2018 httpswwwyoutubecomwatchv=1RN88O9C13U

[2] Lehnert Wendy G The Process of Question Answering No RR-88 Yale Univ New Haven Conn Dept of computer science 1977

[3] Joshi Mandar Eunsol Choi Daniel S Weld and Luke Zettlemoyer Triviaqa A large scale distantly supervised challenge dataset for reading comprehension arXiv preprint arXiv170503551 (2017)

[4] Rajpurkar Pranav Jian Zhang Konstantin Lopyrev and Percy Liang Squad 100000+ questions for machine comprehension of text arXiv preprint arXiv160605250 (2016)

[5] The Stanford Question Answering Dataset 2018 RajpurkarGithubIo Accessed March 16 2018 httpsrajpurkargithubioSQuAD-explorer

[6] Hu Minghao Yuxing Peng and Xipeng Qiu Reinforced mnemonic reader for machine comprehension CoRR abs170502798 (2017)

[7] Schalkoff Robert J Artificial neural networks Vol 1 New York McGraw-Hill 1997

[8] Me For The AI and Neetesh Mehrotra 2018 The Connect Between Deep Learning And AI - Open Source For You Open Source For You Accessed March 16 2018 httpopensourceforucom201801connect-deep-learning-ai

[9] Krizhevsky Alex Ilya Sutskever and Geoffrey E Hinton Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems pp 1097-1105 2012

[10] An Intuitive Explanation Of Convolutional Neural Networks 2016 The Data Science Blog Accessed March 16 2018 httpsujjwalkarnme20160811intuitive-explanation-convnets

[11] Williams Ronald J and David Zipser A learning algorithm for continually running fully recurrent neural networks Neural computation 1 no 2 (1989) 270-280

[12] Hochreiter Sepp and Juumlrgen Schmidhuber Long short-term memory Neural computation 9 no 8 (1997) 1735-1780

[13] Chung Junyoung Caglar Gulcehre KyungHyun Cho and Yoshua Bengio Empirical evaluation of gated recurrent neural networks on sequence modeling arXiv preprint arXiv14123555 (2014)

40

[14] Mikolov Tomas Wen-tau Yih and Geoffrey Zweig Linguistic regularities in continuous space word representations In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics Human Language Technologies pp 746-751 2013

[15] Pennington Jeffrey Richard Socher and Christopher Manning Glove Global vectors for word representation In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) pp 1532-1543 2014

[16] Word2vec Tutorial - The Skip-Gram Model middot Chris Mccormick 2016 MccormickmlCom Accessed March 16 2018 httpmccormickmlcom20160419word2vec-tutorial-the-skip-gram-model

[17] Group 2015 [Paper Introduction] Bilingual Word Representations With Monolingual hellip SlideshareNet Accessed March 16 2018 httpswwwslidesharenetnaist-mtstudybilingual-word-representations-with-monolingual-quality-in-mind

[18] Attention Mechanism 2016 BlogHeuritechCom Accessed March 16 2018 httpsblogheuritechcom20160120attention-mechanism

[19] Weston Jason Sumit Chopra and Antoine Bordes 2014 Memory Networks arXiv preprint arXiv14103916v11 Accessed March 16 2018 httpsarxivorgabs14103916

[20] Wang Shuohang and Jing Jiang Machine comprehension using match-lstm and answer pointer arXiv preprint arXiv160807905 (2016)

[21] Wang Shuohang and Jing Jiang Learning natural language inference with LSTM arXiv preprint arXiv151208849 (2015)

[22] Vinyals Oriol Meire Fortunato and Navdeep Jaitly Pointer networks In Advances in Neural Information Processing Systems pp 2692-2700 2015

[23] Wang Wenhui Nan Yang Furu Wei Baobao Chang and Ming Zhou Gated self-matching networks for reading comprehension and question answering In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1 Long Papers) vol 1 pp 189-198 2017

[24] Seo Minjoon Aniruddha Kembhavi Ali Farhadi and Hannaneh Hajishirzi Bidirectional attention flow for machine comprehension arXiv preprint arXiv161101603 (2016)

[25] Clark Christopher and Matt Gardner Simple and effective multi-paragraph reading comprehension arXiv preprint arXiv171010723 (2017)

41

[26] 8 Analyzing Sentence Structure 2018 NltkOrg Accessed March 16 2018 httpwwwnltkorgbookch08html

42

BIOGRAPHICAL SKETCH

Purnendu Mukherjee grew up in Kolkata India and had a strong interest for

computers since an early age After his high school he did his BSc in computer

science from Ramakrishna Mission Residential College Narendrapur followed by MSc

in computer science from St Xavierrsquos College Kolkata He had a strong intuition and

interest for human like learning systems and wanted to work in this area He started

working at TCS Innovation Labs Pune for the application of Natural Language

Processing in Educational Applications As he was deeply passionate about learning

system that mimic the human brain and learn like a human child does he was

increasing interested about Deep Learning and its applications After working for a year

he went on to pursue a Master of Science degree in computer science from the

University of Florida Gainesville His academic interests have been focused on Deep

Learning and Natural Language Processing and he has been working on Machine

Reading Comprehension since summer of 2017

  • ACKNOWLEDGMENTS
  • LIST OF FIGURES
  • LIST OF ABBREVIATIONS
  • INTRODUCTION
  • THE BUILDING BLOCKS OF A QUESTION ANSWERING SYSTEM
    • Neural Networks
    • Convolutional Neural Network
    • Recurrent Neural Networks (RNN)
    • Word Embedding
    • Attention Mechanism
    • Memory Networks
      • LITERATURE REVIEW AND STATE OF THE ART
        • Machine Comprehension Using Match-LSTM and Answer Pointer
        • R-NET Matching Reading Comprehension with Self-Matching Networks
        • Bi-Directional Attention Flow (BiDAF) for Machine Comprehension
        • Summary
          • MULTI-ATTENTION QUESTION ANSWERING
            • Multi-Attention BiDAF Model
            • Chatbot Design Using a QA System
            • Online QA System and Attention Visualization
              • FUTURE DIRECTIONS AND CONCLUSION
                • Future Directions
                • Conclusion
                  • LIST OF REFERENCES
                  • BIOGRAPHICAL SKETCH
Page 28: QUESTION ANSWERING SYSTEMS WITH ATTENTION MECHANISMufdcimages.uflib.ufl.edu/UF/E0/05/23/73/00001/MUKHERJEE_P.pdf · QUESTION ANSWERING SYSTEMS WITH ATTENTION MECHANISM By PURNENDU

28

Summary

In this chapter we reviewed the methods that are fundamental to the state of the

art in Machine Comprehension and for the task of Question Answering We have

reached closed to human level accuracy and this is due to incremental developments

over previous models As we saw Out of Vocabulary (OOV) tokens were handled by

using Character embedding Long term dependency within context passage were

solved using self-attention and many other techniques such as Contextualized vectors

Attention Flow etc were employed to get better results In the next chapter we will see

how we can build on these models and develop further

29

CHAPTER 4 MULTI-ATTENTION QUESTION ANSWERING

Although the advancements in Question Answering systems since the release of

the SQuAD dataset has been impressive and the results are getting close to human

level accuracy it is far from being a fool-proof system The models still make mistakes

which would be obvious to a human For example

Passage The Panthers used the San Jose State practice facility and stayed at

the San Jose Marriott The Broncos practiced at Florida State Facility and stayed at the

Santa Clara Marriott

Question At what universitys facility did the Panthers practice

Actual Answer San Jose State

Predicted Answer Florida State Facility

To find out what is leading to the wrong predictions we wanted to see the

attention weights associated with such an example We plotted the passage and

question heat map which is a 2D matrix where the intensity of each cell signifies the

similarity between a passage word and a question word For the above example we

found out that while certain words of the question are given high weightage other parts

are not The words lsquoAtrsquo lsquofacilityrsquo lsquopracticersquo are given high attention but lsquoPanthersrsquo does

not receive high attention If it had received high attention then the system would have

predicted lsquoSan Jose Statersquo as the right answer To solve this issue we analyzed the

base BiDAF model and proposed adding two things

1 Bi-Attention and Self-Attention over Query

2 Second level attention over the output of (Bi-Attention + Self-Attention) from both Context and Query

30

Multi-Attention BiDAF Model

Our comprehension model is a hierarchical multi-stage process and consists of

the following layers

1 Embedding Just as all other models we embed words using pretrained word vectors We also embed the characters in each word into size 20 vectors which are learned and run a convolution neural network followed by max-pooling to get character-derived embeddings for each word The character-level and word-level embeddings are then concatenated and passed to the next layer We do not update the word embeddings during training

2 Pre-Process A shared bi-directional GRU (Cho et al 2014) is used to map the question and passage embeddings to context aware embeddings

3 Context Attention The bi-directional attention mechanism from the Bi-Directional Attention Flow (BiDAF) model (Seo et al 2016) is used to build a query-aware context representation Let h_i be the vector for context word i q_j be the vector for question word j and n_q and n_c be the lengths of the question and context respectively We compute attention between context word i and question word j as

(4-1)

where w1 w2 and w3 are learned vectors and is element-wise multiplication We then compute an attended vector c_i for each context token as

(4-2)

We also compute a query-to-context vector q_c

(4-3)

31

The final vector computed for each token is built by concatenating

and In our model we subsequently pass the result through a linear layer with ReLU activations

4 Context Self-Attention Next we use a layer of residual self-attention The input is passed through another bi-directional GRU Then we apply the same attention mechanism only now between the passage and itself In this case we do not use query-to context attention and we set aij = 1048576inf if i = j As before we pass the concatenated output through a linear layer with ReLU activations This layer is applied residually so this output is additionally summed with the input

5 Query Attention For this part we do the same way as context attention but calculate the weighted sum of the context words for each query word Thus we get the length as number of query words Then we calculate context to query similar to query to context in context attention layer

Figure 4-1 The modified BiDAF model with multilevel attention

6 Query Self-Attention This part is done the same way as the context self-attention layer but from the output of the Query Attention layer

32

7 Context Query Bi-Attention + Self-Attention The output of the Context self-attention and the Query Self-Attention layers are taken as input and the same process for Bi-Attention and self-attention is applied on these inputs

8 Prediction In the last layer of our model a bidirectional GRU is applied followed by a linear layer that computes answer start scores for each token The hidden states of that layer are concatenated with the input and fed into a second bidirectional GRU and linear layer to predict answer end scores The softmax operation is applied to the start and end scores to produce start and end probabilities and we optimize the negative log likelihood of selecting correct start and end tokens

Having carried out this modification we were able to solve the wrong example we

started with The multilevel attention model gives the correct output as ldquoSan Jose

Staterdquo Also we achieved slightly better scores than the original model with a F1 score

of 8544 on the SQuAD dev dataset

Chatbot Design Using a QA System

Designing a perfect chatbot that passes the Turing test is a fundamental goal for

Artificial Intelligence Although we are many order to magnitudes away from achieving

such a goal domain specific tasks can be solved with chatbots made from current

technology We had started our goal with a similar objective in mind ie to design a

domain specific chatbot and then generalize to other areas as it is able to achieve the

first domain specific objective robustly This led us to the fundamental problem of

Machine Comprehension and subsequently to the task of Question Answering Having

achieved some degree of success with QA systems we looked back if we could apply

our newly acquired knowledge in the task of designing Chatbots

The chatbots made with todayrsquos technologies are mostly handcrafted techniques

such as template matching that requires anticipating all possible ways a user may

articulate his requirements and a conversation may occur This requires a lot of man

hours for designing a domain specific system and is still very error prone In this section

33

we propose a general Chatbot design that would make the designing of a domain

specific chatbot very easy and robust at the same time

Every domain specific chatbot needs to obtain a set of information from the user

and show some results based on the user specific information obtained The traditional

chatbots use template matching and keywords lookup to determine if the user has

provided the required information Our idea is to use the Question Answering system in

the backend to extract out the required information from whatever the user has typed

until this point of the conversation The information to be extracted can be posed in the

form of a set of questions and the answers obtained from those questions can be used

as the parameters to supply the relevant information to the user

We had chosen our chatbot domain as the flight reservation system Our goal

was to extract the required information from the user to be able to show him the

available flights as per the userrsquos requirements

Figure 4-2 Flight reservation chatbotrsquos chat window

34

For a flight reservation task the booking agent needs to know the origin city

destination city and date of travel at minimum to be able to show the available flights

Optional information includes the number of tickets passengerrsquos name one way or

round trip etc

The minimalistic conversation with the user through the chat window would be as

shown above We had a platform called OneTask on which we wanted to implement our

chat bot The chat interface within the OneTask system looks as follows

Figure 4-3 Chatbot within OneTask system

The working of the chat system is as follows ndash

1 Initiation The user opens the chat window which starts a session with the chat bot The chat bot asks lsquoHow may I help yoursquo

2 User Reply The user may reply with none of the required information for flight booking or may reply with multiple information in the same message

3 User Reply parsing The conversation up to this point is treated as a passage and the internal questions are run on this passage So the four questions that are run are

Where do you want to go

35

From where do you want to leave

When do you want to depart

4 Parsed responses from QA model After running the questions on the QA system with the given conversation answers are obtained for all along with their corresponding confidence values Since answers will be obtained even if the required question was not answered up to this point in the conversation the confidence values plays a crucial role in determining the validity of the answer For this purpose we determined ranges of confidence values for validation A confidence value above 10 signifies the question has been answered correctly A confidence of 2 ndash 10 signifies that it may have been answered but should verify with the user for correctness and any confidence below 2 is discarded

5 Asking remaining questions iteratively After the parsing it is checked if any of the required questions are still unanswered If so the chatbot asks the remaining question and the process from 3 ndash 4 is carried out iteratively Once all the questions have been answered the user is shown with the available flight options as per his request

Figure 4-4 The Flow diagram of the Flight booking Chatbot system

Online QA System and Attention Visualization

To be able to test out various examples we used the BiDAF [23] model for an

online demo One can either choose from the available examples from the drop-down

menu or paste their own passage and examples While this is a useful and interesting

system to test the model in a user-friendly way we created this system to be able to

focus on the wrong samples

36

Figure 4-5 QA system interface with attention highlight over candidate answers

The candidate answers are shown with the blue highlights as per their

confidence values The higher the confidence the darker the answer The highest

confidence value is chosen as the predicted answer We developed the system to show

attention spread of the candidate answers to realize what needs to be done to improve

the system This led us to realize the importance of including the query attention part as

well as multilevel attention on the BiDAF model as described in the first section of this

chapter

37

CHAPTER 5 FUTURE DIRECTIONS AND CONCLUSION

Future Directions

Having done a thorough analysis of the current methods and state of the arts in

Machine Comprehension and for the QA task and developing systems on it we have

achieved a strong sense of what needs to be done to further improve the QA models

After observing the wrong samples we can see that the system is still unable to encode

meaning and is picking answers based on statistical patterns of occurrence of answers

on training examples

As per the State of the Art models and analyzing the ongoing research literature

it is easy to conclude that more features need to be embedded to encode the meaning

of words phrases and sentences A paper called Reinforced Mnemonic Reader for

Machine Comprehension encoded POS and NER tags of words along with their word

and character embedding This gave them better results

Figure 5-1 An English language semantic parse tree [26]

38

We have developed a method to encode the syntax parse tree of a sentence that

not only encodes the post tags but the relation of the word within the phrase and the

relation of the phrase within the whole sentence in a hierarchical manner

Finally data augmentation is another solution to get better results One definite

way to reduce the errors would be to include similar samples in the training data which

the system is faltering in the dev set One could generate similar examples as the failure

cases and include them in the training set to have better prediction Another system

would be to train to similar and bigger datasets Our models were trained on the SQuAD

dataset There are other datasets too which does the similar question answering task

such as TriviaQA We could augment the training set of TriviaQA along with SQuAD to

have a more robust system that is able to generalize better and thus have higher

accuracy for predicting answer spans

Conclusion

In this work we have tried to explore the most fundamental techniques that have

shaped the current state of the art Then we proposed a minor improvement of

architecture over an existing model Furthermore we developed two applications that

uses the base model First we talked about how a chatbot application can be made

using the QA system and lastly we also created a web interface where the model can

be used for any Passage and Question This interface also shows the attention spread

on the candidate answers While our effort is ongoing to push the state of the art

forward we strongly believe that surpassing human level accuracy on this task will have

high dividends for the society at large

39

LIST OF REFERENCES

[1] Danqi Chen From Reading Comprehension To Open-Domain Question Answering 2018 Youtube Accessed March 16 2018 httpswwwyoutubecomwatchv=1RN88O9C13U

[2] Lehnert Wendy G The Process of Question Answering No RR-88 Yale Univ New Haven Conn Dept of computer science 1977

[3] Joshi Mandar Eunsol Choi Daniel S Weld and Luke Zettlemoyer Triviaqa A large scale distantly supervised challenge dataset for reading comprehension arXiv preprint arXiv170503551 (2017)

[4] Rajpurkar Pranav Jian Zhang Konstantin Lopyrev and Percy Liang Squad 100000+ questions for machine comprehension of text arXiv preprint arXiv160605250 (2016)

[5] The Stanford Question Answering Dataset 2018 RajpurkarGithubIo Accessed March 16 2018 httpsrajpurkargithubioSQuAD-explorer

[6] Hu Minghao Yuxing Peng and Xipeng Qiu Reinforced mnemonic reader for machine comprehension CoRR abs170502798 (2017)

[7] Schalkoff Robert J Artificial neural networks Vol 1 New York McGraw-Hill 1997

[8] Me For The AI and Neetesh Mehrotra 2018 The Connect Between Deep Learning And AI - Open Source For You Open Source For You Accessed March 16 2018 httpopensourceforucom201801connect-deep-learning-ai

[9] Krizhevsky Alex Ilya Sutskever and Geoffrey E Hinton Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems pp 1097-1105 2012

[10] An Intuitive Explanation Of Convolutional Neural Networks 2016 The Data Science Blog Accessed March 16 2018 httpsujjwalkarnme20160811intuitive-explanation-convnets

[11] Williams Ronald J and David Zipser A learning algorithm for continually running fully recurrent neural networks Neural computation 1 no 2 (1989) 270-280

[12] Hochreiter Sepp and Juumlrgen Schmidhuber Long short-term memory Neural computation 9 no 8 (1997) 1735-1780

[13] Chung Junyoung Caglar Gulcehre KyungHyun Cho and Yoshua Bengio Empirical evaluation of gated recurrent neural networks on sequence modeling arXiv preprint arXiv14123555 (2014)

40

[14] Mikolov Tomas Wen-tau Yih and Geoffrey Zweig Linguistic regularities in continuous space word representations In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics Human Language Technologies pp 746-751 2013

[15] Pennington Jeffrey Richard Socher and Christopher Manning Glove Global vectors for word representation In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) pp 1532-1543 2014

[16] Word2vec Tutorial - The Skip-Gram Model middot Chris Mccormick 2016 MccormickmlCom Accessed March 16 2018 httpmccormickmlcom20160419word2vec-tutorial-the-skip-gram-model

[17] Group 2015 [Paper Introduction] Bilingual Word Representations With Monolingual hellip SlideshareNet Accessed March 16 2018 httpswwwslidesharenetnaist-mtstudybilingual-word-representations-with-monolingual-quality-in-mind

[18] Attention Mechanism 2016 BlogHeuritechCom Accessed March 16 2018 httpsblogheuritechcom20160120attention-mechanism

[19] Weston Jason Sumit Chopra and Antoine Bordes 2014 Memory Networks arXiv preprint arXiv14103916v11 Accessed March 16 2018 httpsarxivorgabs14103916

[20] Wang Shuohang and Jing Jiang Machine comprehension using match-lstm and answer pointer arXiv preprint arXiv160807905 (2016)

[21] Wang Shuohang and Jing Jiang Learning natural language inference with LSTM arXiv preprint arXiv151208849 (2015)

[22] Vinyals Oriol Meire Fortunato and Navdeep Jaitly Pointer networks In Advances in Neural Information Processing Systems pp 2692-2700 2015

[23] Wang Wenhui Nan Yang Furu Wei Baobao Chang and Ming Zhou Gated self-matching networks for reading comprehension and question answering In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1 Long Papers) vol 1 pp 189-198 2017

[24] Seo Minjoon Aniruddha Kembhavi Ali Farhadi and Hannaneh Hajishirzi Bidirectional attention flow for machine comprehension arXiv preprint arXiv161101603 (2016)

[25] Clark Christopher and Matt Gardner Simple and effective multi-paragraph reading comprehension arXiv preprint arXiv171010723 (2017)

41

[26] 8 Analyzing Sentence Structure 2018 NltkOrg Accessed March 16 2018 httpwwwnltkorgbookch08html

42

BIOGRAPHICAL SKETCH

Purnendu Mukherjee grew up in Kolkata India and had a strong interest for

computers since an early age After his high school he did his BSc in computer

science from Ramakrishna Mission Residential College Narendrapur followed by MSc

in computer science from St Xavierrsquos College Kolkata He had a strong intuition and

interest for human like learning systems and wanted to work in this area He started

working at TCS Innovation Labs Pune for the application of Natural Language

Processing in Educational Applications As he was deeply passionate about learning

system that mimic the human brain and learn like a human child does he was

increasing interested about Deep Learning and its applications After working for a year

he went on to pursue a Master of Science degree in computer science from the

University of Florida Gainesville His academic interests have been focused on Deep

Learning and Natural Language Processing and he has been working on Machine

Reading Comprehension since summer of 2017

  • ACKNOWLEDGMENTS
  • LIST OF FIGURES
  • LIST OF ABBREVIATIONS
  • INTRODUCTION
  • THE BUILDING BLOCKS OF A QUESTION ANSWERING SYSTEM
    • Neural Networks
    • Convolutional Neural Network
    • Recurrent Neural Networks (RNN)
    • Word Embedding
    • Attention Mechanism
    • Memory Networks
      • LITERATURE REVIEW AND STATE OF THE ART
        • Machine Comprehension Using Match-LSTM and Answer Pointer
        • R-NET Matching Reading Comprehension with Self-Matching Networks
        • Bi-Directional Attention Flow (BiDAF) for Machine Comprehension
        • Summary
          • MULTI-ATTENTION QUESTION ANSWERING
            • Multi-Attention BiDAF Model
            • Chatbot Design Using a QA System
            • Online QA System and Attention Visualization
              • FUTURE DIRECTIONS AND CONCLUSION
                • Future Directions
                • Conclusion
                  • LIST OF REFERENCES
                  • BIOGRAPHICAL SKETCH
Page 29: QUESTION ANSWERING SYSTEMS WITH ATTENTION MECHANISMufdcimages.uflib.ufl.edu/UF/E0/05/23/73/00001/MUKHERJEE_P.pdf · QUESTION ANSWERING SYSTEMS WITH ATTENTION MECHANISM By PURNENDU

29

CHAPTER 4 MULTI-ATTENTION QUESTION ANSWERING

Although the advancements in Question Answering systems since the release of

the SQuAD dataset has been impressive and the results are getting close to human

level accuracy it is far from being a fool-proof system The models still make mistakes

which would be obvious to a human For example

Passage The Panthers used the San Jose State practice facility and stayed at

the San Jose Marriott The Broncos practiced at Florida State Facility and stayed at the

Santa Clara Marriott

Question At what universitys facility did the Panthers practice

Actual Answer San Jose State

Predicted Answer Florida State Facility

To find out what is leading to the wrong predictions we wanted to see the

attention weights associated with such an example We plotted the passage and

question heat map which is a 2D matrix where the intensity of each cell signifies the

similarity between a passage word and a question word For the above example we

found out that while certain words of the question are given high weightage other parts

are not The words lsquoAtrsquo lsquofacilityrsquo lsquopracticersquo are given high attention but lsquoPanthersrsquo does

not receive high attention If it had received high attention then the system would have

predicted lsquoSan Jose Statersquo as the right answer To solve this issue we analyzed the

base BiDAF model and proposed adding two things

1 Bi-Attention and Self-Attention over Query

2 Second level attention over the output of (Bi-Attention + Self-Attention) from both Context and Query

30

Multi-Attention BiDAF Model

Our comprehension model is a hierarchical multi-stage process and consists of

the following layers

1 Embedding Just as all other models we embed words using pretrained word vectors We also embed the characters in each word into size 20 vectors which are learned and run a convolution neural network followed by max-pooling to get character-derived embeddings for each word The character-level and word-level embeddings are then concatenated and passed to the next layer We do not update the word embeddings during training

2 Pre-Process A shared bi-directional GRU (Cho et al 2014) is used to map the question and passage embeddings to context aware embeddings

3 Context Attention The bi-directional attention mechanism from the Bi-Directional Attention Flow (BiDAF) model (Seo et al 2016) is used to build a query-aware context representation Let h_i be the vector for context word i q_j be the vector for question word j and n_q and n_c be the lengths of the question and context respectively We compute attention between context word i and question word j as

(4-1)

where w1 w2 and w3 are learned vectors and is element-wise multiplication We then compute an attended vector c_i for each context token as

(4-2)

We also compute a query-to-context vector q_c

(4-3)

31

The final vector computed for each token is built by concatenating

and In our model we subsequently pass the result through a linear layer with ReLU activations

4 Context Self-Attention Next we use a layer of residual self-attention The input is passed through another bi-directional GRU Then we apply the same attention mechanism only now between the passage and itself In this case we do not use query-to context attention and we set aij = 1048576inf if i = j As before we pass the concatenated output through a linear layer with ReLU activations This layer is applied residually so this output is additionally summed with the input

5 Query Attention For this part we do the same way as context attention but calculate the weighted sum of the context words for each query word Thus we get the length as number of query words Then we calculate context to query similar to query to context in context attention layer

Figure 4-1 The modified BiDAF model with multilevel attention

6 Query Self-Attention This part is done the same way as the context self-attention layer but from the output of the Query Attention layer

32

7 Context Query Bi-Attention + Self-Attention The output of the Context self-attention and the Query Self-Attention layers are taken as input and the same process for Bi-Attention and self-attention is applied on these inputs

8 Prediction In the last layer of our model a bidirectional GRU is applied followed by a linear layer that computes answer start scores for each token The hidden states of that layer are concatenated with the input and fed into a second bidirectional GRU and linear layer to predict answer end scores The softmax operation is applied to the start and end scores to produce start and end probabilities and we optimize the negative log likelihood of selecting correct start and end tokens

Having carried out this modification we were able to solve the wrong example we

started with The multilevel attention model gives the correct output as ldquoSan Jose

Staterdquo Also we achieved slightly better scores than the original model with a F1 score

of 8544 on the SQuAD dev dataset

Chatbot Design Using a QA System

Designing a perfect chatbot that passes the Turing test is a fundamental goal for

Artificial Intelligence Although we are many order to magnitudes away from achieving

such a goal domain specific tasks can be solved with chatbots made from current

technology We had started our goal with a similar objective in mind ie to design a

domain specific chatbot and then generalize to other areas as it is able to achieve the

first domain specific objective robustly This led us to the fundamental problem of

Machine Comprehension and subsequently to the task of Question Answering Having

achieved some degree of success with QA systems we looked back if we could apply

our newly acquired knowledge in the task of designing Chatbots

The chatbots made with todayrsquos technologies are mostly handcrafted techniques

such as template matching that requires anticipating all possible ways a user may

articulate his requirements and a conversation may occur This requires a lot of man

hours for designing a domain specific system and is still very error prone In this section

33

we propose a general Chatbot design that would make the designing of a domain

specific chatbot very easy and robust at the same time

Every domain specific chatbot needs to obtain a set of information from the user

and show some results based on the user specific information obtained The traditional

chatbots use template matching and keywords lookup to determine if the user has

provided the required information Our idea is to use the Question Answering system in

the backend to extract out the required information from whatever the user has typed

until this point of the conversation The information to be extracted can be posed in the

form of a set of questions and the answers obtained from those questions can be used

as the parameters to supply the relevant information to the user

We had chosen our chatbot domain as the flight reservation system Our goal

was to extract the required information from the user to be able to show him the

available flights as per the userrsquos requirements

Figure 4-2 Flight reservation chatbotrsquos chat window

34

For a flight reservation task the booking agent needs to know the origin city

destination city and date of travel at minimum to be able to show the available flights

Optional information includes the number of tickets passengerrsquos name one way or

round trip etc

The minimalistic conversation with the user through the chat window would be as

shown above We had a platform called OneTask on which we wanted to implement our

chat bot The chat interface within the OneTask system looks as follows

Figure 4-3 Chatbot within OneTask system

The working of the chat system is as follows ndash

1 Initiation The user opens the chat window which starts a session with the chat bot The chat bot asks lsquoHow may I help yoursquo

2 User Reply The user may reply with none of the required information for flight booking or may reply with multiple information in the same message

3 User Reply parsing The conversation up to this point is treated as a passage and the internal questions are run on this passage So the four questions that are run are

Where do you want to go

35

From where do you want to leave

When do you want to depart

4 Parsed responses from QA model After running the questions on the QA system with the given conversation answers are obtained for all along with their corresponding confidence values Since answers will be obtained even if the required question was not answered up to this point in the conversation the confidence values plays a crucial role in determining the validity of the answer For this purpose we determined ranges of confidence values for validation A confidence value above 10 signifies the question has been answered correctly A confidence of 2 ndash 10 signifies that it may have been answered but should verify with the user for correctness and any confidence below 2 is discarded

5 Asking remaining questions iteratively After the parsing it is checked if any of the required questions are still unanswered If so the chatbot asks the remaining question and the process from 3 ndash 4 is carried out iteratively Once all the questions have been answered the user is shown with the available flight options as per his request

Figure 4-4 The Flow diagram of the Flight booking Chatbot system

Online QA System and Attention Visualization

To be able to test out various examples we used the BiDAF [23] model for an

online demo One can either choose from the available examples from the drop-down

menu or paste their own passage and examples While this is a useful and interesting

system to test the model in a user-friendly way we created this system to be able to

focus on the wrong samples

36

Figure 4-5 QA system interface with attention highlight over candidate answers

The candidate answers are shown with the blue highlights as per their

confidence values The higher the confidence the darker the answer The highest

confidence value is chosen as the predicted answer We developed the system to show

attention spread of the candidate answers to realize what needs to be done to improve

the system This led us to realize the importance of including the query attention part as

well as multilevel attention on the BiDAF model as described in the first section of this

chapter

37

CHAPTER 5 FUTURE DIRECTIONS AND CONCLUSION

Future Directions

Having done a thorough analysis of the current methods and state of the arts in

Machine Comprehension and for the QA task and developing systems on it we have

achieved a strong sense of what needs to be done to further improve the QA models

After observing the wrong samples we can see that the system is still unable to encode

meaning and is picking answers based on statistical patterns of occurrence of answers

on training examples

As per the State of the Art models and analyzing the ongoing research literature

it is easy to conclude that more features need to be embedded to encode the meaning

of words phrases and sentences A paper called Reinforced Mnemonic Reader for

Machine Comprehension encoded POS and NER tags of words along with their word

and character embedding This gave them better results

Figure 5-1 An English language semantic parse tree [26]

38

We have developed a method to encode the syntax parse tree of a sentence that

not only encodes the post tags but the relation of the word within the phrase and the

relation of the phrase within the whole sentence in a hierarchical manner

Finally data augmentation is another solution to get better results One definite

way to reduce the errors would be to include similar samples in the training data which

the system is faltering in the dev set One could generate similar examples as the failure

cases and include them in the training set to have better prediction Another system

would be to train to similar and bigger datasets Our models were trained on the SQuAD

dataset There are other datasets too which does the similar question answering task

such as TriviaQA We could augment the training set of TriviaQA along with SQuAD to

have a more robust system that is able to generalize better and thus have higher

accuracy for predicting answer spans

Conclusion

In this work we have tried to explore the most fundamental techniques that have

shaped the current state of the art Then we proposed a minor improvement of

architecture over an existing model Furthermore we developed two applications that

uses the base model First we talked about how a chatbot application can be made

using the QA system and lastly we also created a web interface where the model can

be used for any Passage and Question This interface also shows the attention spread

on the candidate answers While our effort is ongoing to push the state of the art

forward we strongly believe that surpassing human level accuracy on this task will have

high dividends for the society at large

39

LIST OF REFERENCES

[1] Danqi Chen From Reading Comprehension To Open-Domain Question Answering 2018 Youtube Accessed March 16 2018 httpswwwyoutubecomwatchv=1RN88O9C13U

[2] Lehnert Wendy G The Process of Question Answering No RR-88 Yale Univ New Haven Conn Dept of computer science 1977

[3] Joshi Mandar Eunsol Choi Daniel S Weld and Luke Zettlemoyer Triviaqa A large scale distantly supervised challenge dataset for reading comprehension arXiv preprint arXiv170503551 (2017)

[4] Rajpurkar Pranav Jian Zhang Konstantin Lopyrev and Percy Liang Squad 100000+ questions for machine comprehension of text arXiv preprint arXiv160605250 (2016)

[5] The Stanford Question Answering Dataset 2018 RajpurkarGithubIo Accessed March 16 2018 httpsrajpurkargithubioSQuAD-explorer

[6] Hu Minghao Yuxing Peng and Xipeng Qiu Reinforced mnemonic reader for machine comprehension CoRR abs170502798 (2017)

[7] Schalkoff Robert J Artificial neural networks Vol 1 New York McGraw-Hill 1997

[8] Me For The AI and Neetesh Mehrotra 2018 The Connect Between Deep Learning And AI - Open Source For You Open Source For You Accessed March 16 2018 httpopensourceforucom201801connect-deep-learning-ai

[9] Krizhevsky Alex Ilya Sutskever and Geoffrey E Hinton Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems pp 1097-1105 2012

[10] An Intuitive Explanation Of Convolutional Neural Networks 2016 The Data Science Blog Accessed March 16 2018 httpsujjwalkarnme20160811intuitive-explanation-convnets

[11] Williams Ronald J and David Zipser A learning algorithm for continually running fully recurrent neural networks Neural computation 1 no 2 (1989) 270-280

[12] Hochreiter Sepp and Juumlrgen Schmidhuber Long short-term memory Neural computation 9 no 8 (1997) 1735-1780

[13] Chung Junyoung Caglar Gulcehre KyungHyun Cho and Yoshua Bengio Empirical evaluation of gated recurrent neural networks on sequence modeling arXiv preprint arXiv14123555 (2014)

40

[14] Mikolov Tomas Wen-tau Yih and Geoffrey Zweig Linguistic regularities in continuous space word representations In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics Human Language Technologies pp 746-751 2013

[15] Pennington Jeffrey Richard Socher and Christopher Manning Glove Global vectors for word representation In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) pp 1532-1543 2014

[16] Word2vec Tutorial - The Skip-Gram Model middot Chris Mccormick 2016 MccormickmlCom Accessed March 16 2018 httpmccormickmlcom20160419word2vec-tutorial-the-skip-gram-model

[17] Group 2015 [Paper Introduction] Bilingual Word Representations With Monolingual hellip SlideshareNet Accessed March 16 2018 httpswwwslidesharenetnaist-mtstudybilingual-word-representations-with-monolingual-quality-in-mind

[18] Attention Mechanism 2016 BlogHeuritechCom Accessed March 16 2018 httpsblogheuritechcom20160120attention-mechanism

[19] Weston Jason Sumit Chopra and Antoine Bordes 2014 Memory Networks arXiv preprint arXiv14103916v11 Accessed March 16 2018 httpsarxivorgabs14103916

[20] Wang Shuohang and Jing Jiang Machine comprehension using match-lstm and answer pointer arXiv preprint arXiv160807905 (2016)

[21] Wang Shuohang and Jing Jiang Learning natural language inference with LSTM arXiv preprint arXiv151208849 (2015)

[22] Vinyals Oriol Meire Fortunato and Navdeep Jaitly Pointer networks In Advances in Neural Information Processing Systems pp 2692-2700 2015

[23] Wang Wenhui Nan Yang Furu Wei Baobao Chang and Ming Zhou Gated self-matching networks for reading comprehension and question answering In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1 Long Papers) vol 1 pp 189-198 2017

[24] Seo Minjoon Aniruddha Kembhavi Ali Farhadi and Hannaneh Hajishirzi Bidirectional attention flow for machine comprehension arXiv preprint arXiv161101603 (2016)

[25] Clark Christopher and Matt Gardner Simple and effective multi-paragraph reading comprehension arXiv preprint arXiv171010723 (2017)

41

[26] 8 Analyzing Sentence Structure 2018 NltkOrg Accessed March 16 2018 httpwwwnltkorgbookch08html

42

BIOGRAPHICAL SKETCH

Purnendu Mukherjee grew up in Kolkata India and had a strong interest for

computers since an early age After his high school he did his BSc in computer

science from Ramakrishna Mission Residential College Narendrapur followed by MSc

in computer science from St Xavierrsquos College Kolkata He had a strong intuition and

interest for human like learning systems and wanted to work in this area He started

working at TCS Innovation Labs Pune for the application of Natural Language

Processing in Educational Applications As he was deeply passionate about learning

system that mimic the human brain and learn like a human child does he was

increasing interested about Deep Learning and its applications After working for a year

he went on to pursue a Master of Science degree in computer science from the

University of Florida Gainesville His academic interests have been focused on Deep

Learning and Natural Language Processing and he has been working on Machine

Reading Comprehension since summer of 2017

  • ACKNOWLEDGMENTS
  • LIST OF FIGURES
  • LIST OF ABBREVIATIONS
  • INTRODUCTION
  • THE BUILDING BLOCKS OF A QUESTION ANSWERING SYSTEM
    • Neural Networks
    • Convolutional Neural Network
    • Recurrent Neural Networks (RNN)
    • Word Embedding
    • Attention Mechanism
    • Memory Networks
      • LITERATURE REVIEW AND STATE OF THE ART
        • Machine Comprehension Using Match-LSTM and Answer Pointer
        • R-NET Matching Reading Comprehension with Self-Matching Networks
        • Bi-Directional Attention Flow (BiDAF) for Machine Comprehension
        • Summary
          • MULTI-ATTENTION QUESTION ANSWERING
            • Multi-Attention BiDAF Model
            • Chatbot Design Using a QA System
            • Online QA System and Attention Visualization
              • FUTURE DIRECTIONS AND CONCLUSION
                • Future Directions
                • Conclusion
                  • LIST OF REFERENCES
                  • BIOGRAPHICAL SKETCH
Page 30: QUESTION ANSWERING SYSTEMS WITH ATTENTION MECHANISMufdcimages.uflib.ufl.edu/UF/E0/05/23/73/00001/MUKHERJEE_P.pdf · QUESTION ANSWERING SYSTEMS WITH ATTENTION MECHANISM By PURNENDU

30

Multi-Attention BiDAF Model

Our comprehension model is a hierarchical multi-stage process and consists of

the following layers

1 Embedding Just as all other models we embed words using pretrained word vectors We also embed the characters in each word into size 20 vectors which are learned and run a convolution neural network followed by max-pooling to get character-derived embeddings for each word The character-level and word-level embeddings are then concatenated and passed to the next layer We do not update the word embeddings during training

2 Pre-Process A shared bi-directional GRU (Cho et al 2014) is used to map the question and passage embeddings to context aware embeddings

3 Context Attention The bi-directional attention mechanism from the Bi-Directional Attention Flow (BiDAF) model (Seo et al 2016) is used to build a query-aware context representation Let h_i be the vector for context word i q_j be the vector for question word j and n_q and n_c be the lengths of the question and context respectively We compute attention between context word i and question word j as

(4-1)

where w1 w2 and w3 are learned vectors and is element-wise multiplication We then compute an attended vector c_i for each context token as

(4-2)

We also compute a query-to-context vector q_c

(4-3)

31

The final vector computed for each token is built by concatenating

and In our model we subsequently pass the result through a linear layer with ReLU activations

4 Context Self-Attention Next we use a layer of residual self-attention The input is passed through another bi-directional GRU Then we apply the same attention mechanism only now between the passage and itself In this case we do not use query-to context attention and we set aij = 1048576inf if i = j As before we pass the concatenated output through a linear layer with ReLU activations This layer is applied residually so this output is additionally summed with the input

5 Query Attention For this part we do the same way as context attention but calculate the weighted sum of the context words for each query word Thus we get the length as number of query words Then we calculate context to query similar to query to context in context attention layer

Figure 4-1 The modified BiDAF model with multilevel attention

6 Query Self-Attention This part is done the same way as the context self-attention layer but from the output of the Query Attention layer

32

7 Context Query Bi-Attention + Self-Attention The output of the Context self-attention and the Query Self-Attention layers are taken as input and the same process for Bi-Attention and self-attention is applied on these inputs

8 Prediction In the last layer of our model a bidirectional GRU is applied followed by a linear layer that computes answer start scores for each token The hidden states of that layer are concatenated with the input and fed into a second bidirectional GRU and linear layer to predict answer end scores The softmax operation is applied to the start and end scores to produce start and end probabilities and we optimize the negative log likelihood of selecting correct start and end tokens

Having carried out this modification we were able to solve the wrong example we

started with The multilevel attention model gives the correct output as ldquoSan Jose

Staterdquo Also we achieved slightly better scores than the original model with a F1 score

of 8544 on the SQuAD dev dataset

Chatbot Design Using a QA System

Designing a perfect chatbot that passes the Turing test is a fundamental goal for

Artificial Intelligence Although we are many order to magnitudes away from achieving

such a goal domain specific tasks can be solved with chatbots made from current

technology We had started our goal with a similar objective in mind ie to design a

domain specific chatbot and then generalize to other areas as it is able to achieve the

first domain specific objective robustly This led us to the fundamental problem of

Machine Comprehension and subsequently to the task of Question Answering Having

achieved some degree of success with QA systems we looked back if we could apply

our newly acquired knowledge in the task of designing Chatbots

The chatbots made with todayrsquos technologies are mostly handcrafted techniques

such as template matching that requires anticipating all possible ways a user may

articulate his requirements and a conversation may occur This requires a lot of man

hours for designing a domain specific system and is still very error prone In this section

33

we propose a general Chatbot design that would make the designing of a domain

specific chatbot very easy and robust at the same time

Every domain specific chatbot needs to obtain a set of information from the user

and show some results based on the user specific information obtained The traditional

chatbots use template matching and keywords lookup to determine if the user has

provided the required information Our idea is to use the Question Answering system in

the backend to extract out the required information from whatever the user has typed

until this point of the conversation The information to be extracted can be posed in the

form of a set of questions and the answers obtained from those questions can be used

as the parameters to supply the relevant information to the user

We had chosen our chatbot domain as the flight reservation system Our goal

was to extract the required information from the user to be able to show him the

available flights as per the userrsquos requirements

Figure 4-2 Flight reservation chatbotrsquos chat window

34

For a flight reservation task the booking agent needs to know the origin city

destination city and date of travel at minimum to be able to show the available flights

Optional information includes the number of tickets passengerrsquos name one way or

round trip etc

The minimalistic conversation with the user through the chat window would be as

shown above We had a platform called OneTask on which we wanted to implement our

chat bot The chat interface within the OneTask system looks as follows

Figure 4-3 Chatbot within OneTask system

The working of the chat system is as follows ndash

1 Initiation The user opens the chat window which starts a session with the chat bot The chat bot asks lsquoHow may I help yoursquo

2 User Reply The user may reply with none of the required information for flight booking or may reply with multiple information in the same message

3 User Reply parsing The conversation up to this point is treated as a passage and the internal questions are run on this passage So the four questions that are run are

Where do you want to go

35

From where do you want to leave

When do you want to depart

4 Parsed responses from QA model After running the questions on the QA system with the given conversation answers are obtained for all along with their corresponding confidence values Since answers will be obtained even if the required question was not answered up to this point in the conversation the confidence values plays a crucial role in determining the validity of the answer For this purpose we determined ranges of confidence values for validation A confidence value above 10 signifies the question has been answered correctly A confidence of 2 ndash 10 signifies that it may have been answered but should verify with the user for correctness and any confidence below 2 is discarded

5 Asking remaining questions iteratively After the parsing it is checked if any of the required questions are still unanswered If so the chatbot asks the remaining question and the process from 3 ndash 4 is carried out iteratively Once all the questions have been answered the user is shown with the available flight options as per his request

Figure 4-4 The Flow diagram of the Flight booking Chatbot system

Online QA System and Attention Visualization

To be able to test out various examples we used the BiDAF [23] model for an

online demo One can either choose from the available examples from the drop-down

menu or paste their own passage and examples While this is a useful and interesting

system to test the model in a user-friendly way we created this system to be able to

focus on the wrong samples

36

Figure 4-5 QA system interface with attention highlight over candidate answers

The candidate answers are shown with the blue highlights as per their

confidence values The higher the confidence the darker the answer The highest

confidence value is chosen as the predicted answer We developed the system to show

attention spread of the candidate answers to realize what needs to be done to improve

the system This led us to realize the importance of including the query attention part as

well as multilevel attention on the BiDAF model as described in the first section of this

chapter

37

CHAPTER 5 FUTURE DIRECTIONS AND CONCLUSION

Future Directions

Having done a thorough analysis of the current methods and state of the arts in

Machine Comprehension and for the QA task and developing systems on it we have

achieved a strong sense of what needs to be done to further improve the QA models

After observing the wrong samples we can see that the system is still unable to encode

meaning and is picking answers based on statistical patterns of occurrence of answers

on training examples

As per the State of the Art models and analyzing the ongoing research literature

it is easy to conclude that more features need to be embedded to encode the meaning

of words phrases and sentences A paper called Reinforced Mnemonic Reader for

Machine Comprehension encoded POS and NER tags of words along with their word

and character embedding This gave them better results

Figure 5-1 An English language semantic parse tree [26]

38

We have developed a method to encode the syntax parse tree of a sentence that

not only encodes the post tags but the relation of the word within the phrase and the

relation of the phrase within the whole sentence in a hierarchical manner

Finally data augmentation is another solution to get better results One definite

way to reduce the errors would be to include similar samples in the training data which

the system is faltering in the dev set One could generate similar examples as the failure

cases and include them in the training set to have better prediction Another system

would be to train to similar and bigger datasets Our models were trained on the SQuAD

dataset There are other datasets too which does the similar question answering task

such as TriviaQA We could augment the training set of TriviaQA along with SQuAD to

have a more robust system that is able to generalize better and thus have higher

accuracy for predicting answer spans

Conclusion

In this work we have tried to explore the most fundamental techniques that have

shaped the current state of the art Then we proposed a minor improvement of

architecture over an existing model Furthermore we developed two applications that

uses the base model First we talked about how a chatbot application can be made

using the QA system and lastly we also created a web interface where the model can

be used for any Passage and Question This interface also shows the attention spread

on the candidate answers While our effort is ongoing to push the state of the art

forward we strongly believe that surpassing human level accuracy on this task will have

high dividends for the society at large

39

LIST OF REFERENCES

[1] Danqi Chen From Reading Comprehension To Open-Domain Question Answering 2018 Youtube Accessed March 16 2018 httpswwwyoutubecomwatchv=1RN88O9C13U

[2] Lehnert Wendy G The Process of Question Answering No RR-88 Yale Univ New Haven Conn Dept of computer science 1977

[3] Joshi Mandar Eunsol Choi Daniel S Weld and Luke Zettlemoyer Triviaqa A large scale distantly supervised challenge dataset for reading comprehension arXiv preprint arXiv170503551 (2017)

[4] Rajpurkar Pranav Jian Zhang Konstantin Lopyrev and Percy Liang Squad 100000+ questions for machine comprehension of text arXiv preprint arXiv160605250 (2016)

[5] The Stanford Question Answering Dataset 2018 RajpurkarGithubIo Accessed March 16 2018 httpsrajpurkargithubioSQuAD-explorer

[6] Hu Minghao Yuxing Peng and Xipeng Qiu Reinforced mnemonic reader for machine comprehension CoRR abs170502798 (2017)

[7] Schalkoff Robert J Artificial neural networks Vol 1 New York McGraw-Hill 1997

[8] Me For The AI and Neetesh Mehrotra 2018 The Connect Between Deep Learning And AI - Open Source For You Open Source For You Accessed March 16 2018 httpopensourceforucom201801connect-deep-learning-ai

[9] Krizhevsky Alex Ilya Sutskever and Geoffrey E Hinton Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems pp 1097-1105 2012

[10] An Intuitive Explanation Of Convolutional Neural Networks 2016 The Data Science Blog Accessed March 16 2018 httpsujjwalkarnme20160811intuitive-explanation-convnets

[11] Williams Ronald J and David Zipser A learning algorithm for continually running fully recurrent neural networks Neural computation 1 no 2 (1989) 270-280

[12] Hochreiter Sepp and Juumlrgen Schmidhuber Long short-term memory Neural computation 9 no 8 (1997) 1735-1780

[13] Chung Junyoung Caglar Gulcehre KyungHyun Cho and Yoshua Bengio Empirical evaluation of gated recurrent neural networks on sequence modeling arXiv preprint arXiv14123555 (2014)

40

[14] Mikolov Tomas Wen-tau Yih and Geoffrey Zweig Linguistic regularities in continuous space word representations In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics Human Language Technologies pp 746-751 2013

[15] Pennington Jeffrey Richard Socher and Christopher Manning Glove Global vectors for word representation In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) pp 1532-1543 2014

[16] Word2vec Tutorial - The Skip-Gram Model middot Chris Mccormick 2016 MccormickmlCom Accessed March 16 2018 httpmccormickmlcom20160419word2vec-tutorial-the-skip-gram-model

[17] Group 2015 [Paper Introduction] Bilingual Word Representations With Monolingual hellip SlideshareNet Accessed March 16 2018 httpswwwslidesharenetnaist-mtstudybilingual-word-representations-with-monolingual-quality-in-mind

[18] Attention Mechanism 2016 BlogHeuritechCom Accessed March 16 2018 httpsblogheuritechcom20160120attention-mechanism

[19] Weston Jason Sumit Chopra and Antoine Bordes 2014 Memory Networks arXiv preprint arXiv14103916v11 Accessed March 16 2018 httpsarxivorgabs14103916

[20] Wang Shuohang and Jing Jiang Machine comprehension using match-lstm and answer pointer arXiv preprint arXiv160807905 (2016)

[21] Wang Shuohang and Jing Jiang Learning natural language inference with LSTM arXiv preprint arXiv151208849 (2015)

[22] Vinyals Oriol Meire Fortunato and Navdeep Jaitly Pointer networks In Advances in Neural Information Processing Systems pp 2692-2700 2015

[23] Wang Wenhui Nan Yang Furu Wei Baobao Chang and Ming Zhou Gated self-matching networks for reading comprehension and question answering In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1 Long Papers) vol 1 pp 189-198 2017

[24] Seo Minjoon Aniruddha Kembhavi Ali Farhadi and Hannaneh Hajishirzi Bidirectional attention flow for machine comprehension arXiv preprint arXiv161101603 (2016)

[25] Clark Christopher and Matt Gardner Simple and effective multi-paragraph reading comprehension arXiv preprint arXiv171010723 (2017)

41

[26] 8 Analyzing Sentence Structure 2018 NltkOrg Accessed March 16 2018 httpwwwnltkorgbookch08html

42

BIOGRAPHICAL SKETCH

Purnendu Mukherjee grew up in Kolkata India and had a strong interest for

computers since an early age After his high school he did his BSc in computer

science from Ramakrishna Mission Residential College Narendrapur followed by MSc

in computer science from St Xavierrsquos College Kolkata He had a strong intuition and

interest for human like learning systems and wanted to work in this area He started

working at TCS Innovation Labs Pune for the application of Natural Language

Processing in Educational Applications As he was deeply passionate about learning

system that mimic the human brain and learn like a human child does he was

increasing interested about Deep Learning and its applications After working for a year

he went on to pursue a Master of Science degree in computer science from the

University of Florida Gainesville His academic interests have been focused on Deep

Learning and Natural Language Processing and he has been working on Machine

Reading Comprehension since summer of 2017

  • ACKNOWLEDGMENTS
  • LIST OF FIGURES
  • LIST OF ABBREVIATIONS
  • INTRODUCTION
  • THE BUILDING BLOCKS OF A QUESTION ANSWERING SYSTEM
    • Neural Networks
    • Convolutional Neural Network
    • Recurrent Neural Networks (RNN)
    • Word Embedding
    • Attention Mechanism
    • Memory Networks
      • LITERATURE REVIEW AND STATE OF THE ART
        • Machine Comprehension Using Match-LSTM and Answer Pointer
        • R-NET Matching Reading Comprehension with Self-Matching Networks
        • Bi-Directional Attention Flow (BiDAF) for Machine Comprehension
        • Summary
          • MULTI-ATTENTION QUESTION ANSWERING
            • Multi-Attention BiDAF Model
            • Chatbot Design Using a QA System
            • Online QA System and Attention Visualization
              • FUTURE DIRECTIONS AND CONCLUSION
                • Future Directions
                • Conclusion
                  • LIST OF REFERENCES
                  • BIOGRAPHICAL SKETCH
Page 31: QUESTION ANSWERING SYSTEMS WITH ATTENTION MECHANISMufdcimages.uflib.ufl.edu/UF/E0/05/23/73/00001/MUKHERJEE_P.pdf · QUESTION ANSWERING SYSTEMS WITH ATTENTION MECHANISM By PURNENDU

31

The final vector computed for each token is built by concatenating

and In our model we subsequently pass the result through a linear layer with ReLU activations

4 Context Self-Attention Next we use a layer of residual self-attention The input is passed through another bi-directional GRU Then we apply the same attention mechanism only now between the passage and itself In this case we do not use query-to context attention and we set aij = 1048576inf if i = j As before we pass the concatenated output through a linear layer with ReLU activations This layer is applied residually so this output is additionally summed with the input

5 Query Attention For this part we do the same way as context attention but calculate the weighted sum of the context words for each query word Thus we get the length as number of query words Then we calculate context to query similar to query to context in context attention layer

Figure 4-1 The modified BiDAF model with multilevel attention

6 Query Self-Attention This part is done the same way as the context self-attention layer but from the output of the Query Attention layer

32

7 Context Query Bi-Attention + Self-Attention The output of the Context self-attention and the Query Self-Attention layers are taken as input and the same process for Bi-Attention and self-attention is applied on these inputs

8 Prediction In the last layer of our model a bidirectional GRU is applied followed by a linear layer that computes answer start scores for each token The hidden states of that layer are concatenated with the input and fed into a second bidirectional GRU and linear layer to predict answer end scores The softmax operation is applied to the start and end scores to produce start and end probabilities and we optimize the negative log likelihood of selecting correct start and end tokens

Having carried out this modification we were able to solve the wrong example we

started with The multilevel attention model gives the correct output as ldquoSan Jose

Staterdquo Also we achieved slightly better scores than the original model with a F1 score

of 8544 on the SQuAD dev dataset

Chatbot Design Using a QA System

Designing a perfect chatbot that passes the Turing test is a fundamental goal for

Artificial Intelligence Although we are many order to magnitudes away from achieving

such a goal domain specific tasks can be solved with chatbots made from current

technology We had started our goal with a similar objective in mind ie to design a

domain specific chatbot and then generalize to other areas as it is able to achieve the

first domain specific objective robustly This led us to the fundamental problem of

Machine Comprehension and subsequently to the task of Question Answering Having

achieved some degree of success with QA systems we looked back if we could apply

our newly acquired knowledge in the task of designing Chatbots

The chatbots made with todayrsquos technologies are mostly handcrafted techniques

such as template matching that requires anticipating all possible ways a user may

articulate his requirements and a conversation may occur This requires a lot of man

hours for designing a domain specific system and is still very error prone In this section

33

we propose a general Chatbot design that would make the designing of a domain

specific chatbot very easy and robust at the same time

Every domain specific chatbot needs to obtain a set of information from the user

and show some results based on the user specific information obtained The traditional

chatbots use template matching and keywords lookup to determine if the user has

provided the required information Our idea is to use the Question Answering system in

the backend to extract out the required information from whatever the user has typed

until this point of the conversation The information to be extracted can be posed in the

form of a set of questions and the answers obtained from those questions can be used

as the parameters to supply the relevant information to the user

We had chosen our chatbot domain as the flight reservation system Our goal

was to extract the required information from the user to be able to show him the

available flights as per the userrsquos requirements

Figure 4-2 Flight reservation chatbotrsquos chat window

34

For a flight reservation task the booking agent needs to know the origin city

destination city and date of travel at minimum to be able to show the available flights

Optional information includes the number of tickets passengerrsquos name one way or

round trip etc

The minimalistic conversation with the user through the chat window would be as

shown above We had a platform called OneTask on which we wanted to implement our

chat bot The chat interface within the OneTask system looks as follows

Figure 4-3 Chatbot within OneTask system

The working of the chat system is as follows ndash

1 Initiation The user opens the chat window which starts a session with the chat bot The chat bot asks lsquoHow may I help yoursquo

2 User Reply The user may reply with none of the required information for flight booking or may reply with multiple information in the same message

3 User Reply parsing The conversation up to this point is treated as a passage and the internal questions are run on this passage So the four questions that are run are

Where do you want to go

35

From where do you want to leave

When do you want to depart

4 Parsed responses from QA model After running the questions on the QA system with the given conversation answers are obtained for all along with their corresponding confidence values Since answers will be obtained even if the required question was not answered up to this point in the conversation the confidence values plays a crucial role in determining the validity of the answer For this purpose we determined ranges of confidence values for validation A confidence value above 10 signifies the question has been answered correctly A confidence of 2 ndash 10 signifies that it may have been answered but should verify with the user for correctness and any confidence below 2 is discarded

5 Asking remaining questions iteratively After the parsing it is checked if any of the required questions are still unanswered If so the chatbot asks the remaining question and the process from 3 ndash 4 is carried out iteratively Once all the questions have been answered the user is shown with the available flight options as per his request

Figure 4-4 The Flow diagram of the Flight booking Chatbot system

Online QA System and Attention Visualization

To be able to test out various examples we used the BiDAF [23] model for an

online demo One can either choose from the available examples from the drop-down

menu or paste their own passage and examples While this is a useful and interesting

system to test the model in a user-friendly way we created this system to be able to

focus on the wrong samples

36

Figure 4-5 QA system interface with attention highlight over candidate answers

The candidate answers are shown with the blue highlights as per their

confidence values The higher the confidence the darker the answer The highest

confidence value is chosen as the predicted answer We developed the system to show

attention spread of the candidate answers to realize what needs to be done to improve

the system This led us to realize the importance of including the query attention part as

well as multilevel attention on the BiDAF model as described in the first section of this

chapter

37

CHAPTER 5 FUTURE DIRECTIONS AND CONCLUSION

Future Directions

Having done a thorough analysis of the current methods and state of the arts in

Machine Comprehension and for the QA task and developing systems on it we have

achieved a strong sense of what needs to be done to further improve the QA models

After observing the wrong samples we can see that the system is still unable to encode

meaning and is picking answers based on statistical patterns of occurrence of answers

on training examples

As per the State of the Art models and analyzing the ongoing research literature

it is easy to conclude that more features need to be embedded to encode the meaning

of words phrases and sentences A paper called Reinforced Mnemonic Reader for

Machine Comprehension encoded POS and NER tags of words along with their word

and character embedding This gave them better results

Figure 5-1 An English language semantic parse tree [26]

38

We have developed a method to encode the syntax parse tree of a sentence that

not only encodes the post tags but the relation of the word within the phrase and the

relation of the phrase within the whole sentence in a hierarchical manner

Finally data augmentation is another solution to get better results One definite

way to reduce the errors would be to include similar samples in the training data which

the system is faltering in the dev set One could generate similar examples as the failure

cases and include them in the training set to have better prediction Another system

would be to train to similar and bigger datasets Our models were trained on the SQuAD

dataset There are other datasets too which does the similar question answering task

such as TriviaQA We could augment the training set of TriviaQA along with SQuAD to

have a more robust system that is able to generalize better and thus have higher

accuracy for predicting answer spans

Conclusion

In this work we have tried to explore the most fundamental techniques that have

shaped the current state of the art Then we proposed a minor improvement of

architecture over an existing model Furthermore we developed two applications that

uses the base model First we talked about how a chatbot application can be made

using the QA system and lastly we also created a web interface where the model can

be used for any Passage and Question This interface also shows the attention spread

on the candidate answers While our effort is ongoing to push the state of the art

forward we strongly believe that surpassing human level accuracy on this task will have

high dividends for the society at large

39

LIST OF REFERENCES

[1] Danqi Chen From Reading Comprehension To Open-Domain Question Answering 2018 Youtube Accessed March 16 2018 httpswwwyoutubecomwatchv=1RN88O9C13U

[2] Lehnert Wendy G The Process of Question Answering No RR-88 Yale Univ New Haven Conn Dept of computer science 1977

[3] Joshi Mandar Eunsol Choi Daniel S Weld and Luke Zettlemoyer Triviaqa A large scale distantly supervised challenge dataset for reading comprehension arXiv preprint arXiv170503551 (2017)

[4] Rajpurkar Pranav Jian Zhang Konstantin Lopyrev and Percy Liang Squad 100000+ questions for machine comprehension of text arXiv preprint arXiv160605250 (2016)

[5] The Stanford Question Answering Dataset 2018 RajpurkarGithubIo Accessed March 16 2018 httpsrajpurkargithubioSQuAD-explorer

[6] Hu Minghao Yuxing Peng and Xipeng Qiu Reinforced mnemonic reader for machine comprehension CoRR abs170502798 (2017)

[7] Schalkoff Robert J Artificial neural networks Vol 1 New York McGraw-Hill 1997

[8] Me For The AI and Neetesh Mehrotra 2018 The Connect Between Deep Learning And AI - Open Source For You Open Source For You Accessed March 16 2018 httpopensourceforucom201801connect-deep-learning-ai

[9] Krizhevsky Alex Ilya Sutskever and Geoffrey E Hinton Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems pp 1097-1105 2012

[10] An Intuitive Explanation Of Convolutional Neural Networks 2016 The Data Science Blog Accessed March 16 2018 httpsujjwalkarnme20160811intuitive-explanation-convnets

[11] Williams Ronald J and David Zipser A learning algorithm for continually running fully recurrent neural networks Neural computation 1 no 2 (1989) 270-280

[12] Hochreiter Sepp and Juumlrgen Schmidhuber Long short-term memory Neural computation 9 no 8 (1997) 1735-1780

[13] Chung Junyoung Caglar Gulcehre KyungHyun Cho and Yoshua Bengio Empirical evaluation of gated recurrent neural networks on sequence modeling arXiv preprint arXiv14123555 (2014)

40

[14] Mikolov Tomas Wen-tau Yih and Geoffrey Zweig Linguistic regularities in continuous space word representations In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics Human Language Technologies pp 746-751 2013

[15] Pennington Jeffrey Richard Socher and Christopher Manning Glove Global vectors for word representation In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) pp 1532-1543 2014

[16] Word2vec Tutorial - The Skip-Gram Model middot Chris Mccormick 2016 MccormickmlCom Accessed March 16 2018 httpmccormickmlcom20160419word2vec-tutorial-the-skip-gram-model

[17] Group 2015 [Paper Introduction] Bilingual Word Representations With Monolingual hellip SlideshareNet Accessed March 16 2018 httpswwwslidesharenetnaist-mtstudybilingual-word-representations-with-monolingual-quality-in-mind

[18] Attention Mechanism 2016 BlogHeuritechCom Accessed March 16 2018 httpsblogheuritechcom20160120attention-mechanism

[19] Weston Jason Sumit Chopra and Antoine Bordes 2014 Memory Networks arXiv preprint arXiv14103916v11 Accessed March 16 2018 httpsarxivorgabs14103916

[20] Wang Shuohang and Jing Jiang Machine comprehension using match-lstm and answer pointer arXiv preprint arXiv160807905 (2016)

[21] Wang Shuohang and Jing Jiang Learning natural language inference with LSTM arXiv preprint arXiv151208849 (2015)

[22] Vinyals Oriol Meire Fortunato and Navdeep Jaitly Pointer networks In Advances in Neural Information Processing Systems pp 2692-2700 2015

[23] Wang Wenhui Nan Yang Furu Wei Baobao Chang and Ming Zhou Gated self-matching networks for reading comprehension and question answering In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1 Long Papers) vol 1 pp 189-198 2017

[24] Seo Minjoon Aniruddha Kembhavi Ali Farhadi and Hannaneh Hajishirzi Bidirectional attention flow for machine comprehension arXiv preprint arXiv161101603 (2016)

[25] Clark Christopher and Matt Gardner Simple and effective multi-paragraph reading comprehension arXiv preprint arXiv171010723 (2017)

41

[26] 8 Analyzing Sentence Structure 2018 NltkOrg Accessed March 16 2018 httpwwwnltkorgbookch08html

42

BIOGRAPHICAL SKETCH

Purnendu Mukherjee grew up in Kolkata India and had a strong interest for

computers since an early age After his high school he did his BSc in computer

science from Ramakrishna Mission Residential College Narendrapur followed by MSc

in computer science from St Xavierrsquos College Kolkata He had a strong intuition and

interest for human like learning systems and wanted to work in this area He started

working at TCS Innovation Labs Pune for the application of Natural Language

Processing in Educational Applications As he was deeply passionate about learning

system that mimic the human brain and learn like a human child does he was

increasing interested about Deep Learning and its applications After working for a year

he went on to pursue a Master of Science degree in computer science from the

University of Florida Gainesville His academic interests have been focused on Deep

Learning and Natural Language Processing and he has been working on Machine

Reading Comprehension since summer of 2017

  • ACKNOWLEDGMENTS
  • LIST OF FIGURES
  • LIST OF ABBREVIATIONS
  • INTRODUCTION
  • THE BUILDING BLOCKS OF A QUESTION ANSWERING SYSTEM
    • Neural Networks
    • Convolutional Neural Network
    • Recurrent Neural Networks (RNN)
    • Word Embedding
    • Attention Mechanism
    • Memory Networks
      • LITERATURE REVIEW AND STATE OF THE ART
        • Machine Comprehension Using Match-LSTM and Answer Pointer
        • R-NET Matching Reading Comprehension with Self-Matching Networks
        • Bi-Directional Attention Flow (BiDAF) for Machine Comprehension
        • Summary
          • MULTI-ATTENTION QUESTION ANSWERING
            • Multi-Attention BiDAF Model
            • Chatbot Design Using a QA System
            • Online QA System and Attention Visualization
              • FUTURE DIRECTIONS AND CONCLUSION
                • Future Directions
                • Conclusion
                  • LIST OF REFERENCES
                  • BIOGRAPHICAL SKETCH
Page 32: QUESTION ANSWERING SYSTEMS WITH ATTENTION MECHANISMufdcimages.uflib.ufl.edu/UF/E0/05/23/73/00001/MUKHERJEE_P.pdf · QUESTION ANSWERING SYSTEMS WITH ATTENTION MECHANISM By PURNENDU

32

7 Context Query Bi-Attention + Self-Attention The output of the Context self-attention and the Query Self-Attention layers are taken as input and the same process for Bi-Attention and self-attention is applied on these inputs

8 Prediction In the last layer of our model a bidirectional GRU is applied followed by a linear layer that computes answer start scores for each token The hidden states of that layer are concatenated with the input and fed into a second bidirectional GRU and linear layer to predict answer end scores The softmax operation is applied to the start and end scores to produce start and end probabilities and we optimize the negative log likelihood of selecting correct start and end tokens

Having carried out this modification we were able to solve the wrong example we

started with The multilevel attention model gives the correct output as ldquoSan Jose

Staterdquo Also we achieved slightly better scores than the original model with a F1 score

of 8544 on the SQuAD dev dataset

Chatbot Design Using a QA System

Designing a perfect chatbot that passes the Turing test is a fundamental goal for

Artificial Intelligence Although we are many order to magnitudes away from achieving

such a goal domain specific tasks can be solved with chatbots made from current

technology We had started our goal with a similar objective in mind ie to design a

domain specific chatbot and then generalize to other areas as it is able to achieve the

first domain specific objective robustly This led us to the fundamental problem of

Machine Comprehension and subsequently to the task of Question Answering Having

achieved some degree of success with QA systems we looked back if we could apply

our newly acquired knowledge in the task of designing Chatbots

The chatbots made with todayrsquos technologies are mostly handcrafted techniques

such as template matching that requires anticipating all possible ways a user may

articulate his requirements and a conversation may occur This requires a lot of man

hours for designing a domain specific system and is still very error prone In this section

33

we propose a general Chatbot design that would make the designing of a domain

specific chatbot very easy and robust at the same time

Every domain specific chatbot needs to obtain a set of information from the user

and show some results based on the user specific information obtained The traditional

chatbots use template matching and keywords lookup to determine if the user has

provided the required information Our idea is to use the Question Answering system in

the backend to extract out the required information from whatever the user has typed

until this point of the conversation The information to be extracted can be posed in the

form of a set of questions and the answers obtained from those questions can be used

as the parameters to supply the relevant information to the user

We had chosen our chatbot domain as the flight reservation system Our goal

was to extract the required information from the user to be able to show him the

available flights as per the userrsquos requirements

Figure 4-2 Flight reservation chatbotrsquos chat window

34

For a flight reservation task the booking agent needs to know the origin city

destination city and date of travel at minimum to be able to show the available flights

Optional information includes the number of tickets passengerrsquos name one way or

round trip etc

The minimalistic conversation with the user through the chat window would be as

shown above We had a platform called OneTask on which we wanted to implement our

chat bot The chat interface within the OneTask system looks as follows

Figure 4-3 Chatbot within OneTask system

The working of the chat system is as follows ndash

1 Initiation The user opens the chat window which starts a session with the chat bot The chat bot asks lsquoHow may I help yoursquo

2 User Reply The user may reply with none of the required information for flight booking or may reply with multiple information in the same message

3 User Reply parsing The conversation up to this point is treated as a passage and the internal questions are run on this passage So the four questions that are run are

Where do you want to go

35

From where do you want to leave

When do you want to depart

4 Parsed responses from QA model After running the questions on the QA system with the given conversation answers are obtained for all along with their corresponding confidence values Since answers will be obtained even if the required question was not answered up to this point in the conversation the confidence values plays a crucial role in determining the validity of the answer For this purpose we determined ranges of confidence values for validation A confidence value above 10 signifies the question has been answered correctly A confidence of 2 ndash 10 signifies that it may have been answered but should verify with the user for correctness and any confidence below 2 is discarded

5 Asking remaining questions iteratively After the parsing it is checked if any of the required questions are still unanswered If so the chatbot asks the remaining question and the process from 3 ndash 4 is carried out iteratively Once all the questions have been answered the user is shown with the available flight options as per his request

Figure 4-4 The Flow diagram of the Flight booking Chatbot system

Online QA System and Attention Visualization

To be able to test out various examples we used the BiDAF [23] model for an

online demo One can either choose from the available examples from the drop-down

menu or paste their own passage and examples While this is a useful and interesting

system to test the model in a user-friendly way we created this system to be able to

focus on the wrong samples

36

Figure 4-5 QA system interface with attention highlight over candidate answers

The candidate answers are shown with the blue highlights as per their

confidence values The higher the confidence the darker the answer The highest

confidence value is chosen as the predicted answer We developed the system to show

attention spread of the candidate answers to realize what needs to be done to improve

the system This led us to realize the importance of including the query attention part as

well as multilevel attention on the BiDAF model as described in the first section of this

chapter

37

CHAPTER 5 FUTURE DIRECTIONS AND CONCLUSION

Future Directions

Having done a thorough analysis of the current methods and state of the arts in

Machine Comprehension and for the QA task and developing systems on it we have

achieved a strong sense of what needs to be done to further improve the QA models

After observing the wrong samples we can see that the system is still unable to encode

meaning and is picking answers based on statistical patterns of occurrence of answers

on training examples

As per the State of the Art models and analyzing the ongoing research literature

it is easy to conclude that more features need to be embedded to encode the meaning

of words phrases and sentences A paper called Reinforced Mnemonic Reader for

Machine Comprehension encoded POS and NER tags of words along with their word

and character embedding This gave them better results

Figure 5-1 An English language semantic parse tree [26]

38

We have developed a method to encode the syntax parse tree of a sentence that

not only encodes the post tags but the relation of the word within the phrase and the

relation of the phrase within the whole sentence in a hierarchical manner

Finally data augmentation is another solution to get better results One definite

way to reduce the errors would be to include similar samples in the training data which

the system is faltering in the dev set One could generate similar examples as the failure

cases and include them in the training set to have better prediction Another system

would be to train to similar and bigger datasets Our models were trained on the SQuAD

dataset There are other datasets too which does the similar question answering task

such as TriviaQA We could augment the training set of TriviaQA along with SQuAD to

have a more robust system that is able to generalize better and thus have higher

accuracy for predicting answer spans

Conclusion

In this work we have tried to explore the most fundamental techniques that have

shaped the current state of the art Then we proposed a minor improvement of

architecture over an existing model Furthermore we developed two applications that

uses the base model First we talked about how a chatbot application can be made

using the QA system and lastly we also created a web interface where the model can

be used for any Passage and Question This interface also shows the attention spread

on the candidate answers While our effort is ongoing to push the state of the art

forward we strongly believe that surpassing human level accuracy on this task will have

high dividends for the society at large

39

LIST OF REFERENCES

[1] Danqi Chen From Reading Comprehension To Open-Domain Question Answering 2018 Youtube Accessed March 16 2018 httpswwwyoutubecomwatchv=1RN88O9C13U

[2] Lehnert Wendy G The Process of Question Answering No RR-88 Yale Univ New Haven Conn Dept of computer science 1977

[3] Joshi Mandar Eunsol Choi Daniel S Weld and Luke Zettlemoyer Triviaqa A large scale distantly supervised challenge dataset for reading comprehension arXiv preprint arXiv170503551 (2017)

[4] Rajpurkar Pranav Jian Zhang Konstantin Lopyrev and Percy Liang Squad 100000+ questions for machine comprehension of text arXiv preprint arXiv160605250 (2016)

[5] The Stanford Question Answering Dataset 2018 RajpurkarGithubIo Accessed March 16 2018 httpsrajpurkargithubioSQuAD-explorer

[6] Hu Minghao Yuxing Peng and Xipeng Qiu Reinforced mnemonic reader for machine comprehension CoRR abs170502798 (2017)

[7] Schalkoff Robert J Artificial neural networks Vol 1 New York McGraw-Hill 1997

[8] Me For The AI and Neetesh Mehrotra 2018 The Connect Between Deep Learning And AI - Open Source For You Open Source For You Accessed March 16 2018 httpopensourceforucom201801connect-deep-learning-ai

[9] Krizhevsky Alex Ilya Sutskever and Geoffrey E Hinton Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems pp 1097-1105 2012

[10] An Intuitive Explanation Of Convolutional Neural Networks 2016 The Data Science Blog Accessed March 16 2018 httpsujjwalkarnme20160811intuitive-explanation-convnets

[11] Williams Ronald J and David Zipser A learning algorithm for continually running fully recurrent neural networks Neural computation 1 no 2 (1989) 270-280

[12] Hochreiter Sepp and Juumlrgen Schmidhuber Long short-term memory Neural computation 9 no 8 (1997) 1735-1780

[13] Chung Junyoung Caglar Gulcehre KyungHyun Cho and Yoshua Bengio Empirical evaluation of gated recurrent neural networks on sequence modeling arXiv preprint arXiv14123555 (2014)

40

[14] Mikolov Tomas Wen-tau Yih and Geoffrey Zweig Linguistic regularities in continuous space word representations In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics Human Language Technologies pp 746-751 2013

[15] Pennington Jeffrey Richard Socher and Christopher Manning Glove Global vectors for word representation In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) pp 1532-1543 2014

[16] Word2vec Tutorial - The Skip-Gram Model middot Chris Mccormick 2016 MccormickmlCom Accessed March 16 2018 httpmccormickmlcom20160419word2vec-tutorial-the-skip-gram-model

[17] Group 2015 [Paper Introduction] Bilingual Word Representations With Monolingual hellip SlideshareNet Accessed March 16 2018 httpswwwslidesharenetnaist-mtstudybilingual-word-representations-with-monolingual-quality-in-mind

[18] Attention Mechanism 2016 BlogHeuritechCom Accessed March 16 2018 httpsblogheuritechcom20160120attention-mechanism

[19] Weston Jason Sumit Chopra and Antoine Bordes 2014 Memory Networks arXiv preprint arXiv14103916v11 Accessed March 16 2018 httpsarxivorgabs14103916

[20] Wang Shuohang and Jing Jiang Machine comprehension using match-lstm and answer pointer arXiv preprint arXiv160807905 (2016)

[21] Wang Shuohang and Jing Jiang Learning natural language inference with LSTM arXiv preprint arXiv151208849 (2015)

[22] Vinyals Oriol Meire Fortunato and Navdeep Jaitly Pointer networks In Advances in Neural Information Processing Systems pp 2692-2700 2015

[23] Wang Wenhui Nan Yang Furu Wei Baobao Chang and Ming Zhou Gated self-matching networks for reading comprehension and question answering In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1 Long Papers) vol 1 pp 189-198 2017

[24] Seo Minjoon Aniruddha Kembhavi Ali Farhadi and Hannaneh Hajishirzi Bidirectional attention flow for machine comprehension arXiv preprint arXiv161101603 (2016)

[25] Clark Christopher and Matt Gardner Simple and effective multi-paragraph reading comprehension arXiv preprint arXiv171010723 (2017)

41

[26] 8 Analyzing Sentence Structure 2018 NltkOrg Accessed March 16 2018 httpwwwnltkorgbookch08html

42

BIOGRAPHICAL SKETCH

Purnendu Mukherjee grew up in Kolkata India and had a strong interest for

computers since an early age After his high school he did his BSc in computer

science from Ramakrishna Mission Residential College Narendrapur followed by MSc

in computer science from St Xavierrsquos College Kolkata He had a strong intuition and

interest for human like learning systems and wanted to work in this area He started

working at TCS Innovation Labs Pune for the application of Natural Language

Processing in Educational Applications As he was deeply passionate about learning

system that mimic the human brain and learn like a human child does he was

increasing interested about Deep Learning and its applications After working for a year

he went on to pursue a Master of Science degree in computer science from the

University of Florida Gainesville His academic interests have been focused on Deep

Learning and Natural Language Processing and he has been working on Machine

Reading Comprehension since summer of 2017

  • ACKNOWLEDGMENTS
  • LIST OF FIGURES
  • LIST OF ABBREVIATIONS
  • INTRODUCTION
  • THE BUILDING BLOCKS OF A QUESTION ANSWERING SYSTEM
    • Neural Networks
    • Convolutional Neural Network
    • Recurrent Neural Networks (RNN)
    • Word Embedding
    • Attention Mechanism
    • Memory Networks
      • LITERATURE REVIEW AND STATE OF THE ART
        • Machine Comprehension Using Match-LSTM and Answer Pointer
        • R-NET Matching Reading Comprehension with Self-Matching Networks
        • Bi-Directional Attention Flow (BiDAF) for Machine Comprehension
        • Summary
          • MULTI-ATTENTION QUESTION ANSWERING
            • Multi-Attention BiDAF Model
            • Chatbot Design Using a QA System
            • Online QA System and Attention Visualization
              • FUTURE DIRECTIONS AND CONCLUSION
                • Future Directions
                • Conclusion
                  • LIST OF REFERENCES
                  • BIOGRAPHICAL SKETCH
Page 33: QUESTION ANSWERING SYSTEMS WITH ATTENTION MECHANISMufdcimages.uflib.ufl.edu/UF/E0/05/23/73/00001/MUKHERJEE_P.pdf · QUESTION ANSWERING SYSTEMS WITH ATTENTION MECHANISM By PURNENDU

33

we propose a general Chatbot design that would make the designing of a domain

specific chatbot very easy and robust at the same time

Every domain specific chatbot needs to obtain a set of information from the user

and show some results based on the user specific information obtained The traditional

chatbots use template matching and keywords lookup to determine if the user has

provided the required information Our idea is to use the Question Answering system in

the backend to extract out the required information from whatever the user has typed

until this point of the conversation The information to be extracted can be posed in the

form of a set of questions and the answers obtained from those questions can be used

as the parameters to supply the relevant information to the user

We had chosen our chatbot domain as the flight reservation system Our goal

was to extract the required information from the user to be able to show him the

available flights as per the userrsquos requirements

Figure 4-2 Flight reservation chatbotrsquos chat window

34

For a flight reservation task the booking agent needs to know the origin city

destination city and date of travel at minimum to be able to show the available flights

Optional information includes the number of tickets passengerrsquos name one way or

round trip etc

The minimalistic conversation with the user through the chat window would be as

shown above We had a platform called OneTask on which we wanted to implement our

chat bot The chat interface within the OneTask system looks as follows

Figure 4-3 Chatbot within OneTask system

The working of the chat system is as follows ndash

1 Initiation The user opens the chat window which starts a session with the chat bot The chat bot asks lsquoHow may I help yoursquo

2 User Reply The user may reply with none of the required information for flight booking or may reply with multiple information in the same message

3 User Reply parsing The conversation up to this point is treated as a passage and the internal questions are run on this passage So the four questions that are run are

Where do you want to go

35

From where do you want to leave

When do you want to depart

4 Parsed responses from QA model After running the questions on the QA system with the given conversation answers are obtained for all along with their corresponding confidence values Since answers will be obtained even if the required question was not answered up to this point in the conversation the confidence values plays a crucial role in determining the validity of the answer For this purpose we determined ranges of confidence values for validation A confidence value above 10 signifies the question has been answered correctly A confidence of 2 ndash 10 signifies that it may have been answered but should verify with the user for correctness and any confidence below 2 is discarded

5 Asking remaining questions iteratively After the parsing it is checked if any of the required questions are still unanswered If so the chatbot asks the remaining question and the process from 3 ndash 4 is carried out iteratively Once all the questions have been answered the user is shown with the available flight options as per his request

Figure 4-4 The Flow diagram of the Flight booking Chatbot system

Online QA System and Attention Visualization

To be able to test out various examples we used the BiDAF [23] model for an

online demo One can either choose from the available examples from the drop-down

menu or paste their own passage and examples While this is a useful and interesting

system to test the model in a user-friendly way we created this system to be able to

focus on the wrong samples

36

Figure 4-5 QA system interface with attention highlight over candidate answers

The candidate answers are shown with the blue highlights as per their

confidence values The higher the confidence the darker the answer The highest

confidence value is chosen as the predicted answer We developed the system to show

attention spread of the candidate answers to realize what needs to be done to improve

the system This led us to realize the importance of including the query attention part as

well as multilevel attention on the BiDAF model as described in the first section of this

chapter

37

CHAPTER 5 FUTURE DIRECTIONS AND CONCLUSION

Future Directions

Having done a thorough analysis of the current methods and state of the arts in

Machine Comprehension and for the QA task and developing systems on it we have

achieved a strong sense of what needs to be done to further improve the QA models

After observing the wrong samples we can see that the system is still unable to encode

meaning and is picking answers based on statistical patterns of occurrence of answers

on training examples

As per the State of the Art models and analyzing the ongoing research literature

it is easy to conclude that more features need to be embedded to encode the meaning

of words phrases and sentences A paper called Reinforced Mnemonic Reader for

Machine Comprehension encoded POS and NER tags of words along with their word

and character embedding This gave them better results

Figure 5-1 An English language semantic parse tree [26]

38

We have developed a method to encode the syntax parse tree of a sentence that

not only encodes the post tags but the relation of the word within the phrase and the

relation of the phrase within the whole sentence in a hierarchical manner

Finally data augmentation is another solution to get better results One definite

way to reduce the errors would be to include similar samples in the training data which

the system is faltering in the dev set One could generate similar examples as the failure

cases and include them in the training set to have better prediction Another system

would be to train to similar and bigger datasets Our models were trained on the SQuAD

dataset There are other datasets too which does the similar question answering task

such as TriviaQA We could augment the training set of TriviaQA along with SQuAD to

have a more robust system that is able to generalize better and thus have higher

accuracy for predicting answer spans

Conclusion

In this work we have tried to explore the most fundamental techniques that have

shaped the current state of the art Then we proposed a minor improvement of

architecture over an existing model Furthermore we developed two applications that

uses the base model First we talked about how a chatbot application can be made

using the QA system and lastly we also created a web interface where the model can

be used for any Passage and Question This interface also shows the attention spread

on the candidate answers While our effort is ongoing to push the state of the art

forward we strongly believe that surpassing human level accuracy on this task will have

high dividends for the society at large

39

LIST OF REFERENCES

[1] Danqi Chen From Reading Comprehension To Open-Domain Question Answering 2018 Youtube Accessed March 16 2018 httpswwwyoutubecomwatchv=1RN88O9C13U

[2] Lehnert Wendy G The Process of Question Answering No RR-88 Yale Univ New Haven Conn Dept of computer science 1977

[3] Joshi Mandar Eunsol Choi Daniel S Weld and Luke Zettlemoyer Triviaqa A large scale distantly supervised challenge dataset for reading comprehension arXiv preprint arXiv170503551 (2017)

[4] Rajpurkar Pranav Jian Zhang Konstantin Lopyrev and Percy Liang Squad 100000+ questions for machine comprehension of text arXiv preprint arXiv160605250 (2016)

[5] The Stanford Question Answering Dataset 2018 RajpurkarGithubIo Accessed March 16 2018 httpsrajpurkargithubioSQuAD-explorer

[6] Hu Minghao Yuxing Peng and Xipeng Qiu Reinforced mnemonic reader for machine comprehension CoRR abs170502798 (2017)

[7] Schalkoff Robert J Artificial neural networks Vol 1 New York McGraw-Hill 1997

[8] Me For The AI and Neetesh Mehrotra 2018 The Connect Between Deep Learning And AI - Open Source For You Open Source For You Accessed March 16 2018 httpopensourceforucom201801connect-deep-learning-ai

[9] Krizhevsky Alex Ilya Sutskever and Geoffrey E Hinton Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems pp 1097-1105 2012

[10] An Intuitive Explanation Of Convolutional Neural Networks 2016 The Data Science Blog Accessed March 16 2018 httpsujjwalkarnme20160811intuitive-explanation-convnets

[11] Williams Ronald J and David Zipser A learning algorithm for continually running fully recurrent neural networks Neural computation 1 no 2 (1989) 270-280

[12] Hochreiter Sepp and Juumlrgen Schmidhuber Long short-term memory Neural computation 9 no 8 (1997) 1735-1780

[13] Chung Junyoung Caglar Gulcehre KyungHyun Cho and Yoshua Bengio Empirical evaluation of gated recurrent neural networks on sequence modeling arXiv preprint arXiv14123555 (2014)

40

[14] Mikolov Tomas Wen-tau Yih and Geoffrey Zweig Linguistic regularities in continuous space word representations In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics Human Language Technologies pp 746-751 2013

[15] Pennington Jeffrey Richard Socher and Christopher Manning Glove Global vectors for word representation In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) pp 1532-1543 2014

[16] Word2vec Tutorial - The Skip-Gram Model middot Chris Mccormick 2016 MccormickmlCom Accessed March 16 2018 httpmccormickmlcom20160419word2vec-tutorial-the-skip-gram-model

[17] Group 2015 [Paper Introduction] Bilingual Word Representations With Monolingual hellip SlideshareNet Accessed March 16 2018 httpswwwslidesharenetnaist-mtstudybilingual-word-representations-with-monolingual-quality-in-mind

[18] Attention Mechanism 2016 BlogHeuritechCom Accessed March 16 2018 httpsblogheuritechcom20160120attention-mechanism

[19] Weston Jason Sumit Chopra and Antoine Bordes 2014 Memory Networks arXiv preprint arXiv14103916v11 Accessed March 16 2018 httpsarxivorgabs14103916

[20] Wang Shuohang and Jing Jiang Machine comprehension using match-lstm and answer pointer arXiv preprint arXiv160807905 (2016)

[21] Wang Shuohang and Jing Jiang Learning natural language inference with LSTM arXiv preprint arXiv151208849 (2015)

[22] Vinyals Oriol Meire Fortunato and Navdeep Jaitly Pointer networks In Advances in Neural Information Processing Systems pp 2692-2700 2015

[23] Wang Wenhui Nan Yang Furu Wei Baobao Chang and Ming Zhou Gated self-matching networks for reading comprehension and question answering In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1 Long Papers) vol 1 pp 189-198 2017

[24] Seo Minjoon Aniruddha Kembhavi Ali Farhadi and Hannaneh Hajishirzi Bidirectional attention flow for machine comprehension arXiv preprint arXiv161101603 (2016)

[25] Clark Christopher and Matt Gardner Simple and effective multi-paragraph reading comprehension arXiv preprint arXiv171010723 (2017)

41

[26] 8 Analyzing Sentence Structure 2018 NltkOrg Accessed March 16 2018 httpwwwnltkorgbookch08html

42

BIOGRAPHICAL SKETCH

Purnendu Mukherjee grew up in Kolkata India and had a strong interest for

computers since an early age After his high school he did his BSc in computer

science from Ramakrishna Mission Residential College Narendrapur followed by MSc

in computer science from St Xavierrsquos College Kolkata He had a strong intuition and

interest for human like learning systems and wanted to work in this area He started

working at TCS Innovation Labs Pune for the application of Natural Language

Processing in Educational Applications As he was deeply passionate about learning

system that mimic the human brain and learn like a human child does he was

increasing interested about Deep Learning and its applications After working for a year

he went on to pursue a Master of Science degree in computer science from the

University of Florida Gainesville His academic interests have been focused on Deep

Learning and Natural Language Processing and he has been working on Machine

Reading Comprehension since summer of 2017

  • ACKNOWLEDGMENTS
  • LIST OF FIGURES
  • LIST OF ABBREVIATIONS
  • INTRODUCTION
  • THE BUILDING BLOCKS OF A QUESTION ANSWERING SYSTEM
    • Neural Networks
    • Convolutional Neural Network
    • Recurrent Neural Networks (RNN)
    • Word Embedding
    • Attention Mechanism
    • Memory Networks
      • LITERATURE REVIEW AND STATE OF THE ART
        • Machine Comprehension Using Match-LSTM and Answer Pointer
        • R-NET Matching Reading Comprehension with Self-Matching Networks
        • Bi-Directional Attention Flow (BiDAF) for Machine Comprehension
        • Summary
          • MULTI-ATTENTION QUESTION ANSWERING
            • Multi-Attention BiDAF Model
            • Chatbot Design Using a QA System
            • Online QA System and Attention Visualization
              • FUTURE DIRECTIONS AND CONCLUSION
                • Future Directions
                • Conclusion
                  • LIST OF REFERENCES
                  • BIOGRAPHICAL SKETCH
Page 34: QUESTION ANSWERING SYSTEMS WITH ATTENTION MECHANISMufdcimages.uflib.ufl.edu/UF/E0/05/23/73/00001/MUKHERJEE_P.pdf · QUESTION ANSWERING SYSTEMS WITH ATTENTION MECHANISM By PURNENDU

34

For a flight reservation task the booking agent needs to know the origin city

destination city and date of travel at minimum to be able to show the available flights

Optional information includes the number of tickets passengerrsquos name one way or

round trip etc

The minimalistic conversation with the user through the chat window would be as

shown above We had a platform called OneTask on which we wanted to implement our

chat bot The chat interface within the OneTask system looks as follows

Figure 4-3 Chatbot within OneTask system

The working of the chat system is as follows ndash

1 Initiation The user opens the chat window which starts a session with the chat bot The chat bot asks lsquoHow may I help yoursquo

2 User Reply The user may reply with none of the required information for flight booking or may reply with multiple information in the same message

3 User Reply parsing The conversation up to this point is treated as a passage and the internal questions are run on this passage So the four questions that are run are

Where do you want to go

35

From where do you want to leave

When do you want to depart

4 Parsed responses from QA model After running the questions on the QA system with the given conversation answers are obtained for all along with their corresponding confidence values Since answers will be obtained even if the required question was not answered up to this point in the conversation the confidence values plays a crucial role in determining the validity of the answer For this purpose we determined ranges of confidence values for validation A confidence value above 10 signifies the question has been answered correctly A confidence of 2 ndash 10 signifies that it may have been answered but should verify with the user for correctness and any confidence below 2 is discarded

5 Asking remaining questions iteratively After the parsing it is checked if any of the required questions are still unanswered If so the chatbot asks the remaining question and the process from 3 ndash 4 is carried out iteratively Once all the questions have been answered the user is shown with the available flight options as per his request

Figure 4-4 The Flow diagram of the Flight booking Chatbot system

Online QA System and Attention Visualization

To be able to test out various examples we used the BiDAF [23] model for an

online demo One can either choose from the available examples from the drop-down

menu or paste their own passage and examples While this is a useful and interesting

system to test the model in a user-friendly way we created this system to be able to

focus on the wrong samples

36

Figure 4-5 QA system interface with attention highlight over candidate answers

The candidate answers are shown with the blue highlights as per their

confidence values The higher the confidence the darker the answer The highest

confidence value is chosen as the predicted answer We developed the system to show

attention spread of the candidate answers to realize what needs to be done to improve

the system This led us to realize the importance of including the query attention part as

well as multilevel attention on the BiDAF model as described in the first section of this

chapter

37

CHAPTER 5 FUTURE DIRECTIONS AND CONCLUSION

Future Directions

Having done a thorough analysis of the current methods and state of the arts in

Machine Comprehension and for the QA task and developing systems on it we have

achieved a strong sense of what needs to be done to further improve the QA models

After observing the wrong samples we can see that the system is still unable to encode

meaning and is picking answers based on statistical patterns of occurrence of answers

on training examples

As per the State of the Art models and analyzing the ongoing research literature

it is easy to conclude that more features need to be embedded to encode the meaning

of words phrases and sentences A paper called Reinforced Mnemonic Reader for

Machine Comprehension encoded POS and NER tags of words along with their word

and character embedding This gave them better results

Figure 5-1 An English language semantic parse tree [26]

38

We have developed a method to encode the syntax parse tree of a sentence that

not only encodes the post tags but the relation of the word within the phrase and the

relation of the phrase within the whole sentence in a hierarchical manner

Finally data augmentation is another solution to get better results One definite

way to reduce the errors would be to include similar samples in the training data which

the system is faltering in the dev set One could generate similar examples as the failure

cases and include them in the training set to have better prediction Another system

would be to train to similar and bigger datasets Our models were trained on the SQuAD

dataset There are other datasets too which does the similar question answering task

such as TriviaQA We could augment the training set of TriviaQA along with SQuAD to

have a more robust system that is able to generalize better and thus have higher

accuracy for predicting answer spans

Conclusion

In this work we have tried to explore the most fundamental techniques that have

shaped the current state of the art Then we proposed a minor improvement of

architecture over an existing model Furthermore we developed two applications that

uses the base model First we talked about how a chatbot application can be made

using the QA system and lastly we also created a web interface where the model can

be used for any Passage and Question This interface also shows the attention spread

on the candidate answers While our effort is ongoing to push the state of the art

forward we strongly believe that surpassing human level accuracy on this task will have

high dividends for the society at large

39

LIST OF REFERENCES

[1] Danqi Chen From Reading Comprehension To Open-Domain Question Answering 2018 Youtube Accessed March 16 2018 httpswwwyoutubecomwatchv=1RN88O9C13U

[2] Lehnert Wendy G The Process of Question Answering No RR-88 Yale Univ New Haven Conn Dept of computer science 1977

[3] Joshi Mandar Eunsol Choi Daniel S Weld and Luke Zettlemoyer Triviaqa A large scale distantly supervised challenge dataset for reading comprehension arXiv preprint arXiv170503551 (2017)

[4] Rajpurkar Pranav Jian Zhang Konstantin Lopyrev and Percy Liang Squad 100000+ questions for machine comprehension of text arXiv preprint arXiv160605250 (2016)

[5] The Stanford Question Answering Dataset 2018 RajpurkarGithubIo Accessed March 16 2018 httpsrajpurkargithubioSQuAD-explorer

[6] Hu Minghao Yuxing Peng and Xipeng Qiu Reinforced mnemonic reader for machine comprehension CoRR abs170502798 (2017)

[7] Schalkoff Robert J Artificial neural networks Vol 1 New York McGraw-Hill 1997

[8] Me For The AI and Neetesh Mehrotra 2018 The Connect Between Deep Learning And AI - Open Source For You Open Source For You Accessed March 16 2018 httpopensourceforucom201801connect-deep-learning-ai

[9] Krizhevsky Alex Ilya Sutskever and Geoffrey E Hinton Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems pp 1097-1105 2012

[10] An Intuitive Explanation Of Convolutional Neural Networks 2016 The Data Science Blog Accessed March 16 2018 httpsujjwalkarnme20160811intuitive-explanation-convnets

[11] Williams Ronald J and David Zipser A learning algorithm for continually running fully recurrent neural networks Neural computation 1 no 2 (1989) 270-280

[12] Hochreiter Sepp and Juumlrgen Schmidhuber Long short-term memory Neural computation 9 no 8 (1997) 1735-1780

[13] Chung Junyoung Caglar Gulcehre KyungHyun Cho and Yoshua Bengio Empirical evaluation of gated recurrent neural networks on sequence modeling arXiv preprint arXiv14123555 (2014)

40

[14] Mikolov Tomas Wen-tau Yih and Geoffrey Zweig Linguistic regularities in continuous space word representations In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics Human Language Technologies pp 746-751 2013

[15] Pennington Jeffrey Richard Socher and Christopher Manning Glove Global vectors for word representation In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) pp 1532-1543 2014

[16] Word2vec Tutorial - The Skip-Gram Model middot Chris Mccormick 2016 MccormickmlCom Accessed March 16 2018 httpmccormickmlcom20160419word2vec-tutorial-the-skip-gram-model

[17] Group 2015 [Paper Introduction] Bilingual Word Representations With Monolingual hellip SlideshareNet Accessed March 16 2018 httpswwwslidesharenetnaist-mtstudybilingual-word-representations-with-monolingual-quality-in-mind

[18] Attention Mechanism 2016 BlogHeuritechCom Accessed March 16 2018 httpsblogheuritechcom20160120attention-mechanism

[19] Weston Jason Sumit Chopra and Antoine Bordes 2014 Memory Networks arXiv preprint arXiv14103916v11 Accessed March 16 2018 httpsarxivorgabs14103916

[20] Wang Shuohang and Jing Jiang Machine comprehension using match-lstm and answer pointer arXiv preprint arXiv160807905 (2016)

[21] Wang Shuohang and Jing Jiang Learning natural language inference with LSTM arXiv preprint arXiv151208849 (2015)

[22] Vinyals Oriol Meire Fortunato and Navdeep Jaitly Pointer networks In Advances in Neural Information Processing Systems pp 2692-2700 2015

[23] Wang Wenhui Nan Yang Furu Wei Baobao Chang and Ming Zhou Gated self-matching networks for reading comprehension and question answering In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1 Long Papers) vol 1 pp 189-198 2017

[24] Seo Minjoon Aniruddha Kembhavi Ali Farhadi and Hannaneh Hajishirzi Bidirectional attention flow for machine comprehension arXiv preprint arXiv161101603 (2016)

[25] Clark Christopher and Matt Gardner Simple and effective multi-paragraph reading comprehension arXiv preprint arXiv171010723 (2017)

41

[26] 8 Analyzing Sentence Structure 2018 NltkOrg Accessed March 16 2018 httpwwwnltkorgbookch08html

42

BIOGRAPHICAL SKETCH

Purnendu Mukherjee grew up in Kolkata India and had a strong interest for

computers since an early age After his high school he did his BSc in computer

science from Ramakrishna Mission Residential College Narendrapur followed by MSc

in computer science from St Xavierrsquos College Kolkata He had a strong intuition and

interest for human like learning systems and wanted to work in this area He started

working at TCS Innovation Labs Pune for the application of Natural Language

Processing in Educational Applications As he was deeply passionate about learning

system that mimic the human brain and learn like a human child does he was

increasing interested about Deep Learning and its applications After working for a year

he went on to pursue a Master of Science degree in computer science from the

University of Florida Gainesville His academic interests have been focused on Deep

Learning and Natural Language Processing and he has been working on Machine

Reading Comprehension since summer of 2017

  • ACKNOWLEDGMENTS
  • LIST OF FIGURES
  • LIST OF ABBREVIATIONS
  • INTRODUCTION
  • THE BUILDING BLOCKS OF A QUESTION ANSWERING SYSTEM
    • Neural Networks
    • Convolutional Neural Network
    • Recurrent Neural Networks (RNN)
    • Word Embedding
    • Attention Mechanism
    • Memory Networks
      • LITERATURE REVIEW AND STATE OF THE ART
        • Machine Comprehension Using Match-LSTM and Answer Pointer
        • R-NET Matching Reading Comprehension with Self-Matching Networks
        • Bi-Directional Attention Flow (BiDAF) for Machine Comprehension
        • Summary
          • MULTI-ATTENTION QUESTION ANSWERING
            • Multi-Attention BiDAF Model
            • Chatbot Design Using a QA System
            • Online QA System and Attention Visualization
              • FUTURE DIRECTIONS AND CONCLUSION
                • Future Directions
                • Conclusion
                  • LIST OF REFERENCES
                  • BIOGRAPHICAL SKETCH
Page 35: QUESTION ANSWERING SYSTEMS WITH ATTENTION MECHANISMufdcimages.uflib.ufl.edu/UF/E0/05/23/73/00001/MUKHERJEE_P.pdf · QUESTION ANSWERING SYSTEMS WITH ATTENTION MECHANISM By PURNENDU

35

From where do you want to leave

When do you want to depart

4 Parsed responses from QA model After running the questions on the QA system with the given conversation answers are obtained for all along with their corresponding confidence values Since answers will be obtained even if the required question was not answered up to this point in the conversation the confidence values plays a crucial role in determining the validity of the answer For this purpose we determined ranges of confidence values for validation A confidence value above 10 signifies the question has been answered correctly A confidence of 2 ndash 10 signifies that it may have been answered but should verify with the user for correctness and any confidence below 2 is discarded

5 Asking remaining questions iteratively After the parsing it is checked if any of the required questions are still unanswered If so the chatbot asks the remaining question and the process from 3 ndash 4 is carried out iteratively Once all the questions have been answered the user is shown with the available flight options as per his request

Figure 4-4 The Flow diagram of the Flight booking Chatbot system

Online QA System and Attention Visualization

To be able to test out various examples we used the BiDAF [23] model for an

online demo One can either choose from the available examples from the drop-down

menu or paste their own passage and examples While this is a useful and interesting

system to test the model in a user-friendly way we created this system to be able to

focus on the wrong samples

36

Figure 4-5 QA system interface with attention highlight over candidate answers

The candidate answers are shown with the blue highlights as per their

confidence values The higher the confidence the darker the answer The highest

confidence value is chosen as the predicted answer We developed the system to show

attention spread of the candidate answers to realize what needs to be done to improve

the system This led us to realize the importance of including the query attention part as

well as multilevel attention on the BiDAF model as described in the first section of this

chapter

37

CHAPTER 5 FUTURE DIRECTIONS AND CONCLUSION

Future Directions

Having done a thorough analysis of the current methods and state of the arts in

Machine Comprehension and for the QA task and developing systems on it we have

achieved a strong sense of what needs to be done to further improve the QA models

After observing the wrong samples we can see that the system is still unable to encode

meaning and is picking answers based on statistical patterns of occurrence of answers

on training examples

As per the State of the Art models and analyzing the ongoing research literature

it is easy to conclude that more features need to be embedded to encode the meaning

of words phrases and sentences A paper called Reinforced Mnemonic Reader for

Machine Comprehension encoded POS and NER tags of words along with their word

and character embedding This gave them better results

Figure 5-1 An English language semantic parse tree [26]

38

We have developed a method to encode the syntax parse tree of a sentence that

not only encodes the post tags but the relation of the word within the phrase and the

relation of the phrase within the whole sentence in a hierarchical manner

Finally data augmentation is another solution to get better results One definite

way to reduce the errors would be to include similar samples in the training data which

the system is faltering in the dev set One could generate similar examples as the failure

cases and include them in the training set to have better prediction Another system

would be to train to similar and bigger datasets Our models were trained on the SQuAD

dataset There are other datasets too which does the similar question answering task

such as TriviaQA We could augment the training set of TriviaQA along with SQuAD to

have a more robust system that is able to generalize better and thus have higher

accuracy for predicting answer spans

Conclusion

In this work we have tried to explore the most fundamental techniques that have

shaped the current state of the art Then we proposed a minor improvement of

architecture over an existing model Furthermore we developed two applications that

uses the base model First we talked about how a chatbot application can be made

using the QA system and lastly we also created a web interface where the model can

be used for any Passage and Question This interface also shows the attention spread

on the candidate answers While our effort is ongoing to push the state of the art

forward we strongly believe that surpassing human level accuracy on this task will have

high dividends for the society at large

39

LIST OF REFERENCES

[1] Danqi Chen From Reading Comprehension To Open-Domain Question Answering 2018 Youtube Accessed March 16 2018 httpswwwyoutubecomwatchv=1RN88O9C13U

[2] Lehnert Wendy G The Process of Question Answering No RR-88 Yale Univ New Haven Conn Dept of computer science 1977

[3] Joshi Mandar Eunsol Choi Daniel S Weld and Luke Zettlemoyer Triviaqa A large scale distantly supervised challenge dataset for reading comprehension arXiv preprint arXiv170503551 (2017)

[4] Rajpurkar Pranav Jian Zhang Konstantin Lopyrev and Percy Liang Squad 100000+ questions for machine comprehension of text arXiv preprint arXiv160605250 (2016)

[5] The Stanford Question Answering Dataset 2018 RajpurkarGithubIo Accessed March 16 2018 httpsrajpurkargithubioSQuAD-explorer

[6] Hu Minghao Yuxing Peng and Xipeng Qiu Reinforced mnemonic reader for machine comprehension CoRR abs170502798 (2017)

[7] Schalkoff Robert J Artificial neural networks Vol 1 New York McGraw-Hill 1997

[8] Me For The AI and Neetesh Mehrotra 2018 The Connect Between Deep Learning And AI - Open Source For You Open Source For You Accessed March 16 2018 httpopensourceforucom201801connect-deep-learning-ai

[9] Krizhevsky Alex Ilya Sutskever and Geoffrey E Hinton Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems pp 1097-1105 2012

[10] An Intuitive Explanation Of Convolutional Neural Networks 2016 The Data Science Blog Accessed March 16 2018 httpsujjwalkarnme20160811intuitive-explanation-convnets

[11] Williams Ronald J and David Zipser A learning algorithm for continually running fully recurrent neural networks Neural computation 1 no 2 (1989) 270-280

[12] Hochreiter Sepp and Juumlrgen Schmidhuber Long short-term memory Neural computation 9 no 8 (1997) 1735-1780

[13] Chung Junyoung Caglar Gulcehre KyungHyun Cho and Yoshua Bengio Empirical evaluation of gated recurrent neural networks on sequence modeling arXiv preprint arXiv14123555 (2014)

40

[14] Mikolov Tomas Wen-tau Yih and Geoffrey Zweig Linguistic regularities in continuous space word representations In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics Human Language Technologies pp 746-751 2013

[15] Pennington Jeffrey Richard Socher and Christopher Manning Glove Global vectors for word representation In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) pp 1532-1543 2014

[16] Word2vec Tutorial - The Skip-Gram Model middot Chris Mccormick 2016 MccormickmlCom Accessed March 16 2018 httpmccormickmlcom20160419word2vec-tutorial-the-skip-gram-model

[17] Group 2015 [Paper Introduction] Bilingual Word Representations With Monolingual hellip SlideshareNet Accessed March 16 2018 httpswwwslidesharenetnaist-mtstudybilingual-word-representations-with-monolingual-quality-in-mind

[18] Attention Mechanism 2016 BlogHeuritechCom Accessed March 16 2018 httpsblogheuritechcom20160120attention-mechanism

[19] Weston Jason Sumit Chopra and Antoine Bordes 2014 Memory Networks arXiv preprint arXiv14103916v11 Accessed March 16 2018 httpsarxivorgabs14103916

[20] Wang Shuohang and Jing Jiang Machine comprehension using match-lstm and answer pointer arXiv preprint arXiv160807905 (2016)

[21] Wang Shuohang and Jing Jiang Learning natural language inference with LSTM arXiv preprint arXiv151208849 (2015)

[22] Vinyals Oriol Meire Fortunato and Navdeep Jaitly Pointer networks In Advances in Neural Information Processing Systems pp 2692-2700 2015

[23] Wang Wenhui Nan Yang Furu Wei Baobao Chang and Ming Zhou Gated self-matching networks for reading comprehension and question answering In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1 Long Papers) vol 1 pp 189-198 2017

[24] Seo Minjoon Aniruddha Kembhavi Ali Farhadi and Hannaneh Hajishirzi Bidirectional attention flow for machine comprehension arXiv preprint arXiv161101603 (2016)

[25] Clark Christopher and Matt Gardner Simple and effective multi-paragraph reading comprehension arXiv preprint arXiv171010723 (2017)

41

[26] 8 Analyzing Sentence Structure 2018 NltkOrg Accessed March 16 2018 httpwwwnltkorgbookch08html

42

BIOGRAPHICAL SKETCH

Purnendu Mukherjee grew up in Kolkata India and had a strong interest for

computers since an early age After his high school he did his BSc in computer

science from Ramakrishna Mission Residential College Narendrapur followed by MSc

in computer science from St Xavierrsquos College Kolkata He had a strong intuition and

interest for human like learning systems and wanted to work in this area He started

working at TCS Innovation Labs Pune for the application of Natural Language

Processing in Educational Applications As he was deeply passionate about learning

system that mimic the human brain and learn like a human child does he was

increasing interested about Deep Learning and its applications After working for a year

he went on to pursue a Master of Science degree in computer science from the

University of Florida Gainesville His academic interests have been focused on Deep

Learning and Natural Language Processing and he has been working on Machine

Reading Comprehension since summer of 2017

  • ACKNOWLEDGMENTS
  • LIST OF FIGURES
  • LIST OF ABBREVIATIONS
  • INTRODUCTION
  • THE BUILDING BLOCKS OF A QUESTION ANSWERING SYSTEM
    • Neural Networks
    • Convolutional Neural Network
    • Recurrent Neural Networks (RNN)
    • Word Embedding
    • Attention Mechanism
    • Memory Networks
      • LITERATURE REVIEW AND STATE OF THE ART
        • Machine Comprehension Using Match-LSTM and Answer Pointer
        • R-NET Matching Reading Comprehension with Self-Matching Networks
        • Bi-Directional Attention Flow (BiDAF) for Machine Comprehension
        • Summary
          • MULTI-ATTENTION QUESTION ANSWERING
            • Multi-Attention BiDAF Model
            • Chatbot Design Using a QA System
            • Online QA System and Attention Visualization
              • FUTURE DIRECTIONS AND CONCLUSION
                • Future Directions
                • Conclusion
                  • LIST OF REFERENCES
                  • BIOGRAPHICAL SKETCH
Page 36: QUESTION ANSWERING SYSTEMS WITH ATTENTION MECHANISMufdcimages.uflib.ufl.edu/UF/E0/05/23/73/00001/MUKHERJEE_P.pdf · QUESTION ANSWERING SYSTEMS WITH ATTENTION MECHANISM By PURNENDU

36

Figure 4-5 QA system interface with attention highlight over candidate answers

The candidate answers are shown with the blue highlights as per their

confidence values The higher the confidence the darker the answer The highest

confidence value is chosen as the predicted answer We developed the system to show

attention spread of the candidate answers to realize what needs to be done to improve

the system This led us to realize the importance of including the query attention part as

well as multilevel attention on the BiDAF model as described in the first section of this

chapter

37

CHAPTER 5 FUTURE DIRECTIONS AND CONCLUSION

Future Directions

Having done a thorough analysis of the current methods and state of the arts in

Machine Comprehension and for the QA task and developing systems on it we have

achieved a strong sense of what needs to be done to further improve the QA models

After observing the wrong samples we can see that the system is still unable to encode

meaning and is picking answers based on statistical patterns of occurrence of answers

on training examples

As per the State of the Art models and analyzing the ongoing research literature

it is easy to conclude that more features need to be embedded to encode the meaning

of words phrases and sentences A paper called Reinforced Mnemonic Reader for

Machine Comprehension encoded POS and NER tags of words along with their word

and character embedding This gave them better results

Figure 5-1 An English language semantic parse tree [26]

38

We have developed a method to encode the syntax parse tree of a sentence that

not only encodes the post tags but the relation of the word within the phrase and the

relation of the phrase within the whole sentence in a hierarchical manner

Finally data augmentation is another solution to get better results One definite

way to reduce the errors would be to include similar samples in the training data which

the system is faltering in the dev set One could generate similar examples as the failure

cases and include them in the training set to have better prediction Another system

would be to train to similar and bigger datasets Our models were trained on the SQuAD

dataset There are other datasets too which does the similar question answering task

such as TriviaQA We could augment the training set of TriviaQA along with SQuAD to

have a more robust system that is able to generalize better and thus have higher

accuracy for predicting answer spans

Conclusion

In this work we have tried to explore the most fundamental techniques that have

shaped the current state of the art Then we proposed a minor improvement of

architecture over an existing model Furthermore we developed two applications that

uses the base model First we talked about how a chatbot application can be made

using the QA system and lastly we also created a web interface where the model can

be used for any Passage and Question This interface also shows the attention spread

on the candidate answers While our effort is ongoing to push the state of the art

forward we strongly believe that surpassing human level accuracy on this task will have

high dividends for the society at large

39

LIST OF REFERENCES

[1] Danqi Chen From Reading Comprehension To Open-Domain Question Answering 2018 Youtube Accessed March 16 2018 httpswwwyoutubecomwatchv=1RN88O9C13U

[2] Lehnert Wendy G The Process of Question Answering No RR-88 Yale Univ New Haven Conn Dept of computer science 1977

[3] Joshi Mandar Eunsol Choi Daniel S Weld and Luke Zettlemoyer Triviaqa A large scale distantly supervised challenge dataset for reading comprehension arXiv preprint arXiv170503551 (2017)

[4] Rajpurkar Pranav Jian Zhang Konstantin Lopyrev and Percy Liang Squad 100000+ questions for machine comprehension of text arXiv preprint arXiv160605250 (2016)

[5] The Stanford Question Answering Dataset 2018 RajpurkarGithubIo Accessed March 16 2018 httpsrajpurkargithubioSQuAD-explorer

[6] Hu Minghao Yuxing Peng and Xipeng Qiu Reinforced mnemonic reader for machine comprehension CoRR abs170502798 (2017)

[7] Schalkoff Robert J Artificial neural networks Vol 1 New York McGraw-Hill 1997

[8] Me For The AI and Neetesh Mehrotra 2018 The Connect Between Deep Learning And AI - Open Source For You Open Source For You Accessed March 16 2018 httpopensourceforucom201801connect-deep-learning-ai

[9] Krizhevsky Alex Ilya Sutskever and Geoffrey E Hinton Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems pp 1097-1105 2012

[10] An Intuitive Explanation Of Convolutional Neural Networks 2016 The Data Science Blog Accessed March 16 2018 httpsujjwalkarnme20160811intuitive-explanation-convnets

[11] Williams Ronald J and David Zipser A learning algorithm for continually running fully recurrent neural networks Neural computation 1 no 2 (1989) 270-280

[12] Hochreiter Sepp and Juumlrgen Schmidhuber Long short-term memory Neural computation 9 no 8 (1997) 1735-1780

[13] Chung Junyoung Caglar Gulcehre KyungHyun Cho and Yoshua Bengio Empirical evaluation of gated recurrent neural networks on sequence modeling arXiv preprint arXiv14123555 (2014)

40

[14] Mikolov Tomas Wen-tau Yih and Geoffrey Zweig Linguistic regularities in continuous space word representations In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics Human Language Technologies pp 746-751 2013

[15] Pennington Jeffrey Richard Socher and Christopher Manning Glove Global vectors for word representation In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) pp 1532-1543 2014

[16] Word2vec Tutorial - The Skip-Gram Model middot Chris Mccormick 2016 MccormickmlCom Accessed March 16 2018 httpmccormickmlcom20160419word2vec-tutorial-the-skip-gram-model

[17] Group 2015 [Paper Introduction] Bilingual Word Representations With Monolingual hellip SlideshareNet Accessed March 16 2018 httpswwwslidesharenetnaist-mtstudybilingual-word-representations-with-monolingual-quality-in-mind

[18] Attention Mechanism 2016 BlogHeuritechCom Accessed March 16 2018 httpsblogheuritechcom20160120attention-mechanism

[19] Weston Jason Sumit Chopra and Antoine Bordes 2014 Memory Networks arXiv preprint arXiv14103916v11 Accessed March 16 2018 httpsarxivorgabs14103916

[20] Wang Shuohang and Jing Jiang Machine comprehension using match-lstm and answer pointer arXiv preprint arXiv160807905 (2016)

[21] Wang Shuohang and Jing Jiang Learning natural language inference with LSTM arXiv preprint arXiv151208849 (2015)

[22] Vinyals Oriol Meire Fortunato and Navdeep Jaitly Pointer networks In Advances in Neural Information Processing Systems pp 2692-2700 2015

[23] Wang Wenhui Nan Yang Furu Wei Baobao Chang and Ming Zhou Gated self-matching networks for reading comprehension and question answering In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1 Long Papers) vol 1 pp 189-198 2017

[24] Seo Minjoon Aniruddha Kembhavi Ali Farhadi and Hannaneh Hajishirzi Bidirectional attention flow for machine comprehension arXiv preprint arXiv161101603 (2016)

[25] Clark Christopher and Matt Gardner Simple and effective multi-paragraph reading comprehension arXiv preprint arXiv171010723 (2017)

41

[26] 8 Analyzing Sentence Structure 2018 NltkOrg Accessed March 16 2018 httpwwwnltkorgbookch08html

42

BIOGRAPHICAL SKETCH

Purnendu Mukherjee grew up in Kolkata India and had a strong interest for

computers since an early age After his high school he did his BSc in computer

science from Ramakrishna Mission Residential College Narendrapur followed by MSc

in computer science from St Xavierrsquos College Kolkata He had a strong intuition and

interest for human like learning systems and wanted to work in this area He started

working at TCS Innovation Labs Pune for the application of Natural Language

Processing in Educational Applications As he was deeply passionate about learning

system that mimic the human brain and learn like a human child does he was

increasing interested about Deep Learning and its applications After working for a year

he went on to pursue a Master of Science degree in computer science from the

University of Florida Gainesville His academic interests have been focused on Deep

Learning and Natural Language Processing and he has been working on Machine

Reading Comprehension since summer of 2017

  • ACKNOWLEDGMENTS
  • LIST OF FIGURES
  • LIST OF ABBREVIATIONS
  • INTRODUCTION
  • THE BUILDING BLOCKS OF A QUESTION ANSWERING SYSTEM
    • Neural Networks
    • Convolutional Neural Network
    • Recurrent Neural Networks (RNN)
    • Word Embedding
    • Attention Mechanism
    • Memory Networks
      • LITERATURE REVIEW AND STATE OF THE ART
        • Machine Comprehension Using Match-LSTM and Answer Pointer
        • R-NET Matching Reading Comprehension with Self-Matching Networks
        • Bi-Directional Attention Flow (BiDAF) for Machine Comprehension
        • Summary
          • MULTI-ATTENTION QUESTION ANSWERING
            • Multi-Attention BiDAF Model
            • Chatbot Design Using a QA System
            • Online QA System and Attention Visualization
              • FUTURE DIRECTIONS AND CONCLUSION
                • Future Directions
                • Conclusion
                  • LIST OF REFERENCES
                  • BIOGRAPHICAL SKETCH
Page 37: QUESTION ANSWERING SYSTEMS WITH ATTENTION MECHANISMufdcimages.uflib.ufl.edu/UF/E0/05/23/73/00001/MUKHERJEE_P.pdf · QUESTION ANSWERING SYSTEMS WITH ATTENTION MECHANISM By PURNENDU

37

CHAPTER 5 FUTURE DIRECTIONS AND CONCLUSION

Future Directions

Having done a thorough analysis of the current methods and state of the arts in

Machine Comprehension and for the QA task and developing systems on it we have

achieved a strong sense of what needs to be done to further improve the QA models

After observing the wrong samples we can see that the system is still unable to encode

meaning and is picking answers based on statistical patterns of occurrence of answers

on training examples

As per the State of the Art models and analyzing the ongoing research literature

it is easy to conclude that more features need to be embedded to encode the meaning

of words phrases and sentences A paper called Reinforced Mnemonic Reader for

Machine Comprehension encoded POS and NER tags of words along with their word

and character embedding This gave them better results

Figure 5-1 An English language semantic parse tree [26]

38

We have developed a method to encode the syntax parse tree of a sentence that

not only encodes the post tags but the relation of the word within the phrase and the

relation of the phrase within the whole sentence in a hierarchical manner

Finally data augmentation is another solution to get better results One definite

way to reduce the errors would be to include similar samples in the training data which

the system is faltering in the dev set One could generate similar examples as the failure

cases and include them in the training set to have better prediction Another system

would be to train to similar and bigger datasets Our models were trained on the SQuAD

dataset There are other datasets too which does the similar question answering task

such as TriviaQA We could augment the training set of TriviaQA along with SQuAD to

have a more robust system that is able to generalize better and thus have higher

accuracy for predicting answer spans

Conclusion

In this work we have tried to explore the most fundamental techniques that have

shaped the current state of the art Then we proposed a minor improvement of

architecture over an existing model Furthermore we developed two applications that

uses the base model First we talked about how a chatbot application can be made

using the QA system and lastly we also created a web interface where the model can

be used for any Passage and Question This interface also shows the attention spread

on the candidate answers While our effort is ongoing to push the state of the art

forward we strongly believe that surpassing human level accuracy on this task will have

high dividends for the society at large

39

LIST OF REFERENCES

[1] Danqi Chen From Reading Comprehension To Open-Domain Question Answering 2018 Youtube Accessed March 16 2018 httpswwwyoutubecomwatchv=1RN88O9C13U

[2] Lehnert Wendy G The Process of Question Answering No RR-88 Yale Univ New Haven Conn Dept of computer science 1977

[3] Joshi Mandar Eunsol Choi Daniel S Weld and Luke Zettlemoyer Triviaqa A large scale distantly supervised challenge dataset for reading comprehension arXiv preprint arXiv170503551 (2017)

[4] Rajpurkar Pranav Jian Zhang Konstantin Lopyrev and Percy Liang Squad 100000+ questions for machine comprehension of text arXiv preprint arXiv160605250 (2016)

[5] The Stanford Question Answering Dataset 2018 RajpurkarGithubIo Accessed March 16 2018 httpsrajpurkargithubioSQuAD-explorer

[6] Hu Minghao Yuxing Peng and Xipeng Qiu Reinforced mnemonic reader for machine comprehension CoRR abs170502798 (2017)

[7] Schalkoff Robert J Artificial neural networks Vol 1 New York McGraw-Hill 1997

[8] Me For The AI and Neetesh Mehrotra 2018 The Connect Between Deep Learning And AI - Open Source For You Open Source For You Accessed March 16 2018 httpopensourceforucom201801connect-deep-learning-ai

[9] Krizhevsky Alex Ilya Sutskever and Geoffrey E Hinton Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems pp 1097-1105 2012

[10] An Intuitive Explanation Of Convolutional Neural Networks 2016 The Data Science Blog Accessed March 16 2018 httpsujjwalkarnme20160811intuitive-explanation-convnets

[11] Williams Ronald J and David Zipser A learning algorithm for continually running fully recurrent neural networks Neural computation 1 no 2 (1989) 270-280

[12] Hochreiter Sepp and Juumlrgen Schmidhuber Long short-term memory Neural computation 9 no 8 (1997) 1735-1780

[13] Chung Junyoung Caglar Gulcehre KyungHyun Cho and Yoshua Bengio Empirical evaluation of gated recurrent neural networks on sequence modeling arXiv preprint arXiv14123555 (2014)

40

[14] Mikolov Tomas Wen-tau Yih and Geoffrey Zweig Linguistic regularities in continuous space word representations In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics Human Language Technologies pp 746-751 2013

[15] Pennington Jeffrey Richard Socher and Christopher Manning Glove Global vectors for word representation In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) pp 1532-1543 2014

[16] Word2vec Tutorial - The Skip-Gram Model middot Chris Mccormick 2016 MccormickmlCom Accessed March 16 2018 httpmccormickmlcom20160419word2vec-tutorial-the-skip-gram-model

[17] Group 2015 [Paper Introduction] Bilingual Word Representations With Monolingual hellip SlideshareNet Accessed March 16 2018 httpswwwslidesharenetnaist-mtstudybilingual-word-representations-with-monolingual-quality-in-mind

[18] Attention Mechanism 2016 BlogHeuritechCom Accessed March 16 2018 httpsblogheuritechcom20160120attention-mechanism

[19] Weston Jason Sumit Chopra and Antoine Bordes 2014 Memory Networks arXiv preprint arXiv14103916v11 Accessed March 16 2018 httpsarxivorgabs14103916

[20] Wang Shuohang and Jing Jiang Machine comprehension using match-lstm and answer pointer arXiv preprint arXiv160807905 (2016)

[21] Wang Shuohang and Jing Jiang Learning natural language inference with LSTM arXiv preprint arXiv151208849 (2015)

[22] Vinyals Oriol Meire Fortunato and Navdeep Jaitly Pointer networks In Advances in Neural Information Processing Systems pp 2692-2700 2015

[23] Wang Wenhui Nan Yang Furu Wei Baobao Chang and Ming Zhou Gated self-matching networks for reading comprehension and question answering In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1 Long Papers) vol 1 pp 189-198 2017

[24] Seo Minjoon Aniruddha Kembhavi Ali Farhadi and Hannaneh Hajishirzi Bidirectional attention flow for machine comprehension arXiv preprint arXiv161101603 (2016)

[25] Clark Christopher and Matt Gardner Simple and effective multi-paragraph reading comprehension arXiv preprint arXiv171010723 (2017)

41

[26] 8 Analyzing Sentence Structure 2018 NltkOrg Accessed March 16 2018 httpwwwnltkorgbookch08html

42

BIOGRAPHICAL SKETCH

Purnendu Mukherjee grew up in Kolkata India and had a strong interest for

computers since an early age After his high school he did his BSc in computer

science from Ramakrishna Mission Residential College Narendrapur followed by MSc

in computer science from St Xavierrsquos College Kolkata He had a strong intuition and

interest for human like learning systems and wanted to work in this area He started

working at TCS Innovation Labs Pune for the application of Natural Language

Processing in Educational Applications As he was deeply passionate about learning

system that mimic the human brain and learn like a human child does he was

increasing interested about Deep Learning and its applications After working for a year

he went on to pursue a Master of Science degree in computer science from the

University of Florida Gainesville His academic interests have been focused on Deep

Learning and Natural Language Processing and he has been working on Machine

Reading Comprehension since summer of 2017

  • ACKNOWLEDGMENTS
  • LIST OF FIGURES
  • LIST OF ABBREVIATIONS
  • INTRODUCTION
  • THE BUILDING BLOCKS OF A QUESTION ANSWERING SYSTEM
    • Neural Networks
    • Convolutional Neural Network
    • Recurrent Neural Networks (RNN)
    • Word Embedding
    • Attention Mechanism
    • Memory Networks
      • LITERATURE REVIEW AND STATE OF THE ART
        • Machine Comprehension Using Match-LSTM and Answer Pointer
        • R-NET Matching Reading Comprehension with Self-Matching Networks
        • Bi-Directional Attention Flow (BiDAF) for Machine Comprehension
        • Summary
          • MULTI-ATTENTION QUESTION ANSWERING
            • Multi-Attention BiDAF Model
            • Chatbot Design Using a QA System
            • Online QA System and Attention Visualization
              • FUTURE DIRECTIONS AND CONCLUSION
                • Future Directions
                • Conclusion
                  • LIST OF REFERENCES
                  • BIOGRAPHICAL SKETCH
Page 38: QUESTION ANSWERING SYSTEMS WITH ATTENTION MECHANISMufdcimages.uflib.ufl.edu/UF/E0/05/23/73/00001/MUKHERJEE_P.pdf · QUESTION ANSWERING SYSTEMS WITH ATTENTION MECHANISM By PURNENDU

38

We have developed a method to encode the syntax parse tree of a sentence that

not only encodes the post tags but the relation of the word within the phrase and the

relation of the phrase within the whole sentence in a hierarchical manner

Finally data augmentation is another solution to get better results One definite

way to reduce the errors would be to include similar samples in the training data which

the system is faltering in the dev set One could generate similar examples as the failure

cases and include them in the training set to have better prediction Another system

would be to train to similar and bigger datasets Our models were trained on the SQuAD

dataset There are other datasets too which does the similar question answering task

such as TriviaQA We could augment the training set of TriviaQA along with SQuAD to

have a more robust system that is able to generalize better and thus have higher

accuracy for predicting answer spans

Conclusion

In this work we have tried to explore the most fundamental techniques that have

shaped the current state of the art Then we proposed a minor improvement of

architecture over an existing model Furthermore we developed two applications that

uses the base model First we talked about how a chatbot application can be made

using the QA system and lastly we also created a web interface where the model can

be used for any Passage and Question This interface also shows the attention spread

on the candidate answers While our effort is ongoing to push the state of the art

forward we strongly believe that surpassing human level accuracy on this task will have

high dividends for the society at large

39

LIST OF REFERENCES

[1] Danqi Chen From Reading Comprehension To Open-Domain Question Answering 2018 Youtube Accessed March 16 2018 httpswwwyoutubecomwatchv=1RN88O9C13U

[2] Lehnert Wendy G The Process of Question Answering No RR-88 Yale Univ New Haven Conn Dept of computer science 1977

[3] Joshi Mandar Eunsol Choi Daniel S Weld and Luke Zettlemoyer Triviaqa A large scale distantly supervised challenge dataset for reading comprehension arXiv preprint arXiv170503551 (2017)

[4] Rajpurkar Pranav Jian Zhang Konstantin Lopyrev and Percy Liang Squad 100000+ questions for machine comprehension of text arXiv preprint arXiv160605250 (2016)

[5] The Stanford Question Answering Dataset 2018 RajpurkarGithubIo Accessed March 16 2018 httpsrajpurkargithubioSQuAD-explorer

[6] Hu Minghao Yuxing Peng and Xipeng Qiu Reinforced mnemonic reader for machine comprehension CoRR abs170502798 (2017)

[7] Schalkoff Robert J Artificial neural networks Vol 1 New York McGraw-Hill 1997

[8] Me For The AI and Neetesh Mehrotra 2018 The Connect Between Deep Learning And AI - Open Source For You Open Source For You Accessed March 16 2018 httpopensourceforucom201801connect-deep-learning-ai

[9] Krizhevsky Alex Ilya Sutskever and Geoffrey E Hinton Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems pp 1097-1105 2012

[10] An Intuitive Explanation Of Convolutional Neural Networks 2016 The Data Science Blog Accessed March 16 2018 httpsujjwalkarnme20160811intuitive-explanation-convnets

[11] Williams Ronald J and David Zipser A learning algorithm for continually running fully recurrent neural networks Neural computation 1 no 2 (1989) 270-280

[12] Hochreiter Sepp and Juumlrgen Schmidhuber Long short-term memory Neural computation 9 no 8 (1997) 1735-1780

[13] Chung Junyoung Caglar Gulcehre KyungHyun Cho and Yoshua Bengio Empirical evaluation of gated recurrent neural networks on sequence modeling arXiv preprint arXiv14123555 (2014)

40

[14] Mikolov Tomas Wen-tau Yih and Geoffrey Zweig Linguistic regularities in continuous space word representations In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics Human Language Technologies pp 746-751 2013

[15] Pennington Jeffrey Richard Socher and Christopher Manning Glove Global vectors for word representation In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) pp 1532-1543 2014

[16] Word2vec Tutorial - The Skip-Gram Model middot Chris Mccormick 2016 MccormickmlCom Accessed March 16 2018 httpmccormickmlcom20160419word2vec-tutorial-the-skip-gram-model

[17] Group 2015 [Paper Introduction] Bilingual Word Representations With Monolingual hellip SlideshareNet Accessed March 16 2018 httpswwwslidesharenetnaist-mtstudybilingual-word-representations-with-monolingual-quality-in-mind

[18] Attention Mechanism 2016 BlogHeuritechCom Accessed March 16 2018 httpsblogheuritechcom20160120attention-mechanism

[19] Weston Jason Sumit Chopra and Antoine Bordes 2014 Memory Networks arXiv preprint arXiv14103916v11 Accessed March 16 2018 httpsarxivorgabs14103916

[20] Wang Shuohang and Jing Jiang Machine comprehension using match-lstm and answer pointer arXiv preprint arXiv160807905 (2016)

[21] Wang Shuohang and Jing Jiang Learning natural language inference with LSTM arXiv preprint arXiv151208849 (2015)

[22] Vinyals Oriol Meire Fortunato and Navdeep Jaitly Pointer networks In Advances in Neural Information Processing Systems pp 2692-2700 2015

[23] Wang Wenhui Nan Yang Furu Wei Baobao Chang and Ming Zhou Gated self-matching networks for reading comprehension and question answering In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1 Long Papers) vol 1 pp 189-198 2017

[24] Seo Minjoon Aniruddha Kembhavi Ali Farhadi and Hannaneh Hajishirzi Bidirectional attention flow for machine comprehension arXiv preprint arXiv161101603 (2016)

[25] Clark Christopher and Matt Gardner Simple and effective multi-paragraph reading comprehension arXiv preprint arXiv171010723 (2017)

41

[26] 8 Analyzing Sentence Structure 2018 NltkOrg Accessed March 16 2018 httpwwwnltkorgbookch08html

42

BIOGRAPHICAL SKETCH

Purnendu Mukherjee grew up in Kolkata India and had a strong interest for

computers since an early age After his high school he did his BSc in computer

science from Ramakrishna Mission Residential College Narendrapur followed by MSc

in computer science from St Xavierrsquos College Kolkata He had a strong intuition and

interest for human like learning systems and wanted to work in this area He started

working at TCS Innovation Labs Pune for the application of Natural Language

Processing in Educational Applications As he was deeply passionate about learning

system that mimic the human brain and learn like a human child does he was

increasing interested about Deep Learning and its applications After working for a year

he went on to pursue a Master of Science degree in computer science from the

University of Florida Gainesville His academic interests have been focused on Deep

Learning and Natural Language Processing and he has been working on Machine

Reading Comprehension since summer of 2017

  • ACKNOWLEDGMENTS
  • LIST OF FIGURES
  • LIST OF ABBREVIATIONS
  • INTRODUCTION
  • THE BUILDING BLOCKS OF A QUESTION ANSWERING SYSTEM
    • Neural Networks
    • Convolutional Neural Network
    • Recurrent Neural Networks (RNN)
    • Word Embedding
    • Attention Mechanism
    • Memory Networks
      • LITERATURE REVIEW AND STATE OF THE ART
        • Machine Comprehension Using Match-LSTM and Answer Pointer
        • R-NET Matching Reading Comprehension with Self-Matching Networks
        • Bi-Directional Attention Flow (BiDAF) for Machine Comprehension
        • Summary
          • MULTI-ATTENTION QUESTION ANSWERING
            • Multi-Attention BiDAF Model
            • Chatbot Design Using a QA System
            • Online QA System and Attention Visualization
              • FUTURE DIRECTIONS AND CONCLUSION
                • Future Directions
                • Conclusion
                  • LIST OF REFERENCES
                  • BIOGRAPHICAL SKETCH
Page 39: QUESTION ANSWERING SYSTEMS WITH ATTENTION MECHANISMufdcimages.uflib.ufl.edu/UF/E0/05/23/73/00001/MUKHERJEE_P.pdf · QUESTION ANSWERING SYSTEMS WITH ATTENTION MECHANISM By PURNENDU

39

LIST OF REFERENCES

[1] Danqi Chen From Reading Comprehension To Open-Domain Question Answering 2018 Youtube Accessed March 16 2018 httpswwwyoutubecomwatchv=1RN88O9C13U

[2] Lehnert Wendy G The Process of Question Answering No RR-88 Yale Univ New Haven Conn Dept of computer science 1977

[3] Joshi Mandar Eunsol Choi Daniel S Weld and Luke Zettlemoyer Triviaqa A large scale distantly supervised challenge dataset for reading comprehension arXiv preprint arXiv170503551 (2017)

[4] Rajpurkar Pranav Jian Zhang Konstantin Lopyrev and Percy Liang Squad 100000+ questions for machine comprehension of text arXiv preprint arXiv160605250 (2016)

[5] The Stanford Question Answering Dataset 2018 RajpurkarGithubIo Accessed March 16 2018 httpsrajpurkargithubioSQuAD-explorer

[6] Hu Minghao Yuxing Peng and Xipeng Qiu Reinforced mnemonic reader for machine comprehension CoRR abs170502798 (2017)

[7] Schalkoff Robert J Artificial neural networks Vol 1 New York McGraw-Hill 1997

[8] Me For The AI and Neetesh Mehrotra 2018 The Connect Between Deep Learning And AI - Open Source For You Open Source For You Accessed March 16 2018 httpopensourceforucom201801connect-deep-learning-ai

[9] Krizhevsky Alex Ilya Sutskever and Geoffrey E Hinton Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems pp 1097-1105 2012

[10] An Intuitive Explanation Of Convolutional Neural Networks 2016 The Data Science Blog Accessed March 16 2018 httpsujjwalkarnme20160811intuitive-explanation-convnets

[11] Williams Ronald J and David Zipser A learning algorithm for continually running fully recurrent neural networks Neural computation 1 no 2 (1989) 270-280

[12] Hochreiter Sepp and Juumlrgen Schmidhuber Long short-term memory Neural computation 9 no 8 (1997) 1735-1780

[13] Chung Junyoung Caglar Gulcehre KyungHyun Cho and Yoshua Bengio Empirical evaluation of gated recurrent neural networks on sequence modeling arXiv preprint arXiv14123555 (2014)

40

[14] Mikolov Tomas Wen-tau Yih and Geoffrey Zweig Linguistic regularities in continuous space word representations In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics Human Language Technologies pp 746-751 2013

[15] Pennington Jeffrey Richard Socher and Christopher Manning Glove Global vectors for word representation In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) pp 1532-1543 2014

[16] Word2vec Tutorial - The Skip-Gram Model middot Chris Mccormick 2016 MccormickmlCom Accessed March 16 2018 httpmccormickmlcom20160419word2vec-tutorial-the-skip-gram-model

[17] Group 2015 [Paper Introduction] Bilingual Word Representations With Monolingual hellip SlideshareNet Accessed March 16 2018 httpswwwslidesharenetnaist-mtstudybilingual-word-representations-with-monolingual-quality-in-mind

[18] Attention Mechanism 2016 BlogHeuritechCom Accessed March 16 2018 httpsblogheuritechcom20160120attention-mechanism

[19] Weston Jason Sumit Chopra and Antoine Bordes 2014 Memory Networks arXiv preprint arXiv14103916v11 Accessed March 16 2018 httpsarxivorgabs14103916

[20] Wang Shuohang and Jing Jiang Machine comprehension using match-lstm and answer pointer arXiv preprint arXiv160807905 (2016)

[21] Wang Shuohang and Jing Jiang Learning natural language inference with LSTM arXiv preprint arXiv151208849 (2015)

[22] Vinyals Oriol Meire Fortunato and Navdeep Jaitly Pointer networks In Advances in Neural Information Processing Systems pp 2692-2700 2015

[23] Wang Wenhui Nan Yang Furu Wei Baobao Chang and Ming Zhou Gated self-matching networks for reading comprehension and question answering In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1 Long Papers) vol 1 pp 189-198 2017

[24] Seo Minjoon Aniruddha Kembhavi Ali Farhadi and Hannaneh Hajishirzi Bidirectional attention flow for machine comprehension arXiv preprint arXiv161101603 (2016)

[25] Clark Christopher and Matt Gardner Simple and effective multi-paragraph reading comprehension arXiv preprint arXiv171010723 (2017)

41

[26] 8 Analyzing Sentence Structure 2018 NltkOrg Accessed March 16 2018 httpwwwnltkorgbookch08html

42

BIOGRAPHICAL SKETCH

Purnendu Mukherjee grew up in Kolkata India and had a strong interest for

computers since an early age After his high school he did his BSc in computer

science from Ramakrishna Mission Residential College Narendrapur followed by MSc

in computer science from St Xavierrsquos College Kolkata He had a strong intuition and

interest for human like learning systems and wanted to work in this area He started

working at TCS Innovation Labs Pune for the application of Natural Language

Processing in Educational Applications As he was deeply passionate about learning

system that mimic the human brain and learn like a human child does he was

increasing interested about Deep Learning and its applications After working for a year

he went on to pursue a Master of Science degree in computer science from the

University of Florida Gainesville His academic interests have been focused on Deep

Learning and Natural Language Processing and he has been working on Machine

Reading Comprehension since summer of 2017

  • ACKNOWLEDGMENTS
  • LIST OF FIGURES
  • LIST OF ABBREVIATIONS
  • INTRODUCTION
  • THE BUILDING BLOCKS OF A QUESTION ANSWERING SYSTEM
    • Neural Networks
    • Convolutional Neural Network
    • Recurrent Neural Networks (RNN)
    • Word Embedding
    • Attention Mechanism
    • Memory Networks
      • LITERATURE REVIEW AND STATE OF THE ART
        • Machine Comprehension Using Match-LSTM and Answer Pointer
        • R-NET Matching Reading Comprehension with Self-Matching Networks
        • Bi-Directional Attention Flow (BiDAF) for Machine Comprehension
        • Summary
          • MULTI-ATTENTION QUESTION ANSWERING
            • Multi-Attention BiDAF Model
            • Chatbot Design Using a QA System
            • Online QA System and Attention Visualization
              • FUTURE DIRECTIONS AND CONCLUSION
                • Future Directions
                • Conclusion
                  • LIST OF REFERENCES
                  • BIOGRAPHICAL SKETCH
Page 40: QUESTION ANSWERING SYSTEMS WITH ATTENTION MECHANISMufdcimages.uflib.ufl.edu/UF/E0/05/23/73/00001/MUKHERJEE_P.pdf · QUESTION ANSWERING SYSTEMS WITH ATTENTION MECHANISM By PURNENDU

40

[14] Mikolov Tomas Wen-tau Yih and Geoffrey Zweig Linguistic regularities in continuous space word representations In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics Human Language Technologies pp 746-751 2013

[15] Pennington Jeffrey Richard Socher and Christopher Manning Glove Global vectors for word representation In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) pp 1532-1543 2014

[16] Word2vec Tutorial - The Skip-Gram Model middot Chris Mccormick 2016 MccormickmlCom Accessed March 16 2018 httpmccormickmlcom20160419word2vec-tutorial-the-skip-gram-model

[17] Group 2015 [Paper Introduction] Bilingual Word Representations With Monolingual hellip SlideshareNet Accessed March 16 2018 httpswwwslidesharenetnaist-mtstudybilingual-word-representations-with-monolingual-quality-in-mind

[18] Attention Mechanism 2016 BlogHeuritechCom Accessed March 16 2018 httpsblogheuritechcom20160120attention-mechanism

[19] Weston Jason Sumit Chopra and Antoine Bordes 2014 Memory Networks arXiv preprint arXiv14103916v11 Accessed March 16 2018 httpsarxivorgabs14103916

[20] Wang Shuohang and Jing Jiang Machine comprehension using match-lstm and answer pointer arXiv preprint arXiv160807905 (2016)

[21] Wang Shuohang and Jing Jiang Learning natural language inference with LSTM arXiv preprint arXiv151208849 (2015)

[22] Vinyals Oriol Meire Fortunato and Navdeep Jaitly Pointer networks In Advances in Neural Information Processing Systems pp 2692-2700 2015

[23] Wang Wenhui Nan Yang Furu Wei Baobao Chang and Ming Zhou Gated self-matching networks for reading comprehension and question answering In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1 Long Papers) vol 1 pp 189-198 2017

[24] Seo Minjoon Aniruddha Kembhavi Ali Farhadi and Hannaneh Hajishirzi Bidirectional attention flow for machine comprehension arXiv preprint arXiv161101603 (2016)

[25] Clark Christopher and Matt Gardner Simple and effective multi-paragraph reading comprehension arXiv preprint arXiv171010723 (2017)

41

[26] 8 Analyzing Sentence Structure 2018 NltkOrg Accessed March 16 2018 httpwwwnltkorgbookch08html

42

BIOGRAPHICAL SKETCH

Purnendu Mukherjee grew up in Kolkata India and had a strong interest for

computers since an early age After his high school he did his BSc in computer

science from Ramakrishna Mission Residential College Narendrapur followed by MSc

in computer science from St Xavierrsquos College Kolkata He had a strong intuition and

interest for human like learning systems and wanted to work in this area He started

working at TCS Innovation Labs Pune for the application of Natural Language

Processing in Educational Applications As he was deeply passionate about learning

system that mimic the human brain and learn like a human child does he was

increasing interested about Deep Learning and its applications After working for a year

he went on to pursue a Master of Science degree in computer science from the

University of Florida Gainesville His academic interests have been focused on Deep

Learning and Natural Language Processing and he has been working on Machine

Reading Comprehension since summer of 2017

  • ACKNOWLEDGMENTS
  • LIST OF FIGURES
  • LIST OF ABBREVIATIONS
  • INTRODUCTION
  • THE BUILDING BLOCKS OF A QUESTION ANSWERING SYSTEM
    • Neural Networks
    • Convolutional Neural Network
    • Recurrent Neural Networks (RNN)
    • Word Embedding
    • Attention Mechanism
    • Memory Networks
      • LITERATURE REVIEW AND STATE OF THE ART
        • Machine Comprehension Using Match-LSTM and Answer Pointer
        • R-NET Matching Reading Comprehension with Self-Matching Networks
        • Bi-Directional Attention Flow (BiDAF) for Machine Comprehension
        • Summary
          • MULTI-ATTENTION QUESTION ANSWERING
            • Multi-Attention BiDAF Model
            • Chatbot Design Using a QA System
            • Online QA System and Attention Visualization
              • FUTURE DIRECTIONS AND CONCLUSION
                • Future Directions
                • Conclusion
                  • LIST OF REFERENCES
                  • BIOGRAPHICAL SKETCH
Page 41: QUESTION ANSWERING SYSTEMS WITH ATTENTION MECHANISMufdcimages.uflib.ufl.edu/UF/E0/05/23/73/00001/MUKHERJEE_P.pdf · QUESTION ANSWERING SYSTEMS WITH ATTENTION MECHANISM By PURNENDU

41

[26] 8 Analyzing Sentence Structure 2018 NltkOrg Accessed March 16 2018 httpwwwnltkorgbookch08html

42

BIOGRAPHICAL SKETCH

Purnendu Mukherjee grew up in Kolkata India and had a strong interest for

computers since an early age After his high school he did his BSc in computer

science from Ramakrishna Mission Residential College Narendrapur followed by MSc

in computer science from St Xavierrsquos College Kolkata He had a strong intuition and

interest for human like learning systems and wanted to work in this area He started

working at TCS Innovation Labs Pune for the application of Natural Language

Processing in Educational Applications As he was deeply passionate about learning

system that mimic the human brain and learn like a human child does he was

increasing interested about Deep Learning and its applications After working for a year

he went on to pursue a Master of Science degree in computer science from the

University of Florida Gainesville His academic interests have been focused on Deep

Learning and Natural Language Processing and he has been working on Machine

Reading Comprehension since summer of 2017

  • ACKNOWLEDGMENTS
  • LIST OF FIGURES
  • LIST OF ABBREVIATIONS
  • INTRODUCTION
  • THE BUILDING BLOCKS OF A QUESTION ANSWERING SYSTEM
    • Neural Networks
    • Convolutional Neural Network
    • Recurrent Neural Networks (RNN)
    • Word Embedding
    • Attention Mechanism
    • Memory Networks
      • LITERATURE REVIEW AND STATE OF THE ART
        • Machine Comprehension Using Match-LSTM and Answer Pointer
        • R-NET Matching Reading Comprehension with Self-Matching Networks
        • Bi-Directional Attention Flow (BiDAF) for Machine Comprehension
        • Summary
          • MULTI-ATTENTION QUESTION ANSWERING
            • Multi-Attention BiDAF Model
            • Chatbot Design Using a QA System
            • Online QA System and Attention Visualization
              • FUTURE DIRECTIONS AND CONCLUSION
                • Future Directions
                • Conclusion
                  • LIST OF REFERENCES
                  • BIOGRAPHICAL SKETCH
Page 42: QUESTION ANSWERING SYSTEMS WITH ATTENTION MECHANISMufdcimages.uflib.ufl.edu/UF/E0/05/23/73/00001/MUKHERJEE_P.pdf · QUESTION ANSWERING SYSTEMS WITH ATTENTION MECHANISM By PURNENDU

42

BIOGRAPHICAL SKETCH

Purnendu Mukherjee grew up in Kolkata India and had a strong interest for

computers since an early age After his high school he did his BSc in computer

science from Ramakrishna Mission Residential College Narendrapur followed by MSc

in computer science from St Xavierrsquos College Kolkata He had a strong intuition and

interest for human like learning systems and wanted to work in this area He started

working at TCS Innovation Labs Pune for the application of Natural Language

Processing in Educational Applications As he was deeply passionate about learning

system that mimic the human brain and learn like a human child does he was

increasing interested about Deep Learning and its applications After working for a year

he went on to pursue a Master of Science degree in computer science from the

University of Florida Gainesville His academic interests have been focused on Deep

Learning and Natural Language Processing and he has been working on Machine

Reading Comprehension since summer of 2017

  • ACKNOWLEDGMENTS
  • LIST OF FIGURES
  • LIST OF ABBREVIATIONS
  • INTRODUCTION
  • THE BUILDING BLOCKS OF A QUESTION ANSWERING SYSTEM
    • Neural Networks
    • Convolutional Neural Network
    • Recurrent Neural Networks (RNN)
    • Word Embedding
    • Attention Mechanism
    • Memory Networks
      • LITERATURE REVIEW AND STATE OF THE ART
        • Machine Comprehension Using Match-LSTM and Answer Pointer
        • R-NET Matching Reading Comprehension with Self-Matching Networks
        • Bi-Directional Attention Flow (BiDAF) for Machine Comprehension
        • Summary
          • MULTI-ATTENTION QUESTION ANSWERING
            • Multi-Attention BiDAF Model
            • Chatbot Design Using a QA System
            • Online QA System and Attention Visualization
              • FUTURE DIRECTIONS AND CONCLUSION
                • Future Directions
                • Conclusion
                  • LIST OF REFERENCES
                  • BIOGRAPHICAL SKETCH