QUESTION ANSWERING SYSTEMS WITH ATTENTION...
Transcript of QUESTION ANSWERING SYSTEMS WITH ATTENTION...
QUESTION ANSWERING SYSTEMS WITH ATTENTION MECHANISM
By
PURNENDU MUKHERJEE
A THESIS PRESENTED TO THE GRADUATE SCHOOL
OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF
MASTER OF SCIENCE
UNIVERSITY OF FLORIDA
2018
copy 2018 Purnendu Mukherjee
To my family friends and teachers
4
ACKNOWLEDGMENTS
I would like to thank my thesis advisor Professor Andy Li who has been a
constant source of inspiration and encouragement for me to pursue my ideas Also I
am thankful to him for providing all the necessary resources to succeed for my research
efforts
I am deeply thankful to Professor Kristy Boyer for her guidance and mentorship
as she motivated me to pursue my research interests in Natural Language Processing
and Deep Learning
I would like to express my heartfelt gratitude to Professor Jose Principe who had
taught me the very fundamentals of Learning systems and formally introduced me to
Deep Learning
I wish to extend my thanks to all the member of CBL Lab and especially to the
NLP group for their insightful remarks and overall support
I am grateful to my close friend roommate and lab partner Yash Sinha who had
supported and helped me throughout with all aspects of my thesis work
Finally I must thank my parents for their unwavering support and dedication
towards my well-being and growth
5
TABLE OF CONTENTS page
ACKNOWLEDGMENTS 4
LIST OF FIGURES 6
LIST OF ABBREVIATIONS 7
ABSTRACT 8
CHAPTER
1 INTRODUCTION 9
2 THE BUILDING BLOCKS OF A QUESTION ANSWERING SYSTEM 13
Neural Networks 13 Convolutional Neural Network 14 Recurrent Neural Networks (RNN) 15 Word Embedding 16 Attention Mechanism 17 Memory Networks 18
3 LITERATURE REVIEW AND STATE OF THE ART 21
Machine Comprehension Using Match-LSTM and Answer Pointer 21 R-NET Matching Reading Comprehension with Self-Matching Networks 23 Bi-Directional Attention Flow (BiDAF) for Machine Comprehension 25 Summary 28
4 MULTI-ATTENTION QUESTION ANSWERING 29
Multi-Attention BiDAF Model 30 Chatbot Design Using a QA System 32 Online QA System and Attention Visualization 35
5 FUTURE DIRECTIONS AND CONCLUSION 37
Future Directions 37 Conclusion 38
LIST OF REFERENCES 39
BIOGRAPHICAL SKETCH 42
6
LIST OF FIGURES
Figure page 1-1 The task of Question Answering 10
2-1 Simple and Deep Learning Neural Networks 13
2-2 Convolutional Neural Network Architecture 14
2-3 Semantic relation between words in vector space 17
2-4 Attention Mechanism flow 18
2-5 QA example on Lord of the Rings using Memory Networks 20
3-1 Match-LSTM Model Architecture 22
3-2 The task of Question Answering 24
3-3 Bi-Directional Attention Flow Model Architecture 26
4-1 The modified BiDAF model with multilevel attention 31
4-2 Flight reservation chatbotrsquos chat window 33
4-3 Chatbot within OneTask system 34
4-4 The Flow diagram of the Flight booking Chatbot system 35
4-5 QA system interface with attention highlight over candidate answers 36
5-1 An English language semantic parse tree 37
7
LIST OF ABBREVIATIONS
BiDAF Bi-Directional Attention Flow
CNN Convolutional Neural Network
GRU Gated Recurrent Units
LSTM Long Short Term Memory
NLP Natural Language Processing
NLU Natural Language Understanding
RNN Recurrent Neural Network
8
Abstract of Thesis Presented to the Graduate School of the University of Florida in Partial Fulfillment of the
Requirements for the Degree of Master of Science
QUESTION ANSWERING SYSTEMS WITH ATTENTION MECHANISM
By
Purnendu Mukherjee
May 2018
Chair Xiaolin Li Major Computer Science
Question Answering(QA) systems have had rapid growth since the last 3 years
and are close to reaching human level accuracy One of the fundamental reason for this
growth has been the use of attention mechanism along with other methods of Deep
Learning But just as with other Deep Learning methods some of the failure cases are
so obvious that it convinces us that there is a lot to improve upon In this work we first
did a literature review of the State of the Art and fundamental models in QA systems
Next we introduce an architecture which has shown improvement in the targeted area
We then introduce a general method to enable easy design of domain-specific Chatbot
applications and present a proof of the concept with the same method Finally we
present an easy to use Question Answering interface with attention visualization on the
passage We also propose a method to improve the current state of the art as a part of
our ongoing work
9
CHAPTER 1 INTRODUCTION
Teaching machines to read and understand human natural language is our
central and long standing goal for Natural Language Processing and Artificial
Intelligence in general The positive side of machines being able to reason and
comprehend human language could be enormous We are beginning to see how
commercial systems like Alexa Siri Google Now etc are being widely used as Speech
Recognition improved to human levels While speech recognition systems can
transcribe speech to text comprehension of that transcribed text is another task which
is currently a major focus for both academia and industry because of its possible
applications Moreover all the text information available throughout the internet is a
major reason why machine comprehension of text is such an important task
With the growth of Deep Learning methods in the last few years the field of
Machine Comprehension and Natural Language Processing(NLP) in general has
experienced a revolution While the traditional methods and practices are still prevalent
and forms the basis of our deep understanding of languages Deep Learning methods
have surpassed all traditional NLP and Machine Learning methods by a significant
margin and are currently driving the growth of the field
To be able to build a system that can understand human text we need to first
ask ourselves how can we evaluate the machinersquos comprehension ability We had
initially set a goal to build a Chabot for a specific domain and generalize to other topics
as we go ahead While developing the system we found out the necessity for reading
comprehension and how to measure it We finally found the answer with Questions
Answering systems
10
Just like how we human beings are tested for our ability of language
understanding with questions we should ask machines similar questions about what it
has just read The performance of the system on such question answering task will let
us evaluate how much the machine is able to reason about what it just read [1]
Reading comprehension has been a topic of Natural Language Understanding since the
1970s In 1977 Wendy Lehnert said in his doctoral thesis ndash ldquoOnly when we can ask a
program to answer questions about what it reads will we be able to begin to access that
programrsquos comprehensionrdquo [2]
Figure 1-1 The task of Question Answering
To achieve this task the NLP community has developed various datasets such as
CNN Daily Mail WebQuestions SQuAD TriviaQA [3] etc For our purpose we chose
SQuAD which stands for Stanford Question Answering Dataset [4] SQuAD consists of
questions posed by crowdworkers on a set of Wikipedia articles where the answer to
every question is a segment of text or span from the corresponding reading passage
With 100000+ question-answer pairs on 500+ articles SQuAD is significantly larger
than previous reading comprehension datasets [5]
An example from the SQuAD dataset is as follows
11
Passage Tesla later approached Morgan to ask for more funds to build a more
powerful transmitter When asked where all the money had gone Tesla responded by
saying that he was affected by the Panic of 1901 which he (Morgan) had caused
Morgan was shocked by the reminder of his part in the stock market crash and by
Teslarsquos breach of contract by asking for more funds Tesla wrote another plea to
Morgan but it was also fruitless Morgan still owed Tesla money on the original
agreement and Tesla had been facing foreclosure even before construction of the
tower began
Question On what did Tesla blame for the loss of the initial money
Answer Panic of 1901
As we started exploring the QA task we faced several challenges Some of them
we could solve with the help of other research and some of the challenges still exist in
the domain
bull Out of Vocabulary words
bull Multi-sentence reasoning may be required
bull There may exist several candidate answers
bull Optimizing the Exact Match (EM) metric may fail when the answer boundary is fuzzy or too long such as the answer of the ldquowhyrdquo query [6]
bull ldquoOne-hoprdquo prediction may fail to fully understand the query [6]
bull Fail to fully capture the long-distance contextual interaction between parts of the context by only using LSTMGRU
bull Current models are unable to capture semantics of the passage
In the upcoming chapters we will first briefly review the basics necessary for
understanding the models then we will delve deep into the fundamental models that
12
have shaped the current State of the Art models then we will discuss our contribution in
terms of architecture and applications and finally conclude with future directions
13
CHAPTER 2 THE BUILDING BLOCKS OF A QUESTION ANSWERING SYSTEM
To build a question answering system one needs to be familiar with the
fundamental deep learning models such as Recurrent Neural Networks (RNN) Long
Short Term Memory (LSTM) etc In this chapter we will have an overview on these
techniques and see how they all connect to building a question answering system
Neural Networks
What makes Deep Learning so intriguing is that it has close resemblance with
the working of the mammalian brain or at least draws inspiration from it The same can
be said for Artificial Neural networks [7] which consists of a system of interconnected
units called lsquoneuronsrsquo that take input from similar units and produces a single output
Figure 2-1 Simple and Deep Learning Neural Networks [8]
The connection from one neuron to another can be weighted based on the input data
which enables the network to tune itself to produce a certain output based on the input
This is the learning process which is achieved through backpropagation which is a
system of propagating the error from the output layer to the previous layers
14
Convolutional Neural Network
The first wave of deep learningrsquos success was brought by Convolutional Neural
Networks (CNN) [9] when this was the technique used by the winning team of ImageNet
competition in 2012 CNNs are deep artificial neural networks (ANN) that can be used to
classify images cluster them by similarity and perform object recognition within scenes
It can be used to detect and identify faces people signs or any other visual data
Figure 2-2 Convolutional Neural Network Architecture [10]
There are primarily four operations in a standard CNN model (as shown in Fig above)
1 Convolution - The primary purpose of Convolution in the ConvNet (above) is to extract features from the input image The spatial relationship between pixels ie the image features are preserved and learned by the convolution using small squares of input data
2 Non-Linearity (ReLU) ndash Rectified Linear Unit (ReLU) is a non-linear operation that carries out an element wise operation on each pixel This operation replaces the negative pixel values in the feature map by zero
3 Pooling or Sub Sampling - Spatial Pooling reduces the dimensionality of each feature map but retains the most important information For max pooling the largest value in the square window is taken and rest are dropped Other types of pooling are Average Sum etc
4 Classification (Fully Connected Layer) - The Fully Connected layer is a traditional Multi-Layer Perceptron as described before that uses a softmax activation function in the output layer The high-level features of the image are encoded by the convolutional and pooling layers which is then fed to the fully connected layer which then uses these features for classifying the input image into various classes based on the training dataset [10]
15
When a new image is fed into the CNN model all the above-mentioned steps are
carried out (forward propagation) and a probability distribution is achieved on the set of
output classes With a large enough training dataset the network will learn and
generalize well enough to classify new images into their correct classes
Recurrent Neural Networks (RNN)
Whenever we want to predict or encode sequential data RNN [11] is our go to
method RNNs perform the same task for every element of a sequence where the
output of each element depends on previous computations thus the recurrence In
practice RNNs are unable to retain long-term dependencies and can look back only a
few steps because of the vanishing gradient problem
(2-1)
A solution to the dependency problem is to use gated cells such as LSTM [11] or
GRU [13] These cells pass on important information to the next cells while ignoring
non-important ones The gated units in a GRU block are
bull Update Gate ndash Computed based on current input and hidden state
(2-2)
bull Reset Gate ndash Calculated similarly but with different weights
(2-3)
bull New memory content - (2-4)
If reset gate unit is ~0 then previous memory is ignored and only new
information is kept
16
Final memory at current time step combines previous and current time steps
(2-5)
While the GRU is computationally efficient the LSTM on the other hand is a
general case where there are three gates as follows
bull Input Gate ndash What new information to add to the current cell state
bull Forget Gate ndash How much information from previous states to be kept
bull Output gate ndash How much info should be sent to the next states
Just like GRU the current cell state is a sum of the previous cell state but
weighted by the forget gate and the new value is added which is weighted by the input
gate Based on the cell state the output gate regulates the final output
Word Embedding
Computation or gradients can be applied on numbers and not on words or letters
So first we need to convert words into their corresponding numerical formation before
feeding into a deep learning model In general there are two types of word embedding
Frequency based (which constitutes count vectors tf-idf and co-occurrence vectors)
and Prediction based With frequency based embedding the order of the words are not
preserved and works as a bag of words model Whereas with prediction based model
the order of words or locality of words are taken into consideration to generate the
numerical representation of the word Within this prediction based category there are
two fundamental techniques called Continuous Bag of Words (CBOW) and Skip Gram
Model which forms the basis for word2vec [14] and GloVe [15]
The basic intuition behind word2vec is that if two different words have very
similar ldquocontextsrdquo (that is what words are likely to appear around them) then the model
17
will produce similar vector for those words Conversely if the two word vectors are
similar then the network will produce similar context predictions for the same two words
For examples synonyms like ldquointelligentrdquo and ldquosmartrdquo would have very similar contexts
Or that words that are related like ldquoenginerdquo and ldquotransmissionrdquo would probably have
similar contexts as well [16] Plotting the word vectors learned by a word2vec over a
large corpus we could find some very interesting relationships between words
Figure 2-3 Semantic relation between words in vector space [17]
Attention Mechanism
We as humans put our attention to things are important or are relevant in a
context For example when asked a question from a passage we try to find the most
relevant part of the passage the question is relevant with and then reason from our
understanding of that part of the passage The same idea applies for attention
mechanism in Deep Learning It is used to identify the specific parts of a given context
to which the current question is relevant to
Formally put the techniques take n arguments y_1 y_n (in our case the
passage having words say y_i through h_i) and a question word say q It returns a
vector z which is supposed to be the laquo summary raquo of the y_i focusing on information
linked to the question q More formally it returns a weighted arithmetic mean of the y_i
18
and the weights are chosen according the relevance of each y_i given the context c
[18]
Figure 2-4 Attention Mechanism flow [18]
Memory Networks
Convolutional Neural Networks and Recurrent Neural Networks which does
capture how we form our visual and sequential memories their memory (encoded by
hidden states and weights) were typically too small and was not compartmentalized
enough to accurately remember facts from the past (knowledge is compressed into
dense vectors) [19]
Deep Learning needed to cultivate a methodology that preserved memories as
they are such that it wonrsquot be lost in generalization and recalling exact words or
sequence of events would be possible mdash something computers are already good at This
effort led us to Memory Networks [19] published at ICLR 2015 by Facebook AI
Research
This paper provides a basic framework to store augment and retrieve memories
while seamlessly working with a Recurrent Neural Network architecture The memory
19
network consists of a memory m (an array of objects 1 indexed by m i) and four
(potentially learned) components I G O and R as follows
I (input feature map) mdash converts the incoming input to the internal feature
representation either a sparse or dense feature vector like that from word2vec or
GloVe
G (generalization) mdash updates old memories given the new input They call this
generalization as there is an opportunity for the network to compress and generalize its
memories at this stage for some intended future use The analogy Irsquove been talking
before
O (output feature map) mdash produces a new output (in the feature representation
space) given the new input and the current memory state This component is
responsible for performing inference In a question answering system this part will
select the candidate sentences (which might contain the answer) from the story
(conversation) so far
R (response) mdash converts the output into the response format desired For
example a textual response or an action In the QA system described this component
finds the desired answer and then converts it from feature representation to the actual
word
This model is a fully supervised model meaning all the candidate sentences from
which the answer could be found are marked during training phase and can also be
termed as lsquohard attentionrsquo
The authors tested out the QA system on various literature including Lord of the
Rings
20
Figure 2-5 QA example on Lord of the Rings using Memory Networks [19]
21
CHAPTER 3 LITERATURE REVIEW AND STATE OF THE ART
There has been rapid progress since the release of the SQuAD dataset and
currently the best ensemble models are close to human level accuracy in machine
comprehension This is due to the various ingenious methods which solves some of the
problems with the previous methods Out of Vocabulary tokens were handled by using
Character embedding Long term dependency within context passage were solved
using self-attention And many other techniques such as Contextualized vectors History
of Words Attention Flow etc In this section we will have a look at the some of the most
important models that were fundamental to the progress of Questions Answering
Machine Comprehension Using Match-LSTM and Answer Pointer
In this paper [20] the authors propose an end-to-end neural architecture for the
QA task The architecture is based on match-LSTM [21] a model they proposed for
textual entailment and Pointer Net [22] a sequence-to-sequence model proposed by
Vinyals et al (2015) to constrain the output tokens to be from the input sequences
The model consists of an LSTM preprocessing layer a match-LSTM layer and an
Answer Pointer layer
We are given a piece of text which we refer to as a passage and a question
related to the passage The passage is represented by matrix P where P is the length
(number of tokens) of the passage and d is the dimensionality of word embeddings
Similarly the question is represented by matrix Q where Q is the length of the question
Our goal is to identify a subsequence from the passage as the answer to the question
22
Figure 3-1 Match-LSTM Model Architecture [20]
LSTM Preprocessing layer They use a standard one-directional LSTM
(Hochreiter amp Schmidhuber 1997) to process the passage and the question separately
as shown below
Match LSTM Layer They applied the match-LSTM model proposed for textual
entailment to their machine comprehension problem by treating the question as a
premise and the passage as a hypothesis The match-LSTM sequentially goes through
the passage At position i of the passage it first uses the standard word-by-word
attention mechanism to obtain attention weight vector as follows
(3-1)
23
where and are parameters to be learned
Answer Pointer Layer The Answer Pointer (Ans-Ptr) layer is motivated by the
Pointer Net introduced by Vinyals et al (2015) [22] The boundary model produces only
the start token and the end token of the answer and then all the tokens between these
two in the original passage are considered to be the answer
When this paper was released back in November 2016 Match-LSTM method
was the state of the art in Question Answering systems and was at the top of the
leaderboard for the SQuAD dataset
R-NET Matching Reading Comprehension with Self-Matching Networks
In this model [23] first the question and passage are processed by a
bidirectional recurrent network (Mikolov et al 2010) separately They then match the
question and passage with gated attention-based recurrent networks obtaining
question-aware representation for the passage On top of that they apply self-matching
attention to aggregate evidence from the whole passage and refine the passage
representation which is then fed into the output layer to predict the boundary of the
answer span
Question and passage encoding First the words are converted to their
respective word-level embeddings and character level embeddings The character-level
embeddings are generated by taking the final hidden states of a bi-directional recurrent
neural network (RNN) applied to embeddings of characters in the token Such
character-level embeddings have been shown to be helpful to deal with out-of-vocab
(OOV) tokens
24
They then use a bi-directional RNN to produce new representation and
of all words in the question and passage respectively
Figure 3-2 The task of Question Answering [23]
Gated Attention-based Recurrent Networks They use a variant of attention-
based recurrent networks with an additional gate to determine the importance of
information in the passage regarding a question Different from the gates in LSTM or
GRU the additional gate is based on the current passage word and its attention-pooling
vector of the question which focuses on the relation between the question and current
passage word The gate effectively model the phenomenon that only parts of the
passage are relevant to the question in reading comprehension and question answering
is utilized in subsequent calculations
25
Self-Matching Attention From the previous step the question aware passage
representation is generated to highlight the important parts of the passage One
problem with such representation is that it has very limited knowledge of context One
answer candidate is often oblivious to important cues in the passage outside its
surrounding window To address this problem the authors propose directly matching
the question-aware passage representation against itself It dynamically collects
evidence from the whole passage for words in passage and encodes the evidence
relevant to the current passage word and its matching question information into the
passage representation
Output Layer They use the same method as Wang amp Jiang (2016b) and use
pointer networks (Vinyals et al 2015) to predict the start and end position of the
answer In addition they use an attention-pooling over the question representation to
generate the initial hidden vector for the pointer network [23]
When the R-Net Model first appeared in the leaderboard in March 2017 it was at
the top with 723 Exact Match and 807 F1 score
Bi-Directional Attention Flow (BiDAF) for Machine Comprehension
BiDAF [24] is a hierarchical multi-stage architecture for modeling the
representations of the context paragraph at different levels of granularity BIDAF
includes character-level word-level and contextual embeddings and uses bi-directional
attention flow to obtain a query-aware context representation Their attention layer is not
used to summarize the context paragraph into a fixed-size vector Instead the attention
is computed for every time step and the attended vector at each time step along with
the representations from previous layers can flow through to the subsequent modeling
layer This reduces the information loss caused by early summarization [24]
26
Figure 3-3 Bi-Directional Attention Flow Model Architecture [24]
Their machine comprehension model is a hierarchical multi-stage process and
consists of six layers
1 Character Embedding Layer maps each word to a vector space using character-level CNNs
2 Word Embedding Layer maps each word to a vector space using a pre-trained word embedding model
3 Contextual Embedding Layer utilizes contextual cues from surrounding words to refine the embedding of the words These first three layers are applied to both the query and context They use a LSTM on top of the embeddings provided by the previous layers to model the temporal interactions between words We place an LSTM in both directions and concatenate the outputs of the two LSTMs
4 Attention Flow Layer couples the query and context vectors and produces a set of query aware feature vectors for each word in the context
5 Modeling Layer employs a Recurrent Neural Network to scan the context The input to the modeling layer is G which encodes the query-aware representations of context words The output of the modeling layer captures the interaction among the context words conditioned on the query They use two layers of bi-directional LSTM
27
with the output size of d for each direction Hence a matrix is obtained which is passed onto the output layer to predict the answer
6 Output Layer provides an answer to the query They define the training loss (to be minimized) as the sum of the negative log probabilities of the true start and end indices by the predicted distributions averaged over all examples [25]
In a further variation of their above work they add a self-attention layer after the
Bi-attention layer to further improve the results The architecture of the model is as
Figure 3-3 The task of Question Answering [25]
28
Summary
In this chapter we reviewed the methods that are fundamental to the state of the
art in Machine Comprehension and for the task of Question Answering We have
reached closed to human level accuracy and this is due to incremental developments
over previous models As we saw Out of Vocabulary (OOV) tokens were handled by
using Character embedding Long term dependency within context passage were
solved using self-attention and many other techniques such as Contextualized vectors
Attention Flow etc were employed to get better results In the next chapter we will see
how we can build on these models and develop further
29
CHAPTER 4 MULTI-ATTENTION QUESTION ANSWERING
Although the advancements in Question Answering systems since the release of
the SQuAD dataset has been impressive and the results are getting close to human
level accuracy it is far from being a fool-proof system The models still make mistakes
which would be obvious to a human For example
Passage The Panthers used the San Jose State practice facility and stayed at
the San Jose Marriott The Broncos practiced at Florida State Facility and stayed at the
Santa Clara Marriott
Question At what universitys facility did the Panthers practice
Actual Answer San Jose State
Predicted Answer Florida State Facility
To find out what is leading to the wrong predictions we wanted to see the
attention weights associated with such an example We plotted the passage and
question heat map which is a 2D matrix where the intensity of each cell signifies the
similarity between a passage word and a question word For the above example we
found out that while certain words of the question are given high weightage other parts
are not The words lsquoAtrsquo lsquofacilityrsquo lsquopracticersquo are given high attention but lsquoPanthersrsquo does
not receive high attention If it had received high attention then the system would have
predicted lsquoSan Jose Statersquo as the right answer To solve this issue we analyzed the
base BiDAF model and proposed adding two things
1 Bi-Attention and Self-Attention over Query
2 Second level attention over the output of (Bi-Attention + Self-Attention) from both Context and Query
30
Multi-Attention BiDAF Model
Our comprehension model is a hierarchical multi-stage process and consists of
the following layers
1 Embedding Just as all other models we embed words using pretrained word vectors We also embed the characters in each word into size 20 vectors which are learned and run a convolution neural network followed by max-pooling to get character-derived embeddings for each word The character-level and word-level embeddings are then concatenated and passed to the next layer We do not update the word embeddings during training
2 Pre-Process A shared bi-directional GRU (Cho et al 2014) is used to map the question and passage embeddings to context aware embeddings
3 Context Attention The bi-directional attention mechanism from the Bi-Directional Attention Flow (BiDAF) model (Seo et al 2016) is used to build a query-aware context representation Let h_i be the vector for context word i q_j be the vector for question word j and n_q and n_c be the lengths of the question and context respectively We compute attention between context word i and question word j as
(4-1)
where w1 w2 and w3 are learned vectors and is element-wise multiplication We then compute an attended vector c_i for each context token as
(4-2)
We also compute a query-to-context vector q_c
(4-3)
31
The final vector computed for each token is built by concatenating
and In our model we subsequently pass the result through a linear layer with ReLU activations
4 Context Self-Attention Next we use a layer of residual self-attention The input is passed through another bi-directional GRU Then we apply the same attention mechanism only now between the passage and itself In this case we do not use query-to context attention and we set aij = 1048576inf if i = j As before we pass the concatenated output through a linear layer with ReLU activations This layer is applied residually so this output is additionally summed with the input
5 Query Attention For this part we do the same way as context attention but calculate the weighted sum of the context words for each query word Thus we get the length as number of query words Then we calculate context to query similar to query to context in context attention layer
Figure 4-1 The modified BiDAF model with multilevel attention
6 Query Self-Attention This part is done the same way as the context self-attention layer but from the output of the Query Attention layer
32
7 Context Query Bi-Attention + Self-Attention The output of the Context self-attention and the Query Self-Attention layers are taken as input and the same process for Bi-Attention and self-attention is applied on these inputs
8 Prediction In the last layer of our model a bidirectional GRU is applied followed by a linear layer that computes answer start scores for each token The hidden states of that layer are concatenated with the input and fed into a second bidirectional GRU and linear layer to predict answer end scores The softmax operation is applied to the start and end scores to produce start and end probabilities and we optimize the negative log likelihood of selecting correct start and end tokens
Having carried out this modification we were able to solve the wrong example we
started with The multilevel attention model gives the correct output as ldquoSan Jose
Staterdquo Also we achieved slightly better scores than the original model with a F1 score
of 8544 on the SQuAD dev dataset
Chatbot Design Using a QA System
Designing a perfect chatbot that passes the Turing test is a fundamental goal for
Artificial Intelligence Although we are many order to magnitudes away from achieving
such a goal domain specific tasks can be solved with chatbots made from current
technology We had started our goal with a similar objective in mind ie to design a
domain specific chatbot and then generalize to other areas as it is able to achieve the
first domain specific objective robustly This led us to the fundamental problem of
Machine Comprehension and subsequently to the task of Question Answering Having
achieved some degree of success with QA systems we looked back if we could apply
our newly acquired knowledge in the task of designing Chatbots
The chatbots made with todayrsquos technologies are mostly handcrafted techniques
such as template matching that requires anticipating all possible ways a user may
articulate his requirements and a conversation may occur This requires a lot of man
hours for designing a domain specific system and is still very error prone In this section
33
we propose a general Chatbot design that would make the designing of a domain
specific chatbot very easy and robust at the same time
Every domain specific chatbot needs to obtain a set of information from the user
and show some results based on the user specific information obtained The traditional
chatbots use template matching and keywords lookup to determine if the user has
provided the required information Our idea is to use the Question Answering system in
the backend to extract out the required information from whatever the user has typed
until this point of the conversation The information to be extracted can be posed in the
form of a set of questions and the answers obtained from those questions can be used
as the parameters to supply the relevant information to the user
We had chosen our chatbot domain as the flight reservation system Our goal
was to extract the required information from the user to be able to show him the
available flights as per the userrsquos requirements
Figure 4-2 Flight reservation chatbotrsquos chat window
34
For a flight reservation task the booking agent needs to know the origin city
destination city and date of travel at minimum to be able to show the available flights
Optional information includes the number of tickets passengerrsquos name one way or
round trip etc
The minimalistic conversation with the user through the chat window would be as
shown above We had a platform called OneTask on which we wanted to implement our
chat bot The chat interface within the OneTask system looks as follows
Figure 4-3 Chatbot within OneTask system
The working of the chat system is as follows ndash
1 Initiation The user opens the chat window which starts a session with the chat bot The chat bot asks lsquoHow may I help yoursquo
2 User Reply The user may reply with none of the required information for flight booking or may reply with multiple information in the same message
3 User Reply parsing The conversation up to this point is treated as a passage and the internal questions are run on this passage So the four questions that are run are
Where do you want to go
35
From where do you want to leave
When do you want to depart
4 Parsed responses from QA model After running the questions on the QA system with the given conversation answers are obtained for all along with their corresponding confidence values Since answers will be obtained even if the required question was not answered up to this point in the conversation the confidence values plays a crucial role in determining the validity of the answer For this purpose we determined ranges of confidence values for validation A confidence value above 10 signifies the question has been answered correctly A confidence of 2 ndash 10 signifies that it may have been answered but should verify with the user for correctness and any confidence below 2 is discarded
5 Asking remaining questions iteratively After the parsing it is checked if any of the required questions are still unanswered If so the chatbot asks the remaining question and the process from 3 ndash 4 is carried out iteratively Once all the questions have been answered the user is shown with the available flight options as per his request
Figure 4-4 The Flow diagram of the Flight booking Chatbot system
Online QA System and Attention Visualization
To be able to test out various examples we used the BiDAF [23] model for an
online demo One can either choose from the available examples from the drop-down
menu or paste their own passage and examples While this is a useful and interesting
system to test the model in a user-friendly way we created this system to be able to
focus on the wrong samples
36
Figure 4-5 QA system interface with attention highlight over candidate answers
The candidate answers are shown with the blue highlights as per their
confidence values The higher the confidence the darker the answer The highest
confidence value is chosen as the predicted answer We developed the system to show
attention spread of the candidate answers to realize what needs to be done to improve
the system This led us to realize the importance of including the query attention part as
well as multilevel attention on the BiDAF model as described in the first section of this
chapter
37
CHAPTER 5 FUTURE DIRECTIONS AND CONCLUSION
Future Directions
Having done a thorough analysis of the current methods and state of the arts in
Machine Comprehension and for the QA task and developing systems on it we have
achieved a strong sense of what needs to be done to further improve the QA models
After observing the wrong samples we can see that the system is still unable to encode
meaning and is picking answers based on statistical patterns of occurrence of answers
on training examples
As per the State of the Art models and analyzing the ongoing research literature
it is easy to conclude that more features need to be embedded to encode the meaning
of words phrases and sentences A paper called Reinforced Mnemonic Reader for
Machine Comprehension encoded POS and NER tags of words along with their word
and character embedding This gave them better results
Figure 5-1 An English language semantic parse tree [26]
38
We have developed a method to encode the syntax parse tree of a sentence that
not only encodes the post tags but the relation of the word within the phrase and the
relation of the phrase within the whole sentence in a hierarchical manner
Finally data augmentation is another solution to get better results One definite
way to reduce the errors would be to include similar samples in the training data which
the system is faltering in the dev set One could generate similar examples as the failure
cases and include them in the training set to have better prediction Another system
would be to train to similar and bigger datasets Our models were trained on the SQuAD
dataset There are other datasets too which does the similar question answering task
such as TriviaQA We could augment the training set of TriviaQA along with SQuAD to
have a more robust system that is able to generalize better and thus have higher
accuracy for predicting answer spans
Conclusion
In this work we have tried to explore the most fundamental techniques that have
shaped the current state of the art Then we proposed a minor improvement of
architecture over an existing model Furthermore we developed two applications that
uses the base model First we talked about how a chatbot application can be made
using the QA system and lastly we also created a web interface where the model can
be used for any Passage and Question This interface also shows the attention spread
on the candidate answers While our effort is ongoing to push the state of the art
forward we strongly believe that surpassing human level accuracy on this task will have
high dividends for the society at large
39
LIST OF REFERENCES
[1] Danqi Chen From Reading Comprehension To Open-Domain Question Answering 2018 Youtube Accessed March 16 2018 httpswwwyoutubecomwatchv=1RN88O9C13U
[2] Lehnert Wendy G The Process of Question Answering No RR-88 Yale Univ New Haven Conn Dept of computer science 1977
[3] Joshi Mandar Eunsol Choi Daniel S Weld and Luke Zettlemoyer Triviaqa A large scale distantly supervised challenge dataset for reading comprehension arXiv preprint arXiv170503551 (2017)
[4] Rajpurkar Pranav Jian Zhang Konstantin Lopyrev and Percy Liang Squad 100000+ questions for machine comprehension of text arXiv preprint arXiv160605250 (2016)
[5] The Stanford Question Answering Dataset 2018 RajpurkarGithubIo Accessed March 16 2018 httpsrajpurkargithubioSQuAD-explorer
[6] Hu Minghao Yuxing Peng and Xipeng Qiu Reinforced mnemonic reader for machine comprehension CoRR abs170502798 (2017)
[7] Schalkoff Robert J Artificial neural networks Vol 1 New York McGraw-Hill 1997
[8] Me For The AI and Neetesh Mehrotra 2018 The Connect Between Deep Learning And AI - Open Source For You Open Source For You Accessed March 16 2018 httpopensourceforucom201801connect-deep-learning-ai
[9] Krizhevsky Alex Ilya Sutskever and Geoffrey E Hinton Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems pp 1097-1105 2012
[10] An Intuitive Explanation Of Convolutional Neural Networks 2016 The Data Science Blog Accessed March 16 2018 httpsujjwalkarnme20160811intuitive-explanation-convnets
[11] Williams Ronald J and David Zipser A learning algorithm for continually running fully recurrent neural networks Neural computation 1 no 2 (1989) 270-280
[12] Hochreiter Sepp and Juumlrgen Schmidhuber Long short-term memory Neural computation 9 no 8 (1997) 1735-1780
[13] Chung Junyoung Caglar Gulcehre KyungHyun Cho and Yoshua Bengio Empirical evaluation of gated recurrent neural networks on sequence modeling arXiv preprint arXiv14123555 (2014)
40
[14] Mikolov Tomas Wen-tau Yih and Geoffrey Zweig Linguistic regularities in continuous space word representations In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics Human Language Technologies pp 746-751 2013
[15] Pennington Jeffrey Richard Socher and Christopher Manning Glove Global vectors for word representation In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) pp 1532-1543 2014
[16] Word2vec Tutorial - The Skip-Gram Model middot Chris Mccormick 2016 MccormickmlCom Accessed March 16 2018 httpmccormickmlcom20160419word2vec-tutorial-the-skip-gram-model
[17] Group 2015 [Paper Introduction] Bilingual Word Representations With Monolingual hellip SlideshareNet Accessed March 16 2018 httpswwwslidesharenetnaist-mtstudybilingual-word-representations-with-monolingual-quality-in-mind
[18] Attention Mechanism 2016 BlogHeuritechCom Accessed March 16 2018 httpsblogheuritechcom20160120attention-mechanism
[19] Weston Jason Sumit Chopra and Antoine Bordes 2014 Memory Networks arXiv preprint arXiv14103916v11 Accessed March 16 2018 httpsarxivorgabs14103916
[20] Wang Shuohang and Jing Jiang Machine comprehension using match-lstm and answer pointer arXiv preprint arXiv160807905 (2016)
[21] Wang Shuohang and Jing Jiang Learning natural language inference with LSTM arXiv preprint arXiv151208849 (2015)
[22] Vinyals Oriol Meire Fortunato and Navdeep Jaitly Pointer networks In Advances in Neural Information Processing Systems pp 2692-2700 2015
[23] Wang Wenhui Nan Yang Furu Wei Baobao Chang and Ming Zhou Gated self-matching networks for reading comprehension and question answering In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1 Long Papers) vol 1 pp 189-198 2017
[24] Seo Minjoon Aniruddha Kembhavi Ali Farhadi and Hannaneh Hajishirzi Bidirectional attention flow for machine comprehension arXiv preprint arXiv161101603 (2016)
[25] Clark Christopher and Matt Gardner Simple and effective multi-paragraph reading comprehension arXiv preprint arXiv171010723 (2017)
41
[26] 8 Analyzing Sentence Structure 2018 NltkOrg Accessed March 16 2018 httpwwwnltkorgbookch08html
42
BIOGRAPHICAL SKETCH
Purnendu Mukherjee grew up in Kolkata India and had a strong interest for
computers since an early age After his high school he did his BSc in computer
science from Ramakrishna Mission Residential College Narendrapur followed by MSc
in computer science from St Xavierrsquos College Kolkata He had a strong intuition and
interest for human like learning systems and wanted to work in this area He started
working at TCS Innovation Labs Pune for the application of Natural Language
Processing in Educational Applications As he was deeply passionate about learning
system that mimic the human brain and learn like a human child does he was
increasing interested about Deep Learning and its applications After working for a year
he went on to pursue a Master of Science degree in computer science from the
University of Florida Gainesville His academic interests have been focused on Deep
Learning and Natural Language Processing and he has been working on Machine
Reading Comprehension since summer of 2017
- ACKNOWLEDGMENTS
- LIST OF FIGURES
- LIST OF ABBREVIATIONS
- INTRODUCTION
- THE BUILDING BLOCKS OF A QUESTION ANSWERING SYSTEM
-
- Neural Networks
- Convolutional Neural Network
- Recurrent Neural Networks (RNN)
- Word Embedding
- Attention Mechanism
- Memory Networks
-
- LITERATURE REVIEW AND STATE OF THE ART
-
- Machine Comprehension Using Match-LSTM and Answer Pointer
- R-NET Matching Reading Comprehension with Self-Matching Networks
- Bi-Directional Attention Flow (BiDAF) for Machine Comprehension
- Summary
-
- MULTI-ATTENTION QUESTION ANSWERING
-
- Multi-Attention BiDAF Model
- Chatbot Design Using a QA System
- Online QA System and Attention Visualization
-
- FUTURE DIRECTIONS AND CONCLUSION
-
- Future Directions
- Conclusion
-
- LIST OF REFERENCES
- BIOGRAPHICAL SKETCH
-
copy 2018 Purnendu Mukherjee
To my family friends and teachers
4
ACKNOWLEDGMENTS
I would like to thank my thesis advisor Professor Andy Li who has been a
constant source of inspiration and encouragement for me to pursue my ideas Also I
am thankful to him for providing all the necessary resources to succeed for my research
efforts
I am deeply thankful to Professor Kristy Boyer for her guidance and mentorship
as she motivated me to pursue my research interests in Natural Language Processing
and Deep Learning
I would like to express my heartfelt gratitude to Professor Jose Principe who had
taught me the very fundamentals of Learning systems and formally introduced me to
Deep Learning
I wish to extend my thanks to all the member of CBL Lab and especially to the
NLP group for their insightful remarks and overall support
I am grateful to my close friend roommate and lab partner Yash Sinha who had
supported and helped me throughout with all aspects of my thesis work
Finally I must thank my parents for their unwavering support and dedication
towards my well-being and growth
5
TABLE OF CONTENTS page
ACKNOWLEDGMENTS 4
LIST OF FIGURES 6
LIST OF ABBREVIATIONS 7
ABSTRACT 8
CHAPTER
1 INTRODUCTION 9
2 THE BUILDING BLOCKS OF A QUESTION ANSWERING SYSTEM 13
Neural Networks 13 Convolutional Neural Network 14 Recurrent Neural Networks (RNN) 15 Word Embedding 16 Attention Mechanism 17 Memory Networks 18
3 LITERATURE REVIEW AND STATE OF THE ART 21
Machine Comprehension Using Match-LSTM and Answer Pointer 21 R-NET Matching Reading Comprehension with Self-Matching Networks 23 Bi-Directional Attention Flow (BiDAF) for Machine Comprehension 25 Summary 28
4 MULTI-ATTENTION QUESTION ANSWERING 29
Multi-Attention BiDAF Model 30 Chatbot Design Using a QA System 32 Online QA System and Attention Visualization 35
5 FUTURE DIRECTIONS AND CONCLUSION 37
Future Directions 37 Conclusion 38
LIST OF REFERENCES 39
BIOGRAPHICAL SKETCH 42
6
LIST OF FIGURES
Figure page 1-1 The task of Question Answering 10
2-1 Simple and Deep Learning Neural Networks 13
2-2 Convolutional Neural Network Architecture 14
2-3 Semantic relation between words in vector space 17
2-4 Attention Mechanism flow 18
2-5 QA example on Lord of the Rings using Memory Networks 20
3-1 Match-LSTM Model Architecture 22
3-2 The task of Question Answering 24
3-3 Bi-Directional Attention Flow Model Architecture 26
4-1 The modified BiDAF model with multilevel attention 31
4-2 Flight reservation chatbotrsquos chat window 33
4-3 Chatbot within OneTask system 34
4-4 The Flow diagram of the Flight booking Chatbot system 35
4-5 QA system interface with attention highlight over candidate answers 36
5-1 An English language semantic parse tree 37
7
LIST OF ABBREVIATIONS
BiDAF Bi-Directional Attention Flow
CNN Convolutional Neural Network
GRU Gated Recurrent Units
LSTM Long Short Term Memory
NLP Natural Language Processing
NLU Natural Language Understanding
RNN Recurrent Neural Network
8
Abstract of Thesis Presented to the Graduate School of the University of Florida in Partial Fulfillment of the
Requirements for the Degree of Master of Science
QUESTION ANSWERING SYSTEMS WITH ATTENTION MECHANISM
By
Purnendu Mukherjee
May 2018
Chair Xiaolin Li Major Computer Science
Question Answering(QA) systems have had rapid growth since the last 3 years
and are close to reaching human level accuracy One of the fundamental reason for this
growth has been the use of attention mechanism along with other methods of Deep
Learning But just as with other Deep Learning methods some of the failure cases are
so obvious that it convinces us that there is a lot to improve upon In this work we first
did a literature review of the State of the Art and fundamental models in QA systems
Next we introduce an architecture which has shown improvement in the targeted area
We then introduce a general method to enable easy design of domain-specific Chatbot
applications and present a proof of the concept with the same method Finally we
present an easy to use Question Answering interface with attention visualization on the
passage We also propose a method to improve the current state of the art as a part of
our ongoing work
9
CHAPTER 1 INTRODUCTION
Teaching machines to read and understand human natural language is our
central and long standing goal for Natural Language Processing and Artificial
Intelligence in general The positive side of machines being able to reason and
comprehend human language could be enormous We are beginning to see how
commercial systems like Alexa Siri Google Now etc are being widely used as Speech
Recognition improved to human levels While speech recognition systems can
transcribe speech to text comprehension of that transcribed text is another task which
is currently a major focus for both academia and industry because of its possible
applications Moreover all the text information available throughout the internet is a
major reason why machine comprehension of text is such an important task
With the growth of Deep Learning methods in the last few years the field of
Machine Comprehension and Natural Language Processing(NLP) in general has
experienced a revolution While the traditional methods and practices are still prevalent
and forms the basis of our deep understanding of languages Deep Learning methods
have surpassed all traditional NLP and Machine Learning methods by a significant
margin and are currently driving the growth of the field
To be able to build a system that can understand human text we need to first
ask ourselves how can we evaluate the machinersquos comprehension ability We had
initially set a goal to build a Chabot for a specific domain and generalize to other topics
as we go ahead While developing the system we found out the necessity for reading
comprehension and how to measure it We finally found the answer with Questions
Answering systems
10
Just like how we human beings are tested for our ability of language
understanding with questions we should ask machines similar questions about what it
has just read The performance of the system on such question answering task will let
us evaluate how much the machine is able to reason about what it just read [1]
Reading comprehension has been a topic of Natural Language Understanding since the
1970s In 1977 Wendy Lehnert said in his doctoral thesis ndash ldquoOnly when we can ask a
program to answer questions about what it reads will we be able to begin to access that
programrsquos comprehensionrdquo [2]
Figure 1-1 The task of Question Answering
To achieve this task the NLP community has developed various datasets such as
CNN Daily Mail WebQuestions SQuAD TriviaQA [3] etc For our purpose we chose
SQuAD which stands for Stanford Question Answering Dataset [4] SQuAD consists of
questions posed by crowdworkers on a set of Wikipedia articles where the answer to
every question is a segment of text or span from the corresponding reading passage
With 100000+ question-answer pairs on 500+ articles SQuAD is significantly larger
than previous reading comprehension datasets [5]
An example from the SQuAD dataset is as follows
11
Passage Tesla later approached Morgan to ask for more funds to build a more
powerful transmitter When asked where all the money had gone Tesla responded by
saying that he was affected by the Panic of 1901 which he (Morgan) had caused
Morgan was shocked by the reminder of his part in the stock market crash and by
Teslarsquos breach of contract by asking for more funds Tesla wrote another plea to
Morgan but it was also fruitless Morgan still owed Tesla money on the original
agreement and Tesla had been facing foreclosure even before construction of the
tower began
Question On what did Tesla blame for the loss of the initial money
Answer Panic of 1901
As we started exploring the QA task we faced several challenges Some of them
we could solve with the help of other research and some of the challenges still exist in
the domain
bull Out of Vocabulary words
bull Multi-sentence reasoning may be required
bull There may exist several candidate answers
bull Optimizing the Exact Match (EM) metric may fail when the answer boundary is fuzzy or too long such as the answer of the ldquowhyrdquo query [6]
bull ldquoOne-hoprdquo prediction may fail to fully understand the query [6]
bull Fail to fully capture the long-distance contextual interaction between parts of the context by only using LSTMGRU
bull Current models are unable to capture semantics of the passage
In the upcoming chapters we will first briefly review the basics necessary for
understanding the models then we will delve deep into the fundamental models that
12
have shaped the current State of the Art models then we will discuss our contribution in
terms of architecture and applications and finally conclude with future directions
13
CHAPTER 2 THE BUILDING BLOCKS OF A QUESTION ANSWERING SYSTEM
To build a question answering system one needs to be familiar with the
fundamental deep learning models such as Recurrent Neural Networks (RNN) Long
Short Term Memory (LSTM) etc In this chapter we will have an overview on these
techniques and see how they all connect to building a question answering system
Neural Networks
What makes Deep Learning so intriguing is that it has close resemblance with
the working of the mammalian brain or at least draws inspiration from it The same can
be said for Artificial Neural networks [7] which consists of a system of interconnected
units called lsquoneuronsrsquo that take input from similar units and produces a single output
Figure 2-1 Simple and Deep Learning Neural Networks [8]
The connection from one neuron to another can be weighted based on the input data
which enables the network to tune itself to produce a certain output based on the input
This is the learning process which is achieved through backpropagation which is a
system of propagating the error from the output layer to the previous layers
14
Convolutional Neural Network
The first wave of deep learningrsquos success was brought by Convolutional Neural
Networks (CNN) [9] when this was the technique used by the winning team of ImageNet
competition in 2012 CNNs are deep artificial neural networks (ANN) that can be used to
classify images cluster them by similarity and perform object recognition within scenes
It can be used to detect and identify faces people signs or any other visual data
Figure 2-2 Convolutional Neural Network Architecture [10]
There are primarily four operations in a standard CNN model (as shown in Fig above)
1 Convolution - The primary purpose of Convolution in the ConvNet (above) is to extract features from the input image The spatial relationship between pixels ie the image features are preserved and learned by the convolution using small squares of input data
2 Non-Linearity (ReLU) ndash Rectified Linear Unit (ReLU) is a non-linear operation that carries out an element wise operation on each pixel This operation replaces the negative pixel values in the feature map by zero
3 Pooling or Sub Sampling - Spatial Pooling reduces the dimensionality of each feature map but retains the most important information For max pooling the largest value in the square window is taken and rest are dropped Other types of pooling are Average Sum etc
4 Classification (Fully Connected Layer) - The Fully Connected layer is a traditional Multi-Layer Perceptron as described before that uses a softmax activation function in the output layer The high-level features of the image are encoded by the convolutional and pooling layers which is then fed to the fully connected layer which then uses these features for classifying the input image into various classes based on the training dataset [10]
15
When a new image is fed into the CNN model all the above-mentioned steps are
carried out (forward propagation) and a probability distribution is achieved on the set of
output classes With a large enough training dataset the network will learn and
generalize well enough to classify new images into their correct classes
Recurrent Neural Networks (RNN)
Whenever we want to predict or encode sequential data RNN [11] is our go to
method RNNs perform the same task for every element of a sequence where the
output of each element depends on previous computations thus the recurrence In
practice RNNs are unable to retain long-term dependencies and can look back only a
few steps because of the vanishing gradient problem
(2-1)
A solution to the dependency problem is to use gated cells such as LSTM [11] or
GRU [13] These cells pass on important information to the next cells while ignoring
non-important ones The gated units in a GRU block are
bull Update Gate ndash Computed based on current input and hidden state
(2-2)
bull Reset Gate ndash Calculated similarly but with different weights
(2-3)
bull New memory content - (2-4)
If reset gate unit is ~0 then previous memory is ignored and only new
information is kept
16
Final memory at current time step combines previous and current time steps
(2-5)
While the GRU is computationally efficient the LSTM on the other hand is a
general case where there are three gates as follows
bull Input Gate ndash What new information to add to the current cell state
bull Forget Gate ndash How much information from previous states to be kept
bull Output gate ndash How much info should be sent to the next states
Just like GRU the current cell state is a sum of the previous cell state but
weighted by the forget gate and the new value is added which is weighted by the input
gate Based on the cell state the output gate regulates the final output
Word Embedding
Computation or gradients can be applied on numbers and not on words or letters
So first we need to convert words into their corresponding numerical formation before
feeding into a deep learning model In general there are two types of word embedding
Frequency based (which constitutes count vectors tf-idf and co-occurrence vectors)
and Prediction based With frequency based embedding the order of the words are not
preserved and works as a bag of words model Whereas with prediction based model
the order of words or locality of words are taken into consideration to generate the
numerical representation of the word Within this prediction based category there are
two fundamental techniques called Continuous Bag of Words (CBOW) and Skip Gram
Model which forms the basis for word2vec [14] and GloVe [15]
The basic intuition behind word2vec is that if two different words have very
similar ldquocontextsrdquo (that is what words are likely to appear around them) then the model
17
will produce similar vector for those words Conversely if the two word vectors are
similar then the network will produce similar context predictions for the same two words
For examples synonyms like ldquointelligentrdquo and ldquosmartrdquo would have very similar contexts
Or that words that are related like ldquoenginerdquo and ldquotransmissionrdquo would probably have
similar contexts as well [16] Plotting the word vectors learned by a word2vec over a
large corpus we could find some very interesting relationships between words
Figure 2-3 Semantic relation between words in vector space [17]
Attention Mechanism
We as humans put our attention to things are important or are relevant in a
context For example when asked a question from a passage we try to find the most
relevant part of the passage the question is relevant with and then reason from our
understanding of that part of the passage The same idea applies for attention
mechanism in Deep Learning It is used to identify the specific parts of a given context
to which the current question is relevant to
Formally put the techniques take n arguments y_1 y_n (in our case the
passage having words say y_i through h_i) and a question word say q It returns a
vector z which is supposed to be the laquo summary raquo of the y_i focusing on information
linked to the question q More formally it returns a weighted arithmetic mean of the y_i
18
and the weights are chosen according the relevance of each y_i given the context c
[18]
Figure 2-4 Attention Mechanism flow [18]
Memory Networks
Convolutional Neural Networks and Recurrent Neural Networks which does
capture how we form our visual and sequential memories their memory (encoded by
hidden states and weights) were typically too small and was not compartmentalized
enough to accurately remember facts from the past (knowledge is compressed into
dense vectors) [19]
Deep Learning needed to cultivate a methodology that preserved memories as
they are such that it wonrsquot be lost in generalization and recalling exact words or
sequence of events would be possible mdash something computers are already good at This
effort led us to Memory Networks [19] published at ICLR 2015 by Facebook AI
Research
This paper provides a basic framework to store augment and retrieve memories
while seamlessly working with a Recurrent Neural Network architecture The memory
19
network consists of a memory m (an array of objects 1 indexed by m i) and four
(potentially learned) components I G O and R as follows
I (input feature map) mdash converts the incoming input to the internal feature
representation either a sparse or dense feature vector like that from word2vec or
GloVe
G (generalization) mdash updates old memories given the new input They call this
generalization as there is an opportunity for the network to compress and generalize its
memories at this stage for some intended future use The analogy Irsquove been talking
before
O (output feature map) mdash produces a new output (in the feature representation
space) given the new input and the current memory state This component is
responsible for performing inference In a question answering system this part will
select the candidate sentences (which might contain the answer) from the story
(conversation) so far
R (response) mdash converts the output into the response format desired For
example a textual response or an action In the QA system described this component
finds the desired answer and then converts it from feature representation to the actual
word
This model is a fully supervised model meaning all the candidate sentences from
which the answer could be found are marked during training phase and can also be
termed as lsquohard attentionrsquo
The authors tested out the QA system on various literature including Lord of the
Rings
20
Figure 2-5 QA example on Lord of the Rings using Memory Networks [19]
21
CHAPTER 3 LITERATURE REVIEW AND STATE OF THE ART
There has been rapid progress since the release of the SQuAD dataset and
currently the best ensemble models are close to human level accuracy in machine
comprehension This is due to the various ingenious methods which solves some of the
problems with the previous methods Out of Vocabulary tokens were handled by using
Character embedding Long term dependency within context passage were solved
using self-attention And many other techniques such as Contextualized vectors History
of Words Attention Flow etc In this section we will have a look at the some of the most
important models that were fundamental to the progress of Questions Answering
Machine Comprehension Using Match-LSTM and Answer Pointer
In this paper [20] the authors propose an end-to-end neural architecture for the
QA task The architecture is based on match-LSTM [21] a model they proposed for
textual entailment and Pointer Net [22] a sequence-to-sequence model proposed by
Vinyals et al (2015) to constrain the output tokens to be from the input sequences
The model consists of an LSTM preprocessing layer a match-LSTM layer and an
Answer Pointer layer
We are given a piece of text which we refer to as a passage and a question
related to the passage The passage is represented by matrix P where P is the length
(number of tokens) of the passage and d is the dimensionality of word embeddings
Similarly the question is represented by matrix Q where Q is the length of the question
Our goal is to identify a subsequence from the passage as the answer to the question
22
Figure 3-1 Match-LSTM Model Architecture [20]
LSTM Preprocessing layer They use a standard one-directional LSTM
(Hochreiter amp Schmidhuber 1997) to process the passage and the question separately
as shown below
Match LSTM Layer They applied the match-LSTM model proposed for textual
entailment to their machine comprehension problem by treating the question as a
premise and the passage as a hypothesis The match-LSTM sequentially goes through
the passage At position i of the passage it first uses the standard word-by-word
attention mechanism to obtain attention weight vector as follows
(3-1)
23
where and are parameters to be learned
Answer Pointer Layer The Answer Pointer (Ans-Ptr) layer is motivated by the
Pointer Net introduced by Vinyals et al (2015) [22] The boundary model produces only
the start token and the end token of the answer and then all the tokens between these
two in the original passage are considered to be the answer
When this paper was released back in November 2016 Match-LSTM method
was the state of the art in Question Answering systems and was at the top of the
leaderboard for the SQuAD dataset
R-NET Matching Reading Comprehension with Self-Matching Networks
In this model [23] first the question and passage are processed by a
bidirectional recurrent network (Mikolov et al 2010) separately They then match the
question and passage with gated attention-based recurrent networks obtaining
question-aware representation for the passage On top of that they apply self-matching
attention to aggregate evidence from the whole passage and refine the passage
representation which is then fed into the output layer to predict the boundary of the
answer span
Question and passage encoding First the words are converted to their
respective word-level embeddings and character level embeddings The character-level
embeddings are generated by taking the final hidden states of a bi-directional recurrent
neural network (RNN) applied to embeddings of characters in the token Such
character-level embeddings have been shown to be helpful to deal with out-of-vocab
(OOV) tokens
24
They then use a bi-directional RNN to produce new representation and
of all words in the question and passage respectively
Figure 3-2 The task of Question Answering [23]
Gated Attention-based Recurrent Networks They use a variant of attention-
based recurrent networks with an additional gate to determine the importance of
information in the passage regarding a question Different from the gates in LSTM or
GRU the additional gate is based on the current passage word and its attention-pooling
vector of the question which focuses on the relation between the question and current
passage word The gate effectively model the phenomenon that only parts of the
passage are relevant to the question in reading comprehension and question answering
is utilized in subsequent calculations
25
Self-Matching Attention From the previous step the question aware passage
representation is generated to highlight the important parts of the passage One
problem with such representation is that it has very limited knowledge of context One
answer candidate is often oblivious to important cues in the passage outside its
surrounding window To address this problem the authors propose directly matching
the question-aware passage representation against itself It dynamically collects
evidence from the whole passage for words in passage and encodes the evidence
relevant to the current passage word and its matching question information into the
passage representation
Output Layer They use the same method as Wang amp Jiang (2016b) and use
pointer networks (Vinyals et al 2015) to predict the start and end position of the
answer In addition they use an attention-pooling over the question representation to
generate the initial hidden vector for the pointer network [23]
When the R-Net Model first appeared in the leaderboard in March 2017 it was at
the top with 723 Exact Match and 807 F1 score
Bi-Directional Attention Flow (BiDAF) for Machine Comprehension
BiDAF [24] is a hierarchical multi-stage architecture for modeling the
representations of the context paragraph at different levels of granularity BIDAF
includes character-level word-level and contextual embeddings and uses bi-directional
attention flow to obtain a query-aware context representation Their attention layer is not
used to summarize the context paragraph into a fixed-size vector Instead the attention
is computed for every time step and the attended vector at each time step along with
the representations from previous layers can flow through to the subsequent modeling
layer This reduces the information loss caused by early summarization [24]
26
Figure 3-3 Bi-Directional Attention Flow Model Architecture [24]
Their machine comprehension model is a hierarchical multi-stage process and
consists of six layers
1 Character Embedding Layer maps each word to a vector space using character-level CNNs
2 Word Embedding Layer maps each word to a vector space using a pre-trained word embedding model
3 Contextual Embedding Layer utilizes contextual cues from surrounding words to refine the embedding of the words These first three layers are applied to both the query and context They use a LSTM on top of the embeddings provided by the previous layers to model the temporal interactions between words We place an LSTM in both directions and concatenate the outputs of the two LSTMs
4 Attention Flow Layer couples the query and context vectors and produces a set of query aware feature vectors for each word in the context
5 Modeling Layer employs a Recurrent Neural Network to scan the context The input to the modeling layer is G which encodes the query-aware representations of context words The output of the modeling layer captures the interaction among the context words conditioned on the query They use two layers of bi-directional LSTM
27
with the output size of d for each direction Hence a matrix is obtained which is passed onto the output layer to predict the answer
6 Output Layer provides an answer to the query They define the training loss (to be minimized) as the sum of the negative log probabilities of the true start and end indices by the predicted distributions averaged over all examples [25]
In a further variation of their above work they add a self-attention layer after the
Bi-attention layer to further improve the results The architecture of the model is as
Figure 3-3 The task of Question Answering [25]
28
Summary
In this chapter we reviewed the methods that are fundamental to the state of the
art in Machine Comprehension and for the task of Question Answering We have
reached closed to human level accuracy and this is due to incremental developments
over previous models As we saw Out of Vocabulary (OOV) tokens were handled by
using Character embedding Long term dependency within context passage were
solved using self-attention and many other techniques such as Contextualized vectors
Attention Flow etc were employed to get better results In the next chapter we will see
how we can build on these models and develop further
29
CHAPTER 4 MULTI-ATTENTION QUESTION ANSWERING
Although the advancements in Question Answering systems since the release of
the SQuAD dataset has been impressive and the results are getting close to human
level accuracy it is far from being a fool-proof system The models still make mistakes
which would be obvious to a human For example
Passage The Panthers used the San Jose State practice facility and stayed at
the San Jose Marriott The Broncos practiced at Florida State Facility and stayed at the
Santa Clara Marriott
Question At what universitys facility did the Panthers practice
Actual Answer San Jose State
Predicted Answer Florida State Facility
To find out what is leading to the wrong predictions we wanted to see the
attention weights associated with such an example We plotted the passage and
question heat map which is a 2D matrix where the intensity of each cell signifies the
similarity between a passage word and a question word For the above example we
found out that while certain words of the question are given high weightage other parts
are not The words lsquoAtrsquo lsquofacilityrsquo lsquopracticersquo are given high attention but lsquoPanthersrsquo does
not receive high attention If it had received high attention then the system would have
predicted lsquoSan Jose Statersquo as the right answer To solve this issue we analyzed the
base BiDAF model and proposed adding two things
1 Bi-Attention and Self-Attention over Query
2 Second level attention over the output of (Bi-Attention + Self-Attention) from both Context and Query
30
Multi-Attention BiDAF Model
Our comprehension model is a hierarchical multi-stage process and consists of
the following layers
1 Embedding Just as all other models we embed words using pretrained word vectors We also embed the characters in each word into size 20 vectors which are learned and run a convolution neural network followed by max-pooling to get character-derived embeddings for each word The character-level and word-level embeddings are then concatenated and passed to the next layer We do not update the word embeddings during training
2 Pre-Process A shared bi-directional GRU (Cho et al 2014) is used to map the question and passage embeddings to context aware embeddings
3 Context Attention The bi-directional attention mechanism from the Bi-Directional Attention Flow (BiDAF) model (Seo et al 2016) is used to build a query-aware context representation Let h_i be the vector for context word i q_j be the vector for question word j and n_q and n_c be the lengths of the question and context respectively We compute attention between context word i and question word j as
(4-1)
where w1 w2 and w3 are learned vectors and is element-wise multiplication We then compute an attended vector c_i for each context token as
(4-2)
We also compute a query-to-context vector q_c
(4-3)
31
The final vector computed for each token is built by concatenating
and In our model we subsequently pass the result through a linear layer with ReLU activations
4 Context Self-Attention Next we use a layer of residual self-attention The input is passed through another bi-directional GRU Then we apply the same attention mechanism only now between the passage and itself In this case we do not use query-to context attention and we set aij = 1048576inf if i = j As before we pass the concatenated output through a linear layer with ReLU activations This layer is applied residually so this output is additionally summed with the input
5 Query Attention For this part we do the same way as context attention but calculate the weighted sum of the context words for each query word Thus we get the length as number of query words Then we calculate context to query similar to query to context in context attention layer
Figure 4-1 The modified BiDAF model with multilevel attention
6 Query Self-Attention This part is done the same way as the context self-attention layer but from the output of the Query Attention layer
32
7 Context Query Bi-Attention + Self-Attention The output of the Context self-attention and the Query Self-Attention layers are taken as input and the same process for Bi-Attention and self-attention is applied on these inputs
8 Prediction In the last layer of our model a bidirectional GRU is applied followed by a linear layer that computes answer start scores for each token The hidden states of that layer are concatenated with the input and fed into a second bidirectional GRU and linear layer to predict answer end scores The softmax operation is applied to the start and end scores to produce start and end probabilities and we optimize the negative log likelihood of selecting correct start and end tokens
Having carried out this modification we were able to solve the wrong example we
started with The multilevel attention model gives the correct output as ldquoSan Jose
Staterdquo Also we achieved slightly better scores than the original model with a F1 score
of 8544 on the SQuAD dev dataset
Chatbot Design Using a QA System
Designing a perfect chatbot that passes the Turing test is a fundamental goal for
Artificial Intelligence Although we are many order to magnitudes away from achieving
such a goal domain specific tasks can be solved with chatbots made from current
technology We had started our goal with a similar objective in mind ie to design a
domain specific chatbot and then generalize to other areas as it is able to achieve the
first domain specific objective robustly This led us to the fundamental problem of
Machine Comprehension and subsequently to the task of Question Answering Having
achieved some degree of success with QA systems we looked back if we could apply
our newly acquired knowledge in the task of designing Chatbots
The chatbots made with todayrsquos technologies are mostly handcrafted techniques
such as template matching that requires anticipating all possible ways a user may
articulate his requirements and a conversation may occur This requires a lot of man
hours for designing a domain specific system and is still very error prone In this section
33
we propose a general Chatbot design that would make the designing of a domain
specific chatbot very easy and robust at the same time
Every domain specific chatbot needs to obtain a set of information from the user
and show some results based on the user specific information obtained The traditional
chatbots use template matching and keywords lookup to determine if the user has
provided the required information Our idea is to use the Question Answering system in
the backend to extract out the required information from whatever the user has typed
until this point of the conversation The information to be extracted can be posed in the
form of a set of questions and the answers obtained from those questions can be used
as the parameters to supply the relevant information to the user
We had chosen our chatbot domain as the flight reservation system Our goal
was to extract the required information from the user to be able to show him the
available flights as per the userrsquos requirements
Figure 4-2 Flight reservation chatbotrsquos chat window
34
For a flight reservation task the booking agent needs to know the origin city
destination city and date of travel at minimum to be able to show the available flights
Optional information includes the number of tickets passengerrsquos name one way or
round trip etc
The minimalistic conversation with the user through the chat window would be as
shown above We had a platform called OneTask on which we wanted to implement our
chat bot The chat interface within the OneTask system looks as follows
Figure 4-3 Chatbot within OneTask system
The working of the chat system is as follows ndash
1 Initiation The user opens the chat window which starts a session with the chat bot The chat bot asks lsquoHow may I help yoursquo
2 User Reply The user may reply with none of the required information for flight booking or may reply with multiple information in the same message
3 User Reply parsing The conversation up to this point is treated as a passage and the internal questions are run on this passage So the four questions that are run are
Where do you want to go
35
From where do you want to leave
When do you want to depart
4 Parsed responses from QA model After running the questions on the QA system with the given conversation answers are obtained for all along with their corresponding confidence values Since answers will be obtained even if the required question was not answered up to this point in the conversation the confidence values plays a crucial role in determining the validity of the answer For this purpose we determined ranges of confidence values for validation A confidence value above 10 signifies the question has been answered correctly A confidence of 2 ndash 10 signifies that it may have been answered but should verify with the user for correctness and any confidence below 2 is discarded
5 Asking remaining questions iteratively After the parsing it is checked if any of the required questions are still unanswered If so the chatbot asks the remaining question and the process from 3 ndash 4 is carried out iteratively Once all the questions have been answered the user is shown with the available flight options as per his request
Figure 4-4 The Flow diagram of the Flight booking Chatbot system
Online QA System and Attention Visualization
To be able to test out various examples we used the BiDAF [23] model for an
online demo One can either choose from the available examples from the drop-down
menu or paste their own passage and examples While this is a useful and interesting
system to test the model in a user-friendly way we created this system to be able to
focus on the wrong samples
36
Figure 4-5 QA system interface with attention highlight over candidate answers
The candidate answers are shown with the blue highlights as per their
confidence values The higher the confidence the darker the answer The highest
confidence value is chosen as the predicted answer We developed the system to show
attention spread of the candidate answers to realize what needs to be done to improve
the system This led us to realize the importance of including the query attention part as
well as multilevel attention on the BiDAF model as described in the first section of this
chapter
37
CHAPTER 5 FUTURE DIRECTIONS AND CONCLUSION
Future Directions
Having done a thorough analysis of the current methods and state of the arts in
Machine Comprehension and for the QA task and developing systems on it we have
achieved a strong sense of what needs to be done to further improve the QA models
After observing the wrong samples we can see that the system is still unable to encode
meaning and is picking answers based on statistical patterns of occurrence of answers
on training examples
As per the State of the Art models and analyzing the ongoing research literature
it is easy to conclude that more features need to be embedded to encode the meaning
of words phrases and sentences A paper called Reinforced Mnemonic Reader for
Machine Comprehension encoded POS and NER tags of words along with their word
and character embedding This gave them better results
Figure 5-1 An English language semantic parse tree [26]
38
We have developed a method to encode the syntax parse tree of a sentence that
not only encodes the post tags but the relation of the word within the phrase and the
relation of the phrase within the whole sentence in a hierarchical manner
Finally data augmentation is another solution to get better results One definite
way to reduce the errors would be to include similar samples in the training data which
the system is faltering in the dev set One could generate similar examples as the failure
cases and include them in the training set to have better prediction Another system
would be to train to similar and bigger datasets Our models were trained on the SQuAD
dataset There are other datasets too which does the similar question answering task
such as TriviaQA We could augment the training set of TriviaQA along with SQuAD to
have a more robust system that is able to generalize better and thus have higher
accuracy for predicting answer spans
Conclusion
In this work we have tried to explore the most fundamental techniques that have
shaped the current state of the art Then we proposed a minor improvement of
architecture over an existing model Furthermore we developed two applications that
uses the base model First we talked about how a chatbot application can be made
using the QA system and lastly we also created a web interface where the model can
be used for any Passage and Question This interface also shows the attention spread
on the candidate answers While our effort is ongoing to push the state of the art
forward we strongly believe that surpassing human level accuracy on this task will have
high dividends for the society at large
39
LIST OF REFERENCES
[1] Danqi Chen From Reading Comprehension To Open-Domain Question Answering 2018 Youtube Accessed March 16 2018 httpswwwyoutubecomwatchv=1RN88O9C13U
[2] Lehnert Wendy G The Process of Question Answering No RR-88 Yale Univ New Haven Conn Dept of computer science 1977
[3] Joshi Mandar Eunsol Choi Daniel S Weld and Luke Zettlemoyer Triviaqa A large scale distantly supervised challenge dataset for reading comprehension arXiv preprint arXiv170503551 (2017)
[4] Rajpurkar Pranav Jian Zhang Konstantin Lopyrev and Percy Liang Squad 100000+ questions for machine comprehension of text arXiv preprint arXiv160605250 (2016)
[5] The Stanford Question Answering Dataset 2018 RajpurkarGithubIo Accessed March 16 2018 httpsrajpurkargithubioSQuAD-explorer
[6] Hu Minghao Yuxing Peng and Xipeng Qiu Reinforced mnemonic reader for machine comprehension CoRR abs170502798 (2017)
[7] Schalkoff Robert J Artificial neural networks Vol 1 New York McGraw-Hill 1997
[8] Me For The AI and Neetesh Mehrotra 2018 The Connect Between Deep Learning And AI - Open Source For You Open Source For You Accessed March 16 2018 httpopensourceforucom201801connect-deep-learning-ai
[9] Krizhevsky Alex Ilya Sutskever and Geoffrey E Hinton Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems pp 1097-1105 2012
[10] An Intuitive Explanation Of Convolutional Neural Networks 2016 The Data Science Blog Accessed March 16 2018 httpsujjwalkarnme20160811intuitive-explanation-convnets
[11] Williams Ronald J and David Zipser A learning algorithm for continually running fully recurrent neural networks Neural computation 1 no 2 (1989) 270-280
[12] Hochreiter Sepp and Juumlrgen Schmidhuber Long short-term memory Neural computation 9 no 8 (1997) 1735-1780
[13] Chung Junyoung Caglar Gulcehre KyungHyun Cho and Yoshua Bengio Empirical evaluation of gated recurrent neural networks on sequence modeling arXiv preprint arXiv14123555 (2014)
40
[14] Mikolov Tomas Wen-tau Yih and Geoffrey Zweig Linguistic regularities in continuous space word representations In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics Human Language Technologies pp 746-751 2013
[15] Pennington Jeffrey Richard Socher and Christopher Manning Glove Global vectors for word representation In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) pp 1532-1543 2014
[16] Word2vec Tutorial - The Skip-Gram Model middot Chris Mccormick 2016 MccormickmlCom Accessed March 16 2018 httpmccormickmlcom20160419word2vec-tutorial-the-skip-gram-model
[17] Group 2015 [Paper Introduction] Bilingual Word Representations With Monolingual hellip SlideshareNet Accessed March 16 2018 httpswwwslidesharenetnaist-mtstudybilingual-word-representations-with-monolingual-quality-in-mind
[18] Attention Mechanism 2016 BlogHeuritechCom Accessed March 16 2018 httpsblogheuritechcom20160120attention-mechanism
[19] Weston Jason Sumit Chopra and Antoine Bordes 2014 Memory Networks arXiv preprint arXiv14103916v11 Accessed March 16 2018 httpsarxivorgabs14103916
[20] Wang Shuohang and Jing Jiang Machine comprehension using match-lstm and answer pointer arXiv preprint arXiv160807905 (2016)
[21] Wang Shuohang and Jing Jiang Learning natural language inference with LSTM arXiv preprint arXiv151208849 (2015)
[22] Vinyals Oriol Meire Fortunato and Navdeep Jaitly Pointer networks In Advances in Neural Information Processing Systems pp 2692-2700 2015
[23] Wang Wenhui Nan Yang Furu Wei Baobao Chang and Ming Zhou Gated self-matching networks for reading comprehension and question answering In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1 Long Papers) vol 1 pp 189-198 2017
[24] Seo Minjoon Aniruddha Kembhavi Ali Farhadi and Hannaneh Hajishirzi Bidirectional attention flow for machine comprehension arXiv preprint arXiv161101603 (2016)
[25] Clark Christopher and Matt Gardner Simple and effective multi-paragraph reading comprehension arXiv preprint arXiv171010723 (2017)
41
[26] 8 Analyzing Sentence Structure 2018 NltkOrg Accessed March 16 2018 httpwwwnltkorgbookch08html
42
BIOGRAPHICAL SKETCH
Purnendu Mukherjee grew up in Kolkata India and had a strong interest for
computers since an early age After his high school he did his BSc in computer
science from Ramakrishna Mission Residential College Narendrapur followed by MSc
in computer science from St Xavierrsquos College Kolkata He had a strong intuition and
interest for human like learning systems and wanted to work in this area He started
working at TCS Innovation Labs Pune for the application of Natural Language
Processing in Educational Applications As he was deeply passionate about learning
system that mimic the human brain and learn like a human child does he was
increasing interested about Deep Learning and its applications After working for a year
he went on to pursue a Master of Science degree in computer science from the
University of Florida Gainesville His academic interests have been focused on Deep
Learning and Natural Language Processing and he has been working on Machine
Reading Comprehension since summer of 2017
- ACKNOWLEDGMENTS
- LIST OF FIGURES
- LIST OF ABBREVIATIONS
- INTRODUCTION
- THE BUILDING BLOCKS OF A QUESTION ANSWERING SYSTEM
-
- Neural Networks
- Convolutional Neural Network
- Recurrent Neural Networks (RNN)
- Word Embedding
- Attention Mechanism
- Memory Networks
-
- LITERATURE REVIEW AND STATE OF THE ART
-
- Machine Comprehension Using Match-LSTM and Answer Pointer
- R-NET Matching Reading Comprehension with Self-Matching Networks
- Bi-Directional Attention Flow (BiDAF) for Machine Comprehension
- Summary
-
- MULTI-ATTENTION QUESTION ANSWERING
-
- Multi-Attention BiDAF Model
- Chatbot Design Using a QA System
- Online QA System and Attention Visualization
-
- FUTURE DIRECTIONS AND CONCLUSION
-
- Future Directions
- Conclusion
-
- LIST OF REFERENCES
- BIOGRAPHICAL SKETCH
-
To my family friends and teachers
4
ACKNOWLEDGMENTS
I would like to thank my thesis advisor Professor Andy Li who has been a
constant source of inspiration and encouragement for me to pursue my ideas Also I
am thankful to him for providing all the necessary resources to succeed for my research
efforts
I am deeply thankful to Professor Kristy Boyer for her guidance and mentorship
as she motivated me to pursue my research interests in Natural Language Processing
and Deep Learning
I would like to express my heartfelt gratitude to Professor Jose Principe who had
taught me the very fundamentals of Learning systems and formally introduced me to
Deep Learning
I wish to extend my thanks to all the member of CBL Lab and especially to the
NLP group for their insightful remarks and overall support
I am grateful to my close friend roommate and lab partner Yash Sinha who had
supported and helped me throughout with all aspects of my thesis work
Finally I must thank my parents for their unwavering support and dedication
towards my well-being and growth
5
TABLE OF CONTENTS page
ACKNOWLEDGMENTS 4
LIST OF FIGURES 6
LIST OF ABBREVIATIONS 7
ABSTRACT 8
CHAPTER
1 INTRODUCTION 9
2 THE BUILDING BLOCKS OF A QUESTION ANSWERING SYSTEM 13
Neural Networks 13 Convolutional Neural Network 14 Recurrent Neural Networks (RNN) 15 Word Embedding 16 Attention Mechanism 17 Memory Networks 18
3 LITERATURE REVIEW AND STATE OF THE ART 21
Machine Comprehension Using Match-LSTM and Answer Pointer 21 R-NET Matching Reading Comprehension with Self-Matching Networks 23 Bi-Directional Attention Flow (BiDAF) for Machine Comprehension 25 Summary 28
4 MULTI-ATTENTION QUESTION ANSWERING 29
Multi-Attention BiDAF Model 30 Chatbot Design Using a QA System 32 Online QA System and Attention Visualization 35
5 FUTURE DIRECTIONS AND CONCLUSION 37
Future Directions 37 Conclusion 38
LIST OF REFERENCES 39
BIOGRAPHICAL SKETCH 42
6
LIST OF FIGURES
Figure page 1-1 The task of Question Answering 10
2-1 Simple and Deep Learning Neural Networks 13
2-2 Convolutional Neural Network Architecture 14
2-3 Semantic relation between words in vector space 17
2-4 Attention Mechanism flow 18
2-5 QA example on Lord of the Rings using Memory Networks 20
3-1 Match-LSTM Model Architecture 22
3-2 The task of Question Answering 24
3-3 Bi-Directional Attention Flow Model Architecture 26
4-1 The modified BiDAF model with multilevel attention 31
4-2 Flight reservation chatbotrsquos chat window 33
4-3 Chatbot within OneTask system 34
4-4 The Flow diagram of the Flight booking Chatbot system 35
4-5 QA system interface with attention highlight over candidate answers 36
5-1 An English language semantic parse tree 37
7
LIST OF ABBREVIATIONS
BiDAF Bi-Directional Attention Flow
CNN Convolutional Neural Network
GRU Gated Recurrent Units
LSTM Long Short Term Memory
NLP Natural Language Processing
NLU Natural Language Understanding
RNN Recurrent Neural Network
8
Abstract of Thesis Presented to the Graduate School of the University of Florida in Partial Fulfillment of the
Requirements for the Degree of Master of Science
QUESTION ANSWERING SYSTEMS WITH ATTENTION MECHANISM
By
Purnendu Mukherjee
May 2018
Chair Xiaolin Li Major Computer Science
Question Answering(QA) systems have had rapid growth since the last 3 years
and are close to reaching human level accuracy One of the fundamental reason for this
growth has been the use of attention mechanism along with other methods of Deep
Learning But just as with other Deep Learning methods some of the failure cases are
so obvious that it convinces us that there is a lot to improve upon In this work we first
did a literature review of the State of the Art and fundamental models in QA systems
Next we introduce an architecture which has shown improvement in the targeted area
We then introduce a general method to enable easy design of domain-specific Chatbot
applications and present a proof of the concept with the same method Finally we
present an easy to use Question Answering interface with attention visualization on the
passage We also propose a method to improve the current state of the art as a part of
our ongoing work
9
CHAPTER 1 INTRODUCTION
Teaching machines to read and understand human natural language is our
central and long standing goal for Natural Language Processing and Artificial
Intelligence in general The positive side of machines being able to reason and
comprehend human language could be enormous We are beginning to see how
commercial systems like Alexa Siri Google Now etc are being widely used as Speech
Recognition improved to human levels While speech recognition systems can
transcribe speech to text comprehension of that transcribed text is another task which
is currently a major focus for both academia and industry because of its possible
applications Moreover all the text information available throughout the internet is a
major reason why machine comprehension of text is such an important task
With the growth of Deep Learning methods in the last few years the field of
Machine Comprehension and Natural Language Processing(NLP) in general has
experienced a revolution While the traditional methods and practices are still prevalent
and forms the basis of our deep understanding of languages Deep Learning methods
have surpassed all traditional NLP and Machine Learning methods by a significant
margin and are currently driving the growth of the field
To be able to build a system that can understand human text we need to first
ask ourselves how can we evaluate the machinersquos comprehension ability We had
initially set a goal to build a Chabot for a specific domain and generalize to other topics
as we go ahead While developing the system we found out the necessity for reading
comprehension and how to measure it We finally found the answer with Questions
Answering systems
10
Just like how we human beings are tested for our ability of language
understanding with questions we should ask machines similar questions about what it
has just read The performance of the system on such question answering task will let
us evaluate how much the machine is able to reason about what it just read [1]
Reading comprehension has been a topic of Natural Language Understanding since the
1970s In 1977 Wendy Lehnert said in his doctoral thesis ndash ldquoOnly when we can ask a
program to answer questions about what it reads will we be able to begin to access that
programrsquos comprehensionrdquo [2]
Figure 1-1 The task of Question Answering
To achieve this task the NLP community has developed various datasets such as
CNN Daily Mail WebQuestions SQuAD TriviaQA [3] etc For our purpose we chose
SQuAD which stands for Stanford Question Answering Dataset [4] SQuAD consists of
questions posed by crowdworkers on a set of Wikipedia articles where the answer to
every question is a segment of text or span from the corresponding reading passage
With 100000+ question-answer pairs on 500+ articles SQuAD is significantly larger
than previous reading comprehension datasets [5]
An example from the SQuAD dataset is as follows
11
Passage Tesla later approached Morgan to ask for more funds to build a more
powerful transmitter When asked where all the money had gone Tesla responded by
saying that he was affected by the Panic of 1901 which he (Morgan) had caused
Morgan was shocked by the reminder of his part in the stock market crash and by
Teslarsquos breach of contract by asking for more funds Tesla wrote another plea to
Morgan but it was also fruitless Morgan still owed Tesla money on the original
agreement and Tesla had been facing foreclosure even before construction of the
tower began
Question On what did Tesla blame for the loss of the initial money
Answer Panic of 1901
As we started exploring the QA task we faced several challenges Some of them
we could solve with the help of other research and some of the challenges still exist in
the domain
bull Out of Vocabulary words
bull Multi-sentence reasoning may be required
bull There may exist several candidate answers
bull Optimizing the Exact Match (EM) metric may fail when the answer boundary is fuzzy or too long such as the answer of the ldquowhyrdquo query [6]
bull ldquoOne-hoprdquo prediction may fail to fully understand the query [6]
bull Fail to fully capture the long-distance contextual interaction between parts of the context by only using LSTMGRU
bull Current models are unable to capture semantics of the passage
In the upcoming chapters we will first briefly review the basics necessary for
understanding the models then we will delve deep into the fundamental models that
12
have shaped the current State of the Art models then we will discuss our contribution in
terms of architecture and applications and finally conclude with future directions
13
CHAPTER 2 THE BUILDING BLOCKS OF A QUESTION ANSWERING SYSTEM
To build a question answering system one needs to be familiar with the
fundamental deep learning models such as Recurrent Neural Networks (RNN) Long
Short Term Memory (LSTM) etc In this chapter we will have an overview on these
techniques and see how they all connect to building a question answering system
Neural Networks
What makes Deep Learning so intriguing is that it has close resemblance with
the working of the mammalian brain or at least draws inspiration from it The same can
be said for Artificial Neural networks [7] which consists of a system of interconnected
units called lsquoneuronsrsquo that take input from similar units and produces a single output
Figure 2-1 Simple and Deep Learning Neural Networks [8]
The connection from one neuron to another can be weighted based on the input data
which enables the network to tune itself to produce a certain output based on the input
This is the learning process which is achieved through backpropagation which is a
system of propagating the error from the output layer to the previous layers
14
Convolutional Neural Network
The first wave of deep learningrsquos success was brought by Convolutional Neural
Networks (CNN) [9] when this was the technique used by the winning team of ImageNet
competition in 2012 CNNs are deep artificial neural networks (ANN) that can be used to
classify images cluster them by similarity and perform object recognition within scenes
It can be used to detect and identify faces people signs or any other visual data
Figure 2-2 Convolutional Neural Network Architecture [10]
There are primarily four operations in a standard CNN model (as shown in Fig above)
1 Convolution - The primary purpose of Convolution in the ConvNet (above) is to extract features from the input image The spatial relationship between pixels ie the image features are preserved and learned by the convolution using small squares of input data
2 Non-Linearity (ReLU) ndash Rectified Linear Unit (ReLU) is a non-linear operation that carries out an element wise operation on each pixel This operation replaces the negative pixel values in the feature map by zero
3 Pooling or Sub Sampling - Spatial Pooling reduces the dimensionality of each feature map but retains the most important information For max pooling the largest value in the square window is taken and rest are dropped Other types of pooling are Average Sum etc
4 Classification (Fully Connected Layer) - The Fully Connected layer is a traditional Multi-Layer Perceptron as described before that uses a softmax activation function in the output layer The high-level features of the image are encoded by the convolutional and pooling layers which is then fed to the fully connected layer which then uses these features for classifying the input image into various classes based on the training dataset [10]
15
When a new image is fed into the CNN model all the above-mentioned steps are
carried out (forward propagation) and a probability distribution is achieved on the set of
output classes With a large enough training dataset the network will learn and
generalize well enough to classify new images into their correct classes
Recurrent Neural Networks (RNN)
Whenever we want to predict or encode sequential data RNN [11] is our go to
method RNNs perform the same task for every element of a sequence where the
output of each element depends on previous computations thus the recurrence In
practice RNNs are unable to retain long-term dependencies and can look back only a
few steps because of the vanishing gradient problem
(2-1)
A solution to the dependency problem is to use gated cells such as LSTM [11] or
GRU [13] These cells pass on important information to the next cells while ignoring
non-important ones The gated units in a GRU block are
bull Update Gate ndash Computed based on current input and hidden state
(2-2)
bull Reset Gate ndash Calculated similarly but with different weights
(2-3)
bull New memory content - (2-4)
If reset gate unit is ~0 then previous memory is ignored and only new
information is kept
16
Final memory at current time step combines previous and current time steps
(2-5)
While the GRU is computationally efficient the LSTM on the other hand is a
general case where there are three gates as follows
bull Input Gate ndash What new information to add to the current cell state
bull Forget Gate ndash How much information from previous states to be kept
bull Output gate ndash How much info should be sent to the next states
Just like GRU the current cell state is a sum of the previous cell state but
weighted by the forget gate and the new value is added which is weighted by the input
gate Based on the cell state the output gate regulates the final output
Word Embedding
Computation or gradients can be applied on numbers and not on words or letters
So first we need to convert words into their corresponding numerical formation before
feeding into a deep learning model In general there are two types of word embedding
Frequency based (which constitutes count vectors tf-idf and co-occurrence vectors)
and Prediction based With frequency based embedding the order of the words are not
preserved and works as a bag of words model Whereas with prediction based model
the order of words or locality of words are taken into consideration to generate the
numerical representation of the word Within this prediction based category there are
two fundamental techniques called Continuous Bag of Words (CBOW) and Skip Gram
Model which forms the basis for word2vec [14] and GloVe [15]
The basic intuition behind word2vec is that if two different words have very
similar ldquocontextsrdquo (that is what words are likely to appear around them) then the model
17
will produce similar vector for those words Conversely if the two word vectors are
similar then the network will produce similar context predictions for the same two words
For examples synonyms like ldquointelligentrdquo and ldquosmartrdquo would have very similar contexts
Or that words that are related like ldquoenginerdquo and ldquotransmissionrdquo would probably have
similar contexts as well [16] Plotting the word vectors learned by a word2vec over a
large corpus we could find some very interesting relationships between words
Figure 2-3 Semantic relation between words in vector space [17]
Attention Mechanism
We as humans put our attention to things are important or are relevant in a
context For example when asked a question from a passage we try to find the most
relevant part of the passage the question is relevant with and then reason from our
understanding of that part of the passage The same idea applies for attention
mechanism in Deep Learning It is used to identify the specific parts of a given context
to which the current question is relevant to
Formally put the techniques take n arguments y_1 y_n (in our case the
passage having words say y_i through h_i) and a question word say q It returns a
vector z which is supposed to be the laquo summary raquo of the y_i focusing on information
linked to the question q More formally it returns a weighted arithmetic mean of the y_i
18
and the weights are chosen according the relevance of each y_i given the context c
[18]
Figure 2-4 Attention Mechanism flow [18]
Memory Networks
Convolutional Neural Networks and Recurrent Neural Networks which does
capture how we form our visual and sequential memories their memory (encoded by
hidden states and weights) were typically too small and was not compartmentalized
enough to accurately remember facts from the past (knowledge is compressed into
dense vectors) [19]
Deep Learning needed to cultivate a methodology that preserved memories as
they are such that it wonrsquot be lost in generalization and recalling exact words or
sequence of events would be possible mdash something computers are already good at This
effort led us to Memory Networks [19] published at ICLR 2015 by Facebook AI
Research
This paper provides a basic framework to store augment and retrieve memories
while seamlessly working with a Recurrent Neural Network architecture The memory
19
network consists of a memory m (an array of objects 1 indexed by m i) and four
(potentially learned) components I G O and R as follows
I (input feature map) mdash converts the incoming input to the internal feature
representation either a sparse or dense feature vector like that from word2vec or
GloVe
G (generalization) mdash updates old memories given the new input They call this
generalization as there is an opportunity for the network to compress and generalize its
memories at this stage for some intended future use The analogy Irsquove been talking
before
O (output feature map) mdash produces a new output (in the feature representation
space) given the new input and the current memory state This component is
responsible for performing inference In a question answering system this part will
select the candidate sentences (which might contain the answer) from the story
(conversation) so far
R (response) mdash converts the output into the response format desired For
example a textual response or an action In the QA system described this component
finds the desired answer and then converts it from feature representation to the actual
word
This model is a fully supervised model meaning all the candidate sentences from
which the answer could be found are marked during training phase and can also be
termed as lsquohard attentionrsquo
The authors tested out the QA system on various literature including Lord of the
Rings
20
Figure 2-5 QA example on Lord of the Rings using Memory Networks [19]
21
CHAPTER 3 LITERATURE REVIEW AND STATE OF THE ART
There has been rapid progress since the release of the SQuAD dataset and
currently the best ensemble models are close to human level accuracy in machine
comprehension This is due to the various ingenious methods which solves some of the
problems with the previous methods Out of Vocabulary tokens were handled by using
Character embedding Long term dependency within context passage were solved
using self-attention And many other techniques such as Contextualized vectors History
of Words Attention Flow etc In this section we will have a look at the some of the most
important models that were fundamental to the progress of Questions Answering
Machine Comprehension Using Match-LSTM and Answer Pointer
In this paper [20] the authors propose an end-to-end neural architecture for the
QA task The architecture is based on match-LSTM [21] a model they proposed for
textual entailment and Pointer Net [22] a sequence-to-sequence model proposed by
Vinyals et al (2015) to constrain the output tokens to be from the input sequences
The model consists of an LSTM preprocessing layer a match-LSTM layer and an
Answer Pointer layer
We are given a piece of text which we refer to as a passage and a question
related to the passage The passage is represented by matrix P where P is the length
(number of tokens) of the passage and d is the dimensionality of word embeddings
Similarly the question is represented by matrix Q where Q is the length of the question
Our goal is to identify a subsequence from the passage as the answer to the question
22
Figure 3-1 Match-LSTM Model Architecture [20]
LSTM Preprocessing layer They use a standard one-directional LSTM
(Hochreiter amp Schmidhuber 1997) to process the passage and the question separately
as shown below
Match LSTM Layer They applied the match-LSTM model proposed for textual
entailment to their machine comprehension problem by treating the question as a
premise and the passage as a hypothesis The match-LSTM sequentially goes through
the passage At position i of the passage it first uses the standard word-by-word
attention mechanism to obtain attention weight vector as follows
(3-1)
23
where and are parameters to be learned
Answer Pointer Layer The Answer Pointer (Ans-Ptr) layer is motivated by the
Pointer Net introduced by Vinyals et al (2015) [22] The boundary model produces only
the start token and the end token of the answer and then all the tokens between these
two in the original passage are considered to be the answer
When this paper was released back in November 2016 Match-LSTM method
was the state of the art in Question Answering systems and was at the top of the
leaderboard for the SQuAD dataset
R-NET Matching Reading Comprehension with Self-Matching Networks
In this model [23] first the question and passage are processed by a
bidirectional recurrent network (Mikolov et al 2010) separately They then match the
question and passage with gated attention-based recurrent networks obtaining
question-aware representation for the passage On top of that they apply self-matching
attention to aggregate evidence from the whole passage and refine the passage
representation which is then fed into the output layer to predict the boundary of the
answer span
Question and passage encoding First the words are converted to their
respective word-level embeddings and character level embeddings The character-level
embeddings are generated by taking the final hidden states of a bi-directional recurrent
neural network (RNN) applied to embeddings of characters in the token Such
character-level embeddings have been shown to be helpful to deal with out-of-vocab
(OOV) tokens
24
They then use a bi-directional RNN to produce new representation and
of all words in the question and passage respectively
Figure 3-2 The task of Question Answering [23]
Gated Attention-based Recurrent Networks They use a variant of attention-
based recurrent networks with an additional gate to determine the importance of
information in the passage regarding a question Different from the gates in LSTM or
GRU the additional gate is based on the current passage word and its attention-pooling
vector of the question which focuses on the relation between the question and current
passage word The gate effectively model the phenomenon that only parts of the
passage are relevant to the question in reading comprehension and question answering
is utilized in subsequent calculations
25
Self-Matching Attention From the previous step the question aware passage
representation is generated to highlight the important parts of the passage One
problem with such representation is that it has very limited knowledge of context One
answer candidate is often oblivious to important cues in the passage outside its
surrounding window To address this problem the authors propose directly matching
the question-aware passage representation against itself It dynamically collects
evidence from the whole passage for words in passage and encodes the evidence
relevant to the current passage word and its matching question information into the
passage representation
Output Layer They use the same method as Wang amp Jiang (2016b) and use
pointer networks (Vinyals et al 2015) to predict the start and end position of the
answer In addition they use an attention-pooling over the question representation to
generate the initial hidden vector for the pointer network [23]
When the R-Net Model first appeared in the leaderboard in March 2017 it was at
the top with 723 Exact Match and 807 F1 score
Bi-Directional Attention Flow (BiDAF) for Machine Comprehension
BiDAF [24] is a hierarchical multi-stage architecture for modeling the
representations of the context paragraph at different levels of granularity BIDAF
includes character-level word-level and contextual embeddings and uses bi-directional
attention flow to obtain a query-aware context representation Their attention layer is not
used to summarize the context paragraph into a fixed-size vector Instead the attention
is computed for every time step and the attended vector at each time step along with
the representations from previous layers can flow through to the subsequent modeling
layer This reduces the information loss caused by early summarization [24]
26
Figure 3-3 Bi-Directional Attention Flow Model Architecture [24]
Their machine comprehension model is a hierarchical multi-stage process and
consists of six layers
1 Character Embedding Layer maps each word to a vector space using character-level CNNs
2 Word Embedding Layer maps each word to a vector space using a pre-trained word embedding model
3 Contextual Embedding Layer utilizes contextual cues from surrounding words to refine the embedding of the words These first three layers are applied to both the query and context They use a LSTM on top of the embeddings provided by the previous layers to model the temporal interactions between words We place an LSTM in both directions and concatenate the outputs of the two LSTMs
4 Attention Flow Layer couples the query and context vectors and produces a set of query aware feature vectors for each word in the context
5 Modeling Layer employs a Recurrent Neural Network to scan the context The input to the modeling layer is G which encodes the query-aware representations of context words The output of the modeling layer captures the interaction among the context words conditioned on the query They use two layers of bi-directional LSTM
27
with the output size of d for each direction Hence a matrix is obtained which is passed onto the output layer to predict the answer
6 Output Layer provides an answer to the query They define the training loss (to be minimized) as the sum of the negative log probabilities of the true start and end indices by the predicted distributions averaged over all examples [25]
In a further variation of their above work they add a self-attention layer after the
Bi-attention layer to further improve the results The architecture of the model is as
Figure 3-3 The task of Question Answering [25]
28
Summary
In this chapter we reviewed the methods that are fundamental to the state of the
art in Machine Comprehension and for the task of Question Answering We have
reached closed to human level accuracy and this is due to incremental developments
over previous models As we saw Out of Vocabulary (OOV) tokens were handled by
using Character embedding Long term dependency within context passage were
solved using self-attention and many other techniques such as Contextualized vectors
Attention Flow etc were employed to get better results In the next chapter we will see
how we can build on these models and develop further
29
CHAPTER 4 MULTI-ATTENTION QUESTION ANSWERING
Although the advancements in Question Answering systems since the release of
the SQuAD dataset has been impressive and the results are getting close to human
level accuracy it is far from being a fool-proof system The models still make mistakes
which would be obvious to a human For example
Passage The Panthers used the San Jose State practice facility and stayed at
the San Jose Marriott The Broncos practiced at Florida State Facility and stayed at the
Santa Clara Marriott
Question At what universitys facility did the Panthers practice
Actual Answer San Jose State
Predicted Answer Florida State Facility
To find out what is leading to the wrong predictions we wanted to see the
attention weights associated with such an example We plotted the passage and
question heat map which is a 2D matrix where the intensity of each cell signifies the
similarity between a passage word and a question word For the above example we
found out that while certain words of the question are given high weightage other parts
are not The words lsquoAtrsquo lsquofacilityrsquo lsquopracticersquo are given high attention but lsquoPanthersrsquo does
not receive high attention If it had received high attention then the system would have
predicted lsquoSan Jose Statersquo as the right answer To solve this issue we analyzed the
base BiDAF model and proposed adding two things
1 Bi-Attention and Self-Attention over Query
2 Second level attention over the output of (Bi-Attention + Self-Attention) from both Context and Query
30
Multi-Attention BiDAF Model
Our comprehension model is a hierarchical multi-stage process and consists of
the following layers
1 Embedding Just as all other models we embed words using pretrained word vectors We also embed the characters in each word into size 20 vectors which are learned and run a convolution neural network followed by max-pooling to get character-derived embeddings for each word The character-level and word-level embeddings are then concatenated and passed to the next layer We do not update the word embeddings during training
2 Pre-Process A shared bi-directional GRU (Cho et al 2014) is used to map the question and passage embeddings to context aware embeddings
3 Context Attention The bi-directional attention mechanism from the Bi-Directional Attention Flow (BiDAF) model (Seo et al 2016) is used to build a query-aware context representation Let h_i be the vector for context word i q_j be the vector for question word j and n_q and n_c be the lengths of the question and context respectively We compute attention between context word i and question word j as
(4-1)
where w1 w2 and w3 are learned vectors and is element-wise multiplication We then compute an attended vector c_i for each context token as
(4-2)
We also compute a query-to-context vector q_c
(4-3)
31
The final vector computed for each token is built by concatenating
and In our model we subsequently pass the result through a linear layer with ReLU activations
4 Context Self-Attention Next we use a layer of residual self-attention The input is passed through another bi-directional GRU Then we apply the same attention mechanism only now between the passage and itself In this case we do not use query-to context attention and we set aij = 1048576inf if i = j As before we pass the concatenated output through a linear layer with ReLU activations This layer is applied residually so this output is additionally summed with the input
5 Query Attention For this part we do the same way as context attention but calculate the weighted sum of the context words for each query word Thus we get the length as number of query words Then we calculate context to query similar to query to context in context attention layer
Figure 4-1 The modified BiDAF model with multilevel attention
6 Query Self-Attention This part is done the same way as the context self-attention layer but from the output of the Query Attention layer
32
7 Context Query Bi-Attention + Self-Attention The output of the Context self-attention and the Query Self-Attention layers are taken as input and the same process for Bi-Attention and self-attention is applied on these inputs
8 Prediction In the last layer of our model a bidirectional GRU is applied followed by a linear layer that computes answer start scores for each token The hidden states of that layer are concatenated with the input and fed into a second bidirectional GRU and linear layer to predict answer end scores The softmax operation is applied to the start and end scores to produce start and end probabilities and we optimize the negative log likelihood of selecting correct start and end tokens
Having carried out this modification we were able to solve the wrong example we
started with The multilevel attention model gives the correct output as ldquoSan Jose
Staterdquo Also we achieved slightly better scores than the original model with a F1 score
of 8544 on the SQuAD dev dataset
Chatbot Design Using a QA System
Designing a perfect chatbot that passes the Turing test is a fundamental goal for
Artificial Intelligence Although we are many order to magnitudes away from achieving
such a goal domain specific tasks can be solved with chatbots made from current
technology We had started our goal with a similar objective in mind ie to design a
domain specific chatbot and then generalize to other areas as it is able to achieve the
first domain specific objective robustly This led us to the fundamental problem of
Machine Comprehension and subsequently to the task of Question Answering Having
achieved some degree of success with QA systems we looked back if we could apply
our newly acquired knowledge in the task of designing Chatbots
The chatbots made with todayrsquos technologies are mostly handcrafted techniques
such as template matching that requires anticipating all possible ways a user may
articulate his requirements and a conversation may occur This requires a lot of man
hours for designing a domain specific system and is still very error prone In this section
33
we propose a general Chatbot design that would make the designing of a domain
specific chatbot very easy and robust at the same time
Every domain specific chatbot needs to obtain a set of information from the user
and show some results based on the user specific information obtained The traditional
chatbots use template matching and keywords lookup to determine if the user has
provided the required information Our idea is to use the Question Answering system in
the backend to extract out the required information from whatever the user has typed
until this point of the conversation The information to be extracted can be posed in the
form of a set of questions and the answers obtained from those questions can be used
as the parameters to supply the relevant information to the user
We had chosen our chatbot domain as the flight reservation system Our goal
was to extract the required information from the user to be able to show him the
available flights as per the userrsquos requirements
Figure 4-2 Flight reservation chatbotrsquos chat window
34
For a flight reservation task the booking agent needs to know the origin city
destination city and date of travel at minimum to be able to show the available flights
Optional information includes the number of tickets passengerrsquos name one way or
round trip etc
The minimalistic conversation with the user through the chat window would be as
shown above We had a platform called OneTask on which we wanted to implement our
chat bot The chat interface within the OneTask system looks as follows
Figure 4-3 Chatbot within OneTask system
The working of the chat system is as follows ndash
1 Initiation The user opens the chat window which starts a session with the chat bot The chat bot asks lsquoHow may I help yoursquo
2 User Reply The user may reply with none of the required information for flight booking or may reply with multiple information in the same message
3 User Reply parsing The conversation up to this point is treated as a passage and the internal questions are run on this passage So the four questions that are run are
Where do you want to go
35
From where do you want to leave
When do you want to depart
4 Parsed responses from QA model After running the questions on the QA system with the given conversation answers are obtained for all along with their corresponding confidence values Since answers will be obtained even if the required question was not answered up to this point in the conversation the confidence values plays a crucial role in determining the validity of the answer For this purpose we determined ranges of confidence values for validation A confidence value above 10 signifies the question has been answered correctly A confidence of 2 ndash 10 signifies that it may have been answered but should verify with the user for correctness and any confidence below 2 is discarded
5 Asking remaining questions iteratively After the parsing it is checked if any of the required questions are still unanswered If so the chatbot asks the remaining question and the process from 3 ndash 4 is carried out iteratively Once all the questions have been answered the user is shown with the available flight options as per his request
Figure 4-4 The Flow diagram of the Flight booking Chatbot system
Online QA System and Attention Visualization
To be able to test out various examples we used the BiDAF [23] model for an
online demo One can either choose from the available examples from the drop-down
menu or paste their own passage and examples While this is a useful and interesting
system to test the model in a user-friendly way we created this system to be able to
focus on the wrong samples
36
Figure 4-5 QA system interface with attention highlight over candidate answers
The candidate answers are shown with the blue highlights as per their
confidence values The higher the confidence the darker the answer The highest
confidence value is chosen as the predicted answer We developed the system to show
attention spread of the candidate answers to realize what needs to be done to improve
the system This led us to realize the importance of including the query attention part as
well as multilevel attention on the BiDAF model as described in the first section of this
chapter
37
CHAPTER 5 FUTURE DIRECTIONS AND CONCLUSION
Future Directions
Having done a thorough analysis of the current methods and state of the arts in
Machine Comprehension and for the QA task and developing systems on it we have
achieved a strong sense of what needs to be done to further improve the QA models
After observing the wrong samples we can see that the system is still unable to encode
meaning and is picking answers based on statistical patterns of occurrence of answers
on training examples
As per the State of the Art models and analyzing the ongoing research literature
it is easy to conclude that more features need to be embedded to encode the meaning
of words phrases and sentences A paper called Reinforced Mnemonic Reader for
Machine Comprehension encoded POS and NER tags of words along with their word
and character embedding This gave them better results
Figure 5-1 An English language semantic parse tree [26]
38
We have developed a method to encode the syntax parse tree of a sentence that
not only encodes the post tags but the relation of the word within the phrase and the
relation of the phrase within the whole sentence in a hierarchical manner
Finally data augmentation is another solution to get better results One definite
way to reduce the errors would be to include similar samples in the training data which
the system is faltering in the dev set One could generate similar examples as the failure
cases and include them in the training set to have better prediction Another system
would be to train to similar and bigger datasets Our models were trained on the SQuAD
dataset There are other datasets too which does the similar question answering task
such as TriviaQA We could augment the training set of TriviaQA along with SQuAD to
have a more robust system that is able to generalize better and thus have higher
accuracy for predicting answer spans
Conclusion
In this work we have tried to explore the most fundamental techniques that have
shaped the current state of the art Then we proposed a minor improvement of
architecture over an existing model Furthermore we developed two applications that
uses the base model First we talked about how a chatbot application can be made
using the QA system and lastly we also created a web interface where the model can
be used for any Passage and Question This interface also shows the attention spread
on the candidate answers While our effort is ongoing to push the state of the art
forward we strongly believe that surpassing human level accuracy on this task will have
high dividends for the society at large
39
LIST OF REFERENCES
[1] Danqi Chen From Reading Comprehension To Open-Domain Question Answering 2018 Youtube Accessed March 16 2018 httpswwwyoutubecomwatchv=1RN88O9C13U
[2] Lehnert Wendy G The Process of Question Answering No RR-88 Yale Univ New Haven Conn Dept of computer science 1977
[3] Joshi Mandar Eunsol Choi Daniel S Weld and Luke Zettlemoyer Triviaqa A large scale distantly supervised challenge dataset for reading comprehension arXiv preprint arXiv170503551 (2017)
[4] Rajpurkar Pranav Jian Zhang Konstantin Lopyrev and Percy Liang Squad 100000+ questions for machine comprehension of text arXiv preprint arXiv160605250 (2016)
[5] The Stanford Question Answering Dataset 2018 RajpurkarGithubIo Accessed March 16 2018 httpsrajpurkargithubioSQuAD-explorer
[6] Hu Minghao Yuxing Peng and Xipeng Qiu Reinforced mnemonic reader for machine comprehension CoRR abs170502798 (2017)
[7] Schalkoff Robert J Artificial neural networks Vol 1 New York McGraw-Hill 1997
[8] Me For The AI and Neetesh Mehrotra 2018 The Connect Between Deep Learning And AI - Open Source For You Open Source For You Accessed March 16 2018 httpopensourceforucom201801connect-deep-learning-ai
[9] Krizhevsky Alex Ilya Sutskever and Geoffrey E Hinton Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems pp 1097-1105 2012
[10] An Intuitive Explanation Of Convolutional Neural Networks 2016 The Data Science Blog Accessed March 16 2018 httpsujjwalkarnme20160811intuitive-explanation-convnets
[11] Williams Ronald J and David Zipser A learning algorithm for continually running fully recurrent neural networks Neural computation 1 no 2 (1989) 270-280
[12] Hochreiter Sepp and Juumlrgen Schmidhuber Long short-term memory Neural computation 9 no 8 (1997) 1735-1780
[13] Chung Junyoung Caglar Gulcehre KyungHyun Cho and Yoshua Bengio Empirical evaluation of gated recurrent neural networks on sequence modeling arXiv preprint arXiv14123555 (2014)
40
[14] Mikolov Tomas Wen-tau Yih and Geoffrey Zweig Linguistic regularities in continuous space word representations In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics Human Language Technologies pp 746-751 2013
[15] Pennington Jeffrey Richard Socher and Christopher Manning Glove Global vectors for word representation In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) pp 1532-1543 2014
[16] Word2vec Tutorial - The Skip-Gram Model middot Chris Mccormick 2016 MccormickmlCom Accessed March 16 2018 httpmccormickmlcom20160419word2vec-tutorial-the-skip-gram-model
[17] Group 2015 [Paper Introduction] Bilingual Word Representations With Monolingual hellip SlideshareNet Accessed March 16 2018 httpswwwslidesharenetnaist-mtstudybilingual-word-representations-with-monolingual-quality-in-mind
[18] Attention Mechanism 2016 BlogHeuritechCom Accessed March 16 2018 httpsblogheuritechcom20160120attention-mechanism
[19] Weston Jason Sumit Chopra and Antoine Bordes 2014 Memory Networks arXiv preprint arXiv14103916v11 Accessed March 16 2018 httpsarxivorgabs14103916
[20] Wang Shuohang and Jing Jiang Machine comprehension using match-lstm and answer pointer arXiv preprint arXiv160807905 (2016)
[21] Wang Shuohang and Jing Jiang Learning natural language inference with LSTM arXiv preprint arXiv151208849 (2015)
[22] Vinyals Oriol Meire Fortunato and Navdeep Jaitly Pointer networks In Advances in Neural Information Processing Systems pp 2692-2700 2015
[23] Wang Wenhui Nan Yang Furu Wei Baobao Chang and Ming Zhou Gated self-matching networks for reading comprehension and question answering In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1 Long Papers) vol 1 pp 189-198 2017
[24] Seo Minjoon Aniruddha Kembhavi Ali Farhadi and Hannaneh Hajishirzi Bidirectional attention flow for machine comprehension arXiv preprint arXiv161101603 (2016)
[25] Clark Christopher and Matt Gardner Simple and effective multi-paragraph reading comprehension arXiv preprint arXiv171010723 (2017)
41
[26] 8 Analyzing Sentence Structure 2018 NltkOrg Accessed March 16 2018 httpwwwnltkorgbookch08html
42
BIOGRAPHICAL SKETCH
Purnendu Mukherjee grew up in Kolkata India and had a strong interest for
computers since an early age After his high school he did his BSc in computer
science from Ramakrishna Mission Residential College Narendrapur followed by MSc
in computer science from St Xavierrsquos College Kolkata He had a strong intuition and
interest for human like learning systems and wanted to work in this area He started
working at TCS Innovation Labs Pune for the application of Natural Language
Processing in Educational Applications As he was deeply passionate about learning
system that mimic the human brain and learn like a human child does he was
increasing interested about Deep Learning and its applications After working for a year
he went on to pursue a Master of Science degree in computer science from the
University of Florida Gainesville His academic interests have been focused on Deep
Learning and Natural Language Processing and he has been working on Machine
Reading Comprehension since summer of 2017
- ACKNOWLEDGMENTS
- LIST OF FIGURES
- LIST OF ABBREVIATIONS
- INTRODUCTION
- THE BUILDING BLOCKS OF A QUESTION ANSWERING SYSTEM
-
- Neural Networks
- Convolutional Neural Network
- Recurrent Neural Networks (RNN)
- Word Embedding
- Attention Mechanism
- Memory Networks
-
- LITERATURE REVIEW AND STATE OF THE ART
-
- Machine Comprehension Using Match-LSTM and Answer Pointer
- R-NET Matching Reading Comprehension with Self-Matching Networks
- Bi-Directional Attention Flow (BiDAF) for Machine Comprehension
- Summary
-
- MULTI-ATTENTION QUESTION ANSWERING
-
- Multi-Attention BiDAF Model
- Chatbot Design Using a QA System
- Online QA System and Attention Visualization
-
- FUTURE DIRECTIONS AND CONCLUSION
-
- Future Directions
- Conclusion
-
- LIST OF REFERENCES
- BIOGRAPHICAL SKETCH
-
4
ACKNOWLEDGMENTS
I would like to thank my thesis advisor Professor Andy Li who has been a
constant source of inspiration and encouragement for me to pursue my ideas Also I
am thankful to him for providing all the necessary resources to succeed for my research
efforts
I am deeply thankful to Professor Kristy Boyer for her guidance and mentorship
as she motivated me to pursue my research interests in Natural Language Processing
and Deep Learning
I would like to express my heartfelt gratitude to Professor Jose Principe who had
taught me the very fundamentals of Learning systems and formally introduced me to
Deep Learning
I wish to extend my thanks to all the member of CBL Lab and especially to the
NLP group for their insightful remarks and overall support
I am grateful to my close friend roommate and lab partner Yash Sinha who had
supported and helped me throughout with all aspects of my thesis work
Finally I must thank my parents for their unwavering support and dedication
towards my well-being and growth
5
TABLE OF CONTENTS page
ACKNOWLEDGMENTS 4
LIST OF FIGURES 6
LIST OF ABBREVIATIONS 7
ABSTRACT 8
CHAPTER
1 INTRODUCTION 9
2 THE BUILDING BLOCKS OF A QUESTION ANSWERING SYSTEM 13
Neural Networks 13 Convolutional Neural Network 14 Recurrent Neural Networks (RNN) 15 Word Embedding 16 Attention Mechanism 17 Memory Networks 18
3 LITERATURE REVIEW AND STATE OF THE ART 21
Machine Comprehension Using Match-LSTM and Answer Pointer 21 R-NET Matching Reading Comprehension with Self-Matching Networks 23 Bi-Directional Attention Flow (BiDAF) for Machine Comprehension 25 Summary 28
4 MULTI-ATTENTION QUESTION ANSWERING 29
Multi-Attention BiDAF Model 30 Chatbot Design Using a QA System 32 Online QA System and Attention Visualization 35
5 FUTURE DIRECTIONS AND CONCLUSION 37
Future Directions 37 Conclusion 38
LIST OF REFERENCES 39
BIOGRAPHICAL SKETCH 42
6
LIST OF FIGURES
Figure page 1-1 The task of Question Answering 10
2-1 Simple and Deep Learning Neural Networks 13
2-2 Convolutional Neural Network Architecture 14
2-3 Semantic relation between words in vector space 17
2-4 Attention Mechanism flow 18
2-5 QA example on Lord of the Rings using Memory Networks 20
3-1 Match-LSTM Model Architecture 22
3-2 The task of Question Answering 24
3-3 Bi-Directional Attention Flow Model Architecture 26
4-1 The modified BiDAF model with multilevel attention 31
4-2 Flight reservation chatbotrsquos chat window 33
4-3 Chatbot within OneTask system 34
4-4 The Flow diagram of the Flight booking Chatbot system 35
4-5 QA system interface with attention highlight over candidate answers 36
5-1 An English language semantic parse tree 37
7
LIST OF ABBREVIATIONS
BiDAF Bi-Directional Attention Flow
CNN Convolutional Neural Network
GRU Gated Recurrent Units
LSTM Long Short Term Memory
NLP Natural Language Processing
NLU Natural Language Understanding
RNN Recurrent Neural Network
8
Abstract of Thesis Presented to the Graduate School of the University of Florida in Partial Fulfillment of the
Requirements for the Degree of Master of Science
QUESTION ANSWERING SYSTEMS WITH ATTENTION MECHANISM
By
Purnendu Mukherjee
May 2018
Chair Xiaolin Li Major Computer Science
Question Answering(QA) systems have had rapid growth since the last 3 years
and are close to reaching human level accuracy One of the fundamental reason for this
growth has been the use of attention mechanism along with other methods of Deep
Learning But just as with other Deep Learning methods some of the failure cases are
so obvious that it convinces us that there is a lot to improve upon In this work we first
did a literature review of the State of the Art and fundamental models in QA systems
Next we introduce an architecture which has shown improvement in the targeted area
We then introduce a general method to enable easy design of domain-specific Chatbot
applications and present a proof of the concept with the same method Finally we
present an easy to use Question Answering interface with attention visualization on the
passage We also propose a method to improve the current state of the art as a part of
our ongoing work
9
CHAPTER 1 INTRODUCTION
Teaching machines to read and understand human natural language is our
central and long standing goal for Natural Language Processing and Artificial
Intelligence in general The positive side of machines being able to reason and
comprehend human language could be enormous We are beginning to see how
commercial systems like Alexa Siri Google Now etc are being widely used as Speech
Recognition improved to human levels While speech recognition systems can
transcribe speech to text comprehension of that transcribed text is another task which
is currently a major focus for both academia and industry because of its possible
applications Moreover all the text information available throughout the internet is a
major reason why machine comprehension of text is such an important task
With the growth of Deep Learning methods in the last few years the field of
Machine Comprehension and Natural Language Processing(NLP) in general has
experienced a revolution While the traditional methods and practices are still prevalent
and forms the basis of our deep understanding of languages Deep Learning methods
have surpassed all traditional NLP and Machine Learning methods by a significant
margin and are currently driving the growth of the field
To be able to build a system that can understand human text we need to first
ask ourselves how can we evaluate the machinersquos comprehension ability We had
initially set a goal to build a Chabot for a specific domain and generalize to other topics
as we go ahead While developing the system we found out the necessity for reading
comprehension and how to measure it We finally found the answer with Questions
Answering systems
10
Just like how we human beings are tested for our ability of language
understanding with questions we should ask machines similar questions about what it
has just read The performance of the system on such question answering task will let
us evaluate how much the machine is able to reason about what it just read [1]
Reading comprehension has been a topic of Natural Language Understanding since the
1970s In 1977 Wendy Lehnert said in his doctoral thesis ndash ldquoOnly when we can ask a
program to answer questions about what it reads will we be able to begin to access that
programrsquos comprehensionrdquo [2]
Figure 1-1 The task of Question Answering
To achieve this task the NLP community has developed various datasets such as
CNN Daily Mail WebQuestions SQuAD TriviaQA [3] etc For our purpose we chose
SQuAD which stands for Stanford Question Answering Dataset [4] SQuAD consists of
questions posed by crowdworkers on a set of Wikipedia articles where the answer to
every question is a segment of text or span from the corresponding reading passage
With 100000+ question-answer pairs on 500+ articles SQuAD is significantly larger
than previous reading comprehension datasets [5]
An example from the SQuAD dataset is as follows
11
Passage Tesla later approached Morgan to ask for more funds to build a more
powerful transmitter When asked where all the money had gone Tesla responded by
saying that he was affected by the Panic of 1901 which he (Morgan) had caused
Morgan was shocked by the reminder of his part in the stock market crash and by
Teslarsquos breach of contract by asking for more funds Tesla wrote another plea to
Morgan but it was also fruitless Morgan still owed Tesla money on the original
agreement and Tesla had been facing foreclosure even before construction of the
tower began
Question On what did Tesla blame for the loss of the initial money
Answer Panic of 1901
As we started exploring the QA task we faced several challenges Some of them
we could solve with the help of other research and some of the challenges still exist in
the domain
bull Out of Vocabulary words
bull Multi-sentence reasoning may be required
bull There may exist several candidate answers
bull Optimizing the Exact Match (EM) metric may fail when the answer boundary is fuzzy or too long such as the answer of the ldquowhyrdquo query [6]
bull ldquoOne-hoprdquo prediction may fail to fully understand the query [6]
bull Fail to fully capture the long-distance contextual interaction between parts of the context by only using LSTMGRU
bull Current models are unable to capture semantics of the passage
In the upcoming chapters we will first briefly review the basics necessary for
understanding the models then we will delve deep into the fundamental models that
12
have shaped the current State of the Art models then we will discuss our contribution in
terms of architecture and applications and finally conclude with future directions
13
CHAPTER 2 THE BUILDING BLOCKS OF A QUESTION ANSWERING SYSTEM
To build a question answering system one needs to be familiar with the
fundamental deep learning models such as Recurrent Neural Networks (RNN) Long
Short Term Memory (LSTM) etc In this chapter we will have an overview on these
techniques and see how they all connect to building a question answering system
Neural Networks
What makes Deep Learning so intriguing is that it has close resemblance with
the working of the mammalian brain or at least draws inspiration from it The same can
be said for Artificial Neural networks [7] which consists of a system of interconnected
units called lsquoneuronsrsquo that take input from similar units and produces a single output
Figure 2-1 Simple and Deep Learning Neural Networks [8]
The connection from one neuron to another can be weighted based on the input data
which enables the network to tune itself to produce a certain output based on the input
This is the learning process which is achieved through backpropagation which is a
system of propagating the error from the output layer to the previous layers
14
Convolutional Neural Network
The first wave of deep learningrsquos success was brought by Convolutional Neural
Networks (CNN) [9] when this was the technique used by the winning team of ImageNet
competition in 2012 CNNs are deep artificial neural networks (ANN) that can be used to
classify images cluster them by similarity and perform object recognition within scenes
It can be used to detect and identify faces people signs or any other visual data
Figure 2-2 Convolutional Neural Network Architecture [10]
There are primarily four operations in a standard CNN model (as shown in Fig above)
1 Convolution - The primary purpose of Convolution in the ConvNet (above) is to extract features from the input image The spatial relationship between pixels ie the image features are preserved and learned by the convolution using small squares of input data
2 Non-Linearity (ReLU) ndash Rectified Linear Unit (ReLU) is a non-linear operation that carries out an element wise operation on each pixel This operation replaces the negative pixel values in the feature map by zero
3 Pooling or Sub Sampling - Spatial Pooling reduces the dimensionality of each feature map but retains the most important information For max pooling the largest value in the square window is taken and rest are dropped Other types of pooling are Average Sum etc
4 Classification (Fully Connected Layer) - The Fully Connected layer is a traditional Multi-Layer Perceptron as described before that uses a softmax activation function in the output layer The high-level features of the image are encoded by the convolutional and pooling layers which is then fed to the fully connected layer which then uses these features for classifying the input image into various classes based on the training dataset [10]
15
When a new image is fed into the CNN model all the above-mentioned steps are
carried out (forward propagation) and a probability distribution is achieved on the set of
output classes With a large enough training dataset the network will learn and
generalize well enough to classify new images into their correct classes
Recurrent Neural Networks (RNN)
Whenever we want to predict or encode sequential data RNN [11] is our go to
method RNNs perform the same task for every element of a sequence where the
output of each element depends on previous computations thus the recurrence In
practice RNNs are unable to retain long-term dependencies and can look back only a
few steps because of the vanishing gradient problem
(2-1)
A solution to the dependency problem is to use gated cells such as LSTM [11] or
GRU [13] These cells pass on important information to the next cells while ignoring
non-important ones The gated units in a GRU block are
bull Update Gate ndash Computed based on current input and hidden state
(2-2)
bull Reset Gate ndash Calculated similarly but with different weights
(2-3)
bull New memory content - (2-4)
If reset gate unit is ~0 then previous memory is ignored and only new
information is kept
16
Final memory at current time step combines previous and current time steps
(2-5)
While the GRU is computationally efficient the LSTM on the other hand is a
general case where there are three gates as follows
bull Input Gate ndash What new information to add to the current cell state
bull Forget Gate ndash How much information from previous states to be kept
bull Output gate ndash How much info should be sent to the next states
Just like GRU the current cell state is a sum of the previous cell state but
weighted by the forget gate and the new value is added which is weighted by the input
gate Based on the cell state the output gate regulates the final output
Word Embedding
Computation or gradients can be applied on numbers and not on words or letters
So first we need to convert words into their corresponding numerical formation before
feeding into a deep learning model In general there are two types of word embedding
Frequency based (which constitutes count vectors tf-idf and co-occurrence vectors)
and Prediction based With frequency based embedding the order of the words are not
preserved and works as a bag of words model Whereas with prediction based model
the order of words or locality of words are taken into consideration to generate the
numerical representation of the word Within this prediction based category there are
two fundamental techniques called Continuous Bag of Words (CBOW) and Skip Gram
Model which forms the basis for word2vec [14] and GloVe [15]
The basic intuition behind word2vec is that if two different words have very
similar ldquocontextsrdquo (that is what words are likely to appear around them) then the model
17
will produce similar vector for those words Conversely if the two word vectors are
similar then the network will produce similar context predictions for the same two words
For examples synonyms like ldquointelligentrdquo and ldquosmartrdquo would have very similar contexts
Or that words that are related like ldquoenginerdquo and ldquotransmissionrdquo would probably have
similar contexts as well [16] Plotting the word vectors learned by a word2vec over a
large corpus we could find some very interesting relationships between words
Figure 2-3 Semantic relation between words in vector space [17]
Attention Mechanism
We as humans put our attention to things are important or are relevant in a
context For example when asked a question from a passage we try to find the most
relevant part of the passage the question is relevant with and then reason from our
understanding of that part of the passage The same idea applies for attention
mechanism in Deep Learning It is used to identify the specific parts of a given context
to which the current question is relevant to
Formally put the techniques take n arguments y_1 y_n (in our case the
passage having words say y_i through h_i) and a question word say q It returns a
vector z which is supposed to be the laquo summary raquo of the y_i focusing on information
linked to the question q More formally it returns a weighted arithmetic mean of the y_i
18
and the weights are chosen according the relevance of each y_i given the context c
[18]
Figure 2-4 Attention Mechanism flow [18]
Memory Networks
Convolutional Neural Networks and Recurrent Neural Networks which does
capture how we form our visual and sequential memories their memory (encoded by
hidden states and weights) were typically too small and was not compartmentalized
enough to accurately remember facts from the past (knowledge is compressed into
dense vectors) [19]
Deep Learning needed to cultivate a methodology that preserved memories as
they are such that it wonrsquot be lost in generalization and recalling exact words or
sequence of events would be possible mdash something computers are already good at This
effort led us to Memory Networks [19] published at ICLR 2015 by Facebook AI
Research
This paper provides a basic framework to store augment and retrieve memories
while seamlessly working with a Recurrent Neural Network architecture The memory
19
network consists of a memory m (an array of objects 1 indexed by m i) and four
(potentially learned) components I G O and R as follows
I (input feature map) mdash converts the incoming input to the internal feature
representation either a sparse or dense feature vector like that from word2vec or
GloVe
G (generalization) mdash updates old memories given the new input They call this
generalization as there is an opportunity for the network to compress and generalize its
memories at this stage for some intended future use The analogy Irsquove been talking
before
O (output feature map) mdash produces a new output (in the feature representation
space) given the new input and the current memory state This component is
responsible for performing inference In a question answering system this part will
select the candidate sentences (which might contain the answer) from the story
(conversation) so far
R (response) mdash converts the output into the response format desired For
example a textual response or an action In the QA system described this component
finds the desired answer and then converts it from feature representation to the actual
word
This model is a fully supervised model meaning all the candidate sentences from
which the answer could be found are marked during training phase and can also be
termed as lsquohard attentionrsquo
The authors tested out the QA system on various literature including Lord of the
Rings
20
Figure 2-5 QA example on Lord of the Rings using Memory Networks [19]
21
CHAPTER 3 LITERATURE REVIEW AND STATE OF THE ART
There has been rapid progress since the release of the SQuAD dataset and
currently the best ensemble models are close to human level accuracy in machine
comprehension This is due to the various ingenious methods which solves some of the
problems with the previous methods Out of Vocabulary tokens were handled by using
Character embedding Long term dependency within context passage were solved
using self-attention And many other techniques such as Contextualized vectors History
of Words Attention Flow etc In this section we will have a look at the some of the most
important models that were fundamental to the progress of Questions Answering
Machine Comprehension Using Match-LSTM and Answer Pointer
In this paper [20] the authors propose an end-to-end neural architecture for the
QA task The architecture is based on match-LSTM [21] a model they proposed for
textual entailment and Pointer Net [22] a sequence-to-sequence model proposed by
Vinyals et al (2015) to constrain the output tokens to be from the input sequences
The model consists of an LSTM preprocessing layer a match-LSTM layer and an
Answer Pointer layer
We are given a piece of text which we refer to as a passage and a question
related to the passage The passage is represented by matrix P where P is the length
(number of tokens) of the passage and d is the dimensionality of word embeddings
Similarly the question is represented by matrix Q where Q is the length of the question
Our goal is to identify a subsequence from the passage as the answer to the question
22
Figure 3-1 Match-LSTM Model Architecture [20]
LSTM Preprocessing layer They use a standard one-directional LSTM
(Hochreiter amp Schmidhuber 1997) to process the passage and the question separately
as shown below
Match LSTM Layer They applied the match-LSTM model proposed for textual
entailment to their machine comprehension problem by treating the question as a
premise and the passage as a hypothesis The match-LSTM sequentially goes through
the passage At position i of the passage it first uses the standard word-by-word
attention mechanism to obtain attention weight vector as follows
(3-1)
23
where and are parameters to be learned
Answer Pointer Layer The Answer Pointer (Ans-Ptr) layer is motivated by the
Pointer Net introduced by Vinyals et al (2015) [22] The boundary model produces only
the start token and the end token of the answer and then all the tokens between these
two in the original passage are considered to be the answer
When this paper was released back in November 2016 Match-LSTM method
was the state of the art in Question Answering systems and was at the top of the
leaderboard for the SQuAD dataset
R-NET Matching Reading Comprehension with Self-Matching Networks
In this model [23] first the question and passage are processed by a
bidirectional recurrent network (Mikolov et al 2010) separately They then match the
question and passage with gated attention-based recurrent networks obtaining
question-aware representation for the passage On top of that they apply self-matching
attention to aggregate evidence from the whole passage and refine the passage
representation which is then fed into the output layer to predict the boundary of the
answer span
Question and passage encoding First the words are converted to their
respective word-level embeddings and character level embeddings The character-level
embeddings are generated by taking the final hidden states of a bi-directional recurrent
neural network (RNN) applied to embeddings of characters in the token Such
character-level embeddings have been shown to be helpful to deal with out-of-vocab
(OOV) tokens
24
They then use a bi-directional RNN to produce new representation and
of all words in the question and passage respectively
Figure 3-2 The task of Question Answering [23]
Gated Attention-based Recurrent Networks They use a variant of attention-
based recurrent networks with an additional gate to determine the importance of
information in the passage regarding a question Different from the gates in LSTM or
GRU the additional gate is based on the current passage word and its attention-pooling
vector of the question which focuses on the relation between the question and current
passage word The gate effectively model the phenomenon that only parts of the
passage are relevant to the question in reading comprehension and question answering
is utilized in subsequent calculations
25
Self-Matching Attention From the previous step the question aware passage
representation is generated to highlight the important parts of the passage One
problem with such representation is that it has very limited knowledge of context One
answer candidate is often oblivious to important cues in the passage outside its
surrounding window To address this problem the authors propose directly matching
the question-aware passage representation against itself It dynamically collects
evidence from the whole passage for words in passage and encodes the evidence
relevant to the current passage word and its matching question information into the
passage representation
Output Layer They use the same method as Wang amp Jiang (2016b) and use
pointer networks (Vinyals et al 2015) to predict the start and end position of the
answer In addition they use an attention-pooling over the question representation to
generate the initial hidden vector for the pointer network [23]
When the R-Net Model first appeared in the leaderboard in March 2017 it was at
the top with 723 Exact Match and 807 F1 score
Bi-Directional Attention Flow (BiDAF) for Machine Comprehension
BiDAF [24] is a hierarchical multi-stage architecture for modeling the
representations of the context paragraph at different levels of granularity BIDAF
includes character-level word-level and contextual embeddings and uses bi-directional
attention flow to obtain a query-aware context representation Their attention layer is not
used to summarize the context paragraph into a fixed-size vector Instead the attention
is computed for every time step and the attended vector at each time step along with
the representations from previous layers can flow through to the subsequent modeling
layer This reduces the information loss caused by early summarization [24]
26
Figure 3-3 Bi-Directional Attention Flow Model Architecture [24]
Their machine comprehension model is a hierarchical multi-stage process and
consists of six layers
1 Character Embedding Layer maps each word to a vector space using character-level CNNs
2 Word Embedding Layer maps each word to a vector space using a pre-trained word embedding model
3 Contextual Embedding Layer utilizes contextual cues from surrounding words to refine the embedding of the words These first three layers are applied to both the query and context They use a LSTM on top of the embeddings provided by the previous layers to model the temporal interactions between words We place an LSTM in both directions and concatenate the outputs of the two LSTMs
4 Attention Flow Layer couples the query and context vectors and produces a set of query aware feature vectors for each word in the context
5 Modeling Layer employs a Recurrent Neural Network to scan the context The input to the modeling layer is G which encodes the query-aware representations of context words The output of the modeling layer captures the interaction among the context words conditioned on the query They use two layers of bi-directional LSTM
27
with the output size of d for each direction Hence a matrix is obtained which is passed onto the output layer to predict the answer
6 Output Layer provides an answer to the query They define the training loss (to be minimized) as the sum of the negative log probabilities of the true start and end indices by the predicted distributions averaged over all examples [25]
In a further variation of their above work they add a self-attention layer after the
Bi-attention layer to further improve the results The architecture of the model is as
Figure 3-3 The task of Question Answering [25]
28
Summary
In this chapter we reviewed the methods that are fundamental to the state of the
art in Machine Comprehension and for the task of Question Answering We have
reached closed to human level accuracy and this is due to incremental developments
over previous models As we saw Out of Vocabulary (OOV) tokens were handled by
using Character embedding Long term dependency within context passage were
solved using self-attention and many other techniques such as Contextualized vectors
Attention Flow etc were employed to get better results In the next chapter we will see
how we can build on these models and develop further
29
CHAPTER 4 MULTI-ATTENTION QUESTION ANSWERING
Although the advancements in Question Answering systems since the release of
the SQuAD dataset has been impressive and the results are getting close to human
level accuracy it is far from being a fool-proof system The models still make mistakes
which would be obvious to a human For example
Passage The Panthers used the San Jose State practice facility and stayed at
the San Jose Marriott The Broncos practiced at Florida State Facility and stayed at the
Santa Clara Marriott
Question At what universitys facility did the Panthers practice
Actual Answer San Jose State
Predicted Answer Florida State Facility
To find out what is leading to the wrong predictions we wanted to see the
attention weights associated with such an example We plotted the passage and
question heat map which is a 2D matrix where the intensity of each cell signifies the
similarity between a passage word and a question word For the above example we
found out that while certain words of the question are given high weightage other parts
are not The words lsquoAtrsquo lsquofacilityrsquo lsquopracticersquo are given high attention but lsquoPanthersrsquo does
not receive high attention If it had received high attention then the system would have
predicted lsquoSan Jose Statersquo as the right answer To solve this issue we analyzed the
base BiDAF model and proposed adding two things
1 Bi-Attention and Self-Attention over Query
2 Second level attention over the output of (Bi-Attention + Self-Attention) from both Context and Query
30
Multi-Attention BiDAF Model
Our comprehension model is a hierarchical multi-stage process and consists of
the following layers
1 Embedding Just as all other models we embed words using pretrained word vectors We also embed the characters in each word into size 20 vectors which are learned and run a convolution neural network followed by max-pooling to get character-derived embeddings for each word The character-level and word-level embeddings are then concatenated and passed to the next layer We do not update the word embeddings during training
2 Pre-Process A shared bi-directional GRU (Cho et al 2014) is used to map the question and passage embeddings to context aware embeddings
3 Context Attention The bi-directional attention mechanism from the Bi-Directional Attention Flow (BiDAF) model (Seo et al 2016) is used to build a query-aware context representation Let h_i be the vector for context word i q_j be the vector for question word j and n_q and n_c be the lengths of the question and context respectively We compute attention between context word i and question word j as
(4-1)
where w1 w2 and w3 are learned vectors and is element-wise multiplication We then compute an attended vector c_i for each context token as
(4-2)
We also compute a query-to-context vector q_c
(4-3)
31
The final vector computed for each token is built by concatenating
and In our model we subsequently pass the result through a linear layer with ReLU activations
4 Context Self-Attention Next we use a layer of residual self-attention The input is passed through another bi-directional GRU Then we apply the same attention mechanism only now between the passage and itself In this case we do not use query-to context attention and we set aij = 1048576inf if i = j As before we pass the concatenated output through a linear layer with ReLU activations This layer is applied residually so this output is additionally summed with the input
5 Query Attention For this part we do the same way as context attention but calculate the weighted sum of the context words for each query word Thus we get the length as number of query words Then we calculate context to query similar to query to context in context attention layer
Figure 4-1 The modified BiDAF model with multilevel attention
6 Query Self-Attention This part is done the same way as the context self-attention layer but from the output of the Query Attention layer
32
7 Context Query Bi-Attention + Self-Attention The output of the Context self-attention and the Query Self-Attention layers are taken as input and the same process for Bi-Attention and self-attention is applied on these inputs
8 Prediction In the last layer of our model a bidirectional GRU is applied followed by a linear layer that computes answer start scores for each token The hidden states of that layer are concatenated with the input and fed into a second bidirectional GRU and linear layer to predict answer end scores The softmax operation is applied to the start and end scores to produce start and end probabilities and we optimize the negative log likelihood of selecting correct start and end tokens
Having carried out this modification we were able to solve the wrong example we
started with The multilevel attention model gives the correct output as ldquoSan Jose
Staterdquo Also we achieved slightly better scores than the original model with a F1 score
of 8544 on the SQuAD dev dataset
Chatbot Design Using a QA System
Designing a perfect chatbot that passes the Turing test is a fundamental goal for
Artificial Intelligence Although we are many order to magnitudes away from achieving
such a goal domain specific tasks can be solved with chatbots made from current
technology We had started our goal with a similar objective in mind ie to design a
domain specific chatbot and then generalize to other areas as it is able to achieve the
first domain specific objective robustly This led us to the fundamental problem of
Machine Comprehension and subsequently to the task of Question Answering Having
achieved some degree of success with QA systems we looked back if we could apply
our newly acquired knowledge in the task of designing Chatbots
The chatbots made with todayrsquos technologies are mostly handcrafted techniques
such as template matching that requires anticipating all possible ways a user may
articulate his requirements and a conversation may occur This requires a lot of man
hours for designing a domain specific system and is still very error prone In this section
33
we propose a general Chatbot design that would make the designing of a domain
specific chatbot very easy and robust at the same time
Every domain specific chatbot needs to obtain a set of information from the user
and show some results based on the user specific information obtained The traditional
chatbots use template matching and keywords lookup to determine if the user has
provided the required information Our idea is to use the Question Answering system in
the backend to extract out the required information from whatever the user has typed
until this point of the conversation The information to be extracted can be posed in the
form of a set of questions and the answers obtained from those questions can be used
as the parameters to supply the relevant information to the user
We had chosen our chatbot domain as the flight reservation system Our goal
was to extract the required information from the user to be able to show him the
available flights as per the userrsquos requirements
Figure 4-2 Flight reservation chatbotrsquos chat window
34
For a flight reservation task the booking agent needs to know the origin city
destination city and date of travel at minimum to be able to show the available flights
Optional information includes the number of tickets passengerrsquos name one way or
round trip etc
The minimalistic conversation with the user through the chat window would be as
shown above We had a platform called OneTask on which we wanted to implement our
chat bot The chat interface within the OneTask system looks as follows
Figure 4-3 Chatbot within OneTask system
The working of the chat system is as follows ndash
1 Initiation The user opens the chat window which starts a session with the chat bot The chat bot asks lsquoHow may I help yoursquo
2 User Reply The user may reply with none of the required information for flight booking or may reply with multiple information in the same message
3 User Reply parsing The conversation up to this point is treated as a passage and the internal questions are run on this passage So the four questions that are run are
Where do you want to go
35
From where do you want to leave
When do you want to depart
4 Parsed responses from QA model After running the questions on the QA system with the given conversation answers are obtained for all along with their corresponding confidence values Since answers will be obtained even if the required question was not answered up to this point in the conversation the confidence values plays a crucial role in determining the validity of the answer For this purpose we determined ranges of confidence values for validation A confidence value above 10 signifies the question has been answered correctly A confidence of 2 ndash 10 signifies that it may have been answered but should verify with the user for correctness and any confidence below 2 is discarded
5 Asking remaining questions iteratively After the parsing it is checked if any of the required questions are still unanswered If so the chatbot asks the remaining question and the process from 3 ndash 4 is carried out iteratively Once all the questions have been answered the user is shown with the available flight options as per his request
Figure 4-4 The Flow diagram of the Flight booking Chatbot system
Online QA System and Attention Visualization
To be able to test out various examples we used the BiDAF [23] model for an
online demo One can either choose from the available examples from the drop-down
menu or paste their own passage and examples While this is a useful and interesting
system to test the model in a user-friendly way we created this system to be able to
focus on the wrong samples
36
Figure 4-5 QA system interface with attention highlight over candidate answers
The candidate answers are shown with the blue highlights as per their
confidence values The higher the confidence the darker the answer The highest
confidence value is chosen as the predicted answer We developed the system to show
attention spread of the candidate answers to realize what needs to be done to improve
the system This led us to realize the importance of including the query attention part as
well as multilevel attention on the BiDAF model as described in the first section of this
chapter
37
CHAPTER 5 FUTURE DIRECTIONS AND CONCLUSION
Future Directions
Having done a thorough analysis of the current methods and state of the arts in
Machine Comprehension and for the QA task and developing systems on it we have
achieved a strong sense of what needs to be done to further improve the QA models
After observing the wrong samples we can see that the system is still unable to encode
meaning and is picking answers based on statistical patterns of occurrence of answers
on training examples
As per the State of the Art models and analyzing the ongoing research literature
it is easy to conclude that more features need to be embedded to encode the meaning
of words phrases and sentences A paper called Reinforced Mnemonic Reader for
Machine Comprehension encoded POS and NER tags of words along with their word
and character embedding This gave them better results
Figure 5-1 An English language semantic parse tree [26]
38
We have developed a method to encode the syntax parse tree of a sentence that
not only encodes the post tags but the relation of the word within the phrase and the
relation of the phrase within the whole sentence in a hierarchical manner
Finally data augmentation is another solution to get better results One definite
way to reduce the errors would be to include similar samples in the training data which
the system is faltering in the dev set One could generate similar examples as the failure
cases and include them in the training set to have better prediction Another system
would be to train to similar and bigger datasets Our models were trained on the SQuAD
dataset There are other datasets too which does the similar question answering task
such as TriviaQA We could augment the training set of TriviaQA along with SQuAD to
have a more robust system that is able to generalize better and thus have higher
accuracy for predicting answer spans
Conclusion
In this work we have tried to explore the most fundamental techniques that have
shaped the current state of the art Then we proposed a minor improvement of
architecture over an existing model Furthermore we developed two applications that
uses the base model First we talked about how a chatbot application can be made
using the QA system and lastly we also created a web interface where the model can
be used for any Passage and Question This interface also shows the attention spread
on the candidate answers While our effort is ongoing to push the state of the art
forward we strongly believe that surpassing human level accuracy on this task will have
high dividends for the society at large
39
LIST OF REFERENCES
[1] Danqi Chen From Reading Comprehension To Open-Domain Question Answering 2018 Youtube Accessed March 16 2018 httpswwwyoutubecomwatchv=1RN88O9C13U
[2] Lehnert Wendy G The Process of Question Answering No RR-88 Yale Univ New Haven Conn Dept of computer science 1977
[3] Joshi Mandar Eunsol Choi Daniel S Weld and Luke Zettlemoyer Triviaqa A large scale distantly supervised challenge dataset for reading comprehension arXiv preprint arXiv170503551 (2017)
[4] Rajpurkar Pranav Jian Zhang Konstantin Lopyrev and Percy Liang Squad 100000+ questions for machine comprehension of text arXiv preprint arXiv160605250 (2016)
[5] The Stanford Question Answering Dataset 2018 RajpurkarGithubIo Accessed March 16 2018 httpsrajpurkargithubioSQuAD-explorer
[6] Hu Minghao Yuxing Peng and Xipeng Qiu Reinforced mnemonic reader for machine comprehension CoRR abs170502798 (2017)
[7] Schalkoff Robert J Artificial neural networks Vol 1 New York McGraw-Hill 1997
[8] Me For The AI and Neetesh Mehrotra 2018 The Connect Between Deep Learning And AI - Open Source For You Open Source For You Accessed March 16 2018 httpopensourceforucom201801connect-deep-learning-ai
[9] Krizhevsky Alex Ilya Sutskever and Geoffrey E Hinton Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems pp 1097-1105 2012
[10] An Intuitive Explanation Of Convolutional Neural Networks 2016 The Data Science Blog Accessed March 16 2018 httpsujjwalkarnme20160811intuitive-explanation-convnets
[11] Williams Ronald J and David Zipser A learning algorithm for continually running fully recurrent neural networks Neural computation 1 no 2 (1989) 270-280
[12] Hochreiter Sepp and Juumlrgen Schmidhuber Long short-term memory Neural computation 9 no 8 (1997) 1735-1780
[13] Chung Junyoung Caglar Gulcehre KyungHyun Cho and Yoshua Bengio Empirical evaluation of gated recurrent neural networks on sequence modeling arXiv preprint arXiv14123555 (2014)
40
[14] Mikolov Tomas Wen-tau Yih and Geoffrey Zweig Linguistic regularities in continuous space word representations In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics Human Language Technologies pp 746-751 2013
[15] Pennington Jeffrey Richard Socher and Christopher Manning Glove Global vectors for word representation In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) pp 1532-1543 2014
[16] Word2vec Tutorial - The Skip-Gram Model middot Chris Mccormick 2016 MccormickmlCom Accessed March 16 2018 httpmccormickmlcom20160419word2vec-tutorial-the-skip-gram-model
[17] Group 2015 [Paper Introduction] Bilingual Word Representations With Monolingual hellip SlideshareNet Accessed March 16 2018 httpswwwslidesharenetnaist-mtstudybilingual-word-representations-with-monolingual-quality-in-mind
[18] Attention Mechanism 2016 BlogHeuritechCom Accessed March 16 2018 httpsblogheuritechcom20160120attention-mechanism
[19] Weston Jason Sumit Chopra and Antoine Bordes 2014 Memory Networks arXiv preprint arXiv14103916v11 Accessed March 16 2018 httpsarxivorgabs14103916
[20] Wang Shuohang and Jing Jiang Machine comprehension using match-lstm and answer pointer arXiv preprint arXiv160807905 (2016)
[21] Wang Shuohang and Jing Jiang Learning natural language inference with LSTM arXiv preprint arXiv151208849 (2015)
[22] Vinyals Oriol Meire Fortunato and Navdeep Jaitly Pointer networks In Advances in Neural Information Processing Systems pp 2692-2700 2015
[23] Wang Wenhui Nan Yang Furu Wei Baobao Chang and Ming Zhou Gated self-matching networks for reading comprehension and question answering In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1 Long Papers) vol 1 pp 189-198 2017
[24] Seo Minjoon Aniruddha Kembhavi Ali Farhadi and Hannaneh Hajishirzi Bidirectional attention flow for machine comprehension arXiv preprint arXiv161101603 (2016)
[25] Clark Christopher and Matt Gardner Simple and effective multi-paragraph reading comprehension arXiv preprint arXiv171010723 (2017)
41
[26] 8 Analyzing Sentence Structure 2018 NltkOrg Accessed March 16 2018 httpwwwnltkorgbookch08html
42
BIOGRAPHICAL SKETCH
Purnendu Mukherjee grew up in Kolkata India and had a strong interest for
computers since an early age After his high school he did his BSc in computer
science from Ramakrishna Mission Residential College Narendrapur followed by MSc
in computer science from St Xavierrsquos College Kolkata He had a strong intuition and
interest for human like learning systems and wanted to work in this area He started
working at TCS Innovation Labs Pune for the application of Natural Language
Processing in Educational Applications As he was deeply passionate about learning
system that mimic the human brain and learn like a human child does he was
increasing interested about Deep Learning and its applications After working for a year
he went on to pursue a Master of Science degree in computer science from the
University of Florida Gainesville His academic interests have been focused on Deep
Learning and Natural Language Processing and he has been working on Machine
Reading Comprehension since summer of 2017
- ACKNOWLEDGMENTS
- LIST OF FIGURES
- LIST OF ABBREVIATIONS
- INTRODUCTION
- THE BUILDING BLOCKS OF A QUESTION ANSWERING SYSTEM
-
- Neural Networks
- Convolutional Neural Network
- Recurrent Neural Networks (RNN)
- Word Embedding
- Attention Mechanism
- Memory Networks
-
- LITERATURE REVIEW AND STATE OF THE ART
-
- Machine Comprehension Using Match-LSTM and Answer Pointer
- R-NET Matching Reading Comprehension with Self-Matching Networks
- Bi-Directional Attention Flow (BiDAF) for Machine Comprehension
- Summary
-
- MULTI-ATTENTION QUESTION ANSWERING
-
- Multi-Attention BiDAF Model
- Chatbot Design Using a QA System
- Online QA System and Attention Visualization
-
- FUTURE DIRECTIONS AND CONCLUSION
-
- Future Directions
- Conclusion
-
- LIST OF REFERENCES
- BIOGRAPHICAL SKETCH
-
5
TABLE OF CONTENTS page
ACKNOWLEDGMENTS 4
LIST OF FIGURES 6
LIST OF ABBREVIATIONS 7
ABSTRACT 8
CHAPTER
1 INTRODUCTION 9
2 THE BUILDING BLOCKS OF A QUESTION ANSWERING SYSTEM 13
Neural Networks 13 Convolutional Neural Network 14 Recurrent Neural Networks (RNN) 15 Word Embedding 16 Attention Mechanism 17 Memory Networks 18
3 LITERATURE REVIEW AND STATE OF THE ART 21
Machine Comprehension Using Match-LSTM and Answer Pointer 21 R-NET Matching Reading Comprehension with Self-Matching Networks 23 Bi-Directional Attention Flow (BiDAF) for Machine Comprehension 25 Summary 28
4 MULTI-ATTENTION QUESTION ANSWERING 29
Multi-Attention BiDAF Model 30 Chatbot Design Using a QA System 32 Online QA System and Attention Visualization 35
5 FUTURE DIRECTIONS AND CONCLUSION 37
Future Directions 37 Conclusion 38
LIST OF REFERENCES 39
BIOGRAPHICAL SKETCH 42
6
LIST OF FIGURES
Figure page 1-1 The task of Question Answering 10
2-1 Simple and Deep Learning Neural Networks 13
2-2 Convolutional Neural Network Architecture 14
2-3 Semantic relation between words in vector space 17
2-4 Attention Mechanism flow 18
2-5 QA example on Lord of the Rings using Memory Networks 20
3-1 Match-LSTM Model Architecture 22
3-2 The task of Question Answering 24
3-3 Bi-Directional Attention Flow Model Architecture 26
4-1 The modified BiDAF model with multilevel attention 31
4-2 Flight reservation chatbotrsquos chat window 33
4-3 Chatbot within OneTask system 34
4-4 The Flow diagram of the Flight booking Chatbot system 35
4-5 QA system interface with attention highlight over candidate answers 36
5-1 An English language semantic parse tree 37
7
LIST OF ABBREVIATIONS
BiDAF Bi-Directional Attention Flow
CNN Convolutional Neural Network
GRU Gated Recurrent Units
LSTM Long Short Term Memory
NLP Natural Language Processing
NLU Natural Language Understanding
RNN Recurrent Neural Network
8
Abstract of Thesis Presented to the Graduate School of the University of Florida in Partial Fulfillment of the
Requirements for the Degree of Master of Science
QUESTION ANSWERING SYSTEMS WITH ATTENTION MECHANISM
By
Purnendu Mukherjee
May 2018
Chair Xiaolin Li Major Computer Science
Question Answering(QA) systems have had rapid growth since the last 3 years
and are close to reaching human level accuracy One of the fundamental reason for this
growth has been the use of attention mechanism along with other methods of Deep
Learning But just as with other Deep Learning methods some of the failure cases are
so obvious that it convinces us that there is a lot to improve upon In this work we first
did a literature review of the State of the Art and fundamental models in QA systems
Next we introduce an architecture which has shown improvement in the targeted area
We then introduce a general method to enable easy design of domain-specific Chatbot
applications and present a proof of the concept with the same method Finally we
present an easy to use Question Answering interface with attention visualization on the
passage We also propose a method to improve the current state of the art as a part of
our ongoing work
9
CHAPTER 1 INTRODUCTION
Teaching machines to read and understand human natural language is our
central and long standing goal for Natural Language Processing and Artificial
Intelligence in general The positive side of machines being able to reason and
comprehend human language could be enormous We are beginning to see how
commercial systems like Alexa Siri Google Now etc are being widely used as Speech
Recognition improved to human levels While speech recognition systems can
transcribe speech to text comprehension of that transcribed text is another task which
is currently a major focus for both academia and industry because of its possible
applications Moreover all the text information available throughout the internet is a
major reason why machine comprehension of text is such an important task
With the growth of Deep Learning methods in the last few years the field of
Machine Comprehension and Natural Language Processing(NLP) in general has
experienced a revolution While the traditional methods and practices are still prevalent
and forms the basis of our deep understanding of languages Deep Learning methods
have surpassed all traditional NLP and Machine Learning methods by a significant
margin and are currently driving the growth of the field
To be able to build a system that can understand human text we need to first
ask ourselves how can we evaluate the machinersquos comprehension ability We had
initially set a goal to build a Chabot for a specific domain and generalize to other topics
as we go ahead While developing the system we found out the necessity for reading
comprehension and how to measure it We finally found the answer with Questions
Answering systems
10
Just like how we human beings are tested for our ability of language
understanding with questions we should ask machines similar questions about what it
has just read The performance of the system on such question answering task will let
us evaluate how much the machine is able to reason about what it just read [1]
Reading comprehension has been a topic of Natural Language Understanding since the
1970s In 1977 Wendy Lehnert said in his doctoral thesis ndash ldquoOnly when we can ask a
program to answer questions about what it reads will we be able to begin to access that
programrsquos comprehensionrdquo [2]
Figure 1-1 The task of Question Answering
To achieve this task the NLP community has developed various datasets such as
CNN Daily Mail WebQuestions SQuAD TriviaQA [3] etc For our purpose we chose
SQuAD which stands for Stanford Question Answering Dataset [4] SQuAD consists of
questions posed by crowdworkers on a set of Wikipedia articles where the answer to
every question is a segment of text or span from the corresponding reading passage
With 100000+ question-answer pairs on 500+ articles SQuAD is significantly larger
than previous reading comprehension datasets [5]
An example from the SQuAD dataset is as follows
11
Passage Tesla later approached Morgan to ask for more funds to build a more
powerful transmitter When asked where all the money had gone Tesla responded by
saying that he was affected by the Panic of 1901 which he (Morgan) had caused
Morgan was shocked by the reminder of his part in the stock market crash and by
Teslarsquos breach of contract by asking for more funds Tesla wrote another plea to
Morgan but it was also fruitless Morgan still owed Tesla money on the original
agreement and Tesla had been facing foreclosure even before construction of the
tower began
Question On what did Tesla blame for the loss of the initial money
Answer Panic of 1901
As we started exploring the QA task we faced several challenges Some of them
we could solve with the help of other research and some of the challenges still exist in
the domain
bull Out of Vocabulary words
bull Multi-sentence reasoning may be required
bull There may exist several candidate answers
bull Optimizing the Exact Match (EM) metric may fail when the answer boundary is fuzzy or too long such as the answer of the ldquowhyrdquo query [6]
bull ldquoOne-hoprdquo prediction may fail to fully understand the query [6]
bull Fail to fully capture the long-distance contextual interaction between parts of the context by only using LSTMGRU
bull Current models are unable to capture semantics of the passage
In the upcoming chapters we will first briefly review the basics necessary for
understanding the models then we will delve deep into the fundamental models that
12
have shaped the current State of the Art models then we will discuss our contribution in
terms of architecture and applications and finally conclude with future directions
13
CHAPTER 2 THE BUILDING BLOCKS OF A QUESTION ANSWERING SYSTEM
To build a question answering system one needs to be familiar with the
fundamental deep learning models such as Recurrent Neural Networks (RNN) Long
Short Term Memory (LSTM) etc In this chapter we will have an overview on these
techniques and see how they all connect to building a question answering system
Neural Networks
What makes Deep Learning so intriguing is that it has close resemblance with
the working of the mammalian brain or at least draws inspiration from it The same can
be said for Artificial Neural networks [7] which consists of a system of interconnected
units called lsquoneuronsrsquo that take input from similar units and produces a single output
Figure 2-1 Simple and Deep Learning Neural Networks [8]
The connection from one neuron to another can be weighted based on the input data
which enables the network to tune itself to produce a certain output based on the input
This is the learning process which is achieved through backpropagation which is a
system of propagating the error from the output layer to the previous layers
14
Convolutional Neural Network
The first wave of deep learningrsquos success was brought by Convolutional Neural
Networks (CNN) [9] when this was the technique used by the winning team of ImageNet
competition in 2012 CNNs are deep artificial neural networks (ANN) that can be used to
classify images cluster them by similarity and perform object recognition within scenes
It can be used to detect and identify faces people signs or any other visual data
Figure 2-2 Convolutional Neural Network Architecture [10]
There are primarily four operations in a standard CNN model (as shown in Fig above)
1 Convolution - The primary purpose of Convolution in the ConvNet (above) is to extract features from the input image The spatial relationship between pixels ie the image features are preserved and learned by the convolution using small squares of input data
2 Non-Linearity (ReLU) ndash Rectified Linear Unit (ReLU) is a non-linear operation that carries out an element wise operation on each pixel This operation replaces the negative pixel values in the feature map by zero
3 Pooling or Sub Sampling - Spatial Pooling reduces the dimensionality of each feature map but retains the most important information For max pooling the largest value in the square window is taken and rest are dropped Other types of pooling are Average Sum etc
4 Classification (Fully Connected Layer) - The Fully Connected layer is a traditional Multi-Layer Perceptron as described before that uses a softmax activation function in the output layer The high-level features of the image are encoded by the convolutional and pooling layers which is then fed to the fully connected layer which then uses these features for classifying the input image into various classes based on the training dataset [10]
15
When a new image is fed into the CNN model all the above-mentioned steps are
carried out (forward propagation) and a probability distribution is achieved on the set of
output classes With a large enough training dataset the network will learn and
generalize well enough to classify new images into their correct classes
Recurrent Neural Networks (RNN)
Whenever we want to predict or encode sequential data RNN [11] is our go to
method RNNs perform the same task for every element of a sequence where the
output of each element depends on previous computations thus the recurrence In
practice RNNs are unable to retain long-term dependencies and can look back only a
few steps because of the vanishing gradient problem
(2-1)
A solution to the dependency problem is to use gated cells such as LSTM [11] or
GRU [13] These cells pass on important information to the next cells while ignoring
non-important ones The gated units in a GRU block are
bull Update Gate ndash Computed based on current input and hidden state
(2-2)
bull Reset Gate ndash Calculated similarly but with different weights
(2-3)
bull New memory content - (2-4)
If reset gate unit is ~0 then previous memory is ignored and only new
information is kept
16
Final memory at current time step combines previous and current time steps
(2-5)
While the GRU is computationally efficient the LSTM on the other hand is a
general case where there are three gates as follows
bull Input Gate ndash What new information to add to the current cell state
bull Forget Gate ndash How much information from previous states to be kept
bull Output gate ndash How much info should be sent to the next states
Just like GRU the current cell state is a sum of the previous cell state but
weighted by the forget gate and the new value is added which is weighted by the input
gate Based on the cell state the output gate regulates the final output
Word Embedding
Computation or gradients can be applied on numbers and not on words or letters
So first we need to convert words into their corresponding numerical formation before
feeding into a deep learning model In general there are two types of word embedding
Frequency based (which constitutes count vectors tf-idf and co-occurrence vectors)
and Prediction based With frequency based embedding the order of the words are not
preserved and works as a bag of words model Whereas with prediction based model
the order of words or locality of words are taken into consideration to generate the
numerical representation of the word Within this prediction based category there are
two fundamental techniques called Continuous Bag of Words (CBOW) and Skip Gram
Model which forms the basis for word2vec [14] and GloVe [15]
The basic intuition behind word2vec is that if two different words have very
similar ldquocontextsrdquo (that is what words are likely to appear around them) then the model
17
will produce similar vector for those words Conversely if the two word vectors are
similar then the network will produce similar context predictions for the same two words
For examples synonyms like ldquointelligentrdquo and ldquosmartrdquo would have very similar contexts
Or that words that are related like ldquoenginerdquo and ldquotransmissionrdquo would probably have
similar contexts as well [16] Plotting the word vectors learned by a word2vec over a
large corpus we could find some very interesting relationships between words
Figure 2-3 Semantic relation between words in vector space [17]
Attention Mechanism
We as humans put our attention to things are important or are relevant in a
context For example when asked a question from a passage we try to find the most
relevant part of the passage the question is relevant with and then reason from our
understanding of that part of the passage The same idea applies for attention
mechanism in Deep Learning It is used to identify the specific parts of a given context
to which the current question is relevant to
Formally put the techniques take n arguments y_1 y_n (in our case the
passage having words say y_i through h_i) and a question word say q It returns a
vector z which is supposed to be the laquo summary raquo of the y_i focusing on information
linked to the question q More formally it returns a weighted arithmetic mean of the y_i
18
and the weights are chosen according the relevance of each y_i given the context c
[18]
Figure 2-4 Attention Mechanism flow [18]
Memory Networks
Convolutional Neural Networks and Recurrent Neural Networks which does
capture how we form our visual and sequential memories their memory (encoded by
hidden states and weights) were typically too small and was not compartmentalized
enough to accurately remember facts from the past (knowledge is compressed into
dense vectors) [19]
Deep Learning needed to cultivate a methodology that preserved memories as
they are such that it wonrsquot be lost in generalization and recalling exact words or
sequence of events would be possible mdash something computers are already good at This
effort led us to Memory Networks [19] published at ICLR 2015 by Facebook AI
Research
This paper provides a basic framework to store augment and retrieve memories
while seamlessly working with a Recurrent Neural Network architecture The memory
19
network consists of a memory m (an array of objects 1 indexed by m i) and four
(potentially learned) components I G O and R as follows
I (input feature map) mdash converts the incoming input to the internal feature
representation either a sparse or dense feature vector like that from word2vec or
GloVe
G (generalization) mdash updates old memories given the new input They call this
generalization as there is an opportunity for the network to compress and generalize its
memories at this stage for some intended future use The analogy Irsquove been talking
before
O (output feature map) mdash produces a new output (in the feature representation
space) given the new input and the current memory state This component is
responsible for performing inference In a question answering system this part will
select the candidate sentences (which might contain the answer) from the story
(conversation) so far
R (response) mdash converts the output into the response format desired For
example a textual response or an action In the QA system described this component
finds the desired answer and then converts it from feature representation to the actual
word
This model is a fully supervised model meaning all the candidate sentences from
which the answer could be found are marked during training phase and can also be
termed as lsquohard attentionrsquo
The authors tested out the QA system on various literature including Lord of the
Rings
20
Figure 2-5 QA example on Lord of the Rings using Memory Networks [19]
21
CHAPTER 3 LITERATURE REVIEW AND STATE OF THE ART
There has been rapid progress since the release of the SQuAD dataset and
currently the best ensemble models are close to human level accuracy in machine
comprehension This is due to the various ingenious methods which solves some of the
problems with the previous methods Out of Vocabulary tokens were handled by using
Character embedding Long term dependency within context passage were solved
using self-attention And many other techniques such as Contextualized vectors History
of Words Attention Flow etc In this section we will have a look at the some of the most
important models that were fundamental to the progress of Questions Answering
Machine Comprehension Using Match-LSTM and Answer Pointer
In this paper [20] the authors propose an end-to-end neural architecture for the
QA task The architecture is based on match-LSTM [21] a model they proposed for
textual entailment and Pointer Net [22] a sequence-to-sequence model proposed by
Vinyals et al (2015) to constrain the output tokens to be from the input sequences
The model consists of an LSTM preprocessing layer a match-LSTM layer and an
Answer Pointer layer
We are given a piece of text which we refer to as a passage and a question
related to the passage The passage is represented by matrix P where P is the length
(number of tokens) of the passage and d is the dimensionality of word embeddings
Similarly the question is represented by matrix Q where Q is the length of the question
Our goal is to identify a subsequence from the passage as the answer to the question
22
Figure 3-1 Match-LSTM Model Architecture [20]
LSTM Preprocessing layer They use a standard one-directional LSTM
(Hochreiter amp Schmidhuber 1997) to process the passage and the question separately
as shown below
Match LSTM Layer They applied the match-LSTM model proposed for textual
entailment to their machine comprehension problem by treating the question as a
premise and the passage as a hypothesis The match-LSTM sequentially goes through
the passage At position i of the passage it first uses the standard word-by-word
attention mechanism to obtain attention weight vector as follows
(3-1)
23
where and are parameters to be learned
Answer Pointer Layer The Answer Pointer (Ans-Ptr) layer is motivated by the
Pointer Net introduced by Vinyals et al (2015) [22] The boundary model produces only
the start token and the end token of the answer and then all the tokens between these
two in the original passage are considered to be the answer
When this paper was released back in November 2016 Match-LSTM method
was the state of the art in Question Answering systems and was at the top of the
leaderboard for the SQuAD dataset
R-NET Matching Reading Comprehension with Self-Matching Networks
In this model [23] first the question and passage are processed by a
bidirectional recurrent network (Mikolov et al 2010) separately They then match the
question and passage with gated attention-based recurrent networks obtaining
question-aware representation for the passage On top of that they apply self-matching
attention to aggregate evidence from the whole passage and refine the passage
representation which is then fed into the output layer to predict the boundary of the
answer span
Question and passage encoding First the words are converted to their
respective word-level embeddings and character level embeddings The character-level
embeddings are generated by taking the final hidden states of a bi-directional recurrent
neural network (RNN) applied to embeddings of characters in the token Such
character-level embeddings have been shown to be helpful to deal with out-of-vocab
(OOV) tokens
24
They then use a bi-directional RNN to produce new representation and
of all words in the question and passage respectively
Figure 3-2 The task of Question Answering [23]
Gated Attention-based Recurrent Networks They use a variant of attention-
based recurrent networks with an additional gate to determine the importance of
information in the passage regarding a question Different from the gates in LSTM or
GRU the additional gate is based on the current passage word and its attention-pooling
vector of the question which focuses on the relation between the question and current
passage word The gate effectively model the phenomenon that only parts of the
passage are relevant to the question in reading comprehension and question answering
is utilized in subsequent calculations
25
Self-Matching Attention From the previous step the question aware passage
representation is generated to highlight the important parts of the passage One
problem with such representation is that it has very limited knowledge of context One
answer candidate is often oblivious to important cues in the passage outside its
surrounding window To address this problem the authors propose directly matching
the question-aware passage representation against itself It dynamically collects
evidence from the whole passage for words in passage and encodes the evidence
relevant to the current passage word and its matching question information into the
passage representation
Output Layer They use the same method as Wang amp Jiang (2016b) and use
pointer networks (Vinyals et al 2015) to predict the start and end position of the
answer In addition they use an attention-pooling over the question representation to
generate the initial hidden vector for the pointer network [23]
When the R-Net Model first appeared in the leaderboard in March 2017 it was at
the top with 723 Exact Match and 807 F1 score
Bi-Directional Attention Flow (BiDAF) for Machine Comprehension
BiDAF [24] is a hierarchical multi-stage architecture for modeling the
representations of the context paragraph at different levels of granularity BIDAF
includes character-level word-level and contextual embeddings and uses bi-directional
attention flow to obtain a query-aware context representation Their attention layer is not
used to summarize the context paragraph into a fixed-size vector Instead the attention
is computed for every time step and the attended vector at each time step along with
the representations from previous layers can flow through to the subsequent modeling
layer This reduces the information loss caused by early summarization [24]
26
Figure 3-3 Bi-Directional Attention Flow Model Architecture [24]
Their machine comprehension model is a hierarchical multi-stage process and
consists of six layers
1 Character Embedding Layer maps each word to a vector space using character-level CNNs
2 Word Embedding Layer maps each word to a vector space using a pre-trained word embedding model
3 Contextual Embedding Layer utilizes contextual cues from surrounding words to refine the embedding of the words These first three layers are applied to both the query and context They use a LSTM on top of the embeddings provided by the previous layers to model the temporal interactions between words We place an LSTM in both directions and concatenate the outputs of the two LSTMs
4 Attention Flow Layer couples the query and context vectors and produces a set of query aware feature vectors for each word in the context
5 Modeling Layer employs a Recurrent Neural Network to scan the context The input to the modeling layer is G which encodes the query-aware representations of context words The output of the modeling layer captures the interaction among the context words conditioned on the query They use two layers of bi-directional LSTM
27
with the output size of d for each direction Hence a matrix is obtained which is passed onto the output layer to predict the answer
6 Output Layer provides an answer to the query They define the training loss (to be minimized) as the sum of the negative log probabilities of the true start and end indices by the predicted distributions averaged over all examples [25]
In a further variation of their above work they add a self-attention layer after the
Bi-attention layer to further improve the results The architecture of the model is as
Figure 3-3 The task of Question Answering [25]
28
Summary
In this chapter we reviewed the methods that are fundamental to the state of the
art in Machine Comprehension and for the task of Question Answering We have
reached closed to human level accuracy and this is due to incremental developments
over previous models As we saw Out of Vocabulary (OOV) tokens were handled by
using Character embedding Long term dependency within context passage were
solved using self-attention and many other techniques such as Contextualized vectors
Attention Flow etc were employed to get better results In the next chapter we will see
how we can build on these models and develop further
29
CHAPTER 4 MULTI-ATTENTION QUESTION ANSWERING
Although the advancements in Question Answering systems since the release of
the SQuAD dataset has been impressive and the results are getting close to human
level accuracy it is far from being a fool-proof system The models still make mistakes
which would be obvious to a human For example
Passage The Panthers used the San Jose State practice facility and stayed at
the San Jose Marriott The Broncos practiced at Florida State Facility and stayed at the
Santa Clara Marriott
Question At what universitys facility did the Panthers practice
Actual Answer San Jose State
Predicted Answer Florida State Facility
To find out what is leading to the wrong predictions we wanted to see the
attention weights associated with such an example We plotted the passage and
question heat map which is a 2D matrix where the intensity of each cell signifies the
similarity between a passage word and a question word For the above example we
found out that while certain words of the question are given high weightage other parts
are not The words lsquoAtrsquo lsquofacilityrsquo lsquopracticersquo are given high attention but lsquoPanthersrsquo does
not receive high attention If it had received high attention then the system would have
predicted lsquoSan Jose Statersquo as the right answer To solve this issue we analyzed the
base BiDAF model and proposed adding two things
1 Bi-Attention and Self-Attention over Query
2 Second level attention over the output of (Bi-Attention + Self-Attention) from both Context and Query
30
Multi-Attention BiDAF Model
Our comprehension model is a hierarchical multi-stage process and consists of
the following layers
1 Embedding Just as all other models we embed words using pretrained word vectors We also embed the characters in each word into size 20 vectors which are learned and run a convolution neural network followed by max-pooling to get character-derived embeddings for each word The character-level and word-level embeddings are then concatenated and passed to the next layer We do not update the word embeddings during training
2 Pre-Process A shared bi-directional GRU (Cho et al 2014) is used to map the question and passage embeddings to context aware embeddings
3 Context Attention The bi-directional attention mechanism from the Bi-Directional Attention Flow (BiDAF) model (Seo et al 2016) is used to build a query-aware context representation Let h_i be the vector for context word i q_j be the vector for question word j and n_q and n_c be the lengths of the question and context respectively We compute attention between context word i and question word j as
(4-1)
where w1 w2 and w3 are learned vectors and is element-wise multiplication We then compute an attended vector c_i for each context token as
(4-2)
We also compute a query-to-context vector q_c
(4-3)
31
The final vector computed for each token is built by concatenating
and In our model we subsequently pass the result through a linear layer with ReLU activations
4 Context Self-Attention Next we use a layer of residual self-attention The input is passed through another bi-directional GRU Then we apply the same attention mechanism only now between the passage and itself In this case we do not use query-to context attention and we set aij = 1048576inf if i = j As before we pass the concatenated output through a linear layer with ReLU activations This layer is applied residually so this output is additionally summed with the input
5 Query Attention For this part we do the same way as context attention but calculate the weighted sum of the context words for each query word Thus we get the length as number of query words Then we calculate context to query similar to query to context in context attention layer
Figure 4-1 The modified BiDAF model with multilevel attention
6 Query Self-Attention This part is done the same way as the context self-attention layer but from the output of the Query Attention layer
32
7 Context Query Bi-Attention + Self-Attention The output of the Context self-attention and the Query Self-Attention layers are taken as input and the same process for Bi-Attention and self-attention is applied on these inputs
8 Prediction In the last layer of our model a bidirectional GRU is applied followed by a linear layer that computes answer start scores for each token The hidden states of that layer are concatenated with the input and fed into a second bidirectional GRU and linear layer to predict answer end scores The softmax operation is applied to the start and end scores to produce start and end probabilities and we optimize the negative log likelihood of selecting correct start and end tokens
Having carried out this modification we were able to solve the wrong example we
started with The multilevel attention model gives the correct output as ldquoSan Jose
Staterdquo Also we achieved slightly better scores than the original model with a F1 score
of 8544 on the SQuAD dev dataset
Chatbot Design Using a QA System
Designing a perfect chatbot that passes the Turing test is a fundamental goal for
Artificial Intelligence Although we are many order to magnitudes away from achieving
such a goal domain specific tasks can be solved with chatbots made from current
technology We had started our goal with a similar objective in mind ie to design a
domain specific chatbot and then generalize to other areas as it is able to achieve the
first domain specific objective robustly This led us to the fundamental problem of
Machine Comprehension and subsequently to the task of Question Answering Having
achieved some degree of success with QA systems we looked back if we could apply
our newly acquired knowledge in the task of designing Chatbots
The chatbots made with todayrsquos technologies are mostly handcrafted techniques
such as template matching that requires anticipating all possible ways a user may
articulate his requirements and a conversation may occur This requires a lot of man
hours for designing a domain specific system and is still very error prone In this section
33
we propose a general Chatbot design that would make the designing of a domain
specific chatbot very easy and robust at the same time
Every domain specific chatbot needs to obtain a set of information from the user
and show some results based on the user specific information obtained The traditional
chatbots use template matching and keywords lookup to determine if the user has
provided the required information Our idea is to use the Question Answering system in
the backend to extract out the required information from whatever the user has typed
until this point of the conversation The information to be extracted can be posed in the
form of a set of questions and the answers obtained from those questions can be used
as the parameters to supply the relevant information to the user
We had chosen our chatbot domain as the flight reservation system Our goal
was to extract the required information from the user to be able to show him the
available flights as per the userrsquos requirements
Figure 4-2 Flight reservation chatbotrsquos chat window
34
For a flight reservation task the booking agent needs to know the origin city
destination city and date of travel at minimum to be able to show the available flights
Optional information includes the number of tickets passengerrsquos name one way or
round trip etc
The minimalistic conversation with the user through the chat window would be as
shown above We had a platform called OneTask on which we wanted to implement our
chat bot The chat interface within the OneTask system looks as follows
Figure 4-3 Chatbot within OneTask system
The working of the chat system is as follows ndash
1 Initiation The user opens the chat window which starts a session with the chat bot The chat bot asks lsquoHow may I help yoursquo
2 User Reply The user may reply with none of the required information for flight booking or may reply with multiple information in the same message
3 User Reply parsing The conversation up to this point is treated as a passage and the internal questions are run on this passage So the four questions that are run are
Where do you want to go
35
From where do you want to leave
When do you want to depart
4 Parsed responses from QA model After running the questions on the QA system with the given conversation answers are obtained for all along with their corresponding confidence values Since answers will be obtained even if the required question was not answered up to this point in the conversation the confidence values plays a crucial role in determining the validity of the answer For this purpose we determined ranges of confidence values for validation A confidence value above 10 signifies the question has been answered correctly A confidence of 2 ndash 10 signifies that it may have been answered but should verify with the user for correctness and any confidence below 2 is discarded
5 Asking remaining questions iteratively After the parsing it is checked if any of the required questions are still unanswered If so the chatbot asks the remaining question and the process from 3 ndash 4 is carried out iteratively Once all the questions have been answered the user is shown with the available flight options as per his request
Figure 4-4 The Flow diagram of the Flight booking Chatbot system
Online QA System and Attention Visualization
To be able to test out various examples we used the BiDAF [23] model for an
online demo One can either choose from the available examples from the drop-down
menu or paste their own passage and examples While this is a useful and interesting
system to test the model in a user-friendly way we created this system to be able to
focus on the wrong samples
36
Figure 4-5 QA system interface with attention highlight over candidate answers
The candidate answers are shown with the blue highlights as per their
confidence values The higher the confidence the darker the answer The highest
confidence value is chosen as the predicted answer We developed the system to show
attention spread of the candidate answers to realize what needs to be done to improve
the system This led us to realize the importance of including the query attention part as
well as multilevel attention on the BiDAF model as described in the first section of this
chapter
37
CHAPTER 5 FUTURE DIRECTIONS AND CONCLUSION
Future Directions
Having done a thorough analysis of the current methods and state of the arts in
Machine Comprehension and for the QA task and developing systems on it we have
achieved a strong sense of what needs to be done to further improve the QA models
After observing the wrong samples we can see that the system is still unable to encode
meaning and is picking answers based on statistical patterns of occurrence of answers
on training examples
As per the State of the Art models and analyzing the ongoing research literature
it is easy to conclude that more features need to be embedded to encode the meaning
of words phrases and sentences A paper called Reinforced Mnemonic Reader for
Machine Comprehension encoded POS and NER tags of words along with their word
and character embedding This gave them better results
Figure 5-1 An English language semantic parse tree [26]
38
We have developed a method to encode the syntax parse tree of a sentence that
not only encodes the post tags but the relation of the word within the phrase and the
relation of the phrase within the whole sentence in a hierarchical manner
Finally data augmentation is another solution to get better results One definite
way to reduce the errors would be to include similar samples in the training data which
the system is faltering in the dev set One could generate similar examples as the failure
cases and include them in the training set to have better prediction Another system
would be to train to similar and bigger datasets Our models were trained on the SQuAD
dataset There are other datasets too which does the similar question answering task
such as TriviaQA We could augment the training set of TriviaQA along with SQuAD to
have a more robust system that is able to generalize better and thus have higher
accuracy for predicting answer spans
Conclusion
In this work we have tried to explore the most fundamental techniques that have
shaped the current state of the art Then we proposed a minor improvement of
architecture over an existing model Furthermore we developed two applications that
uses the base model First we talked about how a chatbot application can be made
using the QA system and lastly we also created a web interface where the model can
be used for any Passage and Question This interface also shows the attention spread
on the candidate answers While our effort is ongoing to push the state of the art
forward we strongly believe that surpassing human level accuracy on this task will have
high dividends for the society at large
39
LIST OF REFERENCES
[1] Danqi Chen From Reading Comprehension To Open-Domain Question Answering 2018 Youtube Accessed March 16 2018 httpswwwyoutubecomwatchv=1RN88O9C13U
[2] Lehnert Wendy G The Process of Question Answering No RR-88 Yale Univ New Haven Conn Dept of computer science 1977
[3] Joshi Mandar Eunsol Choi Daniel S Weld and Luke Zettlemoyer Triviaqa A large scale distantly supervised challenge dataset for reading comprehension arXiv preprint arXiv170503551 (2017)
[4] Rajpurkar Pranav Jian Zhang Konstantin Lopyrev and Percy Liang Squad 100000+ questions for machine comprehension of text arXiv preprint arXiv160605250 (2016)
[5] The Stanford Question Answering Dataset 2018 RajpurkarGithubIo Accessed March 16 2018 httpsrajpurkargithubioSQuAD-explorer
[6] Hu Minghao Yuxing Peng and Xipeng Qiu Reinforced mnemonic reader for machine comprehension CoRR abs170502798 (2017)
[7] Schalkoff Robert J Artificial neural networks Vol 1 New York McGraw-Hill 1997
[8] Me For The AI and Neetesh Mehrotra 2018 The Connect Between Deep Learning And AI - Open Source For You Open Source For You Accessed March 16 2018 httpopensourceforucom201801connect-deep-learning-ai
[9] Krizhevsky Alex Ilya Sutskever and Geoffrey E Hinton Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems pp 1097-1105 2012
[10] An Intuitive Explanation Of Convolutional Neural Networks 2016 The Data Science Blog Accessed March 16 2018 httpsujjwalkarnme20160811intuitive-explanation-convnets
[11] Williams Ronald J and David Zipser A learning algorithm for continually running fully recurrent neural networks Neural computation 1 no 2 (1989) 270-280
[12] Hochreiter Sepp and Juumlrgen Schmidhuber Long short-term memory Neural computation 9 no 8 (1997) 1735-1780
[13] Chung Junyoung Caglar Gulcehre KyungHyun Cho and Yoshua Bengio Empirical evaluation of gated recurrent neural networks on sequence modeling arXiv preprint arXiv14123555 (2014)
40
[14] Mikolov Tomas Wen-tau Yih and Geoffrey Zweig Linguistic regularities in continuous space word representations In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics Human Language Technologies pp 746-751 2013
[15] Pennington Jeffrey Richard Socher and Christopher Manning Glove Global vectors for word representation In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) pp 1532-1543 2014
[16] Word2vec Tutorial - The Skip-Gram Model middot Chris Mccormick 2016 MccormickmlCom Accessed March 16 2018 httpmccormickmlcom20160419word2vec-tutorial-the-skip-gram-model
[17] Group 2015 [Paper Introduction] Bilingual Word Representations With Monolingual hellip SlideshareNet Accessed March 16 2018 httpswwwslidesharenetnaist-mtstudybilingual-word-representations-with-monolingual-quality-in-mind
[18] Attention Mechanism 2016 BlogHeuritechCom Accessed March 16 2018 httpsblogheuritechcom20160120attention-mechanism
[19] Weston Jason Sumit Chopra and Antoine Bordes 2014 Memory Networks arXiv preprint arXiv14103916v11 Accessed March 16 2018 httpsarxivorgabs14103916
[20] Wang Shuohang and Jing Jiang Machine comprehension using match-lstm and answer pointer arXiv preprint arXiv160807905 (2016)
[21] Wang Shuohang and Jing Jiang Learning natural language inference with LSTM arXiv preprint arXiv151208849 (2015)
[22] Vinyals Oriol Meire Fortunato and Navdeep Jaitly Pointer networks In Advances in Neural Information Processing Systems pp 2692-2700 2015
[23] Wang Wenhui Nan Yang Furu Wei Baobao Chang and Ming Zhou Gated self-matching networks for reading comprehension and question answering In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1 Long Papers) vol 1 pp 189-198 2017
[24] Seo Minjoon Aniruddha Kembhavi Ali Farhadi and Hannaneh Hajishirzi Bidirectional attention flow for machine comprehension arXiv preprint arXiv161101603 (2016)
[25] Clark Christopher and Matt Gardner Simple and effective multi-paragraph reading comprehension arXiv preprint arXiv171010723 (2017)
41
[26] 8 Analyzing Sentence Structure 2018 NltkOrg Accessed March 16 2018 httpwwwnltkorgbookch08html
42
BIOGRAPHICAL SKETCH
Purnendu Mukherjee grew up in Kolkata India and had a strong interest for
computers since an early age After his high school he did his BSc in computer
science from Ramakrishna Mission Residential College Narendrapur followed by MSc
in computer science from St Xavierrsquos College Kolkata He had a strong intuition and
interest for human like learning systems and wanted to work in this area He started
working at TCS Innovation Labs Pune for the application of Natural Language
Processing in Educational Applications As he was deeply passionate about learning
system that mimic the human brain and learn like a human child does he was
increasing interested about Deep Learning and its applications After working for a year
he went on to pursue a Master of Science degree in computer science from the
University of Florida Gainesville His academic interests have been focused on Deep
Learning and Natural Language Processing and he has been working on Machine
Reading Comprehension since summer of 2017
- ACKNOWLEDGMENTS
- LIST OF FIGURES
- LIST OF ABBREVIATIONS
- INTRODUCTION
- THE BUILDING BLOCKS OF A QUESTION ANSWERING SYSTEM
-
- Neural Networks
- Convolutional Neural Network
- Recurrent Neural Networks (RNN)
- Word Embedding
- Attention Mechanism
- Memory Networks
-
- LITERATURE REVIEW AND STATE OF THE ART
-
- Machine Comprehension Using Match-LSTM and Answer Pointer
- R-NET Matching Reading Comprehension with Self-Matching Networks
- Bi-Directional Attention Flow (BiDAF) for Machine Comprehension
- Summary
-
- MULTI-ATTENTION QUESTION ANSWERING
-
- Multi-Attention BiDAF Model
- Chatbot Design Using a QA System
- Online QA System and Attention Visualization
-
- FUTURE DIRECTIONS AND CONCLUSION
-
- Future Directions
- Conclusion
-
- LIST OF REFERENCES
- BIOGRAPHICAL SKETCH
-
6
LIST OF FIGURES
Figure page 1-1 The task of Question Answering 10
2-1 Simple and Deep Learning Neural Networks 13
2-2 Convolutional Neural Network Architecture 14
2-3 Semantic relation between words in vector space 17
2-4 Attention Mechanism flow 18
2-5 QA example on Lord of the Rings using Memory Networks 20
3-1 Match-LSTM Model Architecture 22
3-2 The task of Question Answering 24
3-3 Bi-Directional Attention Flow Model Architecture 26
4-1 The modified BiDAF model with multilevel attention 31
4-2 Flight reservation chatbotrsquos chat window 33
4-3 Chatbot within OneTask system 34
4-4 The Flow diagram of the Flight booking Chatbot system 35
4-5 QA system interface with attention highlight over candidate answers 36
5-1 An English language semantic parse tree 37
7
LIST OF ABBREVIATIONS
BiDAF Bi-Directional Attention Flow
CNN Convolutional Neural Network
GRU Gated Recurrent Units
LSTM Long Short Term Memory
NLP Natural Language Processing
NLU Natural Language Understanding
RNN Recurrent Neural Network
8
Abstract of Thesis Presented to the Graduate School of the University of Florida in Partial Fulfillment of the
Requirements for the Degree of Master of Science
QUESTION ANSWERING SYSTEMS WITH ATTENTION MECHANISM
By
Purnendu Mukherjee
May 2018
Chair Xiaolin Li Major Computer Science
Question Answering(QA) systems have had rapid growth since the last 3 years
and are close to reaching human level accuracy One of the fundamental reason for this
growth has been the use of attention mechanism along with other methods of Deep
Learning But just as with other Deep Learning methods some of the failure cases are
so obvious that it convinces us that there is a lot to improve upon In this work we first
did a literature review of the State of the Art and fundamental models in QA systems
Next we introduce an architecture which has shown improvement in the targeted area
We then introduce a general method to enable easy design of domain-specific Chatbot
applications and present a proof of the concept with the same method Finally we
present an easy to use Question Answering interface with attention visualization on the
passage We also propose a method to improve the current state of the art as a part of
our ongoing work
9
CHAPTER 1 INTRODUCTION
Teaching machines to read and understand human natural language is our
central and long standing goal for Natural Language Processing and Artificial
Intelligence in general The positive side of machines being able to reason and
comprehend human language could be enormous We are beginning to see how
commercial systems like Alexa Siri Google Now etc are being widely used as Speech
Recognition improved to human levels While speech recognition systems can
transcribe speech to text comprehension of that transcribed text is another task which
is currently a major focus for both academia and industry because of its possible
applications Moreover all the text information available throughout the internet is a
major reason why machine comprehension of text is such an important task
With the growth of Deep Learning methods in the last few years the field of
Machine Comprehension and Natural Language Processing(NLP) in general has
experienced a revolution While the traditional methods and practices are still prevalent
and forms the basis of our deep understanding of languages Deep Learning methods
have surpassed all traditional NLP and Machine Learning methods by a significant
margin and are currently driving the growth of the field
To be able to build a system that can understand human text we need to first
ask ourselves how can we evaluate the machinersquos comprehension ability We had
initially set a goal to build a Chabot for a specific domain and generalize to other topics
as we go ahead While developing the system we found out the necessity for reading
comprehension and how to measure it We finally found the answer with Questions
Answering systems
10
Just like how we human beings are tested for our ability of language
understanding with questions we should ask machines similar questions about what it
has just read The performance of the system on such question answering task will let
us evaluate how much the machine is able to reason about what it just read [1]
Reading comprehension has been a topic of Natural Language Understanding since the
1970s In 1977 Wendy Lehnert said in his doctoral thesis ndash ldquoOnly when we can ask a
program to answer questions about what it reads will we be able to begin to access that
programrsquos comprehensionrdquo [2]
Figure 1-1 The task of Question Answering
To achieve this task the NLP community has developed various datasets such as
CNN Daily Mail WebQuestions SQuAD TriviaQA [3] etc For our purpose we chose
SQuAD which stands for Stanford Question Answering Dataset [4] SQuAD consists of
questions posed by crowdworkers on a set of Wikipedia articles where the answer to
every question is a segment of text or span from the corresponding reading passage
With 100000+ question-answer pairs on 500+ articles SQuAD is significantly larger
than previous reading comprehension datasets [5]
An example from the SQuAD dataset is as follows
11
Passage Tesla later approached Morgan to ask for more funds to build a more
powerful transmitter When asked where all the money had gone Tesla responded by
saying that he was affected by the Panic of 1901 which he (Morgan) had caused
Morgan was shocked by the reminder of his part in the stock market crash and by
Teslarsquos breach of contract by asking for more funds Tesla wrote another plea to
Morgan but it was also fruitless Morgan still owed Tesla money on the original
agreement and Tesla had been facing foreclosure even before construction of the
tower began
Question On what did Tesla blame for the loss of the initial money
Answer Panic of 1901
As we started exploring the QA task we faced several challenges Some of them
we could solve with the help of other research and some of the challenges still exist in
the domain
bull Out of Vocabulary words
bull Multi-sentence reasoning may be required
bull There may exist several candidate answers
bull Optimizing the Exact Match (EM) metric may fail when the answer boundary is fuzzy or too long such as the answer of the ldquowhyrdquo query [6]
bull ldquoOne-hoprdquo prediction may fail to fully understand the query [6]
bull Fail to fully capture the long-distance contextual interaction between parts of the context by only using LSTMGRU
bull Current models are unable to capture semantics of the passage
In the upcoming chapters we will first briefly review the basics necessary for
understanding the models then we will delve deep into the fundamental models that
12
have shaped the current State of the Art models then we will discuss our contribution in
terms of architecture and applications and finally conclude with future directions
13
CHAPTER 2 THE BUILDING BLOCKS OF A QUESTION ANSWERING SYSTEM
To build a question answering system one needs to be familiar with the
fundamental deep learning models such as Recurrent Neural Networks (RNN) Long
Short Term Memory (LSTM) etc In this chapter we will have an overview on these
techniques and see how they all connect to building a question answering system
Neural Networks
What makes Deep Learning so intriguing is that it has close resemblance with
the working of the mammalian brain or at least draws inspiration from it The same can
be said for Artificial Neural networks [7] which consists of a system of interconnected
units called lsquoneuronsrsquo that take input from similar units and produces a single output
Figure 2-1 Simple and Deep Learning Neural Networks [8]
The connection from one neuron to another can be weighted based on the input data
which enables the network to tune itself to produce a certain output based on the input
This is the learning process which is achieved through backpropagation which is a
system of propagating the error from the output layer to the previous layers
14
Convolutional Neural Network
The first wave of deep learningrsquos success was brought by Convolutional Neural
Networks (CNN) [9] when this was the technique used by the winning team of ImageNet
competition in 2012 CNNs are deep artificial neural networks (ANN) that can be used to
classify images cluster them by similarity and perform object recognition within scenes
It can be used to detect and identify faces people signs or any other visual data
Figure 2-2 Convolutional Neural Network Architecture [10]
There are primarily four operations in a standard CNN model (as shown in Fig above)
1 Convolution - The primary purpose of Convolution in the ConvNet (above) is to extract features from the input image The spatial relationship between pixels ie the image features are preserved and learned by the convolution using small squares of input data
2 Non-Linearity (ReLU) ndash Rectified Linear Unit (ReLU) is a non-linear operation that carries out an element wise operation on each pixel This operation replaces the negative pixel values in the feature map by zero
3 Pooling or Sub Sampling - Spatial Pooling reduces the dimensionality of each feature map but retains the most important information For max pooling the largest value in the square window is taken and rest are dropped Other types of pooling are Average Sum etc
4 Classification (Fully Connected Layer) - The Fully Connected layer is a traditional Multi-Layer Perceptron as described before that uses a softmax activation function in the output layer The high-level features of the image are encoded by the convolutional and pooling layers which is then fed to the fully connected layer which then uses these features for classifying the input image into various classes based on the training dataset [10]
15
When a new image is fed into the CNN model all the above-mentioned steps are
carried out (forward propagation) and a probability distribution is achieved on the set of
output classes With a large enough training dataset the network will learn and
generalize well enough to classify new images into their correct classes
Recurrent Neural Networks (RNN)
Whenever we want to predict or encode sequential data RNN [11] is our go to
method RNNs perform the same task for every element of a sequence where the
output of each element depends on previous computations thus the recurrence In
practice RNNs are unable to retain long-term dependencies and can look back only a
few steps because of the vanishing gradient problem
(2-1)
A solution to the dependency problem is to use gated cells such as LSTM [11] or
GRU [13] These cells pass on important information to the next cells while ignoring
non-important ones The gated units in a GRU block are
bull Update Gate ndash Computed based on current input and hidden state
(2-2)
bull Reset Gate ndash Calculated similarly but with different weights
(2-3)
bull New memory content - (2-4)
If reset gate unit is ~0 then previous memory is ignored and only new
information is kept
16
Final memory at current time step combines previous and current time steps
(2-5)
While the GRU is computationally efficient the LSTM on the other hand is a
general case where there are three gates as follows
bull Input Gate ndash What new information to add to the current cell state
bull Forget Gate ndash How much information from previous states to be kept
bull Output gate ndash How much info should be sent to the next states
Just like GRU the current cell state is a sum of the previous cell state but
weighted by the forget gate and the new value is added which is weighted by the input
gate Based on the cell state the output gate regulates the final output
Word Embedding
Computation or gradients can be applied on numbers and not on words or letters
So first we need to convert words into their corresponding numerical formation before
feeding into a deep learning model In general there are two types of word embedding
Frequency based (which constitutes count vectors tf-idf and co-occurrence vectors)
and Prediction based With frequency based embedding the order of the words are not
preserved and works as a bag of words model Whereas with prediction based model
the order of words or locality of words are taken into consideration to generate the
numerical representation of the word Within this prediction based category there are
two fundamental techniques called Continuous Bag of Words (CBOW) and Skip Gram
Model which forms the basis for word2vec [14] and GloVe [15]
The basic intuition behind word2vec is that if two different words have very
similar ldquocontextsrdquo (that is what words are likely to appear around them) then the model
17
will produce similar vector for those words Conversely if the two word vectors are
similar then the network will produce similar context predictions for the same two words
For examples synonyms like ldquointelligentrdquo and ldquosmartrdquo would have very similar contexts
Or that words that are related like ldquoenginerdquo and ldquotransmissionrdquo would probably have
similar contexts as well [16] Plotting the word vectors learned by a word2vec over a
large corpus we could find some very interesting relationships between words
Figure 2-3 Semantic relation between words in vector space [17]
Attention Mechanism
We as humans put our attention to things are important or are relevant in a
context For example when asked a question from a passage we try to find the most
relevant part of the passage the question is relevant with and then reason from our
understanding of that part of the passage The same idea applies for attention
mechanism in Deep Learning It is used to identify the specific parts of a given context
to which the current question is relevant to
Formally put the techniques take n arguments y_1 y_n (in our case the
passage having words say y_i through h_i) and a question word say q It returns a
vector z which is supposed to be the laquo summary raquo of the y_i focusing on information
linked to the question q More formally it returns a weighted arithmetic mean of the y_i
18
and the weights are chosen according the relevance of each y_i given the context c
[18]
Figure 2-4 Attention Mechanism flow [18]
Memory Networks
Convolutional Neural Networks and Recurrent Neural Networks which does
capture how we form our visual and sequential memories their memory (encoded by
hidden states and weights) were typically too small and was not compartmentalized
enough to accurately remember facts from the past (knowledge is compressed into
dense vectors) [19]
Deep Learning needed to cultivate a methodology that preserved memories as
they are such that it wonrsquot be lost in generalization and recalling exact words or
sequence of events would be possible mdash something computers are already good at This
effort led us to Memory Networks [19] published at ICLR 2015 by Facebook AI
Research
This paper provides a basic framework to store augment and retrieve memories
while seamlessly working with a Recurrent Neural Network architecture The memory
19
network consists of a memory m (an array of objects 1 indexed by m i) and four
(potentially learned) components I G O and R as follows
I (input feature map) mdash converts the incoming input to the internal feature
representation either a sparse or dense feature vector like that from word2vec or
GloVe
G (generalization) mdash updates old memories given the new input They call this
generalization as there is an opportunity for the network to compress and generalize its
memories at this stage for some intended future use The analogy Irsquove been talking
before
O (output feature map) mdash produces a new output (in the feature representation
space) given the new input and the current memory state This component is
responsible for performing inference In a question answering system this part will
select the candidate sentences (which might contain the answer) from the story
(conversation) so far
R (response) mdash converts the output into the response format desired For
example a textual response or an action In the QA system described this component
finds the desired answer and then converts it from feature representation to the actual
word
This model is a fully supervised model meaning all the candidate sentences from
which the answer could be found are marked during training phase and can also be
termed as lsquohard attentionrsquo
The authors tested out the QA system on various literature including Lord of the
Rings
20
Figure 2-5 QA example on Lord of the Rings using Memory Networks [19]
21
CHAPTER 3 LITERATURE REVIEW AND STATE OF THE ART
There has been rapid progress since the release of the SQuAD dataset and
currently the best ensemble models are close to human level accuracy in machine
comprehension This is due to the various ingenious methods which solves some of the
problems with the previous methods Out of Vocabulary tokens were handled by using
Character embedding Long term dependency within context passage were solved
using self-attention And many other techniques such as Contextualized vectors History
of Words Attention Flow etc In this section we will have a look at the some of the most
important models that were fundamental to the progress of Questions Answering
Machine Comprehension Using Match-LSTM and Answer Pointer
In this paper [20] the authors propose an end-to-end neural architecture for the
QA task The architecture is based on match-LSTM [21] a model they proposed for
textual entailment and Pointer Net [22] a sequence-to-sequence model proposed by
Vinyals et al (2015) to constrain the output tokens to be from the input sequences
The model consists of an LSTM preprocessing layer a match-LSTM layer and an
Answer Pointer layer
We are given a piece of text which we refer to as a passage and a question
related to the passage The passage is represented by matrix P where P is the length
(number of tokens) of the passage and d is the dimensionality of word embeddings
Similarly the question is represented by matrix Q where Q is the length of the question
Our goal is to identify a subsequence from the passage as the answer to the question
22
Figure 3-1 Match-LSTM Model Architecture [20]
LSTM Preprocessing layer They use a standard one-directional LSTM
(Hochreiter amp Schmidhuber 1997) to process the passage and the question separately
as shown below
Match LSTM Layer They applied the match-LSTM model proposed for textual
entailment to their machine comprehension problem by treating the question as a
premise and the passage as a hypothesis The match-LSTM sequentially goes through
the passage At position i of the passage it first uses the standard word-by-word
attention mechanism to obtain attention weight vector as follows
(3-1)
23
where and are parameters to be learned
Answer Pointer Layer The Answer Pointer (Ans-Ptr) layer is motivated by the
Pointer Net introduced by Vinyals et al (2015) [22] The boundary model produces only
the start token and the end token of the answer and then all the tokens between these
two in the original passage are considered to be the answer
When this paper was released back in November 2016 Match-LSTM method
was the state of the art in Question Answering systems and was at the top of the
leaderboard for the SQuAD dataset
R-NET Matching Reading Comprehension with Self-Matching Networks
In this model [23] first the question and passage are processed by a
bidirectional recurrent network (Mikolov et al 2010) separately They then match the
question and passage with gated attention-based recurrent networks obtaining
question-aware representation for the passage On top of that they apply self-matching
attention to aggregate evidence from the whole passage and refine the passage
representation which is then fed into the output layer to predict the boundary of the
answer span
Question and passage encoding First the words are converted to their
respective word-level embeddings and character level embeddings The character-level
embeddings are generated by taking the final hidden states of a bi-directional recurrent
neural network (RNN) applied to embeddings of characters in the token Such
character-level embeddings have been shown to be helpful to deal with out-of-vocab
(OOV) tokens
24
They then use a bi-directional RNN to produce new representation and
of all words in the question and passage respectively
Figure 3-2 The task of Question Answering [23]
Gated Attention-based Recurrent Networks They use a variant of attention-
based recurrent networks with an additional gate to determine the importance of
information in the passage regarding a question Different from the gates in LSTM or
GRU the additional gate is based on the current passage word and its attention-pooling
vector of the question which focuses on the relation between the question and current
passage word The gate effectively model the phenomenon that only parts of the
passage are relevant to the question in reading comprehension and question answering
is utilized in subsequent calculations
25
Self-Matching Attention From the previous step the question aware passage
representation is generated to highlight the important parts of the passage One
problem with such representation is that it has very limited knowledge of context One
answer candidate is often oblivious to important cues in the passage outside its
surrounding window To address this problem the authors propose directly matching
the question-aware passage representation against itself It dynamically collects
evidence from the whole passage for words in passage and encodes the evidence
relevant to the current passage word and its matching question information into the
passage representation
Output Layer They use the same method as Wang amp Jiang (2016b) and use
pointer networks (Vinyals et al 2015) to predict the start and end position of the
answer In addition they use an attention-pooling over the question representation to
generate the initial hidden vector for the pointer network [23]
When the R-Net Model first appeared in the leaderboard in March 2017 it was at
the top with 723 Exact Match and 807 F1 score
Bi-Directional Attention Flow (BiDAF) for Machine Comprehension
BiDAF [24] is a hierarchical multi-stage architecture for modeling the
representations of the context paragraph at different levels of granularity BIDAF
includes character-level word-level and contextual embeddings and uses bi-directional
attention flow to obtain a query-aware context representation Their attention layer is not
used to summarize the context paragraph into a fixed-size vector Instead the attention
is computed for every time step and the attended vector at each time step along with
the representations from previous layers can flow through to the subsequent modeling
layer This reduces the information loss caused by early summarization [24]
26
Figure 3-3 Bi-Directional Attention Flow Model Architecture [24]
Their machine comprehension model is a hierarchical multi-stage process and
consists of six layers
1 Character Embedding Layer maps each word to a vector space using character-level CNNs
2 Word Embedding Layer maps each word to a vector space using a pre-trained word embedding model
3 Contextual Embedding Layer utilizes contextual cues from surrounding words to refine the embedding of the words These first three layers are applied to both the query and context They use a LSTM on top of the embeddings provided by the previous layers to model the temporal interactions between words We place an LSTM in both directions and concatenate the outputs of the two LSTMs
4 Attention Flow Layer couples the query and context vectors and produces a set of query aware feature vectors for each word in the context
5 Modeling Layer employs a Recurrent Neural Network to scan the context The input to the modeling layer is G which encodes the query-aware representations of context words The output of the modeling layer captures the interaction among the context words conditioned on the query They use two layers of bi-directional LSTM
27
with the output size of d for each direction Hence a matrix is obtained which is passed onto the output layer to predict the answer
6 Output Layer provides an answer to the query They define the training loss (to be minimized) as the sum of the negative log probabilities of the true start and end indices by the predicted distributions averaged over all examples [25]
In a further variation of their above work they add a self-attention layer after the
Bi-attention layer to further improve the results The architecture of the model is as
Figure 3-3 The task of Question Answering [25]
28
Summary
In this chapter we reviewed the methods that are fundamental to the state of the
art in Machine Comprehension and for the task of Question Answering We have
reached closed to human level accuracy and this is due to incremental developments
over previous models As we saw Out of Vocabulary (OOV) tokens were handled by
using Character embedding Long term dependency within context passage were
solved using self-attention and many other techniques such as Contextualized vectors
Attention Flow etc were employed to get better results In the next chapter we will see
how we can build on these models and develop further
29
CHAPTER 4 MULTI-ATTENTION QUESTION ANSWERING
Although the advancements in Question Answering systems since the release of
the SQuAD dataset has been impressive and the results are getting close to human
level accuracy it is far from being a fool-proof system The models still make mistakes
which would be obvious to a human For example
Passage The Panthers used the San Jose State practice facility and stayed at
the San Jose Marriott The Broncos practiced at Florida State Facility and stayed at the
Santa Clara Marriott
Question At what universitys facility did the Panthers practice
Actual Answer San Jose State
Predicted Answer Florida State Facility
To find out what is leading to the wrong predictions we wanted to see the
attention weights associated with such an example We plotted the passage and
question heat map which is a 2D matrix where the intensity of each cell signifies the
similarity between a passage word and a question word For the above example we
found out that while certain words of the question are given high weightage other parts
are not The words lsquoAtrsquo lsquofacilityrsquo lsquopracticersquo are given high attention but lsquoPanthersrsquo does
not receive high attention If it had received high attention then the system would have
predicted lsquoSan Jose Statersquo as the right answer To solve this issue we analyzed the
base BiDAF model and proposed adding two things
1 Bi-Attention and Self-Attention over Query
2 Second level attention over the output of (Bi-Attention + Self-Attention) from both Context and Query
30
Multi-Attention BiDAF Model
Our comprehension model is a hierarchical multi-stage process and consists of
the following layers
1 Embedding Just as all other models we embed words using pretrained word vectors We also embed the characters in each word into size 20 vectors which are learned and run a convolution neural network followed by max-pooling to get character-derived embeddings for each word The character-level and word-level embeddings are then concatenated and passed to the next layer We do not update the word embeddings during training
2 Pre-Process A shared bi-directional GRU (Cho et al 2014) is used to map the question and passage embeddings to context aware embeddings
3 Context Attention The bi-directional attention mechanism from the Bi-Directional Attention Flow (BiDAF) model (Seo et al 2016) is used to build a query-aware context representation Let h_i be the vector for context word i q_j be the vector for question word j and n_q and n_c be the lengths of the question and context respectively We compute attention between context word i and question word j as
(4-1)
where w1 w2 and w3 are learned vectors and is element-wise multiplication We then compute an attended vector c_i for each context token as
(4-2)
We also compute a query-to-context vector q_c
(4-3)
31
The final vector computed for each token is built by concatenating
and In our model we subsequently pass the result through a linear layer with ReLU activations
4 Context Self-Attention Next we use a layer of residual self-attention The input is passed through another bi-directional GRU Then we apply the same attention mechanism only now between the passage and itself In this case we do not use query-to context attention and we set aij = 1048576inf if i = j As before we pass the concatenated output through a linear layer with ReLU activations This layer is applied residually so this output is additionally summed with the input
5 Query Attention For this part we do the same way as context attention but calculate the weighted sum of the context words for each query word Thus we get the length as number of query words Then we calculate context to query similar to query to context in context attention layer
Figure 4-1 The modified BiDAF model with multilevel attention
6 Query Self-Attention This part is done the same way as the context self-attention layer but from the output of the Query Attention layer
32
7 Context Query Bi-Attention + Self-Attention The output of the Context self-attention and the Query Self-Attention layers are taken as input and the same process for Bi-Attention and self-attention is applied on these inputs
8 Prediction In the last layer of our model a bidirectional GRU is applied followed by a linear layer that computes answer start scores for each token The hidden states of that layer are concatenated with the input and fed into a second bidirectional GRU and linear layer to predict answer end scores The softmax operation is applied to the start and end scores to produce start and end probabilities and we optimize the negative log likelihood of selecting correct start and end tokens
Having carried out this modification we were able to solve the wrong example we
started with The multilevel attention model gives the correct output as ldquoSan Jose
Staterdquo Also we achieved slightly better scores than the original model with a F1 score
of 8544 on the SQuAD dev dataset
Chatbot Design Using a QA System
Designing a perfect chatbot that passes the Turing test is a fundamental goal for
Artificial Intelligence Although we are many order to magnitudes away from achieving
such a goal domain specific tasks can be solved with chatbots made from current
technology We had started our goal with a similar objective in mind ie to design a
domain specific chatbot and then generalize to other areas as it is able to achieve the
first domain specific objective robustly This led us to the fundamental problem of
Machine Comprehension and subsequently to the task of Question Answering Having
achieved some degree of success with QA systems we looked back if we could apply
our newly acquired knowledge in the task of designing Chatbots
The chatbots made with todayrsquos technologies are mostly handcrafted techniques
such as template matching that requires anticipating all possible ways a user may
articulate his requirements and a conversation may occur This requires a lot of man
hours for designing a domain specific system and is still very error prone In this section
33
we propose a general Chatbot design that would make the designing of a domain
specific chatbot very easy and robust at the same time
Every domain specific chatbot needs to obtain a set of information from the user
and show some results based on the user specific information obtained The traditional
chatbots use template matching and keywords lookup to determine if the user has
provided the required information Our idea is to use the Question Answering system in
the backend to extract out the required information from whatever the user has typed
until this point of the conversation The information to be extracted can be posed in the
form of a set of questions and the answers obtained from those questions can be used
as the parameters to supply the relevant information to the user
We had chosen our chatbot domain as the flight reservation system Our goal
was to extract the required information from the user to be able to show him the
available flights as per the userrsquos requirements
Figure 4-2 Flight reservation chatbotrsquos chat window
34
For a flight reservation task the booking agent needs to know the origin city
destination city and date of travel at minimum to be able to show the available flights
Optional information includes the number of tickets passengerrsquos name one way or
round trip etc
The minimalistic conversation with the user through the chat window would be as
shown above We had a platform called OneTask on which we wanted to implement our
chat bot The chat interface within the OneTask system looks as follows
Figure 4-3 Chatbot within OneTask system
The working of the chat system is as follows ndash
1 Initiation The user opens the chat window which starts a session with the chat bot The chat bot asks lsquoHow may I help yoursquo
2 User Reply The user may reply with none of the required information for flight booking or may reply with multiple information in the same message
3 User Reply parsing The conversation up to this point is treated as a passage and the internal questions are run on this passage So the four questions that are run are
Where do you want to go
35
From where do you want to leave
When do you want to depart
4 Parsed responses from QA model After running the questions on the QA system with the given conversation answers are obtained for all along with their corresponding confidence values Since answers will be obtained even if the required question was not answered up to this point in the conversation the confidence values plays a crucial role in determining the validity of the answer For this purpose we determined ranges of confidence values for validation A confidence value above 10 signifies the question has been answered correctly A confidence of 2 ndash 10 signifies that it may have been answered but should verify with the user for correctness and any confidence below 2 is discarded
5 Asking remaining questions iteratively After the parsing it is checked if any of the required questions are still unanswered If so the chatbot asks the remaining question and the process from 3 ndash 4 is carried out iteratively Once all the questions have been answered the user is shown with the available flight options as per his request
Figure 4-4 The Flow diagram of the Flight booking Chatbot system
Online QA System and Attention Visualization
To be able to test out various examples we used the BiDAF [23] model for an
online demo One can either choose from the available examples from the drop-down
menu or paste their own passage and examples While this is a useful and interesting
system to test the model in a user-friendly way we created this system to be able to
focus on the wrong samples
36
Figure 4-5 QA system interface with attention highlight over candidate answers
The candidate answers are shown with the blue highlights as per their
confidence values The higher the confidence the darker the answer The highest
confidence value is chosen as the predicted answer We developed the system to show
attention spread of the candidate answers to realize what needs to be done to improve
the system This led us to realize the importance of including the query attention part as
well as multilevel attention on the BiDAF model as described in the first section of this
chapter
37
CHAPTER 5 FUTURE DIRECTIONS AND CONCLUSION
Future Directions
Having done a thorough analysis of the current methods and state of the arts in
Machine Comprehension and for the QA task and developing systems on it we have
achieved a strong sense of what needs to be done to further improve the QA models
After observing the wrong samples we can see that the system is still unable to encode
meaning and is picking answers based on statistical patterns of occurrence of answers
on training examples
As per the State of the Art models and analyzing the ongoing research literature
it is easy to conclude that more features need to be embedded to encode the meaning
of words phrases and sentences A paper called Reinforced Mnemonic Reader for
Machine Comprehension encoded POS and NER tags of words along with their word
and character embedding This gave them better results
Figure 5-1 An English language semantic parse tree [26]
38
We have developed a method to encode the syntax parse tree of a sentence that
not only encodes the post tags but the relation of the word within the phrase and the
relation of the phrase within the whole sentence in a hierarchical manner
Finally data augmentation is another solution to get better results One definite
way to reduce the errors would be to include similar samples in the training data which
the system is faltering in the dev set One could generate similar examples as the failure
cases and include them in the training set to have better prediction Another system
would be to train to similar and bigger datasets Our models were trained on the SQuAD
dataset There are other datasets too which does the similar question answering task
such as TriviaQA We could augment the training set of TriviaQA along with SQuAD to
have a more robust system that is able to generalize better and thus have higher
accuracy for predicting answer spans
Conclusion
In this work we have tried to explore the most fundamental techniques that have
shaped the current state of the art Then we proposed a minor improvement of
architecture over an existing model Furthermore we developed two applications that
uses the base model First we talked about how a chatbot application can be made
using the QA system and lastly we also created a web interface where the model can
be used for any Passage and Question This interface also shows the attention spread
on the candidate answers While our effort is ongoing to push the state of the art
forward we strongly believe that surpassing human level accuracy on this task will have
high dividends for the society at large
39
LIST OF REFERENCES
[1] Danqi Chen From Reading Comprehension To Open-Domain Question Answering 2018 Youtube Accessed March 16 2018 httpswwwyoutubecomwatchv=1RN88O9C13U
[2] Lehnert Wendy G The Process of Question Answering No RR-88 Yale Univ New Haven Conn Dept of computer science 1977
[3] Joshi Mandar Eunsol Choi Daniel S Weld and Luke Zettlemoyer Triviaqa A large scale distantly supervised challenge dataset for reading comprehension arXiv preprint arXiv170503551 (2017)
[4] Rajpurkar Pranav Jian Zhang Konstantin Lopyrev and Percy Liang Squad 100000+ questions for machine comprehension of text arXiv preprint arXiv160605250 (2016)
[5] The Stanford Question Answering Dataset 2018 RajpurkarGithubIo Accessed March 16 2018 httpsrajpurkargithubioSQuAD-explorer
[6] Hu Minghao Yuxing Peng and Xipeng Qiu Reinforced mnemonic reader for machine comprehension CoRR abs170502798 (2017)
[7] Schalkoff Robert J Artificial neural networks Vol 1 New York McGraw-Hill 1997
[8] Me For The AI and Neetesh Mehrotra 2018 The Connect Between Deep Learning And AI - Open Source For You Open Source For You Accessed March 16 2018 httpopensourceforucom201801connect-deep-learning-ai
[9] Krizhevsky Alex Ilya Sutskever and Geoffrey E Hinton Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems pp 1097-1105 2012
[10] An Intuitive Explanation Of Convolutional Neural Networks 2016 The Data Science Blog Accessed March 16 2018 httpsujjwalkarnme20160811intuitive-explanation-convnets
[11] Williams Ronald J and David Zipser A learning algorithm for continually running fully recurrent neural networks Neural computation 1 no 2 (1989) 270-280
[12] Hochreiter Sepp and Juumlrgen Schmidhuber Long short-term memory Neural computation 9 no 8 (1997) 1735-1780
[13] Chung Junyoung Caglar Gulcehre KyungHyun Cho and Yoshua Bengio Empirical evaluation of gated recurrent neural networks on sequence modeling arXiv preprint arXiv14123555 (2014)
40
[14] Mikolov Tomas Wen-tau Yih and Geoffrey Zweig Linguistic regularities in continuous space word representations In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics Human Language Technologies pp 746-751 2013
[15] Pennington Jeffrey Richard Socher and Christopher Manning Glove Global vectors for word representation In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) pp 1532-1543 2014
[16] Word2vec Tutorial - The Skip-Gram Model middot Chris Mccormick 2016 MccormickmlCom Accessed March 16 2018 httpmccormickmlcom20160419word2vec-tutorial-the-skip-gram-model
[17] Group 2015 [Paper Introduction] Bilingual Word Representations With Monolingual hellip SlideshareNet Accessed March 16 2018 httpswwwslidesharenetnaist-mtstudybilingual-word-representations-with-monolingual-quality-in-mind
[18] Attention Mechanism 2016 BlogHeuritechCom Accessed March 16 2018 httpsblogheuritechcom20160120attention-mechanism
[19] Weston Jason Sumit Chopra and Antoine Bordes 2014 Memory Networks arXiv preprint arXiv14103916v11 Accessed March 16 2018 httpsarxivorgabs14103916
[20] Wang Shuohang and Jing Jiang Machine comprehension using match-lstm and answer pointer arXiv preprint arXiv160807905 (2016)
[21] Wang Shuohang and Jing Jiang Learning natural language inference with LSTM arXiv preprint arXiv151208849 (2015)
[22] Vinyals Oriol Meire Fortunato and Navdeep Jaitly Pointer networks In Advances in Neural Information Processing Systems pp 2692-2700 2015
[23] Wang Wenhui Nan Yang Furu Wei Baobao Chang and Ming Zhou Gated self-matching networks for reading comprehension and question answering In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1 Long Papers) vol 1 pp 189-198 2017
[24] Seo Minjoon Aniruddha Kembhavi Ali Farhadi and Hannaneh Hajishirzi Bidirectional attention flow for machine comprehension arXiv preprint arXiv161101603 (2016)
[25] Clark Christopher and Matt Gardner Simple and effective multi-paragraph reading comprehension arXiv preprint arXiv171010723 (2017)
41
[26] 8 Analyzing Sentence Structure 2018 NltkOrg Accessed March 16 2018 httpwwwnltkorgbookch08html
42
BIOGRAPHICAL SKETCH
Purnendu Mukherjee grew up in Kolkata India and had a strong interest for
computers since an early age After his high school he did his BSc in computer
science from Ramakrishna Mission Residential College Narendrapur followed by MSc
in computer science from St Xavierrsquos College Kolkata He had a strong intuition and
interest for human like learning systems and wanted to work in this area He started
working at TCS Innovation Labs Pune for the application of Natural Language
Processing in Educational Applications As he was deeply passionate about learning
system that mimic the human brain and learn like a human child does he was
increasing interested about Deep Learning and its applications After working for a year
he went on to pursue a Master of Science degree in computer science from the
University of Florida Gainesville His academic interests have been focused on Deep
Learning and Natural Language Processing and he has been working on Machine
Reading Comprehension since summer of 2017
- ACKNOWLEDGMENTS
- LIST OF FIGURES
- LIST OF ABBREVIATIONS
- INTRODUCTION
- THE BUILDING BLOCKS OF A QUESTION ANSWERING SYSTEM
-
- Neural Networks
- Convolutional Neural Network
- Recurrent Neural Networks (RNN)
- Word Embedding
- Attention Mechanism
- Memory Networks
-
- LITERATURE REVIEW AND STATE OF THE ART
-
- Machine Comprehension Using Match-LSTM and Answer Pointer
- R-NET Matching Reading Comprehension with Self-Matching Networks
- Bi-Directional Attention Flow (BiDAF) for Machine Comprehension
- Summary
-
- MULTI-ATTENTION QUESTION ANSWERING
-
- Multi-Attention BiDAF Model
- Chatbot Design Using a QA System
- Online QA System and Attention Visualization
-
- FUTURE DIRECTIONS AND CONCLUSION
-
- Future Directions
- Conclusion
-
- LIST OF REFERENCES
- BIOGRAPHICAL SKETCH
-
7
LIST OF ABBREVIATIONS
BiDAF Bi-Directional Attention Flow
CNN Convolutional Neural Network
GRU Gated Recurrent Units
LSTM Long Short Term Memory
NLP Natural Language Processing
NLU Natural Language Understanding
RNN Recurrent Neural Network
8
Abstract of Thesis Presented to the Graduate School of the University of Florida in Partial Fulfillment of the
Requirements for the Degree of Master of Science
QUESTION ANSWERING SYSTEMS WITH ATTENTION MECHANISM
By
Purnendu Mukherjee
May 2018
Chair Xiaolin Li Major Computer Science
Question Answering(QA) systems have had rapid growth since the last 3 years
and are close to reaching human level accuracy One of the fundamental reason for this
growth has been the use of attention mechanism along with other methods of Deep
Learning But just as with other Deep Learning methods some of the failure cases are
so obvious that it convinces us that there is a lot to improve upon In this work we first
did a literature review of the State of the Art and fundamental models in QA systems
Next we introduce an architecture which has shown improvement in the targeted area
We then introduce a general method to enable easy design of domain-specific Chatbot
applications and present a proof of the concept with the same method Finally we
present an easy to use Question Answering interface with attention visualization on the
passage We also propose a method to improve the current state of the art as a part of
our ongoing work
9
CHAPTER 1 INTRODUCTION
Teaching machines to read and understand human natural language is our
central and long standing goal for Natural Language Processing and Artificial
Intelligence in general The positive side of machines being able to reason and
comprehend human language could be enormous We are beginning to see how
commercial systems like Alexa Siri Google Now etc are being widely used as Speech
Recognition improved to human levels While speech recognition systems can
transcribe speech to text comprehension of that transcribed text is another task which
is currently a major focus for both academia and industry because of its possible
applications Moreover all the text information available throughout the internet is a
major reason why machine comprehension of text is such an important task
With the growth of Deep Learning methods in the last few years the field of
Machine Comprehension and Natural Language Processing(NLP) in general has
experienced a revolution While the traditional methods and practices are still prevalent
and forms the basis of our deep understanding of languages Deep Learning methods
have surpassed all traditional NLP and Machine Learning methods by a significant
margin and are currently driving the growth of the field
To be able to build a system that can understand human text we need to first
ask ourselves how can we evaluate the machinersquos comprehension ability We had
initially set a goal to build a Chabot for a specific domain and generalize to other topics
as we go ahead While developing the system we found out the necessity for reading
comprehension and how to measure it We finally found the answer with Questions
Answering systems
10
Just like how we human beings are tested for our ability of language
understanding with questions we should ask machines similar questions about what it
has just read The performance of the system on such question answering task will let
us evaluate how much the machine is able to reason about what it just read [1]
Reading comprehension has been a topic of Natural Language Understanding since the
1970s In 1977 Wendy Lehnert said in his doctoral thesis ndash ldquoOnly when we can ask a
program to answer questions about what it reads will we be able to begin to access that
programrsquos comprehensionrdquo [2]
Figure 1-1 The task of Question Answering
To achieve this task the NLP community has developed various datasets such as
CNN Daily Mail WebQuestions SQuAD TriviaQA [3] etc For our purpose we chose
SQuAD which stands for Stanford Question Answering Dataset [4] SQuAD consists of
questions posed by crowdworkers on a set of Wikipedia articles where the answer to
every question is a segment of text or span from the corresponding reading passage
With 100000+ question-answer pairs on 500+ articles SQuAD is significantly larger
than previous reading comprehension datasets [5]
An example from the SQuAD dataset is as follows
11
Passage Tesla later approached Morgan to ask for more funds to build a more
powerful transmitter When asked where all the money had gone Tesla responded by
saying that he was affected by the Panic of 1901 which he (Morgan) had caused
Morgan was shocked by the reminder of his part in the stock market crash and by
Teslarsquos breach of contract by asking for more funds Tesla wrote another plea to
Morgan but it was also fruitless Morgan still owed Tesla money on the original
agreement and Tesla had been facing foreclosure even before construction of the
tower began
Question On what did Tesla blame for the loss of the initial money
Answer Panic of 1901
As we started exploring the QA task we faced several challenges Some of them
we could solve with the help of other research and some of the challenges still exist in
the domain
bull Out of Vocabulary words
bull Multi-sentence reasoning may be required
bull There may exist several candidate answers
bull Optimizing the Exact Match (EM) metric may fail when the answer boundary is fuzzy or too long such as the answer of the ldquowhyrdquo query [6]
bull ldquoOne-hoprdquo prediction may fail to fully understand the query [6]
bull Fail to fully capture the long-distance contextual interaction between parts of the context by only using LSTMGRU
bull Current models are unable to capture semantics of the passage
In the upcoming chapters we will first briefly review the basics necessary for
understanding the models then we will delve deep into the fundamental models that
12
have shaped the current State of the Art models then we will discuss our contribution in
terms of architecture and applications and finally conclude with future directions
13
CHAPTER 2 THE BUILDING BLOCKS OF A QUESTION ANSWERING SYSTEM
To build a question answering system one needs to be familiar with the
fundamental deep learning models such as Recurrent Neural Networks (RNN) Long
Short Term Memory (LSTM) etc In this chapter we will have an overview on these
techniques and see how they all connect to building a question answering system
Neural Networks
What makes Deep Learning so intriguing is that it has close resemblance with
the working of the mammalian brain or at least draws inspiration from it The same can
be said for Artificial Neural networks [7] which consists of a system of interconnected
units called lsquoneuronsrsquo that take input from similar units and produces a single output
Figure 2-1 Simple and Deep Learning Neural Networks [8]
The connection from one neuron to another can be weighted based on the input data
which enables the network to tune itself to produce a certain output based on the input
This is the learning process which is achieved through backpropagation which is a
system of propagating the error from the output layer to the previous layers
14
Convolutional Neural Network
The first wave of deep learningrsquos success was brought by Convolutional Neural
Networks (CNN) [9] when this was the technique used by the winning team of ImageNet
competition in 2012 CNNs are deep artificial neural networks (ANN) that can be used to
classify images cluster them by similarity and perform object recognition within scenes
It can be used to detect and identify faces people signs or any other visual data
Figure 2-2 Convolutional Neural Network Architecture [10]
There are primarily four operations in a standard CNN model (as shown in Fig above)
1 Convolution - The primary purpose of Convolution in the ConvNet (above) is to extract features from the input image The spatial relationship between pixels ie the image features are preserved and learned by the convolution using small squares of input data
2 Non-Linearity (ReLU) ndash Rectified Linear Unit (ReLU) is a non-linear operation that carries out an element wise operation on each pixel This operation replaces the negative pixel values in the feature map by zero
3 Pooling or Sub Sampling - Spatial Pooling reduces the dimensionality of each feature map but retains the most important information For max pooling the largest value in the square window is taken and rest are dropped Other types of pooling are Average Sum etc
4 Classification (Fully Connected Layer) - The Fully Connected layer is a traditional Multi-Layer Perceptron as described before that uses a softmax activation function in the output layer The high-level features of the image are encoded by the convolutional and pooling layers which is then fed to the fully connected layer which then uses these features for classifying the input image into various classes based on the training dataset [10]
15
When a new image is fed into the CNN model all the above-mentioned steps are
carried out (forward propagation) and a probability distribution is achieved on the set of
output classes With a large enough training dataset the network will learn and
generalize well enough to classify new images into their correct classes
Recurrent Neural Networks (RNN)
Whenever we want to predict or encode sequential data RNN [11] is our go to
method RNNs perform the same task for every element of a sequence where the
output of each element depends on previous computations thus the recurrence In
practice RNNs are unable to retain long-term dependencies and can look back only a
few steps because of the vanishing gradient problem
(2-1)
A solution to the dependency problem is to use gated cells such as LSTM [11] or
GRU [13] These cells pass on important information to the next cells while ignoring
non-important ones The gated units in a GRU block are
bull Update Gate ndash Computed based on current input and hidden state
(2-2)
bull Reset Gate ndash Calculated similarly but with different weights
(2-3)
bull New memory content - (2-4)
If reset gate unit is ~0 then previous memory is ignored and only new
information is kept
16
Final memory at current time step combines previous and current time steps
(2-5)
While the GRU is computationally efficient the LSTM on the other hand is a
general case where there are three gates as follows
bull Input Gate ndash What new information to add to the current cell state
bull Forget Gate ndash How much information from previous states to be kept
bull Output gate ndash How much info should be sent to the next states
Just like GRU the current cell state is a sum of the previous cell state but
weighted by the forget gate and the new value is added which is weighted by the input
gate Based on the cell state the output gate regulates the final output
Word Embedding
Computation or gradients can be applied on numbers and not on words or letters
So first we need to convert words into their corresponding numerical formation before
feeding into a deep learning model In general there are two types of word embedding
Frequency based (which constitutes count vectors tf-idf and co-occurrence vectors)
and Prediction based With frequency based embedding the order of the words are not
preserved and works as a bag of words model Whereas with prediction based model
the order of words or locality of words are taken into consideration to generate the
numerical representation of the word Within this prediction based category there are
two fundamental techniques called Continuous Bag of Words (CBOW) and Skip Gram
Model which forms the basis for word2vec [14] and GloVe [15]
The basic intuition behind word2vec is that if two different words have very
similar ldquocontextsrdquo (that is what words are likely to appear around them) then the model
17
will produce similar vector for those words Conversely if the two word vectors are
similar then the network will produce similar context predictions for the same two words
For examples synonyms like ldquointelligentrdquo and ldquosmartrdquo would have very similar contexts
Or that words that are related like ldquoenginerdquo and ldquotransmissionrdquo would probably have
similar contexts as well [16] Plotting the word vectors learned by a word2vec over a
large corpus we could find some very interesting relationships between words
Figure 2-3 Semantic relation between words in vector space [17]
Attention Mechanism
We as humans put our attention to things are important or are relevant in a
context For example when asked a question from a passage we try to find the most
relevant part of the passage the question is relevant with and then reason from our
understanding of that part of the passage The same idea applies for attention
mechanism in Deep Learning It is used to identify the specific parts of a given context
to which the current question is relevant to
Formally put the techniques take n arguments y_1 y_n (in our case the
passage having words say y_i through h_i) and a question word say q It returns a
vector z which is supposed to be the laquo summary raquo of the y_i focusing on information
linked to the question q More formally it returns a weighted arithmetic mean of the y_i
18
and the weights are chosen according the relevance of each y_i given the context c
[18]
Figure 2-4 Attention Mechanism flow [18]
Memory Networks
Convolutional Neural Networks and Recurrent Neural Networks which does
capture how we form our visual and sequential memories their memory (encoded by
hidden states and weights) were typically too small and was not compartmentalized
enough to accurately remember facts from the past (knowledge is compressed into
dense vectors) [19]
Deep Learning needed to cultivate a methodology that preserved memories as
they are such that it wonrsquot be lost in generalization and recalling exact words or
sequence of events would be possible mdash something computers are already good at This
effort led us to Memory Networks [19] published at ICLR 2015 by Facebook AI
Research
This paper provides a basic framework to store augment and retrieve memories
while seamlessly working with a Recurrent Neural Network architecture The memory
19
network consists of a memory m (an array of objects 1 indexed by m i) and four
(potentially learned) components I G O and R as follows
I (input feature map) mdash converts the incoming input to the internal feature
representation either a sparse or dense feature vector like that from word2vec or
GloVe
G (generalization) mdash updates old memories given the new input They call this
generalization as there is an opportunity for the network to compress and generalize its
memories at this stage for some intended future use The analogy Irsquove been talking
before
O (output feature map) mdash produces a new output (in the feature representation
space) given the new input and the current memory state This component is
responsible for performing inference In a question answering system this part will
select the candidate sentences (which might contain the answer) from the story
(conversation) so far
R (response) mdash converts the output into the response format desired For
example a textual response or an action In the QA system described this component
finds the desired answer and then converts it from feature representation to the actual
word
This model is a fully supervised model meaning all the candidate sentences from
which the answer could be found are marked during training phase and can also be
termed as lsquohard attentionrsquo
The authors tested out the QA system on various literature including Lord of the
Rings
20
Figure 2-5 QA example on Lord of the Rings using Memory Networks [19]
21
CHAPTER 3 LITERATURE REVIEW AND STATE OF THE ART
There has been rapid progress since the release of the SQuAD dataset and
currently the best ensemble models are close to human level accuracy in machine
comprehension This is due to the various ingenious methods which solves some of the
problems with the previous methods Out of Vocabulary tokens were handled by using
Character embedding Long term dependency within context passage were solved
using self-attention And many other techniques such as Contextualized vectors History
of Words Attention Flow etc In this section we will have a look at the some of the most
important models that were fundamental to the progress of Questions Answering
Machine Comprehension Using Match-LSTM and Answer Pointer
In this paper [20] the authors propose an end-to-end neural architecture for the
QA task The architecture is based on match-LSTM [21] a model they proposed for
textual entailment and Pointer Net [22] a sequence-to-sequence model proposed by
Vinyals et al (2015) to constrain the output tokens to be from the input sequences
The model consists of an LSTM preprocessing layer a match-LSTM layer and an
Answer Pointer layer
We are given a piece of text which we refer to as a passage and a question
related to the passage The passage is represented by matrix P where P is the length
(number of tokens) of the passage and d is the dimensionality of word embeddings
Similarly the question is represented by matrix Q where Q is the length of the question
Our goal is to identify a subsequence from the passage as the answer to the question
22
Figure 3-1 Match-LSTM Model Architecture [20]
LSTM Preprocessing layer They use a standard one-directional LSTM
(Hochreiter amp Schmidhuber 1997) to process the passage and the question separately
as shown below
Match LSTM Layer They applied the match-LSTM model proposed for textual
entailment to their machine comprehension problem by treating the question as a
premise and the passage as a hypothesis The match-LSTM sequentially goes through
the passage At position i of the passage it first uses the standard word-by-word
attention mechanism to obtain attention weight vector as follows
(3-1)
23
where and are parameters to be learned
Answer Pointer Layer The Answer Pointer (Ans-Ptr) layer is motivated by the
Pointer Net introduced by Vinyals et al (2015) [22] The boundary model produces only
the start token and the end token of the answer and then all the tokens between these
two in the original passage are considered to be the answer
When this paper was released back in November 2016 Match-LSTM method
was the state of the art in Question Answering systems and was at the top of the
leaderboard for the SQuAD dataset
R-NET Matching Reading Comprehension with Self-Matching Networks
In this model [23] first the question and passage are processed by a
bidirectional recurrent network (Mikolov et al 2010) separately They then match the
question and passage with gated attention-based recurrent networks obtaining
question-aware representation for the passage On top of that they apply self-matching
attention to aggregate evidence from the whole passage and refine the passage
representation which is then fed into the output layer to predict the boundary of the
answer span
Question and passage encoding First the words are converted to their
respective word-level embeddings and character level embeddings The character-level
embeddings are generated by taking the final hidden states of a bi-directional recurrent
neural network (RNN) applied to embeddings of characters in the token Such
character-level embeddings have been shown to be helpful to deal with out-of-vocab
(OOV) tokens
24
They then use a bi-directional RNN to produce new representation and
of all words in the question and passage respectively
Figure 3-2 The task of Question Answering [23]
Gated Attention-based Recurrent Networks They use a variant of attention-
based recurrent networks with an additional gate to determine the importance of
information in the passage regarding a question Different from the gates in LSTM or
GRU the additional gate is based on the current passage word and its attention-pooling
vector of the question which focuses on the relation between the question and current
passage word The gate effectively model the phenomenon that only parts of the
passage are relevant to the question in reading comprehension and question answering
is utilized in subsequent calculations
25
Self-Matching Attention From the previous step the question aware passage
representation is generated to highlight the important parts of the passage One
problem with such representation is that it has very limited knowledge of context One
answer candidate is often oblivious to important cues in the passage outside its
surrounding window To address this problem the authors propose directly matching
the question-aware passage representation against itself It dynamically collects
evidence from the whole passage for words in passage and encodes the evidence
relevant to the current passage word and its matching question information into the
passage representation
Output Layer They use the same method as Wang amp Jiang (2016b) and use
pointer networks (Vinyals et al 2015) to predict the start and end position of the
answer In addition they use an attention-pooling over the question representation to
generate the initial hidden vector for the pointer network [23]
When the R-Net Model first appeared in the leaderboard in March 2017 it was at
the top with 723 Exact Match and 807 F1 score
Bi-Directional Attention Flow (BiDAF) for Machine Comprehension
BiDAF [24] is a hierarchical multi-stage architecture for modeling the
representations of the context paragraph at different levels of granularity BIDAF
includes character-level word-level and contextual embeddings and uses bi-directional
attention flow to obtain a query-aware context representation Their attention layer is not
used to summarize the context paragraph into a fixed-size vector Instead the attention
is computed for every time step and the attended vector at each time step along with
the representations from previous layers can flow through to the subsequent modeling
layer This reduces the information loss caused by early summarization [24]
26
Figure 3-3 Bi-Directional Attention Flow Model Architecture [24]
Their machine comprehension model is a hierarchical multi-stage process and
consists of six layers
1 Character Embedding Layer maps each word to a vector space using character-level CNNs
2 Word Embedding Layer maps each word to a vector space using a pre-trained word embedding model
3 Contextual Embedding Layer utilizes contextual cues from surrounding words to refine the embedding of the words These first three layers are applied to both the query and context They use a LSTM on top of the embeddings provided by the previous layers to model the temporal interactions between words We place an LSTM in both directions and concatenate the outputs of the two LSTMs
4 Attention Flow Layer couples the query and context vectors and produces a set of query aware feature vectors for each word in the context
5 Modeling Layer employs a Recurrent Neural Network to scan the context The input to the modeling layer is G which encodes the query-aware representations of context words The output of the modeling layer captures the interaction among the context words conditioned on the query They use two layers of bi-directional LSTM
27
with the output size of d for each direction Hence a matrix is obtained which is passed onto the output layer to predict the answer
6 Output Layer provides an answer to the query They define the training loss (to be minimized) as the sum of the negative log probabilities of the true start and end indices by the predicted distributions averaged over all examples [25]
In a further variation of their above work they add a self-attention layer after the
Bi-attention layer to further improve the results The architecture of the model is as
Figure 3-3 The task of Question Answering [25]
28
Summary
In this chapter we reviewed the methods that are fundamental to the state of the
art in Machine Comprehension and for the task of Question Answering We have
reached closed to human level accuracy and this is due to incremental developments
over previous models As we saw Out of Vocabulary (OOV) tokens were handled by
using Character embedding Long term dependency within context passage were
solved using self-attention and many other techniques such as Contextualized vectors
Attention Flow etc were employed to get better results In the next chapter we will see
how we can build on these models and develop further
29
CHAPTER 4 MULTI-ATTENTION QUESTION ANSWERING
Although the advancements in Question Answering systems since the release of
the SQuAD dataset has been impressive and the results are getting close to human
level accuracy it is far from being a fool-proof system The models still make mistakes
which would be obvious to a human For example
Passage The Panthers used the San Jose State practice facility and stayed at
the San Jose Marriott The Broncos practiced at Florida State Facility and stayed at the
Santa Clara Marriott
Question At what universitys facility did the Panthers practice
Actual Answer San Jose State
Predicted Answer Florida State Facility
To find out what is leading to the wrong predictions we wanted to see the
attention weights associated with such an example We plotted the passage and
question heat map which is a 2D matrix where the intensity of each cell signifies the
similarity between a passage word and a question word For the above example we
found out that while certain words of the question are given high weightage other parts
are not The words lsquoAtrsquo lsquofacilityrsquo lsquopracticersquo are given high attention but lsquoPanthersrsquo does
not receive high attention If it had received high attention then the system would have
predicted lsquoSan Jose Statersquo as the right answer To solve this issue we analyzed the
base BiDAF model and proposed adding two things
1 Bi-Attention and Self-Attention over Query
2 Second level attention over the output of (Bi-Attention + Self-Attention) from both Context and Query
30
Multi-Attention BiDAF Model
Our comprehension model is a hierarchical multi-stage process and consists of
the following layers
1 Embedding Just as all other models we embed words using pretrained word vectors We also embed the characters in each word into size 20 vectors which are learned and run a convolution neural network followed by max-pooling to get character-derived embeddings for each word The character-level and word-level embeddings are then concatenated and passed to the next layer We do not update the word embeddings during training
2 Pre-Process A shared bi-directional GRU (Cho et al 2014) is used to map the question and passage embeddings to context aware embeddings
3 Context Attention The bi-directional attention mechanism from the Bi-Directional Attention Flow (BiDAF) model (Seo et al 2016) is used to build a query-aware context representation Let h_i be the vector for context word i q_j be the vector for question word j and n_q and n_c be the lengths of the question and context respectively We compute attention between context word i and question word j as
(4-1)
where w1 w2 and w3 are learned vectors and is element-wise multiplication We then compute an attended vector c_i for each context token as
(4-2)
We also compute a query-to-context vector q_c
(4-3)
31
The final vector computed for each token is built by concatenating
and In our model we subsequently pass the result through a linear layer with ReLU activations
4 Context Self-Attention Next we use a layer of residual self-attention The input is passed through another bi-directional GRU Then we apply the same attention mechanism only now between the passage and itself In this case we do not use query-to context attention and we set aij = 1048576inf if i = j As before we pass the concatenated output through a linear layer with ReLU activations This layer is applied residually so this output is additionally summed with the input
5 Query Attention For this part we do the same way as context attention but calculate the weighted sum of the context words for each query word Thus we get the length as number of query words Then we calculate context to query similar to query to context in context attention layer
Figure 4-1 The modified BiDAF model with multilevel attention
6 Query Self-Attention This part is done the same way as the context self-attention layer but from the output of the Query Attention layer
32
7 Context Query Bi-Attention + Self-Attention The output of the Context self-attention and the Query Self-Attention layers are taken as input and the same process for Bi-Attention and self-attention is applied on these inputs
8 Prediction In the last layer of our model a bidirectional GRU is applied followed by a linear layer that computes answer start scores for each token The hidden states of that layer are concatenated with the input and fed into a second bidirectional GRU and linear layer to predict answer end scores The softmax operation is applied to the start and end scores to produce start and end probabilities and we optimize the negative log likelihood of selecting correct start and end tokens
Having carried out this modification we were able to solve the wrong example we
started with The multilevel attention model gives the correct output as ldquoSan Jose
Staterdquo Also we achieved slightly better scores than the original model with a F1 score
of 8544 on the SQuAD dev dataset
Chatbot Design Using a QA System
Designing a perfect chatbot that passes the Turing test is a fundamental goal for
Artificial Intelligence Although we are many order to magnitudes away from achieving
such a goal domain specific tasks can be solved with chatbots made from current
technology We had started our goal with a similar objective in mind ie to design a
domain specific chatbot and then generalize to other areas as it is able to achieve the
first domain specific objective robustly This led us to the fundamental problem of
Machine Comprehension and subsequently to the task of Question Answering Having
achieved some degree of success with QA systems we looked back if we could apply
our newly acquired knowledge in the task of designing Chatbots
The chatbots made with todayrsquos technologies are mostly handcrafted techniques
such as template matching that requires anticipating all possible ways a user may
articulate his requirements and a conversation may occur This requires a lot of man
hours for designing a domain specific system and is still very error prone In this section
33
we propose a general Chatbot design that would make the designing of a domain
specific chatbot very easy and robust at the same time
Every domain specific chatbot needs to obtain a set of information from the user
and show some results based on the user specific information obtained The traditional
chatbots use template matching and keywords lookup to determine if the user has
provided the required information Our idea is to use the Question Answering system in
the backend to extract out the required information from whatever the user has typed
until this point of the conversation The information to be extracted can be posed in the
form of a set of questions and the answers obtained from those questions can be used
as the parameters to supply the relevant information to the user
We had chosen our chatbot domain as the flight reservation system Our goal
was to extract the required information from the user to be able to show him the
available flights as per the userrsquos requirements
Figure 4-2 Flight reservation chatbotrsquos chat window
34
For a flight reservation task the booking agent needs to know the origin city
destination city and date of travel at minimum to be able to show the available flights
Optional information includes the number of tickets passengerrsquos name one way or
round trip etc
The minimalistic conversation with the user through the chat window would be as
shown above We had a platform called OneTask on which we wanted to implement our
chat bot The chat interface within the OneTask system looks as follows
Figure 4-3 Chatbot within OneTask system
The working of the chat system is as follows ndash
1 Initiation The user opens the chat window which starts a session with the chat bot The chat bot asks lsquoHow may I help yoursquo
2 User Reply The user may reply with none of the required information for flight booking or may reply with multiple information in the same message
3 User Reply parsing The conversation up to this point is treated as a passage and the internal questions are run on this passage So the four questions that are run are
Where do you want to go
35
From where do you want to leave
When do you want to depart
4 Parsed responses from QA model After running the questions on the QA system with the given conversation answers are obtained for all along with their corresponding confidence values Since answers will be obtained even if the required question was not answered up to this point in the conversation the confidence values plays a crucial role in determining the validity of the answer For this purpose we determined ranges of confidence values for validation A confidence value above 10 signifies the question has been answered correctly A confidence of 2 ndash 10 signifies that it may have been answered but should verify with the user for correctness and any confidence below 2 is discarded
5 Asking remaining questions iteratively After the parsing it is checked if any of the required questions are still unanswered If so the chatbot asks the remaining question and the process from 3 ndash 4 is carried out iteratively Once all the questions have been answered the user is shown with the available flight options as per his request
Figure 4-4 The Flow diagram of the Flight booking Chatbot system
Online QA System and Attention Visualization
To be able to test out various examples we used the BiDAF [23] model for an
online demo One can either choose from the available examples from the drop-down
menu or paste their own passage and examples While this is a useful and interesting
system to test the model in a user-friendly way we created this system to be able to
focus on the wrong samples
36
Figure 4-5 QA system interface with attention highlight over candidate answers
The candidate answers are shown with the blue highlights as per their
confidence values The higher the confidence the darker the answer The highest
confidence value is chosen as the predicted answer We developed the system to show
attention spread of the candidate answers to realize what needs to be done to improve
the system This led us to realize the importance of including the query attention part as
well as multilevel attention on the BiDAF model as described in the first section of this
chapter
37
CHAPTER 5 FUTURE DIRECTIONS AND CONCLUSION
Future Directions
Having done a thorough analysis of the current methods and state of the arts in
Machine Comprehension and for the QA task and developing systems on it we have
achieved a strong sense of what needs to be done to further improve the QA models
After observing the wrong samples we can see that the system is still unable to encode
meaning and is picking answers based on statistical patterns of occurrence of answers
on training examples
As per the State of the Art models and analyzing the ongoing research literature
it is easy to conclude that more features need to be embedded to encode the meaning
of words phrases and sentences A paper called Reinforced Mnemonic Reader for
Machine Comprehension encoded POS and NER tags of words along with their word
and character embedding This gave them better results
Figure 5-1 An English language semantic parse tree [26]
38
We have developed a method to encode the syntax parse tree of a sentence that
not only encodes the post tags but the relation of the word within the phrase and the
relation of the phrase within the whole sentence in a hierarchical manner
Finally data augmentation is another solution to get better results One definite
way to reduce the errors would be to include similar samples in the training data which
the system is faltering in the dev set One could generate similar examples as the failure
cases and include them in the training set to have better prediction Another system
would be to train to similar and bigger datasets Our models were trained on the SQuAD
dataset There are other datasets too which does the similar question answering task
such as TriviaQA We could augment the training set of TriviaQA along with SQuAD to
have a more robust system that is able to generalize better and thus have higher
accuracy for predicting answer spans
Conclusion
In this work we have tried to explore the most fundamental techniques that have
shaped the current state of the art Then we proposed a minor improvement of
architecture over an existing model Furthermore we developed two applications that
uses the base model First we talked about how a chatbot application can be made
using the QA system and lastly we also created a web interface where the model can
be used for any Passage and Question This interface also shows the attention spread
on the candidate answers While our effort is ongoing to push the state of the art
forward we strongly believe that surpassing human level accuracy on this task will have
high dividends for the society at large
39
LIST OF REFERENCES
[1] Danqi Chen From Reading Comprehension To Open-Domain Question Answering 2018 Youtube Accessed March 16 2018 httpswwwyoutubecomwatchv=1RN88O9C13U
[2] Lehnert Wendy G The Process of Question Answering No RR-88 Yale Univ New Haven Conn Dept of computer science 1977
[3] Joshi Mandar Eunsol Choi Daniel S Weld and Luke Zettlemoyer Triviaqa A large scale distantly supervised challenge dataset for reading comprehension arXiv preprint arXiv170503551 (2017)
[4] Rajpurkar Pranav Jian Zhang Konstantin Lopyrev and Percy Liang Squad 100000+ questions for machine comprehension of text arXiv preprint arXiv160605250 (2016)
[5] The Stanford Question Answering Dataset 2018 RajpurkarGithubIo Accessed March 16 2018 httpsrajpurkargithubioSQuAD-explorer
[6] Hu Minghao Yuxing Peng and Xipeng Qiu Reinforced mnemonic reader for machine comprehension CoRR abs170502798 (2017)
[7] Schalkoff Robert J Artificial neural networks Vol 1 New York McGraw-Hill 1997
[8] Me For The AI and Neetesh Mehrotra 2018 The Connect Between Deep Learning And AI - Open Source For You Open Source For You Accessed March 16 2018 httpopensourceforucom201801connect-deep-learning-ai
[9] Krizhevsky Alex Ilya Sutskever and Geoffrey E Hinton Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems pp 1097-1105 2012
[10] An Intuitive Explanation Of Convolutional Neural Networks 2016 The Data Science Blog Accessed March 16 2018 httpsujjwalkarnme20160811intuitive-explanation-convnets
[11] Williams Ronald J and David Zipser A learning algorithm for continually running fully recurrent neural networks Neural computation 1 no 2 (1989) 270-280
[12] Hochreiter Sepp and Juumlrgen Schmidhuber Long short-term memory Neural computation 9 no 8 (1997) 1735-1780
[13] Chung Junyoung Caglar Gulcehre KyungHyun Cho and Yoshua Bengio Empirical evaluation of gated recurrent neural networks on sequence modeling arXiv preprint arXiv14123555 (2014)
40
[14] Mikolov Tomas Wen-tau Yih and Geoffrey Zweig Linguistic regularities in continuous space word representations In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics Human Language Technologies pp 746-751 2013
[15] Pennington Jeffrey Richard Socher and Christopher Manning Glove Global vectors for word representation In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) pp 1532-1543 2014
[16] Word2vec Tutorial - The Skip-Gram Model middot Chris Mccormick 2016 MccormickmlCom Accessed March 16 2018 httpmccormickmlcom20160419word2vec-tutorial-the-skip-gram-model
[17] Group 2015 [Paper Introduction] Bilingual Word Representations With Monolingual hellip SlideshareNet Accessed March 16 2018 httpswwwslidesharenetnaist-mtstudybilingual-word-representations-with-monolingual-quality-in-mind
[18] Attention Mechanism 2016 BlogHeuritechCom Accessed March 16 2018 httpsblogheuritechcom20160120attention-mechanism
[19] Weston Jason Sumit Chopra and Antoine Bordes 2014 Memory Networks arXiv preprint arXiv14103916v11 Accessed March 16 2018 httpsarxivorgabs14103916
[20] Wang Shuohang and Jing Jiang Machine comprehension using match-lstm and answer pointer arXiv preprint arXiv160807905 (2016)
[21] Wang Shuohang and Jing Jiang Learning natural language inference with LSTM arXiv preprint arXiv151208849 (2015)
[22] Vinyals Oriol Meire Fortunato and Navdeep Jaitly Pointer networks In Advances in Neural Information Processing Systems pp 2692-2700 2015
[23] Wang Wenhui Nan Yang Furu Wei Baobao Chang and Ming Zhou Gated self-matching networks for reading comprehension and question answering In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1 Long Papers) vol 1 pp 189-198 2017
[24] Seo Minjoon Aniruddha Kembhavi Ali Farhadi and Hannaneh Hajishirzi Bidirectional attention flow for machine comprehension arXiv preprint arXiv161101603 (2016)
[25] Clark Christopher and Matt Gardner Simple and effective multi-paragraph reading comprehension arXiv preprint arXiv171010723 (2017)
41
[26] 8 Analyzing Sentence Structure 2018 NltkOrg Accessed March 16 2018 httpwwwnltkorgbookch08html
42
BIOGRAPHICAL SKETCH
Purnendu Mukherjee grew up in Kolkata India and had a strong interest for
computers since an early age After his high school he did his BSc in computer
science from Ramakrishna Mission Residential College Narendrapur followed by MSc
in computer science from St Xavierrsquos College Kolkata He had a strong intuition and
interest for human like learning systems and wanted to work in this area He started
working at TCS Innovation Labs Pune for the application of Natural Language
Processing in Educational Applications As he was deeply passionate about learning
system that mimic the human brain and learn like a human child does he was
increasing interested about Deep Learning and its applications After working for a year
he went on to pursue a Master of Science degree in computer science from the
University of Florida Gainesville His academic interests have been focused on Deep
Learning and Natural Language Processing and he has been working on Machine
Reading Comprehension since summer of 2017
- ACKNOWLEDGMENTS
- LIST OF FIGURES
- LIST OF ABBREVIATIONS
- INTRODUCTION
- THE BUILDING BLOCKS OF A QUESTION ANSWERING SYSTEM
-
- Neural Networks
- Convolutional Neural Network
- Recurrent Neural Networks (RNN)
- Word Embedding
- Attention Mechanism
- Memory Networks
-
- LITERATURE REVIEW AND STATE OF THE ART
-
- Machine Comprehension Using Match-LSTM and Answer Pointer
- R-NET Matching Reading Comprehension with Self-Matching Networks
- Bi-Directional Attention Flow (BiDAF) for Machine Comprehension
- Summary
-
- MULTI-ATTENTION QUESTION ANSWERING
-
- Multi-Attention BiDAF Model
- Chatbot Design Using a QA System
- Online QA System and Attention Visualization
-
- FUTURE DIRECTIONS AND CONCLUSION
-
- Future Directions
- Conclusion
-
- LIST OF REFERENCES
- BIOGRAPHICAL SKETCH
-
8
Abstract of Thesis Presented to the Graduate School of the University of Florida in Partial Fulfillment of the
Requirements for the Degree of Master of Science
QUESTION ANSWERING SYSTEMS WITH ATTENTION MECHANISM
By
Purnendu Mukherjee
May 2018
Chair Xiaolin Li Major Computer Science
Question Answering(QA) systems have had rapid growth since the last 3 years
and are close to reaching human level accuracy One of the fundamental reason for this
growth has been the use of attention mechanism along with other methods of Deep
Learning But just as with other Deep Learning methods some of the failure cases are
so obvious that it convinces us that there is a lot to improve upon In this work we first
did a literature review of the State of the Art and fundamental models in QA systems
Next we introduce an architecture which has shown improvement in the targeted area
We then introduce a general method to enable easy design of domain-specific Chatbot
applications and present a proof of the concept with the same method Finally we
present an easy to use Question Answering interface with attention visualization on the
passage We also propose a method to improve the current state of the art as a part of
our ongoing work
9
CHAPTER 1 INTRODUCTION
Teaching machines to read and understand human natural language is our
central and long standing goal for Natural Language Processing and Artificial
Intelligence in general The positive side of machines being able to reason and
comprehend human language could be enormous We are beginning to see how
commercial systems like Alexa Siri Google Now etc are being widely used as Speech
Recognition improved to human levels While speech recognition systems can
transcribe speech to text comprehension of that transcribed text is another task which
is currently a major focus for both academia and industry because of its possible
applications Moreover all the text information available throughout the internet is a
major reason why machine comprehension of text is such an important task
With the growth of Deep Learning methods in the last few years the field of
Machine Comprehension and Natural Language Processing(NLP) in general has
experienced a revolution While the traditional methods and practices are still prevalent
and forms the basis of our deep understanding of languages Deep Learning methods
have surpassed all traditional NLP and Machine Learning methods by a significant
margin and are currently driving the growth of the field
To be able to build a system that can understand human text we need to first
ask ourselves how can we evaluate the machinersquos comprehension ability We had
initially set a goal to build a Chabot for a specific domain and generalize to other topics
as we go ahead While developing the system we found out the necessity for reading
comprehension and how to measure it We finally found the answer with Questions
Answering systems
10
Just like how we human beings are tested for our ability of language
understanding with questions we should ask machines similar questions about what it
has just read The performance of the system on such question answering task will let
us evaluate how much the machine is able to reason about what it just read [1]
Reading comprehension has been a topic of Natural Language Understanding since the
1970s In 1977 Wendy Lehnert said in his doctoral thesis ndash ldquoOnly when we can ask a
program to answer questions about what it reads will we be able to begin to access that
programrsquos comprehensionrdquo [2]
Figure 1-1 The task of Question Answering
To achieve this task the NLP community has developed various datasets such as
CNN Daily Mail WebQuestions SQuAD TriviaQA [3] etc For our purpose we chose
SQuAD which stands for Stanford Question Answering Dataset [4] SQuAD consists of
questions posed by crowdworkers on a set of Wikipedia articles where the answer to
every question is a segment of text or span from the corresponding reading passage
With 100000+ question-answer pairs on 500+ articles SQuAD is significantly larger
than previous reading comprehension datasets [5]
An example from the SQuAD dataset is as follows
11
Passage Tesla later approached Morgan to ask for more funds to build a more
powerful transmitter When asked where all the money had gone Tesla responded by
saying that he was affected by the Panic of 1901 which he (Morgan) had caused
Morgan was shocked by the reminder of his part in the stock market crash and by
Teslarsquos breach of contract by asking for more funds Tesla wrote another plea to
Morgan but it was also fruitless Morgan still owed Tesla money on the original
agreement and Tesla had been facing foreclosure even before construction of the
tower began
Question On what did Tesla blame for the loss of the initial money
Answer Panic of 1901
As we started exploring the QA task we faced several challenges Some of them
we could solve with the help of other research and some of the challenges still exist in
the domain
bull Out of Vocabulary words
bull Multi-sentence reasoning may be required
bull There may exist several candidate answers
bull Optimizing the Exact Match (EM) metric may fail when the answer boundary is fuzzy or too long such as the answer of the ldquowhyrdquo query [6]
bull ldquoOne-hoprdquo prediction may fail to fully understand the query [6]
bull Fail to fully capture the long-distance contextual interaction between parts of the context by only using LSTMGRU
bull Current models are unable to capture semantics of the passage
In the upcoming chapters we will first briefly review the basics necessary for
understanding the models then we will delve deep into the fundamental models that
12
have shaped the current State of the Art models then we will discuss our contribution in
terms of architecture and applications and finally conclude with future directions
13
CHAPTER 2 THE BUILDING BLOCKS OF A QUESTION ANSWERING SYSTEM
To build a question answering system one needs to be familiar with the
fundamental deep learning models such as Recurrent Neural Networks (RNN) Long
Short Term Memory (LSTM) etc In this chapter we will have an overview on these
techniques and see how they all connect to building a question answering system
Neural Networks
What makes Deep Learning so intriguing is that it has close resemblance with
the working of the mammalian brain or at least draws inspiration from it The same can
be said for Artificial Neural networks [7] which consists of a system of interconnected
units called lsquoneuronsrsquo that take input from similar units and produces a single output
Figure 2-1 Simple and Deep Learning Neural Networks [8]
The connection from one neuron to another can be weighted based on the input data
which enables the network to tune itself to produce a certain output based on the input
This is the learning process which is achieved through backpropagation which is a
system of propagating the error from the output layer to the previous layers
14
Convolutional Neural Network
The first wave of deep learningrsquos success was brought by Convolutional Neural
Networks (CNN) [9] when this was the technique used by the winning team of ImageNet
competition in 2012 CNNs are deep artificial neural networks (ANN) that can be used to
classify images cluster them by similarity and perform object recognition within scenes
It can be used to detect and identify faces people signs or any other visual data
Figure 2-2 Convolutional Neural Network Architecture [10]
There are primarily four operations in a standard CNN model (as shown in Fig above)
1 Convolution - The primary purpose of Convolution in the ConvNet (above) is to extract features from the input image The spatial relationship between pixels ie the image features are preserved and learned by the convolution using small squares of input data
2 Non-Linearity (ReLU) ndash Rectified Linear Unit (ReLU) is a non-linear operation that carries out an element wise operation on each pixel This operation replaces the negative pixel values in the feature map by zero
3 Pooling or Sub Sampling - Spatial Pooling reduces the dimensionality of each feature map but retains the most important information For max pooling the largest value in the square window is taken and rest are dropped Other types of pooling are Average Sum etc
4 Classification (Fully Connected Layer) - The Fully Connected layer is a traditional Multi-Layer Perceptron as described before that uses a softmax activation function in the output layer The high-level features of the image are encoded by the convolutional and pooling layers which is then fed to the fully connected layer which then uses these features for classifying the input image into various classes based on the training dataset [10]
15
When a new image is fed into the CNN model all the above-mentioned steps are
carried out (forward propagation) and a probability distribution is achieved on the set of
output classes With a large enough training dataset the network will learn and
generalize well enough to classify new images into their correct classes
Recurrent Neural Networks (RNN)
Whenever we want to predict or encode sequential data RNN [11] is our go to
method RNNs perform the same task for every element of a sequence where the
output of each element depends on previous computations thus the recurrence In
practice RNNs are unable to retain long-term dependencies and can look back only a
few steps because of the vanishing gradient problem
(2-1)
A solution to the dependency problem is to use gated cells such as LSTM [11] or
GRU [13] These cells pass on important information to the next cells while ignoring
non-important ones The gated units in a GRU block are
bull Update Gate ndash Computed based on current input and hidden state
(2-2)
bull Reset Gate ndash Calculated similarly but with different weights
(2-3)
bull New memory content - (2-4)
If reset gate unit is ~0 then previous memory is ignored and only new
information is kept
16
Final memory at current time step combines previous and current time steps
(2-5)
While the GRU is computationally efficient the LSTM on the other hand is a
general case where there are three gates as follows
bull Input Gate ndash What new information to add to the current cell state
bull Forget Gate ndash How much information from previous states to be kept
bull Output gate ndash How much info should be sent to the next states
Just like GRU the current cell state is a sum of the previous cell state but
weighted by the forget gate and the new value is added which is weighted by the input
gate Based on the cell state the output gate regulates the final output
Word Embedding
Computation or gradients can be applied on numbers and not on words or letters
So first we need to convert words into their corresponding numerical formation before
feeding into a deep learning model In general there are two types of word embedding
Frequency based (which constitutes count vectors tf-idf and co-occurrence vectors)
and Prediction based With frequency based embedding the order of the words are not
preserved and works as a bag of words model Whereas with prediction based model
the order of words or locality of words are taken into consideration to generate the
numerical representation of the word Within this prediction based category there are
two fundamental techniques called Continuous Bag of Words (CBOW) and Skip Gram
Model which forms the basis for word2vec [14] and GloVe [15]
The basic intuition behind word2vec is that if two different words have very
similar ldquocontextsrdquo (that is what words are likely to appear around them) then the model
17
will produce similar vector for those words Conversely if the two word vectors are
similar then the network will produce similar context predictions for the same two words
For examples synonyms like ldquointelligentrdquo and ldquosmartrdquo would have very similar contexts
Or that words that are related like ldquoenginerdquo and ldquotransmissionrdquo would probably have
similar contexts as well [16] Plotting the word vectors learned by a word2vec over a
large corpus we could find some very interesting relationships between words
Figure 2-3 Semantic relation between words in vector space [17]
Attention Mechanism
We as humans put our attention to things are important or are relevant in a
context For example when asked a question from a passage we try to find the most
relevant part of the passage the question is relevant with and then reason from our
understanding of that part of the passage The same idea applies for attention
mechanism in Deep Learning It is used to identify the specific parts of a given context
to which the current question is relevant to
Formally put the techniques take n arguments y_1 y_n (in our case the
passage having words say y_i through h_i) and a question word say q It returns a
vector z which is supposed to be the laquo summary raquo of the y_i focusing on information
linked to the question q More formally it returns a weighted arithmetic mean of the y_i
18
and the weights are chosen according the relevance of each y_i given the context c
[18]
Figure 2-4 Attention Mechanism flow [18]
Memory Networks
Convolutional Neural Networks and Recurrent Neural Networks which does
capture how we form our visual and sequential memories their memory (encoded by
hidden states and weights) were typically too small and was not compartmentalized
enough to accurately remember facts from the past (knowledge is compressed into
dense vectors) [19]
Deep Learning needed to cultivate a methodology that preserved memories as
they are such that it wonrsquot be lost in generalization and recalling exact words or
sequence of events would be possible mdash something computers are already good at This
effort led us to Memory Networks [19] published at ICLR 2015 by Facebook AI
Research
This paper provides a basic framework to store augment and retrieve memories
while seamlessly working with a Recurrent Neural Network architecture The memory
19
network consists of a memory m (an array of objects 1 indexed by m i) and four
(potentially learned) components I G O and R as follows
I (input feature map) mdash converts the incoming input to the internal feature
representation either a sparse or dense feature vector like that from word2vec or
GloVe
G (generalization) mdash updates old memories given the new input They call this
generalization as there is an opportunity for the network to compress and generalize its
memories at this stage for some intended future use The analogy Irsquove been talking
before
O (output feature map) mdash produces a new output (in the feature representation
space) given the new input and the current memory state This component is
responsible for performing inference In a question answering system this part will
select the candidate sentences (which might contain the answer) from the story
(conversation) so far
R (response) mdash converts the output into the response format desired For
example a textual response or an action In the QA system described this component
finds the desired answer and then converts it from feature representation to the actual
word
This model is a fully supervised model meaning all the candidate sentences from
which the answer could be found are marked during training phase and can also be
termed as lsquohard attentionrsquo
The authors tested out the QA system on various literature including Lord of the
Rings
20
Figure 2-5 QA example on Lord of the Rings using Memory Networks [19]
21
CHAPTER 3 LITERATURE REVIEW AND STATE OF THE ART
There has been rapid progress since the release of the SQuAD dataset and
currently the best ensemble models are close to human level accuracy in machine
comprehension This is due to the various ingenious methods which solves some of the
problems with the previous methods Out of Vocabulary tokens were handled by using
Character embedding Long term dependency within context passage were solved
using self-attention And many other techniques such as Contextualized vectors History
of Words Attention Flow etc In this section we will have a look at the some of the most
important models that were fundamental to the progress of Questions Answering
Machine Comprehension Using Match-LSTM and Answer Pointer
In this paper [20] the authors propose an end-to-end neural architecture for the
QA task The architecture is based on match-LSTM [21] a model they proposed for
textual entailment and Pointer Net [22] a sequence-to-sequence model proposed by
Vinyals et al (2015) to constrain the output tokens to be from the input sequences
The model consists of an LSTM preprocessing layer a match-LSTM layer and an
Answer Pointer layer
We are given a piece of text which we refer to as a passage and a question
related to the passage The passage is represented by matrix P where P is the length
(number of tokens) of the passage and d is the dimensionality of word embeddings
Similarly the question is represented by matrix Q where Q is the length of the question
Our goal is to identify a subsequence from the passage as the answer to the question
22
Figure 3-1 Match-LSTM Model Architecture [20]
LSTM Preprocessing layer They use a standard one-directional LSTM
(Hochreiter amp Schmidhuber 1997) to process the passage and the question separately
as shown below
Match LSTM Layer They applied the match-LSTM model proposed for textual
entailment to their machine comprehension problem by treating the question as a
premise and the passage as a hypothesis The match-LSTM sequentially goes through
the passage At position i of the passage it first uses the standard word-by-word
attention mechanism to obtain attention weight vector as follows
(3-1)
23
where and are parameters to be learned
Answer Pointer Layer The Answer Pointer (Ans-Ptr) layer is motivated by the
Pointer Net introduced by Vinyals et al (2015) [22] The boundary model produces only
the start token and the end token of the answer and then all the tokens between these
two in the original passage are considered to be the answer
When this paper was released back in November 2016 Match-LSTM method
was the state of the art in Question Answering systems and was at the top of the
leaderboard for the SQuAD dataset
R-NET Matching Reading Comprehension with Self-Matching Networks
In this model [23] first the question and passage are processed by a
bidirectional recurrent network (Mikolov et al 2010) separately They then match the
question and passage with gated attention-based recurrent networks obtaining
question-aware representation for the passage On top of that they apply self-matching
attention to aggregate evidence from the whole passage and refine the passage
representation which is then fed into the output layer to predict the boundary of the
answer span
Question and passage encoding First the words are converted to their
respective word-level embeddings and character level embeddings The character-level
embeddings are generated by taking the final hidden states of a bi-directional recurrent
neural network (RNN) applied to embeddings of characters in the token Such
character-level embeddings have been shown to be helpful to deal with out-of-vocab
(OOV) tokens
24
They then use a bi-directional RNN to produce new representation and
of all words in the question and passage respectively
Figure 3-2 The task of Question Answering [23]
Gated Attention-based Recurrent Networks They use a variant of attention-
based recurrent networks with an additional gate to determine the importance of
information in the passage regarding a question Different from the gates in LSTM or
GRU the additional gate is based on the current passage word and its attention-pooling
vector of the question which focuses on the relation between the question and current
passage word The gate effectively model the phenomenon that only parts of the
passage are relevant to the question in reading comprehension and question answering
is utilized in subsequent calculations
25
Self-Matching Attention From the previous step the question aware passage
representation is generated to highlight the important parts of the passage One
problem with such representation is that it has very limited knowledge of context One
answer candidate is often oblivious to important cues in the passage outside its
surrounding window To address this problem the authors propose directly matching
the question-aware passage representation against itself It dynamically collects
evidence from the whole passage for words in passage and encodes the evidence
relevant to the current passage word and its matching question information into the
passage representation
Output Layer They use the same method as Wang amp Jiang (2016b) and use
pointer networks (Vinyals et al 2015) to predict the start and end position of the
answer In addition they use an attention-pooling over the question representation to
generate the initial hidden vector for the pointer network [23]
When the R-Net Model first appeared in the leaderboard in March 2017 it was at
the top with 723 Exact Match and 807 F1 score
Bi-Directional Attention Flow (BiDAF) for Machine Comprehension
BiDAF [24] is a hierarchical multi-stage architecture for modeling the
representations of the context paragraph at different levels of granularity BIDAF
includes character-level word-level and contextual embeddings and uses bi-directional
attention flow to obtain a query-aware context representation Their attention layer is not
used to summarize the context paragraph into a fixed-size vector Instead the attention
is computed for every time step and the attended vector at each time step along with
the representations from previous layers can flow through to the subsequent modeling
layer This reduces the information loss caused by early summarization [24]
26
Figure 3-3 Bi-Directional Attention Flow Model Architecture [24]
Their machine comprehension model is a hierarchical multi-stage process and
consists of six layers
1 Character Embedding Layer maps each word to a vector space using character-level CNNs
2 Word Embedding Layer maps each word to a vector space using a pre-trained word embedding model
3 Contextual Embedding Layer utilizes contextual cues from surrounding words to refine the embedding of the words These first three layers are applied to both the query and context They use a LSTM on top of the embeddings provided by the previous layers to model the temporal interactions between words We place an LSTM in both directions and concatenate the outputs of the two LSTMs
4 Attention Flow Layer couples the query and context vectors and produces a set of query aware feature vectors for each word in the context
5 Modeling Layer employs a Recurrent Neural Network to scan the context The input to the modeling layer is G which encodes the query-aware representations of context words The output of the modeling layer captures the interaction among the context words conditioned on the query They use two layers of bi-directional LSTM
27
with the output size of d for each direction Hence a matrix is obtained which is passed onto the output layer to predict the answer
6 Output Layer provides an answer to the query They define the training loss (to be minimized) as the sum of the negative log probabilities of the true start and end indices by the predicted distributions averaged over all examples [25]
In a further variation of their above work they add a self-attention layer after the
Bi-attention layer to further improve the results The architecture of the model is as
Figure 3-3 The task of Question Answering [25]
28
Summary
In this chapter we reviewed the methods that are fundamental to the state of the
art in Machine Comprehension and for the task of Question Answering We have
reached closed to human level accuracy and this is due to incremental developments
over previous models As we saw Out of Vocabulary (OOV) tokens were handled by
using Character embedding Long term dependency within context passage were
solved using self-attention and many other techniques such as Contextualized vectors
Attention Flow etc were employed to get better results In the next chapter we will see
how we can build on these models and develop further
29
CHAPTER 4 MULTI-ATTENTION QUESTION ANSWERING
Although the advancements in Question Answering systems since the release of
the SQuAD dataset has been impressive and the results are getting close to human
level accuracy it is far from being a fool-proof system The models still make mistakes
which would be obvious to a human For example
Passage The Panthers used the San Jose State practice facility and stayed at
the San Jose Marriott The Broncos practiced at Florida State Facility and stayed at the
Santa Clara Marriott
Question At what universitys facility did the Panthers practice
Actual Answer San Jose State
Predicted Answer Florida State Facility
To find out what is leading to the wrong predictions we wanted to see the
attention weights associated with such an example We plotted the passage and
question heat map which is a 2D matrix where the intensity of each cell signifies the
similarity between a passage word and a question word For the above example we
found out that while certain words of the question are given high weightage other parts
are not The words lsquoAtrsquo lsquofacilityrsquo lsquopracticersquo are given high attention but lsquoPanthersrsquo does
not receive high attention If it had received high attention then the system would have
predicted lsquoSan Jose Statersquo as the right answer To solve this issue we analyzed the
base BiDAF model and proposed adding two things
1 Bi-Attention and Self-Attention over Query
2 Second level attention over the output of (Bi-Attention + Self-Attention) from both Context and Query
30
Multi-Attention BiDAF Model
Our comprehension model is a hierarchical multi-stage process and consists of
the following layers
1 Embedding Just as all other models we embed words using pretrained word vectors We also embed the characters in each word into size 20 vectors which are learned and run a convolution neural network followed by max-pooling to get character-derived embeddings for each word The character-level and word-level embeddings are then concatenated and passed to the next layer We do not update the word embeddings during training
2 Pre-Process A shared bi-directional GRU (Cho et al 2014) is used to map the question and passage embeddings to context aware embeddings
3 Context Attention The bi-directional attention mechanism from the Bi-Directional Attention Flow (BiDAF) model (Seo et al 2016) is used to build a query-aware context representation Let h_i be the vector for context word i q_j be the vector for question word j and n_q and n_c be the lengths of the question and context respectively We compute attention between context word i and question word j as
(4-1)
where w1 w2 and w3 are learned vectors and is element-wise multiplication We then compute an attended vector c_i for each context token as
(4-2)
We also compute a query-to-context vector q_c
(4-3)
31
The final vector computed for each token is built by concatenating
and In our model we subsequently pass the result through a linear layer with ReLU activations
4 Context Self-Attention Next we use a layer of residual self-attention The input is passed through another bi-directional GRU Then we apply the same attention mechanism only now between the passage and itself In this case we do not use query-to context attention and we set aij = 1048576inf if i = j As before we pass the concatenated output through a linear layer with ReLU activations This layer is applied residually so this output is additionally summed with the input
5 Query Attention For this part we do the same way as context attention but calculate the weighted sum of the context words for each query word Thus we get the length as number of query words Then we calculate context to query similar to query to context in context attention layer
Figure 4-1 The modified BiDAF model with multilevel attention
6 Query Self-Attention This part is done the same way as the context self-attention layer but from the output of the Query Attention layer
32
7 Context Query Bi-Attention + Self-Attention The output of the Context self-attention and the Query Self-Attention layers are taken as input and the same process for Bi-Attention and self-attention is applied on these inputs
8 Prediction In the last layer of our model a bidirectional GRU is applied followed by a linear layer that computes answer start scores for each token The hidden states of that layer are concatenated with the input and fed into a second bidirectional GRU and linear layer to predict answer end scores The softmax operation is applied to the start and end scores to produce start and end probabilities and we optimize the negative log likelihood of selecting correct start and end tokens
Having carried out this modification we were able to solve the wrong example we
started with The multilevel attention model gives the correct output as ldquoSan Jose
Staterdquo Also we achieved slightly better scores than the original model with a F1 score
of 8544 on the SQuAD dev dataset
Chatbot Design Using a QA System
Designing a perfect chatbot that passes the Turing test is a fundamental goal for
Artificial Intelligence Although we are many order to magnitudes away from achieving
such a goal domain specific tasks can be solved with chatbots made from current
technology We had started our goal with a similar objective in mind ie to design a
domain specific chatbot and then generalize to other areas as it is able to achieve the
first domain specific objective robustly This led us to the fundamental problem of
Machine Comprehension and subsequently to the task of Question Answering Having
achieved some degree of success with QA systems we looked back if we could apply
our newly acquired knowledge in the task of designing Chatbots
The chatbots made with todayrsquos technologies are mostly handcrafted techniques
such as template matching that requires anticipating all possible ways a user may
articulate his requirements and a conversation may occur This requires a lot of man
hours for designing a domain specific system and is still very error prone In this section
33
we propose a general Chatbot design that would make the designing of a domain
specific chatbot very easy and robust at the same time
Every domain specific chatbot needs to obtain a set of information from the user
and show some results based on the user specific information obtained The traditional
chatbots use template matching and keywords lookup to determine if the user has
provided the required information Our idea is to use the Question Answering system in
the backend to extract out the required information from whatever the user has typed
until this point of the conversation The information to be extracted can be posed in the
form of a set of questions and the answers obtained from those questions can be used
as the parameters to supply the relevant information to the user
We had chosen our chatbot domain as the flight reservation system Our goal
was to extract the required information from the user to be able to show him the
available flights as per the userrsquos requirements
Figure 4-2 Flight reservation chatbotrsquos chat window
34
For a flight reservation task the booking agent needs to know the origin city
destination city and date of travel at minimum to be able to show the available flights
Optional information includes the number of tickets passengerrsquos name one way or
round trip etc
The minimalistic conversation with the user through the chat window would be as
shown above We had a platform called OneTask on which we wanted to implement our
chat bot The chat interface within the OneTask system looks as follows
Figure 4-3 Chatbot within OneTask system
The working of the chat system is as follows ndash
1 Initiation The user opens the chat window which starts a session with the chat bot The chat bot asks lsquoHow may I help yoursquo
2 User Reply The user may reply with none of the required information for flight booking or may reply with multiple information in the same message
3 User Reply parsing The conversation up to this point is treated as a passage and the internal questions are run on this passage So the four questions that are run are
Where do you want to go
35
From where do you want to leave
When do you want to depart
4 Parsed responses from QA model After running the questions on the QA system with the given conversation answers are obtained for all along with their corresponding confidence values Since answers will be obtained even if the required question was not answered up to this point in the conversation the confidence values plays a crucial role in determining the validity of the answer For this purpose we determined ranges of confidence values for validation A confidence value above 10 signifies the question has been answered correctly A confidence of 2 ndash 10 signifies that it may have been answered but should verify with the user for correctness and any confidence below 2 is discarded
5 Asking remaining questions iteratively After the parsing it is checked if any of the required questions are still unanswered If so the chatbot asks the remaining question and the process from 3 ndash 4 is carried out iteratively Once all the questions have been answered the user is shown with the available flight options as per his request
Figure 4-4 The Flow diagram of the Flight booking Chatbot system
Online QA System and Attention Visualization
To be able to test out various examples we used the BiDAF [23] model for an
online demo One can either choose from the available examples from the drop-down
menu or paste their own passage and examples While this is a useful and interesting
system to test the model in a user-friendly way we created this system to be able to
focus on the wrong samples
36
Figure 4-5 QA system interface with attention highlight over candidate answers
The candidate answers are shown with the blue highlights as per their
confidence values The higher the confidence the darker the answer The highest
confidence value is chosen as the predicted answer We developed the system to show
attention spread of the candidate answers to realize what needs to be done to improve
the system This led us to realize the importance of including the query attention part as
well as multilevel attention on the BiDAF model as described in the first section of this
chapter
37
CHAPTER 5 FUTURE DIRECTIONS AND CONCLUSION
Future Directions
Having done a thorough analysis of the current methods and state of the arts in
Machine Comprehension and for the QA task and developing systems on it we have
achieved a strong sense of what needs to be done to further improve the QA models
After observing the wrong samples we can see that the system is still unable to encode
meaning and is picking answers based on statistical patterns of occurrence of answers
on training examples
As per the State of the Art models and analyzing the ongoing research literature
it is easy to conclude that more features need to be embedded to encode the meaning
of words phrases and sentences A paper called Reinforced Mnemonic Reader for
Machine Comprehension encoded POS and NER tags of words along with their word
and character embedding This gave them better results
Figure 5-1 An English language semantic parse tree [26]
38
We have developed a method to encode the syntax parse tree of a sentence that
not only encodes the post tags but the relation of the word within the phrase and the
relation of the phrase within the whole sentence in a hierarchical manner
Finally data augmentation is another solution to get better results One definite
way to reduce the errors would be to include similar samples in the training data which
the system is faltering in the dev set One could generate similar examples as the failure
cases and include them in the training set to have better prediction Another system
would be to train to similar and bigger datasets Our models were trained on the SQuAD
dataset There are other datasets too which does the similar question answering task
such as TriviaQA We could augment the training set of TriviaQA along with SQuAD to
have a more robust system that is able to generalize better and thus have higher
accuracy for predicting answer spans
Conclusion
In this work we have tried to explore the most fundamental techniques that have
shaped the current state of the art Then we proposed a minor improvement of
architecture over an existing model Furthermore we developed two applications that
uses the base model First we talked about how a chatbot application can be made
using the QA system and lastly we also created a web interface where the model can
be used for any Passage and Question This interface also shows the attention spread
on the candidate answers While our effort is ongoing to push the state of the art
forward we strongly believe that surpassing human level accuracy on this task will have
high dividends for the society at large
39
LIST OF REFERENCES
[1] Danqi Chen From Reading Comprehension To Open-Domain Question Answering 2018 Youtube Accessed March 16 2018 httpswwwyoutubecomwatchv=1RN88O9C13U
[2] Lehnert Wendy G The Process of Question Answering No RR-88 Yale Univ New Haven Conn Dept of computer science 1977
[3] Joshi Mandar Eunsol Choi Daniel S Weld and Luke Zettlemoyer Triviaqa A large scale distantly supervised challenge dataset for reading comprehension arXiv preprint arXiv170503551 (2017)
[4] Rajpurkar Pranav Jian Zhang Konstantin Lopyrev and Percy Liang Squad 100000+ questions for machine comprehension of text arXiv preprint arXiv160605250 (2016)
[5] The Stanford Question Answering Dataset 2018 RajpurkarGithubIo Accessed March 16 2018 httpsrajpurkargithubioSQuAD-explorer
[6] Hu Minghao Yuxing Peng and Xipeng Qiu Reinforced mnemonic reader for machine comprehension CoRR abs170502798 (2017)
[7] Schalkoff Robert J Artificial neural networks Vol 1 New York McGraw-Hill 1997
[8] Me For The AI and Neetesh Mehrotra 2018 The Connect Between Deep Learning And AI - Open Source For You Open Source For You Accessed March 16 2018 httpopensourceforucom201801connect-deep-learning-ai
[9] Krizhevsky Alex Ilya Sutskever and Geoffrey E Hinton Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems pp 1097-1105 2012
[10] An Intuitive Explanation Of Convolutional Neural Networks 2016 The Data Science Blog Accessed March 16 2018 httpsujjwalkarnme20160811intuitive-explanation-convnets
[11] Williams Ronald J and David Zipser A learning algorithm for continually running fully recurrent neural networks Neural computation 1 no 2 (1989) 270-280
[12] Hochreiter Sepp and Juumlrgen Schmidhuber Long short-term memory Neural computation 9 no 8 (1997) 1735-1780
[13] Chung Junyoung Caglar Gulcehre KyungHyun Cho and Yoshua Bengio Empirical evaluation of gated recurrent neural networks on sequence modeling arXiv preprint arXiv14123555 (2014)
40
[14] Mikolov Tomas Wen-tau Yih and Geoffrey Zweig Linguistic regularities in continuous space word representations In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics Human Language Technologies pp 746-751 2013
[15] Pennington Jeffrey Richard Socher and Christopher Manning Glove Global vectors for word representation In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) pp 1532-1543 2014
[16] Word2vec Tutorial - The Skip-Gram Model middot Chris Mccormick 2016 MccormickmlCom Accessed March 16 2018 httpmccormickmlcom20160419word2vec-tutorial-the-skip-gram-model
[17] Group 2015 [Paper Introduction] Bilingual Word Representations With Monolingual hellip SlideshareNet Accessed March 16 2018 httpswwwslidesharenetnaist-mtstudybilingual-word-representations-with-monolingual-quality-in-mind
[18] Attention Mechanism 2016 BlogHeuritechCom Accessed March 16 2018 httpsblogheuritechcom20160120attention-mechanism
[19] Weston Jason Sumit Chopra and Antoine Bordes 2014 Memory Networks arXiv preprint arXiv14103916v11 Accessed March 16 2018 httpsarxivorgabs14103916
[20] Wang Shuohang and Jing Jiang Machine comprehension using match-lstm and answer pointer arXiv preprint arXiv160807905 (2016)
[21] Wang Shuohang and Jing Jiang Learning natural language inference with LSTM arXiv preprint arXiv151208849 (2015)
[22] Vinyals Oriol Meire Fortunato and Navdeep Jaitly Pointer networks In Advances in Neural Information Processing Systems pp 2692-2700 2015
[23] Wang Wenhui Nan Yang Furu Wei Baobao Chang and Ming Zhou Gated self-matching networks for reading comprehension and question answering In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1 Long Papers) vol 1 pp 189-198 2017
[24] Seo Minjoon Aniruddha Kembhavi Ali Farhadi and Hannaneh Hajishirzi Bidirectional attention flow for machine comprehension arXiv preprint arXiv161101603 (2016)
[25] Clark Christopher and Matt Gardner Simple and effective multi-paragraph reading comprehension arXiv preprint arXiv171010723 (2017)
41
[26] 8 Analyzing Sentence Structure 2018 NltkOrg Accessed March 16 2018 httpwwwnltkorgbookch08html
42
BIOGRAPHICAL SKETCH
Purnendu Mukherjee grew up in Kolkata India and had a strong interest for
computers since an early age After his high school he did his BSc in computer
science from Ramakrishna Mission Residential College Narendrapur followed by MSc
in computer science from St Xavierrsquos College Kolkata He had a strong intuition and
interest for human like learning systems and wanted to work in this area He started
working at TCS Innovation Labs Pune for the application of Natural Language
Processing in Educational Applications As he was deeply passionate about learning
system that mimic the human brain and learn like a human child does he was
increasing interested about Deep Learning and its applications After working for a year
he went on to pursue a Master of Science degree in computer science from the
University of Florida Gainesville His academic interests have been focused on Deep
Learning and Natural Language Processing and he has been working on Machine
Reading Comprehension since summer of 2017
- ACKNOWLEDGMENTS
- LIST OF FIGURES
- LIST OF ABBREVIATIONS
- INTRODUCTION
- THE BUILDING BLOCKS OF A QUESTION ANSWERING SYSTEM
-
- Neural Networks
- Convolutional Neural Network
- Recurrent Neural Networks (RNN)
- Word Embedding
- Attention Mechanism
- Memory Networks
-
- LITERATURE REVIEW AND STATE OF THE ART
-
- Machine Comprehension Using Match-LSTM and Answer Pointer
- R-NET Matching Reading Comprehension with Self-Matching Networks
- Bi-Directional Attention Flow (BiDAF) for Machine Comprehension
- Summary
-
- MULTI-ATTENTION QUESTION ANSWERING
-
- Multi-Attention BiDAF Model
- Chatbot Design Using a QA System
- Online QA System and Attention Visualization
-
- FUTURE DIRECTIONS AND CONCLUSION
-
- Future Directions
- Conclusion
-
- LIST OF REFERENCES
- BIOGRAPHICAL SKETCH
-
9
CHAPTER 1 INTRODUCTION
Teaching machines to read and understand human natural language is our
central and long standing goal for Natural Language Processing and Artificial
Intelligence in general The positive side of machines being able to reason and
comprehend human language could be enormous We are beginning to see how
commercial systems like Alexa Siri Google Now etc are being widely used as Speech
Recognition improved to human levels While speech recognition systems can
transcribe speech to text comprehension of that transcribed text is another task which
is currently a major focus for both academia and industry because of its possible
applications Moreover all the text information available throughout the internet is a
major reason why machine comprehension of text is such an important task
With the growth of Deep Learning methods in the last few years the field of
Machine Comprehension and Natural Language Processing(NLP) in general has
experienced a revolution While the traditional methods and practices are still prevalent
and forms the basis of our deep understanding of languages Deep Learning methods
have surpassed all traditional NLP and Machine Learning methods by a significant
margin and are currently driving the growth of the field
To be able to build a system that can understand human text we need to first
ask ourselves how can we evaluate the machinersquos comprehension ability We had
initially set a goal to build a Chabot for a specific domain and generalize to other topics
as we go ahead While developing the system we found out the necessity for reading
comprehension and how to measure it We finally found the answer with Questions
Answering systems
10
Just like how we human beings are tested for our ability of language
understanding with questions we should ask machines similar questions about what it
has just read The performance of the system on such question answering task will let
us evaluate how much the machine is able to reason about what it just read [1]
Reading comprehension has been a topic of Natural Language Understanding since the
1970s In 1977 Wendy Lehnert said in his doctoral thesis ndash ldquoOnly when we can ask a
program to answer questions about what it reads will we be able to begin to access that
programrsquos comprehensionrdquo [2]
Figure 1-1 The task of Question Answering
To achieve this task the NLP community has developed various datasets such as
CNN Daily Mail WebQuestions SQuAD TriviaQA [3] etc For our purpose we chose
SQuAD which stands for Stanford Question Answering Dataset [4] SQuAD consists of
questions posed by crowdworkers on a set of Wikipedia articles where the answer to
every question is a segment of text or span from the corresponding reading passage
With 100000+ question-answer pairs on 500+ articles SQuAD is significantly larger
than previous reading comprehension datasets [5]
An example from the SQuAD dataset is as follows
11
Passage Tesla later approached Morgan to ask for more funds to build a more
powerful transmitter When asked where all the money had gone Tesla responded by
saying that he was affected by the Panic of 1901 which he (Morgan) had caused
Morgan was shocked by the reminder of his part in the stock market crash and by
Teslarsquos breach of contract by asking for more funds Tesla wrote another plea to
Morgan but it was also fruitless Morgan still owed Tesla money on the original
agreement and Tesla had been facing foreclosure even before construction of the
tower began
Question On what did Tesla blame for the loss of the initial money
Answer Panic of 1901
As we started exploring the QA task we faced several challenges Some of them
we could solve with the help of other research and some of the challenges still exist in
the domain
bull Out of Vocabulary words
bull Multi-sentence reasoning may be required
bull There may exist several candidate answers
bull Optimizing the Exact Match (EM) metric may fail when the answer boundary is fuzzy or too long such as the answer of the ldquowhyrdquo query [6]
bull ldquoOne-hoprdquo prediction may fail to fully understand the query [6]
bull Fail to fully capture the long-distance contextual interaction between parts of the context by only using LSTMGRU
bull Current models are unable to capture semantics of the passage
In the upcoming chapters we will first briefly review the basics necessary for
understanding the models then we will delve deep into the fundamental models that
12
have shaped the current State of the Art models then we will discuss our contribution in
terms of architecture and applications and finally conclude with future directions
13
CHAPTER 2 THE BUILDING BLOCKS OF A QUESTION ANSWERING SYSTEM
To build a question answering system one needs to be familiar with the
fundamental deep learning models such as Recurrent Neural Networks (RNN) Long
Short Term Memory (LSTM) etc In this chapter we will have an overview on these
techniques and see how they all connect to building a question answering system
Neural Networks
What makes Deep Learning so intriguing is that it has close resemblance with
the working of the mammalian brain or at least draws inspiration from it The same can
be said for Artificial Neural networks [7] which consists of a system of interconnected
units called lsquoneuronsrsquo that take input from similar units and produces a single output
Figure 2-1 Simple and Deep Learning Neural Networks [8]
The connection from one neuron to another can be weighted based on the input data
which enables the network to tune itself to produce a certain output based on the input
This is the learning process which is achieved through backpropagation which is a
system of propagating the error from the output layer to the previous layers
14
Convolutional Neural Network
The first wave of deep learningrsquos success was brought by Convolutional Neural
Networks (CNN) [9] when this was the technique used by the winning team of ImageNet
competition in 2012 CNNs are deep artificial neural networks (ANN) that can be used to
classify images cluster them by similarity and perform object recognition within scenes
It can be used to detect and identify faces people signs or any other visual data
Figure 2-2 Convolutional Neural Network Architecture [10]
There are primarily four operations in a standard CNN model (as shown in Fig above)
1 Convolution - The primary purpose of Convolution in the ConvNet (above) is to extract features from the input image The spatial relationship between pixels ie the image features are preserved and learned by the convolution using small squares of input data
2 Non-Linearity (ReLU) ndash Rectified Linear Unit (ReLU) is a non-linear operation that carries out an element wise operation on each pixel This operation replaces the negative pixel values in the feature map by zero
3 Pooling or Sub Sampling - Spatial Pooling reduces the dimensionality of each feature map but retains the most important information For max pooling the largest value in the square window is taken and rest are dropped Other types of pooling are Average Sum etc
4 Classification (Fully Connected Layer) - The Fully Connected layer is a traditional Multi-Layer Perceptron as described before that uses a softmax activation function in the output layer The high-level features of the image are encoded by the convolutional and pooling layers which is then fed to the fully connected layer which then uses these features for classifying the input image into various classes based on the training dataset [10]
15
When a new image is fed into the CNN model all the above-mentioned steps are
carried out (forward propagation) and a probability distribution is achieved on the set of
output classes With a large enough training dataset the network will learn and
generalize well enough to classify new images into their correct classes
Recurrent Neural Networks (RNN)
Whenever we want to predict or encode sequential data RNN [11] is our go to
method RNNs perform the same task for every element of a sequence where the
output of each element depends on previous computations thus the recurrence In
practice RNNs are unable to retain long-term dependencies and can look back only a
few steps because of the vanishing gradient problem
(2-1)
A solution to the dependency problem is to use gated cells such as LSTM [11] or
GRU [13] These cells pass on important information to the next cells while ignoring
non-important ones The gated units in a GRU block are
bull Update Gate ndash Computed based on current input and hidden state
(2-2)
bull Reset Gate ndash Calculated similarly but with different weights
(2-3)
bull New memory content - (2-4)
If reset gate unit is ~0 then previous memory is ignored and only new
information is kept
16
Final memory at current time step combines previous and current time steps
(2-5)
While the GRU is computationally efficient the LSTM on the other hand is a
general case where there are three gates as follows
bull Input Gate ndash What new information to add to the current cell state
bull Forget Gate ndash How much information from previous states to be kept
bull Output gate ndash How much info should be sent to the next states
Just like GRU the current cell state is a sum of the previous cell state but
weighted by the forget gate and the new value is added which is weighted by the input
gate Based on the cell state the output gate regulates the final output
Word Embedding
Computation or gradients can be applied on numbers and not on words or letters
So first we need to convert words into their corresponding numerical formation before
feeding into a deep learning model In general there are two types of word embedding
Frequency based (which constitutes count vectors tf-idf and co-occurrence vectors)
and Prediction based With frequency based embedding the order of the words are not
preserved and works as a bag of words model Whereas with prediction based model
the order of words or locality of words are taken into consideration to generate the
numerical representation of the word Within this prediction based category there are
two fundamental techniques called Continuous Bag of Words (CBOW) and Skip Gram
Model which forms the basis for word2vec [14] and GloVe [15]
The basic intuition behind word2vec is that if two different words have very
similar ldquocontextsrdquo (that is what words are likely to appear around them) then the model
17
will produce similar vector for those words Conversely if the two word vectors are
similar then the network will produce similar context predictions for the same two words
For examples synonyms like ldquointelligentrdquo and ldquosmartrdquo would have very similar contexts
Or that words that are related like ldquoenginerdquo and ldquotransmissionrdquo would probably have
similar contexts as well [16] Plotting the word vectors learned by a word2vec over a
large corpus we could find some very interesting relationships between words
Figure 2-3 Semantic relation between words in vector space [17]
Attention Mechanism
We as humans put our attention to things are important or are relevant in a
context For example when asked a question from a passage we try to find the most
relevant part of the passage the question is relevant with and then reason from our
understanding of that part of the passage The same idea applies for attention
mechanism in Deep Learning It is used to identify the specific parts of a given context
to which the current question is relevant to
Formally put the techniques take n arguments y_1 y_n (in our case the
passage having words say y_i through h_i) and a question word say q It returns a
vector z which is supposed to be the laquo summary raquo of the y_i focusing on information
linked to the question q More formally it returns a weighted arithmetic mean of the y_i
18
and the weights are chosen according the relevance of each y_i given the context c
[18]
Figure 2-4 Attention Mechanism flow [18]
Memory Networks
Convolutional Neural Networks and Recurrent Neural Networks which does
capture how we form our visual and sequential memories their memory (encoded by
hidden states and weights) were typically too small and was not compartmentalized
enough to accurately remember facts from the past (knowledge is compressed into
dense vectors) [19]
Deep Learning needed to cultivate a methodology that preserved memories as
they are such that it wonrsquot be lost in generalization and recalling exact words or
sequence of events would be possible mdash something computers are already good at This
effort led us to Memory Networks [19] published at ICLR 2015 by Facebook AI
Research
This paper provides a basic framework to store augment and retrieve memories
while seamlessly working with a Recurrent Neural Network architecture The memory
19
network consists of a memory m (an array of objects 1 indexed by m i) and four
(potentially learned) components I G O and R as follows
I (input feature map) mdash converts the incoming input to the internal feature
representation either a sparse or dense feature vector like that from word2vec or
GloVe
G (generalization) mdash updates old memories given the new input They call this
generalization as there is an opportunity for the network to compress and generalize its
memories at this stage for some intended future use The analogy Irsquove been talking
before
O (output feature map) mdash produces a new output (in the feature representation
space) given the new input and the current memory state This component is
responsible for performing inference In a question answering system this part will
select the candidate sentences (which might contain the answer) from the story
(conversation) so far
R (response) mdash converts the output into the response format desired For
example a textual response or an action In the QA system described this component
finds the desired answer and then converts it from feature representation to the actual
word
This model is a fully supervised model meaning all the candidate sentences from
which the answer could be found are marked during training phase and can also be
termed as lsquohard attentionrsquo
The authors tested out the QA system on various literature including Lord of the
Rings
20
Figure 2-5 QA example on Lord of the Rings using Memory Networks [19]
21
CHAPTER 3 LITERATURE REVIEW AND STATE OF THE ART
There has been rapid progress since the release of the SQuAD dataset and
currently the best ensemble models are close to human level accuracy in machine
comprehension This is due to the various ingenious methods which solves some of the
problems with the previous methods Out of Vocabulary tokens were handled by using
Character embedding Long term dependency within context passage were solved
using self-attention And many other techniques such as Contextualized vectors History
of Words Attention Flow etc In this section we will have a look at the some of the most
important models that were fundamental to the progress of Questions Answering
Machine Comprehension Using Match-LSTM and Answer Pointer
In this paper [20] the authors propose an end-to-end neural architecture for the
QA task The architecture is based on match-LSTM [21] a model they proposed for
textual entailment and Pointer Net [22] a sequence-to-sequence model proposed by
Vinyals et al (2015) to constrain the output tokens to be from the input sequences
The model consists of an LSTM preprocessing layer a match-LSTM layer and an
Answer Pointer layer
We are given a piece of text which we refer to as a passage and a question
related to the passage The passage is represented by matrix P where P is the length
(number of tokens) of the passage and d is the dimensionality of word embeddings
Similarly the question is represented by matrix Q where Q is the length of the question
Our goal is to identify a subsequence from the passage as the answer to the question
22
Figure 3-1 Match-LSTM Model Architecture [20]
LSTM Preprocessing layer They use a standard one-directional LSTM
(Hochreiter amp Schmidhuber 1997) to process the passage and the question separately
as shown below
Match LSTM Layer They applied the match-LSTM model proposed for textual
entailment to their machine comprehension problem by treating the question as a
premise and the passage as a hypothesis The match-LSTM sequentially goes through
the passage At position i of the passage it first uses the standard word-by-word
attention mechanism to obtain attention weight vector as follows
(3-1)
23
where and are parameters to be learned
Answer Pointer Layer The Answer Pointer (Ans-Ptr) layer is motivated by the
Pointer Net introduced by Vinyals et al (2015) [22] The boundary model produces only
the start token and the end token of the answer and then all the tokens between these
two in the original passage are considered to be the answer
When this paper was released back in November 2016 Match-LSTM method
was the state of the art in Question Answering systems and was at the top of the
leaderboard for the SQuAD dataset
R-NET Matching Reading Comprehension with Self-Matching Networks
In this model [23] first the question and passage are processed by a
bidirectional recurrent network (Mikolov et al 2010) separately They then match the
question and passage with gated attention-based recurrent networks obtaining
question-aware representation for the passage On top of that they apply self-matching
attention to aggregate evidence from the whole passage and refine the passage
representation which is then fed into the output layer to predict the boundary of the
answer span
Question and passage encoding First the words are converted to their
respective word-level embeddings and character level embeddings The character-level
embeddings are generated by taking the final hidden states of a bi-directional recurrent
neural network (RNN) applied to embeddings of characters in the token Such
character-level embeddings have been shown to be helpful to deal with out-of-vocab
(OOV) tokens
24
They then use a bi-directional RNN to produce new representation and
of all words in the question and passage respectively
Figure 3-2 The task of Question Answering [23]
Gated Attention-based Recurrent Networks They use a variant of attention-
based recurrent networks with an additional gate to determine the importance of
information in the passage regarding a question Different from the gates in LSTM or
GRU the additional gate is based on the current passage word and its attention-pooling
vector of the question which focuses on the relation between the question and current
passage word The gate effectively model the phenomenon that only parts of the
passage are relevant to the question in reading comprehension and question answering
is utilized in subsequent calculations
25
Self-Matching Attention From the previous step the question aware passage
representation is generated to highlight the important parts of the passage One
problem with such representation is that it has very limited knowledge of context One
answer candidate is often oblivious to important cues in the passage outside its
surrounding window To address this problem the authors propose directly matching
the question-aware passage representation against itself It dynamically collects
evidence from the whole passage for words in passage and encodes the evidence
relevant to the current passage word and its matching question information into the
passage representation
Output Layer They use the same method as Wang amp Jiang (2016b) and use
pointer networks (Vinyals et al 2015) to predict the start and end position of the
answer In addition they use an attention-pooling over the question representation to
generate the initial hidden vector for the pointer network [23]
When the R-Net Model first appeared in the leaderboard in March 2017 it was at
the top with 723 Exact Match and 807 F1 score
Bi-Directional Attention Flow (BiDAF) for Machine Comprehension
BiDAF [24] is a hierarchical multi-stage architecture for modeling the
representations of the context paragraph at different levels of granularity BIDAF
includes character-level word-level and contextual embeddings and uses bi-directional
attention flow to obtain a query-aware context representation Their attention layer is not
used to summarize the context paragraph into a fixed-size vector Instead the attention
is computed for every time step and the attended vector at each time step along with
the representations from previous layers can flow through to the subsequent modeling
layer This reduces the information loss caused by early summarization [24]
26
Figure 3-3 Bi-Directional Attention Flow Model Architecture [24]
Their machine comprehension model is a hierarchical multi-stage process and
consists of six layers
1 Character Embedding Layer maps each word to a vector space using character-level CNNs
2 Word Embedding Layer maps each word to a vector space using a pre-trained word embedding model
3 Contextual Embedding Layer utilizes contextual cues from surrounding words to refine the embedding of the words These first three layers are applied to both the query and context They use a LSTM on top of the embeddings provided by the previous layers to model the temporal interactions between words We place an LSTM in both directions and concatenate the outputs of the two LSTMs
4 Attention Flow Layer couples the query and context vectors and produces a set of query aware feature vectors for each word in the context
5 Modeling Layer employs a Recurrent Neural Network to scan the context The input to the modeling layer is G which encodes the query-aware representations of context words The output of the modeling layer captures the interaction among the context words conditioned on the query They use two layers of bi-directional LSTM
27
with the output size of d for each direction Hence a matrix is obtained which is passed onto the output layer to predict the answer
6 Output Layer provides an answer to the query They define the training loss (to be minimized) as the sum of the negative log probabilities of the true start and end indices by the predicted distributions averaged over all examples [25]
In a further variation of their above work they add a self-attention layer after the
Bi-attention layer to further improve the results The architecture of the model is as
Figure 3-3 The task of Question Answering [25]
28
Summary
In this chapter we reviewed the methods that are fundamental to the state of the
art in Machine Comprehension and for the task of Question Answering We have
reached closed to human level accuracy and this is due to incremental developments
over previous models As we saw Out of Vocabulary (OOV) tokens were handled by
using Character embedding Long term dependency within context passage were
solved using self-attention and many other techniques such as Contextualized vectors
Attention Flow etc were employed to get better results In the next chapter we will see
how we can build on these models and develop further
29
CHAPTER 4 MULTI-ATTENTION QUESTION ANSWERING
Although the advancements in Question Answering systems since the release of
the SQuAD dataset has been impressive and the results are getting close to human
level accuracy it is far from being a fool-proof system The models still make mistakes
which would be obvious to a human For example
Passage The Panthers used the San Jose State practice facility and stayed at
the San Jose Marriott The Broncos practiced at Florida State Facility and stayed at the
Santa Clara Marriott
Question At what universitys facility did the Panthers practice
Actual Answer San Jose State
Predicted Answer Florida State Facility
To find out what is leading to the wrong predictions we wanted to see the
attention weights associated with such an example We plotted the passage and
question heat map which is a 2D matrix where the intensity of each cell signifies the
similarity between a passage word and a question word For the above example we
found out that while certain words of the question are given high weightage other parts
are not The words lsquoAtrsquo lsquofacilityrsquo lsquopracticersquo are given high attention but lsquoPanthersrsquo does
not receive high attention If it had received high attention then the system would have
predicted lsquoSan Jose Statersquo as the right answer To solve this issue we analyzed the
base BiDAF model and proposed adding two things
1 Bi-Attention and Self-Attention over Query
2 Second level attention over the output of (Bi-Attention + Self-Attention) from both Context and Query
30
Multi-Attention BiDAF Model
Our comprehension model is a hierarchical multi-stage process and consists of
the following layers
1 Embedding Just as all other models we embed words using pretrained word vectors We also embed the characters in each word into size 20 vectors which are learned and run a convolution neural network followed by max-pooling to get character-derived embeddings for each word The character-level and word-level embeddings are then concatenated and passed to the next layer We do not update the word embeddings during training
2 Pre-Process A shared bi-directional GRU (Cho et al 2014) is used to map the question and passage embeddings to context aware embeddings
3 Context Attention The bi-directional attention mechanism from the Bi-Directional Attention Flow (BiDAF) model (Seo et al 2016) is used to build a query-aware context representation Let h_i be the vector for context word i q_j be the vector for question word j and n_q and n_c be the lengths of the question and context respectively We compute attention between context word i and question word j as
(4-1)
where w1 w2 and w3 are learned vectors and is element-wise multiplication We then compute an attended vector c_i for each context token as
(4-2)
We also compute a query-to-context vector q_c
(4-3)
31
The final vector computed for each token is built by concatenating
and In our model we subsequently pass the result through a linear layer with ReLU activations
4 Context Self-Attention Next we use a layer of residual self-attention The input is passed through another bi-directional GRU Then we apply the same attention mechanism only now between the passage and itself In this case we do not use query-to context attention and we set aij = 1048576inf if i = j As before we pass the concatenated output through a linear layer with ReLU activations This layer is applied residually so this output is additionally summed with the input
5 Query Attention For this part we do the same way as context attention but calculate the weighted sum of the context words for each query word Thus we get the length as number of query words Then we calculate context to query similar to query to context in context attention layer
Figure 4-1 The modified BiDAF model with multilevel attention
6 Query Self-Attention This part is done the same way as the context self-attention layer but from the output of the Query Attention layer
32
7 Context Query Bi-Attention + Self-Attention The output of the Context self-attention and the Query Self-Attention layers are taken as input and the same process for Bi-Attention and self-attention is applied on these inputs
8 Prediction In the last layer of our model a bidirectional GRU is applied followed by a linear layer that computes answer start scores for each token The hidden states of that layer are concatenated with the input and fed into a second bidirectional GRU and linear layer to predict answer end scores The softmax operation is applied to the start and end scores to produce start and end probabilities and we optimize the negative log likelihood of selecting correct start and end tokens
Having carried out this modification we were able to solve the wrong example we
started with The multilevel attention model gives the correct output as ldquoSan Jose
Staterdquo Also we achieved slightly better scores than the original model with a F1 score
of 8544 on the SQuAD dev dataset
Chatbot Design Using a QA System
Designing a perfect chatbot that passes the Turing test is a fundamental goal for
Artificial Intelligence Although we are many order to magnitudes away from achieving
such a goal domain specific tasks can be solved with chatbots made from current
technology We had started our goal with a similar objective in mind ie to design a
domain specific chatbot and then generalize to other areas as it is able to achieve the
first domain specific objective robustly This led us to the fundamental problem of
Machine Comprehension and subsequently to the task of Question Answering Having
achieved some degree of success with QA systems we looked back if we could apply
our newly acquired knowledge in the task of designing Chatbots
The chatbots made with todayrsquos technologies are mostly handcrafted techniques
such as template matching that requires anticipating all possible ways a user may
articulate his requirements and a conversation may occur This requires a lot of man
hours for designing a domain specific system and is still very error prone In this section
33
we propose a general Chatbot design that would make the designing of a domain
specific chatbot very easy and robust at the same time
Every domain specific chatbot needs to obtain a set of information from the user
and show some results based on the user specific information obtained The traditional
chatbots use template matching and keywords lookup to determine if the user has
provided the required information Our idea is to use the Question Answering system in
the backend to extract out the required information from whatever the user has typed
until this point of the conversation The information to be extracted can be posed in the
form of a set of questions and the answers obtained from those questions can be used
as the parameters to supply the relevant information to the user
We had chosen our chatbot domain as the flight reservation system Our goal
was to extract the required information from the user to be able to show him the
available flights as per the userrsquos requirements
Figure 4-2 Flight reservation chatbotrsquos chat window
34
For a flight reservation task the booking agent needs to know the origin city
destination city and date of travel at minimum to be able to show the available flights
Optional information includes the number of tickets passengerrsquos name one way or
round trip etc
The minimalistic conversation with the user through the chat window would be as
shown above We had a platform called OneTask on which we wanted to implement our
chat bot The chat interface within the OneTask system looks as follows
Figure 4-3 Chatbot within OneTask system
The working of the chat system is as follows ndash
1 Initiation The user opens the chat window which starts a session with the chat bot The chat bot asks lsquoHow may I help yoursquo
2 User Reply The user may reply with none of the required information for flight booking or may reply with multiple information in the same message
3 User Reply parsing The conversation up to this point is treated as a passage and the internal questions are run on this passage So the four questions that are run are
Where do you want to go
35
From where do you want to leave
When do you want to depart
4 Parsed responses from QA model After running the questions on the QA system with the given conversation answers are obtained for all along with their corresponding confidence values Since answers will be obtained even if the required question was not answered up to this point in the conversation the confidence values plays a crucial role in determining the validity of the answer For this purpose we determined ranges of confidence values for validation A confidence value above 10 signifies the question has been answered correctly A confidence of 2 ndash 10 signifies that it may have been answered but should verify with the user for correctness and any confidence below 2 is discarded
5 Asking remaining questions iteratively After the parsing it is checked if any of the required questions are still unanswered If so the chatbot asks the remaining question and the process from 3 ndash 4 is carried out iteratively Once all the questions have been answered the user is shown with the available flight options as per his request
Figure 4-4 The Flow diagram of the Flight booking Chatbot system
Online QA System and Attention Visualization
To be able to test out various examples we used the BiDAF [23] model for an
online demo One can either choose from the available examples from the drop-down
menu or paste their own passage and examples While this is a useful and interesting
system to test the model in a user-friendly way we created this system to be able to
focus on the wrong samples
36
Figure 4-5 QA system interface with attention highlight over candidate answers
The candidate answers are shown with the blue highlights as per their
confidence values The higher the confidence the darker the answer The highest
confidence value is chosen as the predicted answer We developed the system to show
attention spread of the candidate answers to realize what needs to be done to improve
the system This led us to realize the importance of including the query attention part as
well as multilevel attention on the BiDAF model as described in the first section of this
chapter
37
CHAPTER 5 FUTURE DIRECTIONS AND CONCLUSION
Future Directions
Having done a thorough analysis of the current methods and state of the arts in
Machine Comprehension and for the QA task and developing systems on it we have
achieved a strong sense of what needs to be done to further improve the QA models
After observing the wrong samples we can see that the system is still unable to encode
meaning and is picking answers based on statistical patterns of occurrence of answers
on training examples
As per the State of the Art models and analyzing the ongoing research literature
it is easy to conclude that more features need to be embedded to encode the meaning
of words phrases and sentences A paper called Reinforced Mnemonic Reader for
Machine Comprehension encoded POS and NER tags of words along with their word
and character embedding This gave them better results
Figure 5-1 An English language semantic parse tree [26]
38
We have developed a method to encode the syntax parse tree of a sentence that
not only encodes the post tags but the relation of the word within the phrase and the
relation of the phrase within the whole sentence in a hierarchical manner
Finally data augmentation is another solution to get better results One definite
way to reduce the errors would be to include similar samples in the training data which
the system is faltering in the dev set One could generate similar examples as the failure
cases and include them in the training set to have better prediction Another system
would be to train to similar and bigger datasets Our models were trained on the SQuAD
dataset There are other datasets too which does the similar question answering task
such as TriviaQA We could augment the training set of TriviaQA along with SQuAD to
have a more robust system that is able to generalize better and thus have higher
accuracy for predicting answer spans
Conclusion
In this work we have tried to explore the most fundamental techniques that have
shaped the current state of the art Then we proposed a minor improvement of
architecture over an existing model Furthermore we developed two applications that
uses the base model First we talked about how a chatbot application can be made
using the QA system and lastly we also created a web interface where the model can
be used for any Passage and Question This interface also shows the attention spread
on the candidate answers While our effort is ongoing to push the state of the art
forward we strongly believe that surpassing human level accuracy on this task will have
high dividends for the society at large
39
LIST OF REFERENCES
[1] Danqi Chen From Reading Comprehension To Open-Domain Question Answering 2018 Youtube Accessed March 16 2018 httpswwwyoutubecomwatchv=1RN88O9C13U
[2] Lehnert Wendy G The Process of Question Answering No RR-88 Yale Univ New Haven Conn Dept of computer science 1977
[3] Joshi Mandar Eunsol Choi Daniel S Weld and Luke Zettlemoyer Triviaqa A large scale distantly supervised challenge dataset for reading comprehension arXiv preprint arXiv170503551 (2017)
[4] Rajpurkar Pranav Jian Zhang Konstantin Lopyrev and Percy Liang Squad 100000+ questions for machine comprehension of text arXiv preprint arXiv160605250 (2016)
[5] The Stanford Question Answering Dataset 2018 RajpurkarGithubIo Accessed March 16 2018 httpsrajpurkargithubioSQuAD-explorer
[6] Hu Minghao Yuxing Peng and Xipeng Qiu Reinforced mnemonic reader for machine comprehension CoRR abs170502798 (2017)
[7] Schalkoff Robert J Artificial neural networks Vol 1 New York McGraw-Hill 1997
[8] Me For The AI and Neetesh Mehrotra 2018 The Connect Between Deep Learning And AI - Open Source For You Open Source For You Accessed March 16 2018 httpopensourceforucom201801connect-deep-learning-ai
[9] Krizhevsky Alex Ilya Sutskever and Geoffrey E Hinton Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems pp 1097-1105 2012
[10] An Intuitive Explanation Of Convolutional Neural Networks 2016 The Data Science Blog Accessed March 16 2018 httpsujjwalkarnme20160811intuitive-explanation-convnets
[11] Williams Ronald J and David Zipser A learning algorithm for continually running fully recurrent neural networks Neural computation 1 no 2 (1989) 270-280
[12] Hochreiter Sepp and Juumlrgen Schmidhuber Long short-term memory Neural computation 9 no 8 (1997) 1735-1780
[13] Chung Junyoung Caglar Gulcehre KyungHyun Cho and Yoshua Bengio Empirical evaluation of gated recurrent neural networks on sequence modeling arXiv preprint arXiv14123555 (2014)
40
[14] Mikolov Tomas Wen-tau Yih and Geoffrey Zweig Linguistic regularities in continuous space word representations In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics Human Language Technologies pp 746-751 2013
[15] Pennington Jeffrey Richard Socher and Christopher Manning Glove Global vectors for word representation In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) pp 1532-1543 2014
[16] Word2vec Tutorial - The Skip-Gram Model middot Chris Mccormick 2016 MccormickmlCom Accessed March 16 2018 httpmccormickmlcom20160419word2vec-tutorial-the-skip-gram-model
[17] Group 2015 [Paper Introduction] Bilingual Word Representations With Monolingual hellip SlideshareNet Accessed March 16 2018 httpswwwslidesharenetnaist-mtstudybilingual-word-representations-with-monolingual-quality-in-mind
[18] Attention Mechanism 2016 BlogHeuritechCom Accessed March 16 2018 httpsblogheuritechcom20160120attention-mechanism
[19] Weston Jason Sumit Chopra and Antoine Bordes 2014 Memory Networks arXiv preprint arXiv14103916v11 Accessed March 16 2018 httpsarxivorgabs14103916
[20] Wang Shuohang and Jing Jiang Machine comprehension using match-lstm and answer pointer arXiv preprint arXiv160807905 (2016)
[21] Wang Shuohang and Jing Jiang Learning natural language inference with LSTM arXiv preprint arXiv151208849 (2015)
[22] Vinyals Oriol Meire Fortunato and Navdeep Jaitly Pointer networks In Advances in Neural Information Processing Systems pp 2692-2700 2015
[23] Wang Wenhui Nan Yang Furu Wei Baobao Chang and Ming Zhou Gated self-matching networks for reading comprehension and question answering In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1 Long Papers) vol 1 pp 189-198 2017
[24] Seo Minjoon Aniruddha Kembhavi Ali Farhadi and Hannaneh Hajishirzi Bidirectional attention flow for machine comprehension arXiv preprint arXiv161101603 (2016)
[25] Clark Christopher and Matt Gardner Simple and effective multi-paragraph reading comprehension arXiv preprint arXiv171010723 (2017)
41
[26] 8 Analyzing Sentence Structure 2018 NltkOrg Accessed March 16 2018 httpwwwnltkorgbookch08html
42
BIOGRAPHICAL SKETCH
Purnendu Mukherjee grew up in Kolkata India and had a strong interest for
computers since an early age After his high school he did his BSc in computer
science from Ramakrishna Mission Residential College Narendrapur followed by MSc
in computer science from St Xavierrsquos College Kolkata He had a strong intuition and
interest for human like learning systems and wanted to work in this area He started
working at TCS Innovation Labs Pune for the application of Natural Language
Processing in Educational Applications As he was deeply passionate about learning
system that mimic the human brain and learn like a human child does he was
increasing interested about Deep Learning and its applications After working for a year
he went on to pursue a Master of Science degree in computer science from the
University of Florida Gainesville His academic interests have been focused on Deep
Learning and Natural Language Processing and he has been working on Machine
Reading Comprehension since summer of 2017
- ACKNOWLEDGMENTS
- LIST OF FIGURES
- LIST OF ABBREVIATIONS
- INTRODUCTION
- THE BUILDING BLOCKS OF A QUESTION ANSWERING SYSTEM
-
- Neural Networks
- Convolutional Neural Network
- Recurrent Neural Networks (RNN)
- Word Embedding
- Attention Mechanism
- Memory Networks
-
- LITERATURE REVIEW AND STATE OF THE ART
-
- Machine Comprehension Using Match-LSTM and Answer Pointer
- R-NET Matching Reading Comprehension with Self-Matching Networks
- Bi-Directional Attention Flow (BiDAF) for Machine Comprehension
- Summary
-
- MULTI-ATTENTION QUESTION ANSWERING
-
- Multi-Attention BiDAF Model
- Chatbot Design Using a QA System
- Online QA System and Attention Visualization
-
- FUTURE DIRECTIONS AND CONCLUSION
-
- Future Directions
- Conclusion
-
- LIST OF REFERENCES
- BIOGRAPHICAL SKETCH
-
10
Just like how we human beings are tested for our ability of language
understanding with questions we should ask machines similar questions about what it
has just read The performance of the system on such question answering task will let
us evaluate how much the machine is able to reason about what it just read [1]
Reading comprehension has been a topic of Natural Language Understanding since the
1970s In 1977 Wendy Lehnert said in his doctoral thesis ndash ldquoOnly when we can ask a
program to answer questions about what it reads will we be able to begin to access that
programrsquos comprehensionrdquo [2]
Figure 1-1 The task of Question Answering
To achieve this task the NLP community has developed various datasets such as
CNN Daily Mail WebQuestions SQuAD TriviaQA [3] etc For our purpose we chose
SQuAD which stands for Stanford Question Answering Dataset [4] SQuAD consists of
questions posed by crowdworkers on a set of Wikipedia articles where the answer to
every question is a segment of text or span from the corresponding reading passage
With 100000+ question-answer pairs on 500+ articles SQuAD is significantly larger
than previous reading comprehension datasets [5]
An example from the SQuAD dataset is as follows
11
Passage Tesla later approached Morgan to ask for more funds to build a more
powerful transmitter When asked where all the money had gone Tesla responded by
saying that he was affected by the Panic of 1901 which he (Morgan) had caused
Morgan was shocked by the reminder of his part in the stock market crash and by
Teslarsquos breach of contract by asking for more funds Tesla wrote another plea to
Morgan but it was also fruitless Morgan still owed Tesla money on the original
agreement and Tesla had been facing foreclosure even before construction of the
tower began
Question On what did Tesla blame for the loss of the initial money
Answer Panic of 1901
As we started exploring the QA task we faced several challenges Some of them
we could solve with the help of other research and some of the challenges still exist in
the domain
bull Out of Vocabulary words
bull Multi-sentence reasoning may be required
bull There may exist several candidate answers
bull Optimizing the Exact Match (EM) metric may fail when the answer boundary is fuzzy or too long such as the answer of the ldquowhyrdquo query [6]
bull ldquoOne-hoprdquo prediction may fail to fully understand the query [6]
bull Fail to fully capture the long-distance contextual interaction between parts of the context by only using LSTMGRU
bull Current models are unable to capture semantics of the passage
In the upcoming chapters we will first briefly review the basics necessary for
understanding the models then we will delve deep into the fundamental models that
12
have shaped the current State of the Art models then we will discuss our contribution in
terms of architecture and applications and finally conclude with future directions
13
CHAPTER 2 THE BUILDING BLOCKS OF A QUESTION ANSWERING SYSTEM
To build a question answering system one needs to be familiar with the
fundamental deep learning models such as Recurrent Neural Networks (RNN) Long
Short Term Memory (LSTM) etc In this chapter we will have an overview on these
techniques and see how they all connect to building a question answering system
Neural Networks
What makes Deep Learning so intriguing is that it has close resemblance with
the working of the mammalian brain or at least draws inspiration from it The same can
be said for Artificial Neural networks [7] which consists of a system of interconnected
units called lsquoneuronsrsquo that take input from similar units and produces a single output
Figure 2-1 Simple and Deep Learning Neural Networks [8]
The connection from one neuron to another can be weighted based on the input data
which enables the network to tune itself to produce a certain output based on the input
This is the learning process which is achieved through backpropagation which is a
system of propagating the error from the output layer to the previous layers
14
Convolutional Neural Network
The first wave of deep learningrsquos success was brought by Convolutional Neural
Networks (CNN) [9] when this was the technique used by the winning team of ImageNet
competition in 2012 CNNs are deep artificial neural networks (ANN) that can be used to
classify images cluster them by similarity and perform object recognition within scenes
It can be used to detect and identify faces people signs or any other visual data
Figure 2-2 Convolutional Neural Network Architecture [10]
There are primarily four operations in a standard CNN model (as shown in Fig above)
1 Convolution - The primary purpose of Convolution in the ConvNet (above) is to extract features from the input image The spatial relationship between pixels ie the image features are preserved and learned by the convolution using small squares of input data
2 Non-Linearity (ReLU) ndash Rectified Linear Unit (ReLU) is a non-linear operation that carries out an element wise operation on each pixel This operation replaces the negative pixel values in the feature map by zero
3 Pooling or Sub Sampling - Spatial Pooling reduces the dimensionality of each feature map but retains the most important information For max pooling the largest value in the square window is taken and rest are dropped Other types of pooling are Average Sum etc
4 Classification (Fully Connected Layer) - The Fully Connected layer is a traditional Multi-Layer Perceptron as described before that uses a softmax activation function in the output layer The high-level features of the image are encoded by the convolutional and pooling layers which is then fed to the fully connected layer which then uses these features for classifying the input image into various classes based on the training dataset [10]
15
When a new image is fed into the CNN model all the above-mentioned steps are
carried out (forward propagation) and a probability distribution is achieved on the set of
output classes With a large enough training dataset the network will learn and
generalize well enough to classify new images into their correct classes
Recurrent Neural Networks (RNN)
Whenever we want to predict or encode sequential data RNN [11] is our go to
method RNNs perform the same task for every element of a sequence where the
output of each element depends on previous computations thus the recurrence In
practice RNNs are unable to retain long-term dependencies and can look back only a
few steps because of the vanishing gradient problem
(2-1)
A solution to the dependency problem is to use gated cells such as LSTM [11] or
GRU [13] These cells pass on important information to the next cells while ignoring
non-important ones The gated units in a GRU block are
bull Update Gate ndash Computed based on current input and hidden state
(2-2)
bull Reset Gate ndash Calculated similarly but with different weights
(2-3)
bull New memory content - (2-4)
If reset gate unit is ~0 then previous memory is ignored and only new
information is kept
16
Final memory at current time step combines previous and current time steps
(2-5)
While the GRU is computationally efficient the LSTM on the other hand is a
general case where there are three gates as follows
bull Input Gate ndash What new information to add to the current cell state
bull Forget Gate ndash How much information from previous states to be kept
bull Output gate ndash How much info should be sent to the next states
Just like GRU the current cell state is a sum of the previous cell state but
weighted by the forget gate and the new value is added which is weighted by the input
gate Based on the cell state the output gate regulates the final output
Word Embedding
Computation or gradients can be applied on numbers and not on words or letters
So first we need to convert words into their corresponding numerical formation before
feeding into a deep learning model In general there are two types of word embedding
Frequency based (which constitutes count vectors tf-idf and co-occurrence vectors)
and Prediction based With frequency based embedding the order of the words are not
preserved and works as a bag of words model Whereas with prediction based model
the order of words or locality of words are taken into consideration to generate the
numerical representation of the word Within this prediction based category there are
two fundamental techniques called Continuous Bag of Words (CBOW) and Skip Gram
Model which forms the basis for word2vec [14] and GloVe [15]
The basic intuition behind word2vec is that if two different words have very
similar ldquocontextsrdquo (that is what words are likely to appear around them) then the model
17
will produce similar vector for those words Conversely if the two word vectors are
similar then the network will produce similar context predictions for the same two words
For examples synonyms like ldquointelligentrdquo and ldquosmartrdquo would have very similar contexts
Or that words that are related like ldquoenginerdquo and ldquotransmissionrdquo would probably have
similar contexts as well [16] Plotting the word vectors learned by a word2vec over a
large corpus we could find some very interesting relationships between words
Figure 2-3 Semantic relation between words in vector space [17]
Attention Mechanism
We as humans put our attention to things are important or are relevant in a
context For example when asked a question from a passage we try to find the most
relevant part of the passage the question is relevant with and then reason from our
understanding of that part of the passage The same idea applies for attention
mechanism in Deep Learning It is used to identify the specific parts of a given context
to which the current question is relevant to
Formally put the techniques take n arguments y_1 y_n (in our case the
passage having words say y_i through h_i) and a question word say q It returns a
vector z which is supposed to be the laquo summary raquo of the y_i focusing on information
linked to the question q More formally it returns a weighted arithmetic mean of the y_i
18
and the weights are chosen according the relevance of each y_i given the context c
[18]
Figure 2-4 Attention Mechanism flow [18]
Memory Networks
Convolutional Neural Networks and Recurrent Neural Networks which does
capture how we form our visual and sequential memories their memory (encoded by
hidden states and weights) were typically too small and was not compartmentalized
enough to accurately remember facts from the past (knowledge is compressed into
dense vectors) [19]
Deep Learning needed to cultivate a methodology that preserved memories as
they are such that it wonrsquot be lost in generalization and recalling exact words or
sequence of events would be possible mdash something computers are already good at This
effort led us to Memory Networks [19] published at ICLR 2015 by Facebook AI
Research
This paper provides a basic framework to store augment and retrieve memories
while seamlessly working with a Recurrent Neural Network architecture The memory
19
network consists of a memory m (an array of objects 1 indexed by m i) and four
(potentially learned) components I G O and R as follows
I (input feature map) mdash converts the incoming input to the internal feature
representation either a sparse or dense feature vector like that from word2vec or
GloVe
G (generalization) mdash updates old memories given the new input They call this
generalization as there is an opportunity for the network to compress and generalize its
memories at this stage for some intended future use The analogy Irsquove been talking
before
O (output feature map) mdash produces a new output (in the feature representation
space) given the new input and the current memory state This component is
responsible for performing inference In a question answering system this part will
select the candidate sentences (which might contain the answer) from the story
(conversation) so far
R (response) mdash converts the output into the response format desired For
example a textual response or an action In the QA system described this component
finds the desired answer and then converts it from feature representation to the actual
word
This model is a fully supervised model meaning all the candidate sentences from
which the answer could be found are marked during training phase and can also be
termed as lsquohard attentionrsquo
The authors tested out the QA system on various literature including Lord of the
Rings
20
Figure 2-5 QA example on Lord of the Rings using Memory Networks [19]
21
CHAPTER 3 LITERATURE REVIEW AND STATE OF THE ART
There has been rapid progress since the release of the SQuAD dataset and
currently the best ensemble models are close to human level accuracy in machine
comprehension This is due to the various ingenious methods which solves some of the
problems with the previous methods Out of Vocabulary tokens were handled by using
Character embedding Long term dependency within context passage were solved
using self-attention And many other techniques such as Contextualized vectors History
of Words Attention Flow etc In this section we will have a look at the some of the most
important models that were fundamental to the progress of Questions Answering
Machine Comprehension Using Match-LSTM and Answer Pointer
In this paper [20] the authors propose an end-to-end neural architecture for the
QA task The architecture is based on match-LSTM [21] a model they proposed for
textual entailment and Pointer Net [22] a sequence-to-sequence model proposed by
Vinyals et al (2015) to constrain the output tokens to be from the input sequences
The model consists of an LSTM preprocessing layer a match-LSTM layer and an
Answer Pointer layer
We are given a piece of text which we refer to as a passage and a question
related to the passage The passage is represented by matrix P where P is the length
(number of tokens) of the passage and d is the dimensionality of word embeddings
Similarly the question is represented by matrix Q where Q is the length of the question
Our goal is to identify a subsequence from the passage as the answer to the question
22
Figure 3-1 Match-LSTM Model Architecture [20]
LSTM Preprocessing layer They use a standard one-directional LSTM
(Hochreiter amp Schmidhuber 1997) to process the passage and the question separately
as shown below
Match LSTM Layer They applied the match-LSTM model proposed for textual
entailment to their machine comprehension problem by treating the question as a
premise and the passage as a hypothesis The match-LSTM sequentially goes through
the passage At position i of the passage it first uses the standard word-by-word
attention mechanism to obtain attention weight vector as follows
(3-1)
23
where and are parameters to be learned
Answer Pointer Layer The Answer Pointer (Ans-Ptr) layer is motivated by the
Pointer Net introduced by Vinyals et al (2015) [22] The boundary model produces only
the start token and the end token of the answer and then all the tokens between these
two in the original passage are considered to be the answer
When this paper was released back in November 2016 Match-LSTM method
was the state of the art in Question Answering systems and was at the top of the
leaderboard for the SQuAD dataset
R-NET Matching Reading Comprehension with Self-Matching Networks
In this model [23] first the question and passage are processed by a
bidirectional recurrent network (Mikolov et al 2010) separately They then match the
question and passage with gated attention-based recurrent networks obtaining
question-aware representation for the passage On top of that they apply self-matching
attention to aggregate evidence from the whole passage and refine the passage
representation which is then fed into the output layer to predict the boundary of the
answer span
Question and passage encoding First the words are converted to their
respective word-level embeddings and character level embeddings The character-level
embeddings are generated by taking the final hidden states of a bi-directional recurrent
neural network (RNN) applied to embeddings of characters in the token Such
character-level embeddings have been shown to be helpful to deal with out-of-vocab
(OOV) tokens
24
They then use a bi-directional RNN to produce new representation and
of all words in the question and passage respectively
Figure 3-2 The task of Question Answering [23]
Gated Attention-based Recurrent Networks They use a variant of attention-
based recurrent networks with an additional gate to determine the importance of
information in the passage regarding a question Different from the gates in LSTM or
GRU the additional gate is based on the current passage word and its attention-pooling
vector of the question which focuses on the relation between the question and current
passage word The gate effectively model the phenomenon that only parts of the
passage are relevant to the question in reading comprehension and question answering
is utilized in subsequent calculations
25
Self-Matching Attention From the previous step the question aware passage
representation is generated to highlight the important parts of the passage One
problem with such representation is that it has very limited knowledge of context One
answer candidate is often oblivious to important cues in the passage outside its
surrounding window To address this problem the authors propose directly matching
the question-aware passage representation against itself It dynamically collects
evidence from the whole passage for words in passage and encodes the evidence
relevant to the current passage word and its matching question information into the
passage representation
Output Layer They use the same method as Wang amp Jiang (2016b) and use
pointer networks (Vinyals et al 2015) to predict the start and end position of the
answer In addition they use an attention-pooling over the question representation to
generate the initial hidden vector for the pointer network [23]
When the R-Net Model first appeared in the leaderboard in March 2017 it was at
the top with 723 Exact Match and 807 F1 score
Bi-Directional Attention Flow (BiDAF) for Machine Comprehension
BiDAF [24] is a hierarchical multi-stage architecture for modeling the
representations of the context paragraph at different levels of granularity BIDAF
includes character-level word-level and contextual embeddings and uses bi-directional
attention flow to obtain a query-aware context representation Their attention layer is not
used to summarize the context paragraph into a fixed-size vector Instead the attention
is computed for every time step and the attended vector at each time step along with
the representations from previous layers can flow through to the subsequent modeling
layer This reduces the information loss caused by early summarization [24]
26
Figure 3-3 Bi-Directional Attention Flow Model Architecture [24]
Their machine comprehension model is a hierarchical multi-stage process and
consists of six layers
1 Character Embedding Layer maps each word to a vector space using character-level CNNs
2 Word Embedding Layer maps each word to a vector space using a pre-trained word embedding model
3 Contextual Embedding Layer utilizes contextual cues from surrounding words to refine the embedding of the words These first three layers are applied to both the query and context They use a LSTM on top of the embeddings provided by the previous layers to model the temporal interactions between words We place an LSTM in both directions and concatenate the outputs of the two LSTMs
4 Attention Flow Layer couples the query and context vectors and produces a set of query aware feature vectors for each word in the context
5 Modeling Layer employs a Recurrent Neural Network to scan the context The input to the modeling layer is G which encodes the query-aware representations of context words The output of the modeling layer captures the interaction among the context words conditioned on the query They use two layers of bi-directional LSTM
27
with the output size of d for each direction Hence a matrix is obtained which is passed onto the output layer to predict the answer
6 Output Layer provides an answer to the query They define the training loss (to be minimized) as the sum of the negative log probabilities of the true start and end indices by the predicted distributions averaged over all examples [25]
In a further variation of their above work they add a self-attention layer after the
Bi-attention layer to further improve the results The architecture of the model is as
Figure 3-3 The task of Question Answering [25]
28
Summary
In this chapter we reviewed the methods that are fundamental to the state of the
art in Machine Comprehension and for the task of Question Answering We have
reached closed to human level accuracy and this is due to incremental developments
over previous models As we saw Out of Vocabulary (OOV) tokens were handled by
using Character embedding Long term dependency within context passage were
solved using self-attention and many other techniques such as Contextualized vectors
Attention Flow etc were employed to get better results In the next chapter we will see
how we can build on these models and develop further
29
CHAPTER 4 MULTI-ATTENTION QUESTION ANSWERING
Although the advancements in Question Answering systems since the release of
the SQuAD dataset has been impressive and the results are getting close to human
level accuracy it is far from being a fool-proof system The models still make mistakes
which would be obvious to a human For example
Passage The Panthers used the San Jose State practice facility and stayed at
the San Jose Marriott The Broncos practiced at Florida State Facility and stayed at the
Santa Clara Marriott
Question At what universitys facility did the Panthers practice
Actual Answer San Jose State
Predicted Answer Florida State Facility
To find out what is leading to the wrong predictions we wanted to see the
attention weights associated with such an example We plotted the passage and
question heat map which is a 2D matrix where the intensity of each cell signifies the
similarity between a passage word and a question word For the above example we
found out that while certain words of the question are given high weightage other parts
are not The words lsquoAtrsquo lsquofacilityrsquo lsquopracticersquo are given high attention but lsquoPanthersrsquo does
not receive high attention If it had received high attention then the system would have
predicted lsquoSan Jose Statersquo as the right answer To solve this issue we analyzed the
base BiDAF model and proposed adding two things
1 Bi-Attention and Self-Attention over Query
2 Second level attention over the output of (Bi-Attention + Self-Attention) from both Context and Query
30
Multi-Attention BiDAF Model
Our comprehension model is a hierarchical multi-stage process and consists of
the following layers
1 Embedding Just as all other models we embed words using pretrained word vectors We also embed the characters in each word into size 20 vectors which are learned and run a convolution neural network followed by max-pooling to get character-derived embeddings for each word The character-level and word-level embeddings are then concatenated and passed to the next layer We do not update the word embeddings during training
2 Pre-Process A shared bi-directional GRU (Cho et al 2014) is used to map the question and passage embeddings to context aware embeddings
3 Context Attention The bi-directional attention mechanism from the Bi-Directional Attention Flow (BiDAF) model (Seo et al 2016) is used to build a query-aware context representation Let h_i be the vector for context word i q_j be the vector for question word j and n_q and n_c be the lengths of the question and context respectively We compute attention between context word i and question word j as
(4-1)
where w1 w2 and w3 are learned vectors and is element-wise multiplication We then compute an attended vector c_i for each context token as
(4-2)
We also compute a query-to-context vector q_c
(4-3)
31
The final vector computed for each token is built by concatenating
and In our model we subsequently pass the result through a linear layer with ReLU activations
4 Context Self-Attention Next we use a layer of residual self-attention The input is passed through another bi-directional GRU Then we apply the same attention mechanism only now between the passage and itself In this case we do not use query-to context attention and we set aij = 1048576inf if i = j As before we pass the concatenated output through a linear layer with ReLU activations This layer is applied residually so this output is additionally summed with the input
5 Query Attention For this part we do the same way as context attention but calculate the weighted sum of the context words for each query word Thus we get the length as number of query words Then we calculate context to query similar to query to context in context attention layer
Figure 4-1 The modified BiDAF model with multilevel attention
6 Query Self-Attention This part is done the same way as the context self-attention layer but from the output of the Query Attention layer
32
7 Context Query Bi-Attention + Self-Attention The output of the Context self-attention and the Query Self-Attention layers are taken as input and the same process for Bi-Attention and self-attention is applied on these inputs
8 Prediction In the last layer of our model a bidirectional GRU is applied followed by a linear layer that computes answer start scores for each token The hidden states of that layer are concatenated with the input and fed into a second bidirectional GRU and linear layer to predict answer end scores The softmax operation is applied to the start and end scores to produce start and end probabilities and we optimize the negative log likelihood of selecting correct start and end tokens
Having carried out this modification we were able to solve the wrong example we
started with The multilevel attention model gives the correct output as ldquoSan Jose
Staterdquo Also we achieved slightly better scores than the original model with a F1 score
of 8544 on the SQuAD dev dataset
Chatbot Design Using a QA System
Designing a perfect chatbot that passes the Turing test is a fundamental goal for
Artificial Intelligence Although we are many order to magnitudes away from achieving
such a goal domain specific tasks can be solved with chatbots made from current
technology We had started our goal with a similar objective in mind ie to design a
domain specific chatbot and then generalize to other areas as it is able to achieve the
first domain specific objective robustly This led us to the fundamental problem of
Machine Comprehension and subsequently to the task of Question Answering Having
achieved some degree of success with QA systems we looked back if we could apply
our newly acquired knowledge in the task of designing Chatbots
The chatbots made with todayrsquos technologies are mostly handcrafted techniques
such as template matching that requires anticipating all possible ways a user may
articulate his requirements and a conversation may occur This requires a lot of man
hours for designing a domain specific system and is still very error prone In this section
33
we propose a general Chatbot design that would make the designing of a domain
specific chatbot very easy and robust at the same time
Every domain specific chatbot needs to obtain a set of information from the user
and show some results based on the user specific information obtained The traditional
chatbots use template matching and keywords lookup to determine if the user has
provided the required information Our idea is to use the Question Answering system in
the backend to extract out the required information from whatever the user has typed
until this point of the conversation The information to be extracted can be posed in the
form of a set of questions and the answers obtained from those questions can be used
as the parameters to supply the relevant information to the user
We had chosen our chatbot domain as the flight reservation system Our goal
was to extract the required information from the user to be able to show him the
available flights as per the userrsquos requirements
Figure 4-2 Flight reservation chatbotrsquos chat window
34
For a flight reservation task the booking agent needs to know the origin city
destination city and date of travel at minimum to be able to show the available flights
Optional information includes the number of tickets passengerrsquos name one way or
round trip etc
The minimalistic conversation with the user through the chat window would be as
shown above We had a platform called OneTask on which we wanted to implement our
chat bot The chat interface within the OneTask system looks as follows
Figure 4-3 Chatbot within OneTask system
The working of the chat system is as follows ndash
1 Initiation The user opens the chat window which starts a session with the chat bot The chat bot asks lsquoHow may I help yoursquo
2 User Reply The user may reply with none of the required information for flight booking or may reply with multiple information in the same message
3 User Reply parsing The conversation up to this point is treated as a passage and the internal questions are run on this passage So the four questions that are run are
Where do you want to go
35
From where do you want to leave
When do you want to depart
4 Parsed responses from QA model After running the questions on the QA system with the given conversation answers are obtained for all along with their corresponding confidence values Since answers will be obtained even if the required question was not answered up to this point in the conversation the confidence values plays a crucial role in determining the validity of the answer For this purpose we determined ranges of confidence values for validation A confidence value above 10 signifies the question has been answered correctly A confidence of 2 ndash 10 signifies that it may have been answered but should verify with the user for correctness and any confidence below 2 is discarded
5 Asking remaining questions iteratively After the parsing it is checked if any of the required questions are still unanswered If so the chatbot asks the remaining question and the process from 3 ndash 4 is carried out iteratively Once all the questions have been answered the user is shown with the available flight options as per his request
Figure 4-4 The Flow diagram of the Flight booking Chatbot system
Online QA System and Attention Visualization
To be able to test out various examples we used the BiDAF [23] model for an
online demo One can either choose from the available examples from the drop-down
menu or paste their own passage and examples While this is a useful and interesting
system to test the model in a user-friendly way we created this system to be able to
focus on the wrong samples
36
Figure 4-5 QA system interface with attention highlight over candidate answers
The candidate answers are shown with the blue highlights as per their
confidence values The higher the confidence the darker the answer The highest
confidence value is chosen as the predicted answer We developed the system to show
attention spread of the candidate answers to realize what needs to be done to improve
the system This led us to realize the importance of including the query attention part as
well as multilevel attention on the BiDAF model as described in the first section of this
chapter
37
CHAPTER 5 FUTURE DIRECTIONS AND CONCLUSION
Future Directions
Having done a thorough analysis of the current methods and state of the arts in
Machine Comprehension and for the QA task and developing systems on it we have
achieved a strong sense of what needs to be done to further improve the QA models
After observing the wrong samples we can see that the system is still unable to encode
meaning and is picking answers based on statistical patterns of occurrence of answers
on training examples
As per the State of the Art models and analyzing the ongoing research literature
it is easy to conclude that more features need to be embedded to encode the meaning
of words phrases and sentences A paper called Reinforced Mnemonic Reader for
Machine Comprehension encoded POS and NER tags of words along with their word
and character embedding This gave them better results
Figure 5-1 An English language semantic parse tree [26]
38
We have developed a method to encode the syntax parse tree of a sentence that
not only encodes the post tags but the relation of the word within the phrase and the
relation of the phrase within the whole sentence in a hierarchical manner
Finally data augmentation is another solution to get better results One definite
way to reduce the errors would be to include similar samples in the training data which
the system is faltering in the dev set One could generate similar examples as the failure
cases and include them in the training set to have better prediction Another system
would be to train to similar and bigger datasets Our models were trained on the SQuAD
dataset There are other datasets too which does the similar question answering task
such as TriviaQA We could augment the training set of TriviaQA along with SQuAD to
have a more robust system that is able to generalize better and thus have higher
accuracy for predicting answer spans
Conclusion
In this work we have tried to explore the most fundamental techniques that have
shaped the current state of the art Then we proposed a minor improvement of
architecture over an existing model Furthermore we developed two applications that
uses the base model First we talked about how a chatbot application can be made
using the QA system and lastly we also created a web interface where the model can
be used for any Passage and Question This interface also shows the attention spread
on the candidate answers While our effort is ongoing to push the state of the art
forward we strongly believe that surpassing human level accuracy on this task will have
high dividends for the society at large
39
LIST OF REFERENCES
[1] Danqi Chen From Reading Comprehension To Open-Domain Question Answering 2018 Youtube Accessed March 16 2018 httpswwwyoutubecomwatchv=1RN88O9C13U
[2] Lehnert Wendy G The Process of Question Answering No RR-88 Yale Univ New Haven Conn Dept of computer science 1977
[3] Joshi Mandar Eunsol Choi Daniel S Weld and Luke Zettlemoyer Triviaqa A large scale distantly supervised challenge dataset for reading comprehension arXiv preprint arXiv170503551 (2017)
[4] Rajpurkar Pranav Jian Zhang Konstantin Lopyrev and Percy Liang Squad 100000+ questions for machine comprehension of text arXiv preprint arXiv160605250 (2016)
[5] The Stanford Question Answering Dataset 2018 RajpurkarGithubIo Accessed March 16 2018 httpsrajpurkargithubioSQuAD-explorer
[6] Hu Minghao Yuxing Peng and Xipeng Qiu Reinforced mnemonic reader for machine comprehension CoRR abs170502798 (2017)
[7] Schalkoff Robert J Artificial neural networks Vol 1 New York McGraw-Hill 1997
[8] Me For The AI and Neetesh Mehrotra 2018 The Connect Between Deep Learning And AI - Open Source For You Open Source For You Accessed March 16 2018 httpopensourceforucom201801connect-deep-learning-ai
[9] Krizhevsky Alex Ilya Sutskever and Geoffrey E Hinton Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems pp 1097-1105 2012
[10] An Intuitive Explanation Of Convolutional Neural Networks 2016 The Data Science Blog Accessed March 16 2018 httpsujjwalkarnme20160811intuitive-explanation-convnets
[11] Williams Ronald J and David Zipser A learning algorithm for continually running fully recurrent neural networks Neural computation 1 no 2 (1989) 270-280
[12] Hochreiter Sepp and Juumlrgen Schmidhuber Long short-term memory Neural computation 9 no 8 (1997) 1735-1780
[13] Chung Junyoung Caglar Gulcehre KyungHyun Cho and Yoshua Bengio Empirical evaluation of gated recurrent neural networks on sequence modeling arXiv preprint arXiv14123555 (2014)
40
[14] Mikolov Tomas Wen-tau Yih and Geoffrey Zweig Linguistic regularities in continuous space word representations In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics Human Language Technologies pp 746-751 2013
[15] Pennington Jeffrey Richard Socher and Christopher Manning Glove Global vectors for word representation In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) pp 1532-1543 2014
[16] Word2vec Tutorial - The Skip-Gram Model middot Chris Mccormick 2016 MccormickmlCom Accessed March 16 2018 httpmccormickmlcom20160419word2vec-tutorial-the-skip-gram-model
[17] Group 2015 [Paper Introduction] Bilingual Word Representations With Monolingual hellip SlideshareNet Accessed March 16 2018 httpswwwslidesharenetnaist-mtstudybilingual-word-representations-with-monolingual-quality-in-mind
[18] Attention Mechanism 2016 BlogHeuritechCom Accessed March 16 2018 httpsblogheuritechcom20160120attention-mechanism
[19] Weston Jason Sumit Chopra and Antoine Bordes 2014 Memory Networks arXiv preprint arXiv14103916v11 Accessed March 16 2018 httpsarxivorgabs14103916
[20] Wang Shuohang and Jing Jiang Machine comprehension using match-lstm and answer pointer arXiv preprint arXiv160807905 (2016)
[21] Wang Shuohang and Jing Jiang Learning natural language inference with LSTM arXiv preprint arXiv151208849 (2015)
[22] Vinyals Oriol Meire Fortunato and Navdeep Jaitly Pointer networks In Advances in Neural Information Processing Systems pp 2692-2700 2015
[23] Wang Wenhui Nan Yang Furu Wei Baobao Chang and Ming Zhou Gated self-matching networks for reading comprehension and question answering In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1 Long Papers) vol 1 pp 189-198 2017
[24] Seo Minjoon Aniruddha Kembhavi Ali Farhadi and Hannaneh Hajishirzi Bidirectional attention flow for machine comprehension arXiv preprint arXiv161101603 (2016)
[25] Clark Christopher and Matt Gardner Simple and effective multi-paragraph reading comprehension arXiv preprint arXiv171010723 (2017)
41
[26] 8 Analyzing Sentence Structure 2018 NltkOrg Accessed March 16 2018 httpwwwnltkorgbookch08html
42
BIOGRAPHICAL SKETCH
Purnendu Mukherjee grew up in Kolkata India and had a strong interest for
computers since an early age After his high school he did his BSc in computer
science from Ramakrishna Mission Residential College Narendrapur followed by MSc
in computer science from St Xavierrsquos College Kolkata He had a strong intuition and
interest for human like learning systems and wanted to work in this area He started
working at TCS Innovation Labs Pune for the application of Natural Language
Processing in Educational Applications As he was deeply passionate about learning
system that mimic the human brain and learn like a human child does he was
increasing interested about Deep Learning and its applications After working for a year
he went on to pursue a Master of Science degree in computer science from the
University of Florida Gainesville His academic interests have been focused on Deep
Learning and Natural Language Processing and he has been working on Machine
Reading Comprehension since summer of 2017
- ACKNOWLEDGMENTS
- LIST OF FIGURES
- LIST OF ABBREVIATIONS
- INTRODUCTION
- THE BUILDING BLOCKS OF A QUESTION ANSWERING SYSTEM
-
- Neural Networks
- Convolutional Neural Network
- Recurrent Neural Networks (RNN)
- Word Embedding
- Attention Mechanism
- Memory Networks
-
- LITERATURE REVIEW AND STATE OF THE ART
-
- Machine Comprehension Using Match-LSTM and Answer Pointer
- R-NET Matching Reading Comprehension with Self-Matching Networks
- Bi-Directional Attention Flow (BiDAF) for Machine Comprehension
- Summary
-
- MULTI-ATTENTION QUESTION ANSWERING
-
- Multi-Attention BiDAF Model
- Chatbot Design Using a QA System
- Online QA System and Attention Visualization
-
- FUTURE DIRECTIONS AND CONCLUSION
-
- Future Directions
- Conclusion
-
- LIST OF REFERENCES
- BIOGRAPHICAL SKETCH
-
11
Passage Tesla later approached Morgan to ask for more funds to build a more
powerful transmitter When asked where all the money had gone Tesla responded by
saying that he was affected by the Panic of 1901 which he (Morgan) had caused
Morgan was shocked by the reminder of his part in the stock market crash and by
Teslarsquos breach of contract by asking for more funds Tesla wrote another plea to
Morgan but it was also fruitless Morgan still owed Tesla money on the original
agreement and Tesla had been facing foreclosure even before construction of the
tower began
Question On what did Tesla blame for the loss of the initial money
Answer Panic of 1901
As we started exploring the QA task we faced several challenges Some of them
we could solve with the help of other research and some of the challenges still exist in
the domain
bull Out of Vocabulary words
bull Multi-sentence reasoning may be required
bull There may exist several candidate answers
bull Optimizing the Exact Match (EM) metric may fail when the answer boundary is fuzzy or too long such as the answer of the ldquowhyrdquo query [6]
bull ldquoOne-hoprdquo prediction may fail to fully understand the query [6]
bull Fail to fully capture the long-distance contextual interaction between parts of the context by only using LSTMGRU
bull Current models are unable to capture semantics of the passage
In the upcoming chapters we will first briefly review the basics necessary for
understanding the models then we will delve deep into the fundamental models that
12
have shaped the current State of the Art models then we will discuss our contribution in
terms of architecture and applications and finally conclude with future directions
13
CHAPTER 2 THE BUILDING BLOCKS OF A QUESTION ANSWERING SYSTEM
To build a question answering system one needs to be familiar with the
fundamental deep learning models such as Recurrent Neural Networks (RNN) Long
Short Term Memory (LSTM) etc In this chapter we will have an overview on these
techniques and see how they all connect to building a question answering system
Neural Networks
What makes Deep Learning so intriguing is that it has close resemblance with
the working of the mammalian brain or at least draws inspiration from it The same can
be said for Artificial Neural networks [7] which consists of a system of interconnected
units called lsquoneuronsrsquo that take input from similar units and produces a single output
Figure 2-1 Simple and Deep Learning Neural Networks [8]
The connection from one neuron to another can be weighted based on the input data
which enables the network to tune itself to produce a certain output based on the input
This is the learning process which is achieved through backpropagation which is a
system of propagating the error from the output layer to the previous layers
14
Convolutional Neural Network
The first wave of deep learningrsquos success was brought by Convolutional Neural
Networks (CNN) [9] when this was the technique used by the winning team of ImageNet
competition in 2012 CNNs are deep artificial neural networks (ANN) that can be used to
classify images cluster them by similarity and perform object recognition within scenes
It can be used to detect and identify faces people signs or any other visual data
Figure 2-2 Convolutional Neural Network Architecture [10]
There are primarily four operations in a standard CNN model (as shown in Fig above)
1 Convolution - The primary purpose of Convolution in the ConvNet (above) is to extract features from the input image The spatial relationship between pixels ie the image features are preserved and learned by the convolution using small squares of input data
2 Non-Linearity (ReLU) ndash Rectified Linear Unit (ReLU) is a non-linear operation that carries out an element wise operation on each pixel This operation replaces the negative pixel values in the feature map by zero
3 Pooling or Sub Sampling - Spatial Pooling reduces the dimensionality of each feature map but retains the most important information For max pooling the largest value in the square window is taken and rest are dropped Other types of pooling are Average Sum etc
4 Classification (Fully Connected Layer) - The Fully Connected layer is a traditional Multi-Layer Perceptron as described before that uses a softmax activation function in the output layer The high-level features of the image are encoded by the convolutional and pooling layers which is then fed to the fully connected layer which then uses these features for classifying the input image into various classes based on the training dataset [10]
15
When a new image is fed into the CNN model all the above-mentioned steps are
carried out (forward propagation) and a probability distribution is achieved on the set of
output classes With a large enough training dataset the network will learn and
generalize well enough to classify new images into their correct classes
Recurrent Neural Networks (RNN)
Whenever we want to predict or encode sequential data RNN [11] is our go to
method RNNs perform the same task for every element of a sequence where the
output of each element depends on previous computations thus the recurrence In
practice RNNs are unable to retain long-term dependencies and can look back only a
few steps because of the vanishing gradient problem
(2-1)
A solution to the dependency problem is to use gated cells such as LSTM [11] or
GRU [13] These cells pass on important information to the next cells while ignoring
non-important ones The gated units in a GRU block are
bull Update Gate ndash Computed based on current input and hidden state
(2-2)
bull Reset Gate ndash Calculated similarly but with different weights
(2-3)
bull New memory content - (2-4)
If reset gate unit is ~0 then previous memory is ignored and only new
information is kept
16
Final memory at current time step combines previous and current time steps
(2-5)
While the GRU is computationally efficient the LSTM on the other hand is a
general case where there are three gates as follows
bull Input Gate ndash What new information to add to the current cell state
bull Forget Gate ndash How much information from previous states to be kept
bull Output gate ndash How much info should be sent to the next states
Just like GRU the current cell state is a sum of the previous cell state but
weighted by the forget gate and the new value is added which is weighted by the input
gate Based on the cell state the output gate regulates the final output
Word Embedding
Computation or gradients can be applied on numbers and not on words or letters
So first we need to convert words into their corresponding numerical formation before
feeding into a deep learning model In general there are two types of word embedding
Frequency based (which constitutes count vectors tf-idf and co-occurrence vectors)
and Prediction based With frequency based embedding the order of the words are not
preserved and works as a bag of words model Whereas with prediction based model
the order of words or locality of words are taken into consideration to generate the
numerical representation of the word Within this prediction based category there are
two fundamental techniques called Continuous Bag of Words (CBOW) and Skip Gram
Model which forms the basis for word2vec [14] and GloVe [15]
The basic intuition behind word2vec is that if two different words have very
similar ldquocontextsrdquo (that is what words are likely to appear around them) then the model
17
will produce similar vector for those words Conversely if the two word vectors are
similar then the network will produce similar context predictions for the same two words
For examples synonyms like ldquointelligentrdquo and ldquosmartrdquo would have very similar contexts
Or that words that are related like ldquoenginerdquo and ldquotransmissionrdquo would probably have
similar contexts as well [16] Plotting the word vectors learned by a word2vec over a
large corpus we could find some very interesting relationships between words
Figure 2-3 Semantic relation between words in vector space [17]
Attention Mechanism
We as humans put our attention to things are important or are relevant in a
context For example when asked a question from a passage we try to find the most
relevant part of the passage the question is relevant with and then reason from our
understanding of that part of the passage The same idea applies for attention
mechanism in Deep Learning It is used to identify the specific parts of a given context
to which the current question is relevant to
Formally put the techniques take n arguments y_1 y_n (in our case the
passage having words say y_i through h_i) and a question word say q It returns a
vector z which is supposed to be the laquo summary raquo of the y_i focusing on information
linked to the question q More formally it returns a weighted arithmetic mean of the y_i
18
and the weights are chosen according the relevance of each y_i given the context c
[18]
Figure 2-4 Attention Mechanism flow [18]
Memory Networks
Convolutional Neural Networks and Recurrent Neural Networks which does
capture how we form our visual and sequential memories their memory (encoded by
hidden states and weights) were typically too small and was not compartmentalized
enough to accurately remember facts from the past (knowledge is compressed into
dense vectors) [19]
Deep Learning needed to cultivate a methodology that preserved memories as
they are such that it wonrsquot be lost in generalization and recalling exact words or
sequence of events would be possible mdash something computers are already good at This
effort led us to Memory Networks [19] published at ICLR 2015 by Facebook AI
Research
This paper provides a basic framework to store augment and retrieve memories
while seamlessly working with a Recurrent Neural Network architecture The memory
19
network consists of a memory m (an array of objects 1 indexed by m i) and four
(potentially learned) components I G O and R as follows
I (input feature map) mdash converts the incoming input to the internal feature
representation either a sparse or dense feature vector like that from word2vec or
GloVe
G (generalization) mdash updates old memories given the new input They call this
generalization as there is an opportunity for the network to compress and generalize its
memories at this stage for some intended future use The analogy Irsquove been talking
before
O (output feature map) mdash produces a new output (in the feature representation
space) given the new input and the current memory state This component is
responsible for performing inference In a question answering system this part will
select the candidate sentences (which might contain the answer) from the story
(conversation) so far
R (response) mdash converts the output into the response format desired For
example a textual response or an action In the QA system described this component
finds the desired answer and then converts it from feature representation to the actual
word
This model is a fully supervised model meaning all the candidate sentences from
which the answer could be found are marked during training phase and can also be
termed as lsquohard attentionrsquo
The authors tested out the QA system on various literature including Lord of the
Rings
20
Figure 2-5 QA example on Lord of the Rings using Memory Networks [19]
21
CHAPTER 3 LITERATURE REVIEW AND STATE OF THE ART
There has been rapid progress since the release of the SQuAD dataset and
currently the best ensemble models are close to human level accuracy in machine
comprehension This is due to the various ingenious methods which solves some of the
problems with the previous methods Out of Vocabulary tokens were handled by using
Character embedding Long term dependency within context passage were solved
using self-attention And many other techniques such as Contextualized vectors History
of Words Attention Flow etc In this section we will have a look at the some of the most
important models that were fundamental to the progress of Questions Answering
Machine Comprehension Using Match-LSTM and Answer Pointer
In this paper [20] the authors propose an end-to-end neural architecture for the
QA task The architecture is based on match-LSTM [21] a model they proposed for
textual entailment and Pointer Net [22] a sequence-to-sequence model proposed by
Vinyals et al (2015) to constrain the output tokens to be from the input sequences
The model consists of an LSTM preprocessing layer a match-LSTM layer and an
Answer Pointer layer
We are given a piece of text which we refer to as a passage and a question
related to the passage The passage is represented by matrix P where P is the length
(number of tokens) of the passage and d is the dimensionality of word embeddings
Similarly the question is represented by matrix Q where Q is the length of the question
Our goal is to identify a subsequence from the passage as the answer to the question
22
Figure 3-1 Match-LSTM Model Architecture [20]
LSTM Preprocessing layer They use a standard one-directional LSTM
(Hochreiter amp Schmidhuber 1997) to process the passage and the question separately
as shown below
Match LSTM Layer They applied the match-LSTM model proposed for textual
entailment to their machine comprehension problem by treating the question as a
premise and the passage as a hypothesis The match-LSTM sequentially goes through
the passage At position i of the passage it first uses the standard word-by-word
attention mechanism to obtain attention weight vector as follows
(3-1)
23
where and are parameters to be learned
Answer Pointer Layer The Answer Pointer (Ans-Ptr) layer is motivated by the
Pointer Net introduced by Vinyals et al (2015) [22] The boundary model produces only
the start token and the end token of the answer and then all the tokens between these
two in the original passage are considered to be the answer
When this paper was released back in November 2016 Match-LSTM method
was the state of the art in Question Answering systems and was at the top of the
leaderboard for the SQuAD dataset
R-NET Matching Reading Comprehension with Self-Matching Networks
In this model [23] first the question and passage are processed by a
bidirectional recurrent network (Mikolov et al 2010) separately They then match the
question and passage with gated attention-based recurrent networks obtaining
question-aware representation for the passage On top of that they apply self-matching
attention to aggregate evidence from the whole passage and refine the passage
representation which is then fed into the output layer to predict the boundary of the
answer span
Question and passage encoding First the words are converted to their
respective word-level embeddings and character level embeddings The character-level
embeddings are generated by taking the final hidden states of a bi-directional recurrent
neural network (RNN) applied to embeddings of characters in the token Such
character-level embeddings have been shown to be helpful to deal with out-of-vocab
(OOV) tokens
24
They then use a bi-directional RNN to produce new representation and
of all words in the question and passage respectively
Figure 3-2 The task of Question Answering [23]
Gated Attention-based Recurrent Networks They use a variant of attention-
based recurrent networks with an additional gate to determine the importance of
information in the passage regarding a question Different from the gates in LSTM or
GRU the additional gate is based on the current passage word and its attention-pooling
vector of the question which focuses on the relation between the question and current
passage word The gate effectively model the phenomenon that only parts of the
passage are relevant to the question in reading comprehension and question answering
is utilized in subsequent calculations
25
Self-Matching Attention From the previous step the question aware passage
representation is generated to highlight the important parts of the passage One
problem with such representation is that it has very limited knowledge of context One
answer candidate is often oblivious to important cues in the passage outside its
surrounding window To address this problem the authors propose directly matching
the question-aware passage representation against itself It dynamically collects
evidence from the whole passage for words in passage and encodes the evidence
relevant to the current passage word and its matching question information into the
passage representation
Output Layer They use the same method as Wang amp Jiang (2016b) and use
pointer networks (Vinyals et al 2015) to predict the start and end position of the
answer In addition they use an attention-pooling over the question representation to
generate the initial hidden vector for the pointer network [23]
When the R-Net Model first appeared in the leaderboard in March 2017 it was at
the top with 723 Exact Match and 807 F1 score
Bi-Directional Attention Flow (BiDAF) for Machine Comprehension
BiDAF [24] is a hierarchical multi-stage architecture for modeling the
representations of the context paragraph at different levels of granularity BIDAF
includes character-level word-level and contextual embeddings and uses bi-directional
attention flow to obtain a query-aware context representation Their attention layer is not
used to summarize the context paragraph into a fixed-size vector Instead the attention
is computed for every time step and the attended vector at each time step along with
the representations from previous layers can flow through to the subsequent modeling
layer This reduces the information loss caused by early summarization [24]
26
Figure 3-3 Bi-Directional Attention Flow Model Architecture [24]
Their machine comprehension model is a hierarchical multi-stage process and
consists of six layers
1 Character Embedding Layer maps each word to a vector space using character-level CNNs
2 Word Embedding Layer maps each word to a vector space using a pre-trained word embedding model
3 Contextual Embedding Layer utilizes contextual cues from surrounding words to refine the embedding of the words These first three layers are applied to both the query and context They use a LSTM on top of the embeddings provided by the previous layers to model the temporal interactions between words We place an LSTM in both directions and concatenate the outputs of the two LSTMs
4 Attention Flow Layer couples the query and context vectors and produces a set of query aware feature vectors for each word in the context
5 Modeling Layer employs a Recurrent Neural Network to scan the context The input to the modeling layer is G which encodes the query-aware representations of context words The output of the modeling layer captures the interaction among the context words conditioned on the query They use two layers of bi-directional LSTM
27
with the output size of d for each direction Hence a matrix is obtained which is passed onto the output layer to predict the answer
6 Output Layer provides an answer to the query They define the training loss (to be minimized) as the sum of the negative log probabilities of the true start and end indices by the predicted distributions averaged over all examples [25]
In a further variation of their above work they add a self-attention layer after the
Bi-attention layer to further improve the results The architecture of the model is as
Figure 3-3 The task of Question Answering [25]
28
Summary
In this chapter we reviewed the methods that are fundamental to the state of the
art in Machine Comprehension and for the task of Question Answering We have
reached closed to human level accuracy and this is due to incremental developments
over previous models As we saw Out of Vocabulary (OOV) tokens were handled by
using Character embedding Long term dependency within context passage were
solved using self-attention and many other techniques such as Contextualized vectors
Attention Flow etc were employed to get better results In the next chapter we will see
how we can build on these models and develop further
29
CHAPTER 4 MULTI-ATTENTION QUESTION ANSWERING
Although the advancements in Question Answering systems since the release of
the SQuAD dataset has been impressive and the results are getting close to human
level accuracy it is far from being a fool-proof system The models still make mistakes
which would be obvious to a human For example
Passage The Panthers used the San Jose State practice facility and stayed at
the San Jose Marriott The Broncos practiced at Florida State Facility and stayed at the
Santa Clara Marriott
Question At what universitys facility did the Panthers practice
Actual Answer San Jose State
Predicted Answer Florida State Facility
To find out what is leading to the wrong predictions we wanted to see the
attention weights associated with such an example We plotted the passage and
question heat map which is a 2D matrix where the intensity of each cell signifies the
similarity between a passage word and a question word For the above example we
found out that while certain words of the question are given high weightage other parts
are not The words lsquoAtrsquo lsquofacilityrsquo lsquopracticersquo are given high attention but lsquoPanthersrsquo does
not receive high attention If it had received high attention then the system would have
predicted lsquoSan Jose Statersquo as the right answer To solve this issue we analyzed the
base BiDAF model and proposed adding two things
1 Bi-Attention and Self-Attention over Query
2 Second level attention over the output of (Bi-Attention + Self-Attention) from both Context and Query
30
Multi-Attention BiDAF Model
Our comprehension model is a hierarchical multi-stage process and consists of
the following layers
1 Embedding Just as all other models we embed words using pretrained word vectors We also embed the characters in each word into size 20 vectors which are learned and run a convolution neural network followed by max-pooling to get character-derived embeddings for each word The character-level and word-level embeddings are then concatenated and passed to the next layer We do not update the word embeddings during training
2 Pre-Process A shared bi-directional GRU (Cho et al 2014) is used to map the question and passage embeddings to context aware embeddings
3 Context Attention The bi-directional attention mechanism from the Bi-Directional Attention Flow (BiDAF) model (Seo et al 2016) is used to build a query-aware context representation Let h_i be the vector for context word i q_j be the vector for question word j and n_q and n_c be the lengths of the question and context respectively We compute attention between context word i and question word j as
(4-1)
where w1 w2 and w3 are learned vectors and is element-wise multiplication We then compute an attended vector c_i for each context token as
(4-2)
We also compute a query-to-context vector q_c
(4-3)
31
The final vector computed for each token is built by concatenating
and In our model we subsequently pass the result through a linear layer with ReLU activations
4 Context Self-Attention Next we use a layer of residual self-attention The input is passed through another bi-directional GRU Then we apply the same attention mechanism only now between the passage and itself In this case we do not use query-to context attention and we set aij = 1048576inf if i = j As before we pass the concatenated output through a linear layer with ReLU activations This layer is applied residually so this output is additionally summed with the input
5 Query Attention For this part we do the same way as context attention but calculate the weighted sum of the context words for each query word Thus we get the length as number of query words Then we calculate context to query similar to query to context in context attention layer
Figure 4-1 The modified BiDAF model with multilevel attention
6 Query Self-Attention This part is done the same way as the context self-attention layer but from the output of the Query Attention layer
32
7 Context Query Bi-Attention + Self-Attention The output of the Context self-attention and the Query Self-Attention layers are taken as input and the same process for Bi-Attention and self-attention is applied on these inputs
8 Prediction In the last layer of our model a bidirectional GRU is applied followed by a linear layer that computes answer start scores for each token The hidden states of that layer are concatenated with the input and fed into a second bidirectional GRU and linear layer to predict answer end scores The softmax operation is applied to the start and end scores to produce start and end probabilities and we optimize the negative log likelihood of selecting correct start and end tokens
Having carried out this modification we were able to solve the wrong example we
started with The multilevel attention model gives the correct output as ldquoSan Jose
Staterdquo Also we achieved slightly better scores than the original model with a F1 score
of 8544 on the SQuAD dev dataset
Chatbot Design Using a QA System
Designing a perfect chatbot that passes the Turing test is a fundamental goal for
Artificial Intelligence Although we are many order to magnitudes away from achieving
such a goal domain specific tasks can be solved with chatbots made from current
technology We had started our goal with a similar objective in mind ie to design a
domain specific chatbot and then generalize to other areas as it is able to achieve the
first domain specific objective robustly This led us to the fundamental problem of
Machine Comprehension and subsequently to the task of Question Answering Having
achieved some degree of success with QA systems we looked back if we could apply
our newly acquired knowledge in the task of designing Chatbots
The chatbots made with todayrsquos technologies are mostly handcrafted techniques
such as template matching that requires anticipating all possible ways a user may
articulate his requirements and a conversation may occur This requires a lot of man
hours for designing a domain specific system and is still very error prone In this section
33
we propose a general Chatbot design that would make the designing of a domain
specific chatbot very easy and robust at the same time
Every domain specific chatbot needs to obtain a set of information from the user
and show some results based on the user specific information obtained The traditional
chatbots use template matching and keywords lookup to determine if the user has
provided the required information Our idea is to use the Question Answering system in
the backend to extract out the required information from whatever the user has typed
until this point of the conversation The information to be extracted can be posed in the
form of a set of questions and the answers obtained from those questions can be used
as the parameters to supply the relevant information to the user
We had chosen our chatbot domain as the flight reservation system Our goal
was to extract the required information from the user to be able to show him the
available flights as per the userrsquos requirements
Figure 4-2 Flight reservation chatbotrsquos chat window
34
For a flight reservation task the booking agent needs to know the origin city
destination city and date of travel at minimum to be able to show the available flights
Optional information includes the number of tickets passengerrsquos name one way or
round trip etc
The minimalistic conversation with the user through the chat window would be as
shown above We had a platform called OneTask on which we wanted to implement our
chat bot The chat interface within the OneTask system looks as follows
Figure 4-3 Chatbot within OneTask system
The working of the chat system is as follows ndash
1 Initiation The user opens the chat window which starts a session with the chat bot The chat bot asks lsquoHow may I help yoursquo
2 User Reply The user may reply with none of the required information for flight booking or may reply with multiple information in the same message
3 User Reply parsing The conversation up to this point is treated as a passage and the internal questions are run on this passage So the four questions that are run are
Where do you want to go
35
From where do you want to leave
When do you want to depart
4 Parsed responses from QA model After running the questions on the QA system with the given conversation answers are obtained for all along with their corresponding confidence values Since answers will be obtained even if the required question was not answered up to this point in the conversation the confidence values plays a crucial role in determining the validity of the answer For this purpose we determined ranges of confidence values for validation A confidence value above 10 signifies the question has been answered correctly A confidence of 2 ndash 10 signifies that it may have been answered but should verify with the user for correctness and any confidence below 2 is discarded
5 Asking remaining questions iteratively After the parsing it is checked if any of the required questions are still unanswered If so the chatbot asks the remaining question and the process from 3 ndash 4 is carried out iteratively Once all the questions have been answered the user is shown with the available flight options as per his request
Figure 4-4 The Flow diagram of the Flight booking Chatbot system
Online QA System and Attention Visualization
To be able to test out various examples we used the BiDAF [23] model for an
online demo One can either choose from the available examples from the drop-down
menu or paste their own passage and examples While this is a useful and interesting
system to test the model in a user-friendly way we created this system to be able to
focus on the wrong samples
36
Figure 4-5 QA system interface with attention highlight over candidate answers
The candidate answers are shown with the blue highlights as per their
confidence values The higher the confidence the darker the answer The highest
confidence value is chosen as the predicted answer We developed the system to show
attention spread of the candidate answers to realize what needs to be done to improve
the system This led us to realize the importance of including the query attention part as
well as multilevel attention on the BiDAF model as described in the first section of this
chapter
37
CHAPTER 5 FUTURE DIRECTIONS AND CONCLUSION
Future Directions
Having done a thorough analysis of the current methods and state of the arts in
Machine Comprehension and for the QA task and developing systems on it we have
achieved a strong sense of what needs to be done to further improve the QA models
After observing the wrong samples we can see that the system is still unable to encode
meaning and is picking answers based on statistical patterns of occurrence of answers
on training examples
As per the State of the Art models and analyzing the ongoing research literature
it is easy to conclude that more features need to be embedded to encode the meaning
of words phrases and sentences A paper called Reinforced Mnemonic Reader for
Machine Comprehension encoded POS and NER tags of words along with their word
and character embedding This gave them better results
Figure 5-1 An English language semantic parse tree [26]
38
We have developed a method to encode the syntax parse tree of a sentence that
not only encodes the post tags but the relation of the word within the phrase and the
relation of the phrase within the whole sentence in a hierarchical manner
Finally data augmentation is another solution to get better results One definite
way to reduce the errors would be to include similar samples in the training data which
the system is faltering in the dev set One could generate similar examples as the failure
cases and include them in the training set to have better prediction Another system
would be to train to similar and bigger datasets Our models were trained on the SQuAD
dataset There are other datasets too which does the similar question answering task
such as TriviaQA We could augment the training set of TriviaQA along with SQuAD to
have a more robust system that is able to generalize better and thus have higher
accuracy for predicting answer spans
Conclusion
In this work we have tried to explore the most fundamental techniques that have
shaped the current state of the art Then we proposed a minor improvement of
architecture over an existing model Furthermore we developed two applications that
uses the base model First we talked about how a chatbot application can be made
using the QA system and lastly we also created a web interface where the model can
be used for any Passage and Question This interface also shows the attention spread
on the candidate answers While our effort is ongoing to push the state of the art
forward we strongly believe that surpassing human level accuracy on this task will have
high dividends for the society at large
39
LIST OF REFERENCES
[1] Danqi Chen From Reading Comprehension To Open-Domain Question Answering 2018 Youtube Accessed March 16 2018 httpswwwyoutubecomwatchv=1RN88O9C13U
[2] Lehnert Wendy G The Process of Question Answering No RR-88 Yale Univ New Haven Conn Dept of computer science 1977
[3] Joshi Mandar Eunsol Choi Daniel S Weld and Luke Zettlemoyer Triviaqa A large scale distantly supervised challenge dataset for reading comprehension arXiv preprint arXiv170503551 (2017)
[4] Rajpurkar Pranav Jian Zhang Konstantin Lopyrev and Percy Liang Squad 100000+ questions for machine comprehension of text arXiv preprint arXiv160605250 (2016)
[5] The Stanford Question Answering Dataset 2018 RajpurkarGithubIo Accessed March 16 2018 httpsrajpurkargithubioSQuAD-explorer
[6] Hu Minghao Yuxing Peng and Xipeng Qiu Reinforced mnemonic reader for machine comprehension CoRR abs170502798 (2017)
[7] Schalkoff Robert J Artificial neural networks Vol 1 New York McGraw-Hill 1997
[8] Me For The AI and Neetesh Mehrotra 2018 The Connect Between Deep Learning And AI - Open Source For You Open Source For You Accessed March 16 2018 httpopensourceforucom201801connect-deep-learning-ai
[9] Krizhevsky Alex Ilya Sutskever and Geoffrey E Hinton Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems pp 1097-1105 2012
[10] An Intuitive Explanation Of Convolutional Neural Networks 2016 The Data Science Blog Accessed March 16 2018 httpsujjwalkarnme20160811intuitive-explanation-convnets
[11] Williams Ronald J and David Zipser A learning algorithm for continually running fully recurrent neural networks Neural computation 1 no 2 (1989) 270-280
[12] Hochreiter Sepp and Juumlrgen Schmidhuber Long short-term memory Neural computation 9 no 8 (1997) 1735-1780
[13] Chung Junyoung Caglar Gulcehre KyungHyun Cho and Yoshua Bengio Empirical evaluation of gated recurrent neural networks on sequence modeling arXiv preprint arXiv14123555 (2014)
40
[14] Mikolov Tomas Wen-tau Yih and Geoffrey Zweig Linguistic regularities in continuous space word representations In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics Human Language Technologies pp 746-751 2013
[15] Pennington Jeffrey Richard Socher and Christopher Manning Glove Global vectors for word representation In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) pp 1532-1543 2014
[16] Word2vec Tutorial - The Skip-Gram Model middot Chris Mccormick 2016 MccormickmlCom Accessed March 16 2018 httpmccormickmlcom20160419word2vec-tutorial-the-skip-gram-model
[17] Group 2015 [Paper Introduction] Bilingual Word Representations With Monolingual hellip SlideshareNet Accessed March 16 2018 httpswwwslidesharenetnaist-mtstudybilingual-word-representations-with-monolingual-quality-in-mind
[18] Attention Mechanism 2016 BlogHeuritechCom Accessed March 16 2018 httpsblogheuritechcom20160120attention-mechanism
[19] Weston Jason Sumit Chopra and Antoine Bordes 2014 Memory Networks arXiv preprint arXiv14103916v11 Accessed March 16 2018 httpsarxivorgabs14103916
[20] Wang Shuohang and Jing Jiang Machine comprehension using match-lstm and answer pointer arXiv preprint arXiv160807905 (2016)
[21] Wang Shuohang and Jing Jiang Learning natural language inference with LSTM arXiv preprint arXiv151208849 (2015)
[22] Vinyals Oriol Meire Fortunato and Navdeep Jaitly Pointer networks In Advances in Neural Information Processing Systems pp 2692-2700 2015
[23] Wang Wenhui Nan Yang Furu Wei Baobao Chang and Ming Zhou Gated self-matching networks for reading comprehension and question answering In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1 Long Papers) vol 1 pp 189-198 2017
[24] Seo Minjoon Aniruddha Kembhavi Ali Farhadi and Hannaneh Hajishirzi Bidirectional attention flow for machine comprehension arXiv preprint arXiv161101603 (2016)
[25] Clark Christopher and Matt Gardner Simple and effective multi-paragraph reading comprehension arXiv preprint arXiv171010723 (2017)
41
[26] 8 Analyzing Sentence Structure 2018 NltkOrg Accessed March 16 2018 httpwwwnltkorgbookch08html
42
BIOGRAPHICAL SKETCH
Purnendu Mukherjee grew up in Kolkata India and had a strong interest for
computers since an early age After his high school he did his BSc in computer
science from Ramakrishna Mission Residential College Narendrapur followed by MSc
in computer science from St Xavierrsquos College Kolkata He had a strong intuition and
interest for human like learning systems and wanted to work in this area He started
working at TCS Innovation Labs Pune for the application of Natural Language
Processing in Educational Applications As he was deeply passionate about learning
system that mimic the human brain and learn like a human child does he was
increasing interested about Deep Learning and its applications After working for a year
he went on to pursue a Master of Science degree in computer science from the
University of Florida Gainesville His academic interests have been focused on Deep
Learning and Natural Language Processing and he has been working on Machine
Reading Comprehension since summer of 2017
- ACKNOWLEDGMENTS
- LIST OF FIGURES
- LIST OF ABBREVIATIONS
- INTRODUCTION
- THE BUILDING BLOCKS OF A QUESTION ANSWERING SYSTEM
-
- Neural Networks
- Convolutional Neural Network
- Recurrent Neural Networks (RNN)
- Word Embedding
- Attention Mechanism
- Memory Networks
-
- LITERATURE REVIEW AND STATE OF THE ART
-
- Machine Comprehension Using Match-LSTM and Answer Pointer
- R-NET Matching Reading Comprehension with Self-Matching Networks
- Bi-Directional Attention Flow (BiDAF) for Machine Comprehension
- Summary
-
- MULTI-ATTENTION QUESTION ANSWERING
-
- Multi-Attention BiDAF Model
- Chatbot Design Using a QA System
- Online QA System and Attention Visualization
-
- FUTURE DIRECTIONS AND CONCLUSION
-
- Future Directions
- Conclusion
-
- LIST OF REFERENCES
- BIOGRAPHICAL SKETCH
-
12
have shaped the current State of the Art models then we will discuss our contribution in
terms of architecture and applications and finally conclude with future directions
13
CHAPTER 2 THE BUILDING BLOCKS OF A QUESTION ANSWERING SYSTEM
To build a question answering system one needs to be familiar with the
fundamental deep learning models such as Recurrent Neural Networks (RNN) Long
Short Term Memory (LSTM) etc In this chapter we will have an overview on these
techniques and see how they all connect to building a question answering system
Neural Networks
What makes Deep Learning so intriguing is that it has close resemblance with
the working of the mammalian brain or at least draws inspiration from it The same can
be said for Artificial Neural networks [7] which consists of a system of interconnected
units called lsquoneuronsrsquo that take input from similar units and produces a single output
Figure 2-1 Simple and Deep Learning Neural Networks [8]
The connection from one neuron to another can be weighted based on the input data
which enables the network to tune itself to produce a certain output based on the input
This is the learning process which is achieved through backpropagation which is a
system of propagating the error from the output layer to the previous layers
14
Convolutional Neural Network
The first wave of deep learningrsquos success was brought by Convolutional Neural
Networks (CNN) [9] when this was the technique used by the winning team of ImageNet
competition in 2012 CNNs are deep artificial neural networks (ANN) that can be used to
classify images cluster them by similarity and perform object recognition within scenes
It can be used to detect and identify faces people signs or any other visual data
Figure 2-2 Convolutional Neural Network Architecture [10]
There are primarily four operations in a standard CNN model (as shown in Fig above)
1 Convolution - The primary purpose of Convolution in the ConvNet (above) is to extract features from the input image The spatial relationship between pixels ie the image features are preserved and learned by the convolution using small squares of input data
2 Non-Linearity (ReLU) ndash Rectified Linear Unit (ReLU) is a non-linear operation that carries out an element wise operation on each pixel This operation replaces the negative pixel values in the feature map by zero
3 Pooling or Sub Sampling - Spatial Pooling reduces the dimensionality of each feature map but retains the most important information For max pooling the largest value in the square window is taken and rest are dropped Other types of pooling are Average Sum etc
4 Classification (Fully Connected Layer) - The Fully Connected layer is a traditional Multi-Layer Perceptron as described before that uses a softmax activation function in the output layer The high-level features of the image are encoded by the convolutional and pooling layers which is then fed to the fully connected layer which then uses these features for classifying the input image into various classes based on the training dataset [10]
15
When a new image is fed into the CNN model all the above-mentioned steps are
carried out (forward propagation) and a probability distribution is achieved on the set of
output classes With a large enough training dataset the network will learn and
generalize well enough to classify new images into their correct classes
Recurrent Neural Networks (RNN)
Whenever we want to predict or encode sequential data RNN [11] is our go to
method RNNs perform the same task for every element of a sequence where the
output of each element depends on previous computations thus the recurrence In
practice RNNs are unable to retain long-term dependencies and can look back only a
few steps because of the vanishing gradient problem
(2-1)
A solution to the dependency problem is to use gated cells such as LSTM [11] or
GRU [13] These cells pass on important information to the next cells while ignoring
non-important ones The gated units in a GRU block are
bull Update Gate ndash Computed based on current input and hidden state
(2-2)
bull Reset Gate ndash Calculated similarly but with different weights
(2-3)
bull New memory content - (2-4)
If reset gate unit is ~0 then previous memory is ignored and only new
information is kept
16
Final memory at current time step combines previous and current time steps
(2-5)
While the GRU is computationally efficient the LSTM on the other hand is a
general case where there are three gates as follows
bull Input Gate ndash What new information to add to the current cell state
bull Forget Gate ndash How much information from previous states to be kept
bull Output gate ndash How much info should be sent to the next states
Just like GRU the current cell state is a sum of the previous cell state but
weighted by the forget gate and the new value is added which is weighted by the input
gate Based on the cell state the output gate regulates the final output
Word Embedding
Computation or gradients can be applied on numbers and not on words or letters
So first we need to convert words into their corresponding numerical formation before
feeding into a deep learning model In general there are two types of word embedding
Frequency based (which constitutes count vectors tf-idf and co-occurrence vectors)
and Prediction based With frequency based embedding the order of the words are not
preserved and works as a bag of words model Whereas with prediction based model
the order of words or locality of words are taken into consideration to generate the
numerical representation of the word Within this prediction based category there are
two fundamental techniques called Continuous Bag of Words (CBOW) and Skip Gram
Model which forms the basis for word2vec [14] and GloVe [15]
The basic intuition behind word2vec is that if two different words have very
similar ldquocontextsrdquo (that is what words are likely to appear around them) then the model
17
will produce similar vector for those words Conversely if the two word vectors are
similar then the network will produce similar context predictions for the same two words
For examples synonyms like ldquointelligentrdquo and ldquosmartrdquo would have very similar contexts
Or that words that are related like ldquoenginerdquo and ldquotransmissionrdquo would probably have
similar contexts as well [16] Plotting the word vectors learned by a word2vec over a
large corpus we could find some very interesting relationships between words
Figure 2-3 Semantic relation between words in vector space [17]
Attention Mechanism
We as humans put our attention to things are important or are relevant in a
context For example when asked a question from a passage we try to find the most
relevant part of the passage the question is relevant with and then reason from our
understanding of that part of the passage The same idea applies for attention
mechanism in Deep Learning It is used to identify the specific parts of a given context
to which the current question is relevant to
Formally put the techniques take n arguments y_1 y_n (in our case the
passage having words say y_i through h_i) and a question word say q It returns a
vector z which is supposed to be the laquo summary raquo of the y_i focusing on information
linked to the question q More formally it returns a weighted arithmetic mean of the y_i
18
and the weights are chosen according the relevance of each y_i given the context c
[18]
Figure 2-4 Attention Mechanism flow [18]
Memory Networks
Convolutional Neural Networks and Recurrent Neural Networks which does
capture how we form our visual and sequential memories their memory (encoded by
hidden states and weights) were typically too small and was not compartmentalized
enough to accurately remember facts from the past (knowledge is compressed into
dense vectors) [19]
Deep Learning needed to cultivate a methodology that preserved memories as
they are such that it wonrsquot be lost in generalization and recalling exact words or
sequence of events would be possible mdash something computers are already good at This
effort led us to Memory Networks [19] published at ICLR 2015 by Facebook AI
Research
This paper provides a basic framework to store augment and retrieve memories
while seamlessly working with a Recurrent Neural Network architecture The memory
19
network consists of a memory m (an array of objects 1 indexed by m i) and four
(potentially learned) components I G O and R as follows
I (input feature map) mdash converts the incoming input to the internal feature
representation either a sparse or dense feature vector like that from word2vec or
GloVe
G (generalization) mdash updates old memories given the new input They call this
generalization as there is an opportunity for the network to compress and generalize its
memories at this stage for some intended future use The analogy Irsquove been talking
before
O (output feature map) mdash produces a new output (in the feature representation
space) given the new input and the current memory state This component is
responsible for performing inference In a question answering system this part will
select the candidate sentences (which might contain the answer) from the story
(conversation) so far
R (response) mdash converts the output into the response format desired For
example a textual response or an action In the QA system described this component
finds the desired answer and then converts it from feature representation to the actual
word
This model is a fully supervised model meaning all the candidate sentences from
which the answer could be found are marked during training phase and can also be
termed as lsquohard attentionrsquo
The authors tested out the QA system on various literature including Lord of the
Rings
20
Figure 2-5 QA example on Lord of the Rings using Memory Networks [19]
21
CHAPTER 3 LITERATURE REVIEW AND STATE OF THE ART
There has been rapid progress since the release of the SQuAD dataset and
currently the best ensemble models are close to human level accuracy in machine
comprehension This is due to the various ingenious methods which solves some of the
problems with the previous methods Out of Vocabulary tokens were handled by using
Character embedding Long term dependency within context passage were solved
using self-attention And many other techniques such as Contextualized vectors History
of Words Attention Flow etc In this section we will have a look at the some of the most
important models that were fundamental to the progress of Questions Answering
Machine Comprehension Using Match-LSTM and Answer Pointer
In this paper [20] the authors propose an end-to-end neural architecture for the
QA task The architecture is based on match-LSTM [21] a model they proposed for
textual entailment and Pointer Net [22] a sequence-to-sequence model proposed by
Vinyals et al (2015) to constrain the output tokens to be from the input sequences
The model consists of an LSTM preprocessing layer a match-LSTM layer and an
Answer Pointer layer
We are given a piece of text which we refer to as a passage and a question
related to the passage The passage is represented by matrix P where P is the length
(number of tokens) of the passage and d is the dimensionality of word embeddings
Similarly the question is represented by matrix Q where Q is the length of the question
Our goal is to identify a subsequence from the passage as the answer to the question
22
Figure 3-1 Match-LSTM Model Architecture [20]
LSTM Preprocessing layer They use a standard one-directional LSTM
(Hochreiter amp Schmidhuber 1997) to process the passage and the question separately
as shown below
Match LSTM Layer They applied the match-LSTM model proposed for textual
entailment to their machine comprehension problem by treating the question as a
premise and the passage as a hypothesis The match-LSTM sequentially goes through
the passage At position i of the passage it first uses the standard word-by-word
attention mechanism to obtain attention weight vector as follows
(3-1)
23
where and are parameters to be learned
Answer Pointer Layer The Answer Pointer (Ans-Ptr) layer is motivated by the
Pointer Net introduced by Vinyals et al (2015) [22] The boundary model produces only
the start token and the end token of the answer and then all the tokens between these
two in the original passage are considered to be the answer
When this paper was released back in November 2016 Match-LSTM method
was the state of the art in Question Answering systems and was at the top of the
leaderboard for the SQuAD dataset
R-NET Matching Reading Comprehension with Self-Matching Networks
In this model [23] first the question and passage are processed by a
bidirectional recurrent network (Mikolov et al 2010) separately They then match the
question and passage with gated attention-based recurrent networks obtaining
question-aware representation for the passage On top of that they apply self-matching
attention to aggregate evidence from the whole passage and refine the passage
representation which is then fed into the output layer to predict the boundary of the
answer span
Question and passage encoding First the words are converted to their
respective word-level embeddings and character level embeddings The character-level
embeddings are generated by taking the final hidden states of a bi-directional recurrent
neural network (RNN) applied to embeddings of characters in the token Such
character-level embeddings have been shown to be helpful to deal with out-of-vocab
(OOV) tokens
24
They then use a bi-directional RNN to produce new representation and
of all words in the question and passage respectively
Figure 3-2 The task of Question Answering [23]
Gated Attention-based Recurrent Networks They use a variant of attention-
based recurrent networks with an additional gate to determine the importance of
information in the passage regarding a question Different from the gates in LSTM or
GRU the additional gate is based on the current passage word and its attention-pooling
vector of the question which focuses on the relation between the question and current
passage word The gate effectively model the phenomenon that only parts of the
passage are relevant to the question in reading comprehension and question answering
is utilized in subsequent calculations
25
Self-Matching Attention From the previous step the question aware passage
representation is generated to highlight the important parts of the passage One
problem with such representation is that it has very limited knowledge of context One
answer candidate is often oblivious to important cues in the passage outside its
surrounding window To address this problem the authors propose directly matching
the question-aware passage representation against itself It dynamically collects
evidence from the whole passage for words in passage and encodes the evidence
relevant to the current passage word and its matching question information into the
passage representation
Output Layer They use the same method as Wang amp Jiang (2016b) and use
pointer networks (Vinyals et al 2015) to predict the start and end position of the
answer In addition they use an attention-pooling over the question representation to
generate the initial hidden vector for the pointer network [23]
When the R-Net Model first appeared in the leaderboard in March 2017 it was at
the top with 723 Exact Match and 807 F1 score
Bi-Directional Attention Flow (BiDAF) for Machine Comprehension
BiDAF [24] is a hierarchical multi-stage architecture for modeling the
representations of the context paragraph at different levels of granularity BIDAF
includes character-level word-level and contextual embeddings and uses bi-directional
attention flow to obtain a query-aware context representation Their attention layer is not
used to summarize the context paragraph into a fixed-size vector Instead the attention
is computed for every time step and the attended vector at each time step along with
the representations from previous layers can flow through to the subsequent modeling
layer This reduces the information loss caused by early summarization [24]
26
Figure 3-3 Bi-Directional Attention Flow Model Architecture [24]
Their machine comprehension model is a hierarchical multi-stage process and
consists of six layers
1 Character Embedding Layer maps each word to a vector space using character-level CNNs
2 Word Embedding Layer maps each word to a vector space using a pre-trained word embedding model
3 Contextual Embedding Layer utilizes contextual cues from surrounding words to refine the embedding of the words These first three layers are applied to both the query and context They use a LSTM on top of the embeddings provided by the previous layers to model the temporal interactions between words We place an LSTM in both directions and concatenate the outputs of the two LSTMs
4 Attention Flow Layer couples the query and context vectors and produces a set of query aware feature vectors for each word in the context
5 Modeling Layer employs a Recurrent Neural Network to scan the context The input to the modeling layer is G which encodes the query-aware representations of context words The output of the modeling layer captures the interaction among the context words conditioned on the query They use two layers of bi-directional LSTM
27
with the output size of d for each direction Hence a matrix is obtained which is passed onto the output layer to predict the answer
6 Output Layer provides an answer to the query They define the training loss (to be minimized) as the sum of the negative log probabilities of the true start and end indices by the predicted distributions averaged over all examples [25]
In a further variation of their above work they add a self-attention layer after the
Bi-attention layer to further improve the results The architecture of the model is as
Figure 3-3 The task of Question Answering [25]
28
Summary
In this chapter we reviewed the methods that are fundamental to the state of the
art in Machine Comprehension and for the task of Question Answering We have
reached closed to human level accuracy and this is due to incremental developments
over previous models As we saw Out of Vocabulary (OOV) tokens were handled by
using Character embedding Long term dependency within context passage were
solved using self-attention and many other techniques such as Contextualized vectors
Attention Flow etc were employed to get better results In the next chapter we will see
how we can build on these models and develop further
29
CHAPTER 4 MULTI-ATTENTION QUESTION ANSWERING
Although the advancements in Question Answering systems since the release of
the SQuAD dataset has been impressive and the results are getting close to human
level accuracy it is far from being a fool-proof system The models still make mistakes
which would be obvious to a human For example
Passage The Panthers used the San Jose State practice facility and stayed at
the San Jose Marriott The Broncos practiced at Florida State Facility and stayed at the
Santa Clara Marriott
Question At what universitys facility did the Panthers practice
Actual Answer San Jose State
Predicted Answer Florida State Facility
To find out what is leading to the wrong predictions we wanted to see the
attention weights associated with such an example We plotted the passage and
question heat map which is a 2D matrix where the intensity of each cell signifies the
similarity between a passage word and a question word For the above example we
found out that while certain words of the question are given high weightage other parts
are not The words lsquoAtrsquo lsquofacilityrsquo lsquopracticersquo are given high attention but lsquoPanthersrsquo does
not receive high attention If it had received high attention then the system would have
predicted lsquoSan Jose Statersquo as the right answer To solve this issue we analyzed the
base BiDAF model and proposed adding two things
1 Bi-Attention and Self-Attention over Query
2 Second level attention over the output of (Bi-Attention + Self-Attention) from both Context and Query
30
Multi-Attention BiDAF Model
Our comprehension model is a hierarchical multi-stage process and consists of
the following layers
1 Embedding Just as all other models we embed words using pretrained word vectors We also embed the characters in each word into size 20 vectors which are learned and run a convolution neural network followed by max-pooling to get character-derived embeddings for each word The character-level and word-level embeddings are then concatenated and passed to the next layer We do not update the word embeddings during training
2 Pre-Process A shared bi-directional GRU (Cho et al 2014) is used to map the question and passage embeddings to context aware embeddings
3 Context Attention The bi-directional attention mechanism from the Bi-Directional Attention Flow (BiDAF) model (Seo et al 2016) is used to build a query-aware context representation Let h_i be the vector for context word i q_j be the vector for question word j and n_q and n_c be the lengths of the question and context respectively We compute attention between context word i and question word j as
(4-1)
where w1 w2 and w3 are learned vectors and is element-wise multiplication We then compute an attended vector c_i for each context token as
(4-2)
We also compute a query-to-context vector q_c
(4-3)
31
The final vector computed for each token is built by concatenating
and In our model we subsequently pass the result through a linear layer with ReLU activations
4 Context Self-Attention Next we use a layer of residual self-attention The input is passed through another bi-directional GRU Then we apply the same attention mechanism only now between the passage and itself In this case we do not use query-to context attention and we set aij = 1048576inf if i = j As before we pass the concatenated output through a linear layer with ReLU activations This layer is applied residually so this output is additionally summed with the input
5 Query Attention For this part we do the same way as context attention but calculate the weighted sum of the context words for each query word Thus we get the length as number of query words Then we calculate context to query similar to query to context in context attention layer
Figure 4-1 The modified BiDAF model with multilevel attention
6 Query Self-Attention This part is done the same way as the context self-attention layer but from the output of the Query Attention layer
32
7 Context Query Bi-Attention + Self-Attention The output of the Context self-attention and the Query Self-Attention layers are taken as input and the same process for Bi-Attention and self-attention is applied on these inputs
8 Prediction In the last layer of our model a bidirectional GRU is applied followed by a linear layer that computes answer start scores for each token The hidden states of that layer are concatenated with the input and fed into a second bidirectional GRU and linear layer to predict answer end scores The softmax operation is applied to the start and end scores to produce start and end probabilities and we optimize the negative log likelihood of selecting correct start and end tokens
Having carried out this modification we were able to solve the wrong example we
started with The multilevel attention model gives the correct output as ldquoSan Jose
Staterdquo Also we achieved slightly better scores than the original model with a F1 score
of 8544 on the SQuAD dev dataset
Chatbot Design Using a QA System
Designing a perfect chatbot that passes the Turing test is a fundamental goal for
Artificial Intelligence Although we are many order to magnitudes away from achieving
such a goal domain specific tasks can be solved with chatbots made from current
technology We had started our goal with a similar objective in mind ie to design a
domain specific chatbot and then generalize to other areas as it is able to achieve the
first domain specific objective robustly This led us to the fundamental problem of
Machine Comprehension and subsequently to the task of Question Answering Having
achieved some degree of success with QA systems we looked back if we could apply
our newly acquired knowledge in the task of designing Chatbots
The chatbots made with todayrsquos technologies are mostly handcrafted techniques
such as template matching that requires anticipating all possible ways a user may
articulate his requirements and a conversation may occur This requires a lot of man
hours for designing a domain specific system and is still very error prone In this section
33
we propose a general Chatbot design that would make the designing of a domain
specific chatbot very easy and robust at the same time
Every domain specific chatbot needs to obtain a set of information from the user
and show some results based on the user specific information obtained The traditional
chatbots use template matching and keywords lookup to determine if the user has
provided the required information Our idea is to use the Question Answering system in
the backend to extract out the required information from whatever the user has typed
until this point of the conversation The information to be extracted can be posed in the
form of a set of questions and the answers obtained from those questions can be used
as the parameters to supply the relevant information to the user
We had chosen our chatbot domain as the flight reservation system Our goal
was to extract the required information from the user to be able to show him the
available flights as per the userrsquos requirements
Figure 4-2 Flight reservation chatbotrsquos chat window
34
For a flight reservation task the booking agent needs to know the origin city
destination city and date of travel at minimum to be able to show the available flights
Optional information includes the number of tickets passengerrsquos name one way or
round trip etc
The minimalistic conversation with the user through the chat window would be as
shown above We had a platform called OneTask on which we wanted to implement our
chat bot The chat interface within the OneTask system looks as follows
Figure 4-3 Chatbot within OneTask system
The working of the chat system is as follows ndash
1 Initiation The user opens the chat window which starts a session with the chat bot The chat bot asks lsquoHow may I help yoursquo
2 User Reply The user may reply with none of the required information for flight booking or may reply with multiple information in the same message
3 User Reply parsing The conversation up to this point is treated as a passage and the internal questions are run on this passage So the four questions that are run are
Where do you want to go
35
From where do you want to leave
When do you want to depart
4 Parsed responses from QA model After running the questions on the QA system with the given conversation answers are obtained for all along with their corresponding confidence values Since answers will be obtained even if the required question was not answered up to this point in the conversation the confidence values plays a crucial role in determining the validity of the answer For this purpose we determined ranges of confidence values for validation A confidence value above 10 signifies the question has been answered correctly A confidence of 2 ndash 10 signifies that it may have been answered but should verify with the user for correctness and any confidence below 2 is discarded
5 Asking remaining questions iteratively After the parsing it is checked if any of the required questions are still unanswered If so the chatbot asks the remaining question and the process from 3 ndash 4 is carried out iteratively Once all the questions have been answered the user is shown with the available flight options as per his request
Figure 4-4 The Flow diagram of the Flight booking Chatbot system
Online QA System and Attention Visualization
To be able to test out various examples we used the BiDAF [23] model for an
online demo One can either choose from the available examples from the drop-down
menu or paste their own passage and examples While this is a useful and interesting
system to test the model in a user-friendly way we created this system to be able to
focus on the wrong samples
36
Figure 4-5 QA system interface with attention highlight over candidate answers
The candidate answers are shown with the blue highlights as per their
confidence values The higher the confidence the darker the answer The highest
confidence value is chosen as the predicted answer We developed the system to show
attention spread of the candidate answers to realize what needs to be done to improve
the system This led us to realize the importance of including the query attention part as
well as multilevel attention on the BiDAF model as described in the first section of this
chapter
37
CHAPTER 5 FUTURE DIRECTIONS AND CONCLUSION
Future Directions
Having done a thorough analysis of the current methods and state of the arts in
Machine Comprehension and for the QA task and developing systems on it we have
achieved a strong sense of what needs to be done to further improve the QA models
After observing the wrong samples we can see that the system is still unable to encode
meaning and is picking answers based on statistical patterns of occurrence of answers
on training examples
As per the State of the Art models and analyzing the ongoing research literature
it is easy to conclude that more features need to be embedded to encode the meaning
of words phrases and sentences A paper called Reinforced Mnemonic Reader for
Machine Comprehension encoded POS and NER tags of words along with their word
and character embedding This gave them better results
Figure 5-1 An English language semantic parse tree [26]
38
We have developed a method to encode the syntax parse tree of a sentence that
not only encodes the post tags but the relation of the word within the phrase and the
relation of the phrase within the whole sentence in a hierarchical manner
Finally data augmentation is another solution to get better results One definite
way to reduce the errors would be to include similar samples in the training data which
the system is faltering in the dev set One could generate similar examples as the failure
cases and include them in the training set to have better prediction Another system
would be to train to similar and bigger datasets Our models were trained on the SQuAD
dataset There are other datasets too which does the similar question answering task
such as TriviaQA We could augment the training set of TriviaQA along with SQuAD to
have a more robust system that is able to generalize better and thus have higher
accuracy for predicting answer spans
Conclusion
In this work we have tried to explore the most fundamental techniques that have
shaped the current state of the art Then we proposed a minor improvement of
architecture over an existing model Furthermore we developed two applications that
uses the base model First we talked about how a chatbot application can be made
using the QA system and lastly we also created a web interface where the model can
be used for any Passage and Question This interface also shows the attention spread
on the candidate answers While our effort is ongoing to push the state of the art
forward we strongly believe that surpassing human level accuracy on this task will have
high dividends for the society at large
39
LIST OF REFERENCES
[1] Danqi Chen From Reading Comprehension To Open-Domain Question Answering 2018 Youtube Accessed March 16 2018 httpswwwyoutubecomwatchv=1RN88O9C13U
[2] Lehnert Wendy G The Process of Question Answering No RR-88 Yale Univ New Haven Conn Dept of computer science 1977
[3] Joshi Mandar Eunsol Choi Daniel S Weld and Luke Zettlemoyer Triviaqa A large scale distantly supervised challenge dataset for reading comprehension arXiv preprint arXiv170503551 (2017)
[4] Rajpurkar Pranav Jian Zhang Konstantin Lopyrev and Percy Liang Squad 100000+ questions for machine comprehension of text arXiv preprint arXiv160605250 (2016)
[5] The Stanford Question Answering Dataset 2018 RajpurkarGithubIo Accessed March 16 2018 httpsrajpurkargithubioSQuAD-explorer
[6] Hu Minghao Yuxing Peng and Xipeng Qiu Reinforced mnemonic reader for machine comprehension CoRR abs170502798 (2017)
[7] Schalkoff Robert J Artificial neural networks Vol 1 New York McGraw-Hill 1997
[8] Me For The AI and Neetesh Mehrotra 2018 The Connect Between Deep Learning And AI - Open Source For You Open Source For You Accessed March 16 2018 httpopensourceforucom201801connect-deep-learning-ai
[9] Krizhevsky Alex Ilya Sutskever and Geoffrey E Hinton Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems pp 1097-1105 2012
[10] An Intuitive Explanation Of Convolutional Neural Networks 2016 The Data Science Blog Accessed March 16 2018 httpsujjwalkarnme20160811intuitive-explanation-convnets
[11] Williams Ronald J and David Zipser A learning algorithm for continually running fully recurrent neural networks Neural computation 1 no 2 (1989) 270-280
[12] Hochreiter Sepp and Juumlrgen Schmidhuber Long short-term memory Neural computation 9 no 8 (1997) 1735-1780
[13] Chung Junyoung Caglar Gulcehre KyungHyun Cho and Yoshua Bengio Empirical evaluation of gated recurrent neural networks on sequence modeling arXiv preprint arXiv14123555 (2014)
40
[14] Mikolov Tomas Wen-tau Yih and Geoffrey Zweig Linguistic regularities in continuous space word representations In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics Human Language Technologies pp 746-751 2013
[15] Pennington Jeffrey Richard Socher and Christopher Manning Glove Global vectors for word representation In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) pp 1532-1543 2014
[16] Word2vec Tutorial - The Skip-Gram Model middot Chris Mccormick 2016 MccormickmlCom Accessed March 16 2018 httpmccormickmlcom20160419word2vec-tutorial-the-skip-gram-model
[17] Group 2015 [Paper Introduction] Bilingual Word Representations With Monolingual hellip SlideshareNet Accessed March 16 2018 httpswwwslidesharenetnaist-mtstudybilingual-word-representations-with-monolingual-quality-in-mind
[18] Attention Mechanism 2016 BlogHeuritechCom Accessed March 16 2018 httpsblogheuritechcom20160120attention-mechanism
[19] Weston Jason Sumit Chopra and Antoine Bordes 2014 Memory Networks arXiv preprint arXiv14103916v11 Accessed March 16 2018 httpsarxivorgabs14103916
[20] Wang Shuohang and Jing Jiang Machine comprehension using match-lstm and answer pointer arXiv preprint arXiv160807905 (2016)
[21] Wang Shuohang and Jing Jiang Learning natural language inference with LSTM arXiv preprint arXiv151208849 (2015)
[22] Vinyals Oriol Meire Fortunato and Navdeep Jaitly Pointer networks In Advances in Neural Information Processing Systems pp 2692-2700 2015
[23] Wang Wenhui Nan Yang Furu Wei Baobao Chang and Ming Zhou Gated self-matching networks for reading comprehension and question answering In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1 Long Papers) vol 1 pp 189-198 2017
[24] Seo Minjoon Aniruddha Kembhavi Ali Farhadi and Hannaneh Hajishirzi Bidirectional attention flow for machine comprehension arXiv preprint arXiv161101603 (2016)
[25] Clark Christopher and Matt Gardner Simple and effective multi-paragraph reading comprehension arXiv preprint arXiv171010723 (2017)
41
[26] 8 Analyzing Sentence Structure 2018 NltkOrg Accessed March 16 2018 httpwwwnltkorgbookch08html
42
BIOGRAPHICAL SKETCH
Purnendu Mukherjee grew up in Kolkata India and had a strong interest for
computers since an early age After his high school he did his BSc in computer
science from Ramakrishna Mission Residential College Narendrapur followed by MSc
in computer science from St Xavierrsquos College Kolkata He had a strong intuition and
interest for human like learning systems and wanted to work in this area He started
working at TCS Innovation Labs Pune for the application of Natural Language
Processing in Educational Applications As he was deeply passionate about learning
system that mimic the human brain and learn like a human child does he was
increasing interested about Deep Learning and its applications After working for a year
he went on to pursue a Master of Science degree in computer science from the
University of Florida Gainesville His academic interests have been focused on Deep
Learning and Natural Language Processing and he has been working on Machine
Reading Comprehension since summer of 2017
- ACKNOWLEDGMENTS
- LIST OF FIGURES
- LIST OF ABBREVIATIONS
- INTRODUCTION
- THE BUILDING BLOCKS OF A QUESTION ANSWERING SYSTEM
-
- Neural Networks
- Convolutional Neural Network
- Recurrent Neural Networks (RNN)
- Word Embedding
- Attention Mechanism
- Memory Networks
-
- LITERATURE REVIEW AND STATE OF THE ART
-
- Machine Comprehension Using Match-LSTM and Answer Pointer
- R-NET Matching Reading Comprehension with Self-Matching Networks
- Bi-Directional Attention Flow (BiDAF) for Machine Comprehension
- Summary
-
- MULTI-ATTENTION QUESTION ANSWERING
-
- Multi-Attention BiDAF Model
- Chatbot Design Using a QA System
- Online QA System and Attention Visualization
-
- FUTURE DIRECTIONS AND CONCLUSION
-
- Future Directions
- Conclusion
-
- LIST OF REFERENCES
- BIOGRAPHICAL SKETCH
-
13
CHAPTER 2 THE BUILDING BLOCKS OF A QUESTION ANSWERING SYSTEM
To build a question answering system one needs to be familiar with the
fundamental deep learning models such as Recurrent Neural Networks (RNN) Long
Short Term Memory (LSTM) etc In this chapter we will have an overview on these
techniques and see how they all connect to building a question answering system
Neural Networks
What makes Deep Learning so intriguing is that it has close resemblance with
the working of the mammalian brain or at least draws inspiration from it The same can
be said for Artificial Neural networks [7] which consists of a system of interconnected
units called lsquoneuronsrsquo that take input from similar units and produces a single output
Figure 2-1 Simple and Deep Learning Neural Networks [8]
The connection from one neuron to another can be weighted based on the input data
which enables the network to tune itself to produce a certain output based on the input
This is the learning process which is achieved through backpropagation which is a
system of propagating the error from the output layer to the previous layers
14
Convolutional Neural Network
The first wave of deep learningrsquos success was brought by Convolutional Neural
Networks (CNN) [9] when this was the technique used by the winning team of ImageNet
competition in 2012 CNNs are deep artificial neural networks (ANN) that can be used to
classify images cluster them by similarity and perform object recognition within scenes
It can be used to detect and identify faces people signs or any other visual data
Figure 2-2 Convolutional Neural Network Architecture [10]
There are primarily four operations in a standard CNN model (as shown in Fig above)
1 Convolution - The primary purpose of Convolution in the ConvNet (above) is to extract features from the input image The spatial relationship between pixels ie the image features are preserved and learned by the convolution using small squares of input data
2 Non-Linearity (ReLU) ndash Rectified Linear Unit (ReLU) is a non-linear operation that carries out an element wise operation on each pixel This operation replaces the negative pixel values in the feature map by zero
3 Pooling or Sub Sampling - Spatial Pooling reduces the dimensionality of each feature map but retains the most important information For max pooling the largest value in the square window is taken and rest are dropped Other types of pooling are Average Sum etc
4 Classification (Fully Connected Layer) - The Fully Connected layer is a traditional Multi-Layer Perceptron as described before that uses a softmax activation function in the output layer The high-level features of the image are encoded by the convolutional and pooling layers which is then fed to the fully connected layer which then uses these features for classifying the input image into various classes based on the training dataset [10]
15
When a new image is fed into the CNN model all the above-mentioned steps are
carried out (forward propagation) and a probability distribution is achieved on the set of
output classes With a large enough training dataset the network will learn and
generalize well enough to classify new images into their correct classes
Recurrent Neural Networks (RNN)
Whenever we want to predict or encode sequential data RNN [11] is our go to
method RNNs perform the same task for every element of a sequence where the
output of each element depends on previous computations thus the recurrence In
practice RNNs are unable to retain long-term dependencies and can look back only a
few steps because of the vanishing gradient problem
(2-1)
A solution to the dependency problem is to use gated cells such as LSTM [11] or
GRU [13] These cells pass on important information to the next cells while ignoring
non-important ones The gated units in a GRU block are
bull Update Gate ndash Computed based on current input and hidden state
(2-2)
bull Reset Gate ndash Calculated similarly but with different weights
(2-3)
bull New memory content - (2-4)
If reset gate unit is ~0 then previous memory is ignored and only new
information is kept
16
Final memory at current time step combines previous and current time steps
(2-5)
While the GRU is computationally efficient the LSTM on the other hand is a
general case where there are three gates as follows
bull Input Gate ndash What new information to add to the current cell state
bull Forget Gate ndash How much information from previous states to be kept
bull Output gate ndash How much info should be sent to the next states
Just like GRU the current cell state is a sum of the previous cell state but
weighted by the forget gate and the new value is added which is weighted by the input
gate Based on the cell state the output gate regulates the final output
Word Embedding
Computation or gradients can be applied on numbers and not on words or letters
So first we need to convert words into their corresponding numerical formation before
feeding into a deep learning model In general there are two types of word embedding
Frequency based (which constitutes count vectors tf-idf and co-occurrence vectors)
and Prediction based With frequency based embedding the order of the words are not
preserved and works as a bag of words model Whereas with prediction based model
the order of words or locality of words are taken into consideration to generate the
numerical representation of the word Within this prediction based category there are
two fundamental techniques called Continuous Bag of Words (CBOW) and Skip Gram
Model which forms the basis for word2vec [14] and GloVe [15]
The basic intuition behind word2vec is that if two different words have very
similar ldquocontextsrdquo (that is what words are likely to appear around them) then the model
17
will produce similar vector for those words Conversely if the two word vectors are
similar then the network will produce similar context predictions for the same two words
For examples synonyms like ldquointelligentrdquo and ldquosmartrdquo would have very similar contexts
Or that words that are related like ldquoenginerdquo and ldquotransmissionrdquo would probably have
similar contexts as well [16] Plotting the word vectors learned by a word2vec over a
large corpus we could find some very interesting relationships between words
Figure 2-3 Semantic relation between words in vector space [17]
Attention Mechanism
We as humans put our attention to things are important or are relevant in a
context For example when asked a question from a passage we try to find the most
relevant part of the passage the question is relevant with and then reason from our
understanding of that part of the passage The same idea applies for attention
mechanism in Deep Learning It is used to identify the specific parts of a given context
to which the current question is relevant to
Formally put the techniques take n arguments y_1 y_n (in our case the
passage having words say y_i through h_i) and a question word say q It returns a
vector z which is supposed to be the laquo summary raquo of the y_i focusing on information
linked to the question q More formally it returns a weighted arithmetic mean of the y_i
18
and the weights are chosen according the relevance of each y_i given the context c
[18]
Figure 2-4 Attention Mechanism flow [18]
Memory Networks
Convolutional Neural Networks and Recurrent Neural Networks which does
capture how we form our visual and sequential memories their memory (encoded by
hidden states and weights) were typically too small and was not compartmentalized
enough to accurately remember facts from the past (knowledge is compressed into
dense vectors) [19]
Deep Learning needed to cultivate a methodology that preserved memories as
they are such that it wonrsquot be lost in generalization and recalling exact words or
sequence of events would be possible mdash something computers are already good at This
effort led us to Memory Networks [19] published at ICLR 2015 by Facebook AI
Research
This paper provides a basic framework to store augment and retrieve memories
while seamlessly working with a Recurrent Neural Network architecture The memory
19
network consists of a memory m (an array of objects 1 indexed by m i) and four
(potentially learned) components I G O and R as follows
I (input feature map) mdash converts the incoming input to the internal feature
representation either a sparse or dense feature vector like that from word2vec or
GloVe
G (generalization) mdash updates old memories given the new input They call this
generalization as there is an opportunity for the network to compress and generalize its
memories at this stage for some intended future use The analogy Irsquove been talking
before
O (output feature map) mdash produces a new output (in the feature representation
space) given the new input and the current memory state This component is
responsible for performing inference In a question answering system this part will
select the candidate sentences (which might contain the answer) from the story
(conversation) so far
R (response) mdash converts the output into the response format desired For
example a textual response or an action In the QA system described this component
finds the desired answer and then converts it from feature representation to the actual
word
This model is a fully supervised model meaning all the candidate sentences from
which the answer could be found are marked during training phase and can also be
termed as lsquohard attentionrsquo
The authors tested out the QA system on various literature including Lord of the
Rings
20
Figure 2-5 QA example on Lord of the Rings using Memory Networks [19]
21
CHAPTER 3 LITERATURE REVIEW AND STATE OF THE ART
There has been rapid progress since the release of the SQuAD dataset and
currently the best ensemble models are close to human level accuracy in machine
comprehension This is due to the various ingenious methods which solves some of the
problems with the previous methods Out of Vocabulary tokens were handled by using
Character embedding Long term dependency within context passage were solved
using self-attention And many other techniques such as Contextualized vectors History
of Words Attention Flow etc In this section we will have a look at the some of the most
important models that were fundamental to the progress of Questions Answering
Machine Comprehension Using Match-LSTM and Answer Pointer
In this paper [20] the authors propose an end-to-end neural architecture for the
QA task The architecture is based on match-LSTM [21] a model they proposed for
textual entailment and Pointer Net [22] a sequence-to-sequence model proposed by
Vinyals et al (2015) to constrain the output tokens to be from the input sequences
The model consists of an LSTM preprocessing layer a match-LSTM layer and an
Answer Pointer layer
We are given a piece of text which we refer to as a passage and a question
related to the passage The passage is represented by matrix P where P is the length
(number of tokens) of the passage and d is the dimensionality of word embeddings
Similarly the question is represented by matrix Q where Q is the length of the question
Our goal is to identify a subsequence from the passage as the answer to the question
22
Figure 3-1 Match-LSTM Model Architecture [20]
LSTM Preprocessing layer They use a standard one-directional LSTM
(Hochreiter amp Schmidhuber 1997) to process the passage and the question separately
as shown below
Match LSTM Layer They applied the match-LSTM model proposed for textual
entailment to their machine comprehension problem by treating the question as a
premise and the passage as a hypothesis The match-LSTM sequentially goes through
the passage At position i of the passage it first uses the standard word-by-word
attention mechanism to obtain attention weight vector as follows
(3-1)
23
where and are parameters to be learned
Answer Pointer Layer The Answer Pointer (Ans-Ptr) layer is motivated by the
Pointer Net introduced by Vinyals et al (2015) [22] The boundary model produces only
the start token and the end token of the answer and then all the tokens between these
two in the original passage are considered to be the answer
When this paper was released back in November 2016 Match-LSTM method
was the state of the art in Question Answering systems and was at the top of the
leaderboard for the SQuAD dataset
R-NET Matching Reading Comprehension with Self-Matching Networks
In this model [23] first the question and passage are processed by a
bidirectional recurrent network (Mikolov et al 2010) separately They then match the
question and passage with gated attention-based recurrent networks obtaining
question-aware representation for the passage On top of that they apply self-matching
attention to aggregate evidence from the whole passage and refine the passage
representation which is then fed into the output layer to predict the boundary of the
answer span
Question and passage encoding First the words are converted to their
respective word-level embeddings and character level embeddings The character-level
embeddings are generated by taking the final hidden states of a bi-directional recurrent
neural network (RNN) applied to embeddings of characters in the token Such
character-level embeddings have been shown to be helpful to deal with out-of-vocab
(OOV) tokens
24
They then use a bi-directional RNN to produce new representation and
of all words in the question and passage respectively
Figure 3-2 The task of Question Answering [23]
Gated Attention-based Recurrent Networks They use a variant of attention-
based recurrent networks with an additional gate to determine the importance of
information in the passage regarding a question Different from the gates in LSTM or
GRU the additional gate is based on the current passage word and its attention-pooling
vector of the question which focuses on the relation between the question and current
passage word The gate effectively model the phenomenon that only parts of the
passage are relevant to the question in reading comprehension and question answering
is utilized in subsequent calculations
25
Self-Matching Attention From the previous step the question aware passage
representation is generated to highlight the important parts of the passage One
problem with such representation is that it has very limited knowledge of context One
answer candidate is often oblivious to important cues in the passage outside its
surrounding window To address this problem the authors propose directly matching
the question-aware passage representation against itself It dynamically collects
evidence from the whole passage for words in passage and encodes the evidence
relevant to the current passage word and its matching question information into the
passage representation
Output Layer They use the same method as Wang amp Jiang (2016b) and use
pointer networks (Vinyals et al 2015) to predict the start and end position of the
answer In addition they use an attention-pooling over the question representation to
generate the initial hidden vector for the pointer network [23]
When the R-Net Model first appeared in the leaderboard in March 2017 it was at
the top with 723 Exact Match and 807 F1 score
Bi-Directional Attention Flow (BiDAF) for Machine Comprehension
BiDAF [24] is a hierarchical multi-stage architecture for modeling the
representations of the context paragraph at different levels of granularity BIDAF
includes character-level word-level and contextual embeddings and uses bi-directional
attention flow to obtain a query-aware context representation Their attention layer is not
used to summarize the context paragraph into a fixed-size vector Instead the attention
is computed for every time step and the attended vector at each time step along with
the representations from previous layers can flow through to the subsequent modeling
layer This reduces the information loss caused by early summarization [24]
26
Figure 3-3 Bi-Directional Attention Flow Model Architecture [24]
Their machine comprehension model is a hierarchical multi-stage process and
consists of six layers
1 Character Embedding Layer maps each word to a vector space using character-level CNNs
2 Word Embedding Layer maps each word to a vector space using a pre-trained word embedding model
3 Contextual Embedding Layer utilizes contextual cues from surrounding words to refine the embedding of the words These first three layers are applied to both the query and context They use a LSTM on top of the embeddings provided by the previous layers to model the temporal interactions between words We place an LSTM in both directions and concatenate the outputs of the two LSTMs
4 Attention Flow Layer couples the query and context vectors and produces a set of query aware feature vectors for each word in the context
5 Modeling Layer employs a Recurrent Neural Network to scan the context The input to the modeling layer is G which encodes the query-aware representations of context words The output of the modeling layer captures the interaction among the context words conditioned on the query They use two layers of bi-directional LSTM
27
with the output size of d for each direction Hence a matrix is obtained which is passed onto the output layer to predict the answer
6 Output Layer provides an answer to the query They define the training loss (to be minimized) as the sum of the negative log probabilities of the true start and end indices by the predicted distributions averaged over all examples [25]
In a further variation of their above work they add a self-attention layer after the
Bi-attention layer to further improve the results The architecture of the model is as
Figure 3-3 The task of Question Answering [25]
28
Summary
In this chapter we reviewed the methods that are fundamental to the state of the
art in Machine Comprehension and for the task of Question Answering We have
reached closed to human level accuracy and this is due to incremental developments
over previous models As we saw Out of Vocabulary (OOV) tokens were handled by
using Character embedding Long term dependency within context passage were
solved using self-attention and many other techniques such as Contextualized vectors
Attention Flow etc were employed to get better results In the next chapter we will see
how we can build on these models and develop further
29
CHAPTER 4 MULTI-ATTENTION QUESTION ANSWERING
Although the advancements in Question Answering systems since the release of
the SQuAD dataset has been impressive and the results are getting close to human
level accuracy it is far from being a fool-proof system The models still make mistakes
which would be obvious to a human For example
Passage The Panthers used the San Jose State practice facility and stayed at
the San Jose Marriott The Broncos practiced at Florida State Facility and stayed at the
Santa Clara Marriott
Question At what universitys facility did the Panthers practice
Actual Answer San Jose State
Predicted Answer Florida State Facility
To find out what is leading to the wrong predictions we wanted to see the
attention weights associated with such an example We plotted the passage and
question heat map which is a 2D matrix where the intensity of each cell signifies the
similarity between a passage word and a question word For the above example we
found out that while certain words of the question are given high weightage other parts
are not The words lsquoAtrsquo lsquofacilityrsquo lsquopracticersquo are given high attention but lsquoPanthersrsquo does
not receive high attention If it had received high attention then the system would have
predicted lsquoSan Jose Statersquo as the right answer To solve this issue we analyzed the
base BiDAF model and proposed adding two things
1 Bi-Attention and Self-Attention over Query
2 Second level attention over the output of (Bi-Attention + Self-Attention) from both Context and Query
30
Multi-Attention BiDAF Model
Our comprehension model is a hierarchical multi-stage process and consists of
the following layers
1 Embedding Just as all other models we embed words using pretrained word vectors We also embed the characters in each word into size 20 vectors which are learned and run a convolution neural network followed by max-pooling to get character-derived embeddings for each word The character-level and word-level embeddings are then concatenated and passed to the next layer We do not update the word embeddings during training
2 Pre-Process A shared bi-directional GRU (Cho et al 2014) is used to map the question and passage embeddings to context aware embeddings
3 Context Attention The bi-directional attention mechanism from the Bi-Directional Attention Flow (BiDAF) model (Seo et al 2016) is used to build a query-aware context representation Let h_i be the vector for context word i q_j be the vector for question word j and n_q and n_c be the lengths of the question and context respectively We compute attention between context word i and question word j as
(4-1)
where w1 w2 and w3 are learned vectors and is element-wise multiplication We then compute an attended vector c_i for each context token as
(4-2)
We also compute a query-to-context vector q_c
(4-3)
31
The final vector computed for each token is built by concatenating
and In our model we subsequently pass the result through a linear layer with ReLU activations
4 Context Self-Attention Next we use a layer of residual self-attention The input is passed through another bi-directional GRU Then we apply the same attention mechanism only now between the passage and itself In this case we do not use query-to context attention and we set aij = 1048576inf if i = j As before we pass the concatenated output through a linear layer with ReLU activations This layer is applied residually so this output is additionally summed with the input
5 Query Attention For this part we do the same way as context attention but calculate the weighted sum of the context words for each query word Thus we get the length as number of query words Then we calculate context to query similar to query to context in context attention layer
Figure 4-1 The modified BiDAF model with multilevel attention
6 Query Self-Attention This part is done the same way as the context self-attention layer but from the output of the Query Attention layer
32
7 Context Query Bi-Attention + Self-Attention The output of the Context self-attention and the Query Self-Attention layers are taken as input and the same process for Bi-Attention and self-attention is applied on these inputs
8 Prediction In the last layer of our model a bidirectional GRU is applied followed by a linear layer that computes answer start scores for each token The hidden states of that layer are concatenated with the input and fed into a second bidirectional GRU and linear layer to predict answer end scores The softmax operation is applied to the start and end scores to produce start and end probabilities and we optimize the negative log likelihood of selecting correct start and end tokens
Having carried out this modification we were able to solve the wrong example we
started with The multilevel attention model gives the correct output as ldquoSan Jose
Staterdquo Also we achieved slightly better scores than the original model with a F1 score
of 8544 on the SQuAD dev dataset
Chatbot Design Using a QA System
Designing a perfect chatbot that passes the Turing test is a fundamental goal for
Artificial Intelligence Although we are many order to magnitudes away from achieving
such a goal domain specific tasks can be solved with chatbots made from current
technology We had started our goal with a similar objective in mind ie to design a
domain specific chatbot and then generalize to other areas as it is able to achieve the
first domain specific objective robustly This led us to the fundamental problem of
Machine Comprehension and subsequently to the task of Question Answering Having
achieved some degree of success with QA systems we looked back if we could apply
our newly acquired knowledge in the task of designing Chatbots
The chatbots made with todayrsquos technologies are mostly handcrafted techniques
such as template matching that requires anticipating all possible ways a user may
articulate his requirements and a conversation may occur This requires a lot of man
hours for designing a domain specific system and is still very error prone In this section
33
we propose a general Chatbot design that would make the designing of a domain
specific chatbot very easy and robust at the same time
Every domain specific chatbot needs to obtain a set of information from the user
and show some results based on the user specific information obtained The traditional
chatbots use template matching and keywords lookup to determine if the user has
provided the required information Our idea is to use the Question Answering system in
the backend to extract out the required information from whatever the user has typed
until this point of the conversation The information to be extracted can be posed in the
form of a set of questions and the answers obtained from those questions can be used
as the parameters to supply the relevant information to the user
We had chosen our chatbot domain as the flight reservation system Our goal
was to extract the required information from the user to be able to show him the
available flights as per the userrsquos requirements
Figure 4-2 Flight reservation chatbotrsquos chat window
34
For a flight reservation task the booking agent needs to know the origin city
destination city and date of travel at minimum to be able to show the available flights
Optional information includes the number of tickets passengerrsquos name one way or
round trip etc
The minimalistic conversation with the user through the chat window would be as
shown above We had a platform called OneTask on which we wanted to implement our
chat bot The chat interface within the OneTask system looks as follows
Figure 4-3 Chatbot within OneTask system
The working of the chat system is as follows ndash
1 Initiation The user opens the chat window which starts a session with the chat bot The chat bot asks lsquoHow may I help yoursquo
2 User Reply The user may reply with none of the required information for flight booking or may reply with multiple information in the same message
3 User Reply parsing The conversation up to this point is treated as a passage and the internal questions are run on this passage So the four questions that are run are
Where do you want to go
35
From where do you want to leave
When do you want to depart
4 Parsed responses from QA model After running the questions on the QA system with the given conversation answers are obtained for all along with their corresponding confidence values Since answers will be obtained even if the required question was not answered up to this point in the conversation the confidence values plays a crucial role in determining the validity of the answer For this purpose we determined ranges of confidence values for validation A confidence value above 10 signifies the question has been answered correctly A confidence of 2 ndash 10 signifies that it may have been answered but should verify with the user for correctness and any confidence below 2 is discarded
5 Asking remaining questions iteratively After the parsing it is checked if any of the required questions are still unanswered If so the chatbot asks the remaining question and the process from 3 ndash 4 is carried out iteratively Once all the questions have been answered the user is shown with the available flight options as per his request
Figure 4-4 The Flow diagram of the Flight booking Chatbot system
Online QA System and Attention Visualization
To be able to test out various examples we used the BiDAF [23] model for an
online demo One can either choose from the available examples from the drop-down
menu or paste their own passage and examples While this is a useful and interesting
system to test the model in a user-friendly way we created this system to be able to
focus on the wrong samples
36
Figure 4-5 QA system interface with attention highlight over candidate answers
The candidate answers are shown with the blue highlights as per their
confidence values The higher the confidence the darker the answer The highest
confidence value is chosen as the predicted answer We developed the system to show
attention spread of the candidate answers to realize what needs to be done to improve
the system This led us to realize the importance of including the query attention part as
well as multilevel attention on the BiDAF model as described in the first section of this
chapter
37
CHAPTER 5 FUTURE DIRECTIONS AND CONCLUSION
Future Directions
Having done a thorough analysis of the current methods and state of the arts in
Machine Comprehension and for the QA task and developing systems on it we have
achieved a strong sense of what needs to be done to further improve the QA models
After observing the wrong samples we can see that the system is still unable to encode
meaning and is picking answers based on statistical patterns of occurrence of answers
on training examples
As per the State of the Art models and analyzing the ongoing research literature
it is easy to conclude that more features need to be embedded to encode the meaning
of words phrases and sentences A paper called Reinforced Mnemonic Reader for
Machine Comprehension encoded POS and NER tags of words along with their word
and character embedding This gave them better results
Figure 5-1 An English language semantic parse tree [26]
38
We have developed a method to encode the syntax parse tree of a sentence that
not only encodes the post tags but the relation of the word within the phrase and the
relation of the phrase within the whole sentence in a hierarchical manner
Finally data augmentation is another solution to get better results One definite
way to reduce the errors would be to include similar samples in the training data which
the system is faltering in the dev set One could generate similar examples as the failure
cases and include them in the training set to have better prediction Another system
would be to train to similar and bigger datasets Our models were trained on the SQuAD
dataset There are other datasets too which does the similar question answering task
such as TriviaQA We could augment the training set of TriviaQA along with SQuAD to
have a more robust system that is able to generalize better and thus have higher
accuracy for predicting answer spans
Conclusion
In this work we have tried to explore the most fundamental techniques that have
shaped the current state of the art Then we proposed a minor improvement of
architecture over an existing model Furthermore we developed two applications that
uses the base model First we talked about how a chatbot application can be made
using the QA system and lastly we also created a web interface where the model can
be used for any Passage and Question This interface also shows the attention spread
on the candidate answers While our effort is ongoing to push the state of the art
forward we strongly believe that surpassing human level accuracy on this task will have
high dividends for the society at large
39
LIST OF REFERENCES
[1] Danqi Chen From Reading Comprehension To Open-Domain Question Answering 2018 Youtube Accessed March 16 2018 httpswwwyoutubecomwatchv=1RN88O9C13U
[2] Lehnert Wendy G The Process of Question Answering No RR-88 Yale Univ New Haven Conn Dept of computer science 1977
[3] Joshi Mandar Eunsol Choi Daniel S Weld and Luke Zettlemoyer Triviaqa A large scale distantly supervised challenge dataset for reading comprehension arXiv preprint arXiv170503551 (2017)
[4] Rajpurkar Pranav Jian Zhang Konstantin Lopyrev and Percy Liang Squad 100000+ questions for machine comprehension of text arXiv preprint arXiv160605250 (2016)
[5] The Stanford Question Answering Dataset 2018 RajpurkarGithubIo Accessed March 16 2018 httpsrajpurkargithubioSQuAD-explorer
[6] Hu Minghao Yuxing Peng and Xipeng Qiu Reinforced mnemonic reader for machine comprehension CoRR abs170502798 (2017)
[7] Schalkoff Robert J Artificial neural networks Vol 1 New York McGraw-Hill 1997
[8] Me For The AI and Neetesh Mehrotra 2018 The Connect Between Deep Learning And AI - Open Source For You Open Source For You Accessed March 16 2018 httpopensourceforucom201801connect-deep-learning-ai
[9] Krizhevsky Alex Ilya Sutskever and Geoffrey E Hinton Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems pp 1097-1105 2012
[10] An Intuitive Explanation Of Convolutional Neural Networks 2016 The Data Science Blog Accessed March 16 2018 httpsujjwalkarnme20160811intuitive-explanation-convnets
[11] Williams Ronald J and David Zipser A learning algorithm for continually running fully recurrent neural networks Neural computation 1 no 2 (1989) 270-280
[12] Hochreiter Sepp and Juumlrgen Schmidhuber Long short-term memory Neural computation 9 no 8 (1997) 1735-1780
[13] Chung Junyoung Caglar Gulcehre KyungHyun Cho and Yoshua Bengio Empirical evaluation of gated recurrent neural networks on sequence modeling arXiv preprint arXiv14123555 (2014)
40
[14] Mikolov Tomas Wen-tau Yih and Geoffrey Zweig Linguistic regularities in continuous space word representations In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics Human Language Technologies pp 746-751 2013
[15] Pennington Jeffrey Richard Socher and Christopher Manning Glove Global vectors for word representation In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) pp 1532-1543 2014
[16] Word2vec Tutorial - The Skip-Gram Model middot Chris Mccormick 2016 MccormickmlCom Accessed March 16 2018 httpmccormickmlcom20160419word2vec-tutorial-the-skip-gram-model
[17] Group 2015 [Paper Introduction] Bilingual Word Representations With Monolingual hellip SlideshareNet Accessed March 16 2018 httpswwwslidesharenetnaist-mtstudybilingual-word-representations-with-monolingual-quality-in-mind
[18] Attention Mechanism 2016 BlogHeuritechCom Accessed March 16 2018 httpsblogheuritechcom20160120attention-mechanism
[19] Weston Jason Sumit Chopra and Antoine Bordes 2014 Memory Networks arXiv preprint arXiv14103916v11 Accessed March 16 2018 httpsarxivorgabs14103916
[20] Wang Shuohang and Jing Jiang Machine comprehension using match-lstm and answer pointer arXiv preprint arXiv160807905 (2016)
[21] Wang Shuohang and Jing Jiang Learning natural language inference with LSTM arXiv preprint arXiv151208849 (2015)
[22] Vinyals Oriol Meire Fortunato and Navdeep Jaitly Pointer networks In Advances in Neural Information Processing Systems pp 2692-2700 2015
[23] Wang Wenhui Nan Yang Furu Wei Baobao Chang and Ming Zhou Gated self-matching networks for reading comprehension and question answering In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1 Long Papers) vol 1 pp 189-198 2017
[24] Seo Minjoon Aniruddha Kembhavi Ali Farhadi and Hannaneh Hajishirzi Bidirectional attention flow for machine comprehension arXiv preprint arXiv161101603 (2016)
[25] Clark Christopher and Matt Gardner Simple and effective multi-paragraph reading comprehension arXiv preprint arXiv171010723 (2017)
41
[26] 8 Analyzing Sentence Structure 2018 NltkOrg Accessed March 16 2018 httpwwwnltkorgbookch08html
42
BIOGRAPHICAL SKETCH
Purnendu Mukherjee grew up in Kolkata India and had a strong interest for
computers since an early age After his high school he did his BSc in computer
science from Ramakrishna Mission Residential College Narendrapur followed by MSc
in computer science from St Xavierrsquos College Kolkata He had a strong intuition and
interest for human like learning systems and wanted to work in this area He started
working at TCS Innovation Labs Pune for the application of Natural Language
Processing in Educational Applications As he was deeply passionate about learning
system that mimic the human brain and learn like a human child does he was
increasing interested about Deep Learning and its applications After working for a year
he went on to pursue a Master of Science degree in computer science from the
University of Florida Gainesville His academic interests have been focused on Deep
Learning and Natural Language Processing and he has been working on Machine
Reading Comprehension since summer of 2017
- ACKNOWLEDGMENTS
- LIST OF FIGURES
- LIST OF ABBREVIATIONS
- INTRODUCTION
- THE BUILDING BLOCKS OF A QUESTION ANSWERING SYSTEM
-
- Neural Networks
- Convolutional Neural Network
- Recurrent Neural Networks (RNN)
- Word Embedding
- Attention Mechanism
- Memory Networks
-
- LITERATURE REVIEW AND STATE OF THE ART
-
- Machine Comprehension Using Match-LSTM and Answer Pointer
- R-NET Matching Reading Comprehension with Self-Matching Networks
- Bi-Directional Attention Flow (BiDAF) for Machine Comprehension
- Summary
-
- MULTI-ATTENTION QUESTION ANSWERING
-
- Multi-Attention BiDAF Model
- Chatbot Design Using a QA System
- Online QA System and Attention Visualization
-
- FUTURE DIRECTIONS AND CONCLUSION
-
- Future Directions
- Conclusion
-
- LIST OF REFERENCES
- BIOGRAPHICAL SKETCH
-
14
Convolutional Neural Network
The first wave of deep learningrsquos success was brought by Convolutional Neural
Networks (CNN) [9] when this was the technique used by the winning team of ImageNet
competition in 2012 CNNs are deep artificial neural networks (ANN) that can be used to
classify images cluster them by similarity and perform object recognition within scenes
It can be used to detect and identify faces people signs or any other visual data
Figure 2-2 Convolutional Neural Network Architecture [10]
There are primarily four operations in a standard CNN model (as shown in Fig above)
1 Convolution - The primary purpose of Convolution in the ConvNet (above) is to extract features from the input image The spatial relationship between pixels ie the image features are preserved and learned by the convolution using small squares of input data
2 Non-Linearity (ReLU) ndash Rectified Linear Unit (ReLU) is a non-linear operation that carries out an element wise operation on each pixel This operation replaces the negative pixel values in the feature map by zero
3 Pooling or Sub Sampling - Spatial Pooling reduces the dimensionality of each feature map but retains the most important information For max pooling the largest value in the square window is taken and rest are dropped Other types of pooling are Average Sum etc
4 Classification (Fully Connected Layer) - The Fully Connected layer is a traditional Multi-Layer Perceptron as described before that uses a softmax activation function in the output layer The high-level features of the image are encoded by the convolutional and pooling layers which is then fed to the fully connected layer which then uses these features for classifying the input image into various classes based on the training dataset [10]
15
When a new image is fed into the CNN model all the above-mentioned steps are
carried out (forward propagation) and a probability distribution is achieved on the set of
output classes With a large enough training dataset the network will learn and
generalize well enough to classify new images into their correct classes
Recurrent Neural Networks (RNN)
Whenever we want to predict or encode sequential data RNN [11] is our go to
method RNNs perform the same task for every element of a sequence where the
output of each element depends on previous computations thus the recurrence In
practice RNNs are unable to retain long-term dependencies and can look back only a
few steps because of the vanishing gradient problem
(2-1)
A solution to the dependency problem is to use gated cells such as LSTM [11] or
GRU [13] These cells pass on important information to the next cells while ignoring
non-important ones The gated units in a GRU block are
bull Update Gate ndash Computed based on current input and hidden state
(2-2)
bull Reset Gate ndash Calculated similarly but with different weights
(2-3)
bull New memory content - (2-4)
If reset gate unit is ~0 then previous memory is ignored and only new
information is kept
16
Final memory at current time step combines previous and current time steps
(2-5)
While the GRU is computationally efficient the LSTM on the other hand is a
general case where there are three gates as follows
bull Input Gate ndash What new information to add to the current cell state
bull Forget Gate ndash How much information from previous states to be kept
bull Output gate ndash How much info should be sent to the next states
Just like GRU the current cell state is a sum of the previous cell state but
weighted by the forget gate and the new value is added which is weighted by the input
gate Based on the cell state the output gate regulates the final output
Word Embedding
Computation or gradients can be applied on numbers and not on words or letters
So first we need to convert words into their corresponding numerical formation before
feeding into a deep learning model In general there are two types of word embedding
Frequency based (which constitutes count vectors tf-idf and co-occurrence vectors)
and Prediction based With frequency based embedding the order of the words are not
preserved and works as a bag of words model Whereas with prediction based model
the order of words or locality of words are taken into consideration to generate the
numerical representation of the word Within this prediction based category there are
two fundamental techniques called Continuous Bag of Words (CBOW) and Skip Gram
Model which forms the basis for word2vec [14] and GloVe [15]
The basic intuition behind word2vec is that if two different words have very
similar ldquocontextsrdquo (that is what words are likely to appear around them) then the model
17
will produce similar vector for those words Conversely if the two word vectors are
similar then the network will produce similar context predictions for the same two words
For examples synonyms like ldquointelligentrdquo and ldquosmartrdquo would have very similar contexts
Or that words that are related like ldquoenginerdquo and ldquotransmissionrdquo would probably have
similar contexts as well [16] Plotting the word vectors learned by a word2vec over a
large corpus we could find some very interesting relationships between words
Figure 2-3 Semantic relation between words in vector space [17]
Attention Mechanism
We as humans put our attention to things are important or are relevant in a
context For example when asked a question from a passage we try to find the most
relevant part of the passage the question is relevant with and then reason from our
understanding of that part of the passage The same idea applies for attention
mechanism in Deep Learning It is used to identify the specific parts of a given context
to which the current question is relevant to
Formally put the techniques take n arguments y_1 y_n (in our case the
passage having words say y_i through h_i) and a question word say q It returns a
vector z which is supposed to be the laquo summary raquo of the y_i focusing on information
linked to the question q More formally it returns a weighted arithmetic mean of the y_i
18
and the weights are chosen according the relevance of each y_i given the context c
[18]
Figure 2-4 Attention Mechanism flow [18]
Memory Networks
Convolutional Neural Networks and Recurrent Neural Networks which does
capture how we form our visual and sequential memories their memory (encoded by
hidden states and weights) were typically too small and was not compartmentalized
enough to accurately remember facts from the past (knowledge is compressed into
dense vectors) [19]
Deep Learning needed to cultivate a methodology that preserved memories as
they are such that it wonrsquot be lost in generalization and recalling exact words or
sequence of events would be possible mdash something computers are already good at This
effort led us to Memory Networks [19] published at ICLR 2015 by Facebook AI
Research
This paper provides a basic framework to store augment and retrieve memories
while seamlessly working with a Recurrent Neural Network architecture The memory
19
network consists of a memory m (an array of objects 1 indexed by m i) and four
(potentially learned) components I G O and R as follows
I (input feature map) mdash converts the incoming input to the internal feature
representation either a sparse or dense feature vector like that from word2vec or
GloVe
G (generalization) mdash updates old memories given the new input They call this
generalization as there is an opportunity for the network to compress and generalize its
memories at this stage for some intended future use The analogy Irsquove been talking
before
O (output feature map) mdash produces a new output (in the feature representation
space) given the new input and the current memory state This component is
responsible for performing inference In a question answering system this part will
select the candidate sentences (which might contain the answer) from the story
(conversation) so far
R (response) mdash converts the output into the response format desired For
example a textual response or an action In the QA system described this component
finds the desired answer and then converts it from feature representation to the actual
word
This model is a fully supervised model meaning all the candidate sentences from
which the answer could be found are marked during training phase and can also be
termed as lsquohard attentionrsquo
The authors tested out the QA system on various literature including Lord of the
Rings
20
Figure 2-5 QA example on Lord of the Rings using Memory Networks [19]
21
CHAPTER 3 LITERATURE REVIEW AND STATE OF THE ART
There has been rapid progress since the release of the SQuAD dataset and
currently the best ensemble models are close to human level accuracy in machine
comprehension This is due to the various ingenious methods which solves some of the
problems with the previous methods Out of Vocabulary tokens were handled by using
Character embedding Long term dependency within context passage were solved
using self-attention And many other techniques such as Contextualized vectors History
of Words Attention Flow etc In this section we will have a look at the some of the most
important models that were fundamental to the progress of Questions Answering
Machine Comprehension Using Match-LSTM and Answer Pointer
In this paper [20] the authors propose an end-to-end neural architecture for the
QA task The architecture is based on match-LSTM [21] a model they proposed for
textual entailment and Pointer Net [22] a sequence-to-sequence model proposed by
Vinyals et al (2015) to constrain the output tokens to be from the input sequences
The model consists of an LSTM preprocessing layer a match-LSTM layer and an
Answer Pointer layer
We are given a piece of text which we refer to as a passage and a question
related to the passage The passage is represented by matrix P where P is the length
(number of tokens) of the passage and d is the dimensionality of word embeddings
Similarly the question is represented by matrix Q where Q is the length of the question
Our goal is to identify a subsequence from the passage as the answer to the question
22
Figure 3-1 Match-LSTM Model Architecture [20]
LSTM Preprocessing layer They use a standard one-directional LSTM
(Hochreiter amp Schmidhuber 1997) to process the passage and the question separately
as shown below
Match LSTM Layer They applied the match-LSTM model proposed for textual
entailment to their machine comprehension problem by treating the question as a
premise and the passage as a hypothesis The match-LSTM sequentially goes through
the passage At position i of the passage it first uses the standard word-by-word
attention mechanism to obtain attention weight vector as follows
(3-1)
23
where and are parameters to be learned
Answer Pointer Layer The Answer Pointer (Ans-Ptr) layer is motivated by the
Pointer Net introduced by Vinyals et al (2015) [22] The boundary model produces only
the start token and the end token of the answer and then all the tokens between these
two in the original passage are considered to be the answer
When this paper was released back in November 2016 Match-LSTM method
was the state of the art in Question Answering systems and was at the top of the
leaderboard for the SQuAD dataset
R-NET Matching Reading Comprehension with Self-Matching Networks
In this model [23] first the question and passage are processed by a
bidirectional recurrent network (Mikolov et al 2010) separately They then match the
question and passage with gated attention-based recurrent networks obtaining
question-aware representation for the passage On top of that they apply self-matching
attention to aggregate evidence from the whole passage and refine the passage
representation which is then fed into the output layer to predict the boundary of the
answer span
Question and passage encoding First the words are converted to their
respective word-level embeddings and character level embeddings The character-level
embeddings are generated by taking the final hidden states of a bi-directional recurrent
neural network (RNN) applied to embeddings of characters in the token Such
character-level embeddings have been shown to be helpful to deal with out-of-vocab
(OOV) tokens
24
They then use a bi-directional RNN to produce new representation and
of all words in the question and passage respectively
Figure 3-2 The task of Question Answering [23]
Gated Attention-based Recurrent Networks They use a variant of attention-
based recurrent networks with an additional gate to determine the importance of
information in the passage regarding a question Different from the gates in LSTM or
GRU the additional gate is based on the current passage word and its attention-pooling
vector of the question which focuses on the relation between the question and current
passage word The gate effectively model the phenomenon that only parts of the
passage are relevant to the question in reading comprehension and question answering
is utilized in subsequent calculations
25
Self-Matching Attention From the previous step the question aware passage
representation is generated to highlight the important parts of the passage One
problem with such representation is that it has very limited knowledge of context One
answer candidate is often oblivious to important cues in the passage outside its
surrounding window To address this problem the authors propose directly matching
the question-aware passage representation against itself It dynamically collects
evidence from the whole passage for words in passage and encodes the evidence
relevant to the current passage word and its matching question information into the
passage representation
Output Layer They use the same method as Wang amp Jiang (2016b) and use
pointer networks (Vinyals et al 2015) to predict the start and end position of the
answer In addition they use an attention-pooling over the question representation to
generate the initial hidden vector for the pointer network [23]
When the R-Net Model first appeared in the leaderboard in March 2017 it was at
the top with 723 Exact Match and 807 F1 score
Bi-Directional Attention Flow (BiDAF) for Machine Comprehension
BiDAF [24] is a hierarchical multi-stage architecture for modeling the
representations of the context paragraph at different levels of granularity BIDAF
includes character-level word-level and contextual embeddings and uses bi-directional
attention flow to obtain a query-aware context representation Their attention layer is not
used to summarize the context paragraph into a fixed-size vector Instead the attention
is computed for every time step and the attended vector at each time step along with
the representations from previous layers can flow through to the subsequent modeling
layer This reduces the information loss caused by early summarization [24]
26
Figure 3-3 Bi-Directional Attention Flow Model Architecture [24]
Their machine comprehension model is a hierarchical multi-stage process and
consists of six layers
1 Character Embedding Layer maps each word to a vector space using character-level CNNs
2 Word Embedding Layer maps each word to a vector space using a pre-trained word embedding model
3 Contextual Embedding Layer utilizes contextual cues from surrounding words to refine the embedding of the words These first three layers are applied to both the query and context They use a LSTM on top of the embeddings provided by the previous layers to model the temporal interactions between words We place an LSTM in both directions and concatenate the outputs of the two LSTMs
4 Attention Flow Layer couples the query and context vectors and produces a set of query aware feature vectors for each word in the context
5 Modeling Layer employs a Recurrent Neural Network to scan the context The input to the modeling layer is G which encodes the query-aware representations of context words The output of the modeling layer captures the interaction among the context words conditioned on the query They use two layers of bi-directional LSTM
27
with the output size of d for each direction Hence a matrix is obtained which is passed onto the output layer to predict the answer
6 Output Layer provides an answer to the query They define the training loss (to be minimized) as the sum of the negative log probabilities of the true start and end indices by the predicted distributions averaged over all examples [25]
In a further variation of their above work they add a self-attention layer after the
Bi-attention layer to further improve the results The architecture of the model is as
Figure 3-3 The task of Question Answering [25]
28
Summary
In this chapter we reviewed the methods that are fundamental to the state of the
art in Machine Comprehension and for the task of Question Answering We have
reached closed to human level accuracy and this is due to incremental developments
over previous models As we saw Out of Vocabulary (OOV) tokens were handled by
using Character embedding Long term dependency within context passage were
solved using self-attention and many other techniques such as Contextualized vectors
Attention Flow etc were employed to get better results In the next chapter we will see
how we can build on these models and develop further
29
CHAPTER 4 MULTI-ATTENTION QUESTION ANSWERING
Although the advancements in Question Answering systems since the release of
the SQuAD dataset has been impressive and the results are getting close to human
level accuracy it is far from being a fool-proof system The models still make mistakes
which would be obvious to a human For example
Passage The Panthers used the San Jose State practice facility and stayed at
the San Jose Marriott The Broncos practiced at Florida State Facility and stayed at the
Santa Clara Marriott
Question At what universitys facility did the Panthers practice
Actual Answer San Jose State
Predicted Answer Florida State Facility
To find out what is leading to the wrong predictions we wanted to see the
attention weights associated with such an example We plotted the passage and
question heat map which is a 2D matrix where the intensity of each cell signifies the
similarity between a passage word and a question word For the above example we
found out that while certain words of the question are given high weightage other parts
are not The words lsquoAtrsquo lsquofacilityrsquo lsquopracticersquo are given high attention but lsquoPanthersrsquo does
not receive high attention If it had received high attention then the system would have
predicted lsquoSan Jose Statersquo as the right answer To solve this issue we analyzed the
base BiDAF model and proposed adding two things
1 Bi-Attention and Self-Attention over Query
2 Second level attention over the output of (Bi-Attention + Self-Attention) from both Context and Query
30
Multi-Attention BiDAF Model
Our comprehension model is a hierarchical multi-stage process and consists of
the following layers
1 Embedding Just as all other models we embed words using pretrained word vectors We also embed the characters in each word into size 20 vectors which are learned and run a convolution neural network followed by max-pooling to get character-derived embeddings for each word The character-level and word-level embeddings are then concatenated and passed to the next layer We do not update the word embeddings during training
2 Pre-Process A shared bi-directional GRU (Cho et al 2014) is used to map the question and passage embeddings to context aware embeddings
3 Context Attention The bi-directional attention mechanism from the Bi-Directional Attention Flow (BiDAF) model (Seo et al 2016) is used to build a query-aware context representation Let h_i be the vector for context word i q_j be the vector for question word j and n_q and n_c be the lengths of the question and context respectively We compute attention between context word i and question word j as
(4-1)
where w1 w2 and w3 are learned vectors and is element-wise multiplication We then compute an attended vector c_i for each context token as
(4-2)
We also compute a query-to-context vector q_c
(4-3)
31
The final vector computed for each token is built by concatenating
and In our model we subsequently pass the result through a linear layer with ReLU activations
4 Context Self-Attention Next we use a layer of residual self-attention The input is passed through another bi-directional GRU Then we apply the same attention mechanism only now between the passage and itself In this case we do not use query-to context attention and we set aij = 1048576inf if i = j As before we pass the concatenated output through a linear layer with ReLU activations This layer is applied residually so this output is additionally summed with the input
5 Query Attention For this part we do the same way as context attention but calculate the weighted sum of the context words for each query word Thus we get the length as number of query words Then we calculate context to query similar to query to context in context attention layer
Figure 4-1 The modified BiDAF model with multilevel attention
6 Query Self-Attention This part is done the same way as the context self-attention layer but from the output of the Query Attention layer
32
7 Context Query Bi-Attention + Self-Attention The output of the Context self-attention and the Query Self-Attention layers are taken as input and the same process for Bi-Attention and self-attention is applied on these inputs
8 Prediction In the last layer of our model a bidirectional GRU is applied followed by a linear layer that computes answer start scores for each token The hidden states of that layer are concatenated with the input and fed into a second bidirectional GRU and linear layer to predict answer end scores The softmax operation is applied to the start and end scores to produce start and end probabilities and we optimize the negative log likelihood of selecting correct start and end tokens
Having carried out this modification we were able to solve the wrong example we
started with The multilevel attention model gives the correct output as ldquoSan Jose
Staterdquo Also we achieved slightly better scores than the original model with a F1 score
of 8544 on the SQuAD dev dataset
Chatbot Design Using a QA System
Designing a perfect chatbot that passes the Turing test is a fundamental goal for
Artificial Intelligence Although we are many order to magnitudes away from achieving
such a goal domain specific tasks can be solved with chatbots made from current
technology We had started our goal with a similar objective in mind ie to design a
domain specific chatbot and then generalize to other areas as it is able to achieve the
first domain specific objective robustly This led us to the fundamental problem of
Machine Comprehension and subsequently to the task of Question Answering Having
achieved some degree of success with QA systems we looked back if we could apply
our newly acquired knowledge in the task of designing Chatbots
The chatbots made with todayrsquos technologies are mostly handcrafted techniques
such as template matching that requires anticipating all possible ways a user may
articulate his requirements and a conversation may occur This requires a lot of man
hours for designing a domain specific system and is still very error prone In this section
33
we propose a general Chatbot design that would make the designing of a domain
specific chatbot very easy and robust at the same time
Every domain specific chatbot needs to obtain a set of information from the user
and show some results based on the user specific information obtained The traditional
chatbots use template matching and keywords lookup to determine if the user has
provided the required information Our idea is to use the Question Answering system in
the backend to extract out the required information from whatever the user has typed
until this point of the conversation The information to be extracted can be posed in the
form of a set of questions and the answers obtained from those questions can be used
as the parameters to supply the relevant information to the user
We had chosen our chatbot domain as the flight reservation system Our goal
was to extract the required information from the user to be able to show him the
available flights as per the userrsquos requirements
Figure 4-2 Flight reservation chatbotrsquos chat window
34
For a flight reservation task the booking agent needs to know the origin city
destination city and date of travel at minimum to be able to show the available flights
Optional information includes the number of tickets passengerrsquos name one way or
round trip etc
The minimalistic conversation with the user through the chat window would be as
shown above We had a platform called OneTask on which we wanted to implement our
chat bot The chat interface within the OneTask system looks as follows
Figure 4-3 Chatbot within OneTask system
The working of the chat system is as follows ndash
1 Initiation The user opens the chat window which starts a session with the chat bot The chat bot asks lsquoHow may I help yoursquo
2 User Reply The user may reply with none of the required information for flight booking or may reply with multiple information in the same message
3 User Reply parsing The conversation up to this point is treated as a passage and the internal questions are run on this passage So the four questions that are run are
Where do you want to go
35
From where do you want to leave
When do you want to depart
4 Parsed responses from QA model After running the questions on the QA system with the given conversation answers are obtained for all along with their corresponding confidence values Since answers will be obtained even if the required question was not answered up to this point in the conversation the confidence values plays a crucial role in determining the validity of the answer For this purpose we determined ranges of confidence values for validation A confidence value above 10 signifies the question has been answered correctly A confidence of 2 ndash 10 signifies that it may have been answered but should verify with the user for correctness and any confidence below 2 is discarded
5 Asking remaining questions iteratively After the parsing it is checked if any of the required questions are still unanswered If so the chatbot asks the remaining question and the process from 3 ndash 4 is carried out iteratively Once all the questions have been answered the user is shown with the available flight options as per his request
Figure 4-4 The Flow diagram of the Flight booking Chatbot system
Online QA System and Attention Visualization
To be able to test out various examples we used the BiDAF [23] model for an
online demo One can either choose from the available examples from the drop-down
menu or paste their own passage and examples While this is a useful and interesting
system to test the model in a user-friendly way we created this system to be able to
focus on the wrong samples
36
Figure 4-5 QA system interface with attention highlight over candidate answers
The candidate answers are shown with the blue highlights as per their
confidence values The higher the confidence the darker the answer The highest
confidence value is chosen as the predicted answer We developed the system to show
attention spread of the candidate answers to realize what needs to be done to improve
the system This led us to realize the importance of including the query attention part as
well as multilevel attention on the BiDAF model as described in the first section of this
chapter
37
CHAPTER 5 FUTURE DIRECTIONS AND CONCLUSION
Future Directions
Having done a thorough analysis of the current methods and state of the arts in
Machine Comprehension and for the QA task and developing systems on it we have
achieved a strong sense of what needs to be done to further improve the QA models
After observing the wrong samples we can see that the system is still unable to encode
meaning and is picking answers based on statistical patterns of occurrence of answers
on training examples
As per the State of the Art models and analyzing the ongoing research literature
it is easy to conclude that more features need to be embedded to encode the meaning
of words phrases and sentences A paper called Reinforced Mnemonic Reader for
Machine Comprehension encoded POS and NER tags of words along with their word
and character embedding This gave them better results
Figure 5-1 An English language semantic parse tree [26]
38
We have developed a method to encode the syntax parse tree of a sentence that
not only encodes the post tags but the relation of the word within the phrase and the
relation of the phrase within the whole sentence in a hierarchical manner
Finally data augmentation is another solution to get better results One definite
way to reduce the errors would be to include similar samples in the training data which
the system is faltering in the dev set One could generate similar examples as the failure
cases and include them in the training set to have better prediction Another system
would be to train to similar and bigger datasets Our models were trained on the SQuAD
dataset There are other datasets too which does the similar question answering task
such as TriviaQA We could augment the training set of TriviaQA along with SQuAD to
have a more robust system that is able to generalize better and thus have higher
accuracy for predicting answer spans
Conclusion
In this work we have tried to explore the most fundamental techniques that have
shaped the current state of the art Then we proposed a minor improvement of
architecture over an existing model Furthermore we developed two applications that
uses the base model First we talked about how a chatbot application can be made
using the QA system and lastly we also created a web interface where the model can
be used for any Passage and Question This interface also shows the attention spread
on the candidate answers While our effort is ongoing to push the state of the art
forward we strongly believe that surpassing human level accuracy on this task will have
high dividends for the society at large
39
LIST OF REFERENCES
[1] Danqi Chen From Reading Comprehension To Open-Domain Question Answering 2018 Youtube Accessed March 16 2018 httpswwwyoutubecomwatchv=1RN88O9C13U
[2] Lehnert Wendy G The Process of Question Answering No RR-88 Yale Univ New Haven Conn Dept of computer science 1977
[3] Joshi Mandar Eunsol Choi Daniel S Weld and Luke Zettlemoyer Triviaqa A large scale distantly supervised challenge dataset for reading comprehension arXiv preprint arXiv170503551 (2017)
[4] Rajpurkar Pranav Jian Zhang Konstantin Lopyrev and Percy Liang Squad 100000+ questions for machine comprehension of text arXiv preprint arXiv160605250 (2016)
[5] The Stanford Question Answering Dataset 2018 RajpurkarGithubIo Accessed March 16 2018 httpsrajpurkargithubioSQuAD-explorer
[6] Hu Minghao Yuxing Peng and Xipeng Qiu Reinforced mnemonic reader for machine comprehension CoRR abs170502798 (2017)
[7] Schalkoff Robert J Artificial neural networks Vol 1 New York McGraw-Hill 1997
[8] Me For The AI and Neetesh Mehrotra 2018 The Connect Between Deep Learning And AI - Open Source For You Open Source For You Accessed March 16 2018 httpopensourceforucom201801connect-deep-learning-ai
[9] Krizhevsky Alex Ilya Sutskever and Geoffrey E Hinton Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems pp 1097-1105 2012
[10] An Intuitive Explanation Of Convolutional Neural Networks 2016 The Data Science Blog Accessed March 16 2018 httpsujjwalkarnme20160811intuitive-explanation-convnets
[11] Williams Ronald J and David Zipser A learning algorithm for continually running fully recurrent neural networks Neural computation 1 no 2 (1989) 270-280
[12] Hochreiter Sepp and Juumlrgen Schmidhuber Long short-term memory Neural computation 9 no 8 (1997) 1735-1780
[13] Chung Junyoung Caglar Gulcehre KyungHyun Cho and Yoshua Bengio Empirical evaluation of gated recurrent neural networks on sequence modeling arXiv preprint arXiv14123555 (2014)
40
[14] Mikolov Tomas Wen-tau Yih and Geoffrey Zweig Linguistic regularities in continuous space word representations In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics Human Language Technologies pp 746-751 2013
[15] Pennington Jeffrey Richard Socher and Christopher Manning Glove Global vectors for word representation In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) pp 1532-1543 2014
[16] Word2vec Tutorial - The Skip-Gram Model middot Chris Mccormick 2016 MccormickmlCom Accessed March 16 2018 httpmccormickmlcom20160419word2vec-tutorial-the-skip-gram-model
[17] Group 2015 [Paper Introduction] Bilingual Word Representations With Monolingual hellip SlideshareNet Accessed March 16 2018 httpswwwslidesharenetnaist-mtstudybilingual-word-representations-with-monolingual-quality-in-mind
[18] Attention Mechanism 2016 BlogHeuritechCom Accessed March 16 2018 httpsblogheuritechcom20160120attention-mechanism
[19] Weston Jason Sumit Chopra and Antoine Bordes 2014 Memory Networks arXiv preprint arXiv14103916v11 Accessed March 16 2018 httpsarxivorgabs14103916
[20] Wang Shuohang and Jing Jiang Machine comprehension using match-lstm and answer pointer arXiv preprint arXiv160807905 (2016)
[21] Wang Shuohang and Jing Jiang Learning natural language inference with LSTM arXiv preprint arXiv151208849 (2015)
[22] Vinyals Oriol Meire Fortunato and Navdeep Jaitly Pointer networks In Advances in Neural Information Processing Systems pp 2692-2700 2015
[23] Wang Wenhui Nan Yang Furu Wei Baobao Chang and Ming Zhou Gated self-matching networks for reading comprehension and question answering In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1 Long Papers) vol 1 pp 189-198 2017
[24] Seo Minjoon Aniruddha Kembhavi Ali Farhadi and Hannaneh Hajishirzi Bidirectional attention flow for machine comprehension arXiv preprint arXiv161101603 (2016)
[25] Clark Christopher and Matt Gardner Simple and effective multi-paragraph reading comprehension arXiv preprint arXiv171010723 (2017)
41
[26] 8 Analyzing Sentence Structure 2018 NltkOrg Accessed March 16 2018 httpwwwnltkorgbookch08html
42
BIOGRAPHICAL SKETCH
Purnendu Mukherjee grew up in Kolkata India and had a strong interest for
computers since an early age After his high school he did his BSc in computer
science from Ramakrishna Mission Residential College Narendrapur followed by MSc
in computer science from St Xavierrsquos College Kolkata He had a strong intuition and
interest for human like learning systems and wanted to work in this area He started
working at TCS Innovation Labs Pune for the application of Natural Language
Processing in Educational Applications As he was deeply passionate about learning
system that mimic the human brain and learn like a human child does he was
increasing interested about Deep Learning and its applications After working for a year
he went on to pursue a Master of Science degree in computer science from the
University of Florida Gainesville His academic interests have been focused on Deep
Learning and Natural Language Processing and he has been working on Machine
Reading Comprehension since summer of 2017
- ACKNOWLEDGMENTS
- LIST OF FIGURES
- LIST OF ABBREVIATIONS
- INTRODUCTION
- THE BUILDING BLOCKS OF A QUESTION ANSWERING SYSTEM
-
- Neural Networks
- Convolutional Neural Network
- Recurrent Neural Networks (RNN)
- Word Embedding
- Attention Mechanism
- Memory Networks
-
- LITERATURE REVIEW AND STATE OF THE ART
-
- Machine Comprehension Using Match-LSTM and Answer Pointer
- R-NET Matching Reading Comprehension with Self-Matching Networks
- Bi-Directional Attention Flow (BiDAF) for Machine Comprehension
- Summary
-
- MULTI-ATTENTION QUESTION ANSWERING
-
- Multi-Attention BiDAF Model
- Chatbot Design Using a QA System
- Online QA System and Attention Visualization
-
- FUTURE DIRECTIONS AND CONCLUSION
-
- Future Directions
- Conclusion
-
- LIST OF REFERENCES
- BIOGRAPHICAL SKETCH
-
15
When a new image is fed into the CNN model all the above-mentioned steps are
carried out (forward propagation) and a probability distribution is achieved on the set of
output classes With a large enough training dataset the network will learn and
generalize well enough to classify new images into their correct classes
Recurrent Neural Networks (RNN)
Whenever we want to predict or encode sequential data RNN [11] is our go to
method RNNs perform the same task for every element of a sequence where the
output of each element depends on previous computations thus the recurrence In
practice RNNs are unable to retain long-term dependencies and can look back only a
few steps because of the vanishing gradient problem
(2-1)
A solution to the dependency problem is to use gated cells such as LSTM [11] or
GRU [13] These cells pass on important information to the next cells while ignoring
non-important ones The gated units in a GRU block are
bull Update Gate ndash Computed based on current input and hidden state
(2-2)
bull Reset Gate ndash Calculated similarly but with different weights
(2-3)
bull New memory content - (2-4)
If reset gate unit is ~0 then previous memory is ignored and only new
information is kept
16
Final memory at current time step combines previous and current time steps
(2-5)
While the GRU is computationally efficient the LSTM on the other hand is a
general case where there are three gates as follows
bull Input Gate ndash What new information to add to the current cell state
bull Forget Gate ndash How much information from previous states to be kept
bull Output gate ndash How much info should be sent to the next states
Just like GRU the current cell state is a sum of the previous cell state but
weighted by the forget gate and the new value is added which is weighted by the input
gate Based on the cell state the output gate regulates the final output
Word Embedding
Computation or gradients can be applied on numbers and not on words or letters
So first we need to convert words into their corresponding numerical formation before
feeding into a deep learning model In general there are two types of word embedding
Frequency based (which constitutes count vectors tf-idf and co-occurrence vectors)
and Prediction based With frequency based embedding the order of the words are not
preserved and works as a bag of words model Whereas with prediction based model
the order of words or locality of words are taken into consideration to generate the
numerical representation of the word Within this prediction based category there are
two fundamental techniques called Continuous Bag of Words (CBOW) and Skip Gram
Model which forms the basis for word2vec [14] and GloVe [15]
The basic intuition behind word2vec is that if two different words have very
similar ldquocontextsrdquo (that is what words are likely to appear around them) then the model
17
will produce similar vector for those words Conversely if the two word vectors are
similar then the network will produce similar context predictions for the same two words
For examples synonyms like ldquointelligentrdquo and ldquosmartrdquo would have very similar contexts
Or that words that are related like ldquoenginerdquo and ldquotransmissionrdquo would probably have
similar contexts as well [16] Plotting the word vectors learned by a word2vec over a
large corpus we could find some very interesting relationships between words
Figure 2-3 Semantic relation between words in vector space [17]
Attention Mechanism
We as humans put our attention to things are important or are relevant in a
context For example when asked a question from a passage we try to find the most
relevant part of the passage the question is relevant with and then reason from our
understanding of that part of the passage The same idea applies for attention
mechanism in Deep Learning It is used to identify the specific parts of a given context
to which the current question is relevant to
Formally put the techniques take n arguments y_1 y_n (in our case the
passage having words say y_i through h_i) and a question word say q It returns a
vector z which is supposed to be the laquo summary raquo of the y_i focusing on information
linked to the question q More formally it returns a weighted arithmetic mean of the y_i
18
and the weights are chosen according the relevance of each y_i given the context c
[18]
Figure 2-4 Attention Mechanism flow [18]
Memory Networks
Convolutional Neural Networks and Recurrent Neural Networks which does
capture how we form our visual and sequential memories their memory (encoded by
hidden states and weights) were typically too small and was not compartmentalized
enough to accurately remember facts from the past (knowledge is compressed into
dense vectors) [19]
Deep Learning needed to cultivate a methodology that preserved memories as
they are such that it wonrsquot be lost in generalization and recalling exact words or
sequence of events would be possible mdash something computers are already good at This
effort led us to Memory Networks [19] published at ICLR 2015 by Facebook AI
Research
This paper provides a basic framework to store augment and retrieve memories
while seamlessly working with a Recurrent Neural Network architecture The memory
19
network consists of a memory m (an array of objects 1 indexed by m i) and four
(potentially learned) components I G O and R as follows
I (input feature map) mdash converts the incoming input to the internal feature
representation either a sparse or dense feature vector like that from word2vec or
GloVe
G (generalization) mdash updates old memories given the new input They call this
generalization as there is an opportunity for the network to compress and generalize its
memories at this stage for some intended future use The analogy Irsquove been talking
before
O (output feature map) mdash produces a new output (in the feature representation
space) given the new input and the current memory state This component is
responsible for performing inference In a question answering system this part will
select the candidate sentences (which might contain the answer) from the story
(conversation) so far
R (response) mdash converts the output into the response format desired For
example a textual response or an action In the QA system described this component
finds the desired answer and then converts it from feature representation to the actual
word
This model is a fully supervised model meaning all the candidate sentences from
which the answer could be found are marked during training phase and can also be
termed as lsquohard attentionrsquo
The authors tested out the QA system on various literature including Lord of the
Rings
20
Figure 2-5 QA example on Lord of the Rings using Memory Networks [19]
21
CHAPTER 3 LITERATURE REVIEW AND STATE OF THE ART
There has been rapid progress since the release of the SQuAD dataset and
currently the best ensemble models are close to human level accuracy in machine
comprehension This is due to the various ingenious methods which solves some of the
problems with the previous methods Out of Vocabulary tokens were handled by using
Character embedding Long term dependency within context passage were solved
using self-attention And many other techniques such as Contextualized vectors History
of Words Attention Flow etc In this section we will have a look at the some of the most
important models that were fundamental to the progress of Questions Answering
Machine Comprehension Using Match-LSTM and Answer Pointer
In this paper [20] the authors propose an end-to-end neural architecture for the
QA task The architecture is based on match-LSTM [21] a model they proposed for
textual entailment and Pointer Net [22] a sequence-to-sequence model proposed by
Vinyals et al (2015) to constrain the output tokens to be from the input sequences
The model consists of an LSTM preprocessing layer a match-LSTM layer and an
Answer Pointer layer
We are given a piece of text which we refer to as a passage and a question
related to the passage The passage is represented by matrix P where P is the length
(number of tokens) of the passage and d is the dimensionality of word embeddings
Similarly the question is represented by matrix Q where Q is the length of the question
Our goal is to identify a subsequence from the passage as the answer to the question
22
Figure 3-1 Match-LSTM Model Architecture [20]
LSTM Preprocessing layer They use a standard one-directional LSTM
(Hochreiter amp Schmidhuber 1997) to process the passage and the question separately
as shown below
Match LSTM Layer They applied the match-LSTM model proposed for textual
entailment to their machine comprehension problem by treating the question as a
premise and the passage as a hypothesis The match-LSTM sequentially goes through
the passage At position i of the passage it first uses the standard word-by-word
attention mechanism to obtain attention weight vector as follows
(3-1)
23
where and are parameters to be learned
Answer Pointer Layer The Answer Pointer (Ans-Ptr) layer is motivated by the
Pointer Net introduced by Vinyals et al (2015) [22] The boundary model produces only
the start token and the end token of the answer and then all the tokens between these
two in the original passage are considered to be the answer
When this paper was released back in November 2016 Match-LSTM method
was the state of the art in Question Answering systems and was at the top of the
leaderboard for the SQuAD dataset
R-NET Matching Reading Comprehension with Self-Matching Networks
In this model [23] first the question and passage are processed by a
bidirectional recurrent network (Mikolov et al 2010) separately They then match the
question and passage with gated attention-based recurrent networks obtaining
question-aware representation for the passage On top of that they apply self-matching
attention to aggregate evidence from the whole passage and refine the passage
representation which is then fed into the output layer to predict the boundary of the
answer span
Question and passage encoding First the words are converted to their
respective word-level embeddings and character level embeddings The character-level
embeddings are generated by taking the final hidden states of a bi-directional recurrent
neural network (RNN) applied to embeddings of characters in the token Such
character-level embeddings have been shown to be helpful to deal with out-of-vocab
(OOV) tokens
24
They then use a bi-directional RNN to produce new representation and
of all words in the question and passage respectively
Figure 3-2 The task of Question Answering [23]
Gated Attention-based Recurrent Networks They use a variant of attention-
based recurrent networks with an additional gate to determine the importance of
information in the passage regarding a question Different from the gates in LSTM or
GRU the additional gate is based on the current passage word and its attention-pooling
vector of the question which focuses on the relation between the question and current
passage word The gate effectively model the phenomenon that only parts of the
passage are relevant to the question in reading comprehension and question answering
is utilized in subsequent calculations
25
Self-Matching Attention From the previous step the question aware passage
representation is generated to highlight the important parts of the passage One
problem with such representation is that it has very limited knowledge of context One
answer candidate is often oblivious to important cues in the passage outside its
surrounding window To address this problem the authors propose directly matching
the question-aware passage representation against itself It dynamically collects
evidence from the whole passage for words in passage and encodes the evidence
relevant to the current passage word and its matching question information into the
passage representation
Output Layer They use the same method as Wang amp Jiang (2016b) and use
pointer networks (Vinyals et al 2015) to predict the start and end position of the
answer In addition they use an attention-pooling over the question representation to
generate the initial hidden vector for the pointer network [23]
When the R-Net Model first appeared in the leaderboard in March 2017 it was at
the top with 723 Exact Match and 807 F1 score
Bi-Directional Attention Flow (BiDAF) for Machine Comprehension
BiDAF [24] is a hierarchical multi-stage architecture for modeling the
representations of the context paragraph at different levels of granularity BIDAF
includes character-level word-level and contextual embeddings and uses bi-directional
attention flow to obtain a query-aware context representation Their attention layer is not
used to summarize the context paragraph into a fixed-size vector Instead the attention
is computed for every time step and the attended vector at each time step along with
the representations from previous layers can flow through to the subsequent modeling
layer This reduces the information loss caused by early summarization [24]
26
Figure 3-3 Bi-Directional Attention Flow Model Architecture [24]
Their machine comprehension model is a hierarchical multi-stage process and
consists of six layers
1 Character Embedding Layer maps each word to a vector space using character-level CNNs
2 Word Embedding Layer maps each word to a vector space using a pre-trained word embedding model
3 Contextual Embedding Layer utilizes contextual cues from surrounding words to refine the embedding of the words These first three layers are applied to both the query and context They use a LSTM on top of the embeddings provided by the previous layers to model the temporal interactions between words We place an LSTM in both directions and concatenate the outputs of the two LSTMs
4 Attention Flow Layer couples the query and context vectors and produces a set of query aware feature vectors for each word in the context
5 Modeling Layer employs a Recurrent Neural Network to scan the context The input to the modeling layer is G which encodes the query-aware representations of context words The output of the modeling layer captures the interaction among the context words conditioned on the query They use two layers of bi-directional LSTM
27
with the output size of d for each direction Hence a matrix is obtained which is passed onto the output layer to predict the answer
6 Output Layer provides an answer to the query They define the training loss (to be minimized) as the sum of the negative log probabilities of the true start and end indices by the predicted distributions averaged over all examples [25]
In a further variation of their above work they add a self-attention layer after the
Bi-attention layer to further improve the results The architecture of the model is as
Figure 3-3 The task of Question Answering [25]
28
Summary
In this chapter we reviewed the methods that are fundamental to the state of the
art in Machine Comprehension and for the task of Question Answering We have
reached closed to human level accuracy and this is due to incremental developments
over previous models As we saw Out of Vocabulary (OOV) tokens were handled by
using Character embedding Long term dependency within context passage were
solved using self-attention and many other techniques such as Contextualized vectors
Attention Flow etc were employed to get better results In the next chapter we will see
how we can build on these models and develop further
29
CHAPTER 4 MULTI-ATTENTION QUESTION ANSWERING
Although the advancements in Question Answering systems since the release of
the SQuAD dataset has been impressive and the results are getting close to human
level accuracy it is far from being a fool-proof system The models still make mistakes
which would be obvious to a human For example
Passage The Panthers used the San Jose State practice facility and stayed at
the San Jose Marriott The Broncos practiced at Florida State Facility and stayed at the
Santa Clara Marriott
Question At what universitys facility did the Panthers practice
Actual Answer San Jose State
Predicted Answer Florida State Facility
To find out what is leading to the wrong predictions we wanted to see the
attention weights associated with such an example We plotted the passage and
question heat map which is a 2D matrix where the intensity of each cell signifies the
similarity between a passage word and a question word For the above example we
found out that while certain words of the question are given high weightage other parts
are not The words lsquoAtrsquo lsquofacilityrsquo lsquopracticersquo are given high attention but lsquoPanthersrsquo does
not receive high attention If it had received high attention then the system would have
predicted lsquoSan Jose Statersquo as the right answer To solve this issue we analyzed the
base BiDAF model and proposed adding two things
1 Bi-Attention and Self-Attention over Query
2 Second level attention over the output of (Bi-Attention + Self-Attention) from both Context and Query
30
Multi-Attention BiDAF Model
Our comprehension model is a hierarchical multi-stage process and consists of
the following layers
1 Embedding Just as all other models we embed words using pretrained word vectors We also embed the characters in each word into size 20 vectors which are learned and run a convolution neural network followed by max-pooling to get character-derived embeddings for each word The character-level and word-level embeddings are then concatenated and passed to the next layer We do not update the word embeddings during training
2 Pre-Process A shared bi-directional GRU (Cho et al 2014) is used to map the question and passage embeddings to context aware embeddings
3 Context Attention The bi-directional attention mechanism from the Bi-Directional Attention Flow (BiDAF) model (Seo et al 2016) is used to build a query-aware context representation Let h_i be the vector for context word i q_j be the vector for question word j and n_q and n_c be the lengths of the question and context respectively We compute attention between context word i and question word j as
(4-1)
where w1 w2 and w3 are learned vectors and is element-wise multiplication We then compute an attended vector c_i for each context token as
(4-2)
We also compute a query-to-context vector q_c
(4-3)
31
The final vector computed for each token is built by concatenating
and In our model we subsequently pass the result through a linear layer with ReLU activations
4 Context Self-Attention Next we use a layer of residual self-attention The input is passed through another bi-directional GRU Then we apply the same attention mechanism only now between the passage and itself In this case we do not use query-to context attention and we set aij = 1048576inf if i = j As before we pass the concatenated output through a linear layer with ReLU activations This layer is applied residually so this output is additionally summed with the input
5 Query Attention For this part we do the same way as context attention but calculate the weighted sum of the context words for each query word Thus we get the length as number of query words Then we calculate context to query similar to query to context in context attention layer
Figure 4-1 The modified BiDAF model with multilevel attention
6 Query Self-Attention This part is done the same way as the context self-attention layer but from the output of the Query Attention layer
32
7 Context Query Bi-Attention + Self-Attention The output of the Context self-attention and the Query Self-Attention layers are taken as input and the same process for Bi-Attention and self-attention is applied on these inputs
8 Prediction In the last layer of our model a bidirectional GRU is applied followed by a linear layer that computes answer start scores for each token The hidden states of that layer are concatenated with the input and fed into a second bidirectional GRU and linear layer to predict answer end scores The softmax operation is applied to the start and end scores to produce start and end probabilities and we optimize the negative log likelihood of selecting correct start and end tokens
Having carried out this modification we were able to solve the wrong example we
started with The multilevel attention model gives the correct output as ldquoSan Jose
Staterdquo Also we achieved slightly better scores than the original model with a F1 score
of 8544 on the SQuAD dev dataset
Chatbot Design Using a QA System
Designing a perfect chatbot that passes the Turing test is a fundamental goal for
Artificial Intelligence Although we are many order to magnitudes away from achieving
such a goal domain specific tasks can be solved with chatbots made from current
technology We had started our goal with a similar objective in mind ie to design a
domain specific chatbot and then generalize to other areas as it is able to achieve the
first domain specific objective robustly This led us to the fundamental problem of
Machine Comprehension and subsequently to the task of Question Answering Having
achieved some degree of success with QA systems we looked back if we could apply
our newly acquired knowledge in the task of designing Chatbots
The chatbots made with todayrsquos technologies are mostly handcrafted techniques
such as template matching that requires anticipating all possible ways a user may
articulate his requirements and a conversation may occur This requires a lot of man
hours for designing a domain specific system and is still very error prone In this section
33
we propose a general Chatbot design that would make the designing of a domain
specific chatbot very easy and robust at the same time
Every domain specific chatbot needs to obtain a set of information from the user
and show some results based on the user specific information obtained The traditional
chatbots use template matching and keywords lookup to determine if the user has
provided the required information Our idea is to use the Question Answering system in
the backend to extract out the required information from whatever the user has typed
until this point of the conversation The information to be extracted can be posed in the
form of a set of questions and the answers obtained from those questions can be used
as the parameters to supply the relevant information to the user
We had chosen our chatbot domain as the flight reservation system Our goal
was to extract the required information from the user to be able to show him the
available flights as per the userrsquos requirements
Figure 4-2 Flight reservation chatbotrsquos chat window
34
For a flight reservation task the booking agent needs to know the origin city
destination city and date of travel at minimum to be able to show the available flights
Optional information includes the number of tickets passengerrsquos name one way or
round trip etc
The minimalistic conversation with the user through the chat window would be as
shown above We had a platform called OneTask on which we wanted to implement our
chat bot The chat interface within the OneTask system looks as follows
Figure 4-3 Chatbot within OneTask system
The working of the chat system is as follows ndash
1 Initiation The user opens the chat window which starts a session with the chat bot The chat bot asks lsquoHow may I help yoursquo
2 User Reply The user may reply with none of the required information for flight booking or may reply with multiple information in the same message
3 User Reply parsing The conversation up to this point is treated as a passage and the internal questions are run on this passage So the four questions that are run are
Where do you want to go
35
From where do you want to leave
When do you want to depart
4 Parsed responses from QA model After running the questions on the QA system with the given conversation answers are obtained for all along with their corresponding confidence values Since answers will be obtained even if the required question was not answered up to this point in the conversation the confidence values plays a crucial role in determining the validity of the answer For this purpose we determined ranges of confidence values for validation A confidence value above 10 signifies the question has been answered correctly A confidence of 2 ndash 10 signifies that it may have been answered but should verify with the user for correctness and any confidence below 2 is discarded
5 Asking remaining questions iteratively After the parsing it is checked if any of the required questions are still unanswered If so the chatbot asks the remaining question and the process from 3 ndash 4 is carried out iteratively Once all the questions have been answered the user is shown with the available flight options as per his request
Figure 4-4 The Flow diagram of the Flight booking Chatbot system
Online QA System and Attention Visualization
To be able to test out various examples we used the BiDAF [23] model for an
online demo One can either choose from the available examples from the drop-down
menu or paste their own passage and examples While this is a useful and interesting
system to test the model in a user-friendly way we created this system to be able to
focus on the wrong samples
36
Figure 4-5 QA system interface with attention highlight over candidate answers
The candidate answers are shown with the blue highlights as per their
confidence values The higher the confidence the darker the answer The highest
confidence value is chosen as the predicted answer We developed the system to show
attention spread of the candidate answers to realize what needs to be done to improve
the system This led us to realize the importance of including the query attention part as
well as multilevel attention on the BiDAF model as described in the first section of this
chapter
37
CHAPTER 5 FUTURE DIRECTIONS AND CONCLUSION
Future Directions
Having done a thorough analysis of the current methods and state of the arts in
Machine Comprehension and for the QA task and developing systems on it we have
achieved a strong sense of what needs to be done to further improve the QA models
After observing the wrong samples we can see that the system is still unable to encode
meaning and is picking answers based on statistical patterns of occurrence of answers
on training examples
As per the State of the Art models and analyzing the ongoing research literature
it is easy to conclude that more features need to be embedded to encode the meaning
of words phrases and sentences A paper called Reinforced Mnemonic Reader for
Machine Comprehension encoded POS and NER tags of words along with their word
and character embedding This gave them better results
Figure 5-1 An English language semantic parse tree [26]
38
We have developed a method to encode the syntax parse tree of a sentence that
not only encodes the post tags but the relation of the word within the phrase and the
relation of the phrase within the whole sentence in a hierarchical manner
Finally data augmentation is another solution to get better results One definite
way to reduce the errors would be to include similar samples in the training data which
the system is faltering in the dev set One could generate similar examples as the failure
cases and include them in the training set to have better prediction Another system
would be to train to similar and bigger datasets Our models were trained on the SQuAD
dataset There are other datasets too which does the similar question answering task
such as TriviaQA We could augment the training set of TriviaQA along with SQuAD to
have a more robust system that is able to generalize better and thus have higher
accuracy for predicting answer spans
Conclusion
In this work we have tried to explore the most fundamental techniques that have
shaped the current state of the art Then we proposed a minor improvement of
architecture over an existing model Furthermore we developed two applications that
uses the base model First we talked about how a chatbot application can be made
using the QA system and lastly we also created a web interface where the model can
be used for any Passage and Question This interface also shows the attention spread
on the candidate answers While our effort is ongoing to push the state of the art
forward we strongly believe that surpassing human level accuracy on this task will have
high dividends for the society at large
39
LIST OF REFERENCES
[1] Danqi Chen From Reading Comprehension To Open-Domain Question Answering 2018 Youtube Accessed March 16 2018 httpswwwyoutubecomwatchv=1RN88O9C13U
[2] Lehnert Wendy G The Process of Question Answering No RR-88 Yale Univ New Haven Conn Dept of computer science 1977
[3] Joshi Mandar Eunsol Choi Daniel S Weld and Luke Zettlemoyer Triviaqa A large scale distantly supervised challenge dataset for reading comprehension arXiv preprint arXiv170503551 (2017)
[4] Rajpurkar Pranav Jian Zhang Konstantin Lopyrev and Percy Liang Squad 100000+ questions for machine comprehension of text arXiv preprint arXiv160605250 (2016)
[5] The Stanford Question Answering Dataset 2018 RajpurkarGithubIo Accessed March 16 2018 httpsrajpurkargithubioSQuAD-explorer
[6] Hu Minghao Yuxing Peng and Xipeng Qiu Reinforced mnemonic reader for machine comprehension CoRR abs170502798 (2017)
[7] Schalkoff Robert J Artificial neural networks Vol 1 New York McGraw-Hill 1997
[8] Me For The AI and Neetesh Mehrotra 2018 The Connect Between Deep Learning And AI - Open Source For You Open Source For You Accessed March 16 2018 httpopensourceforucom201801connect-deep-learning-ai
[9] Krizhevsky Alex Ilya Sutskever and Geoffrey E Hinton Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems pp 1097-1105 2012
[10] An Intuitive Explanation Of Convolutional Neural Networks 2016 The Data Science Blog Accessed March 16 2018 httpsujjwalkarnme20160811intuitive-explanation-convnets
[11] Williams Ronald J and David Zipser A learning algorithm for continually running fully recurrent neural networks Neural computation 1 no 2 (1989) 270-280
[12] Hochreiter Sepp and Juumlrgen Schmidhuber Long short-term memory Neural computation 9 no 8 (1997) 1735-1780
[13] Chung Junyoung Caglar Gulcehre KyungHyun Cho and Yoshua Bengio Empirical evaluation of gated recurrent neural networks on sequence modeling arXiv preprint arXiv14123555 (2014)
40
[14] Mikolov Tomas Wen-tau Yih and Geoffrey Zweig Linguistic regularities in continuous space word representations In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics Human Language Technologies pp 746-751 2013
[15] Pennington Jeffrey Richard Socher and Christopher Manning Glove Global vectors for word representation In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) pp 1532-1543 2014
[16] Word2vec Tutorial - The Skip-Gram Model middot Chris Mccormick 2016 MccormickmlCom Accessed March 16 2018 httpmccormickmlcom20160419word2vec-tutorial-the-skip-gram-model
[17] Group 2015 [Paper Introduction] Bilingual Word Representations With Monolingual hellip SlideshareNet Accessed March 16 2018 httpswwwslidesharenetnaist-mtstudybilingual-word-representations-with-monolingual-quality-in-mind
[18] Attention Mechanism 2016 BlogHeuritechCom Accessed March 16 2018 httpsblogheuritechcom20160120attention-mechanism
[19] Weston Jason Sumit Chopra and Antoine Bordes 2014 Memory Networks arXiv preprint arXiv14103916v11 Accessed March 16 2018 httpsarxivorgabs14103916
[20] Wang Shuohang and Jing Jiang Machine comprehension using match-lstm and answer pointer arXiv preprint arXiv160807905 (2016)
[21] Wang Shuohang and Jing Jiang Learning natural language inference with LSTM arXiv preprint arXiv151208849 (2015)
[22] Vinyals Oriol Meire Fortunato and Navdeep Jaitly Pointer networks In Advances in Neural Information Processing Systems pp 2692-2700 2015
[23] Wang Wenhui Nan Yang Furu Wei Baobao Chang and Ming Zhou Gated self-matching networks for reading comprehension and question answering In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1 Long Papers) vol 1 pp 189-198 2017
[24] Seo Minjoon Aniruddha Kembhavi Ali Farhadi and Hannaneh Hajishirzi Bidirectional attention flow for machine comprehension arXiv preprint arXiv161101603 (2016)
[25] Clark Christopher and Matt Gardner Simple and effective multi-paragraph reading comprehension arXiv preprint arXiv171010723 (2017)
41
[26] 8 Analyzing Sentence Structure 2018 NltkOrg Accessed March 16 2018 httpwwwnltkorgbookch08html
42
BIOGRAPHICAL SKETCH
Purnendu Mukherjee grew up in Kolkata India and had a strong interest for
computers since an early age After his high school he did his BSc in computer
science from Ramakrishna Mission Residential College Narendrapur followed by MSc
in computer science from St Xavierrsquos College Kolkata He had a strong intuition and
interest for human like learning systems and wanted to work in this area He started
working at TCS Innovation Labs Pune for the application of Natural Language
Processing in Educational Applications As he was deeply passionate about learning
system that mimic the human brain and learn like a human child does he was
increasing interested about Deep Learning and its applications After working for a year
he went on to pursue a Master of Science degree in computer science from the
University of Florida Gainesville His academic interests have been focused on Deep
Learning and Natural Language Processing and he has been working on Machine
Reading Comprehension since summer of 2017
- ACKNOWLEDGMENTS
- LIST OF FIGURES
- LIST OF ABBREVIATIONS
- INTRODUCTION
- THE BUILDING BLOCKS OF A QUESTION ANSWERING SYSTEM
-
- Neural Networks
- Convolutional Neural Network
- Recurrent Neural Networks (RNN)
- Word Embedding
- Attention Mechanism
- Memory Networks
-
- LITERATURE REVIEW AND STATE OF THE ART
-
- Machine Comprehension Using Match-LSTM and Answer Pointer
- R-NET Matching Reading Comprehension with Self-Matching Networks
- Bi-Directional Attention Flow (BiDAF) for Machine Comprehension
- Summary
-
- MULTI-ATTENTION QUESTION ANSWERING
-
- Multi-Attention BiDAF Model
- Chatbot Design Using a QA System
- Online QA System and Attention Visualization
-
- FUTURE DIRECTIONS AND CONCLUSION
-
- Future Directions
- Conclusion
-
- LIST OF REFERENCES
- BIOGRAPHICAL SKETCH
-
16
Final memory at current time step combines previous and current time steps
(2-5)
While the GRU is computationally efficient the LSTM on the other hand is a
general case where there are three gates as follows
bull Input Gate ndash What new information to add to the current cell state
bull Forget Gate ndash How much information from previous states to be kept
bull Output gate ndash How much info should be sent to the next states
Just like GRU the current cell state is a sum of the previous cell state but
weighted by the forget gate and the new value is added which is weighted by the input
gate Based on the cell state the output gate regulates the final output
Word Embedding
Computation or gradients can be applied on numbers and not on words or letters
So first we need to convert words into their corresponding numerical formation before
feeding into a deep learning model In general there are two types of word embedding
Frequency based (which constitutes count vectors tf-idf and co-occurrence vectors)
and Prediction based With frequency based embedding the order of the words are not
preserved and works as a bag of words model Whereas with prediction based model
the order of words or locality of words are taken into consideration to generate the
numerical representation of the word Within this prediction based category there are
two fundamental techniques called Continuous Bag of Words (CBOW) and Skip Gram
Model which forms the basis for word2vec [14] and GloVe [15]
The basic intuition behind word2vec is that if two different words have very
similar ldquocontextsrdquo (that is what words are likely to appear around them) then the model
17
will produce similar vector for those words Conversely if the two word vectors are
similar then the network will produce similar context predictions for the same two words
For examples synonyms like ldquointelligentrdquo and ldquosmartrdquo would have very similar contexts
Or that words that are related like ldquoenginerdquo and ldquotransmissionrdquo would probably have
similar contexts as well [16] Plotting the word vectors learned by a word2vec over a
large corpus we could find some very interesting relationships between words
Figure 2-3 Semantic relation between words in vector space [17]
Attention Mechanism
We as humans put our attention to things are important or are relevant in a
context For example when asked a question from a passage we try to find the most
relevant part of the passage the question is relevant with and then reason from our
understanding of that part of the passage The same idea applies for attention
mechanism in Deep Learning It is used to identify the specific parts of a given context
to which the current question is relevant to
Formally put the techniques take n arguments y_1 y_n (in our case the
passage having words say y_i through h_i) and a question word say q It returns a
vector z which is supposed to be the laquo summary raquo of the y_i focusing on information
linked to the question q More formally it returns a weighted arithmetic mean of the y_i
18
and the weights are chosen according the relevance of each y_i given the context c
[18]
Figure 2-4 Attention Mechanism flow [18]
Memory Networks
Convolutional Neural Networks and Recurrent Neural Networks which does
capture how we form our visual and sequential memories their memory (encoded by
hidden states and weights) were typically too small and was not compartmentalized
enough to accurately remember facts from the past (knowledge is compressed into
dense vectors) [19]
Deep Learning needed to cultivate a methodology that preserved memories as
they are such that it wonrsquot be lost in generalization and recalling exact words or
sequence of events would be possible mdash something computers are already good at This
effort led us to Memory Networks [19] published at ICLR 2015 by Facebook AI
Research
This paper provides a basic framework to store augment and retrieve memories
while seamlessly working with a Recurrent Neural Network architecture The memory
19
network consists of a memory m (an array of objects 1 indexed by m i) and four
(potentially learned) components I G O and R as follows
I (input feature map) mdash converts the incoming input to the internal feature
representation either a sparse or dense feature vector like that from word2vec or
GloVe
G (generalization) mdash updates old memories given the new input They call this
generalization as there is an opportunity for the network to compress and generalize its
memories at this stage for some intended future use The analogy Irsquove been talking
before
O (output feature map) mdash produces a new output (in the feature representation
space) given the new input and the current memory state This component is
responsible for performing inference In a question answering system this part will
select the candidate sentences (which might contain the answer) from the story
(conversation) so far
R (response) mdash converts the output into the response format desired For
example a textual response or an action In the QA system described this component
finds the desired answer and then converts it from feature representation to the actual
word
This model is a fully supervised model meaning all the candidate sentences from
which the answer could be found are marked during training phase and can also be
termed as lsquohard attentionrsquo
The authors tested out the QA system on various literature including Lord of the
Rings
20
Figure 2-5 QA example on Lord of the Rings using Memory Networks [19]
21
CHAPTER 3 LITERATURE REVIEW AND STATE OF THE ART
There has been rapid progress since the release of the SQuAD dataset and
currently the best ensemble models are close to human level accuracy in machine
comprehension This is due to the various ingenious methods which solves some of the
problems with the previous methods Out of Vocabulary tokens were handled by using
Character embedding Long term dependency within context passage were solved
using self-attention And many other techniques such as Contextualized vectors History
of Words Attention Flow etc In this section we will have a look at the some of the most
important models that were fundamental to the progress of Questions Answering
Machine Comprehension Using Match-LSTM and Answer Pointer
In this paper [20] the authors propose an end-to-end neural architecture for the
QA task The architecture is based on match-LSTM [21] a model they proposed for
textual entailment and Pointer Net [22] a sequence-to-sequence model proposed by
Vinyals et al (2015) to constrain the output tokens to be from the input sequences
The model consists of an LSTM preprocessing layer a match-LSTM layer and an
Answer Pointer layer
We are given a piece of text which we refer to as a passage and a question
related to the passage The passage is represented by matrix P where P is the length
(number of tokens) of the passage and d is the dimensionality of word embeddings
Similarly the question is represented by matrix Q where Q is the length of the question
Our goal is to identify a subsequence from the passage as the answer to the question
22
Figure 3-1 Match-LSTM Model Architecture [20]
LSTM Preprocessing layer They use a standard one-directional LSTM
(Hochreiter amp Schmidhuber 1997) to process the passage and the question separately
as shown below
Match LSTM Layer They applied the match-LSTM model proposed for textual
entailment to their machine comprehension problem by treating the question as a
premise and the passage as a hypothesis The match-LSTM sequentially goes through
the passage At position i of the passage it first uses the standard word-by-word
attention mechanism to obtain attention weight vector as follows
(3-1)
23
where and are parameters to be learned
Answer Pointer Layer The Answer Pointer (Ans-Ptr) layer is motivated by the
Pointer Net introduced by Vinyals et al (2015) [22] The boundary model produces only
the start token and the end token of the answer and then all the tokens between these
two in the original passage are considered to be the answer
When this paper was released back in November 2016 Match-LSTM method
was the state of the art in Question Answering systems and was at the top of the
leaderboard for the SQuAD dataset
R-NET Matching Reading Comprehension with Self-Matching Networks
In this model [23] first the question and passage are processed by a
bidirectional recurrent network (Mikolov et al 2010) separately They then match the
question and passage with gated attention-based recurrent networks obtaining
question-aware representation for the passage On top of that they apply self-matching
attention to aggregate evidence from the whole passage and refine the passage
representation which is then fed into the output layer to predict the boundary of the
answer span
Question and passage encoding First the words are converted to their
respective word-level embeddings and character level embeddings The character-level
embeddings are generated by taking the final hidden states of a bi-directional recurrent
neural network (RNN) applied to embeddings of characters in the token Such
character-level embeddings have been shown to be helpful to deal with out-of-vocab
(OOV) tokens
24
They then use a bi-directional RNN to produce new representation and
of all words in the question and passage respectively
Figure 3-2 The task of Question Answering [23]
Gated Attention-based Recurrent Networks They use a variant of attention-
based recurrent networks with an additional gate to determine the importance of
information in the passage regarding a question Different from the gates in LSTM or
GRU the additional gate is based on the current passage word and its attention-pooling
vector of the question which focuses on the relation between the question and current
passage word The gate effectively model the phenomenon that only parts of the
passage are relevant to the question in reading comprehension and question answering
is utilized in subsequent calculations
25
Self-Matching Attention From the previous step the question aware passage
representation is generated to highlight the important parts of the passage One
problem with such representation is that it has very limited knowledge of context One
answer candidate is often oblivious to important cues in the passage outside its
surrounding window To address this problem the authors propose directly matching
the question-aware passage representation against itself It dynamically collects
evidence from the whole passage for words in passage and encodes the evidence
relevant to the current passage word and its matching question information into the
passage representation
Output Layer They use the same method as Wang amp Jiang (2016b) and use
pointer networks (Vinyals et al 2015) to predict the start and end position of the
answer In addition they use an attention-pooling over the question representation to
generate the initial hidden vector for the pointer network [23]
When the R-Net Model first appeared in the leaderboard in March 2017 it was at
the top with 723 Exact Match and 807 F1 score
Bi-Directional Attention Flow (BiDAF) for Machine Comprehension
BiDAF [24] is a hierarchical multi-stage architecture for modeling the
representations of the context paragraph at different levels of granularity BIDAF
includes character-level word-level and contextual embeddings and uses bi-directional
attention flow to obtain a query-aware context representation Their attention layer is not
used to summarize the context paragraph into a fixed-size vector Instead the attention
is computed for every time step and the attended vector at each time step along with
the representations from previous layers can flow through to the subsequent modeling
layer This reduces the information loss caused by early summarization [24]
26
Figure 3-3 Bi-Directional Attention Flow Model Architecture [24]
Their machine comprehension model is a hierarchical multi-stage process and
consists of six layers
1 Character Embedding Layer maps each word to a vector space using character-level CNNs
2 Word Embedding Layer maps each word to a vector space using a pre-trained word embedding model
3 Contextual Embedding Layer utilizes contextual cues from surrounding words to refine the embedding of the words These first three layers are applied to both the query and context They use a LSTM on top of the embeddings provided by the previous layers to model the temporal interactions between words We place an LSTM in both directions and concatenate the outputs of the two LSTMs
4 Attention Flow Layer couples the query and context vectors and produces a set of query aware feature vectors for each word in the context
5 Modeling Layer employs a Recurrent Neural Network to scan the context The input to the modeling layer is G which encodes the query-aware representations of context words The output of the modeling layer captures the interaction among the context words conditioned on the query They use two layers of bi-directional LSTM
27
with the output size of d for each direction Hence a matrix is obtained which is passed onto the output layer to predict the answer
6 Output Layer provides an answer to the query They define the training loss (to be minimized) as the sum of the negative log probabilities of the true start and end indices by the predicted distributions averaged over all examples [25]
In a further variation of their above work they add a self-attention layer after the
Bi-attention layer to further improve the results The architecture of the model is as
Figure 3-3 The task of Question Answering [25]
28
Summary
In this chapter we reviewed the methods that are fundamental to the state of the
art in Machine Comprehension and for the task of Question Answering We have
reached closed to human level accuracy and this is due to incremental developments
over previous models As we saw Out of Vocabulary (OOV) tokens were handled by
using Character embedding Long term dependency within context passage were
solved using self-attention and many other techniques such as Contextualized vectors
Attention Flow etc were employed to get better results In the next chapter we will see
how we can build on these models and develop further
29
CHAPTER 4 MULTI-ATTENTION QUESTION ANSWERING
Although the advancements in Question Answering systems since the release of
the SQuAD dataset has been impressive and the results are getting close to human
level accuracy it is far from being a fool-proof system The models still make mistakes
which would be obvious to a human For example
Passage The Panthers used the San Jose State practice facility and stayed at
the San Jose Marriott The Broncos practiced at Florida State Facility and stayed at the
Santa Clara Marriott
Question At what universitys facility did the Panthers practice
Actual Answer San Jose State
Predicted Answer Florida State Facility
To find out what is leading to the wrong predictions we wanted to see the
attention weights associated with such an example We plotted the passage and
question heat map which is a 2D matrix where the intensity of each cell signifies the
similarity between a passage word and a question word For the above example we
found out that while certain words of the question are given high weightage other parts
are not The words lsquoAtrsquo lsquofacilityrsquo lsquopracticersquo are given high attention but lsquoPanthersrsquo does
not receive high attention If it had received high attention then the system would have
predicted lsquoSan Jose Statersquo as the right answer To solve this issue we analyzed the
base BiDAF model and proposed adding two things
1 Bi-Attention and Self-Attention over Query
2 Second level attention over the output of (Bi-Attention + Self-Attention) from both Context and Query
30
Multi-Attention BiDAF Model
Our comprehension model is a hierarchical multi-stage process and consists of
the following layers
1 Embedding Just as all other models we embed words using pretrained word vectors We also embed the characters in each word into size 20 vectors which are learned and run a convolution neural network followed by max-pooling to get character-derived embeddings for each word The character-level and word-level embeddings are then concatenated and passed to the next layer We do not update the word embeddings during training
2 Pre-Process A shared bi-directional GRU (Cho et al 2014) is used to map the question and passage embeddings to context aware embeddings
3 Context Attention The bi-directional attention mechanism from the Bi-Directional Attention Flow (BiDAF) model (Seo et al 2016) is used to build a query-aware context representation Let h_i be the vector for context word i q_j be the vector for question word j and n_q and n_c be the lengths of the question and context respectively We compute attention between context word i and question word j as
(4-1)
where w1 w2 and w3 are learned vectors and is element-wise multiplication We then compute an attended vector c_i for each context token as
(4-2)
We also compute a query-to-context vector q_c
(4-3)
31
The final vector computed for each token is built by concatenating
and In our model we subsequently pass the result through a linear layer with ReLU activations
4 Context Self-Attention Next we use a layer of residual self-attention The input is passed through another bi-directional GRU Then we apply the same attention mechanism only now between the passage and itself In this case we do not use query-to context attention and we set aij = 1048576inf if i = j As before we pass the concatenated output through a linear layer with ReLU activations This layer is applied residually so this output is additionally summed with the input
5 Query Attention For this part we do the same way as context attention but calculate the weighted sum of the context words for each query word Thus we get the length as number of query words Then we calculate context to query similar to query to context in context attention layer
Figure 4-1 The modified BiDAF model with multilevel attention
6 Query Self-Attention This part is done the same way as the context self-attention layer but from the output of the Query Attention layer
32
7 Context Query Bi-Attention + Self-Attention The output of the Context self-attention and the Query Self-Attention layers are taken as input and the same process for Bi-Attention and self-attention is applied on these inputs
8 Prediction In the last layer of our model a bidirectional GRU is applied followed by a linear layer that computes answer start scores for each token The hidden states of that layer are concatenated with the input and fed into a second bidirectional GRU and linear layer to predict answer end scores The softmax operation is applied to the start and end scores to produce start and end probabilities and we optimize the negative log likelihood of selecting correct start and end tokens
Having carried out this modification we were able to solve the wrong example we
started with The multilevel attention model gives the correct output as ldquoSan Jose
Staterdquo Also we achieved slightly better scores than the original model with a F1 score
of 8544 on the SQuAD dev dataset
Chatbot Design Using a QA System
Designing a perfect chatbot that passes the Turing test is a fundamental goal for
Artificial Intelligence Although we are many order to magnitudes away from achieving
such a goal domain specific tasks can be solved with chatbots made from current
technology We had started our goal with a similar objective in mind ie to design a
domain specific chatbot and then generalize to other areas as it is able to achieve the
first domain specific objective robustly This led us to the fundamental problem of
Machine Comprehension and subsequently to the task of Question Answering Having
achieved some degree of success with QA systems we looked back if we could apply
our newly acquired knowledge in the task of designing Chatbots
The chatbots made with todayrsquos technologies are mostly handcrafted techniques
such as template matching that requires anticipating all possible ways a user may
articulate his requirements and a conversation may occur This requires a lot of man
hours for designing a domain specific system and is still very error prone In this section
33
we propose a general Chatbot design that would make the designing of a domain
specific chatbot very easy and robust at the same time
Every domain specific chatbot needs to obtain a set of information from the user
and show some results based on the user specific information obtained The traditional
chatbots use template matching and keywords lookup to determine if the user has
provided the required information Our idea is to use the Question Answering system in
the backend to extract out the required information from whatever the user has typed
until this point of the conversation The information to be extracted can be posed in the
form of a set of questions and the answers obtained from those questions can be used
as the parameters to supply the relevant information to the user
We had chosen our chatbot domain as the flight reservation system Our goal
was to extract the required information from the user to be able to show him the
available flights as per the userrsquos requirements
Figure 4-2 Flight reservation chatbotrsquos chat window
34
For a flight reservation task the booking agent needs to know the origin city
destination city and date of travel at minimum to be able to show the available flights
Optional information includes the number of tickets passengerrsquos name one way or
round trip etc
The minimalistic conversation with the user through the chat window would be as
shown above We had a platform called OneTask on which we wanted to implement our
chat bot The chat interface within the OneTask system looks as follows
Figure 4-3 Chatbot within OneTask system
The working of the chat system is as follows ndash
1 Initiation The user opens the chat window which starts a session with the chat bot The chat bot asks lsquoHow may I help yoursquo
2 User Reply The user may reply with none of the required information for flight booking or may reply with multiple information in the same message
3 User Reply parsing The conversation up to this point is treated as a passage and the internal questions are run on this passage So the four questions that are run are
Where do you want to go
35
From where do you want to leave
When do you want to depart
4 Parsed responses from QA model After running the questions on the QA system with the given conversation answers are obtained for all along with their corresponding confidence values Since answers will be obtained even if the required question was not answered up to this point in the conversation the confidence values plays a crucial role in determining the validity of the answer For this purpose we determined ranges of confidence values for validation A confidence value above 10 signifies the question has been answered correctly A confidence of 2 ndash 10 signifies that it may have been answered but should verify with the user for correctness and any confidence below 2 is discarded
5 Asking remaining questions iteratively After the parsing it is checked if any of the required questions are still unanswered If so the chatbot asks the remaining question and the process from 3 ndash 4 is carried out iteratively Once all the questions have been answered the user is shown with the available flight options as per his request
Figure 4-4 The Flow diagram of the Flight booking Chatbot system
Online QA System and Attention Visualization
To be able to test out various examples we used the BiDAF [23] model for an
online demo One can either choose from the available examples from the drop-down
menu or paste their own passage and examples While this is a useful and interesting
system to test the model in a user-friendly way we created this system to be able to
focus on the wrong samples
36
Figure 4-5 QA system interface with attention highlight over candidate answers
The candidate answers are shown with the blue highlights as per their
confidence values The higher the confidence the darker the answer The highest
confidence value is chosen as the predicted answer We developed the system to show
attention spread of the candidate answers to realize what needs to be done to improve
the system This led us to realize the importance of including the query attention part as
well as multilevel attention on the BiDAF model as described in the first section of this
chapter
37
CHAPTER 5 FUTURE DIRECTIONS AND CONCLUSION
Future Directions
Having done a thorough analysis of the current methods and state of the arts in
Machine Comprehension and for the QA task and developing systems on it we have
achieved a strong sense of what needs to be done to further improve the QA models
After observing the wrong samples we can see that the system is still unable to encode
meaning and is picking answers based on statistical patterns of occurrence of answers
on training examples
As per the State of the Art models and analyzing the ongoing research literature
it is easy to conclude that more features need to be embedded to encode the meaning
of words phrases and sentences A paper called Reinforced Mnemonic Reader for
Machine Comprehension encoded POS and NER tags of words along with their word
and character embedding This gave them better results
Figure 5-1 An English language semantic parse tree [26]
38
We have developed a method to encode the syntax parse tree of a sentence that
not only encodes the post tags but the relation of the word within the phrase and the
relation of the phrase within the whole sentence in a hierarchical manner
Finally data augmentation is another solution to get better results One definite
way to reduce the errors would be to include similar samples in the training data which
the system is faltering in the dev set One could generate similar examples as the failure
cases and include them in the training set to have better prediction Another system
would be to train to similar and bigger datasets Our models were trained on the SQuAD
dataset There are other datasets too which does the similar question answering task
such as TriviaQA We could augment the training set of TriviaQA along with SQuAD to
have a more robust system that is able to generalize better and thus have higher
accuracy for predicting answer spans
Conclusion
In this work we have tried to explore the most fundamental techniques that have
shaped the current state of the art Then we proposed a minor improvement of
architecture over an existing model Furthermore we developed two applications that
uses the base model First we talked about how a chatbot application can be made
using the QA system and lastly we also created a web interface where the model can
be used for any Passage and Question This interface also shows the attention spread
on the candidate answers While our effort is ongoing to push the state of the art
forward we strongly believe that surpassing human level accuracy on this task will have
high dividends for the society at large
39
LIST OF REFERENCES
[1] Danqi Chen From Reading Comprehension To Open-Domain Question Answering 2018 Youtube Accessed March 16 2018 httpswwwyoutubecomwatchv=1RN88O9C13U
[2] Lehnert Wendy G The Process of Question Answering No RR-88 Yale Univ New Haven Conn Dept of computer science 1977
[3] Joshi Mandar Eunsol Choi Daniel S Weld and Luke Zettlemoyer Triviaqa A large scale distantly supervised challenge dataset for reading comprehension arXiv preprint arXiv170503551 (2017)
[4] Rajpurkar Pranav Jian Zhang Konstantin Lopyrev and Percy Liang Squad 100000+ questions for machine comprehension of text arXiv preprint arXiv160605250 (2016)
[5] The Stanford Question Answering Dataset 2018 RajpurkarGithubIo Accessed March 16 2018 httpsrajpurkargithubioSQuAD-explorer
[6] Hu Minghao Yuxing Peng and Xipeng Qiu Reinforced mnemonic reader for machine comprehension CoRR abs170502798 (2017)
[7] Schalkoff Robert J Artificial neural networks Vol 1 New York McGraw-Hill 1997
[8] Me For The AI and Neetesh Mehrotra 2018 The Connect Between Deep Learning And AI - Open Source For You Open Source For You Accessed March 16 2018 httpopensourceforucom201801connect-deep-learning-ai
[9] Krizhevsky Alex Ilya Sutskever and Geoffrey E Hinton Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems pp 1097-1105 2012
[10] An Intuitive Explanation Of Convolutional Neural Networks 2016 The Data Science Blog Accessed March 16 2018 httpsujjwalkarnme20160811intuitive-explanation-convnets
[11] Williams Ronald J and David Zipser A learning algorithm for continually running fully recurrent neural networks Neural computation 1 no 2 (1989) 270-280
[12] Hochreiter Sepp and Juumlrgen Schmidhuber Long short-term memory Neural computation 9 no 8 (1997) 1735-1780
[13] Chung Junyoung Caglar Gulcehre KyungHyun Cho and Yoshua Bengio Empirical evaluation of gated recurrent neural networks on sequence modeling arXiv preprint arXiv14123555 (2014)
40
[14] Mikolov Tomas Wen-tau Yih and Geoffrey Zweig Linguistic regularities in continuous space word representations In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics Human Language Technologies pp 746-751 2013
[15] Pennington Jeffrey Richard Socher and Christopher Manning Glove Global vectors for word representation In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) pp 1532-1543 2014
[16] Word2vec Tutorial - The Skip-Gram Model middot Chris Mccormick 2016 MccormickmlCom Accessed March 16 2018 httpmccormickmlcom20160419word2vec-tutorial-the-skip-gram-model
[17] Group 2015 [Paper Introduction] Bilingual Word Representations With Monolingual hellip SlideshareNet Accessed March 16 2018 httpswwwslidesharenetnaist-mtstudybilingual-word-representations-with-monolingual-quality-in-mind
[18] Attention Mechanism 2016 BlogHeuritechCom Accessed March 16 2018 httpsblogheuritechcom20160120attention-mechanism
[19] Weston Jason Sumit Chopra and Antoine Bordes 2014 Memory Networks arXiv preprint arXiv14103916v11 Accessed March 16 2018 httpsarxivorgabs14103916
[20] Wang Shuohang and Jing Jiang Machine comprehension using match-lstm and answer pointer arXiv preprint arXiv160807905 (2016)
[21] Wang Shuohang and Jing Jiang Learning natural language inference with LSTM arXiv preprint arXiv151208849 (2015)
[22] Vinyals Oriol Meire Fortunato and Navdeep Jaitly Pointer networks In Advances in Neural Information Processing Systems pp 2692-2700 2015
[23] Wang Wenhui Nan Yang Furu Wei Baobao Chang and Ming Zhou Gated self-matching networks for reading comprehension and question answering In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1 Long Papers) vol 1 pp 189-198 2017
[24] Seo Minjoon Aniruddha Kembhavi Ali Farhadi and Hannaneh Hajishirzi Bidirectional attention flow for machine comprehension arXiv preprint arXiv161101603 (2016)
[25] Clark Christopher and Matt Gardner Simple and effective multi-paragraph reading comprehension arXiv preprint arXiv171010723 (2017)
41
[26] 8 Analyzing Sentence Structure 2018 NltkOrg Accessed March 16 2018 httpwwwnltkorgbookch08html
42
BIOGRAPHICAL SKETCH
Purnendu Mukherjee grew up in Kolkata India and had a strong interest for
computers since an early age After his high school he did his BSc in computer
science from Ramakrishna Mission Residential College Narendrapur followed by MSc
in computer science from St Xavierrsquos College Kolkata He had a strong intuition and
interest for human like learning systems and wanted to work in this area He started
working at TCS Innovation Labs Pune for the application of Natural Language
Processing in Educational Applications As he was deeply passionate about learning
system that mimic the human brain and learn like a human child does he was
increasing interested about Deep Learning and its applications After working for a year
he went on to pursue a Master of Science degree in computer science from the
University of Florida Gainesville His academic interests have been focused on Deep
Learning and Natural Language Processing and he has been working on Machine
Reading Comprehension since summer of 2017
- ACKNOWLEDGMENTS
- LIST OF FIGURES
- LIST OF ABBREVIATIONS
- INTRODUCTION
- THE BUILDING BLOCKS OF A QUESTION ANSWERING SYSTEM
-
- Neural Networks
- Convolutional Neural Network
- Recurrent Neural Networks (RNN)
- Word Embedding
- Attention Mechanism
- Memory Networks
-
- LITERATURE REVIEW AND STATE OF THE ART
-
- Machine Comprehension Using Match-LSTM and Answer Pointer
- R-NET Matching Reading Comprehension with Self-Matching Networks
- Bi-Directional Attention Flow (BiDAF) for Machine Comprehension
- Summary
-
- MULTI-ATTENTION QUESTION ANSWERING
-
- Multi-Attention BiDAF Model
- Chatbot Design Using a QA System
- Online QA System and Attention Visualization
-
- FUTURE DIRECTIONS AND CONCLUSION
-
- Future Directions
- Conclusion
-
- LIST OF REFERENCES
- BIOGRAPHICAL SKETCH
-
17
will produce similar vector for those words Conversely if the two word vectors are
similar then the network will produce similar context predictions for the same two words
For examples synonyms like ldquointelligentrdquo and ldquosmartrdquo would have very similar contexts
Or that words that are related like ldquoenginerdquo and ldquotransmissionrdquo would probably have
similar contexts as well [16] Plotting the word vectors learned by a word2vec over a
large corpus we could find some very interesting relationships between words
Figure 2-3 Semantic relation between words in vector space [17]
Attention Mechanism
We as humans put our attention to things are important or are relevant in a
context For example when asked a question from a passage we try to find the most
relevant part of the passage the question is relevant with and then reason from our
understanding of that part of the passage The same idea applies for attention
mechanism in Deep Learning It is used to identify the specific parts of a given context
to which the current question is relevant to
Formally put the techniques take n arguments y_1 y_n (in our case the
passage having words say y_i through h_i) and a question word say q It returns a
vector z which is supposed to be the laquo summary raquo of the y_i focusing on information
linked to the question q More formally it returns a weighted arithmetic mean of the y_i
18
and the weights are chosen according the relevance of each y_i given the context c
[18]
Figure 2-4 Attention Mechanism flow [18]
Memory Networks
Convolutional Neural Networks and Recurrent Neural Networks which does
capture how we form our visual and sequential memories their memory (encoded by
hidden states and weights) were typically too small and was not compartmentalized
enough to accurately remember facts from the past (knowledge is compressed into
dense vectors) [19]
Deep Learning needed to cultivate a methodology that preserved memories as
they are such that it wonrsquot be lost in generalization and recalling exact words or
sequence of events would be possible mdash something computers are already good at This
effort led us to Memory Networks [19] published at ICLR 2015 by Facebook AI
Research
This paper provides a basic framework to store augment and retrieve memories
while seamlessly working with a Recurrent Neural Network architecture The memory
19
network consists of a memory m (an array of objects 1 indexed by m i) and four
(potentially learned) components I G O and R as follows
I (input feature map) mdash converts the incoming input to the internal feature
representation either a sparse or dense feature vector like that from word2vec or
GloVe
G (generalization) mdash updates old memories given the new input They call this
generalization as there is an opportunity for the network to compress and generalize its
memories at this stage for some intended future use The analogy Irsquove been talking
before
O (output feature map) mdash produces a new output (in the feature representation
space) given the new input and the current memory state This component is
responsible for performing inference In a question answering system this part will
select the candidate sentences (which might contain the answer) from the story
(conversation) so far
R (response) mdash converts the output into the response format desired For
example a textual response or an action In the QA system described this component
finds the desired answer and then converts it from feature representation to the actual
word
This model is a fully supervised model meaning all the candidate sentences from
which the answer could be found are marked during training phase and can also be
termed as lsquohard attentionrsquo
The authors tested out the QA system on various literature including Lord of the
Rings
20
Figure 2-5 QA example on Lord of the Rings using Memory Networks [19]
21
CHAPTER 3 LITERATURE REVIEW AND STATE OF THE ART
There has been rapid progress since the release of the SQuAD dataset and
currently the best ensemble models are close to human level accuracy in machine
comprehension This is due to the various ingenious methods which solves some of the
problems with the previous methods Out of Vocabulary tokens were handled by using
Character embedding Long term dependency within context passage were solved
using self-attention And many other techniques such as Contextualized vectors History
of Words Attention Flow etc In this section we will have a look at the some of the most
important models that were fundamental to the progress of Questions Answering
Machine Comprehension Using Match-LSTM and Answer Pointer
In this paper [20] the authors propose an end-to-end neural architecture for the
QA task The architecture is based on match-LSTM [21] a model they proposed for
textual entailment and Pointer Net [22] a sequence-to-sequence model proposed by
Vinyals et al (2015) to constrain the output tokens to be from the input sequences
The model consists of an LSTM preprocessing layer a match-LSTM layer and an
Answer Pointer layer
We are given a piece of text which we refer to as a passage and a question
related to the passage The passage is represented by matrix P where P is the length
(number of tokens) of the passage and d is the dimensionality of word embeddings
Similarly the question is represented by matrix Q where Q is the length of the question
Our goal is to identify a subsequence from the passage as the answer to the question
22
Figure 3-1 Match-LSTM Model Architecture [20]
LSTM Preprocessing layer They use a standard one-directional LSTM
(Hochreiter amp Schmidhuber 1997) to process the passage and the question separately
as shown below
Match LSTM Layer They applied the match-LSTM model proposed for textual
entailment to their machine comprehension problem by treating the question as a
premise and the passage as a hypothesis The match-LSTM sequentially goes through
the passage At position i of the passage it first uses the standard word-by-word
attention mechanism to obtain attention weight vector as follows
(3-1)
23
where and are parameters to be learned
Answer Pointer Layer The Answer Pointer (Ans-Ptr) layer is motivated by the
Pointer Net introduced by Vinyals et al (2015) [22] The boundary model produces only
the start token and the end token of the answer and then all the tokens between these
two in the original passage are considered to be the answer
When this paper was released back in November 2016 Match-LSTM method
was the state of the art in Question Answering systems and was at the top of the
leaderboard for the SQuAD dataset
R-NET Matching Reading Comprehension with Self-Matching Networks
In this model [23] first the question and passage are processed by a
bidirectional recurrent network (Mikolov et al 2010) separately They then match the
question and passage with gated attention-based recurrent networks obtaining
question-aware representation for the passage On top of that they apply self-matching
attention to aggregate evidence from the whole passage and refine the passage
representation which is then fed into the output layer to predict the boundary of the
answer span
Question and passage encoding First the words are converted to their
respective word-level embeddings and character level embeddings The character-level
embeddings are generated by taking the final hidden states of a bi-directional recurrent
neural network (RNN) applied to embeddings of characters in the token Such
character-level embeddings have been shown to be helpful to deal with out-of-vocab
(OOV) tokens
24
They then use a bi-directional RNN to produce new representation and
of all words in the question and passage respectively
Figure 3-2 The task of Question Answering [23]
Gated Attention-based Recurrent Networks They use a variant of attention-
based recurrent networks with an additional gate to determine the importance of
information in the passage regarding a question Different from the gates in LSTM or
GRU the additional gate is based on the current passage word and its attention-pooling
vector of the question which focuses on the relation between the question and current
passage word The gate effectively model the phenomenon that only parts of the
passage are relevant to the question in reading comprehension and question answering
is utilized in subsequent calculations
25
Self-Matching Attention From the previous step the question aware passage
representation is generated to highlight the important parts of the passage One
problem with such representation is that it has very limited knowledge of context One
answer candidate is often oblivious to important cues in the passage outside its
surrounding window To address this problem the authors propose directly matching
the question-aware passage representation against itself It dynamically collects
evidence from the whole passage for words in passage and encodes the evidence
relevant to the current passage word and its matching question information into the
passage representation
Output Layer They use the same method as Wang amp Jiang (2016b) and use
pointer networks (Vinyals et al 2015) to predict the start and end position of the
answer In addition they use an attention-pooling over the question representation to
generate the initial hidden vector for the pointer network [23]
When the R-Net Model first appeared in the leaderboard in March 2017 it was at
the top with 723 Exact Match and 807 F1 score
Bi-Directional Attention Flow (BiDAF) for Machine Comprehension
BiDAF [24] is a hierarchical multi-stage architecture for modeling the
representations of the context paragraph at different levels of granularity BIDAF
includes character-level word-level and contextual embeddings and uses bi-directional
attention flow to obtain a query-aware context representation Their attention layer is not
used to summarize the context paragraph into a fixed-size vector Instead the attention
is computed for every time step and the attended vector at each time step along with
the representations from previous layers can flow through to the subsequent modeling
layer This reduces the information loss caused by early summarization [24]
26
Figure 3-3 Bi-Directional Attention Flow Model Architecture [24]
Their machine comprehension model is a hierarchical multi-stage process and
consists of six layers
1 Character Embedding Layer maps each word to a vector space using character-level CNNs
2 Word Embedding Layer maps each word to a vector space using a pre-trained word embedding model
3 Contextual Embedding Layer utilizes contextual cues from surrounding words to refine the embedding of the words These first three layers are applied to both the query and context They use a LSTM on top of the embeddings provided by the previous layers to model the temporal interactions between words We place an LSTM in both directions and concatenate the outputs of the two LSTMs
4 Attention Flow Layer couples the query and context vectors and produces a set of query aware feature vectors for each word in the context
5 Modeling Layer employs a Recurrent Neural Network to scan the context The input to the modeling layer is G which encodes the query-aware representations of context words The output of the modeling layer captures the interaction among the context words conditioned on the query They use two layers of bi-directional LSTM
27
with the output size of d for each direction Hence a matrix is obtained which is passed onto the output layer to predict the answer
6 Output Layer provides an answer to the query They define the training loss (to be minimized) as the sum of the negative log probabilities of the true start and end indices by the predicted distributions averaged over all examples [25]
In a further variation of their above work they add a self-attention layer after the
Bi-attention layer to further improve the results The architecture of the model is as
Figure 3-3 The task of Question Answering [25]
28
Summary
In this chapter we reviewed the methods that are fundamental to the state of the
art in Machine Comprehension and for the task of Question Answering We have
reached closed to human level accuracy and this is due to incremental developments
over previous models As we saw Out of Vocabulary (OOV) tokens were handled by
using Character embedding Long term dependency within context passage were
solved using self-attention and many other techniques such as Contextualized vectors
Attention Flow etc were employed to get better results In the next chapter we will see
how we can build on these models and develop further
29
CHAPTER 4 MULTI-ATTENTION QUESTION ANSWERING
Although the advancements in Question Answering systems since the release of
the SQuAD dataset has been impressive and the results are getting close to human
level accuracy it is far from being a fool-proof system The models still make mistakes
which would be obvious to a human For example
Passage The Panthers used the San Jose State practice facility and stayed at
the San Jose Marriott The Broncos practiced at Florida State Facility and stayed at the
Santa Clara Marriott
Question At what universitys facility did the Panthers practice
Actual Answer San Jose State
Predicted Answer Florida State Facility
To find out what is leading to the wrong predictions we wanted to see the
attention weights associated with such an example We plotted the passage and
question heat map which is a 2D matrix where the intensity of each cell signifies the
similarity between a passage word and a question word For the above example we
found out that while certain words of the question are given high weightage other parts
are not The words lsquoAtrsquo lsquofacilityrsquo lsquopracticersquo are given high attention but lsquoPanthersrsquo does
not receive high attention If it had received high attention then the system would have
predicted lsquoSan Jose Statersquo as the right answer To solve this issue we analyzed the
base BiDAF model and proposed adding two things
1 Bi-Attention and Self-Attention over Query
2 Second level attention over the output of (Bi-Attention + Self-Attention) from both Context and Query
30
Multi-Attention BiDAF Model
Our comprehension model is a hierarchical multi-stage process and consists of
the following layers
1 Embedding Just as all other models we embed words using pretrained word vectors We also embed the characters in each word into size 20 vectors which are learned and run a convolution neural network followed by max-pooling to get character-derived embeddings for each word The character-level and word-level embeddings are then concatenated and passed to the next layer We do not update the word embeddings during training
2 Pre-Process A shared bi-directional GRU (Cho et al 2014) is used to map the question and passage embeddings to context aware embeddings
3 Context Attention The bi-directional attention mechanism from the Bi-Directional Attention Flow (BiDAF) model (Seo et al 2016) is used to build a query-aware context representation Let h_i be the vector for context word i q_j be the vector for question word j and n_q and n_c be the lengths of the question and context respectively We compute attention between context word i and question word j as
(4-1)
where w1 w2 and w3 are learned vectors and is element-wise multiplication We then compute an attended vector c_i for each context token as
(4-2)
We also compute a query-to-context vector q_c
(4-3)
31
The final vector computed for each token is built by concatenating
and In our model we subsequently pass the result through a linear layer with ReLU activations
4 Context Self-Attention Next we use a layer of residual self-attention The input is passed through another bi-directional GRU Then we apply the same attention mechanism only now between the passage and itself In this case we do not use query-to context attention and we set aij = 1048576inf if i = j As before we pass the concatenated output through a linear layer with ReLU activations This layer is applied residually so this output is additionally summed with the input
5 Query Attention For this part we do the same way as context attention but calculate the weighted sum of the context words for each query word Thus we get the length as number of query words Then we calculate context to query similar to query to context in context attention layer
Figure 4-1 The modified BiDAF model with multilevel attention
6 Query Self-Attention This part is done the same way as the context self-attention layer but from the output of the Query Attention layer
32
7 Context Query Bi-Attention + Self-Attention The output of the Context self-attention and the Query Self-Attention layers are taken as input and the same process for Bi-Attention and self-attention is applied on these inputs
8 Prediction In the last layer of our model a bidirectional GRU is applied followed by a linear layer that computes answer start scores for each token The hidden states of that layer are concatenated with the input and fed into a second bidirectional GRU and linear layer to predict answer end scores The softmax operation is applied to the start and end scores to produce start and end probabilities and we optimize the negative log likelihood of selecting correct start and end tokens
Having carried out this modification we were able to solve the wrong example we
started with The multilevel attention model gives the correct output as ldquoSan Jose
Staterdquo Also we achieved slightly better scores than the original model with a F1 score
of 8544 on the SQuAD dev dataset
Chatbot Design Using a QA System
Designing a perfect chatbot that passes the Turing test is a fundamental goal for
Artificial Intelligence Although we are many order to magnitudes away from achieving
such a goal domain specific tasks can be solved with chatbots made from current
technology We had started our goal with a similar objective in mind ie to design a
domain specific chatbot and then generalize to other areas as it is able to achieve the
first domain specific objective robustly This led us to the fundamental problem of
Machine Comprehension and subsequently to the task of Question Answering Having
achieved some degree of success with QA systems we looked back if we could apply
our newly acquired knowledge in the task of designing Chatbots
The chatbots made with todayrsquos technologies are mostly handcrafted techniques
such as template matching that requires anticipating all possible ways a user may
articulate his requirements and a conversation may occur This requires a lot of man
hours for designing a domain specific system and is still very error prone In this section
33
we propose a general Chatbot design that would make the designing of a domain
specific chatbot very easy and robust at the same time
Every domain specific chatbot needs to obtain a set of information from the user
and show some results based on the user specific information obtained The traditional
chatbots use template matching and keywords lookup to determine if the user has
provided the required information Our idea is to use the Question Answering system in
the backend to extract out the required information from whatever the user has typed
until this point of the conversation The information to be extracted can be posed in the
form of a set of questions and the answers obtained from those questions can be used
as the parameters to supply the relevant information to the user
We had chosen our chatbot domain as the flight reservation system Our goal
was to extract the required information from the user to be able to show him the
available flights as per the userrsquos requirements
Figure 4-2 Flight reservation chatbotrsquos chat window
34
For a flight reservation task the booking agent needs to know the origin city
destination city and date of travel at minimum to be able to show the available flights
Optional information includes the number of tickets passengerrsquos name one way or
round trip etc
The minimalistic conversation with the user through the chat window would be as
shown above We had a platform called OneTask on which we wanted to implement our
chat bot The chat interface within the OneTask system looks as follows
Figure 4-3 Chatbot within OneTask system
The working of the chat system is as follows ndash
1 Initiation The user opens the chat window which starts a session with the chat bot The chat bot asks lsquoHow may I help yoursquo
2 User Reply The user may reply with none of the required information for flight booking or may reply with multiple information in the same message
3 User Reply parsing The conversation up to this point is treated as a passage and the internal questions are run on this passage So the four questions that are run are
Where do you want to go
35
From where do you want to leave
When do you want to depart
4 Parsed responses from QA model After running the questions on the QA system with the given conversation answers are obtained for all along with their corresponding confidence values Since answers will be obtained even if the required question was not answered up to this point in the conversation the confidence values plays a crucial role in determining the validity of the answer For this purpose we determined ranges of confidence values for validation A confidence value above 10 signifies the question has been answered correctly A confidence of 2 ndash 10 signifies that it may have been answered but should verify with the user for correctness and any confidence below 2 is discarded
5 Asking remaining questions iteratively After the parsing it is checked if any of the required questions are still unanswered If so the chatbot asks the remaining question and the process from 3 ndash 4 is carried out iteratively Once all the questions have been answered the user is shown with the available flight options as per his request
Figure 4-4 The Flow diagram of the Flight booking Chatbot system
Online QA System and Attention Visualization
To be able to test out various examples we used the BiDAF [23] model for an
online demo One can either choose from the available examples from the drop-down
menu or paste their own passage and examples While this is a useful and interesting
system to test the model in a user-friendly way we created this system to be able to
focus on the wrong samples
36
Figure 4-5 QA system interface with attention highlight over candidate answers
The candidate answers are shown with the blue highlights as per their
confidence values The higher the confidence the darker the answer The highest
confidence value is chosen as the predicted answer We developed the system to show
attention spread of the candidate answers to realize what needs to be done to improve
the system This led us to realize the importance of including the query attention part as
well as multilevel attention on the BiDAF model as described in the first section of this
chapter
37
CHAPTER 5 FUTURE DIRECTIONS AND CONCLUSION
Future Directions
Having done a thorough analysis of the current methods and state of the arts in
Machine Comprehension and for the QA task and developing systems on it we have
achieved a strong sense of what needs to be done to further improve the QA models
After observing the wrong samples we can see that the system is still unable to encode
meaning and is picking answers based on statistical patterns of occurrence of answers
on training examples
As per the State of the Art models and analyzing the ongoing research literature
it is easy to conclude that more features need to be embedded to encode the meaning
of words phrases and sentences A paper called Reinforced Mnemonic Reader for
Machine Comprehension encoded POS and NER tags of words along with their word
and character embedding This gave them better results
Figure 5-1 An English language semantic parse tree [26]
38
We have developed a method to encode the syntax parse tree of a sentence that
not only encodes the post tags but the relation of the word within the phrase and the
relation of the phrase within the whole sentence in a hierarchical manner
Finally data augmentation is another solution to get better results One definite
way to reduce the errors would be to include similar samples in the training data which
the system is faltering in the dev set One could generate similar examples as the failure
cases and include them in the training set to have better prediction Another system
would be to train to similar and bigger datasets Our models were trained on the SQuAD
dataset There are other datasets too which does the similar question answering task
such as TriviaQA We could augment the training set of TriviaQA along with SQuAD to
have a more robust system that is able to generalize better and thus have higher
accuracy for predicting answer spans
Conclusion
In this work we have tried to explore the most fundamental techniques that have
shaped the current state of the art Then we proposed a minor improvement of
architecture over an existing model Furthermore we developed two applications that
uses the base model First we talked about how a chatbot application can be made
using the QA system and lastly we also created a web interface where the model can
be used for any Passage and Question This interface also shows the attention spread
on the candidate answers While our effort is ongoing to push the state of the art
forward we strongly believe that surpassing human level accuracy on this task will have
high dividends for the society at large
39
LIST OF REFERENCES
[1] Danqi Chen From Reading Comprehension To Open-Domain Question Answering 2018 Youtube Accessed March 16 2018 httpswwwyoutubecomwatchv=1RN88O9C13U
[2] Lehnert Wendy G The Process of Question Answering No RR-88 Yale Univ New Haven Conn Dept of computer science 1977
[3] Joshi Mandar Eunsol Choi Daniel S Weld and Luke Zettlemoyer Triviaqa A large scale distantly supervised challenge dataset for reading comprehension arXiv preprint arXiv170503551 (2017)
[4] Rajpurkar Pranav Jian Zhang Konstantin Lopyrev and Percy Liang Squad 100000+ questions for machine comprehension of text arXiv preprint arXiv160605250 (2016)
[5] The Stanford Question Answering Dataset 2018 RajpurkarGithubIo Accessed March 16 2018 httpsrajpurkargithubioSQuAD-explorer
[6] Hu Minghao Yuxing Peng and Xipeng Qiu Reinforced mnemonic reader for machine comprehension CoRR abs170502798 (2017)
[7] Schalkoff Robert J Artificial neural networks Vol 1 New York McGraw-Hill 1997
[8] Me For The AI and Neetesh Mehrotra 2018 The Connect Between Deep Learning And AI - Open Source For You Open Source For You Accessed March 16 2018 httpopensourceforucom201801connect-deep-learning-ai
[9] Krizhevsky Alex Ilya Sutskever and Geoffrey E Hinton Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems pp 1097-1105 2012
[10] An Intuitive Explanation Of Convolutional Neural Networks 2016 The Data Science Blog Accessed March 16 2018 httpsujjwalkarnme20160811intuitive-explanation-convnets
[11] Williams Ronald J and David Zipser A learning algorithm for continually running fully recurrent neural networks Neural computation 1 no 2 (1989) 270-280
[12] Hochreiter Sepp and Juumlrgen Schmidhuber Long short-term memory Neural computation 9 no 8 (1997) 1735-1780
[13] Chung Junyoung Caglar Gulcehre KyungHyun Cho and Yoshua Bengio Empirical evaluation of gated recurrent neural networks on sequence modeling arXiv preprint arXiv14123555 (2014)
40
[14] Mikolov Tomas Wen-tau Yih and Geoffrey Zweig Linguistic regularities in continuous space word representations In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics Human Language Technologies pp 746-751 2013
[15] Pennington Jeffrey Richard Socher and Christopher Manning Glove Global vectors for word representation In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) pp 1532-1543 2014
[16] Word2vec Tutorial - The Skip-Gram Model middot Chris Mccormick 2016 MccormickmlCom Accessed March 16 2018 httpmccormickmlcom20160419word2vec-tutorial-the-skip-gram-model
[17] Group 2015 [Paper Introduction] Bilingual Word Representations With Monolingual hellip SlideshareNet Accessed March 16 2018 httpswwwslidesharenetnaist-mtstudybilingual-word-representations-with-monolingual-quality-in-mind
[18] Attention Mechanism 2016 BlogHeuritechCom Accessed March 16 2018 httpsblogheuritechcom20160120attention-mechanism
[19] Weston Jason Sumit Chopra and Antoine Bordes 2014 Memory Networks arXiv preprint arXiv14103916v11 Accessed March 16 2018 httpsarxivorgabs14103916
[20] Wang Shuohang and Jing Jiang Machine comprehension using match-lstm and answer pointer arXiv preprint arXiv160807905 (2016)
[21] Wang Shuohang and Jing Jiang Learning natural language inference with LSTM arXiv preprint arXiv151208849 (2015)
[22] Vinyals Oriol Meire Fortunato and Navdeep Jaitly Pointer networks In Advances in Neural Information Processing Systems pp 2692-2700 2015
[23] Wang Wenhui Nan Yang Furu Wei Baobao Chang and Ming Zhou Gated self-matching networks for reading comprehension and question answering In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1 Long Papers) vol 1 pp 189-198 2017
[24] Seo Minjoon Aniruddha Kembhavi Ali Farhadi and Hannaneh Hajishirzi Bidirectional attention flow for machine comprehension arXiv preprint arXiv161101603 (2016)
[25] Clark Christopher and Matt Gardner Simple and effective multi-paragraph reading comprehension arXiv preprint arXiv171010723 (2017)
41
[26] 8 Analyzing Sentence Structure 2018 NltkOrg Accessed March 16 2018 httpwwwnltkorgbookch08html
42
BIOGRAPHICAL SKETCH
Purnendu Mukherjee grew up in Kolkata India and had a strong interest for
computers since an early age After his high school he did his BSc in computer
science from Ramakrishna Mission Residential College Narendrapur followed by MSc
in computer science from St Xavierrsquos College Kolkata He had a strong intuition and
interest for human like learning systems and wanted to work in this area He started
working at TCS Innovation Labs Pune for the application of Natural Language
Processing in Educational Applications As he was deeply passionate about learning
system that mimic the human brain and learn like a human child does he was
increasing interested about Deep Learning and its applications After working for a year
he went on to pursue a Master of Science degree in computer science from the
University of Florida Gainesville His academic interests have been focused on Deep
Learning and Natural Language Processing and he has been working on Machine
Reading Comprehension since summer of 2017
- ACKNOWLEDGMENTS
- LIST OF FIGURES
- LIST OF ABBREVIATIONS
- INTRODUCTION
- THE BUILDING BLOCKS OF A QUESTION ANSWERING SYSTEM
-
- Neural Networks
- Convolutional Neural Network
- Recurrent Neural Networks (RNN)
- Word Embedding
- Attention Mechanism
- Memory Networks
-
- LITERATURE REVIEW AND STATE OF THE ART
-
- Machine Comprehension Using Match-LSTM and Answer Pointer
- R-NET Matching Reading Comprehension with Self-Matching Networks
- Bi-Directional Attention Flow (BiDAF) for Machine Comprehension
- Summary
-
- MULTI-ATTENTION QUESTION ANSWERING
-
- Multi-Attention BiDAF Model
- Chatbot Design Using a QA System
- Online QA System and Attention Visualization
-
- FUTURE DIRECTIONS AND CONCLUSION
-
- Future Directions
- Conclusion
-
- LIST OF REFERENCES
- BIOGRAPHICAL SKETCH
-
18
and the weights are chosen according the relevance of each y_i given the context c
[18]
Figure 2-4 Attention Mechanism flow [18]
Memory Networks
Convolutional Neural Networks and Recurrent Neural Networks which does
capture how we form our visual and sequential memories their memory (encoded by
hidden states and weights) were typically too small and was not compartmentalized
enough to accurately remember facts from the past (knowledge is compressed into
dense vectors) [19]
Deep Learning needed to cultivate a methodology that preserved memories as
they are such that it wonrsquot be lost in generalization and recalling exact words or
sequence of events would be possible mdash something computers are already good at This
effort led us to Memory Networks [19] published at ICLR 2015 by Facebook AI
Research
This paper provides a basic framework to store augment and retrieve memories
while seamlessly working with a Recurrent Neural Network architecture The memory
19
network consists of a memory m (an array of objects 1 indexed by m i) and four
(potentially learned) components I G O and R as follows
I (input feature map) mdash converts the incoming input to the internal feature
representation either a sparse or dense feature vector like that from word2vec or
GloVe
G (generalization) mdash updates old memories given the new input They call this
generalization as there is an opportunity for the network to compress and generalize its
memories at this stage for some intended future use The analogy Irsquove been talking
before
O (output feature map) mdash produces a new output (in the feature representation
space) given the new input and the current memory state This component is
responsible for performing inference In a question answering system this part will
select the candidate sentences (which might contain the answer) from the story
(conversation) so far
R (response) mdash converts the output into the response format desired For
example a textual response or an action In the QA system described this component
finds the desired answer and then converts it from feature representation to the actual
word
This model is a fully supervised model meaning all the candidate sentences from
which the answer could be found are marked during training phase and can also be
termed as lsquohard attentionrsquo
The authors tested out the QA system on various literature including Lord of the
Rings
20
Figure 2-5 QA example on Lord of the Rings using Memory Networks [19]
21
CHAPTER 3 LITERATURE REVIEW AND STATE OF THE ART
There has been rapid progress since the release of the SQuAD dataset and
currently the best ensemble models are close to human level accuracy in machine
comprehension This is due to the various ingenious methods which solves some of the
problems with the previous methods Out of Vocabulary tokens were handled by using
Character embedding Long term dependency within context passage were solved
using self-attention And many other techniques such as Contextualized vectors History
of Words Attention Flow etc In this section we will have a look at the some of the most
important models that were fundamental to the progress of Questions Answering
Machine Comprehension Using Match-LSTM and Answer Pointer
In this paper [20] the authors propose an end-to-end neural architecture for the
QA task The architecture is based on match-LSTM [21] a model they proposed for
textual entailment and Pointer Net [22] a sequence-to-sequence model proposed by
Vinyals et al (2015) to constrain the output tokens to be from the input sequences
The model consists of an LSTM preprocessing layer a match-LSTM layer and an
Answer Pointer layer
We are given a piece of text which we refer to as a passage and a question
related to the passage The passage is represented by matrix P where P is the length
(number of tokens) of the passage and d is the dimensionality of word embeddings
Similarly the question is represented by matrix Q where Q is the length of the question
Our goal is to identify a subsequence from the passage as the answer to the question
22
Figure 3-1 Match-LSTM Model Architecture [20]
LSTM Preprocessing layer They use a standard one-directional LSTM
(Hochreiter amp Schmidhuber 1997) to process the passage and the question separately
as shown below
Match LSTM Layer They applied the match-LSTM model proposed for textual
entailment to their machine comprehension problem by treating the question as a
premise and the passage as a hypothesis The match-LSTM sequentially goes through
the passage At position i of the passage it first uses the standard word-by-word
attention mechanism to obtain attention weight vector as follows
(3-1)
23
where and are parameters to be learned
Answer Pointer Layer The Answer Pointer (Ans-Ptr) layer is motivated by the
Pointer Net introduced by Vinyals et al (2015) [22] The boundary model produces only
the start token and the end token of the answer and then all the tokens between these
two in the original passage are considered to be the answer
When this paper was released back in November 2016 Match-LSTM method
was the state of the art in Question Answering systems and was at the top of the
leaderboard for the SQuAD dataset
R-NET Matching Reading Comprehension with Self-Matching Networks
In this model [23] first the question and passage are processed by a
bidirectional recurrent network (Mikolov et al 2010) separately They then match the
question and passage with gated attention-based recurrent networks obtaining
question-aware representation for the passage On top of that they apply self-matching
attention to aggregate evidence from the whole passage and refine the passage
representation which is then fed into the output layer to predict the boundary of the
answer span
Question and passage encoding First the words are converted to their
respective word-level embeddings and character level embeddings The character-level
embeddings are generated by taking the final hidden states of a bi-directional recurrent
neural network (RNN) applied to embeddings of characters in the token Such
character-level embeddings have been shown to be helpful to deal with out-of-vocab
(OOV) tokens
24
They then use a bi-directional RNN to produce new representation and
of all words in the question and passage respectively
Figure 3-2 The task of Question Answering [23]
Gated Attention-based Recurrent Networks They use a variant of attention-
based recurrent networks with an additional gate to determine the importance of
information in the passage regarding a question Different from the gates in LSTM or
GRU the additional gate is based on the current passage word and its attention-pooling
vector of the question which focuses on the relation between the question and current
passage word The gate effectively model the phenomenon that only parts of the
passage are relevant to the question in reading comprehension and question answering
is utilized in subsequent calculations
25
Self-Matching Attention From the previous step the question aware passage
representation is generated to highlight the important parts of the passage One
problem with such representation is that it has very limited knowledge of context One
answer candidate is often oblivious to important cues in the passage outside its
surrounding window To address this problem the authors propose directly matching
the question-aware passage representation against itself It dynamically collects
evidence from the whole passage for words in passage and encodes the evidence
relevant to the current passage word and its matching question information into the
passage representation
Output Layer They use the same method as Wang amp Jiang (2016b) and use
pointer networks (Vinyals et al 2015) to predict the start and end position of the
answer In addition they use an attention-pooling over the question representation to
generate the initial hidden vector for the pointer network [23]
When the R-Net Model first appeared in the leaderboard in March 2017 it was at
the top with 723 Exact Match and 807 F1 score
Bi-Directional Attention Flow (BiDAF) for Machine Comprehension
BiDAF [24] is a hierarchical multi-stage architecture for modeling the
representations of the context paragraph at different levels of granularity BIDAF
includes character-level word-level and contextual embeddings and uses bi-directional
attention flow to obtain a query-aware context representation Their attention layer is not
used to summarize the context paragraph into a fixed-size vector Instead the attention
is computed for every time step and the attended vector at each time step along with
the representations from previous layers can flow through to the subsequent modeling
layer This reduces the information loss caused by early summarization [24]
26
Figure 3-3 Bi-Directional Attention Flow Model Architecture [24]
Their machine comprehension model is a hierarchical multi-stage process and
consists of six layers
1 Character Embedding Layer maps each word to a vector space using character-level CNNs
2 Word Embedding Layer maps each word to a vector space using a pre-trained word embedding model
3 Contextual Embedding Layer utilizes contextual cues from surrounding words to refine the embedding of the words These first three layers are applied to both the query and context They use a LSTM on top of the embeddings provided by the previous layers to model the temporal interactions between words We place an LSTM in both directions and concatenate the outputs of the two LSTMs
4 Attention Flow Layer couples the query and context vectors and produces a set of query aware feature vectors for each word in the context
5 Modeling Layer employs a Recurrent Neural Network to scan the context The input to the modeling layer is G which encodes the query-aware representations of context words The output of the modeling layer captures the interaction among the context words conditioned on the query They use two layers of bi-directional LSTM
27
with the output size of d for each direction Hence a matrix is obtained which is passed onto the output layer to predict the answer
6 Output Layer provides an answer to the query They define the training loss (to be minimized) as the sum of the negative log probabilities of the true start and end indices by the predicted distributions averaged over all examples [25]
In a further variation of their above work they add a self-attention layer after the
Bi-attention layer to further improve the results The architecture of the model is as
Figure 3-3 The task of Question Answering [25]
28
Summary
In this chapter we reviewed the methods that are fundamental to the state of the
art in Machine Comprehension and for the task of Question Answering We have
reached closed to human level accuracy and this is due to incremental developments
over previous models As we saw Out of Vocabulary (OOV) tokens were handled by
using Character embedding Long term dependency within context passage were
solved using self-attention and many other techniques such as Contextualized vectors
Attention Flow etc were employed to get better results In the next chapter we will see
how we can build on these models and develop further
29
CHAPTER 4 MULTI-ATTENTION QUESTION ANSWERING
Although the advancements in Question Answering systems since the release of
the SQuAD dataset has been impressive and the results are getting close to human
level accuracy it is far from being a fool-proof system The models still make mistakes
which would be obvious to a human For example
Passage The Panthers used the San Jose State practice facility and stayed at
the San Jose Marriott The Broncos practiced at Florida State Facility and stayed at the
Santa Clara Marriott
Question At what universitys facility did the Panthers practice
Actual Answer San Jose State
Predicted Answer Florida State Facility
To find out what is leading to the wrong predictions we wanted to see the
attention weights associated with such an example We plotted the passage and
question heat map which is a 2D matrix where the intensity of each cell signifies the
similarity between a passage word and a question word For the above example we
found out that while certain words of the question are given high weightage other parts
are not The words lsquoAtrsquo lsquofacilityrsquo lsquopracticersquo are given high attention but lsquoPanthersrsquo does
not receive high attention If it had received high attention then the system would have
predicted lsquoSan Jose Statersquo as the right answer To solve this issue we analyzed the
base BiDAF model and proposed adding two things
1 Bi-Attention and Self-Attention over Query
2 Second level attention over the output of (Bi-Attention + Self-Attention) from both Context and Query
30
Multi-Attention BiDAF Model
Our comprehension model is a hierarchical multi-stage process and consists of
the following layers
1 Embedding Just as all other models we embed words using pretrained word vectors We also embed the characters in each word into size 20 vectors which are learned and run a convolution neural network followed by max-pooling to get character-derived embeddings for each word The character-level and word-level embeddings are then concatenated and passed to the next layer We do not update the word embeddings during training
2 Pre-Process A shared bi-directional GRU (Cho et al 2014) is used to map the question and passage embeddings to context aware embeddings
3 Context Attention The bi-directional attention mechanism from the Bi-Directional Attention Flow (BiDAF) model (Seo et al 2016) is used to build a query-aware context representation Let h_i be the vector for context word i q_j be the vector for question word j and n_q and n_c be the lengths of the question and context respectively We compute attention between context word i and question word j as
(4-1)
where w1 w2 and w3 are learned vectors and is element-wise multiplication We then compute an attended vector c_i for each context token as
(4-2)
We also compute a query-to-context vector q_c
(4-3)
31
The final vector computed for each token is built by concatenating
and In our model we subsequently pass the result through a linear layer with ReLU activations
4 Context Self-Attention Next we use a layer of residual self-attention The input is passed through another bi-directional GRU Then we apply the same attention mechanism only now between the passage and itself In this case we do not use query-to context attention and we set aij = 1048576inf if i = j As before we pass the concatenated output through a linear layer with ReLU activations This layer is applied residually so this output is additionally summed with the input
5 Query Attention For this part we do the same way as context attention but calculate the weighted sum of the context words for each query word Thus we get the length as number of query words Then we calculate context to query similar to query to context in context attention layer
Figure 4-1 The modified BiDAF model with multilevel attention
6 Query Self-Attention This part is done the same way as the context self-attention layer but from the output of the Query Attention layer
32
7 Context Query Bi-Attention + Self-Attention The output of the Context self-attention and the Query Self-Attention layers are taken as input and the same process for Bi-Attention and self-attention is applied on these inputs
8 Prediction In the last layer of our model a bidirectional GRU is applied followed by a linear layer that computes answer start scores for each token The hidden states of that layer are concatenated with the input and fed into a second bidirectional GRU and linear layer to predict answer end scores The softmax operation is applied to the start and end scores to produce start and end probabilities and we optimize the negative log likelihood of selecting correct start and end tokens
Having carried out this modification we were able to solve the wrong example we
started with The multilevel attention model gives the correct output as ldquoSan Jose
Staterdquo Also we achieved slightly better scores than the original model with a F1 score
of 8544 on the SQuAD dev dataset
Chatbot Design Using a QA System
Designing a perfect chatbot that passes the Turing test is a fundamental goal for
Artificial Intelligence Although we are many order to magnitudes away from achieving
such a goal domain specific tasks can be solved with chatbots made from current
technology We had started our goal with a similar objective in mind ie to design a
domain specific chatbot and then generalize to other areas as it is able to achieve the
first domain specific objective robustly This led us to the fundamental problem of
Machine Comprehension and subsequently to the task of Question Answering Having
achieved some degree of success with QA systems we looked back if we could apply
our newly acquired knowledge in the task of designing Chatbots
The chatbots made with todayrsquos technologies are mostly handcrafted techniques
such as template matching that requires anticipating all possible ways a user may
articulate his requirements and a conversation may occur This requires a lot of man
hours for designing a domain specific system and is still very error prone In this section
33
we propose a general Chatbot design that would make the designing of a domain
specific chatbot very easy and robust at the same time
Every domain specific chatbot needs to obtain a set of information from the user
and show some results based on the user specific information obtained The traditional
chatbots use template matching and keywords lookup to determine if the user has
provided the required information Our idea is to use the Question Answering system in
the backend to extract out the required information from whatever the user has typed
until this point of the conversation The information to be extracted can be posed in the
form of a set of questions and the answers obtained from those questions can be used
as the parameters to supply the relevant information to the user
We had chosen our chatbot domain as the flight reservation system Our goal
was to extract the required information from the user to be able to show him the
available flights as per the userrsquos requirements
Figure 4-2 Flight reservation chatbotrsquos chat window
34
For a flight reservation task the booking agent needs to know the origin city
destination city and date of travel at minimum to be able to show the available flights
Optional information includes the number of tickets passengerrsquos name one way or
round trip etc
The minimalistic conversation with the user through the chat window would be as
shown above We had a platform called OneTask on which we wanted to implement our
chat bot The chat interface within the OneTask system looks as follows
Figure 4-3 Chatbot within OneTask system
The working of the chat system is as follows ndash
1 Initiation The user opens the chat window which starts a session with the chat bot The chat bot asks lsquoHow may I help yoursquo
2 User Reply The user may reply with none of the required information for flight booking or may reply with multiple information in the same message
3 User Reply parsing The conversation up to this point is treated as a passage and the internal questions are run on this passage So the four questions that are run are
Where do you want to go
35
From where do you want to leave
When do you want to depart
4 Parsed responses from QA model After running the questions on the QA system with the given conversation answers are obtained for all along with their corresponding confidence values Since answers will be obtained even if the required question was not answered up to this point in the conversation the confidence values plays a crucial role in determining the validity of the answer For this purpose we determined ranges of confidence values for validation A confidence value above 10 signifies the question has been answered correctly A confidence of 2 ndash 10 signifies that it may have been answered but should verify with the user for correctness and any confidence below 2 is discarded
5 Asking remaining questions iteratively After the parsing it is checked if any of the required questions are still unanswered If so the chatbot asks the remaining question and the process from 3 ndash 4 is carried out iteratively Once all the questions have been answered the user is shown with the available flight options as per his request
Figure 4-4 The Flow diagram of the Flight booking Chatbot system
Online QA System and Attention Visualization
To be able to test out various examples we used the BiDAF [23] model for an
online demo One can either choose from the available examples from the drop-down
menu or paste their own passage and examples While this is a useful and interesting
system to test the model in a user-friendly way we created this system to be able to
focus on the wrong samples
36
Figure 4-5 QA system interface with attention highlight over candidate answers
The candidate answers are shown with the blue highlights as per their
confidence values The higher the confidence the darker the answer The highest
confidence value is chosen as the predicted answer We developed the system to show
attention spread of the candidate answers to realize what needs to be done to improve
the system This led us to realize the importance of including the query attention part as
well as multilevel attention on the BiDAF model as described in the first section of this
chapter
37
CHAPTER 5 FUTURE DIRECTIONS AND CONCLUSION
Future Directions
Having done a thorough analysis of the current methods and state of the arts in
Machine Comprehension and for the QA task and developing systems on it we have
achieved a strong sense of what needs to be done to further improve the QA models
After observing the wrong samples we can see that the system is still unable to encode
meaning and is picking answers based on statistical patterns of occurrence of answers
on training examples
As per the State of the Art models and analyzing the ongoing research literature
it is easy to conclude that more features need to be embedded to encode the meaning
of words phrases and sentences A paper called Reinforced Mnemonic Reader for
Machine Comprehension encoded POS and NER tags of words along with their word
and character embedding This gave them better results
Figure 5-1 An English language semantic parse tree [26]
38
We have developed a method to encode the syntax parse tree of a sentence that
not only encodes the post tags but the relation of the word within the phrase and the
relation of the phrase within the whole sentence in a hierarchical manner
Finally data augmentation is another solution to get better results One definite
way to reduce the errors would be to include similar samples in the training data which
the system is faltering in the dev set One could generate similar examples as the failure
cases and include them in the training set to have better prediction Another system
would be to train to similar and bigger datasets Our models were trained on the SQuAD
dataset There are other datasets too which does the similar question answering task
such as TriviaQA We could augment the training set of TriviaQA along with SQuAD to
have a more robust system that is able to generalize better and thus have higher
accuracy for predicting answer spans
Conclusion
In this work we have tried to explore the most fundamental techniques that have
shaped the current state of the art Then we proposed a minor improvement of
architecture over an existing model Furthermore we developed two applications that
uses the base model First we talked about how a chatbot application can be made
using the QA system and lastly we also created a web interface where the model can
be used for any Passage and Question This interface also shows the attention spread
on the candidate answers While our effort is ongoing to push the state of the art
forward we strongly believe that surpassing human level accuracy on this task will have
high dividends for the society at large
39
LIST OF REFERENCES
[1] Danqi Chen From Reading Comprehension To Open-Domain Question Answering 2018 Youtube Accessed March 16 2018 httpswwwyoutubecomwatchv=1RN88O9C13U
[2] Lehnert Wendy G The Process of Question Answering No RR-88 Yale Univ New Haven Conn Dept of computer science 1977
[3] Joshi Mandar Eunsol Choi Daniel S Weld and Luke Zettlemoyer Triviaqa A large scale distantly supervised challenge dataset for reading comprehension arXiv preprint arXiv170503551 (2017)
[4] Rajpurkar Pranav Jian Zhang Konstantin Lopyrev and Percy Liang Squad 100000+ questions for machine comprehension of text arXiv preprint arXiv160605250 (2016)
[5] The Stanford Question Answering Dataset 2018 RajpurkarGithubIo Accessed March 16 2018 httpsrajpurkargithubioSQuAD-explorer
[6] Hu Minghao Yuxing Peng and Xipeng Qiu Reinforced mnemonic reader for machine comprehension CoRR abs170502798 (2017)
[7] Schalkoff Robert J Artificial neural networks Vol 1 New York McGraw-Hill 1997
[8] Me For The AI and Neetesh Mehrotra 2018 The Connect Between Deep Learning And AI - Open Source For You Open Source For You Accessed March 16 2018 httpopensourceforucom201801connect-deep-learning-ai
[9] Krizhevsky Alex Ilya Sutskever and Geoffrey E Hinton Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems pp 1097-1105 2012
[10] An Intuitive Explanation Of Convolutional Neural Networks 2016 The Data Science Blog Accessed March 16 2018 httpsujjwalkarnme20160811intuitive-explanation-convnets
[11] Williams Ronald J and David Zipser A learning algorithm for continually running fully recurrent neural networks Neural computation 1 no 2 (1989) 270-280
[12] Hochreiter Sepp and Juumlrgen Schmidhuber Long short-term memory Neural computation 9 no 8 (1997) 1735-1780
[13] Chung Junyoung Caglar Gulcehre KyungHyun Cho and Yoshua Bengio Empirical evaluation of gated recurrent neural networks on sequence modeling arXiv preprint arXiv14123555 (2014)
40
[14] Mikolov Tomas Wen-tau Yih and Geoffrey Zweig Linguistic regularities in continuous space word representations In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics Human Language Technologies pp 746-751 2013
[15] Pennington Jeffrey Richard Socher and Christopher Manning Glove Global vectors for word representation In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) pp 1532-1543 2014
[16] Word2vec Tutorial - The Skip-Gram Model middot Chris Mccormick 2016 MccormickmlCom Accessed March 16 2018 httpmccormickmlcom20160419word2vec-tutorial-the-skip-gram-model
[17] Group 2015 [Paper Introduction] Bilingual Word Representations With Monolingual hellip SlideshareNet Accessed March 16 2018 httpswwwslidesharenetnaist-mtstudybilingual-word-representations-with-monolingual-quality-in-mind
[18] Attention Mechanism 2016 BlogHeuritechCom Accessed March 16 2018 httpsblogheuritechcom20160120attention-mechanism
[19] Weston Jason Sumit Chopra and Antoine Bordes 2014 Memory Networks arXiv preprint arXiv14103916v11 Accessed March 16 2018 httpsarxivorgabs14103916
[20] Wang Shuohang and Jing Jiang Machine comprehension using match-lstm and answer pointer arXiv preprint arXiv160807905 (2016)
[21] Wang Shuohang and Jing Jiang Learning natural language inference with LSTM arXiv preprint arXiv151208849 (2015)
[22] Vinyals Oriol Meire Fortunato and Navdeep Jaitly Pointer networks In Advances in Neural Information Processing Systems pp 2692-2700 2015
[23] Wang Wenhui Nan Yang Furu Wei Baobao Chang and Ming Zhou Gated self-matching networks for reading comprehension and question answering In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1 Long Papers) vol 1 pp 189-198 2017
[24] Seo Minjoon Aniruddha Kembhavi Ali Farhadi and Hannaneh Hajishirzi Bidirectional attention flow for machine comprehension arXiv preprint arXiv161101603 (2016)
[25] Clark Christopher and Matt Gardner Simple and effective multi-paragraph reading comprehension arXiv preprint arXiv171010723 (2017)
41
[26] 8 Analyzing Sentence Structure 2018 NltkOrg Accessed March 16 2018 httpwwwnltkorgbookch08html
42
BIOGRAPHICAL SKETCH
Purnendu Mukherjee grew up in Kolkata India and had a strong interest for
computers since an early age After his high school he did his BSc in computer
science from Ramakrishna Mission Residential College Narendrapur followed by MSc
in computer science from St Xavierrsquos College Kolkata He had a strong intuition and
interest for human like learning systems and wanted to work in this area He started
working at TCS Innovation Labs Pune for the application of Natural Language
Processing in Educational Applications As he was deeply passionate about learning
system that mimic the human brain and learn like a human child does he was
increasing interested about Deep Learning and its applications After working for a year
he went on to pursue a Master of Science degree in computer science from the
University of Florida Gainesville His academic interests have been focused on Deep
Learning and Natural Language Processing and he has been working on Machine
Reading Comprehension since summer of 2017
- ACKNOWLEDGMENTS
- LIST OF FIGURES
- LIST OF ABBREVIATIONS
- INTRODUCTION
- THE BUILDING BLOCKS OF A QUESTION ANSWERING SYSTEM
-
- Neural Networks
- Convolutional Neural Network
- Recurrent Neural Networks (RNN)
- Word Embedding
- Attention Mechanism
- Memory Networks
-
- LITERATURE REVIEW AND STATE OF THE ART
-
- Machine Comprehension Using Match-LSTM and Answer Pointer
- R-NET Matching Reading Comprehension with Self-Matching Networks
- Bi-Directional Attention Flow (BiDAF) for Machine Comprehension
- Summary
-
- MULTI-ATTENTION QUESTION ANSWERING
-
- Multi-Attention BiDAF Model
- Chatbot Design Using a QA System
- Online QA System and Attention Visualization
-
- FUTURE DIRECTIONS AND CONCLUSION
-
- Future Directions
- Conclusion
-
- LIST OF REFERENCES
- BIOGRAPHICAL SKETCH
-
19
network consists of a memory m (an array of objects 1 indexed by m i) and four
(potentially learned) components I G O and R as follows
I (input feature map) mdash converts the incoming input to the internal feature
representation either a sparse or dense feature vector like that from word2vec or
GloVe
G (generalization) mdash updates old memories given the new input They call this
generalization as there is an opportunity for the network to compress and generalize its
memories at this stage for some intended future use The analogy Irsquove been talking
before
O (output feature map) mdash produces a new output (in the feature representation
space) given the new input and the current memory state This component is
responsible for performing inference In a question answering system this part will
select the candidate sentences (which might contain the answer) from the story
(conversation) so far
R (response) mdash converts the output into the response format desired For
example a textual response or an action In the QA system described this component
finds the desired answer and then converts it from feature representation to the actual
word
This model is a fully supervised model meaning all the candidate sentences from
which the answer could be found are marked during training phase and can also be
termed as lsquohard attentionrsquo
The authors tested out the QA system on various literature including Lord of the
Rings
20
Figure 2-5 QA example on Lord of the Rings using Memory Networks [19]
21
CHAPTER 3 LITERATURE REVIEW AND STATE OF THE ART
There has been rapid progress since the release of the SQuAD dataset and
currently the best ensemble models are close to human level accuracy in machine
comprehension This is due to the various ingenious methods which solves some of the
problems with the previous methods Out of Vocabulary tokens were handled by using
Character embedding Long term dependency within context passage were solved
using self-attention And many other techniques such as Contextualized vectors History
of Words Attention Flow etc In this section we will have a look at the some of the most
important models that were fundamental to the progress of Questions Answering
Machine Comprehension Using Match-LSTM and Answer Pointer
In this paper [20] the authors propose an end-to-end neural architecture for the
QA task The architecture is based on match-LSTM [21] a model they proposed for
textual entailment and Pointer Net [22] a sequence-to-sequence model proposed by
Vinyals et al (2015) to constrain the output tokens to be from the input sequences
The model consists of an LSTM preprocessing layer a match-LSTM layer and an
Answer Pointer layer
We are given a piece of text which we refer to as a passage and a question
related to the passage The passage is represented by matrix P where P is the length
(number of tokens) of the passage and d is the dimensionality of word embeddings
Similarly the question is represented by matrix Q where Q is the length of the question
Our goal is to identify a subsequence from the passage as the answer to the question
22
Figure 3-1 Match-LSTM Model Architecture [20]
LSTM Preprocessing layer They use a standard one-directional LSTM
(Hochreiter amp Schmidhuber 1997) to process the passage and the question separately
as shown below
Match LSTM Layer They applied the match-LSTM model proposed for textual
entailment to their machine comprehension problem by treating the question as a
premise and the passage as a hypothesis The match-LSTM sequentially goes through
the passage At position i of the passage it first uses the standard word-by-word
attention mechanism to obtain attention weight vector as follows
(3-1)
23
where and are parameters to be learned
Answer Pointer Layer The Answer Pointer (Ans-Ptr) layer is motivated by the
Pointer Net introduced by Vinyals et al (2015) [22] The boundary model produces only
the start token and the end token of the answer and then all the tokens between these
two in the original passage are considered to be the answer
When this paper was released back in November 2016 Match-LSTM method
was the state of the art in Question Answering systems and was at the top of the
leaderboard for the SQuAD dataset
R-NET Matching Reading Comprehension with Self-Matching Networks
In this model [23] first the question and passage are processed by a
bidirectional recurrent network (Mikolov et al 2010) separately They then match the
question and passage with gated attention-based recurrent networks obtaining
question-aware representation for the passage On top of that they apply self-matching
attention to aggregate evidence from the whole passage and refine the passage
representation which is then fed into the output layer to predict the boundary of the
answer span
Question and passage encoding First the words are converted to their
respective word-level embeddings and character level embeddings The character-level
embeddings are generated by taking the final hidden states of a bi-directional recurrent
neural network (RNN) applied to embeddings of characters in the token Such
character-level embeddings have been shown to be helpful to deal with out-of-vocab
(OOV) tokens
24
They then use a bi-directional RNN to produce new representation and
of all words in the question and passage respectively
Figure 3-2 The task of Question Answering [23]
Gated Attention-based Recurrent Networks They use a variant of attention-
based recurrent networks with an additional gate to determine the importance of
information in the passage regarding a question Different from the gates in LSTM or
GRU the additional gate is based on the current passage word and its attention-pooling
vector of the question which focuses on the relation between the question and current
passage word The gate effectively model the phenomenon that only parts of the
passage are relevant to the question in reading comprehension and question answering
is utilized in subsequent calculations
25
Self-Matching Attention From the previous step the question aware passage
representation is generated to highlight the important parts of the passage One
problem with such representation is that it has very limited knowledge of context One
answer candidate is often oblivious to important cues in the passage outside its
surrounding window To address this problem the authors propose directly matching
the question-aware passage representation against itself It dynamically collects
evidence from the whole passage for words in passage and encodes the evidence
relevant to the current passage word and its matching question information into the
passage representation
Output Layer They use the same method as Wang amp Jiang (2016b) and use
pointer networks (Vinyals et al 2015) to predict the start and end position of the
answer In addition they use an attention-pooling over the question representation to
generate the initial hidden vector for the pointer network [23]
When the R-Net Model first appeared in the leaderboard in March 2017 it was at
the top with 723 Exact Match and 807 F1 score
Bi-Directional Attention Flow (BiDAF) for Machine Comprehension
BiDAF [24] is a hierarchical multi-stage architecture for modeling the
representations of the context paragraph at different levels of granularity BIDAF
includes character-level word-level and contextual embeddings and uses bi-directional
attention flow to obtain a query-aware context representation Their attention layer is not
used to summarize the context paragraph into a fixed-size vector Instead the attention
is computed for every time step and the attended vector at each time step along with
the representations from previous layers can flow through to the subsequent modeling
layer This reduces the information loss caused by early summarization [24]
26
Figure 3-3 Bi-Directional Attention Flow Model Architecture [24]
Their machine comprehension model is a hierarchical multi-stage process and
consists of six layers
1 Character Embedding Layer maps each word to a vector space using character-level CNNs
2 Word Embedding Layer maps each word to a vector space using a pre-trained word embedding model
3 Contextual Embedding Layer utilizes contextual cues from surrounding words to refine the embedding of the words These first three layers are applied to both the query and context They use a LSTM on top of the embeddings provided by the previous layers to model the temporal interactions between words We place an LSTM in both directions and concatenate the outputs of the two LSTMs
4 Attention Flow Layer couples the query and context vectors and produces a set of query aware feature vectors for each word in the context
5 Modeling Layer employs a Recurrent Neural Network to scan the context The input to the modeling layer is G which encodes the query-aware representations of context words The output of the modeling layer captures the interaction among the context words conditioned on the query They use two layers of bi-directional LSTM
27
with the output size of d for each direction Hence a matrix is obtained which is passed onto the output layer to predict the answer
6 Output Layer provides an answer to the query They define the training loss (to be minimized) as the sum of the negative log probabilities of the true start and end indices by the predicted distributions averaged over all examples [25]
In a further variation of their above work they add a self-attention layer after the
Bi-attention layer to further improve the results The architecture of the model is as
Figure 3-3 The task of Question Answering [25]
28
Summary
In this chapter we reviewed the methods that are fundamental to the state of the
art in Machine Comprehension and for the task of Question Answering We have
reached closed to human level accuracy and this is due to incremental developments
over previous models As we saw Out of Vocabulary (OOV) tokens were handled by
using Character embedding Long term dependency within context passage were
solved using self-attention and many other techniques such as Contextualized vectors
Attention Flow etc were employed to get better results In the next chapter we will see
how we can build on these models and develop further
29
CHAPTER 4 MULTI-ATTENTION QUESTION ANSWERING
Although the advancements in Question Answering systems since the release of
the SQuAD dataset has been impressive and the results are getting close to human
level accuracy it is far from being a fool-proof system The models still make mistakes
which would be obvious to a human For example
Passage The Panthers used the San Jose State practice facility and stayed at
the San Jose Marriott The Broncos practiced at Florida State Facility and stayed at the
Santa Clara Marriott
Question At what universitys facility did the Panthers practice
Actual Answer San Jose State
Predicted Answer Florida State Facility
To find out what is leading to the wrong predictions we wanted to see the
attention weights associated with such an example We plotted the passage and
question heat map which is a 2D matrix where the intensity of each cell signifies the
similarity between a passage word and a question word For the above example we
found out that while certain words of the question are given high weightage other parts
are not The words lsquoAtrsquo lsquofacilityrsquo lsquopracticersquo are given high attention but lsquoPanthersrsquo does
not receive high attention If it had received high attention then the system would have
predicted lsquoSan Jose Statersquo as the right answer To solve this issue we analyzed the
base BiDAF model and proposed adding two things
1 Bi-Attention and Self-Attention over Query
2 Second level attention over the output of (Bi-Attention + Self-Attention) from both Context and Query
30
Multi-Attention BiDAF Model
Our comprehension model is a hierarchical multi-stage process and consists of
the following layers
1 Embedding Just as all other models we embed words using pretrained word vectors We also embed the characters in each word into size 20 vectors which are learned and run a convolution neural network followed by max-pooling to get character-derived embeddings for each word The character-level and word-level embeddings are then concatenated and passed to the next layer We do not update the word embeddings during training
2 Pre-Process A shared bi-directional GRU (Cho et al 2014) is used to map the question and passage embeddings to context aware embeddings
3 Context Attention The bi-directional attention mechanism from the Bi-Directional Attention Flow (BiDAF) model (Seo et al 2016) is used to build a query-aware context representation Let h_i be the vector for context word i q_j be the vector for question word j and n_q and n_c be the lengths of the question and context respectively We compute attention between context word i and question word j as
(4-1)
where w1 w2 and w3 are learned vectors and is element-wise multiplication We then compute an attended vector c_i for each context token as
(4-2)
We also compute a query-to-context vector q_c
(4-3)
31
The final vector computed for each token is built by concatenating
and In our model we subsequently pass the result through a linear layer with ReLU activations
4 Context Self-Attention Next we use a layer of residual self-attention The input is passed through another bi-directional GRU Then we apply the same attention mechanism only now between the passage and itself In this case we do not use query-to context attention and we set aij = 1048576inf if i = j As before we pass the concatenated output through a linear layer with ReLU activations This layer is applied residually so this output is additionally summed with the input
5 Query Attention For this part we do the same way as context attention but calculate the weighted sum of the context words for each query word Thus we get the length as number of query words Then we calculate context to query similar to query to context in context attention layer
Figure 4-1 The modified BiDAF model with multilevel attention
6 Query Self-Attention This part is done the same way as the context self-attention layer but from the output of the Query Attention layer
32
7 Context Query Bi-Attention + Self-Attention The output of the Context self-attention and the Query Self-Attention layers are taken as input and the same process for Bi-Attention and self-attention is applied on these inputs
8 Prediction In the last layer of our model a bidirectional GRU is applied followed by a linear layer that computes answer start scores for each token The hidden states of that layer are concatenated with the input and fed into a second bidirectional GRU and linear layer to predict answer end scores The softmax operation is applied to the start and end scores to produce start and end probabilities and we optimize the negative log likelihood of selecting correct start and end tokens
Having carried out this modification we were able to solve the wrong example we
started with The multilevel attention model gives the correct output as ldquoSan Jose
Staterdquo Also we achieved slightly better scores than the original model with a F1 score
of 8544 on the SQuAD dev dataset
Chatbot Design Using a QA System
Designing a perfect chatbot that passes the Turing test is a fundamental goal for
Artificial Intelligence Although we are many order to magnitudes away from achieving
such a goal domain specific tasks can be solved with chatbots made from current
technology We had started our goal with a similar objective in mind ie to design a
domain specific chatbot and then generalize to other areas as it is able to achieve the
first domain specific objective robustly This led us to the fundamental problem of
Machine Comprehension and subsequently to the task of Question Answering Having
achieved some degree of success with QA systems we looked back if we could apply
our newly acquired knowledge in the task of designing Chatbots
The chatbots made with todayrsquos technologies are mostly handcrafted techniques
such as template matching that requires anticipating all possible ways a user may
articulate his requirements and a conversation may occur This requires a lot of man
hours for designing a domain specific system and is still very error prone In this section
33
we propose a general Chatbot design that would make the designing of a domain
specific chatbot very easy and robust at the same time
Every domain specific chatbot needs to obtain a set of information from the user
and show some results based on the user specific information obtained The traditional
chatbots use template matching and keywords lookup to determine if the user has
provided the required information Our idea is to use the Question Answering system in
the backend to extract out the required information from whatever the user has typed
until this point of the conversation The information to be extracted can be posed in the
form of a set of questions and the answers obtained from those questions can be used
as the parameters to supply the relevant information to the user
We had chosen our chatbot domain as the flight reservation system Our goal
was to extract the required information from the user to be able to show him the
available flights as per the userrsquos requirements
Figure 4-2 Flight reservation chatbotrsquos chat window
34
For a flight reservation task the booking agent needs to know the origin city
destination city and date of travel at minimum to be able to show the available flights
Optional information includes the number of tickets passengerrsquos name one way or
round trip etc
The minimalistic conversation with the user through the chat window would be as
shown above We had a platform called OneTask on which we wanted to implement our
chat bot The chat interface within the OneTask system looks as follows
Figure 4-3 Chatbot within OneTask system
The working of the chat system is as follows ndash
1 Initiation The user opens the chat window which starts a session with the chat bot The chat bot asks lsquoHow may I help yoursquo
2 User Reply The user may reply with none of the required information for flight booking or may reply with multiple information in the same message
3 User Reply parsing The conversation up to this point is treated as a passage and the internal questions are run on this passage So the four questions that are run are
Where do you want to go
35
From where do you want to leave
When do you want to depart
4 Parsed responses from QA model After running the questions on the QA system with the given conversation answers are obtained for all along with their corresponding confidence values Since answers will be obtained even if the required question was not answered up to this point in the conversation the confidence values plays a crucial role in determining the validity of the answer For this purpose we determined ranges of confidence values for validation A confidence value above 10 signifies the question has been answered correctly A confidence of 2 ndash 10 signifies that it may have been answered but should verify with the user for correctness and any confidence below 2 is discarded
5 Asking remaining questions iteratively After the parsing it is checked if any of the required questions are still unanswered If so the chatbot asks the remaining question and the process from 3 ndash 4 is carried out iteratively Once all the questions have been answered the user is shown with the available flight options as per his request
Figure 4-4 The Flow diagram of the Flight booking Chatbot system
Online QA System and Attention Visualization
To be able to test out various examples we used the BiDAF [23] model for an
online demo One can either choose from the available examples from the drop-down
menu or paste their own passage and examples While this is a useful and interesting
system to test the model in a user-friendly way we created this system to be able to
focus on the wrong samples
36
Figure 4-5 QA system interface with attention highlight over candidate answers
The candidate answers are shown with the blue highlights as per their
confidence values The higher the confidence the darker the answer The highest
confidence value is chosen as the predicted answer We developed the system to show
attention spread of the candidate answers to realize what needs to be done to improve
the system This led us to realize the importance of including the query attention part as
well as multilevel attention on the BiDAF model as described in the first section of this
chapter
37
CHAPTER 5 FUTURE DIRECTIONS AND CONCLUSION
Future Directions
Having done a thorough analysis of the current methods and state of the arts in
Machine Comprehension and for the QA task and developing systems on it we have
achieved a strong sense of what needs to be done to further improve the QA models
After observing the wrong samples we can see that the system is still unable to encode
meaning and is picking answers based on statistical patterns of occurrence of answers
on training examples
As per the State of the Art models and analyzing the ongoing research literature
it is easy to conclude that more features need to be embedded to encode the meaning
of words phrases and sentences A paper called Reinforced Mnemonic Reader for
Machine Comprehension encoded POS and NER tags of words along with their word
and character embedding This gave them better results
Figure 5-1 An English language semantic parse tree [26]
38
We have developed a method to encode the syntax parse tree of a sentence that
not only encodes the post tags but the relation of the word within the phrase and the
relation of the phrase within the whole sentence in a hierarchical manner
Finally data augmentation is another solution to get better results One definite
way to reduce the errors would be to include similar samples in the training data which
the system is faltering in the dev set One could generate similar examples as the failure
cases and include them in the training set to have better prediction Another system
would be to train to similar and bigger datasets Our models were trained on the SQuAD
dataset There are other datasets too which does the similar question answering task
such as TriviaQA We could augment the training set of TriviaQA along with SQuAD to
have a more robust system that is able to generalize better and thus have higher
accuracy for predicting answer spans
Conclusion
In this work we have tried to explore the most fundamental techniques that have
shaped the current state of the art Then we proposed a minor improvement of
architecture over an existing model Furthermore we developed two applications that
uses the base model First we talked about how a chatbot application can be made
using the QA system and lastly we also created a web interface where the model can
be used for any Passage and Question This interface also shows the attention spread
on the candidate answers While our effort is ongoing to push the state of the art
forward we strongly believe that surpassing human level accuracy on this task will have
high dividends for the society at large
39
LIST OF REFERENCES
[1] Danqi Chen From Reading Comprehension To Open-Domain Question Answering 2018 Youtube Accessed March 16 2018 httpswwwyoutubecomwatchv=1RN88O9C13U
[2] Lehnert Wendy G The Process of Question Answering No RR-88 Yale Univ New Haven Conn Dept of computer science 1977
[3] Joshi Mandar Eunsol Choi Daniel S Weld and Luke Zettlemoyer Triviaqa A large scale distantly supervised challenge dataset for reading comprehension arXiv preprint arXiv170503551 (2017)
[4] Rajpurkar Pranav Jian Zhang Konstantin Lopyrev and Percy Liang Squad 100000+ questions for machine comprehension of text arXiv preprint arXiv160605250 (2016)
[5] The Stanford Question Answering Dataset 2018 RajpurkarGithubIo Accessed March 16 2018 httpsrajpurkargithubioSQuAD-explorer
[6] Hu Minghao Yuxing Peng and Xipeng Qiu Reinforced mnemonic reader for machine comprehension CoRR abs170502798 (2017)
[7] Schalkoff Robert J Artificial neural networks Vol 1 New York McGraw-Hill 1997
[8] Me For The AI and Neetesh Mehrotra 2018 The Connect Between Deep Learning And AI - Open Source For You Open Source For You Accessed March 16 2018 httpopensourceforucom201801connect-deep-learning-ai
[9] Krizhevsky Alex Ilya Sutskever and Geoffrey E Hinton Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems pp 1097-1105 2012
[10] An Intuitive Explanation Of Convolutional Neural Networks 2016 The Data Science Blog Accessed March 16 2018 httpsujjwalkarnme20160811intuitive-explanation-convnets
[11] Williams Ronald J and David Zipser A learning algorithm for continually running fully recurrent neural networks Neural computation 1 no 2 (1989) 270-280
[12] Hochreiter Sepp and Juumlrgen Schmidhuber Long short-term memory Neural computation 9 no 8 (1997) 1735-1780
[13] Chung Junyoung Caglar Gulcehre KyungHyun Cho and Yoshua Bengio Empirical evaluation of gated recurrent neural networks on sequence modeling arXiv preprint arXiv14123555 (2014)
40
[14] Mikolov Tomas Wen-tau Yih and Geoffrey Zweig Linguistic regularities in continuous space word representations In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics Human Language Technologies pp 746-751 2013
[15] Pennington Jeffrey Richard Socher and Christopher Manning Glove Global vectors for word representation In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) pp 1532-1543 2014
[16] Word2vec Tutorial - The Skip-Gram Model middot Chris Mccormick 2016 MccormickmlCom Accessed March 16 2018 httpmccormickmlcom20160419word2vec-tutorial-the-skip-gram-model
[17] Group 2015 [Paper Introduction] Bilingual Word Representations With Monolingual hellip SlideshareNet Accessed March 16 2018 httpswwwslidesharenetnaist-mtstudybilingual-word-representations-with-monolingual-quality-in-mind
[18] Attention Mechanism 2016 BlogHeuritechCom Accessed March 16 2018 httpsblogheuritechcom20160120attention-mechanism
[19] Weston Jason Sumit Chopra and Antoine Bordes 2014 Memory Networks arXiv preprint arXiv14103916v11 Accessed March 16 2018 httpsarxivorgabs14103916
[20] Wang Shuohang and Jing Jiang Machine comprehension using match-lstm and answer pointer arXiv preprint arXiv160807905 (2016)
[21] Wang Shuohang and Jing Jiang Learning natural language inference with LSTM arXiv preprint arXiv151208849 (2015)
[22] Vinyals Oriol Meire Fortunato and Navdeep Jaitly Pointer networks In Advances in Neural Information Processing Systems pp 2692-2700 2015
[23] Wang Wenhui Nan Yang Furu Wei Baobao Chang and Ming Zhou Gated self-matching networks for reading comprehension and question answering In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1 Long Papers) vol 1 pp 189-198 2017
[24] Seo Minjoon Aniruddha Kembhavi Ali Farhadi and Hannaneh Hajishirzi Bidirectional attention flow for machine comprehension arXiv preprint arXiv161101603 (2016)
[25] Clark Christopher and Matt Gardner Simple and effective multi-paragraph reading comprehension arXiv preprint arXiv171010723 (2017)
41
[26] 8 Analyzing Sentence Structure 2018 NltkOrg Accessed March 16 2018 httpwwwnltkorgbookch08html
42
BIOGRAPHICAL SKETCH
Purnendu Mukherjee grew up in Kolkata India and had a strong interest for
computers since an early age After his high school he did his BSc in computer
science from Ramakrishna Mission Residential College Narendrapur followed by MSc
in computer science from St Xavierrsquos College Kolkata He had a strong intuition and
interest for human like learning systems and wanted to work in this area He started
working at TCS Innovation Labs Pune for the application of Natural Language
Processing in Educational Applications As he was deeply passionate about learning
system that mimic the human brain and learn like a human child does he was
increasing interested about Deep Learning and its applications After working for a year
he went on to pursue a Master of Science degree in computer science from the
University of Florida Gainesville His academic interests have been focused on Deep
Learning and Natural Language Processing and he has been working on Machine
Reading Comprehension since summer of 2017
- ACKNOWLEDGMENTS
- LIST OF FIGURES
- LIST OF ABBREVIATIONS
- INTRODUCTION
- THE BUILDING BLOCKS OF A QUESTION ANSWERING SYSTEM
-
- Neural Networks
- Convolutional Neural Network
- Recurrent Neural Networks (RNN)
- Word Embedding
- Attention Mechanism
- Memory Networks
-
- LITERATURE REVIEW AND STATE OF THE ART
-
- Machine Comprehension Using Match-LSTM and Answer Pointer
- R-NET Matching Reading Comprehension with Self-Matching Networks
- Bi-Directional Attention Flow (BiDAF) for Machine Comprehension
- Summary
-
- MULTI-ATTENTION QUESTION ANSWERING
-
- Multi-Attention BiDAF Model
- Chatbot Design Using a QA System
- Online QA System and Attention Visualization
-
- FUTURE DIRECTIONS AND CONCLUSION
-
- Future Directions
- Conclusion
-
- LIST OF REFERENCES
- BIOGRAPHICAL SKETCH
-
20
Figure 2-5 QA example on Lord of the Rings using Memory Networks [19]
21
CHAPTER 3 LITERATURE REVIEW AND STATE OF THE ART
There has been rapid progress since the release of the SQuAD dataset and
currently the best ensemble models are close to human level accuracy in machine
comprehension This is due to the various ingenious methods which solves some of the
problems with the previous methods Out of Vocabulary tokens were handled by using
Character embedding Long term dependency within context passage were solved
using self-attention And many other techniques such as Contextualized vectors History
of Words Attention Flow etc In this section we will have a look at the some of the most
important models that were fundamental to the progress of Questions Answering
Machine Comprehension Using Match-LSTM and Answer Pointer
In this paper [20] the authors propose an end-to-end neural architecture for the
QA task The architecture is based on match-LSTM [21] a model they proposed for
textual entailment and Pointer Net [22] a sequence-to-sequence model proposed by
Vinyals et al (2015) to constrain the output tokens to be from the input sequences
The model consists of an LSTM preprocessing layer a match-LSTM layer and an
Answer Pointer layer
We are given a piece of text which we refer to as a passage and a question
related to the passage The passage is represented by matrix P where P is the length
(number of tokens) of the passage and d is the dimensionality of word embeddings
Similarly the question is represented by matrix Q where Q is the length of the question
Our goal is to identify a subsequence from the passage as the answer to the question
22
Figure 3-1 Match-LSTM Model Architecture [20]
LSTM Preprocessing layer They use a standard one-directional LSTM
(Hochreiter amp Schmidhuber 1997) to process the passage and the question separately
as shown below
Match LSTM Layer They applied the match-LSTM model proposed for textual
entailment to their machine comprehension problem by treating the question as a
premise and the passage as a hypothesis The match-LSTM sequentially goes through
the passage At position i of the passage it first uses the standard word-by-word
attention mechanism to obtain attention weight vector as follows
(3-1)
23
where and are parameters to be learned
Answer Pointer Layer The Answer Pointer (Ans-Ptr) layer is motivated by the
Pointer Net introduced by Vinyals et al (2015) [22] The boundary model produces only
the start token and the end token of the answer and then all the tokens between these
two in the original passage are considered to be the answer
When this paper was released back in November 2016 Match-LSTM method
was the state of the art in Question Answering systems and was at the top of the
leaderboard for the SQuAD dataset
R-NET Matching Reading Comprehension with Self-Matching Networks
In this model [23] first the question and passage are processed by a
bidirectional recurrent network (Mikolov et al 2010) separately They then match the
question and passage with gated attention-based recurrent networks obtaining
question-aware representation for the passage On top of that they apply self-matching
attention to aggregate evidence from the whole passage and refine the passage
representation which is then fed into the output layer to predict the boundary of the
answer span
Question and passage encoding First the words are converted to their
respective word-level embeddings and character level embeddings The character-level
embeddings are generated by taking the final hidden states of a bi-directional recurrent
neural network (RNN) applied to embeddings of characters in the token Such
character-level embeddings have been shown to be helpful to deal with out-of-vocab
(OOV) tokens
24
They then use a bi-directional RNN to produce new representation and
of all words in the question and passage respectively
Figure 3-2 The task of Question Answering [23]
Gated Attention-based Recurrent Networks They use a variant of attention-
based recurrent networks with an additional gate to determine the importance of
information in the passage regarding a question Different from the gates in LSTM or
GRU the additional gate is based on the current passage word and its attention-pooling
vector of the question which focuses on the relation between the question and current
passage word The gate effectively model the phenomenon that only parts of the
passage are relevant to the question in reading comprehension and question answering
is utilized in subsequent calculations
25
Self-Matching Attention From the previous step the question aware passage
representation is generated to highlight the important parts of the passage One
problem with such representation is that it has very limited knowledge of context One
answer candidate is often oblivious to important cues in the passage outside its
surrounding window To address this problem the authors propose directly matching
the question-aware passage representation against itself It dynamically collects
evidence from the whole passage for words in passage and encodes the evidence
relevant to the current passage word and its matching question information into the
passage representation
Output Layer They use the same method as Wang amp Jiang (2016b) and use
pointer networks (Vinyals et al 2015) to predict the start and end position of the
answer In addition they use an attention-pooling over the question representation to
generate the initial hidden vector for the pointer network [23]
When the R-Net Model first appeared in the leaderboard in March 2017 it was at
the top with 723 Exact Match and 807 F1 score
Bi-Directional Attention Flow (BiDAF) for Machine Comprehension
BiDAF [24] is a hierarchical multi-stage architecture for modeling the
representations of the context paragraph at different levels of granularity BIDAF
includes character-level word-level and contextual embeddings and uses bi-directional
attention flow to obtain a query-aware context representation Their attention layer is not
used to summarize the context paragraph into a fixed-size vector Instead the attention
is computed for every time step and the attended vector at each time step along with
the representations from previous layers can flow through to the subsequent modeling
layer This reduces the information loss caused by early summarization [24]
26
Figure 3-3 Bi-Directional Attention Flow Model Architecture [24]
Their machine comprehension model is a hierarchical multi-stage process and
consists of six layers
1 Character Embedding Layer maps each word to a vector space using character-level CNNs
2 Word Embedding Layer maps each word to a vector space using a pre-trained word embedding model
3 Contextual Embedding Layer utilizes contextual cues from surrounding words to refine the embedding of the words These first three layers are applied to both the query and context They use a LSTM on top of the embeddings provided by the previous layers to model the temporal interactions between words We place an LSTM in both directions and concatenate the outputs of the two LSTMs
4 Attention Flow Layer couples the query and context vectors and produces a set of query aware feature vectors for each word in the context
5 Modeling Layer employs a Recurrent Neural Network to scan the context The input to the modeling layer is G which encodes the query-aware representations of context words The output of the modeling layer captures the interaction among the context words conditioned on the query They use two layers of bi-directional LSTM
27
with the output size of d for each direction Hence a matrix is obtained which is passed onto the output layer to predict the answer
6 Output Layer provides an answer to the query They define the training loss (to be minimized) as the sum of the negative log probabilities of the true start and end indices by the predicted distributions averaged over all examples [25]
In a further variation of their above work they add a self-attention layer after the
Bi-attention layer to further improve the results The architecture of the model is as
Figure 3-3 The task of Question Answering [25]
28
Summary
In this chapter we reviewed the methods that are fundamental to the state of the
art in Machine Comprehension and for the task of Question Answering We have
reached closed to human level accuracy and this is due to incremental developments
over previous models As we saw Out of Vocabulary (OOV) tokens were handled by
using Character embedding Long term dependency within context passage were
solved using self-attention and many other techniques such as Contextualized vectors
Attention Flow etc were employed to get better results In the next chapter we will see
how we can build on these models and develop further
29
CHAPTER 4 MULTI-ATTENTION QUESTION ANSWERING
Although the advancements in Question Answering systems since the release of
the SQuAD dataset has been impressive and the results are getting close to human
level accuracy it is far from being a fool-proof system The models still make mistakes
which would be obvious to a human For example
Passage The Panthers used the San Jose State practice facility and stayed at
the San Jose Marriott The Broncos practiced at Florida State Facility and stayed at the
Santa Clara Marriott
Question At what universitys facility did the Panthers practice
Actual Answer San Jose State
Predicted Answer Florida State Facility
To find out what is leading to the wrong predictions we wanted to see the
attention weights associated with such an example We plotted the passage and
question heat map which is a 2D matrix where the intensity of each cell signifies the
similarity between a passage word and a question word For the above example we
found out that while certain words of the question are given high weightage other parts
are not The words lsquoAtrsquo lsquofacilityrsquo lsquopracticersquo are given high attention but lsquoPanthersrsquo does
not receive high attention If it had received high attention then the system would have
predicted lsquoSan Jose Statersquo as the right answer To solve this issue we analyzed the
base BiDAF model and proposed adding two things
1 Bi-Attention and Self-Attention over Query
2 Second level attention over the output of (Bi-Attention + Self-Attention) from both Context and Query
30
Multi-Attention BiDAF Model
Our comprehension model is a hierarchical multi-stage process and consists of
the following layers
1 Embedding Just as all other models we embed words using pretrained word vectors We also embed the characters in each word into size 20 vectors which are learned and run a convolution neural network followed by max-pooling to get character-derived embeddings for each word The character-level and word-level embeddings are then concatenated and passed to the next layer We do not update the word embeddings during training
2 Pre-Process A shared bi-directional GRU (Cho et al 2014) is used to map the question and passage embeddings to context aware embeddings
3 Context Attention The bi-directional attention mechanism from the Bi-Directional Attention Flow (BiDAF) model (Seo et al 2016) is used to build a query-aware context representation Let h_i be the vector for context word i q_j be the vector for question word j and n_q and n_c be the lengths of the question and context respectively We compute attention between context word i and question word j as
(4-1)
where w1 w2 and w3 are learned vectors and is element-wise multiplication We then compute an attended vector c_i for each context token as
(4-2)
We also compute a query-to-context vector q_c
(4-3)
31
The final vector computed for each token is built by concatenating
and In our model we subsequently pass the result through a linear layer with ReLU activations
4 Context Self-Attention Next we use a layer of residual self-attention The input is passed through another bi-directional GRU Then we apply the same attention mechanism only now between the passage and itself In this case we do not use query-to context attention and we set aij = 1048576inf if i = j As before we pass the concatenated output through a linear layer with ReLU activations This layer is applied residually so this output is additionally summed with the input
5 Query Attention For this part we do the same way as context attention but calculate the weighted sum of the context words for each query word Thus we get the length as number of query words Then we calculate context to query similar to query to context in context attention layer
Figure 4-1 The modified BiDAF model with multilevel attention
6 Query Self-Attention This part is done the same way as the context self-attention layer but from the output of the Query Attention layer
32
7 Context Query Bi-Attention + Self-Attention The output of the Context self-attention and the Query Self-Attention layers are taken as input and the same process for Bi-Attention and self-attention is applied on these inputs
8 Prediction In the last layer of our model a bidirectional GRU is applied followed by a linear layer that computes answer start scores for each token The hidden states of that layer are concatenated with the input and fed into a second bidirectional GRU and linear layer to predict answer end scores The softmax operation is applied to the start and end scores to produce start and end probabilities and we optimize the negative log likelihood of selecting correct start and end tokens
Having carried out this modification we were able to solve the wrong example we
started with The multilevel attention model gives the correct output as ldquoSan Jose
Staterdquo Also we achieved slightly better scores than the original model with a F1 score
of 8544 on the SQuAD dev dataset
Chatbot Design Using a QA System
Designing a perfect chatbot that passes the Turing test is a fundamental goal for
Artificial Intelligence Although we are many order to magnitudes away from achieving
such a goal domain specific tasks can be solved with chatbots made from current
technology We had started our goal with a similar objective in mind ie to design a
domain specific chatbot and then generalize to other areas as it is able to achieve the
first domain specific objective robustly This led us to the fundamental problem of
Machine Comprehension and subsequently to the task of Question Answering Having
achieved some degree of success with QA systems we looked back if we could apply
our newly acquired knowledge in the task of designing Chatbots
The chatbots made with todayrsquos technologies are mostly handcrafted techniques
such as template matching that requires anticipating all possible ways a user may
articulate his requirements and a conversation may occur This requires a lot of man
hours for designing a domain specific system and is still very error prone In this section
33
we propose a general Chatbot design that would make the designing of a domain
specific chatbot very easy and robust at the same time
Every domain specific chatbot needs to obtain a set of information from the user
and show some results based on the user specific information obtained The traditional
chatbots use template matching and keywords lookup to determine if the user has
provided the required information Our idea is to use the Question Answering system in
the backend to extract out the required information from whatever the user has typed
until this point of the conversation The information to be extracted can be posed in the
form of a set of questions and the answers obtained from those questions can be used
as the parameters to supply the relevant information to the user
We had chosen our chatbot domain as the flight reservation system Our goal
was to extract the required information from the user to be able to show him the
available flights as per the userrsquos requirements
Figure 4-2 Flight reservation chatbotrsquos chat window
34
For a flight reservation task the booking agent needs to know the origin city
destination city and date of travel at minimum to be able to show the available flights
Optional information includes the number of tickets passengerrsquos name one way or
round trip etc
The minimalistic conversation with the user through the chat window would be as
shown above We had a platform called OneTask on which we wanted to implement our
chat bot The chat interface within the OneTask system looks as follows
Figure 4-3 Chatbot within OneTask system
The working of the chat system is as follows ndash
1 Initiation The user opens the chat window which starts a session with the chat bot The chat bot asks lsquoHow may I help yoursquo
2 User Reply The user may reply with none of the required information for flight booking or may reply with multiple information in the same message
3 User Reply parsing The conversation up to this point is treated as a passage and the internal questions are run on this passage So the four questions that are run are
Where do you want to go
35
From where do you want to leave
When do you want to depart
4 Parsed responses from QA model After running the questions on the QA system with the given conversation answers are obtained for all along with their corresponding confidence values Since answers will be obtained even if the required question was not answered up to this point in the conversation the confidence values plays a crucial role in determining the validity of the answer For this purpose we determined ranges of confidence values for validation A confidence value above 10 signifies the question has been answered correctly A confidence of 2 ndash 10 signifies that it may have been answered but should verify with the user for correctness and any confidence below 2 is discarded
5 Asking remaining questions iteratively After the parsing it is checked if any of the required questions are still unanswered If so the chatbot asks the remaining question and the process from 3 ndash 4 is carried out iteratively Once all the questions have been answered the user is shown with the available flight options as per his request
Figure 4-4 The Flow diagram of the Flight booking Chatbot system
Online QA System and Attention Visualization
To be able to test out various examples we used the BiDAF [23] model for an
online demo One can either choose from the available examples from the drop-down
menu or paste their own passage and examples While this is a useful and interesting
system to test the model in a user-friendly way we created this system to be able to
focus on the wrong samples
36
Figure 4-5 QA system interface with attention highlight over candidate answers
The candidate answers are shown with the blue highlights as per their
confidence values The higher the confidence the darker the answer The highest
confidence value is chosen as the predicted answer We developed the system to show
attention spread of the candidate answers to realize what needs to be done to improve
the system This led us to realize the importance of including the query attention part as
well as multilevel attention on the BiDAF model as described in the first section of this
chapter
37
CHAPTER 5 FUTURE DIRECTIONS AND CONCLUSION
Future Directions
Having done a thorough analysis of the current methods and state of the arts in
Machine Comprehension and for the QA task and developing systems on it we have
achieved a strong sense of what needs to be done to further improve the QA models
After observing the wrong samples we can see that the system is still unable to encode
meaning and is picking answers based on statistical patterns of occurrence of answers
on training examples
As per the State of the Art models and analyzing the ongoing research literature
it is easy to conclude that more features need to be embedded to encode the meaning
of words phrases and sentences A paper called Reinforced Mnemonic Reader for
Machine Comprehension encoded POS and NER tags of words along with their word
and character embedding This gave them better results
Figure 5-1 An English language semantic parse tree [26]
38
We have developed a method to encode the syntax parse tree of a sentence that
not only encodes the post tags but the relation of the word within the phrase and the
relation of the phrase within the whole sentence in a hierarchical manner
Finally data augmentation is another solution to get better results One definite
way to reduce the errors would be to include similar samples in the training data which
the system is faltering in the dev set One could generate similar examples as the failure
cases and include them in the training set to have better prediction Another system
would be to train to similar and bigger datasets Our models were trained on the SQuAD
dataset There are other datasets too which does the similar question answering task
such as TriviaQA We could augment the training set of TriviaQA along with SQuAD to
have a more robust system that is able to generalize better and thus have higher
accuracy for predicting answer spans
Conclusion
In this work we have tried to explore the most fundamental techniques that have
shaped the current state of the art Then we proposed a minor improvement of
architecture over an existing model Furthermore we developed two applications that
uses the base model First we talked about how a chatbot application can be made
using the QA system and lastly we also created a web interface where the model can
be used for any Passage and Question This interface also shows the attention spread
on the candidate answers While our effort is ongoing to push the state of the art
forward we strongly believe that surpassing human level accuracy on this task will have
high dividends for the society at large
39
LIST OF REFERENCES
[1] Danqi Chen From Reading Comprehension To Open-Domain Question Answering 2018 Youtube Accessed March 16 2018 httpswwwyoutubecomwatchv=1RN88O9C13U
[2] Lehnert Wendy G The Process of Question Answering No RR-88 Yale Univ New Haven Conn Dept of computer science 1977
[3] Joshi Mandar Eunsol Choi Daniel S Weld and Luke Zettlemoyer Triviaqa A large scale distantly supervised challenge dataset for reading comprehension arXiv preprint arXiv170503551 (2017)
[4] Rajpurkar Pranav Jian Zhang Konstantin Lopyrev and Percy Liang Squad 100000+ questions for machine comprehension of text arXiv preprint arXiv160605250 (2016)
[5] The Stanford Question Answering Dataset 2018 RajpurkarGithubIo Accessed March 16 2018 httpsrajpurkargithubioSQuAD-explorer
[6] Hu Minghao Yuxing Peng and Xipeng Qiu Reinforced mnemonic reader for machine comprehension CoRR abs170502798 (2017)
[7] Schalkoff Robert J Artificial neural networks Vol 1 New York McGraw-Hill 1997
[8] Me For The AI and Neetesh Mehrotra 2018 The Connect Between Deep Learning And AI - Open Source For You Open Source For You Accessed March 16 2018 httpopensourceforucom201801connect-deep-learning-ai
[9] Krizhevsky Alex Ilya Sutskever and Geoffrey E Hinton Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems pp 1097-1105 2012
[10] An Intuitive Explanation Of Convolutional Neural Networks 2016 The Data Science Blog Accessed March 16 2018 httpsujjwalkarnme20160811intuitive-explanation-convnets
[11] Williams Ronald J and David Zipser A learning algorithm for continually running fully recurrent neural networks Neural computation 1 no 2 (1989) 270-280
[12] Hochreiter Sepp and Juumlrgen Schmidhuber Long short-term memory Neural computation 9 no 8 (1997) 1735-1780
[13] Chung Junyoung Caglar Gulcehre KyungHyun Cho and Yoshua Bengio Empirical evaluation of gated recurrent neural networks on sequence modeling arXiv preprint arXiv14123555 (2014)
40
[14] Mikolov Tomas Wen-tau Yih and Geoffrey Zweig Linguistic regularities in continuous space word representations In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics Human Language Technologies pp 746-751 2013
[15] Pennington Jeffrey Richard Socher and Christopher Manning Glove Global vectors for word representation In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) pp 1532-1543 2014
[16] Word2vec Tutorial - The Skip-Gram Model middot Chris Mccormick 2016 MccormickmlCom Accessed March 16 2018 httpmccormickmlcom20160419word2vec-tutorial-the-skip-gram-model
[17] Group 2015 [Paper Introduction] Bilingual Word Representations With Monolingual hellip SlideshareNet Accessed March 16 2018 httpswwwslidesharenetnaist-mtstudybilingual-word-representations-with-monolingual-quality-in-mind
[18] Attention Mechanism 2016 BlogHeuritechCom Accessed March 16 2018 httpsblogheuritechcom20160120attention-mechanism
[19] Weston Jason Sumit Chopra and Antoine Bordes 2014 Memory Networks arXiv preprint arXiv14103916v11 Accessed March 16 2018 httpsarxivorgabs14103916
[20] Wang Shuohang and Jing Jiang Machine comprehension using match-lstm and answer pointer arXiv preprint arXiv160807905 (2016)
[21] Wang Shuohang and Jing Jiang Learning natural language inference with LSTM arXiv preprint arXiv151208849 (2015)
[22] Vinyals Oriol Meire Fortunato and Navdeep Jaitly Pointer networks In Advances in Neural Information Processing Systems pp 2692-2700 2015
[23] Wang Wenhui Nan Yang Furu Wei Baobao Chang and Ming Zhou Gated self-matching networks for reading comprehension and question answering In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1 Long Papers) vol 1 pp 189-198 2017
[24] Seo Minjoon Aniruddha Kembhavi Ali Farhadi and Hannaneh Hajishirzi Bidirectional attention flow for machine comprehension arXiv preprint arXiv161101603 (2016)
[25] Clark Christopher and Matt Gardner Simple and effective multi-paragraph reading comprehension arXiv preprint arXiv171010723 (2017)
41
[26] 8 Analyzing Sentence Structure 2018 NltkOrg Accessed March 16 2018 httpwwwnltkorgbookch08html
42
BIOGRAPHICAL SKETCH
Purnendu Mukherjee grew up in Kolkata India and had a strong interest for
computers since an early age After his high school he did his BSc in computer
science from Ramakrishna Mission Residential College Narendrapur followed by MSc
in computer science from St Xavierrsquos College Kolkata He had a strong intuition and
interest for human like learning systems and wanted to work in this area He started
working at TCS Innovation Labs Pune for the application of Natural Language
Processing in Educational Applications As he was deeply passionate about learning
system that mimic the human brain and learn like a human child does he was
increasing interested about Deep Learning and its applications After working for a year
he went on to pursue a Master of Science degree in computer science from the
University of Florida Gainesville His academic interests have been focused on Deep
Learning and Natural Language Processing and he has been working on Machine
Reading Comprehension since summer of 2017
- ACKNOWLEDGMENTS
- LIST OF FIGURES
- LIST OF ABBREVIATIONS
- INTRODUCTION
- THE BUILDING BLOCKS OF A QUESTION ANSWERING SYSTEM
-
- Neural Networks
- Convolutional Neural Network
- Recurrent Neural Networks (RNN)
- Word Embedding
- Attention Mechanism
- Memory Networks
-
- LITERATURE REVIEW AND STATE OF THE ART
-
- Machine Comprehension Using Match-LSTM and Answer Pointer
- R-NET Matching Reading Comprehension with Self-Matching Networks
- Bi-Directional Attention Flow (BiDAF) for Machine Comprehension
- Summary
-
- MULTI-ATTENTION QUESTION ANSWERING
-
- Multi-Attention BiDAF Model
- Chatbot Design Using a QA System
- Online QA System and Attention Visualization
-
- FUTURE DIRECTIONS AND CONCLUSION
-
- Future Directions
- Conclusion
-
- LIST OF REFERENCES
- BIOGRAPHICAL SKETCH
-
21
CHAPTER 3 LITERATURE REVIEW AND STATE OF THE ART
There has been rapid progress since the release of the SQuAD dataset and
currently the best ensemble models are close to human level accuracy in machine
comprehension This is due to the various ingenious methods which solves some of the
problems with the previous methods Out of Vocabulary tokens were handled by using
Character embedding Long term dependency within context passage were solved
using self-attention And many other techniques such as Contextualized vectors History
of Words Attention Flow etc In this section we will have a look at the some of the most
important models that were fundamental to the progress of Questions Answering
Machine Comprehension Using Match-LSTM and Answer Pointer
In this paper [20] the authors propose an end-to-end neural architecture for the
QA task The architecture is based on match-LSTM [21] a model they proposed for
textual entailment and Pointer Net [22] a sequence-to-sequence model proposed by
Vinyals et al (2015) to constrain the output tokens to be from the input sequences
The model consists of an LSTM preprocessing layer a match-LSTM layer and an
Answer Pointer layer
We are given a piece of text which we refer to as a passage and a question
related to the passage The passage is represented by matrix P where P is the length
(number of tokens) of the passage and d is the dimensionality of word embeddings
Similarly the question is represented by matrix Q where Q is the length of the question
Our goal is to identify a subsequence from the passage as the answer to the question
22
Figure 3-1 Match-LSTM Model Architecture [20]
LSTM Preprocessing layer They use a standard one-directional LSTM
(Hochreiter amp Schmidhuber 1997) to process the passage and the question separately
as shown below
Match LSTM Layer They applied the match-LSTM model proposed for textual
entailment to their machine comprehension problem by treating the question as a
premise and the passage as a hypothesis The match-LSTM sequentially goes through
the passage At position i of the passage it first uses the standard word-by-word
attention mechanism to obtain attention weight vector as follows
(3-1)
23
where and are parameters to be learned
Answer Pointer Layer The Answer Pointer (Ans-Ptr) layer is motivated by the
Pointer Net introduced by Vinyals et al (2015) [22] The boundary model produces only
the start token and the end token of the answer and then all the tokens between these
two in the original passage are considered to be the answer
When this paper was released back in November 2016 Match-LSTM method
was the state of the art in Question Answering systems and was at the top of the
leaderboard for the SQuAD dataset
R-NET Matching Reading Comprehension with Self-Matching Networks
In this model [23] first the question and passage are processed by a
bidirectional recurrent network (Mikolov et al 2010) separately They then match the
question and passage with gated attention-based recurrent networks obtaining
question-aware representation for the passage On top of that they apply self-matching
attention to aggregate evidence from the whole passage and refine the passage
representation which is then fed into the output layer to predict the boundary of the
answer span
Question and passage encoding First the words are converted to their
respective word-level embeddings and character level embeddings The character-level
embeddings are generated by taking the final hidden states of a bi-directional recurrent
neural network (RNN) applied to embeddings of characters in the token Such
character-level embeddings have been shown to be helpful to deal with out-of-vocab
(OOV) tokens
24
They then use a bi-directional RNN to produce new representation and
of all words in the question and passage respectively
Figure 3-2 The task of Question Answering [23]
Gated Attention-based Recurrent Networks They use a variant of attention-
based recurrent networks with an additional gate to determine the importance of
information in the passage regarding a question Different from the gates in LSTM or
GRU the additional gate is based on the current passage word and its attention-pooling
vector of the question which focuses on the relation between the question and current
passage word The gate effectively model the phenomenon that only parts of the
passage are relevant to the question in reading comprehension and question answering
is utilized in subsequent calculations
25
Self-Matching Attention From the previous step the question aware passage
representation is generated to highlight the important parts of the passage One
problem with such representation is that it has very limited knowledge of context One
answer candidate is often oblivious to important cues in the passage outside its
surrounding window To address this problem the authors propose directly matching
the question-aware passage representation against itself It dynamically collects
evidence from the whole passage for words in passage and encodes the evidence
relevant to the current passage word and its matching question information into the
passage representation
Output Layer They use the same method as Wang amp Jiang (2016b) and use
pointer networks (Vinyals et al 2015) to predict the start and end position of the
answer In addition they use an attention-pooling over the question representation to
generate the initial hidden vector for the pointer network [23]
When the R-Net Model first appeared in the leaderboard in March 2017 it was at
the top with 723 Exact Match and 807 F1 score
Bi-Directional Attention Flow (BiDAF) for Machine Comprehension
BiDAF [24] is a hierarchical multi-stage architecture for modeling the
representations of the context paragraph at different levels of granularity BIDAF
includes character-level word-level and contextual embeddings and uses bi-directional
attention flow to obtain a query-aware context representation Their attention layer is not
used to summarize the context paragraph into a fixed-size vector Instead the attention
is computed for every time step and the attended vector at each time step along with
the representations from previous layers can flow through to the subsequent modeling
layer This reduces the information loss caused by early summarization [24]
26
Figure 3-3 Bi-Directional Attention Flow Model Architecture [24]
Their machine comprehension model is a hierarchical multi-stage process and
consists of six layers
1 Character Embedding Layer maps each word to a vector space using character-level CNNs
2 Word Embedding Layer maps each word to a vector space using a pre-trained word embedding model
3 Contextual Embedding Layer utilizes contextual cues from surrounding words to refine the embedding of the words These first three layers are applied to both the query and context They use a LSTM on top of the embeddings provided by the previous layers to model the temporal interactions between words We place an LSTM in both directions and concatenate the outputs of the two LSTMs
4 Attention Flow Layer couples the query and context vectors and produces a set of query aware feature vectors for each word in the context
5 Modeling Layer employs a Recurrent Neural Network to scan the context The input to the modeling layer is G which encodes the query-aware representations of context words The output of the modeling layer captures the interaction among the context words conditioned on the query They use two layers of bi-directional LSTM
27
with the output size of d for each direction Hence a matrix is obtained which is passed onto the output layer to predict the answer
6 Output Layer provides an answer to the query They define the training loss (to be minimized) as the sum of the negative log probabilities of the true start and end indices by the predicted distributions averaged over all examples [25]
In a further variation of their above work they add a self-attention layer after the
Bi-attention layer to further improve the results The architecture of the model is as
Figure 3-3 The task of Question Answering [25]
28
Summary
In this chapter we reviewed the methods that are fundamental to the state of the
art in Machine Comprehension and for the task of Question Answering We have
reached closed to human level accuracy and this is due to incremental developments
over previous models As we saw Out of Vocabulary (OOV) tokens were handled by
using Character embedding Long term dependency within context passage were
solved using self-attention and many other techniques such as Contextualized vectors
Attention Flow etc were employed to get better results In the next chapter we will see
how we can build on these models and develop further
29
CHAPTER 4 MULTI-ATTENTION QUESTION ANSWERING
Although the advancements in Question Answering systems since the release of
the SQuAD dataset has been impressive and the results are getting close to human
level accuracy it is far from being a fool-proof system The models still make mistakes
which would be obvious to a human For example
Passage The Panthers used the San Jose State practice facility and stayed at
the San Jose Marriott The Broncos practiced at Florida State Facility and stayed at the
Santa Clara Marriott
Question At what universitys facility did the Panthers practice
Actual Answer San Jose State
Predicted Answer Florida State Facility
To find out what is leading to the wrong predictions we wanted to see the
attention weights associated with such an example We plotted the passage and
question heat map which is a 2D matrix where the intensity of each cell signifies the
similarity between a passage word and a question word For the above example we
found out that while certain words of the question are given high weightage other parts
are not The words lsquoAtrsquo lsquofacilityrsquo lsquopracticersquo are given high attention but lsquoPanthersrsquo does
not receive high attention If it had received high attention then the system would have
predicted lsquoSan Jose Statersquo as the right answer To solve this issue we analyzed the
base BiDAF model and proposed adding two things
1 Bi-Attention and Self-Attention over Query
2 Second level attention over the output of (Bi-Attention + Self-Attention) from both Context and Query
30
Multi-Attention BiDAF Model
Our comprehension model is a hierarchical multi-stage process and consists of
the following layers
1 Embedding Just as all other models we embed words using pretrained word vectors We also embed the characters in each word into size 20 vectors which are learned and run a convolution neural network followed by max-pooling to get character-derived embeddings for each word The character-level and word-level embeddings are then concatenated and passed to the next layer We do not update the word embeddings during training
2 Pre-Process A shared bi-directional GRU (Cho et al 2014) is used to map the question and passage embeddings to context aware embeddings
3 Context Attention The bi-directional attention mechanism from the Bi-Directional Attention Flow (BiDAF) model (Seo et al 2016) is used to build a query-aware context representation Let h_i be the vector for context word i q_j be the vector for question word j and n_q and n_c be the lengths of the question and context respectively We compute attention between context word i and question word j as
(4-1)
where w1 w2 and w3 are learned vectors and is element-wise multiplication We then compute an attended vector c_i for each context token as
(4-2)
We also compute a query-to-context vector q_c
(4-3)
31
The final vector computed for each token is built by concatenating
and In our model we subsequently pass the result through a linear layer with ReLU activations
4 Context Self-Attention Next we use a layer of residual self-attention The input is passed through another bi-directional GRU Then we apply the same attention mechanism only now between the passage and itself In this case we do not use query-to context attention and we set aij = 1048576inf if i = j As before we pass the concatenated output through a linear layer with ReLU activations This layer is applied residually so this output is additionally summed with the input
5 Query Attention For this part we do the same way as context attention but calculate the weighted sum of the context words for each query word Thus we get the length as number of query words Then we calculate context to query similar to query to context in context attention layer
Figure 4-1 The modified BiDAF model with multilevel attention
6 Query Self-Attention This part is done the same way as the context self-attention layer but from the output of the Query Attention layer
32
7 Context Query Bi-Attention + Self-Attention The output of the Context self-attention and the Query Self-Attention layers are taken as input and the same process for Bi-Attention and self-attention is applied on these inputs
8 Prediction In the last layer of our model a bidirectional GRU is applied followed by a linear layer that computes answer start scores for each token The hidden states of that layer are concatenated with the input and fed into a second bidirectional GRU and linear layer to predict answer end scores The softmax operation is applied to the start and end scores to produce start and end probabilities and we optimize the negative log likelihood of selecting correct start and end tokens
Having carried out this modification we were able to solve the wrong example we
started with The multilevel attention model gives the correct output as ldquoSan Jose
Staterdquo Also we achieved slightly better scores than the original model with a F1 score
of 8544 on the SQuAD dev dataset
Chatbot Design Using a QA System
Designing a perfect chatbot that passes the Turing test is a fundamental goal for
Artificial Intelligence Although we are many order to magnitudes away from achieving
such a goal domain specific tasks can be solved with chatbots made from current
technology We had started our goal with a similar objective in mind ie to design a
domain specific chatbot and then generalize to other areas as it is able to achieve the
first domain specific objective robustly This led us to the fundamental problem of
Machine Comprehension and subsequently to the task of Question Answering Having
achieved some degree of success with QA systems we looked back if we could apply
our newly acquired knowledge in the task of designing Chatbots
The chatbots made with todayrsquos technologies are mostly handcrafted techniques
such as template matching that requires anticipating all possible ways a user may
articulate his requirements and a conversation may occur This requires a lot of man
hours for designing a domain specific system and is still very error prone In this section
33
we propose a general Chatbot design that would make the designing of a domain
specific chatbot very easy and robust at the same time
Every domain specific chatbot needs to obtain a set of information from the user
and show some results based on the user specific information obtained The traditional
chatbots use template matching and keywords lookup to determine if the user has
provided the required information Our idea is to use the Question Answering system in
the backend to extract out the required information from whatever the user has typed
until this point of the conversation The information to be extracted can be posed in the
form of a set of questions and the answers obtained from those questions can be used
as the parameters to supply the relevant information to the user
We had chosen our chatbot domain as the flight reservation system Our goal
was to extract the required information from the user to be able to show him the
available flights as per the userrsquos requirements
Figure 4-2 Flight reservation chatbotrsquos chat window
34
For a flight reservation task the booking agent needs to know the origin city
destination city and date of travel at minimum to be able to show the available flights
Optional information includes the number of tickets passengerrsquos name one way or
round trip etc
The minimalistic conversation with the user through the chat window would be as
shown above We had a platform called OneTask on which we wanted to implement our
chat bot The chat interface within the OneTask system looks as follows
Figure 4-3 Chatbot within OneTask system
The working of the chat system is as follows ndash
1 Initiation The user opens the chat window which starts a session with the chat bot The chat bot asks lsquoHow may I help yoursquo
2 User Reply The user may reply with none of the required information for flight booking or may reply with multiple information in the same message
3 User Reply parsing The conversation up to this point is treated as a passage and the internal questions are run on this passage So the four questions that are run are
Where do you want to go
35
From where do you want to leave
When do you want to depart
4 Parsed responses from QA model After running the questions on the QA system with the given conversation answers are obtained for all along with their corresponding confidence values Since answers will be obtained even if the required question was not answered up to this point in the conversation the confidence values plays a crucial role in determining the validity of the answer For this purpose we determined ranges of confidence values for validation A confidence value above 10 signifies the question has been answered correctly A confidence of 2 ndash 10 signifies that it may have been answered but should verify with the user for correctness and any confidence below 2 is discarded
5 Asking remaining questions iteratively After the parsing it is checked if any of the required questions are still unanswered If so the chatbot asks the remaining question and the process from 3 ndash 4 is carried out iteratively Once all the questions have been answered the user is shown with the available flight options as per his request
Figure 4-4 The Flow diagram of the Flight booking Chatbot system
Online QA System and Attention Visualization
To be able to test out various examples we used the BiDAF [23] model for an
online demo One can either choose from the available examples from the drop-down
menu or paste their own passage and examples While this is a useful and interesting
system to test the model in a user-friendly way we created this system to be able to
focus on the wrong samples
36
Figure 4-5 QA system interface with attention highlight over candidate answers
The candidate answers are shown with the blue highlights as per their
confidence values The higher the confidence the darker the answer The highest
confidence value is chosen as the predicted answer We developed the system to show
attention spread of the candidate answers to realize what needs to be done to improve
the system This led us to realize the importance of including the query attention part as
well as multilevel attention on the BiDAF model as described in the first section of this
chapter
37
CHAPTER 5 FUTURE DIRECTIONS AND CONCLUSION
Future Directions
Having done a thorough analysis of the current methods and state of the arts in
Machine Comprehension and for the QA task and developing systems on it we have
achieved a strong sense of what needs to be done to further improve the QA models
After observing the wrong samples we can see that the system is still unable to encode
meaning and is picking answers based on statistical patterns of occurrence of answers
on training examples
As per the State of the Art models and analyzing the ongoing research literature
it is easy to conclude that more features need to be embedded to encode the meaning
of words phrases and sentences A paper called Reinforced Mnemonic Reader for
Machine Comprehension encoded POS and NER tags of words along with their word
and character embedding This gave them better results
Figure 5-1 An English language semantic parse tree [26]
38
We have developed a method to encode the syntax parse tree of a sentence that
not only encodes the post tags but the relation of the word within the phrase and the
relation of the phrase within the whole sentence in a hierarchical manner
Finally data augmentation is another solution to get better results One definite
way to reduce the errors would be to include similar samples in the training data which
the system is faltering in the dev set One could generate similar examples as the failure
cases and include them in the training set to have better prediction Another system
would be to train to similar and bigger datasets Our models were trained on the SQuAD
dataset There are other datasets too which does the similar question answering task
such as TriviaQA We could augment the training set of TriviaQA along with SQuAD to
have a more robust system that is able to generalize better and thus have higher
accuracy for predicting answer spans
Conclusion
In this work we have tried to explore the most fundamental techniques that have
shaped the current state of the art Then we proposed a minor improvement of
architecture over an existing model Furthermore we developed two applications that
uses the base model First we talked about how a chatbot application can be made
using the QA system and lastly we also created a web interface where the model can
be used for any Passage and Question This interface also shows the attention spread
on the candidate answers While our effort is ongoing to push the state of the art
forward we strongly believe that surpassing human level accuracy on this task will have
high dividends for the society at large
39
LIST OF REFERENCES
[1] Danqi Chen From Reading Comprehension To Open-Domain Question Answering 2018 Youtube Accessed March 16 2018 httpswwwyoutubecomwatchv=1RN88O9C13U
[2] Lehnert Wendy G The Process of Question Answering No RR-88 Yale Univ New Haven Conn Dept of computer science 1977
[3] Joshi Mandar Eunsol Choi Daniel S Weld and Luke Zettlemoyer Triviaqa A large scale distantly supervised challenge dataset for reading comprehension arXiv preprint arXiv170503551 (2017)
[4] Rajpurkar Pranav Jian Zhang Konstantin Lopyrev and Percy Liang Squad 100000+ questions for machine comprehension of text arXiv preprint arXiv160605250 (2016)
[5] The Stanford Question Answering Dataset 2018 RajpurkarGithubIo Accessed March 16 2018 httpsrajpurkargithubioSQuAD-explorer
[6] Hu Minghao Yuxing Peng and Xipeng Qiu Reinforced mnemonic reader for machine comprehension CoRR abs170502798 (2017)
[7] Schalkoff Robert J Artificial neural networks Vol 1 New York McGraw-Hill 1997
[8] Me For The AI and Neetesh Mehrotra 2018 The Connect Between Deep Learning And AI - Open Source For You Open Source For You Accessed March 16 2018 httpopensourceforucom201801connect-deep-learning-ai
[9] Krizhevsky Alex Ilya Sutskever and Geoffrey E Hinton Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems pp 1097-1105 2012
[10] An Intuitive Explanation Of Convolutional Neural Networks 2016 The Data Science Blog Accessed March 16 2018 httpsujjwalkarnme20160811intuitive-explanation-convnets
[11] Williams Ronald J and David Zipser A learning algorithm for continually running fully recurrent neural networks Neural computation 1 no 2 (1989) 270-280
[12] Hochreiter Sepp and Juumlrgen Schmidhuber Long short-term memory Neural computation 9 no 8 (1997) 1735-1780
[13] Chung Junyoung Caglar Gulcehre KyungHyun Cho and Yoshua Bengio Empirical evaluation of gated recurrent neural networks on sequence modeling arXiv preprint arXiv14123555 (2014)
40
[14] Mikolov Tomas Wen-tau Yih and Geoffrey Zweig Linguistic regularities in continuous space word representations In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics Human Language Technologies pp 746-751 2013
[15] Pennington Jeffrey Richard Socher and Christopher Manning Glove Global vectors for word representation In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) pp 1532-1543 2014
[16] Word2vec Tutorial - The Skip-Gram Model middot Chris Mccormick 2016 MccormickmlCom Accessed March 16 2018 httpmccormickmlcom20160419word2vec-tutorial-the-skip-gram-model
[17] Group 2015 [Paper Introduction] Bilingual Word Representations With Monolingual hellip SlideshareNet Accessed March 16 2018 httpswwwslidesharenetnaist-mtstudybilingual-word-representations-with-monolingual-quality-in-mind
[18] Attention Mechanism 2016 BlogHeuritechCom Accessed March 16 2018 httpsblogheuritechcom20160120attention-mechanism
[19] Weston Jason Sumit Chopra and Antoine Bordes 2014 Memory Networks arXiv preprint arXiv14103916v11 Accessed March 16 2018 httpsarxivorgabs14103916
[20] Wang Shuohang and Jing Jiang Machine comprehension using match-lstm and answer pointer arXiv preprint arXiv160807905 (2016)
[21] Wang Shuohang and Jing Jiang Learning natural language inference with LSTM arXiv preprint arXiv151208849 (2015)
[22] Vinyals Oriol Meire Fortunato and Navdeep Jaitly Pointer networks In Advances in Neural Information Processing Systems pp 2692-2700 2015
[23] Wang Wenhui Nan Yang Furu Wei Baobao Chang and Ming Zhou Gated self-matching networks for reading comprehension and question answering In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1 Long Papers) vol 1 pp 189-198 2017
[24] Seo Minjoon Aniruddha Kembhavi Ali Farhadi and Hannaneh Hajishirzi Bidirectional attention flow for machine comprehension arXiv preprint arXiv161101603 (2016)
[25] Clark Christopher and Matt Gardner Simple and effective multi-paragraph reading comprehension arXiv preprint arXiv171010723 (2017)
41
[26] 8 Analyzing Sentence Structure 2018 NltkOrg Accessed March 16 2018 httpwwwnltkorgbookch08html
42
BIOGRAPHICAL SKETCH
Purnendu Mukherjee grew up in Kolkata India and had a strong interest for
computers since an early age After his high school he did his BSc in computer
science from Ramakrishna Mission Residential College Narendrapur followed by MSc
in computer science from St Xavierrsquos College Kolkata He had a strong intuition and
interest for human like learning systems and wanted to work in this area He started
working at TCS Innovation Labs Pune for the application of Natural Language
Processing in Educational Applications As he was deeply passionate about learning
system that mimic the human brain and learn like a human child does he was
increasing interested about Deep Learning and its applications After working for a year
he went on to pursue a Master of Science degree in computer science from the
University of Florida Gainesville His academic interests have been focused on Deep
Learning and Natural Language Processing and he has been working on Machine
Reading Comprehension since summer of 2017
- ACKNOWLEDGMENTS
- LIST OF FIGURES
- LIST OF ABBREVIATIONS
- INTRODUCTION
- THE BUILDING BLOCKS OF A QUESTION ANSWERING SYSTEM
-
- Neural Networks
- Convolutional Neural Network
- Recurrent Neural Networks (RNN)
- Word Embedding
- Attention Mechanism
- Memory Networks
-
- LITERATURE REVIEW AND STATE OF THE ART
-
- Machine Comprehension Using Match-LSTM and Answer Pointer
- R-NET Matching Reading Comprehension with Self-Matching Networks
- Bi-Directional Attention Flow (BiDAF) for Machine Comprehension
- Summary
-
- MULTI-ATTENTION QUESTION ANSWERING
-
- Multi-Attention BiDAF Model
- Chatbot Design Using a QA System
- Online QA System and Attention Visualization
-
- FUTURE DIRECTIONS AND CONCLUSION
-
- Future Directions
- Conclusion
-
- LIST OF REFERENCES
- BIOGRAPHICAL SKETCH
-
22
Figure 3-1 Match-LSTM Model Architecture [20]
LSTM Preprocessing layer They use a standard one-directional LSTM
(Hochreiter amp Schmidhuber 1997) to process the passage and the question separately
as shown below
Match LSTM Layer They applied the match-LSTM model proposed for textual
entailment to their machine comprehension problem by treating the question as a
premise and the passage as a hypothesis The match-LSTM sequentially goes through
the passage At position i of the passage it first uses the standard word-by-word
attention mechanism to obtain attention weight vector as follows
(3-1)
23
where and are parameters to be learned
Answer Pointer Layer The Answer Pointer (Ans-Ptr) layer is motivated by the
Pointer Net introduced by Vinyals et al (2015) [22] The boundary model produces only
the start token and the end token of the answer and then all the tokens between these
two in the original passage are considered to be the answer
When this paper was released back in November 2016 Match-LSTM method
was the state of the art in Question Answering systems and was at the top of the
leaderboard for the SQuAD dataset
R-NET Matching Reading Comprehension with Self-Matching Networks
In this model [23] first the question and passage are processed by a
bidirectional recurrent network (Mikolov et al 2010) separately They then match the
question and passage with gated attention-based recurrent networks obtaining
question-aware representation for the passage On top of that they apply self-matching
attention to aggregate evidence from the whole passage and refine the passage
representation which is then fed into the output layer to predict the boundary of the
answer span
Question and passage encoding First the words are converted to their
respective word-level embeddings and character level embeddings The character-level
embeddings are generated by taking the final hidden states of a bi-directional recurrent
neural network (RNN) applied to embeddings of characters in the token Such
character-level embeddings have been shown to be helpful to deal with out-of-vocab
(OOV) tokens
24
They then use a bi-directional RNN to produce new representation and
of all words in the question and passage respectively
Figure 3-2 The task of Question Answering [23]
Gated Attention-based Recurrent Networks They use a variant of attention-
based recurrent networks with an additional gate to determine the importance of
information in the passage regarding a question Different from the gates in LSTM or
GRU the additional gate is based on the current passage word and its attention-pooling
vector of the question which focuses on the relation between the question and current
passage word The gate effectively model the phenomenon that only parts of the
passage are relevant to the question in reading comprehension and question answering
is utilized in subsequent calculations
25
Self-Matching Attention From the previous step the question aware passage
representation is generated to highlight the important parts of the passage One
problem with such representation is that it has very limited knowledge of context One
answer candidate is often oblivious to important cues in the passage outside its
surrounding window To address this problem the authors propose directly matching
the question-aware passage representation against itself It dynamically collects
evidence from the whole passage for words in passage and encodes the evidence
relevant to the current passage word and its matching question information into the
passage representation
Output Layer They use the same method as Wang amp Jiang (2016b) and use
pointer networks (Vinyals et al 2015) to predict the start and end position of the
answer In addition they use an attention-pooling over the question representation to
generate the initial hidden vector for the pointer network [23]
When the R-Net Model first appeared in the leaderboard in March 2017 it was at
the top with 723 Exact Match and 807 F1 score
Bi-Directional Attention Flow (BiDAF) for Machine Comprehension
BiDAF [24] is a hierarchical multi-stage architecture for modeling the
representations of the context paragraph at different levels of granularity BIDAF
includes character-level word-level and contextual embeddings and uses bi-directional
attention flow to obtain a query-aware context representation Their attention layer is not
used to summarize the context paragraph into a fixed-size vector Instead the attention
is computed for every time step and the attended vector at each time step along with
the representations from previous layers can flow through to the subsequent modeling
layer This reduces the information loss caused by early summarization [24]
26
Figure 3-3 Bi-Directional Attention Flow Model Architecture [24]
Their machine comprehension model is a hierarchical multi-stage process and
consists of six layers
1 Character Embedding Layer maps each word to a vector space using character-level CNNs
2 Word Embedding Layer maps each word to a vector space using a pre-trained word embedding model
3 Contextual Embedding Layer utilizes contextual cues from surrounding words to refine the embedding of the words These first three layers are applied to both the query and context They use a LSTM on top of the embeddings provided by the previous layers to model the temporal interactions between words We place an LSTM in both directions and concatenate the outputs of the two LSTMs
4 Attention Flow Layer couples the query and context vectors and produces a set of query aware feature vectors for each word in the context
5 Modeling Layer employs a Recurrent Neural Network to scan the context The input to the modeling layer is G which encodes the query-aware representations of context words The output of the modeling layer captures the interaction among the context words conditioned on the query They use two layers of bi-directional LSTM
27
with the output size of d for each direction Hence a matrix is obtained which is passed onto the output layer to predict the answer
6 Output Layer provides an answer to the query They define the training loss (to be minimized) as the sum of the negative log probabilities of the true start and end indices by the predicted distributions averaged over all examples [25]
In a further variation of their above work they add a self-attention layer after the
Bi-attention layer to further improve the results The architecture of the model is as
Figure 3-3 The task of Question Answering [25]
28
Summary
In this chapter we reviewed the methods that are fundamental to the state of the
art in Machine Comprehension and for the task of Question Answering We have
reached closed to human level accuracy and this is due to incremental developments
over previous models As we saw Out of Vocabulary (OOV) tokens were handled by
using Character embedding Long term dependency within context passage were
solved using self-attention and many other techniques such as Contextualized vectors
Attention Flow etc were employed to get better results In the next chapter we will see
how we can build on these models and develop further
29
CHAPTER 4 MULTI-ATTENTION QUESTION ANSWERING
Although the advancements in Question Answering systems since the release of
the SQuAD dataset has been impressive and the results are getting close to human
level accuracy it is far from being a fool-proof system The models still make mistakes
which would be obvious to a human For example
Passage The Panthers used the San Jose State practice facility and stayed at
the San Jose Marriott The Broncos practiced at Florida State Facility and stayed at the
Santa Clara Marriott
Question At what universitys facility did the Panthers practice
Actual Answer San Jose State
Predicted Answer Florida State Facility
To find out what is leading to the wrong predictions we wanted to see the
attention weights associated with such an example We plotted the passage and
question heat map which is a 2D matrix where the intensity of each cell signifies the
similarity between a passage word and a question word For the above example we
found out that while certain words of the question are given high weightage other parts
are not The words lsquoAtrsquo lsquofacilityrsquo lsquopracticersquo are given high attention but lsquoPanthersrsquo does
not receive high attention If it had received high attention then the system would have
predicted lsquoSan Jose Statersquo as the right answer To solve this issue we analyzed the
base BiDAF model and proposed adding two things
1 Bi-Attention and Self-Attention over Query
2 Second level attention over the output of (Bi-Attention + Self-Attention) from both Context and Query
30
Multi-Attention BiDAF Model
Our comprehension model is a hierarchical multi-stage process and consists of
the following layers
1 Embedding Just as all other models we embed words using pretrained word vectors We also embed the characters in each word into size 20 vectors which are learned and run a convolution neural network followed by max-pooling to get character-derived embeddings for each word The character-level and word-level embeddings are then concatenated and passed to the next layer We do not update the word embeddings during training
2 Pre-Process A shared bi-directional GRU (Cho et al 2014) is used to map the question and passage embeddings to context aware embeddings
3 Context Attention The bi-directional attention mechanism from the Bi-Directional Attention Flow (BiDAF) model (Seo et al 2016) is used to build a query-aware context representation Let h_i be the vector for context word i q_j be the vector for question word j and n_q and n_c be the lengths of the question and context respectively We compute attention between context word i and question word j as
(4-1)
where w1 w2 and w3 are learned vectors and is element-wise multiplication We then compute an attended vector c_i for each context token as
(4-2)
We also compute a query-to-context vector q_c
(4-3)
31
The final vector computed for each token is built by concatenating
and In our model we subsequently pass the result through a linear layer with ReLU activations
4 Context Self-Attention Next we use a layer of residual self-attention The input is passed through another bi-directional GRU Then we apply the same attention mechanism only now between the passage and itself In this case we do not use query-to context attention and we set aij = 1048576inf if i = j As before we pass the concatenated output through a linear layer with ReLU activations This layer is applied residually so this output is additionally summed with the input
5 Query Attention For this part we do the same way as context attention but calculate the weighted sum of the context words for each query word Thus we get the length as number of query words Then we calculate context to query similar to query to context in context attention layer
Figure 4-1 The modified BiDAF model with multilevel attention
6 Query Self-Attention This part is done the same way as the context self-attention layer but from the output of the Query Attention layer
32
7 Context Query Bi-Attention + Self-Attention The output of the Context self-attention and the Query Self-Attention layers are taken as input and the same process for Bi-Attention and self-attention is applied on these inputs
8 Prediction In the last layer of our model a bidirectional GRU is applied followed by a linear layer that computes answer start scores for each token The hidden states of that layer are concatenated with the input and fed into a second bidirectional GRU and linear layer to predict answer end scores The softmax operation is applied to the start and end scores to produce start and end probabilities and we optimize the negative log likelihood of selecting correct start and end tokens
Having carried out this modification we were able to solve the wrong example we
started with The multilevel attention model gives the correct output as ldquoSan Jose
Staterdquo Also we achieved slightly better scores than the original model with a F1 score
of 8544 on the SQuAD dev dataset
Chatbot Design Using a QA System
Designing a perfect chatbot that passes the Turing test is a fundamental goal for
Artificial Intelligence Although we are many order to magnitudes away from achieving
such a goal domain specific tasks can be solved with chatbots made from current
technology We had started our goal with a similar objective in mind ie to design a
domain specific chatbot and then generalize to other areas as it is able to achieve the
first domain specific objective robustly This led us to the fundamental problem of
Machine Comprehension and subsequently to the task of Question Answering Having
achieved some degree of success with QA systems we looked back if we could apply
our newly acquired knowledge in the task of designing Chatbots
The chatbots made with todayrsquos technologies are mostly handcrafted techniques
such as template matching that requires anticipating all possible ways a user may
articulate his requirements and a conversation may occur This requires a lot of man
hours for designing a domain specific system and is still very error prone In this section
33
we propose a general Chatbot design that would make the designing of a domain
specific chatbot very easy and robust at the same time
Every domain specific chatbot needs to obtain a set of information from the user
and show some results based on the user specific information obtained The traditional
chatbots use template matching and keywords lookup to determine if the user has
provided the required information Our idea is to use the Question Answering system in
the backend to extract out the required information from whatever the user has typed
until this point of the conversation The information to be extracted can be posed in the
form of a set of questions and the answers obtained from those questions can be used
as the parameters to supply the relevant information to the user
We had chosen our chatbot domain as the flight reservation system Our goal
was to extract the required information from the user to be able to show him the
available flights as per the userrsquos requirements
Figure 4-2 Flight reservation chatbotrsquos chat window
34
For a flight reservation task the booking agent needs to know the origin city
destination city and date of travel at minimum to be able to show the available flights
Optional information includes the number of tickets passengerrsquos name one way or
round trip etc
The minimalistic conversation with the user through the chat window would be as
shown above We had a platform called OneTask on which we wanted to implement our
chat bot The chat interface within the OneTask system looks as follows
Figure 4-3 Chatbot within OneTask system
The working of the chat system is as follows ndash
1 Initiation The user opens the chat window which starts a session with the chat bot The chat bot asks lsquoHow may I help yoursquo
2 User Reply The user may reply with none of the required information for flight booking or may reply with multiple information in the same message
3 User Reply parsing The conversation up to this point is treated as a passage and the internal questions are run on this passage So the four questions that are run are
Where do you want to go
35
From where do you want to leave
When do you want to depart
4 Parsed responses from QA model After running the questions on the QA system with the given conversation answers are obtained for all along with their corresponding confidence values Since answers will be obtained even if the required question was not answered up to this point in the conversation the confidence values plays a crucial role in determining the validity of the answer For this purpose we determined ranges of confidence values for validation A confidence value above 10 signifies the question has been answered correctly A confidence of 2 ndash 10 signifies that it may have been answered but should verify with the user for correctness and any confidence below 2 is discarded
5 Asking remaining questions iteratively After the parsing it is checked if any of the required questions are still unanswered If so the chatbot asks the remaining question and the process from 3 ndash 4 is carried out iteratively Once all the questions have been answered the user is shown with the available flight options as per his request
Figure 4-4 The Flow diagram of the Flight booking Chatbot system
Online QA System and Attention Visualization
To be able to test out various examples we used the BiDAF [23] model for an
online demo One can either choose from the available examples from the drop-down
menu or paste their own passage and examples While this is a useful and interesting
system to test the model in a user-friendly way we created this system to be able to
focus on the wrong samples
36
Figure 4-5 QA system interface with attention highlight over candidate answers
The candidate answers are shown with the blue highlights as per their
confidence values The higher the confidence the darker the answer The highest
confidence value is chosen as the predicted answer We developed the system to show
attention spread of the candidate answers to realize what needs to be done to improve
the system This led us to realize the importance of including the query attention part as
well as multilevel attention on the BiDAF model as described in the first section of this
chapter
37
CHAPTER 5 FUTURE DIRECTIONS AND CONCLUSION
Future Directions
Having done a thorough analysis of the current methods and state of the arts in
Machine Comprehension and for the QA task and developing systems on it we have
achieved a strong sense of what needs to be done to further improve the QA models
After observing the wrong samples we can see that the system is still unable to encode
meaning and is picking answers based on statistical patterns of occurrence of answers
on training examples
As per the State of the Art models and analyzing the ongoing research literature
it is easy to conclude that more features need to be embedded to encode the meaning
of words phrases and sentences A paper called Reinforced Mnemonic Reader for
Machine Comprehension encoded POS and NER tags of words along with their word
and character embedding This gave them better results
Figure 5-1 An English language semantic parse tree [26]
38
We have developed a method to encode the syntax parse tree of a sentence that
not only encodes the post tags but the relation of the word within the phrase and the
relation of the phrase within the whole sentence in a hierarchical manner
Finally data augmentation is another solution to get better results One definite
way to reduce the errors would be to include similar samples in the training data which
the system is faltering in the dev set One could generate similar examples as the failure
cases and include them in the training set to have better prediction Another system
would be to train to similar and bigger datasets Our models were trained on the SQuAD
dataset There are other datasets too which does the similar question answering task
such as TriviaQA We could augment the training set of TriviaQA along with SQuAD to
have a more robust system that is able to generalize better and thus have higher
accuracy for predicting answer spans
Conclusion
In this work we have tried to explore the most fundamental techniques that have
shaped the current state of the art Then we proposed a minor improvement of
architecture over an existing model Furthermore we developed two applications that
uses the base model First we talked about how a chatbot application can be made
using the QA system and lastly we also created a web interface where the model can
be used for any Passage and Question This interface also shows the attention spread
on the candidate answers While our effort is ongoing to push the state of the art
forward we strongly believe that surpassing human level accuracy on this task will have
high dividends for the society at large
39
LIST OF REFERENCES
[1] Danqi Chen From Reading Comprehension To Open-Domain Question Answering 2018 Youtube Accessed March 16 2018 httpswwwyoutubecomwatchv=1RN88O9C13U
[2] Lehnert Wendy G The Process of Question Answering No RR-88 Yale Univ New Haven Conn Dept of computer science 1977
[3] Joshi Mandar Eunsol Choi Daniel S Weld and Luke Zettlemoyer Triviaqa A large scale distantly supervised challenge dataset for reading comprehension arXiv preprint arXiv170503551 (2017)
[4] Rajpurkar Pranav Jian Zhang Konstantin Lopyrev and Percy Liang Squad 100000+ questions for machine comprehension of text arXiv preprint arXiv160605250 (2016)
[5] The Stanford Question Answering Dataset 2018 RajpurkarGithubIo Accessed March 16 2018 httpsrajpurkargithubioSQuAD-explorer
[6] Hu Minghao Yuxing Peng and Xipeng Qiu Reinforced mnemonic reader for machine comprehension CoRR abs170502798 (2017)
[7] Schalkoff Robert J Artificial neural networks Vol 1 New York McGraw-Hill 1997
[8] Me For The AI and Neetesh Mehrotra 2018 The Connect Between Deep Learning And AI - Open Source For You Open Source For You Accessed March 16 2018 httpopensourceforucom201801connect-deep-learning-ai
[9] Krizhevsky Alex Ilya Sutskever and Geoffrey E Hinton Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems pp 1097-1105 2012
[10] An Intuitive Explanation Of Convolutional Neural Networks 2016 The Data Science Blog Accessed March 16 2018 httpsujjwalkarnme20160811intuitive-explanation-convnets
[11] Williams Ronald J and David Zipser A learning algorithm for continually running fully recurrent neural networks Neural computation 1 no 2 (1989) 270-280
[12] Hochreiter Sepp and Juumlrgen Schmidhuber Long short-term memory Neural computation 9 no 8 (1997) 1735-1780
[13] Chung Junyoung Caglar Gulcehre KyungHyun Cho and Yoshua Bengio Empirical evaluation of gated recurrent neural networks on sequence modeling arXiv preprint arXiv14123555 (2014)
40
[14] Mikolov Tomas Wen-tau Yih and Geoffrey Zweig Linguistic regularities in continuous space word representations In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics Human Language Technologies pp 746-751 2013
[15] Pennington Jeffrey Richard Socher and Christopher Manning Glove Global vectors for word representation In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) pp 1532-1543 2014
[16] Word2vec Tutorial - The Skip-Gram Model middot Chris Mccormick 2016 MccormickmlCom Accessed March 16 2018 httpmccormickmlcom20160419word2vec-tutorial-the-skip-gram-model
[17] Group 2015 [Paper Introduction] Bilingual Word Representations With Monolingual hellip SlideshareNet Accessed March 16 2018 httpswwwslidesharenetnaist-mtstudybilingual-word-representations-with-monolingual-quality-in-mind
[18] Attention Mechanism 2016 BlogHeuritechCom Accessed March 16 2018 httpsblogheuritechcom20160120attention-mechanism
[19] Weston Jason Sumit Chopra and Antoine Bordes 2014 Memory Networks arXiv preprint arXiv14103916v11 Accessed March 16 2018 httpsarxivorgabs14103916
[20] Wang Shuohang and Jing Jiang Machine comprehension using match-lstm and answer pointer arXiv preprint arXiv160807905 (2016)
[21] Wang Shuohang and Jing Jiang Learning natural language inference with LSTM arXiv preprint arXiv151208849 (2015)
[22] Vinyals Oriol Meire Fortunato and Navdeep Jaitly Pointer networks In Advances in Neural Information Processing Systems pp 2692-2700 2015
[23] Wang Wenhui Nan Yang Furu Wei Baobao Chang and Ming Zhou Gated self-matching networks for reading comprehension and question answering In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1 Long Papers) vol 1 pp 189-198 2017
[24] Seo Minjoon Aniruddha Kembhavi Ali Farhadi and Hannaneh Hajishirzi Bidirectional attention flow for machine comprehension arXiv preprint arXiv161101603 (2016)
[25] Clark Christopher and Matt Gardner Simple and effective multi-paragraph reading comprehension arXiv preprint arXiv171010723 (2017)
41
[26] 8 Analyzing Sentence Structure 2018 NltkOrg Accessed March 16 2018 httpwwwnltkorgbookch08html
42
BIOGRAPHICAL SKETCH
Purnendu Mukherjee grew up in Kolkata India and had a strong interest for
computers since an early age After his high school he did his BSc in computer
science from Ramakrishna Mission Residential College Narendrapur followed by MSc
in computer science from St Xavierrsquos College Kolkata He had a strong intuition and
interest for human like learning systems and wanted to work in this area He started
working at TCS Innovation Labs Pune for the application of Natural Language
Processing in Educational Applications As he was deeply passionate about learning
system that mimic the human brain and learn like a human child does he was
increasing interested about Deep Learning and its applications After working for a year
he went on to pursue a Master of Science degree in computer science from the
University of Florida Gainesville His academic interests have been focused on Deep
Learning and Natural Language Processing and he has been working on Machine
Reading Comprehension since summer of 2017
- ACKNOWLEDGMENTS
- LIST OF FIGURES
- LIST OF ABBREVIATIONS
- INTRODUCTION
- THE BUILDING BLOCKS OF A QUESTION ANSWERING SYSTEM
-
- Neural Networks
- Convolutional Neural Network
- Recurrent Neural Networks (RNN)
- Word Embedding
- Attention Mechanism
- Memory Networks
-
- LITERATURE REVIEW AND STATE OF THE ART
-
- Machine Comprehension Using Match-LSTM and Answer Pointer
- R-NET Matching Reading Comprehension with Self-Matching Networks
- Bi-Directional Attention Flow (BiDAF) for Machine Comprehension
- Summary
-
- MULTI-ATTENTION QUESTION ANSWERING
-
- Multi-Attention BiDAF Model
- Chatbot Design Using a QA System
- Online QA System and Attention Visualization
-
- FUTURE DIRECTIONS AND CONCLUSION
-
- Future Directions
- Conclusion
-
- LIST OF REFERENCES
- BIOGRAPHICAL SKETCH
-
23
where and are parameters to be learned
Answer Pointer Layer The Answer Pointer (Ans-Ptr) layer is motivated by the
Pointer Net introduced by Vinyals et al (2015) [22] The boundary model produces only
the start token and the end token of the answer and then all the tokens between these
two in the original passage are considered to be the answer
When this paper was released back in November 2016 Match-LSTM method
was the state of the art in Question Answering systems and was at the top of the
leaderboard for the SQuAD dataset
R-NET Matching Reading Comprehension with Self-Matching Networks
In this model [23] first the question and passage are processed by a
bidirectional recurrent network (Mikolov et al 2010) separately They then match the
question and passage with gated attention-based recurrent networks obtaining
question-aware representation for the passage On top of that they apply self-matching
attention to aggregate evidence from the whole passage and refine the passage
representation which is then fed into the output layer to predict the boundary of the
answer span
Question and passage encoding First the words are converted to their
respective word-level embeddings and character level embeddings The character-level
embeddings are generated by taking the final hidden states of a bi-directional recurrent
neural network (RNN) applied to embeddings of characters in the token Such
character-level embeddings have been shown to be helpful to deal with out-of-vocab
(OOV) tokens
24
They then use a bi-directional RNN to produce new representation and
of all words in the question and passage respectively
Figure 3-2 The task of Question Answering [23]
Gated Attention-based Recurrent Networks They use a variant of attention-
based recurrent networks with an additional gate to determine the importance of
information in the passage regarding a question Different from the gates in LSTM or
GRU the additional gate is based on the current passage word and its attention-pooling
vector of the question which focuses on the relation between the question and current
passage word The gate effectively model the phenomenon that only parts of the
passage are relevant to the question in reading comprehension and question answering
is utilized in subsequent calculations
25
Self-Matching Attention From the previous step the question aware passage
representation is generated to highlight the important parts of the passage One
problem with such representation is that it has very limited knowledge of context One
answer candidate is often oblivious to important cues in the passage outside its
surrounding window To address this problem the authors propose directly matching
the question-aware passage representation against itself It dynamically collects
evidence from the whole passage for words in passage and encodes the evidence
relevant to the current passage word and its matching question information into the
passage representation
Output Layer They use the same method as Wang amp Jiang (2016b) and use
pointer networks (Vinyals et al 2015) to predict the start and end position of the
answer In addition they use an attention-pooling over the question representation to
generate the initial hidden vector for the pointer network [23]
When the R-Net Model first appeared in the leaderboard in March 2017 it was at
the top with 723 Exact Match and 807 F1 score
Bi-Directional Attention Flow (BiDAF) for Machine Comprehension
BiDAF [24] is a hierarchical multi-stage architecture for modeling the
representations of the context paragraph at different levels of granularity BIDAF
includes character-level word-level and contextual embeddings and uses bi-directional
attention flow to obtain a query-aware context representation Their attention layer is not
used to summarize the context paragraph into a fixed-size vector Instead the attention
is computed for every time step and the attended vector at each time step along with
the representations from previous layers can flow through to the subsequent modeling
layer This reduces the information loss caused by early summarization [24]
26
Figure 3-3 Bi-Directional Attention Flow Model Architecture [24]
Their machine comprehension model is a hierarchical multi-stage process and
consists of six layers
1 Character Embedding Layer maps each word to a vector space using character-level CNNs
2 Word Embedding Layer maps each word to a vector space using a pre-trained word embedding model
3 Contextual Embedding Layer utilizes contextual cues from surrounding words to refine the embedding of the words These first three layers are applied to both the query and context They use a LSTM on top of the embeddings provided by the previous layers to model the temporal interactions between words We place an LSTM in both directions and concatenate the outputs of the two LSTMs
4 Attention Flow Layer couples the query and context vectors and produces a set of query aware feature vectors for each word in the context
5 Modeling Layer employs a Recurrent Neural Network to scan the context The input to the modeling layer is G which encodes the query-aware representations of context words The output of the modeling layer captures the interaction among the context words conditioned on the query They use two layers of bi-directional LSTM
27
with the output size of d for each direction Hence a matrix is obtained which is passed onto the output layer to predict the answer
6 Output Layer provides an answer to the query They define the training loss (to be minimized) as the sum of the negative log probabilities of the true start and end indices by the predicted distributions averaged over all examples [25]
In a further variation of their above work they add a self-attention layer after the
Bi-attention layer to further improve the results The architecture of the model is as
Figure 3-3 The task of Question Answering [25]
28
Summary
In this chapter we reviewed the methods that are fundamental to the state of the
art in Machine Comprehension and for the task of Question Answering We have
reached closed to human level accuracy and this is due to incremental developments
over previous models As we saw Out of Vocabulary (OOV) tokens were handled by
using Character embedding Long term dependency within context passage were
solved using self-attention and many other techniques such as Contextualized vectors
Attention Flow etc were employed to get better results In the next chapter we will see
how we can build on these models and develop further
29
CHAPTER 4 MULTI-ATTENTION QUESTION ANSWERING
Although the advancements in Question Answering systems since the release of
the SQuAD dataset has been impressive and the results are getting close to human
level accuracy it is far from being a fool-proof system The models still make mistakes
which would be obvious to a human For example
Passage The Panthers used the San Jose State practice facility and stayed at
the San Jose Marriott The Broncos practiced at Florida State Facility and stayed at the
Santa Clara Marriott
Question At what universitys facility did the Panthers practice
Actual Answer San Jose State
Predicted Answer Florida State Facility
To find out what is leading to the wrong predictions we wanted to see the
attention weights associated with such an example We plotted the passage and
question heat map which is a 2D matrix where the intensity of each cell signifies the
similarity between a passage word and a question word For the above example we
found out that while certain words of the question are given high weightage other parts
are not The words lsquoAtrsquo lsquofacilityrsquo lsquopracticersquo are given high attention but lsquoPanthersrsquo does
not receive high attention If it had received high attention then the system would have
predicted lsquoSan Jose Statersquo as the right answer To solve this issue we analyzed the
base BiDAF model and proposed adding two things
1 Bi-Attention and Self-Attention over Query
2 Second level attention over the output of (Bi-Attention + Self-Attention) from both Context and Query
30
Multi-Attention BiDAF Model
Our comprehension model is a hierarchical multi-stage process and consists of
the following layers
1 Embedding Just as all other models we embed words using pretrained word vectors We also embed the characters in each word into size 20 vectors which are learned and run a convolution neural network followed by max-pooling to get character-derived embeddings for each word The character-level and word-level embeddings are then concatenated and passed to the next layer We do not update the word embeddings during training
2 Pre-Process A shared bi-directional GRU (Cho et al 2014) is used to map the question and passage embeddings to context aware embeddings
3 Context Attention The bi-directional attention mechanism from the Bi-Directional Attention Flow (BiDAF) model (Seo et al 2016) is used to build a query-aware context representation Let h_i be the vector for context word i q_j be the vector for question word j and n_q and n_c be the lengths of the question and context respectively We compute attention between context word i and question word j as
(4-1)
where w1 w2 and w3 are learned vectors and is element-wise multiplication We then compute an attended vector c_i for each context token as
(4-2)
We also compute a query-to-context vector q_c
(4-3)
31
The final vector computed for each token is built by concatenating
and In our model we subsequently pass the result through a linear layer with ReLU activations
4 Context Self-Attention Next we use a layer of residual self-attention The input is passed through another bi-directional GRU Then we apply the same attention mechanism only now between the passage and itself In this case we do not use query-to context attention and we set aij = 1048576inf if i = j As before we pass the concatenated output through a linear layer with ReLU activations This layer is applied residually so this output is additionally summed with the input
5 Query Attention For this part we do the same way as context attention but calculate the weighted sum of the context words for each query word Thus we get the length as number of query words Then we calculate context to query similar to query to context in context attention layer
Figure 4-1 The modified BiDAF model with multilevel attention
6 Query Self-Attention This part is done the same way as the context self-attention layer but from the output of the Query Attention layer
32
7 Context Query Bi-Attention + Self-Attention The output of the Context self-attention and the Query Self-Attention layers are taken as input and the same process for Bi-Attention and self-attention is applied on these inputs
8 Prediction In the last layer of our model a bidirectional GRU is applied followed by a linear layer that computes answer start scores for each token The hidden states of that layer are concatenated with the input and fed into a second bidirectional GRU and linear layer to predict answer end scores The softmax operation is applied to the start and end scores to produce start and end probabilities and we optimize the negative log likelihood of selecting correct start and end tokens
Having carried out this modification we were able to solve the wrong example we
started with The multilevel attention model gives the correct output as ldquoSan Jose
Staterdquo Also we achieved slightly better scores than the original model with a F1 score
of 8544 on the SQuAD dev dataset
Chatbot Design Using a QA System
Designing a perfect chatbot that passes the Turing test is a fundamental goal for
Artificial Intelligence Although we are many order to magnitudes away from achieving
such a goal domain specific tasks can be solved with chatbots made from current
technology We had started our goal with a similar objective in mind ie to design a
domain specific chatbot and then generalize to other areas as it is able to achieve the
first domain specific objective robustly This led us to the fundamental problem of
Machine Comprehension and subsequently to the task of Question Answering Having
achieved some degree of success with QA systems we looked back if we could apply
our newly acquired knowledge in the task of designing Chatbots
The chatbots made with todayrsquos technologies are mostly handcrafted techniques
such as template matching that requires anticipating all possible ways a user may
articulate his requirements and a conversation may occur This requires a lot of man
hours for designing a domain specific system and is still very error prone In this section
33
we propose a general Chatbot design that would make the designing of a domain
specific chatbot very easy and robust at the same time
Every domain specific chatbot needs to obtain a set of information from the user
and show some results based on the user specific information obtained The traditional
chatbots use template matching and keywords lookup to determine if the user has
provided the required information Our idea is to use the Question Answering system in
the backend to extract out the required information from whatever the user has typed
until this point of the conversation The information to be extracted can be posed in the
form of a set of questions and the answers obtained from those questions can be used
as the parameters to supply the relevant information to the user
We had chosen our chatbot domain as the flight reservation system Our goal
was to extract the required information from the user to be able to show him the
available flights as per the userrsquos requirements
Figure 4-2 Flight reservation chatbotrsquos chat window
34
For a flight reservation task the booking agent needs to know the origin city
destination city and date of travel at minimum to be able to show the available flights
Optional information includes the number of tickets passengerrsquos name one way or
round trip etc
The minimalistic conversation with the user through the chat window would be as
shown above We had a platform called OneTask on which we wanted to implement our
chat bot The chat interface within the OneTask system looks as follows
Figure 4-3 Chatbot within OneTask system
The working of the chat system is as follows ndash
1 Initiation The user opens the chat window which starts a session with the chat bot The chat bot asks lsquoHow may I help yoursquo
2 User Reply The user may reply with none of the required information for flight booking or may reply with multiple information in the same message
3 User Reply parsing The conversation up to this point is treated as a passage and the internal questions are run on this passage So the four questions that are run are
Where do you want to go
35
From where do you want to leave
When do you want to depart
4 Parsed responses from QA model After running the questions on the QA system with the given conversation answers are obtained for all along with their corresponding confidence values Since answers will be obtained even if the required question was not answered up to this point in the conversation the confidence values plays a crucial role in determining the validity of the answer For this purpose we determined ranges of confidence values for validation A confidence value above 10 signifies the question has been answered correctly A confidence of 2 ndash 10 signifies that it may have been answered but should verify with the user for correctness and any confidence below 2 is discarded
5 Asking remaining questions iteratively After the parsing it is checked if any of the required questions are still unanswered If so the chatbot asks the remaining question and the process from 3 ndash 4 is carried out iteratively Once all the questions have been answered the user is shown with the available flight options as per his request
Figure 4-4 The Flow diagram of the Flight booking Chatbot system
Online QA System and Attention Visualization
To be able to test out various examples we used the BiDAF [23] model for an
online demo One can either choose from the available examples from the drop-down
menu or paste their own passage and examples While this is a useful and interesting
system to test the model in a user-friendly way we created this system to be able to
focus on the wrong samples
36
Figure 4-5 QA system interface with attention highlight over candidate answers
The candidate answers are shown with the blue highlights as per their
confidence values The higher the confidence the darker the answer The highest
confidence value is chosen as the predicted answer We developed the system to show
attention spread of the candidate answers to realize what needs to be done to improve
the system This led us to realize the importance of including the query attention part as
well as multilevel attention on the BiDAF model as described in the first section of this
chapter
37
CHAPTER 5 FUTURE DIRECTIONS AND CONCLUSION
Future Directions
Having done a thorough analysis of the current methods and state of the arts in
Machine Comprehension and for the QA task and developing systems on it we have
achieved a strong sense of what needs to be done to further improve the QA models
After observing the wrong samples we can see that the system is still unable to encode
meaning and is picking answers based on statistical patterns of occurrence of answers
on training examples
As per the State of the Art models and analyzing the ongoing research literature
it is easy to conclude that more features need to be embedded to encode the meaning
of words phrases and sentences A paper called Reinforced Mnemonic Reader for
Machine Comprehension encoded POS and NER tags of words along with their word
and character embedding This gave them better results
Figure 5-1 An English language semantic parse tree [26]
38
We have developed a method to encode the syntax parse tree of a sentence that
not only encodes the post tags but the relation of the word within the phrase and the
relation of the phrase within the whole sentence in a hierarchical manner
Finally data augmentation is another solution to get better results One definite
way to reduce the errors would be to include similar samples in the training data which
the system is faltering in the dev set One could generate similar examples as the failure
cases and include them in the training set to have better prediction Another system
would be to train to similar and bigger datasets Our models were trained on the SQuAD
dataset There are other datasets too which does the similar question answering task
such as TriviaQA We could augment the training set of TriviaQA along with SQuAD to
have a more robust system that is able to generalize better and thus have higher
accuracy for predicting answer spans
Conclusion
In this work we have tried to explore the most fundamental techniques that have
shaped the current state of the art Then we proposed a minor improvement of
architecture over an existing model Furthermore we developed two applications that
uses the base model First we talked about how a chatbot application can be made
using the QA system and lastly we also created a web interface where the model can
be used for any Passage and Question This interface also shows the attention spread
on the candidate answers While our effort is ongoing to push the state of the art
forward we strongly believe that surpassing human level accuracy on this task will have
high dividends for the society at large
39
LIST OF REFERENCES
[1] Danqi Chen From Reading Comprehension To Open-Domain Question Answering 2018 Youtube Accessed March 16 2018 httpswwwyoutubecomwatchv=1RN88O9C13U
[2] Lehnert Wendy G The Process of Question Answering No RR-88 Yale Univ New Haven Conn Dept of computer science 1977
[3] Joshi Mandar Eunsol Choi Daniel S Weld and Luke Zettlemoyer Triviaqa A large scale distantly supervised challenge dataset for reading comprehension arXiv preprint arXiv170503551 (2017)
[4] Rajpurkar Pranav Jian Zhang Konstantin Lopyrev and Percy Liang Squad 100000+ questions for machine comprehension of text arXiv preprint arXiv160605250 (2016)
[5] The Stanford Question Answering Dataset 2018 RajpurkarGithubIo Accessed March 16 2018 httpsrajpurkargithubioSQuAD-explorer
[6] Hu Minghao Yuxing Peng and Xipeng Qiu Reinforced mnemonic reader for machine comprehension CoRR abs170502798 (2017)
[7] Schalkoff Robert J Artificial neural networks Vol 1 New York McGraw-Hill 1997
[8] Me For The AI and Neetesh Mehrotra 2018 The Connect Between Deep Learning And AI - Open Source For You Open Source For You Accessed March 16 2018 httpopensourceforucom201801connect-deep-learning-ai
[9] Krizhevsky Alex Ilya Sutskever and Geoffrey E Hinton Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems pp 1097-1105 2012
[10] An Intuitive Explanation Of Convolutional Neural Networks 2016 The Data Science Blog Accessed March 16 2018 httpsujjwalkarnme20160811intuitive-explanation-convnets
[11] Williams Ronald J and David Zipser A learning algorithm for continually running fully recurrent neural networks Neural computation 1 no 2 (1989) 270-280
[12] Hochreiter Sepp and Juumlrgen Schmidhuber Long short-term memory Neural computation 9 no 8 (1997) 1735-1780
[13] Chung Junyoung Caglar Gulcehre KyungHyun Cho and Yoshua Bengio Empirical evaluation of gated recurrent neural networks on sequence modeling arXiv preprint arXiv14123555 (2014)
40
[14] Mikolov Tomas Wen-tau Yih and Geoffrey Zweig Linguistic regularities in continuous space word representations In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics Human Language Technologies pp 746-751 2013
[15] Pennington Jeffrey Richard Socher and Christopher Manning Glove Global vectors for word representation In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) pp 1532-1543 2014
[16] Word2vec Tutorial - The Skip-Gram Model middot Chris Mccormick 2016 MccormickmlCom Accessed March 16 2018 httpmccormickmlcom20160419word2vec-tutorial-the-skip-gram-model
[17] Group 2015 [Paper Introduction] Bilingual Word Representations With Monolingual hellip SlideshareNet Accessed March 16 2018 httpswwwslidesharenetnaist-mtstudybilingual-word-representations-with-monolingual-quality-in-mind
[18] Attention Mechanism 2016 BlogHeuritechCom Accessed March 16 2018 httpsblogheuritechcom20160120attention-mechanism
[19] Weston Jason Sumit Chopra and Antoine Bordes 2014 Memory Networks arXiv preprint arXiv14103916v11 Accessed March 16 2018 httpsarxivorgabs14103916
[20] Wang Shuohang and Jing Jiang Machine comprehension using match-lstm and answer pointer arXiv preprint arXiv160807905 (2016)
[21] Wang Shuohang and Jing Jiang Learning natural language inference with LSTM arXiv preprint arXiv151208849 (2015)
[22] Vinyals Oriol Meire Fortunato and Navdeep Jaitly Pointer networks In Advances in Neural Information Processing Systems pp 2692-2700 2015
[23] Wang Wenhui Nan Yang Furu Wei Baobao Chang and Ming Zhou Gated self-matching networks for reading comprehension and question answering In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1 Long Papers) vol 1 pp 189-198 2017
[24] Seo Minjoon Aniruddha Kembhavi Ali Farhadi and Hannaneh Hajishirzi Bidirectional attention flow for machine comprehension arXiv preprint arXiv161101603 (2016)
[25] Clark Christopher and Matt Gardner Simple and effective multi-paragraph reading comprehension arXiv preprint arXiv171010723 (2017)
41
[26] 8 Analyzing Sentence Structure 2018 NltkOrg Accessed March 16 2018 httpwwwnltkorgbookch08html
42
BIOGRAPHICAL SKETCH
Purnendu Mukherjee grew up in Kolkata India and had a strong interest for
computers since an early age After his high school he did his BSc in computer
science from Ramakrishna Mission Residential College Narendrapur followed by MSc
in computer science from St Xavierrsquos College Kolkata He had a strong intuition and
interest for human like learning systems and wanted to work in this area He started
working at TCS Innovation Labs Pune for the application of Natural Language
Processing in Educational Applications As he was deeply passionate about learning
system that mimic the human brain and learn like a human child does he was
increasing interested about Deep Learning and its applications After working for a year
he went on to pursue a Master of Science degree in computer science from the
University of Florida Gainesville His academic interests have been focused on Deep
Learning and Natural Language Processing and he has been working on Machine
Reading Comprehension since summer of 2017
- ACKNOWLEDGMENTS
- LIST OF FIGURES
- LIST OF ABBREVIATIONS
- INTRODUCTION
- THE BUILDING BLOCKS OF A QUESTION ANSWERING SYSTEM
-
- Neural Networks
- Convolutional Neural Network
- Recurrent Neural Networks (RNN)
- Word Embedding
- Attention Mechanism
- Memory Networks
-
- LITERATURE REVIEW AND STATE OF THE ART
-
- Machine Comprehension Using Match-LSTM and Answer Pointer
- R-NET Matching Reading Comprehension with Self-Matching Networks
- Bi-Directional Attention Flow (BiDAF) for Machine Comprehension
- Summary
-
- MULTI-ATTENTION QUESTION ANSWERING
-
- Multi-Attention BiDAF Model
- Chatbot Design Using a QA System
- Online QA System and Attention Visualization
-
- FUTURE DIRECTIONS AND CONCLUSION
-
- Future Directions
- Conclusion
-
- LIST OF REFERENCES
- BIOGRAPHICAL SKETCH
-
24
They then use a bi-directional RNN to produce new representation and
of all words in the question and passage respectively
Figure 3-2 The task of Question Answering [23]
Gated Attention-based Recurrent Networks They use a variant of attention-
based recurrent networks with an additional gate to determine the importance of
information in the passage regarding a question Different from the gates in LSTM or
GRU the additional gate is based on the current passage word and its attention-pooling
vector of the question which focuses on the relation between the question and current
passage word The gate effectively model the phenomenon that only parts of the
passage are relevant to the question in reading comprehension and question answering
is utilized in subsequent calculations
25
Self-Matching Attention From the previous step the question aware passage
representation is generated to highlight the important parts of the passage One
problem with such representation is that it has very limited knowledge of context One
answer candidate is often oblivious to important cues in the passage outside its
surrounding window To address this problem the authors propose directly matching
the question-aware passage representation against itself It dynamically collects
evidence from the whole passage for words in passage and encodes the evidence
relevant to the current passage word and its matching question information into the
passage representation
Output Layer They use the same method as Wang amp Jiang (2016b) and use
pointer networks (Vinyals et al 2015) to predict the start and end position of the
answer In addition they use an attention-pooling over the question representation to
generate the initial hidden vector for the pointer network [23]
When the R-Net Model first appeared in the leaderboard in March 2017 it was at
the top with 723 Exact Match and 807 F1 score
Bi-Directional Attention Flow (BiDAF) for Machine Comprehension
BiDAF [24] is a hierarchical multi-stage architecture for modeling the
representations of the context paragraph at different levels of granularity BIDAF
includes character-level word-level and contextual embeddings and uses bi-directional
attention flow to obtain a query-aware context representation Their attention layer is not
used to summarize the context paragraph into a fixed-size vector Instead the attention
is computed for every time step and the attended vector at each time step along with
the representations from previous layers can flow through to the subsequent modeling
layer This reduces the information loss caused by early summarization [24]
26
Figure 3-3 Bi-Directional Attention Flow Model Architecture [24]
Their machine comprehension model is a hierarchical multi-stage process and
consists of six layers
1 Character Embedding Layer maps each word to a vector space using character-level CNNs
2 Word Embedding Layer maps each word to a vector space using a pre-trained word embedding model
3 Contextual Embedding Layer utilizes contextual cues from surrounding words to refine the embedding of the words These first three layers are applied to both the query and context They use a LSTM on top of the embeddings provided by the previous layers to model the temporal interactions between words We place an LSTM in both directions and concatenate the outputs of the two LSTMs
4 Attention Flow Layer couples the query and context vectors and produces a set of query aware feature vectors for each word in the context
5 Modeling Layer employs a Recurrent Neural Network to scan the context The input to the modeling layer is G which encodes the query-aware representations of context words The output of the modeling layer captures the interaction among the context words conditioned on the query They use two layers of bi-directional LSTM
27
with the output size of d for each direction Hence a matrix is obtained which is passed onto the output layer to predict the answer
6 Output Layer provides an answer to the query They define the training loss (to be minimized) as the sum of the negative log probabilities of the true start and end indices by the predicted distributions averaged over all examples [25]
In a further variation of their above work they add a self-attention layer after the
Bi-attention layer to further improve the results The architecture of the model is as
Figure 3-3 The task of Question Answering [25]
28
Summary
In this chapter we reviewed the methods that are fundamental to the state of the
art in Machine Comprehension and for the task of Question Answering We have
reached closed to human level accuracy and this is due to incremental developments
over previous models As we saw Out of Vocabulary (OOV) tokens were handled by
using Character embedding Long term dependency within context passage were
solved using self-attention and many other techniques such as Contextualized vectors
Attention Flow etc were employed to get better results In the next chapter we will see
how we can build on these models and develop further
29
CHAPTER 4 MULTI-ATTENTION QUESTION ANSWERING
Although the advancements in Question Answering systems since the release of
the SQuAD dataset has been impressive and the results are getting close to human
level accuracy it is far from being a fool-proof system The models still make mistakes
which would be obvious to a human For example
Passage The Panthers used the San Jose State practice facility and stayed at
the San Jose Marriott The Broncos practiced at Florida State Facility and stayed at the
Santa Clara Marriott
Question At what universitys facility did the Panthers practice
Actual Answer San Jose State
Predicted Answer Florida State Facility
To find out what is leading to the wrong predictions we wanted to see the
attention weights associated with such an example We plotted the passage and
question heat map which is a 2D matrix where the intensity of each cell signifies the
similarity between a passage word and a question word For the above example we
found out that while certain words of the question are given high weightage other parts
are not The words lsquoAtrsquo lsquofacilityrsquo lsquopracticersquo are given high attention but lsquoPanthersrsquo does
not receive high attention If it had received high attention then the system would have
predicted lsquoSan Jose Statersquo as the right answer To solve this issue we analyzed the
base BiDAF model and proposed adding two things
1 Bi-Attention and Self-Attention over Query
2 Second level attention over the output of (Bi-Attention + Self-Attention) from both Context and Query
30
Multi-Attention BiDAF Model
Our comprehension model is a hierarchical multi-stage process and consists of
the following layers
1 Embedding Just as all other models we embed words using pretrained word vectors We also embed the characters in each word into size 20 vectors which are learned and run a convolution neural network followed by max-pooling to get character-derived embeddings for each word The character-level and word-level embeddings are then concatenated and passed to the next layer We do not update the word embeddings during training
2 Pre-Process A shared bi-directional GRU (Cho et al 2014) is used to map the question and passage embeddings to context aware embeddings
3 Context Attention The bi-directional attention mechanism from the Bi-Directional Attention Flow (BiDAF) model (Seo et al 2016) is used to build a query-aware context representation Let h_i be the vector for context word i q_j be the vector for question word j and n_q and n_c be the lengths of the question and context respectively We compute attention between context word i and question word j as
(4-1)
where w1 w2 and w3 are learned vectors and is element-wise multiplication We then compute an attended vector c_i for each context token as
(4-2)
We also compute a query-to-context vector q_c
(4-3)
31
The final vector computed for each token is built by concatenating
and In our model we subsequently pass the result through a linear layer with ReLU activations
4 Context Self-Attention Next we use a layer of residual self-attention The input is passed through another bi-directional GRU Then we apply the same attention mechanism only now between the passage and itself In this case we do not use query-to context attention and we set aij = 1048576inf if i = j As before we pass the concatenated output through a linear layer with ReLU activations This layer is applied residually so this output is additionally summed with the input
5 Query Attention For this part we do the same way as context attention but calculate the weighted sum of the context words for each query word Thus we get the length as number of query words Then we calculate context to query similar to query to context in context attention layer
Figure 4-1 The modified BiDAF model with multilevel attention
6 Query Self-Attention This part is done the same way as the context self-attention layer but from the output of the Query Attention layer
32
7 Context Query Bi-Attention + Self-Attention The output of the Context self-attention and the Query Self-Attention layers are taken as input and the same process for Bi-Attention and self-attention is applied on these inputs
8 Prediction In the last layer of our model a bidirectional GRU is applied followed by a linear layer that computes answer start scores for each token The hidden states of that layer are concatenated with the input and fed into a second bidirectional GRU and linear layer to predict answer end scores The softmax operation is applied to the start and end scores to produce start and end probabilities and we optimize the negative log likelihood of selecting correct start and end tokens
Having carried out this modification we were able to solve the wrong example we
started with The multilevel attention model gives the correct output as ldquoSan Jose
Staterdquo Also we achieved slightly better scores than the original model with a F1 score
of 8544 on the SQuAD dev dataset
Chatbot Design Using a QA System
Designing a perfect chatbot that passes the Turing test is a fundamental goal for
Artificial Intelligence Although we are many order to magnitudes away from achieving
such a goal domain specific tasks can be solved with chatbots made from current
technology We had started our goal with a similar objective in mind ie to design a
domain specific chatbot and then generalize to other areas as it is able to achieve the
first domain specific objective robustly This led us to the fundamental problem of
Machine Comprehension and subsequently to the task of Question Answering Having
achieved some degree of success with QA systems we looked back if we could apply
our newly acquired knowledge in the task of designing Chatbots
The chatbots made with todayrsquos technologies are mostly handcrafted techniques
such as template matching that requires anticipating all possible ways a user may
articulate his requirements and a conversation may occur This requires a lot of man
hours for designing a domain specific system and is still very error prone In this section
33
we propose a general Chatbot design that would make the designing of a domain
specific chatbot very easy and robust at the same time
Every domain specific chatbot needs to obtain a set of information from the user
and show some results based on the user specific information obtained The traditional
chatbots use template matching and keywords lookup to determine if the user has
provided the required information Our idea is to use the Question Answering system in
the backend to extract out the required information from whatever the user has typed
until this point of the conversation The information to be extracted can be posed in the
form of a set of questions and the answers obtained from those questions can be used
as the parameters to supply the relevant information to the user
We had chosen our chatbot domain as the flight reservation system Our goal
was to extract the required information from the user to be able to show him the
available flights as per the userrsquos requirements
Figure 4-2 Flight reservation chatbotrsquos chat window
34
For a flight reservation task the booking agent needs to know the origin city
destination city and date of travel at minimum to be able to show the available flights
Optional information includes the number of tickets passengerrsquos name one way or
round trip etc
The minimalistic conversation with the user through the chat window would be as
shown above We had a platform called OneTask on which we wanted to implement our
chat bot The chat interface within the OneTask system looks as follows
Figure 4-3 Chatbot within OneTask system
The working of the chat system is as follows ndash
1 Initiation The user opens the chat window which starts a session with the chat bot The chat bot asks lsquoHow may I help yoursquo
2 User Reply The user may reply with none of the required information for flight booking or may reply with multiple information in the same message
3 User Reply parsing The conversation up to this point is treated as a passage and the internal questions are run on this passage So the four questions that are run are
Where do you want to go
35
From where do you want to leave
When do you want to depart
4 Parsed responses from QA model After running the questions on the QA system with the given conversation answers are obtained for all along with their corresponding confidence values Since answers will be obtained even if the required question was not answered up to this point in the conversation the confidence values plays a crucial role in determining the validity of the answer For this purpose we determined ranges of confidence values for validation A confidence value above 10 signifies the question has been answered correctly A confidence of 2 ndash 10 signifies that it may have been answered but should verify with the user for correctness and any confidence below 2 is discarded
5 Asking remaining questions iteratively After the parsing it is checked if any of the required questions are still unanswered If so the chatbot asks the remaining question and the process from 3 ndash 4 is carried out iteratively Once all the questions have been answered the user is shown with the available flight options as per his request
Figure 4-4 The Flow diagram of the Flight booking Chatbot system
Online QA System and Attention Visualization
To be able to test out various examples we used the BiDAF [23] model for an
online demo One can either choose from the available examples from the drop-down
menu or paste their own passage and examples While this is a useful and interesting
system to test the model in a user-friendly way we created this system to be able to
focus on the wrong samples
36
Figure 4-5 QA system interface with attention highlight over candidate answers
The candidate answers are shown with the blue highlights as per their
confidence values The higher the confidence the darker the answer The highest
confidence value is chosen as the predicted answer We developed the system to show
attention spread of the candidate answers to realize what needs to be done to improve
the system This led us to realize the importance of including the query attention part as
well as multilevel attention on the BiDAF model as described in the first section of this
chapter
37
CHAPTER 5 FUTURE DIRECTIONS AND CONCLUSION
Future Directions
Having done a thorough analysis of the current methods and state of the arts in
Machine Comprehension and for the QA task and developing systems on it we have
achieved a strong sense of what needs to be done to further improve the QA models
After observing the wrong samples we can see that the system is still unable to encode
meaning and is picking answers based on statistical patterns of occurrence of answers
on training examples
As per the State of the Art models and analyzing the ongoing research literature
it is easy to conclude that more features need to be embedded to encode the meaning
of words phrases and sentences A paper called Reinforced Mnemonic Reader for
Machine Comprehension encoded POS and NER tags of words along with their word
and character embedding This gave them better results
Figure 5-1 An English language semantic parse tree [26]
38
We have developed a method to encode the syntax parse tree of a sentence that
not only encodes the post tags but the relation of the word within the phrase and the
relation of the phrase within the whole sentence in a hierarchical manner
Finally data augmentation is another solution to get better results One definite
way to reduce the errors would be to include similar samples in the training data which
the system is faltering in the dev set One could generate similar examples as the failure
cases and include them in the training set to have better prediction Another system
would be to train to similar and bigger datasets Our models were trained on the SQuAD
dataset There are other datasets too which does the similar question answering task
such as TriviaQA We could augment the training set of TriviaQA along with SQuAD to
have a more robust system that is able to generalize better and thus have higher
accuracy for predicting answer spans
Conclusion
In this work we have tried to explore the most fundamental techniques that have
shaped the current state of the art Then we proposed a minor improvement of
architecture over an existing model Furthermore we developed two applications that
uses the base model First we talked about how a chatbot application can be made
using the QA system and lastly we also created a web interface where the model can
be used for any Passage and Question This interface also shows the attention spread
on the candidate answers While our effort is ongoing to push the state of the art
forward we strongly believe that surpassing human level accuracy on this task will have
high dividends for the society at large
39
LIST OF REFERENCES
[1] Danqi Chen From Reading Comprehension To Open-Domain Question Answering 2018 Youtube Accessed March 16 2018 httpswwwyoutubecomwatchv=1RN88O9C13U
[2] Lehnert Wendy G The Process of Question Answering No RR-88 Yale Univ New Haven Conn Dept of computer science 1977
[3] Joshi Mandar Eunsol Choi Daniel S Weld and Luke Zettlemoyer Triviaqa A large scale distantly supervised challenge dataset for reading comprehension arXiv preprint arXiv170503551 (2017)
[4] Rajpurkar Pranav Jian Zhang Konstantin Lopyrev and Percy Liang Squad 100000+ questions for machine comprehension of text arXiv preprint arXiv160605250 (2016)
[5] The Stanford Question Answering Dataset 2018 RajpurkarGithubIo Accessed March 16 2018 httpsrajpurkargithubioSQuAD-explorer
[6] Hu Minghao Yuxing Peng and Xipeng Qiu Reinforced mnemonic reader for machine comprehension CoRR abs170502798 (2017)
[7] Schalkoff Robert J Artificial neural networks Vol 1 New York McGraw-Hill 1997
[8] Me For The AI and Neetesh Mehrotra 2018 The Connect Between Deep Learning And AI - Open Source For You Open Source For You Accessed March 16 2018 httpopensourceforucom201801connect-deep-learning-ai
[9] Krizhevsky Alex Ilya Sutskever and Geoffrey E Hinton Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems pp 1097-1105 2012
[10] An Intuitive Explanation Of Convolutional Neural Networks 2016 The Data Science Blog Accessed March 16 2018 httpsujjwalkarnme20160811intuitive-explanation-convnets
[11] Williams Ronald J and David Zipser A learning algorithm for continually running fully recurrent neural networks Neural computation 1 no 2 (1989) 270-280
[12] Hochreiter Sepp and Juumlrgen Schmidhuber Long short-term memory Neural computation 9 no 8 (1997) 1735-1780
[13] Chung Junyoung Caglar Gulcehre KyungHyun Cho and Yoshua Bengio Empirical evaluation of gated recurrent neural networks on sequence modeling arXiv preprint arXiv14123555 (2014)
40
[14] Mikolov Tomas Wen-tau Yih and Geoffrey Zweig Linguistic regularities in continuous space word representations In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics Human Language Technologies pp 746-751 2013
[15] Pennington Jeffrey Richard Socher and Christopher Manning Glove Global vectors for word representation In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) pp 1532-1543 2014
[16] Word2vec Tutorial - The Skip-Gram Model middot Chris Mccormick 2016 MccormickmlCom Accessed March 16 2018 httpmccormickmlcom20160419word2vec-tutorial-the-skip-gram-model
[17] Group 2015 [Paper Introduction] Bilingual Word Representations With Monolingual hellip SlideshareNet Accessed March 16 2018 httpswwwslidesharenetnaist-mtstudybilingual-word-representations-with-monolingual-quality-in-mind
[18] Attention Mechanism 2016 BlogHeuritechCom Accessed March 16 2018 httpsblogheuritechcom20160120attention-mechanism
[19] Weston Jason Sumit Chopra and Antoine Bordes 2014 Memory Networks arXiv preprint arXiv14103916v11 Accessed March 16 2018 httpsarxivorgabs14103916
[20] Wang Shuohang and Jing Jiang Machine comprehension using match-lstm and answer pointer arXiv preprint arXiv160807905 (2016)
[21] Wang Shuohang and Jing Jiang Learning natural language inference with LSTM arXiv preprint arXiv151208849 (2015)
[22] Vinyals Oriol Meire Fortunato and Navdeep Jaitly Pointer networks In Advances in Neural Information Processing Systems pp 2692-2700 2015
[23] Wang Wenhui Nan Yang Furu Wei Baobao Chang and Ming Zhou Gated self-matching networks for reading comprehension and question answering In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1 Long Papers) vol 1 pp 189-198 2017
[24] Seo Minjoon Aniruddha Kembhavi Ali Farhadi and Hannaneh Hajishirzi Bidirectional attention flow for machine comprehension arXiv preprint arXiv161101603 (2016)
[25] Clark Christopher and Matt Gardner Simple and effective multi-paragraph reading comprehension arXiv preprint arXiv171010723 (2017)
41
[26] 8 Analyzing Sentence Structure 2018 NltkOrg Accessed March 16 2018 httpwwwnltkorgbookch08html
42
BIOGRAPHICAL SKETCH
Purnendu Mukherjee grew up in Kolkata India and had a strong interest for
computers since an early age After his high school he did his BSc in computer
science from Ramakrishna Mission Residential College Narendrapur followed by MSc
in computer science from St Xavierrsquos College Kolkata He had a strong intuition and
interest for human like learning systems and wanted to work in this area He started
working at TCS Innovation Labs Pune for the application of Natural Language
Processing in Educational Applications As he was deeply passionate about learning
system that mimic the human brain and learn like a human child does he was
increasing interested about Deep Learning and its applications After working for a year
he went on to pursue a Master of Science degree in computer science from the
University of Florida Gainesville His academic interests have been focused on Deep
Learning and Natural Language Processing and he has been working on Machine
Reading Comprehension since summer of 2017
- ACKNOWLEDGMENTS
- LIST OF FIGURES
- LIST OF ABBREVIATIONS
- INTRODUCTION
- THE BUILDING BLOCKS OF A QUESTION ANSWERING SYSTEM
-
- Neural Networks
- Convolutional Neural Network
- Recurrent Neural Networks (RNN)
- Word Embedding
- Attention Mechanism
- Memory Networks
-
- LITERATURE REVIEW AND STATE OF THE ART
-
- Machine Comprehension Using Match-LSTM and Answer Pointer
- R-NET Matching Reading Comprehension with Self-Matching Networks
- Bi-Directional Attention Flow (BiDAF) for Machine Comprehension
- Summary
-
- MULTI-ATTENTION QUESTION ANSWERING
-
- Multi-Attention BiDAF Model
- Chatbot Design Using a QA System
- Online QA System and Attention Visualization
-
- FUTURE DIRECTIONS AND CONCLUSION
-
- Future Directions
- Conclusion
-
- LIST OF REFERENCES
- BIOGRAPHICAL SKETCH
-
25
Self-Matching Attention From the previous step the question aware passage
representation is generated to highlight the important parts of the passage One
problem with such representation is that it has very limited knowledge of context One
answer candidate is often oblivious to important cues in the passage outside its
surrounding window To address this problem the authors propose directly matching
the question-aware passage representation against itself It dynamically collects
evidence from the whole passage for words in passage and encodes the evidence
relevant to the current passage word and its matching question information into the
passage representation
Output Layer They use the same method as Wang amp Jiang (2016b) and use
pointer networks (Vinyals et al 2015) to predict the start and end position of the
answer In addition they use an attention-pooling over the question representation to
generate the initial hidden vector for the pointer network [23]
When the R-Net Model first appeared in the leaderboard in March 2017 it was at
the top with 723 Exact Match and 807 F1 score
Bi-Directional Attention Flow (BiDAF) for Machine Comprehension
BiDAF [24] is a hierarchical multi-stage architecture for modeling the
representations of the context paragraph at different levels of granularity BIDAF
includes character-level word-level and contextual embeddings and uses bi-directional
attention flow to obtain a query-aware context representation Their attention layer is not
used to summarize the context paragraph into a fixed-size vector Instead the attention
is computed for every time step and the attended vector at each time step along with
the representations from previous layers can flow through to the subsequent modeling
layer This reduces the information loss caused by early summarization [24]
26
Figure 3-3 Bi-Directional Attention Flow Model Architecture [24]
Their machine comprehension model is a hierarchical multi-stage process and
consists of six layers
1 Character Embedding Layer maps each word to a vector space using character-level CNNs
2 Word Embedding Layer maps each word to a vector space using a pre-trained word embedding model
3 Contextual Embedding Layer utilizes contextual cues from surrounding words to refine the embedding of the words These first three layers are applied to both the query and context They use a LSTM on top of the embeddings provided by the previous layers to model the temporal interactions between words We place an LSTM in both directions and concatenate the outputs of the two LSTMs
4 Attention Flow Layer couples the query and context vectors and produces a set of query aware feature vectors for each word in the context
5 Modeling Layer employs a Recurrent Neural Network to scan the context The input to the modeling layer is G which encodes the query-aware representations of context words The output of the modeling layer captures the interaction among the context words conditioned on the query They use two layers of bi-directional LSTM
27
with the output size of d for each direction Hence a matrix is obtained which is passed onto the output layer to predict the answer
6 Output Layer provides an answer to the query They define the training loss (to be minimized) as the sum of the negative log probabilities of the true start and end indices by the predicted distributions averaged over all examples [25]
In a further variation of their above work they add a self-attention layer after the
Bi-attention layer to further improve the results The architecture of the model is as
Figure 3-3 The task of Question Answering [25]
28
Summary
In this chapter we reviewed the methods that are fundamental to the state of the
art in Machine Comprehension and for the task of Question Answering We have
reached closed to human level accuracy and this is due to incremental developments
over previous models As we saw Out of Vocabulary (OOV) tokens were handled by
using Character embedding Long term dependency within context passage were
solved using self-attention and many other techniques such as Contextualized vectors
Attention Flow etc were employed to get better results In the next chapter we will see
how we can build on these models and develop further
29
CHAPTER 4 MULTI-ATTENTION QUESTION ANSWERING
Although the advancements in Question Answering systems since the release of
the SQuAD dataset has been impressive and the results are getting close to human
level accuracy it is far from being a fool-proof system The models still make mistakes
which would be obvious to a human For example
Passage The Panthers used the San Jose State practice facility and stayed at
the San Jose Marriott The Broncos practiced at Florida State Facility and stayed at the
Santa Clara Marriott
Question At what universitys facility did the Panthers practice
Actual Answer San Jose State
Predicted Answer Florida State Facility
To find out what is leading to the wrong predictions we wanted to see the
attention weights associated with such an example We plotted the passage and
question heat map which is a 2D matrix where the intensity of each cell signifies the
similarity between a passage word and a question word For the above example we
found out that while certain words of the question are given high weightage other parts
are not The words lsquoAtrsquo lsquofacilityrsquo lsquopracticersquo are given high attention but lsquoPanthersrsquo does
not receive high attention If it had received high attention then the system would have
predicted lsquoSan Jose Statersquo as the right answer To solve this issue we analyzed the
base BiDAF model and proposed adding two things
1 Bi-Attention and Self-Attention over Query
2 Second level attention over the output of (Bi-Attention + Self-Attention) from both Context and Query
30
Multi-Attention BiDAF Model
Our comprehension model is a hierarchical multi-stage process and consists of
the following layers
1 Embedding Just as all other models we embed words using pretrained word vectors We also embed the characters in each word into size 20 vectors which are learned and run a convolution neural network followed by max-pooling to get character-derived embeddings for each word The character-level and word-level embeddings are then concatenated and passed to the next layer We do not update the word embeddings during training
2 Pre-Process A shared bi-directional GRU (Cho et al 2014) is used to map the question and passage embeddings to context aware embeddings
3 Context Attention The bi-directional attention mechanism from the Bi-Directional Attention Flow (BiDAF) model (Seo et al 2016) is used to build a query-aware context representation Let h_i be the vector for context word i q_j be the vector for question word j and n_q and n_c be the lengths of the question and context respectively We compute attention between context word i and question word j as
(4-1)
where w1 w2 and w3 are learned vectors and is element-wise multiplication We then compute an attended vector c_i for each context token as
(4-2)
We also compute a query-to-context vector q_c
(4-3)
31
The final vector computed for each token is built by concatenating
and In our model we subsequently pass the result through a linear layer with ReLU activations
4 Context Self-Attention Next we use a layer of residual self-attention The input is passed through another bi-directional GRU Then we apply the same attention mechanism only now between the passage and itself In this case we do not use query-to context attention and we set aij = 1048576inf if i = j As before we pass the concatenated output through a linear layer with ReLU activations This layer is applied residually so this output is additionally summed with the input
5 Query Attention For this part we do the same way as context attention but calculate the weighted sum of the context words for each query word Thus we get the length as number of query words Then we calculate context to query similar to query to context in context attention layer
Figure 4-1 The modified BiDAF model with multilevel attention
6 Query Self-Attention This part is done the same way as the context self-attention layer but from the output of the Query Attention layer
32
7 Context Query Bi-Attention + Self-Attention The output of the Context self-attention and the Query Self-Attention layers are taken as input and the same process for Bi-Attention and self-attention is applied on these inputs
8 Prediction In the last layer of our model a bidirectional GRU is applied followed by a linear layer that computes answer start scores for each token The hidden states of that layer are concatenated with the input and fed into a second bidirectional GRU and linear layer to predict answer end scores The softmax operation is applied to the start and end scores to produce start and end probabilities and we optimize the negative log likelihood of selecting correct start and end tokens
Having carried out this modification we were able to solve the wrong example we
started with The multilevel attention model gives the correct output as ldquoSan Jose
Staterdquo Also we achieved slightly better scores than the original model with a F1 score
of 8544 on the SQuAD dev dataset
Chatbot Design Using a QA System
Designing a perfect chatbot that passes the Turing test is a fundamental goal for
Artificial Intelligence Although we are many order to magnitudes away from achieving
such a goal domain specific tasks can be solved with chatbots made from current
technology We had started our goal with a similar objective in mind ie to design a
domain specific chatbot and then generalize to other areas as it is able to achieve the
first domain specific objective robustly This led us to the fundamental problem of
Machine Comprehension and subsequently to the task of Question Answering Having
achieved some degree of success with QA systems we looked back if we could apply
our newly acquired knowledge in the task of designing Chatbots
The chatbots made with todayrsquos technologies are mostly handcrafted techniques
such as template matching that requires anticipating all possible ways a user may
articulate his requirements and a conversation may occur This requires a lot of man
hours for designing a domain specific system and is still very error prone In this section
33
we propose a general Chatbot design that would make the designing of a domain
specific chatbot very easy and robust at the same time
Every domain specific chatbot needs to obtain a set of information from the user
and show some results based on the user specific information obtained The traditional
chatbots use template matching and keywords lookup to determine if the user has
provided the required information Our idea is to use the Question Answering system in
the backend to extract out the required information from whatever the user has typed
until this point of the conversation The information to be extracted can be posed in the
form of a set of questions and the answers obtained from those questions can be used
as the parameters to supply the relevant information to the user
We had chosen our chatbot domain as the flight reservation system Our goal
was to extract the required information from the user to be able to show him the
available flights as per the userrsquos requirements
Figure 4-2 Flight reservation chatbotrsquos chat window
34
For a flight reservation task the booking agent needs to know the origin city
destination city and date of travel at minimum to be able to show the available flights
Optional information includes the number of tickets passengerrsquos name one way or
round trip etc
The minimalistic conversation with the user through the chat window would be as
shown above We had a platform called OneTask on which we wanted to implement our
chat bot The chat interface within the OneTask system looks as follows
Figure 4-3 Chatbot within OneTask system
The working of the chat system is as follows ndash
1 Initiation The user opens the chat window which starts a session with the chat bot The chat bot asks lsquoHow may I help yoursquo
2 User Reply The user may reply with none of the required information for flight booking or may reply with multiple information in the same message
3 User Reply parsing The conversation up to this point is treated as a passage and the internal questions are run on this passage So the four questions that are run are
Where do you want to go
35
From where do you want to leave
When do you want to depart
4 Parsed responses from QA model After running the questions on the QA system with the given conversation answers are obtained for all along with their corresponding confidence values Since answers will be obtained even if the required question was not answered up to this point in the conversation the confidence values plays a crucial role in determining the validity of the answer For this purpose we determined ranges of confidence values for validation A confidence value above 10 signifies the question has been answered correctly A confidence of 2 ndash 10 signifies that it may have been answered but should verify with the user for correctness and any confidence below 2 is discarded
5 Asking remaining questions iteratively After the parsing it is checked if any of the required questions are still unanswered If so the chatbot asks the remaining question and the process from 3 ndash 4 is carried out iteratively Once all the questions have been answered the user is shown with the available flight options as per his request
Figure 4-4 The Flow diagram of the Flight booking Chatbot system
Online QA System and Attention Visualization
To be able to test out various examples we used the BiDAF [23] model for an
online demo One can either choose from the available examples from the drop-down
menu or paste their own passage and examples While this is a useful and interesting
system to test the model in a user-friendly way we created this system to be able to
focus on the wrong samples
36
Figure 4-5 QA system interface with attention highlight over candidate answers
The candidate answers are shown with the blue highlights as per their
confidence values The higher the confidence the darker the answer The highest
confidence value is chosen as the predicted answer We developed the system to show
attention spread of the candidate answers to realize what needs to be done to improve
the system This led us to realize the importance of including the query attention part as
well as multilevel attention on the BiDAF model as described in the first section of this
chapter
37
CHAPTER 5 FUTURE DIRECTIONS AND CONCLUSION
Future Directions
Having done a thorough analysis of the current methods and state of the arts in
Machine Comprehension and for the QA task and developing systems on it we have
achieved a strong sense of what needs to be done to further improve the QA models
After observing the wrong samples we can see that the system is still unable to encode
meaning and is picking answers based on statistical patterns of occurrence of answers
on training examples
As per the State of the Art models and analyzing the ongoing research literature
it is easy to conclude that more features need to be embedded to encode the meaning
of words phrases and sentences A paper called Reinforced Mnemonic Reader for
Machine Comprehension encoded POS and NER tags of words along with their word
and character embedding This gave them better results
Figure 5-1 An English language semantic parse tree [26]
38
We have developed a method to encode the syntax parse tree of a sentence that
not only encodes the post tags but the relation of the word within the phrase and the
relation of the phrase within the whole sentence in a hierarchical manner
Finally data augmentation is another solution to get better results One definite
way to reduce the errors would be to include similar samples in the training data which
the system is faltering in the dev set One could generate similar examples as the failure
cases and include them in the training set to have better prediction Another system
would be to train to similar and bigger datasets Our models were trained on the SQuAD
dataset There are other datasets too which does the similar question answering task
such as TriviaQA We could augment the training set of TriviaQA along with SQuAD to
have a more robust system that is able to generalize better and thus have higher
accuracy for predicting answer spans
Conclusion
In this work we have tried to explore the most fundamental techniques that have
shaped the current state of the art Then we proposed a minor improvement of
architecture over an existing model Furthermore we developed two applications that
uses the base model First we talked about how a chatbot application can be made
using the QA system and lastly we also created a web interface where the model can
be used for any Passage and Question This interface also shows the attention spread
on the candidate answers While our effort is ongoing to push the state of the art
forward we strongly believe that surpassing human level accuracy on this task will have
high dividends for the society at large
39
LIST OF REFERENCES
[1] Danqi Chen From Reading Comprehension To Open-Domain Question Answering 2018 Youtube Accessed March 16 2018 httpswwwyoutubecomwatchv=1RN88O9C13U
[2] Lehnert Wendy G The Process of Question Answering No RR-88 Yale Univ New Haven Conn Dept of computer science 1977
[3] Joshi Mandar Eunsol Choi Daniel S Weld and Luke Zettlemoyer Triviaqa A large scale distantly supervised challenge dataset for reading comprehension arXiv preprint arXiv170503551 (2017)
[4] Rajpurkar Pranav Jian Zhang Konstantin Lopyrev and Percy Liang Squad 100000+ questions for machine comprehension of text arXiv preprint arXiv160605250 (2016)
[5] The Stanford Question Answering Dataset 2018 RajpurkarGithubIo Accessed March 16 2018 httpsrajpurkargithubioSQuAD-explorer
[6] Hu Minghao Yuxing Peng and Xipeng Qiu Reinforced mnemonic reader for machine comprehension CoRR abs170502798 (2017)
[7] Schalkoff Robert J Artificial neural networks Vol 1 New York McGraw-Hill 1997
[8] Me For The AI and Neetesh Mehrotra 2018 The Connect Between Deep Learning And AI - Open Source For You Open Source For You Accessed March 16 2018 httpopensourceforucom201801connect-deep-learning-ai
[9] Krizhevsky Alex Ilya Sutskever and Geoffrey E Hinton Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems pp 1097-1105 2012
[10] An Intuitive Explanation Of Convolutional Neural Networks 2016 The Data Science Blog Accessed March 16 2018 httpsujjwalkarnme20160811intuitive-explanation-convnets
[11] Williams Ronald J and David Zipser A learning algorithm for continually running fully recurrent neural networks Neural computation 1 no 2 (1989) 270-280
[12] Hochreiter Sepp and Juumlrgen Schmidhuber Long short-term memory Neural computation 9 no 8 (1997) 1735-1780
[13] Chung Junyoung Caglar Gulcehre KyungHyun Cho and Yoshua Bengio Empirical evaluation of gated recurrent neural networks on sequence modeling arXiv preprint arXiv14123555 (2014)
40
[14] Mikolov Tomas Wen-tau Yih and Geoffrey Zweig Linguistic regularities in continuous space word representations In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics Human Language Technologies pp 746-751 2013
[15] Pennington Jeffrey Richard Socher and Christopher Manning Glove Global vectors for word representation In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) pp 1532-1543 2014
[16] Word2vec Tutorial - The Skip-Gram Model middot Chris Mccormick 2016 MccormickmlCom Accessed March 16 2018 httpmccormickmlcom20160419word2vec-tutorial-the-skip-gram-model
[17] Group 2015 [Paper Introduction] Bilingual Word Representations With Monolingual hellip SlideshareNet Accessed March 16 2018 httpswwwslidesharenetnaist-mtstudybilingual-word-representations-with-monolingual-quality-in-mind
[18] Attention Mechanism 2016 BlogHeuritechCom Accessed March 16 2018 httpsblogheuritechcom20160120attention-mechanism
[19] Weston Jason Sumit Chopra and Antoine Bordes 2014 Memory Networks arXiv preprint arXiv14103916v11 Accessed March 16 2018 httpsarxivorgabs14103916
[20] Wang Shuohang and Jing Jiang Machine comprehension using match-lstm and answer pointer arXiv preprint arXiv160807905 (2016)
[21] Wang Shuohang and Jing Jiang Learning natural language inference with LSTM arXiv preprint arXiv151208849 (2015)
[22] Vinyals Oriol Meire Fortunato and Navdeep Jaitly Pointer networks In Advances in Neural Information Processing Systems pp 2692-2700 2015
[23] Wang Wenhui Nan Yang Furu Wei Baobao Chang and Ming Zhou Gated self-matching networks for reading comprehension and question answering In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1 Long Papers) vol 1 pp 189-198 2017
[24] Seo Minjoon Aniruddha Kembhavi Ali Farhadi and Hannaneh Hajishirzi Bidirectional attention flow for machine comprehension arXiv preprint arXiv161101603 (2016)
[25] Clark Christopher and Matt Gardner Simple and effective multi-paragraph reading comprehension arXiv preprint arXiv171010723 (2017)
41
[26] 8 Analyzing Sentence Structure 2018 NltkOrg Accessed March 16 2018 httpwwwnltkorgbookch08html
42
BIOGRAPHICAL SKETCH
Purnendu Mukherjee grew up in Kolkata India and had a strong interest for
computers since an early age After his high school he did his BSc in computer
science from Ramakrishna Mission Residential College Narendrapur followed by MSc
in computer science from St Xavierrsquos College Kolkata He had a strong intuition and
interest for human like learning systems and wanted to work in this area He started
working at TCS Innovation Labs Pune for the application of Natural Language
Processing in Educational Applications As he was deeply passionate about learning
system that mimic the human brain and learn like a human child does he was
increasing interested about Deep Learning and its applications After working for a year
he went on to pursue a Master of Science degree in computer science from the
University of Florida Gainesville His academic interests have been focused on Deep
Learning and Natural Language Processing and he has been working on Machine
Reading Comprehension since summer of 2017
- ACKNOWLEDGMENTS
- LIST OF FIGURES
- LIST OF ABBREVIATIONS
- INTRODUCTION
- THE BUILDING BLOCKS OF A QUESTION ANSWERING SYSTEM
-
- Neural Networks
- Convolutional Neural Network
- Recurrent Neural Networks (RNN)
- Word Embedding
- Attention Mechanism
- Memory Networks
-
- LITERATURE REVIEW AND STATE OF THE ART
-
- Machine Comprehension Using Match-LSTM and Answer Pointer
- R-NET Matching Reading Comprehension with Self-Matching Networks
- Bi-Directional Attention Flow (BiDAF) for Machine Comprehension
- Summary
-
- MULTI-ATTENTION QUESTION ANSWERING
-
- Multi-Attention BiDAF Model
- Chatbot Design Using a QA System
- Online QA System and Attention Visualization
-
- FUTURE DIRECTIONS AND CONCLUSION
-
- Future Directions
- Conclusion
-
- LIST OF REFERENCES
- BIOGRAPHICAL SKETCH
-
26
Figure 3-3 Bi-Directional Attention Flow Model Architecture [24]
Their machine comprehension model is a hierarchical multi-stage process and
consists of six layers
1 Character Embedding Layer maps each word to a vector space using character-level CNNs
2 Word Embedding Layer maps each word to a vector space using a pre-trained word embedding model
3 Contextual Embedding Layer utilizes contextual cues from surrounding words to refine the embedding of the words These first three layers are applied to both the query and context They use a LSTM on top of the embeddings provided by the previous layers to model the temporal interactions between words We place an LSTM in both directions and concatenate the outputs of the two LSTMs
4 Attention Flow Layer couples the query and context vectors and produces a set of query aware feature vectors for each word in the context
5 Modeling Layer employs a Recurrent Neural Network to scan the context The input to the modeling layer is G which encodes the query-aware representations of context words The output of the modeling layer captures the interaction among the context words conditioned on the query They use two layers of bi-directional LSTM
27
with the output size of d for each direction Hence a matrix is obtained which is passed onto the output layer to predict the answer
6 Output Layer provides an answer to the query They define the training loss (to be minimized) as the sum of the negative log probabilities of the true start and end indices by the predicted distributions averaged over all examples [25]
In a further variation of their above work they add a self-attention layer after the
Bi-attention layer to further improve the results The architecture of the model is as
Figure 3-3 The task of Question Answering [25]
28
Summary
In this chapter we reviewed the methods that are fundamental to the state of the
art in Machine Comprehension and for the task of Question Answering We have
reached closed to human level accuracy and this is due to incremental developments
over previous models As we saw Out of Vocabulary (OOV) tokens were handled by
using Character embedding Long term dependency within context passage were
solved using self-attention and many other techniques such as Contextualized vectors
Attention Flow etc were employed to get better results In the next chapter we will see
how we can build on these models and develop further
29
CHAPTER 4 MULTI-ATTENTION QUESTION ANSWERING
Although the advancements in Question Answering systems since the release of
the SQuAD dataset has been impressive and the results are getting close to human
level accuracy it is far from being a fool-proof system The models still make mistakes
which would be obvious to a human For example
Passage The Panthers used the San Jose State practice facility and stayed at
the San Jose Marriott The Broncos practiced at Florida State Facility and stayed at the
Santa Clara Marriott
Question At what universitys facility did the Panthers practice
Actual Answer San Jose State
Predicted Answer Florida State Facility
To find out what is leading to the wrong predictions we wanted to see the
attention weights associated with such an example We plotted the passage and
question heat map which is a 2D matrix where the intensity of each cell signifies the
similarity between a passage word and a question word For the above example we
found out that while certain words of the question are given high weightage other parts
are not The words lsquoAtrsquo lsquofacilityrsquo lsquopracticersquo are given high attention but lsquoPanthersrsquo does
not receive high attention If it had received high attention then the system would have
predicted lsquoSan Jose Statersquo as the right answer To solve this issue we analyzed the
base BiDAF model and proposed adding two things
1 Bi-Attention and Self-Attention over Query
2 Second level attention over the output of (Bi-Attention + Self-Attention) from both Context and Query
30
Multi-Attention BiDAF Model
Our comprehension model is a hierarchical multi-stage process and consists of
the following layers
1 Embedding Just as all other models we embed words using pretrained word vectors We also embed the characters in each word into size 20 vectors which are learned and run a convolution neural network followed by max-pooling to get character-derived embeddings for each word The character-level and word-level embeddings are then concatenated and passed to the next layer We do not update the word embeddings during training
2 Pre-Process A shared bi-directional GRU (Cho et al 2014) is used to map the question and passage embeddings to context aware embeddings
3 Context Attention The bi-directional attention mechanism from the Bi-Directional Attention Flow (BiDAF) model (Seo et al 2016) is used to build a query-aware context representation Let h_i be the vector for context word i q_j be the vector for question word j and n_q and n_c be the lengths of the question and context respectively We compute attention between context word i and question word j as
(4-1)
where w1 w2 and w3 are learned vectors and is element-wise multiplication We then compute an attended vector c_i for each context token as
(4-2)
We also compute a query-to-context vector q_c
(4-3)
31
The final vector computed for each token is built by concatenating
and In our model we subsequently pass the result through a linear layer with ReLU activations
4 Context Self-Attention Next we use a layer of residual self-attention The input is passed through another bi-directional GRU Then we apply the same attention mechanism only now between the passage and itself In this case we do not use query-to context attention and we set aij = 1048576inf if i = j As before we pass the concatenated output through a linear layer with ReLU activations This layer is applied residually so this output is additionally summed with the input
5 Query Attention For this part we do the same way as context attention but calculate the weighted sum of the context words for each query word Thus we get the length as number of query words Then we calculate context to query similar to query to context in context attention layer
Figure 4-1 The modified BiDAF model with multilevel attention
6 Query Self-Attention This part is done the same way as the context self-attention layer but from the output of the Query Attention layer
32
7 Context Query Bi-Attention + Self-Attention The output of the Context self-attention and the Query Self-Attention layers are taken as input and the same process for Bi-Attention and self-attention is applied on these inputs
8 Prediction In the last layer of our model a bidirectional GRU is applied followed by a linear layer that computes answer start scores for each token The hidden states of that layer are concatenated with the input and fed into a second bidirectional GRU and linear layer to predict answer end scores The softmax operation is applied to the start and end scores to produce start and end probabilities and we optimize the negative log likelihood of selecting correct start and end tokens
Having carried out this modification we were able to solve the wrong example we
started with The multilevel attention model gives the correct output as ldquoSan Jose
Staterdquo Also we achieved slightly better scores than the original model with a F1 score
of 8544 on the SQuAD dev dataset
Chatbot Design Using a QA System
Designing a perfect chatbot that passes the Turing test is a fundamental goal for
Artificial Intelligence Although we are many order to magnitudes away from achieving
such a goal domain specific tasks can be solved with chatbots made from current
technology We had started our goal with a similar objective in mind ie to design a
domain specific chatbot and then generalize to other areas as it is able to achieve the
first domain specific objective robustly This led us to the fundamental problem of
Machine Comprehension and subsequently to the task of Question Answering Having
achieved some degree of success with QA systems we looked back if we could apply
our newly acquired knowledge in the task of designing Chatbots
The chatbots made with todayrsquos technologies are mostly handcrafted techniques
such as template matching that requires anticipating all possible ways a user may
articulate his requirements and a conversation may occur This requires a lot of man
hours for designing a domain specific system and is still very error prone In this section
33
we propose a general Chatbot design that would make the designing of a domain
specific chatbot very easy and robust at the same time
Every domain specific chatbot needs to obtain a set of information from the user
and show some results based on the user specific information obtained The traditional
chatbots use template matching and keywords lookup to determine if the user has
provided the required information Our idea is to use the Question Answering system in
the backend to extract out the required information from whatever the user has typed
until this point of the conversation The information to be extracted can be posed in the
form of a set of questions and the answers obtained from those questions can be used
as the parameters to supply the relevant information to the user
We had chosen our chatbot domain as the flight reservation system Our goal
was to extract the required information from the user to be able to show him the
available flights as per the userrsquos requirements
Figure 4-2 Flight reservation chatbotrsquos chat window
34
For a flight reservation task the booking agent needs to know the origin city
destination city and date of travel at minimum to be able to show the available flights
Optional information includes the number of tickets passengerrsquos name one way or
round trip etc
The minimalistic conversation with the user through the chat window would be as
shown above We had a platform called OneTask on which we wanted to implement our
chat bot The chat interface within the OneTask system looks as follows
Figure 4-3 Chatbot within OneTask system
The working of the chat system is as follows ndash
1 Initiation The user opens the chat window which starts a session with the chat bot The chat bot asks lsquoHow may I help yoursquo
2 User Reply The user may reply with none of the required information for flight booking or may reply with multiple information in the same message
3 User Reply parsing The conversation up to this point is treated as a passage and the internal questions are run on this passage So the four questions that are run are
Where do you want to go
35
From where do you want to leave
When do you want to depart
4 Parsed responses from QA model After running the questions on the QA system with the given conversation answers are obtained for all along with their corresponding confidence values Since answers will be obtained even if the required question was not answered up to this point in the conversation the confidence values plays a crucial role in determining the validity of the answer For this purpose we determined ranges of confidence values for validation A confidence value above 10 signifies the question has been answered correctly A confidence of 2 ndash 10 signifies that it may have been answered but should verify with the user for correctness and any confidence below 2 is discarded
5 Asking remaining questions iteratively After the parsing it is checked if any of the required questions are still unanswered If so the chatbot asks the remaining question and the process from 3 ndash 4 is carried out iteratively Once all the questions have been answered the user is shown with the available flight options as per his request
Figure 4-4 The Flow diagram of the Flight booking Chatbot system
Online QA System and Attention Visualization
To be able to test out various examples we used the BiDAF [23] model for an
online demo One can either choose from the available examples from the drop-down
menu or paste their own passage and examples While this is a useful and interesting
system to test the model in a user-friendly way we created this system to be able to
focus on the wrong samples
36
Figure 4-5 QA system interface with attention highlight over candidate answers
The candidate answers are shown with the blue highlights as per their
confidence values The higher the confidence the darker the answer The highest
confidence value is chosen as the predicted answer We developed the system to show
attention spread of the candidate answers to realize what needs to be done to improve
the system This led us to realize the importance of including the query attention part as
well as multilevel attention on the BiDAF model as described in the first section of this
chapter
37
CHAPTER 5 FUTURE DIRECTIONS AND CONCLUSION
Future Directions
Having done a thorough analysis of the current methods and state of the arts in
Machine Comprehension and for the QA task and developing systems on it we have
achieved a strong sense of what needs to be done to further improve the QA models
After observing the wrong samples we can see that the system is still unable to encode
meaning and is picking answers based on statistical patterns of occurrence of answers
on training examples
As per the State of the Art models and analyzing the ongoing research literature
it is easy to conclude that more features need to be embedded to encode the meaning
of words phrases and sentences A paper called Reinforced Mnemonic Reader for
Machine Comprehension encoded POS and NER tags of words along with their word
and character embedding This gave them better results
Figure 5-1 An English language semantic parse tree [26]
38
We have developed a method to encode the syntax parse tree of a sentence that
not only encodes the post tags but the relation of the word within the phrase and the
relation of the phrase within the whole sentence in a hierarchical manner
Finally data augmentation is another solution to get better results One definite
way to reduce the errors would be to include similar samples in the training data which
the system is faltering in the dev set One could generate similar examples as the failure
cases and include them in the training set to have better prediction Another system
would be to train to similar and bigger datasets Our models were trained on the SQuAD
dataset There are other datasets too which does the similar question answering task
such as TriviaQA We could augment the training set of TriviaQA along with SQuAD to
have a more robust system that is able to generalize better and thus have higher
accuracy for predicting answer spans
Conclusion
In this work we have tried to explore the most fundamental techniques that have
shaped the current state of the art Then we proposed a minor improvement of
architecture over an existing model Furthermore we developed two applications that
uses the base model First we talked about how a chatbot application can be made
using the QA system and lastly we also created a web interface where the model can
be used for any Passage and Question This interface also shows the attention spread
on the candidate answers While our effort is ongoing to push the state of the art
forward we strongly believe that surpassing human level accuracy on this task will have
high dividends for the society at large
39
LIST OF REFERENCES
[1] Danqi Chen From Reading Comprehension To Open-Domain Question Answering 2018 Youtube Accessed March 16 2018 httpswwwyoutubecomwatchv=1RN88O9C13U
[2] Lehnert Wendy G The Process of Question Answering No RR-88 Yale Univ New Haven Conn Dept of computer science 1977
[3] Joshi Mandar Eunsol Choi Daniel S Weld and Luke Zettlemoyer Triviaqa A large scale distantly supervised challenge dataset for reading comprehension arXiv preprint arXiv170503551 (2017)
[4] Rajpurkar Pranav Jian Zhang Konstantin Lopyrev and Percy Liang Squad 100000+ questions for machine comprehension of text arXiv preprint arXiv160605250 (2016)
[5] The Stanford Question Answering Dataset 2018 RajpurkarGithubIo Accessed March 16 2018 httpsrajpurkargithubioSQuAD-explorer
[6] Hu Minghao Yuxing Peng and Xipeng Qiu Reinforced mnemonic reader for machine comprehension CoRR abs170502798 (2017)
[7] Schalkoff Robert J Artificial neural networks Vol 1 New York McGraw-Hill 1997
[8] Me For The AI and Neetesh Mehrotra 2018 The Connect Between Deep Learning And AI - Open Source For You Open Source For You Accessed March 16 2018 httpopensourceforucom201801connect-deep-learning-ai
[9] Krizhevsky Alex Ilya Sutskever and Geoffrey E Hinton Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems pp 1097-1105 2012
[10] An Intuitive Explanation Of Convolutional Neural Networks 2016 The Data Science Blog Accessed March 16 2018 httpsujjwalkarnme20160811intuitive-explanation-convnets
[11] Williams Ronald J and David Zipser A learning algorithm for continually running fully recurrent neural networks Neural computation 1 no 2 (1989) 270-280
[12] Hochreiter Sepp and Juumlrgen Schmidhuber Long short-term memory Neural computation 9 no 8 (1997) 1735-1780
[13] Chung Junyoung Caglar Gulcehre KyungHyun Cho and Yoshua Bengio Empirical evaluation of gated recurrent neural networks on sequence modeling arXiv preprint arXiv14123555 (2014)
40
[14] Mikolov Tomas Wen-tau Yih and Geoffrey Zweig Linguistic regularities in continuous space word representations In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics Human Language Technologies pp 746-751 2013
[15] Pennington Jeffrey Richard Socher and Christopher Manning Glove Global vectors for word representation In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) pp 1532-1543 2014
[16] Word2vec Tutorial - The Skip-Gram Model middot Chris Mccormick 2016 MccormickmlCom Accessed March 16 2018 httpmccormickmlcom20160419word2vec-tutorial-the-skip-gram-model
[17] Group 2015 [Paper Introduction] Bilingual Word Representations With Monolingual hellip SlideshareNet Accessed March 16 2018 httpswwwslidesharenetnaist-mtstudybilingual-word-representations-with-monolingual-quality-in-mind
[18] Attention Mechanism 2016 BlogHeuritechCom Accessed March 16 2018 httpsblogheuritechcom20160120attention-mechanism
[19] Weston Jason Sumit Chopra and Antoine Bordes 2014 Memory Networks arXiv preprint arXiv14103916v11 Accessed March 16 2018 httpsarxivorgabs14103916
[20] Wang Shuohang and Jing Jiang Machine comprehension using match-lstm and answer pointer arXiv preprint arXiv160807905 (2016)
[21] Wang Shuohang and Jing Jiang Learning natural language inference with LSTM arXiv preprint arXiv151208849 (2015)
[22] Vinyals Oriol Meire Fortunato and Navdeep Jaitly Pointer networks In Advances in Neural Information Processing Systems pp 2692-2700 2015
[23] Wang Wenhui Nan Yang Furu Wei Baobao Chang and Ming Zhou Gated self-matching networks for reading comprehension and question answering In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1 Long Papers) vol 1 pp 189-198 2017
[24] Seo Minjoon Aniruddha Kembhavi Ali Farhadi and Hannaneh Hajishirzi Bidirectional attention flow for machine comprehension arXiv preprint arXiv161101603 (2016)
[25] Clark Christopher and Matt Gardner Simple and effective multi-paragraph reading comprehension arXiv preprint arXiv171010723 (2017)
41
[26] 8 Analyzing Sentence Structure 2018 NltkOrg Accessed March 16 2018 httpwwwnltkorgbookch08html
42
BIOGRAPHICAL SKETCH
Purnendu Mukherjee grew up in Kolkata India and had a strong interest for
computers since an early age After his high school he did his BSc in computer
science from Ramakrishna Mission Residential College Narendrapur followed by MSc
in computer science from St Xavierrsquos College Kolkata He had a strong intuition and
interest for human like learning systems and wanted to work in this area He started
working at TCS Innovation Labs Pune for the application of Natural Language
Processing in Educational Applications As he was deeply passionate about learning
system that mimic the human brain and learn like a human child does he was
increasing interested about Deep Learning and its applications After working for a year
he went on to pursue a Master of Science degree in computer science from the
University of Florida Gainesville His academic interests have been focused on Deep
Learning and Natural Language Processing and he has been working on Machine
Reading Comprehension since summer of 2017
- ACKNOWLEDGMENTS
- LIST OF FIGURES
- LIST OF ABBREVIATIONS
- INTRODUCTION
- THE BUILDING BLOCKS OF A QUESTION ANSWERING SYSTEM
-
- Neural Networks
- Convolutional Neural Network
- Recurrent Neural Networks (RNN)
- Word Embedding
- Attention Mechanism
- Memory Networks
-
- LITERATURE REVIEW AND STATE OF THE ART
-
- Machine Comprehension Using Match-LSTM and Answer Pointer
- R-NET Matching Reading Comprehension with Self-Matching Networks
- Bi-Directional Attention Flow (BiDAF) for Machine Comprehension
- Summary
-
- MULTI-ATTENTION QUESTION ANSWERING
-
- Multi-Attention BiDAF Model
- Chatbot Design Using a QA System
- Online QA System and Attention Visualization
-
- FUTURE DIRECTIONS AND CONCLUSION
-
- Future Directions
- Conclusion
-
- LIST OF REFERENCES
- BIOGRAPHICAL SKETCH
-
27
with the output size of d for each direction Hence a matrix is obtained which is passed onto the output layer to predict the answer
6 Output Layer provides an answer to the query They define the training loss (to be minimized) as the sum of the negative log probabilities of the true start and end indices by the predicted distributions averaged over all examples [25]
In a further variation of their above work they add a self-attention layer after the
Bi-attention layer to further improve the results The architecture of the model is as
Figure 3-3 The task of Question Answering [25]
28
Summary
In this chapter we reviewed the methods that are fundamental to the state of the
art in Machine Comprehension and for the task of Question Answering We have
reached closed to human level accuracy and this is due to incremental developments
over previous models As we saw Out of Vocabulary (OOV) tokens were handled by
using Character embedding Long term dependency within context passage were
solved using self-attention and many other techniques such as Contextualized vectors
Attention Flow etc were employed to get better results In the next chapter we will see
how we can build on these models and develop further
29
CHAPTER 4 MULTI-ATTENTION QUESTION ANSWERING
Although the advancements in Question Answering systems since the release of
the SQuAD dataset has been impressive and the results are getting close to human
level accuracy it is far from being a fool-proof system The models still make mistakes
which would be obvious to a human For example
Passage The Panthers used the San Jose State practice facility and stayed at
the San Jose Marriott The Broncos practiced at Florida State Facility and stayed at the
Santa Clara Marriott
Question At what universitys facility did the Panthers practice
Actual Answer San Jose State
Predicted Answer Florida State Facility
To find out what is leading to the wrong predictions we wanted to see the
attention weights associated with such an example We plotted the passage and
question heat map which is a 2D matrix where the intensity of each cell signifies the
similarity between a passage word and a question word For the above example we
found out that while certain words of the question are given high weightage other parts
are not The words lsquoAtrsquo lsquofacilityrsquo lsquopracticersquo are given high attention but lsquoPanthersrsquo does
not receive high attention If it had received high attention then the system would have
predicted lsquoSan Jose Statersquo as the right answer To solve this issue we analyzed the
base BiDAF model and proposed adding two things
1 Bi-Attention and Self-Attention over Query
2 Second level attention over the output of (Bi-Attention + Self-Attention) from both Context and Query
30
Multi-Attention BiDAF Model
Our comprehension model is a hierarchical multi-stage process and consists of
the following layers
1 Embedding Just as all other models we embed words using pretrained word vectors We also embed the characters in each word into size 20 vectors which are learned and run a convolution neural network followed by max-pooling to get character-derived embeddings for each word The character-level and word-level embeddings are then concatenated and passed to the next layer We do not update the word embeddings during training
2 Pre-Process A shared bi-directional GRU (Cho et al 2014) is used to map the question and passage embeddings to context aware embeddings
3 Context Attention The bi-directional attention mechanism from the Bi-Directional Attention Flow (BiDAF) model (Seo et al 2016) is used to build a query-aware context representation Let h_i be the vector for context word i q_j be the vector for question word j and n_q and n_c be the lengths of the question and context respectively We compute attention between context word i and question word j as
(4-1)
where w1 w2 and w3 are learned vectors and is element-wise multiplication We then compute an attended vector c_i for each context token as
(4-2)
We also compute a query-to-context vector q_c
(4-3)
31
The final vector computed for each token is built by concatenating
and In our model we subsequently pass the result through a linear layer with ReLU activations
4 Context Self-Attention Next we use a layer of residual self-attention The input is passed through another bi-directional GRU Then we apply the same attention mechanism only now between the passage and itself In this case we do not use query-to context attention and we set aij = 1048576inf if i = j As before we pass the concatenated output through a linear layer with ReLU activations This layer is applied residually so this output is additionally summed with the input
5 Query Attention For this part we do the same way as context attention but calculate the weighted sum of the context words for each query word Thus we get the length as number of query words Then we calculate context to query similar to query to context in context attention layer
Figure 4-1 The modified BiDAF model with multilevel attention
6 Query Self-Attention This part is done the same way as the context self-attention layer but from the output of the Query Attention layer
32
7 Context Query Bi-Attention + Self-Attention The output of the Context self-attention and the Query Self-Attention layers are taken as input and the same process for Bi-Attention and self-attention is applied on these inputs
8 Prediction In the last layer of our model a bidirectional GRU is applied followed by a linear layer that computes answer start scores for each token The hidden states of that layer are concatenated with the input and fed into a second bidirectional GRU and linear layer to predict answer end scores The softmax operation is applied to the start and end scores to produce start and end probabilities and we optimize the negative log likelihood of selecting correct start and end tokens
Having carried out this modification we were able to solve the wrong example we
started with The multilevel attention model gives the correct output as ldquoSan Jose
Staterdquo Also we achieved slightly better scores than the original model with a F1 score
of 8544 on the SQuAD dev dataset
Chatbot Design Using a QA System
Designing a perfect chatbot that passes the Turing test is a fundamental goal for
Artificial Intelligence Although we are many order to magnitudes away from achieving
such a goal domain specific tasks can be solved with chatbots made from current
technology We had started our goal with a similar objective in mind ie to design a
domain specific chatbot and then generalize to other areas as it is able to achieve the
first domain specific objective robustly This led us to the fundamental problem of
Machine Comprehension and subsequently to the task of Question Answering Having
achieved some degree of success with QA systems we looked back if we could apply
our newly acquired knowledge in the task of designing Chatbots
The chatbots made with todayrsquos technologies are mostly handcrafted techniques
such as template matching that requires anticipating all possible ways a user may
articulate his requirements and a conversation may occur This requires a lot of man
hours for designing a domain specific system and is still very error prone In this section
33
we propose a general Chatbot design that would make the designing of a domain
specific chatbot very easy and robust at the same time
Every domain specific chatbot needs to obtain a set of information from the user
and show some results based on the user specific information obtained The traditional
chatbots use template matching and keywords lookup to determine if the user has
provided the required information Our idea is to use the Question Answering system in
the backend to extract out the required information from whatever the user has typed
until this point of the conversation The information to be extracted can be posed in the
form of a set of questions and the answers obtained from those questions can be used
as the parameters to supply the relevant information to the user
We had chosen our chatbot domain as the flight reservation system Our goal
was to extract the required information from the user to be able to show him the
available flights as per the userrsquos requirements
Figure 4-2 Flight reservation chatbotrsquos chat window
34
For a flight reservation task the booking agent needs to know the origin city
destination city and date of travel at minimum to be able to show the available flights
Optional information includes the number of tickets passengerrsquos name one way or
round trip etc
The minimalistic conversation with the user through the chat window would be as
shown above We had a platform called OneTask on which we wanted to implement our
chat bot The chat interface within the OneTask system looks as follows
Figure 4-3 Chatbot within OneTask system
The working of the chat system is as follows ndash
1 Initiation The user opens the chat window which starts a session with the chat bot The chat bot asks lsquoHow may I help yoursquo
2 User Reply The user may reply with none of the required information for flight booking or may reply with multiple information in the same message
3 User Reply parsing The conversation up to this point is treated as a passage and the internal questions are run on this passage So the four questions that are run are
Where do you want to go
35
From where do you want to leave
When do you want to depart
4 Parsed responses from QA model After running the questions on the QA system with the given conversation answers are obtained for all along with their corresponding confidence values Since answers will be obtained even if the required question was not answered up to this point in the conversation the confidence values plays a crucial role in determining the validity of the answer For this purpose we determined ranges of confidence values for validation A confidence value above 10 signifies the question has been answered correctly A confidence of 2 ndash 10 signifies that it may have been answered but should verify with the user for correctness and any confidence below 2 is discarded
5 Asking remaining questions iteratively After the parsing it is checked if any of the required questions are still unanswered If so the chatbot asks the remaining question and the process from 3 ndash 4 is carried out iteratively Once all the questions have been answered the user is shown with the available flight options as per his request
Figure 4-4 The Flow diagram of the Flight booking Chatbot system
Online QA System and Attention Visualization
To be able to test out various examples we used the BiDAF [23] model for an
online demo One can either choose from the available examples from the drop-down
menu or paste their own passage and examples While this is a useful and interesting
system to test the model in a user-friendly way we created this system to be able to
focus on the wrong samples
36
Figure 4-5 QA system interface with attention highlight over candidate answers
The candidate answers are shown with the blue highlights as per their
confidence values The higher the confidence the darker the answer The highest
confidence value is chosen as the predicted answer We developed the system to show
attention spread of the candidate answers to realize what needs to be done to improve
the system This led us to realize the importance of including the query attention part as
well as multilevel attention on the BiDAF model as described in the first section of this
chapter
37
CHAPTER 5 FUTURE DIRECTIONS AND CONCLUSION
Future Directions
Having done a thorough analysis of the current methods and state of the arts in
Machine Comprehension and for the QA task and developing systems on it we have
achieved a strong sense of what needs to be done to further improve the QA models
After observing the wrong samples we can see that the system is still unable to encode
meaning and is picking answers based on statistical patterns of occurrence of answers
on training examples
As per the State of the Art models and analyzing the ongoing research literature
it is easy to conclude that more features need to be embedded to encode the meaning
of words phrases and sentences A paper called Reinforced Mnemonic Reader for
Machine Comprehension encoded POS and NER tags of words along with their word
and character embedding This gave them better results
Figure 5-1 An English language semantic parse tree [26]
38
We have developed a method to encode the syntax parse tree of a sentence that
not only encodes the post tags but the relation of the word within the phrase and the
relation of the phrase within the whole sentence in a hierarchical manner
Finally data augmentation is another solution to get better results One definite
way to reduce the errors would be to include similar samples in the training data which
the system is faltering in the dev set One could generate similar examples as the failure
cases and include them in the training set to have better prediction Another system
would be to train to similar and bigger datasets Our models were trained on the SQuAD
dataset There are other datasets too which does the similar question answering task
such as TriviaQA We could augment the training set of TriviaQA along with SQuAD to
have a more robust system that is able to generalize better and thus have higher
accuracy for predicting answer spans
Conclusion
In this work we have tried to explore the most fundamental techniques that have
shaped the current state of the art Then we proposed a minor improvement of
architecture over an existing model Furthermore we developed two applications that
uses the base model First we talked about how a chatbot application can be made
using the QA system and lastly we also created a web interface where the model can
be used for any Passage and Question This interface also shows the attention spread
on the candidate answers While our effort is ongoing to push the state of the art
forward we strongly believe that surpassing human level accuracy on this task will have
high dividends for the society at large
39
LIST OF REFERENCES
[1] Danqi Chen From Reading Comprehension To Open-Domain Question Answering 2018 Youtube Accessed March 16 2018 httpswwwyoutubecomwatchv=1RN88O9C13U
[2] Lehnert Wendy G The Process of Question Answering No RR-88 Yale Univ New Haven Conn Dept of computer science 1977
[3] Joshi Mandar Eunsol Choi Daniel S Weld and Luke Zettlemoyer Triviaqa A large scale distantly supervised challenge dataset for reading comprehension arXiv preprint arXiv170503551 (2017)
[4] Rajpurkar Pranav Jian Zhang Konstantin Lopyrev and Percy Liang Squad 100000+ questions for machine comprehension of text arXiv preprint arXiv160605250 (2016)
[5] The Stanford Question Answering Dataset 2018 RajpurkarGithubIo Accessed March 16 2018 httpsrajpurkargithubioSQuAD-explorer
[6] Hu Minghao Yuxing Peng and Xipeng Qiu Reinforced mnemonic reader for machine comprehension CoRR abs170502798 (2017)
[7] Schalkoff Robert J Artificial neural networks Vol 1 New York McGraw-Hill 1997
[8] Me For The AI and Neetesh Mehrotra 2018 The Connect Between Deep Learning And AI - Open Source For You Open Source For You Accessed March 16 2018 httpopensourceforucom201801connect-deep-learning-ai
[9] Krizhevsky Alex Ilya Sutskever and Geoffrey E Hinton Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems pp 1097-1105 2012
[10] An Intuitive Explanation Of Convolutional Neural Networks 2016 The Data Science Blog Accessed March 16 2018 httpsujjwalkarnme20160811intuitive-explanation-convnets
[11] Williams Ronald J and David Zipser A learning algorithm for continually running fully recurrent neural networks Neural computation 1 no 2 (1989) 270-280
[12] Hochreiter Sepp and Juumlrgen Schmidhuber Long short-term memory Neural computation 9 no 8 (1997) 1735-1780
[13] Chung Junyoung Caglar Gulcehre KyungHyun Cho and Yoshua Bengio Empirical evaluation of gated recurrent neural networks on sequence modeling arXiv preprint arXiv14123555 (2014)
40
[14] Mikolov Tomas Wen-tau Yih and Geoffrey Zweig Linguistic regularities in continuous space word representations In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics Human Language Technologies pp 746-751 2013
[15] Pennington Jeffrey Richard Socher and Christopher Manning Glove Global vectors for word representation In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) pp 1532-1543 2014
[16] Word2vec Tutorial - The Skip-Gram Model middot Chris Mccormick 2016 MccormickmlCom Accessed March 16 2018 httpmccormickmlcom20160419word2vec-tutorial-the-skip-gram-model
[17] Group 2015 [Paper Introduction] Bilingual Word Representations With Monolingual hellip SlideshareNet Accessed March 16 2018 httpswwwslidesharenetnaist-mtstudybilingual-word-representations-with-monolingual-quality-in-mind
[18] Attention Mechanism 2016 BlogHeuritechCom Accessed March 16 2018 httpsblogheuritechcom20160120attention-mechanism
[19] Weston Jason Sumit Chopra and Antoine Bordes 2014 Memory Networks arXiv preprint arXiv14103916v11 Accessed March 16 2018 httpsarxivorgabs14103916
[20] Wang Shuohang and Jing Jiang Machine comprehension using match-lstm and answer pointer arXiv preprint arXiv160807905 (2016)
[21] Wang Shuohang and Jing Jiang Learning natural language inference with LSTM arXiv preprint arXiv151208849 (2015)
[22] Vinyals Oriol Meire Fortunato and Navdeep Jaitly Pointer networks In Advances in Neural Information Processing Systems pp 2692-2700 2015
[23] Wang Wenhui Nan Yang Furu Wei Baobao Chang and Ming Zhou Gated self-matching networks for reading comprehension and question answering In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1 Long Papers) vol 1 pp 189-198 2017
[24] Seo Minjoon Aniruddha Kembhavi Ali Farhadi and Hannaneh Hajishirzi Bidirectional attention flow for machine comprehension arXiv preprint arXiv161101603 (2016)
[25] Clark Christopher and Matt Gardner Simple and effective multi-paragraph reading comprehension arXiv preprint arXiv171010723 (2017)
41
[26] 8 Analyzing Sentence Structure 2018 NltkOrg Accessed March 16 2018 httpwwwnltkorgbookch08html
42
BIOGRAPHICAL SKETCH
Purnendu Mukherjee grew up in Kolkata India and had a strong interest for
computers since an early age After his high school he did his BSc in computer
science from Ramakrishna Mission Residential College Narendrapur followed by MSc
in computer science from St Xavierrsquos College Kolkata He had a strong intuition and
interest for human like learning systems and wanted to work in this area He started
working at TCS Innovation Labs Pune for the application of Natural Language
Processing in Educational Applications As he was deeply passionate about learning
system that mimic the human brain and learn like a human child does he was
increasing interested about Deep Learning and its applications After working for a year
he went on to pursue a Master of Science degree in computer science from the
University of Florida Gainesville His academic interests have been focused on Deep
Learning and Natural Language Processing and he has been working on Machine
Reading Comprehension since summer of 2017
- ACKNOWLEDGMENTS
- LIST OF FIGURES
- LIST OF ABBREVIATIONS
- INTRODUCTION
- THE BUILDING BLOCKS OF A QUESTION ANSWERING SYSTEM
-
- Neural Networks
- Convolutional Neural Network
- Recurrent Neural Networks (RNN)
- Word Embedding
- Attention Mechanism
- Memory Networks
-
- LITERATURE REVIEW AND STATE OF THE ART
-
- Machine Comprehension Using Match-LSTM and Answer Pointer
- R-NET Matching Reading Comprehension with Self-Matching Networks
- Bi-Directional Attention Flow (BiDAF) for Machine Comprehension
- Summary
-
- MULTI-ATTENTION QUESTION ANSWERING
-
- Multi-Attention BiDAF Model
- Chatbot Design Using a QA System
- Online QA System and Attention Visualization
-
- FUTURE DIRECTIONS AND CONCLUSION
-
- Future Directions
- Conclusion
-
- LIST OF REFERENCES
- BIOGRAPHICAL SKETCH
-
28
Summary
In this chapter we reviewed the methods that are fundamental to the state of the
art in Machine Comprehension and for the task of Question Answering We have
reached closed to human level accuracy and this is due to incremental developments
over previous models As we saw Out of Vocabulary (OOV) tokens were handled by
using Character embedding Long term dependency within context passage were
solved using self-attention and many other techniques such as Contextualized vectors
Attention Flow etc were employed to get better results In the next chapter we will see
how we can build on these models and develop further
29
CHAPTER 4 MULTI-ATTENTION QUESTION ANSWERING
Although the advancements in Question Answering systems since the release of
the SQuAD dataset has been impressive and the results are getting close to human
level accuracy it is far from being a fool-proof system The models still make mistakes
which would be obvious to a human For example
Passage The Panthers used the San Jose State practice facility and stayed at
the San Jose Marriott The Broncos practiced at Florida State Facility and stayed at the
Santa Clara Marriott
Question At what universitys facility did the Panthers practice
Actual Answer San Jose State
Predicted Answer Florida State Facility
To find out what is leading to the wrong predictions we wanted to see the
attention weights associated with such an example We plotted the passage and
question heat map which is a 2D matrix where the intensity of each cell signifies the
similarity between a passage word and a question word For the above example we
found out that while certain words of the question are given high weightage other parts
are not The words lsquoAtrsquo lsquofacilityrsquo lsquopracticersquo are given high attention but lsquoPanthersrsquo does
not receive high attention If it had received high attention then the system would have
predicted lsquoSan Jose Statersquo as the right answer To solve this issue we analyzed the
base BiDAF model and proposed adding two things
1 Bi-Attention and Self-Attention over Query
2 Second level attention over the output of (Bi-Attention + Self-Attention) from both Context and Query
30
Multi-Attention BiDAF Model
Our comprehension model is a hierarchical multi-stage process and consists of
the following layers
1 Embedding Just as all other models we embed words using pretrained word vectors We also embed the characters in each word into size 20 vectors which are learned and run a convolution neural network followed by max-pooling to get character-derived embeddings for each word The character-level and word-level embeddings are then concatenated and passed to the next layer We do not update the word embeddings during training
2 Pre-Process A shared bi-directional GRU (Cho et al 2014) is used to map the question and passage embeddings to context aware embeddings
3 Context Attention The bi-directional attention mechanism from the Bi-Directional Attention Flow (BiDAF) model (Seo et al 2016) is used to build a query-aware context representation Let h_i be the vector for context word i q_j be the vector for question word j and n_q and n_c be the lengths of the question and context respectively We compute attention between context word i and question word j as
(4-1)
where w1 w2 and w3 are learned vectors and is element-wise multiplication We then compute an attended vector c_i for each context token as
(4-2)
We also compute a query-to-context vector q_c
(4-3)
31
The final vector computed for each token is built by concatenating
and In our model we subsequently pass the result through a linear layer with ReLU activations
4 Context Self-Attention Next we use a layer of residual self-attention The input is passed through another bi-directional GRU Then we apply the same attention mechanism only now between the passage and itself In this case we do not use query-to context attention and we set aij = 1048576inf if i = j As before we pass the concatenated output through a linear layer with ReLU activations This layer is applied residually so this output is additionally summed with the input
5 Query Attention For this part we do the same way as context attention but calculate the weighted sum of the context words for each query word Thus we get the length as number of query words Then we calculate context to query similar to query to context in context attention layer
Figure 4-1 The modified BiDAF model with multilevel attention
6 Query Self-Attention This part is done the same way as the context self-attention layer but from the output of the Query Attention layer
32
7 Context Query Bi-Attention + Self-Attention The output of the Context self-attention and the Query Self-Attention layers are taken as input and the same process for Bi-Attention and self-attention is applied on these inputs
8 Prediction In the last layer of our model a bidirectional GRU is applied followed by a linear layer that computes answer start scores for each token The hidden states of that layer are concatenated with the input and fed into a second bidirectional GRU and linear layer to predict answer end scores The softmax operation is applied to the start and end scores to produce start and end probabilities and we optimize the negative log likelihood of selecting correct start and end tokens
Having carried out this modification we were able to solve the wrong example we
started with The multilevel attention model gives the correct output as ldquoSan Jose
Staterdquo Also we achieved slightly better scores than the original model with a F1 score
of 8544 on the SQuAD dev dataset
Chatbot Design Using a QA System
Designing a perfect chatbot that passes the Turing test is a fundamental goal for
Artificial Intelligence Although we are many order to magnitudes away from achieving
such a goal domain specific tasks can be solved with chatbots made from current
technology We had started our goal with a similar objective in mind ie to design a
domain specific chatbot and then generalize to other areas as it is able to achieve the
first domain specific objective robustly This led us to the fundamental problem of
Machine Comprehension and subsequently to the task of Question Answering Having
achieved some degree of success with QA systems we looked back if we could apply
our newly acquired knowledge in the task of designing Chatbots
The chatbots made with todayrsquos technologies are mostly handcrafted techniques
such as template matching that requires anticipating all possible ways a user may
articulate his requirements and a conversation may occur This requires a lot of man
hours for designing a domain specific system and is still very error prone In this section
33
we propose a general Chatbot design that would make the designing of a domain
specific chatbot very easy and robust at the same time
Every domain specific chatbot needs to obtain a set of information from the user
and show some results based on the user specific information obtained The traditional
chatbots use template matching and keywords lookup to determine if the user has
provided the required information Our idea is to use the Question Answering system in
the backend to extract out the required information from whatever the user has typed
until this point of the conversation The information to be extracted can be posed in the
form of a set of questions and the answers obtained from those questions can be used
as the parameters to supply the relevant information to the user
We had chosen our chatbot domain as the flight reservation system Our goal
was to extract the required information from the user to be able to show him the
available flights as per the userrsquos requirements
Figure 4-2 Flight reservation chatbotrsquos chat window
34
For a flight reservation task the booking agent needs to know the origin city
destination city and date of travel at minimum to be able to show the available flights
Optional information includes the number of tickets passengerrsquos name one way or
round trip etc
The minimalistic conversation with the user through the chat window would be as
shown above We had a platform called OneTask on which we wanted to implement our
chat bot The chat interface within the OneTask system looks as follows
Figure 4-3 Chatbot within OneTask system
The working of the chat system is as follows ndash
1 Initiation The user opens the chat window which starts a session with the chat bot The chat bot asks lsquoHow may I help yoursquo
2 User Reply The user may reply with none of the required information for flight booking or may reply with multiple information in the same message
3 User Reply parsing The conversation up to this point is treated as a passage and the internal questions are run on this passage So the four questions that are run are
Where do you want to go
35
From where do you want to leave
When do you want to depart
4 Parsed responses from QA model After running the questions on the QA system with the given conversation answers are obtained for all along with their corresponding confidence values Since answers will be obtained even if the required question was not answered up to this point in the conversation the confidence values plays a crucial role in determining the validity of the answer For this purpose we determined ranges of confidence values for validation A confidence value above 10 signifies the question has been answered correctly A confidence of 2 ndash 10 signifies that it may have been answered but should verify with the user for correctness and any confidence below 2 is discarded
5 Asking remaining questions iteratively After the parsing it is checked if any of the required questions are still unanswered If so the chatbot asks the remaining question and the process from 3 ndash 4 is carried out iteratively Once all the questions have been answered the user is shown with the available flight options as per his request
Figure 4-4 The Flow diagram of the Flight booking Chatbot system
Online QA System and Attention Visualization
To be able to test out various examples we used the BiDAF [23] model for an
online demo One can either choose from the available examples from the drop-down
menu or paste their own passage and examples While this is a useful and interesting
system to test the model in a user-friendly way we created this system to be able to
focus on the wrong samples
36
Figure 4-5 QA system interface with attention highlight over candidate answers
The candidate answers are shown with the blue highlights as per their
confidence values The higher the confidence the darker the answer The highest
confidence value is chosen as the predicted answer We developed the system to show
attention spread of the candidate answers to realize what needs to be done to improve
the system This led us to realize the importance of including the query attention part as
well as multilevel attention on the BiDAF model as described in the first section of this
chapter
37
CHAPTER 5 FUTURE DIRECTIONS AND CONCLUSION
Future Directions
Having done a thorough analysis of the current methods and state of the arts in
Machine Comprehension and for the QA task and developing systems on it we have
achieved a strong sense of what needs to be done to further improve the QA models
After observing the wrong samples we can see that the system is still unable to encode
meaning and is picking answers based on statistical patterns of occurrence of answers
on training examples
As per the State of the Art models and analyzing the ongoing research literature
it is easy to conclude that more features need to be embedded to encode the meaning
of words phrases and sentences A paper called Reinforced Mnemonic Reader for
Machine Comprehension encoded POS and NER tags of words along with their word
and character embedding This gave them better results
Figure 5-1 An English language semantic parse tree [26]
38
We have developed a method to encode the syntax parse tree of a sentence that
not only encodes the post tags but the relation of the word within the phrase and the
relation of the phrase within the whole sentence in a hierarchical manner
Finally data augmentation is another solution to get better results One definite
way to reduce the errors would be to include similar samples in the training data which
the system is faltering in the dev set One could generate similar examples as the failure
cases and include them in the training set to have better prediction Another system
would be to train to similar and bigger datasets Our models were trained on the SQuAD
dataset There are other datasets too which does the similar question answering task
such as TriviaQA We could augment the training set of TriviaQA along with SQuAD to
have a more robust system that is able to generalize better and thus have higher
accuracy for predicting answer spans
Conclusion
In this work we have tried to explore the most fundamental techniques that have
shaped the current state of the art Then we proposed a minor improvement of
architecture over an existing model Furthermore we developed two applications that
uses the base model First we talked about how a chatbot application can be made
using the QA system and lastly we also created a web interface where the model can
be used for any Passage and Question This interface also shows the attention spread
on the candidate answers While our effort is ongoing to push the state of the art
forward we strongly believe that surpassing human level accuracy on this task will have
high dividends for the society at large
39
LIST OF REFERENCES
[1] Danqi Chen From Reading Comprehension To Open-Domain Question Answering 2018 Youtube Accessed March 16 2018 httpswwwyoutubecomwatchv=1RN88O9C13U
[2] Lehnert Wendy G The Process of Question Answering No RR-88 Yale Univ New Haven Conn Dept of computer science 1977
[3] Joshi Mandar Eunsol Choi Daniel S Weld and Luke Zettlemoyer Triviaqa A large scale distantly supervised challenge dataset for reading comprehension arXiv preprint arXiv170503551 (2017)
[4] Rajpurkar Pranav Jian Zhang Konstantin Lopyrev and Percy Liang Squad 100000+ questions for machine comprehension of text arXiv preprint arXiv160605250 (2016)
[5] The Stanford Question Answering Dataset 2018 RajpurkarGithubIo Accessed March 16 2018 httpsrajpurkargithubioSQuAD-explorer
[6] Hu Minghao Yuxing Peng and Xipeng Qiu Reinforced mnemonic reader for machine comprehension CoRR abs170502798 (2017)
[7] Schalkoff Robert J Artificial neural networks Vol 1 New York McGraw-Hill 1997
[8] Me For The AI and Neetesh Mehrotra 2018 The Connect Between Deep Learning And AI - Open Source For You Open Source For You Accessed March 16 2018 httpopensourceforucom201801connect-deep-learning-ai
[9] Krizhevsky Alex Ilya Sutskever and Geoffrey E Hinton Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems pp 1097-1105 2012
[10] An Intuitive Explanation Of Convolutional Neural Networks 2016 The Data Science Blog Accessed March 16 2018 httpsujjwalkarnme20160811intuitive-explanation-convnets
[11] Williams Ronald J and David Zipser A learning algorithm for continually running fully recurrent neural networks Neural computation 1 no 2 (1989) 270-280
[12] Hochreiter Sepp and Juumlrgen Schmidhuber Long short-term memory Neural computation 9 no 8 (1997) 1735-1780
[13] Chung Junyoung Caglar Gulcehre KyungHyun Cho and Yoshua Bengio Empirical evaluation of gated recurrent neural networks on sequence modeling arXiv preprint arXiv14123555 (2014)
40
[14] Mikolov Tomas Wen-tau Yih and Geoffrey Zweig Linguistic regularities in continuous space word representations In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics Human Language Technologies pp 746-751 2013
[15] Pennington Jeffrey Richard Socher and Christopher Manning Glove Global vectors for word representation In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) pp 1532-1543 2014
[16] Word2vec Tutorial - The Skip-Gram Model middot Chris Mccormick 2016 MccormickmlCom Accessed March 16 2018 httpmccormickmlcom20160419word2vec-tutorial-the-skip-gram-model
[17] Group 2015 [Paper Introduction] Bilingual Word Representations With Monolingual hellip SlideshareNet Accessed March 16 2018 httpswwwslidesharenetnaist-mtstudybilingual-word-representations-with-monolingual-quality-in-mind
[18] Attention Mechanism 2016 BlogHeuritechCom Accessed March 16 2018 httpsblogheuritechcom20160120attention-mechanism
[19] Weston Jason Sumit Chopra and Antoine Bordes 2014 Memory Networks arXiv preprint arXiv14103916v11 Accessed March 16 2018 httpsarxivorgabs14103916
[20] Wang Shuohang and Jing Jiang Machine comprehension using match-lstm and answer pointer arXiv preprint arXiv160807905 (2016)
[21] Wang Shuohang and Jing Jiang Learning natural language inference with LSTM arXiv preprint arXiv151208849 (2015)
[22] Vinyals Oriol Meire Fortunato and Navdeep Jaitly Pointer networks In Advances in Neural Information Processing Systems pp 2692-2700 2015
[23] Wang Wenhui Nan Yang Furu Wei Baobao Chang and Ming Zhou Gated self-matching networks for reading comprehension and question answering In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1 Long Papers) vol 1 pp 189-198 2017
[24] Seo Minjoon Aniruddha Kembhavi Ali Farhadi and Hannaneh Hajishirzi Bidirectional attention flow for machine comprehension arXiv preprint arXiv161101603 (2016)
[25] Clark Christopher and Matt Gardner Simple and effective multi-paragraph reading comprehension arXiv preprint arXiv171010723 (2017)
41
[26] 8 Analyzing Sentence Structure 2018 NltkOrg Accessed March 16 2018 httpwwwnltkorgbookch08html
42
BIOGRAPHICAL SKETCH
Purnendu Mukherjee grew up in Kolkata India and had a strong interest for
computers since an early age After his high school he did his BSc in computer
science from Ramakrishna Mission Residential College Narendrapur followed by MSc
in computer science from St Xavierrsquos College Kolkata He had a strong intuition and
interest for human like learning systems and wanted to work in this area He started
working at TCS Innovation Labs Pune for the application of Natural Language
Processing in Educational Applications As he was deeply passionate about learning
system that mimic the human brain and learn like a human child does he was
increasing interested about Deep Learning and its applications After working for a year
he went on to pursue a Master of Science degree in computer science from the
University of Florida Gainesville His academic interests have been focused on Deep
Learning and Natural Language Processing and he has been working on Machine
Reading Comprehension since summer of 2017
- ACKNOWLEDGMENTS
- LIST OF FIGURES
- LIST OF ABBREVIATIONS
- INTRODUCTION
- THE BUILDING BLOCKS OF A QUESTION ANSWERING SYSTEM
-
- Neural Networks
- Convolutional Neural Network
- Recurrent Neural Networks (RNN)
- Word Embedding
- Attention Mechanism
- Memory Networks
-
- LITERATURE REVIEW AND STATE OF THE ART
-
- Machine Comprehension Using Match-LSTM and Answer Pointer
- R-NET Matching Reading Comprehension with Self-Matching Networks
- Bi-Directional Attention Flow (BiDAF) for Machine Comprehension
- Summary
-
- MULTI-ATTENTION QUESTION ANSWERING
-
- Multi-Attention BiDAF Model
- Chatbot Design Using a QA System
- Online QA System and Attention Visualization
-
- FUTURE DIRECTIONS AND CONCLUSION
-
- Future Directions
- Conclusion
-
- LIST OF REFERENCES
- BIOGRAPHICAL SKETCH
-
29
CHAPTER 4 MULTI-ATTENTION QUESTION ANSWERING
Although the advancements in Question Answering systems since the release of
the SQuAD dataset has been impressive and the results are getting close to human
level accuracy it is far from being a fool-proof system The models still make mistakes
which would be obvious to a human For example
Passage The Panthers used the San Jose State practice facility and stayed at
the San Jose Marriott The Broncos practiced at Florida State Facility and stayed at the
Santa Clara Marriott
Question At what universitys facility did the Panthers practice
Actual Answer San Jose State
Predicted Answer Florida State Facility
To find out what is leading to the wrong predictions we wanted to see the
attention weights associated with such an example We plotted the passage and
question heat map which is a 2D matrix where the intensity of each cell signifies the
similarity between a passage word and a question word For the above example we
found out that while certain words of the question are given high weightage other parts
are not The words lsquoAtrsquo lsquofacilityrsquo lsquopracticersquo are given high attention but lsquoPanthersrsquo does
not receive high attention If it had received high attention then the system would have
predicted lsquoSan Jose Statersquo as the right answer To solve this issue we analyzed the
base BiDAF model and proposed adding two things
1 Bi-Attention and Self-Attention over Query
2 Second level attention over the output of (Bi-Attention + Self-Attention) from both Context and Query
30
Multi-Attention BiDAF Model
Our comprehension model is a hierarchical multi-stage process and consists of
the following layers
1 Embedding Just as all other models we embed words using pretrained word vectors We also embed the characters in each word into size 20 vectors which are learned and run a convolution neural network followed by max-pooling to get character-derived embeddings for each word The character-level and word-level embeddings are then concatenated and passed to the next layer We do not update the word embeddings during training
2 Pre-Process A shared bi-directional GRU (Cho et al 2014) is used to map the question and passage embeddings to context aware embeddings
3 Context Attention The bi-directional attention mechanism from the Bi-Directional Attention Flow (BiDAF) model (Seo et al 2016) is used to build a query-aware context representation Let h_i be the vector for context word i q_j be the vector for question word j and n_q and n_c be the lengths of the question and context respectively We compute attention between context word i and question word j as
(4-1)
where w1 w2 and w3 are learned vectors and is element-wise multiplication We then compute an attended vector c_i for each context token as
(4-2)
We also compute a query-to-context vector q_c
(4-3)
31
The final vector computed for each token is built by concatenating
and In our model we subsequently pass the result through a linear layer with ReLU activations
4 Context Self-Attention Next we use a layer of residual self-attention The input is passed through another bi-directional GRU Then we apply the same attention mechanism only now between the passage and itself In this case we do not use query-to context attention and we set aij = 1048576inf if i = j As before we pass the concatenated output through a linear layer with ReLU activations This layer is applied residually so this output is additionally summed with the input
5 Query Attention For this part we do the same way as context attention but calculate the weighted sum of the context words for each query word Thus we get the length as number of query words Then we calculate context to query similar to query to context in context attention layer
Figure 4-1 The modified BiDAF model with multilevel attention
6 Query Self-Attention This part is done the same way as the context self-attention layer but from the output of the Query Attention layer
32
7 Context Query Bi-Attention + Self-Attention The output of the Context self-attention and the Query Self-Attention layers are taken as input and the same process for Bi-Attention and self-attention is applied on these inputs
8 Prediction In the last layer of our model a bidirectional GRU is applied followed by a linear layer that computes answer start scores for each token The hidden states of that layer are concatenated with the input and fed into a second bidirectional GRU and linear layer to predict answer end scores The softmax operation is applied to the start and end scores to produce start and end probabilities and we optimize the negative log likelihood of selecting correct start and end tokens
Having carried out this modification we were able to solve the wrong example we
started with The multilevel attention model gives the correct output as ldquoSan Jose
Staterdquo Also we achieved slightly better scores than the original model with a F1 score
of 8544 on the SQuAD dev dataset
Chatbot Design Using a QA System
Designing a perfect chatbot that passes the Turing test is a fundamental goal for
Artificial Intelligence Although we are many order to magnitudes away from achieving
such a goal domain specific tasks can be solved with chatbots made from current
technology We had started our goal with a similar objective in mind ie to design a
domain specific chatbot and then generalize to other areas as it is able to achieve the
first domain specific objective robustly This led us to the fundamental problem of
Machine Comprehension and subsequently to the task of Question Answering Having
achieved some degree of success with QA systems we looked back if we could apply
our newly acquired knowledge in the task of designing Chatbots
The chatbots made with todayrsquos technologies are mostly handcrafted techniques
such as template matching that requires anticipating all possible ways a user may
articulate his requirements and a conversation may occur This requires a lot of man
hours for designing a domain specific system and is still very error prone In this section
33
we propose a general Chatbot design that would make the designing of a domain
specific chatbot very easy and robust at the same time
Every domain specific chatbot needs to obtain a set of information from the user
and show some results based on the user specific information obtained The traditional
chatbots use template matching and keywords lookup to determine if the user has
provided the required information Our idea is to use the Question Answering system in
the backend to extract out the required information from whatever the user has typed
until this point of the conversation The information to be extracted can be posed in the
form of a set of questions and the answers obtained from those questions can be used
as the parameters to supply the relevant information to the user
We had chosen our chatbot domain as the flight reservation system Our goal
was to extract the required information from the user to be able to show him the
available flights as per the userrsquos requirements
Figure 4-2 Flight reservation chatbotrsquos chat window
34
For a flight reservation task the booking agent needs to know the origin city
destination city and date of travel at minimum to be able to show the available flights
Optional information includes the number of tickets passengerrsquos name one way or
round trip etc
The minimalistic conversation with the user through the chat window would be as
shown above We had a platform called OneTask on which we wanted to implement our
chat bot The chat interface within the OneTask system looks as follows
Figure 4-3 Chatbot within OneTask system
The working of the chat system is as follows ndash
1 Initiation The user opens the chat window which starts a session with the chat bot The chat bot asks lsquoHow may I help yoursquo
2 User Reply The user may reply with none of the required information for flight booking or may reply with multiple information in the same message
3 User Reply parsing The conversation up to this point is treated as a passage and the internal questions are run on this passage So the four questions that are run are
Where do you want to go
35
From where do you want to leave
When do you want to depart
4 Parsed responses from QA model After running the questions on the QA system with the given conversation answers are obtained for all along with their corresponding confidence values Since answers will be obtained even if the required question was not answered up to this point in the conversation the confidence values plays a crucial role in determining the validity of the answer For this purpose we determined ranges of confidence values for validation A confidence value above 10 signifies the question has been answered correctly A confidence of 2 ndash 10 signifies that it may have been answered but should verify with the user for correctness and any confidence below 2 is discarded
5 Asking remaining questions iteratively After the parsing it is checked if any of the required questions are still unanswered If so the chatbot asks the remaining question and the process from 3 ndash 4 is carried out iteratively Once all the questions have been answered the user is shown with the available flight options as per his request
Figure 4-4 The Flow diagram of the Flight booking Chatbot system
Online QA System and Attention Visualization
To be able to test out various examples we used the BiDAF [23] model for an
online demo One can either choose from the available examples from the drop-down
menu or paste their own passage and examples While this is a useful and interesting
system to test the model in a user-friendly way we created this system to be able to
focus on the wrong samples
36
Figure 4-5 QA system interface with attention highlight over candidate answers
The candidate answers are shown with the blue highlights as per their
confidence values The higher the confidence the darker the answer The highest
confidence value is chosen as the predicted answer We developed the system to show
attention spread of the candidate answers to realize what needs to be done to improve
the system This led us to realize the importance of including the query attention part as
well as multilevel attention on the BiDAF model as described in the first section of this
chapter
37
CHAPTER 5 FUTURE DIRECTIONS AND CONCLUSION
Future Directions
Having done a thorough analysis of the current methods and state of the arts in
Machine Comprehension and for the QA task and developing systems on it we have
achieved a strong sense of what needs to be done to further improve the QA models
After observing the wrong samples we can see that the system is still unable to encode
meaning and is picking answers based on statistical patterns of occurrence of answers
on training examples
As per the State of the Art models and analyzing the ongoing research literature
it is easy to conclude that more features need to be embedded to encode the meaning
of words phrases and sentences A paper called Reinforced Mnemonic Reader for
Machine Comprehension encoded POS and NER tags of words along with their word
and character embedding This gave them better results
Figure 5-1 An English language semantic parse tree [26]
38
We have developed a method to encode the syntax parse tree of a sentence that
not only encodes the post tags but the relation of the word within the phrase and the
relation of the phrase within the whole sentence in a hierarchical manner
Finally data augmentation is another solution to get better results One definite
way to reduce the errors would be to include similar samples in the training data which
the system is faltering in the dev set One could generate similar examples as the failure
cases and include them in the training set to have better prediction Another system
would be to train to similar and bigger datasets Our models were trained on the SQuAD
dataset There are other datasets too which does the similar question answering task
such as TriviaQA We could augment the training set of TriviaQA along with SQuAD to
have a more robust system that is able to generalize better and thus have higher
accuracy for predicting answer spans
Conclusion
In this work we have tried to explore the most fundamental techniques that have
shaped the current state of the art Then we proposed a minor improvement of
architecture over an existing model Furthermore we developed two applications that
uses the base model First we talked about how a chatbot application can be made
using the QA system and lastly we also created a web interface where the model can
be used for any Passage and Question This interface also shows the attention spread
on the candidate answers While our effort is ongoing to push the state of the art
forward we strongly believe that surpassing human level accuracy on this task will have
high dividends for the society at large
39
LIST OF REFERENCES
[1] Danqi Chen From Reading Comprehension To Open-Domain Question Answering 2018 Youtube Accessed March 16 2018 httpswwwyoutubecomwatchv=1RN88O9C13U
[2] Lehnert Wendy G The Process of Question Answering No RR-88 Yale Univ New Haven Conn Dept of computer science 1977
[3] Joshi Mandar Eunsol Choi Daniel S Weld and Luke Zettlemoyer Triviaqa A large scale distantly supervised challenge dataset for reading comprehension arXiv preprint arXiv170503551 (2017)
[4] Rajpurkar Pranav Jian Zhang Konstantin Lopyrev and Percy Liang Squad 100000+ questions for machine comprehension of text arXiv preprint arXiv160605250 (2016)
[5] The Stanford Question Answering Dataset 2018 RajpurkarGithubIo Accessed March 16 2018 httpsrajpurkargithubioSQuAD-explorer
[6] Hu Minghao Yuxing Peng and Xipeng Qiu Reinforced mnemonic reader for machine comprehension CoRR abs170502798 (2017)
[7] Schalkoff Robert J Artificial neural networks Vol 1 New York McGraw-Hill 1997
[8] Me For The AI and Neetesh Mehrotra 2018 The Connect Between Deep Learning And AI - Open Source For You Open Source For You Accessed March 16 2018 httpopensourceforucom201801connect-deep-learning-ai
[9] Krizhevsky Alex Ilya Sutskever and Geoffrey E Hinton Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems pp 1097-1105 2012
[10] An Intuitive Explanation Of Convolutional Neural Networks 2016 The Data Science Blog Accessed March 16 2018 httpsujjwalkarnme20160811intuitive-explanation-convnets
[11] Williams Ronald J and David Zipser A learning algorithm for continually running fully recurrent neural networks Neural computation 1 no 2 (1989) 270-280
[12] Hochreiter Sepp and Juumlrgen Schmidhuber Long short-term memory Neural computation 9 no 8 (1997) 1735-1780
[13] Chung Junyoung Caglar Gulcehre KyungHyun Cho and Yoshua Bengio Empirical evaluation of gated recurrent neural networks on sequence modeling arXiv preprint arXiv14123555 (2014)
40
[14] Mikolov Tomas Wen-tau Yih and Geoffrey Zweig Linguistic regularities in continuous space word representations In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics Human Language Technologies pp 746-751 2013
[15] Pennington Jeffrey Richard Socher and Christopher Manning Glove Global vectors for word representation In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) pp 1532-1543 2014
[16] Word2vec Tutorial - The Skip-Gram Model middot Chris Mccormick 2016 MccormickmlCom Accessed March 16 2018 httpmccormickmlcom20160419word2vec-tutorial-the-skip-gram-model
[17] Group 2015 [Paper Introduction] Bilingual Word Representations With Monolingual hellip SlideshareNet Accessed March 16 2018 httpswwwslidesharenetnaist-mtstudybilingual-word-representations-with-monolingual-quality-in-mind
[18] Attention Mechanism 2016 BlogHeuritechCom Accessed March 16 2018 httpsblogheuritechcom20160120attention-mechanism
[19] Weston Jason Sumit Chopra and Antoine Bordes 2014 Memory Networks arXiv preprint arXiv14103916v11 Accessed March 16 2018 httpsarxivorgabs14103916
[20] Wang Shuohang and Jing Jiang Machine comprehension using match-lstm and answer pointer arXiv preprint arXiv160807905 (2016)
[21] Wang Shuohang and Jing Jiang Learning natural language inference with LSTM arXiv preprint arXiv151208849 (2015)
[22] Vinyals Oriol Meire Fortunato and Navdeep Jaitly Pointer networks In Advances in Neural Information Processing Systems pp 2692-2700 2015
[23] Wang Wenhui Nan Yang Furu Wei Baobao Chang and Ming Zhou Gated self-matching networks for reading comprehension and question answering In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1 Long Papers) vol 1 pp 189-198 2017
[24] Seo Minjoon Aniruddha Kembhavi Ali Farhadi and Hannaneh Hajishirzi Bidirectional attention flow for machine comprehension arXiv preprint arXiv161101603 (2016)
[25] Clark Christopher and Matt Gardner Simple and effective multi-paragraph reading comprehension arXiv preprint arXiv171010723 (2017)
41
[26] 8 Analyzing Sentence Structure 2018 NltkOrg Accessed March 16 2018 httpwwwnltkorgbookch08html
42
BIOGRAPHICAL SKETCH
Purnendu Mukherjee grew up in Kolkata India and had a strong interest for
computers since an early age After his high school he did his BSc in computer
science from Ramakrishna Mission Residential College Narendrapur followed by MSc
in computer science from St Xavierrsquos College Kolkata He had a strong intuition and
interest for human like learning systems and wanted to work in this area He started
working at TCS Innovation Labs Pune for the application of Natural Language
Processing in Educational Applications As he was deeply passionate about learning
system that mimic the human brain and learn like a human child does he was
increasing interested about Deep Learning and its applications After working for a year
he went on to pursue a Master of Science degree in computer science from the
University of Florida Gainesville His academic interests have been focused on Deep
Learning and Natural Language Processing and he has been working on Machine
Reading Comprehension since summer of 2017
- ACKNOWLEDGMENTS
- LIST OF FIGURES
- LIST OF ABBREVIATIONS
- INTRODUCTION
- THE BUILDING BLOCKS OF A QUESTION ANSWERING SYSTEM
-
- Neural Networks
- Convolutional Neural Network
- Recurrent Neural Networks (RNN)
- Word Embedding
- Attention Mechanism
- Memory Networks
-
- LITERATURE REVIEW AND STATE OF THE ART
-
- Machine Comprehension Using Match-LSTM and Answer Pointer
- R-NET Matching Reading Comprehension with Self-Matching Networks
- Bi-Directional Attention Flow (BiDAF) for Machine Comprehension
- Summary
-
- MULTI-ATTENTION QUESTION ANSWERING
-
- Multi-Attention BiDAF Model
- Chatbot Design Using a QA System
- Online QA System and Attention Visualization
-
- FUTURE DIRECTIONS AND CONCLUSION
-
- Future Directions
- Conclusion
-
- LIST OF REFERENCES
- BIOGRAPHICAL SKETCH
-
30
Multi-Attention BiDAF Model
Our comprehension model is a hierarchical multi-stage process and consists of
the following layers
1 Embedding Just as all other models we embed words using pretrained word vectors We also embed the characters in each word into size 20 vectors which are learned and run a convolution neural network followed by max-pooling to get character-derived embeddings for each word The character-level and word-level embeddings are then concatenated and passed to the next layer We do not update the word embeddings during training
2 Pre-Process A shared bi-directional GRU (Cho et al 2014) is used to map the question and passage embeddings to context aware embeddings
3 Context Attention The bi-directional attention mechanism from the Bi-Directional Attention Flow (BiDAF) model (Seo et al 2016) is used to build a query-aware context representation Let h_i be the vector for context word i q_j be the vector for question word j and n_q and n_c be the lengths of the question and context respectively We compute attention between context word i and question word j as
(4-1)
where w1 w2 and w3 are learned vectors and is element-wise multiplication We then compute an attended vector c_i for each context token as
(4-2)
We also compute a query-to-context vector q_c
(4-3)
31
The final vector computed for each token is built by concatenating
and In our model we subsequently pass the result through a linear layer with ReLU activations
4 Context Self-Attention Next we use a layer of residual self-attention The input is passed through another bi-directional GRU Then we apply the same attention mechanism only now between the passage and itself In this case we do not use query-to context attention and we set aij = 1048576inf if i = j As before we pass the concatenated output through a linear layer with ReLU activations This layer is applied residually so this output is additionally summed with the input
5 Query Attention For this part we do the same way as context attention but calculate the weighted sum of the context words for each query word Thus we get the length as number of query words Then we calculate context to query similar to query to context in context attention layer
Figure 4-1 The modified BiDAF model with multilevel attention
6 Query Self-Attention This part is done the same way as the context self-attention layer but from the output of the Query Attention layer
32
7 Context Query Bi-Attention + Self-Attention The output of the Context self-attention and the Query Self-Attention layers are taken as input and the same process for Bi-Attention and self-attention is applied on these inputs
8 Prediction In the last layer of our model a bidirectional GRU is applied followed by a linear layer that computes answer start scores for each token The hidden states of that layer are concatenated with the input and fed into a second bidirectional GRU and linear layer to predict answer end scores The softmax operation is applied to the start and end scores to produce start and end probabilities and we optimize the negative log likelihood of selecting correct start and end tokens
Having carried out this modification we were able to solve the wrong example we
started with The multilevel attention model gives the correct output as ldquoSan Jose
Staterdquo Also we achieved slightly better scores than the original model with a F1 score
of 8544 on the SQuAD dev dataset
Chatbot Design Using a QA System
Designing a perfect chatbot that passes the Turing test is a fundamental goal for
Artificial Intelligence Although we are many order to magnitudes away from achieving
such a goal domain specific tasks can be solved with chatbots made from current
technology We had started our goal with a similar objective in mind ie to design a
domain specific chatbot and then generalize to other areas as it is able to achieve the
first domain specific objective robustly This led us to the fundamental problem of
Machine Comprehension and subsequently to the task of Question Answering Having
achieved some degree of success with QA systems we looked back if we could apply
our newly acquired knowledge in the task of designing Chatbots
The chatbots made with todayrsquos technologies are mostly handcrafted techniques
such as template matching that requires anticipating all possible ways a user may
articulate his requirements and a conversation may occur This requires a lot of man
hours for designing a domain specific system and is still very error prone In this section
33
we propose a general Chatbot design that would make the designing of a domain
specific chatbot very easy and robust at the same time
Every domain specific chatbot needs to obtain a set of information from the user
and show some results based on the user specific information obtained The traditional
chatbots use template matching and keywords lookup to determine if the user has
provided the required information Our idea is to use the Question Answering system in
the backend to extract out the required information from whatever the user has typed
until this point of the conversation The information to be extracted can be posed in the
form of a set of questions and the answers obtained from those questions can be used
as the parameters to supply the relevant information to the user
We had chosen our chatbot domain as the flight reservation system Our goal
was to extract the required information from the user to be able to show him the
available flights as per the userrsquos requirements
Figure 4-2 Flight reservation chatbotrsquos chat window
34
For a flight reservation task the booking agent needs to know the origin city
destination city and date of travel at minimum to be able to show the available flights
Optional information includes the number of tickets passengerrsquos name one way or
round trip etc
The minimalistic conversation with the user through the chat window would be as
shown above We had a platform called OneTask on which we wanted to implement our
chat bot The chat interface within the OneTask system looks as follows
Figure 4-3 Chatbot within OneTask system
The working of the chat system is as follows ndash
1 Initiation The user opens the chat window which starts a session with the chat bot The chat bot asks lsquoHow may I help yoursquo
2 User Reply The user may reply with none of the required information for flight booking or may reply with multiple information in the same message
3 User Reply parsing The conversation up to this point is treated as a passage and the internal questions are run on this passage So the four questions that are run are
Where do you want to go
35
From where do you want to leave
When do you want to depart
4 Parsed responses from QA model After running the questions on the QA system with the given conversation answers are obtained for all along with their corresponding confidence values Since answers will be obtained even if the required question was not answered up to this point in the conversation the confidence values plays a crucial role in determining the validity of the answer For this purpose we determined ranges of confidence values for validation A confidence value above 10 signifies the question has been answered correctly A confidence of 2 ndash 10 signifies that it may have been answered but should verify with the user for correctness and any confidence below 2 is discarded
5 Asking remaining questions iteratively After the parsing it is checked if any of the required questions are still unanswered If so the chatbot asks the remaining question and the process from 3 ndash 4 is carried out iteratively Once all the questions have been answered the user is shown with the available flight options as per his request
Figure 4-4 The Flow diagram of the Flight booking Chatbot system
Online QA System and Attention Visualization
To be able to test out various examples we used the BiDAF [23] model for an
online demo One can either choose from the available examples from the drop-down
menu or paste their own passage and examples While this is a useful and interesting
system to test the model in a user-friendly way we created this system to be able to
focus on the wrong samples
36
Figure 4-5 QA system interface with attention highlight over candidate answers
The candidate answers are shown with the blue highlights as per their
confidence values The higher the confidence the darker the answer The highest
confidence value is chosen as the predicted answer We developed the system to show
attention spread of the candidate answers to realize what needs to be done to improve
the system This led us to realize the importance of including the query attention part as
well as multilevel attention on the BiDAF model as described in the first section of this
chapter
37
CHAPTER 5 FUTURE DIRECTIONS AND CONCLUSION
Future Directions
Having done a thorough analysis of the current methods and state of the arts in
Machine Comprehension and for the QA task and developing systems on it we have
achieved a strong sense of what needs to be done to further improve the QA models
After observing the wrong samples we can see that the system is still unable to encode
meaning and is picking answers based on statistical patterns of occurrence of answers
on training examples
As per the State of the Art models and analyzing the ongoing research literature
it is easy to conclude that more features need to be embedded to encode the meaning
of words phrases and sentences A paper called Reinforced Mnemonic Reader for
Machine Comprehension encoded POS and NER tags of words along with their word
and character embedding This gave them better results
Figure 5-1 An English language semantic parse tree [26]
38
We have developed a method to encode the syntax parse tree of a sentence that
not only encodes the post tags but the relation of the word within the phrase and the
relation of the phrase within the whole sentence in a hierarchical manner
Finally data augmentation is another solution to get better results One definite
way to reduce the errors would be to include similar samples in the training data which
the system is faltering in the dev set One could generate similar examples as the failure
cases and include them in the training set to have better prediction Another system
would be to train to similar and bigger datasets Our models were trained on the SQuAD
dataset There are other datasets too which does the similar question answering task
such as TriviaQA We could augment the training set of TriviaQA along with SQuAD to
have a more robust system that is able to generalize better and thus have higher
accuracy for predicting answer spans
Conclusion
In this work we have tried to explore the most fundamental techniques that have
shaped the current state of the art Then we proposed a minor improvement of
architecture over an existing model Furthermore we developed two applications that
uses the base model First we talked about how a chatbot application can be made
using the QA system and lastly we also created a web interface where the model can
be used for any Passage and Question This interface also shows the attention spread
on the candidate answers While our effort is ongoing to push the state of the art
forward we strongly believe that surpassing human level accuracy on this task will have
high dividends for the society at large
39
LIST OF REFERENCES
[1] Danqi Chen From Reading Comprehension To Open-Domain Question Answering 2018 Youtube Accessed March 16 2018 httpswwwyoutubecomwatchv=1RN88O9C13U
[2] Lehnert Wendy G The Process of Question Answering No RR-88 Yale Univ New Haven Conn Dept of computer science 1977
[3] Joshi Mandar Eunsol Choi Daniel S Weld and Luke Zettlemoyer Triviaqa A large scale distantly supervised challenge dataset for reading comprehension arXiv preprint arXiv170503551 (2017)
[4] Rajpurkar Pranav Jian Zhang Konstantin Lopyrev and Percy Liang Squad 100000+ questions for machine comprehension of text arXiv preprint arXiv160605250 (2016)
[5] The Stanford Question Answering Dataset 2018 RajpurkarGithubIo Accessed March 16 2018 httpsrajpurkargithubioSQuAD-explorer
[6] Hu Minghao Yuxing Peng and Xipeng Qiu Reinforced mnemonic reader for machine comprehension CoRR abs170502798 (2017)
[7] Schalkoff Robert J Artificial neural networks Vol 1 New York McGraw-Hill 1997
[8] Me For The AI and Neetesh Mehrotra 2018 The Connect Between Deep Learning And AI - Open Source For You Open Source For You Accessed March 16 2018 httpopensourceforucom201801connect-deep-learning-ai
[9] Krizhevsky Alex Ilya Sutskever and Geoffrey E Hinton Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems pp 1097-1105 2012
[10] An Intuitive Explanation Of Convolutional Neural Networks 2016 The Data Science Blog Accessed March 16 2018 httpsujjwalkarnme20160811intuitive-explanation-convnets
[11] Williams Ronald J and David Zipser A learning algorithm for continually running fully recurrent neural networks Neural computation 1 no 2 (1989) 270-280
[12] Hochreiter Sepp and Juumlrgen Schmidhuber Long short-term memory Neural computation 9 no 8 (1997) 1735-1780
[13] Chung Junyoung Caglar Gulcehre KyungHyun Cho and Yoshua Bengio Empirical evaluation of gated recurrent neural networks on sequence modeling arXiv preprint arXiv14123555 (2014)
40
[14] Mikolov Tomas Wen-tau Yih and Geoffrey Zweig Linguistic regularities in continuous space word representations In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics Human Language Technologies pp 746-751 2013
[15] Pennington Jeffrey Richard Socher and Christopher Manning Glove Global vectors for word representation In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) pp 1532-1543 2014
[16] Word2vec Tutorial - The Skip-Gram Model middot Chris Mccormick 2016 MccormickmlCom Accessed March 16 2018 httpmccormickmlcom20160419word2vec-tutorial-the-skip-gram-model
[17] Group 2015 [Paper Introduction] Bilingual Word Representations With Monolingual hellip SlideshareNet Accessed March 16 2018 httpswwwslidesharenetnaist-mtstudybilingual-word-representations-with-monolingual-quality-in-mind
[18] Attention Mechanism 2016 BlogHeuritechCom Accessed March 16 2018 httpsblogheuritechcom20160120attention-mechanism
[19] Weston Jason Sumit Chopra and Antoine Bordes 2014 Memory Networks arXiv preprint arXiv14103916v11 Accessed March 16 2018 httpsarxivorgabs14103916
[20] Wang Shuohang and Jing Jiang Machine comprehension using match-lstm and answer pointer arXiv preprint arXiv160807905 (2016)
[21] Wang Shuohang and Jing Jiang Learning natural language inference with LSTM arXiv preprint arXiv151208849 (2015)
[22] Vinyals Oriol Meire Fortunato and Navdeep Jaitly Pointer networks In Advances in Neural Information Processing Systems pp 2692-2700 2015
[23] Wang Wenhui Nan Yang Furu Wei Baobao Chang and Ming Zhou Gated self-matching networks for reading comprehension and question answering In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1 Long Papers) vol 1 pp 189-198 2017
[24] Seo Minjoon Aniruddha Kembhavi Ali Farhadi and Hannaneh Hajishirzi Bidirectional attention flow for machine comprehension arXiv preprint arXiv161101603 (2016)
[25] Clark Christopher and Matt Gardner Simple and effective multi-paragraph reading comprehension arXiv preprint arXiv171010723 (2017)
41
[26] 8 Analyzing Sentence Structure 2018 NltkOrg Accessed March 16 2018 httpwwwnltkorgbookch08html
42
BIOGRAPHICAL SKETCH
Purnendu Mukherjee grew up in Kolkata India and had a strong interest for
computers since an early age After his high school he did his BSc in computer
science from Ramakrishna Mission Residential College Narendrapur followed by MSc
in computer science from St Xavierrsquos College Kolkata He had a strong intuition and
interest for human like learning systems and wanted to work in this area He started
working at TCS Innovation Labs Pune for the application of Natural Language
Processing in Educational Applications As he was deeply passionate about learning
system that mimic the human brain and learn like a human child does he was
increasing interested about Deep Learning and its applications After working for a year
he went on to pursue a Master of Science degree in computer science from the
University of Florida Gainesville His academic interests have been focused on Deep
Learning and Natural Language Processing and he has been working on Machine
Reading Comprehension since summer of 2017
- ACKNOWLEDGMENTS
- LIST OF FIGURES
- LIST OF ABBREVIATIONS
- INTRODUCTION
- THE BUILDING BLOCKS OF A QUESTION ANSWERING SYSTEM
-
- Neural Networks
- Convolutional Neural Network
- Recurrent Neural Networks (RNN)
- Word Embedding
- Attention Mechanism
- Memory Networks
-
- LITERATURE REVIEW AND STATE OF THE ART
-
- Machine Comprehension Using Match-LSTM and Answer Pointer
- R-NET Matching Reading Comprehension with Self-Matching Networks
- Bi-Directional Attention Flow (BiDAF) for Machine Comprehension
- Summary
-
- MULTI-ATTENTION QUESTION ANSWERING
-
- Multi-Attention BiDAF Model
- Chatbot Design Using a QA System
- Online QA System and Attention Visualization
-
- FUTURE DIRECTIONS AND CONCLUSION
-
- Future Directions
- Conclusion
-
- LIST OF REFERENCES
- BIOGRAPHICAL SKETCH
-
31
The final vector computed for each token is built by concatenating
and In our model we subsequently pass the result through a linear layer with ReLU activations
4 Context Self-Attention Next we use a layer of residual self-attention The input is passed through another bi-directional GRU Then we apply the same attention mechanism only now between the passage and itself In this case we do not use query-to context attention and we set aij = 1048576inf if i = j As before we pass the concatenated output through a linear layer with ReLU activations This layer is applied residually so this output is additionally summed with the input
5 Query Attention For this part we do the same way as context attention but calculate the weighted sum of the context words for each query word Thus we get the length as number of query words Then we calculate context to query similar to query to context in context attention layer
Figure 4-1 The modified BiDAF model with multilevel attention
6 Query Self-Attention This part is done the same way as the context self-attention layer but from the output of the Query Attention layer
32
7 Context Query Bi-Attention + Self-Attention The output of the Context self-attention and the Query Self-Attention layers are taken as input and the same process for Bi-Attention and self-attention is applied on these inputs
8 Prediction In the last layer of our model a bidirectional GRU is applied followed by a linear layer that computes answer start scores for each token The hidden states of that layer are concatenated with the input and fed into a second bidirectional GRU and linear layer to predict answer end scores The softmax operation is applied to the start and end scores to produce start and end probabilities and we optimize the negative log likelihood of selecting correct start and end tokens
Having carried out this modification we were able to solve the wrong example we
started with The multilevel attention model gives the correct output as ldquoSan Jose
Staterdquo Also we achieved slightly better scores than the original model with a F1 score
of 8544 on the SQuAD dev dataset
Chatbot Design Using a QA System
Designing a perfect chatbot that passes the Turing test is a fundamental goal for
Artificial Intelligence Although we are many order to magnitudes away from achieving
such a goal domain specific tasks can be solved with chatbots made from current
technology We had started our goal with a similar objective in mind ie to design a
domain specific chatbot and then generalize to other areas as it is able to achieve the
first domain specific objective robustly This led us to the fundamental problem of
Machine Comprehension and subsequently to the task of Question Answering Having
achieved some degree of success with QA systems we looked back if we could apply
our newly acquired knowledge in the task of designing Chatbots
The chatbots made with todayrsquos technologies are mostly handcrafted techniques
such as template matching that requires anticipating all possible ways a user may
articulate his requirements and a conversation may occur This requires a lot of man
hours for designing a domain specific system and is still very error prone In this section
33
we propose a general Chatbot design that would make the designing of a domain
specific chatbot very easy and robust at the same time
Every domain specific chatbot needs to obtain a set of information from the user
and show some results based on the user specific information obtained The traditional
chatbots use template matching and keywords lookup to determine if the user has
provided the required information Our idea is to use the Question Answering system in
the backend to extract out the required information from whatever the user has typed
until this point of the conversation The information to be extracted can be posed in the
form of a set of questions and the answers obtained from those questions can be used
as the parameters to supply the relevant information to the user
We had chosen our chatbot domain as the flight reservation system Our goal
was to extract the required information from the user to be able to show him the
available flights as per the userrsquos requirements
Figure 4-2 Flight reservation chatbotrsquos chat window
34
For a flight reservation task the booking agent needs to know the origin city
destination city and date of travel at minimum to be able to show the available flights
Optional information includes the number of tickets passengerrsquos name one way or
round trip etc
The minimalistic conversation with the user through the chat window would be as
shown above We had a platform called OneTask on which we wanted to implement our
chat bot The chat interface within the OneTask system looks as follows
Figure 4-3 Chatbot within OneTask system
The working of the chat system is as follows ndash
1 Initiation The user opens the chat window which starts a session with the chat bot The chat bot asks lsquoHow may I help yoursquo
2 User Reply The user may reply with none of the required information for flight booking or may reply with multiple information in the same message
3 User Reply parsing The conversation up to this point is treated as a passage and the internal questions are run on this passage So the four questions that are run are
Where do you want to go
35
From where do you want to leave
When do you want to depart
4 Parsed responses from QA model After running the questions on the QA system with the given conversation answers are obtained for all along with their corresponding confidence values Since answers will be obtained even if the required question was not answered up to this point in the conversation the confidence values plays a crucial role in determining the validity of the answer For this purpose we determined ranges of confidence values for validation A confidence value above 10 signifies the question has been answered correctly A confidence of 2 ndash 10 signifies that it may have been answered but should verify with the user for correctness and any confidence below 2 is discarded
5 Asking remaining questions iteratively After the parsing it is checked if any of the required questions are still unanswered If so the chatbot asks the remaining question and the process from 3 ndash 4 is carried out iteratively Once all the questions have been answered the user is shown with the available flight options as per his request
Figure 4-4 The Flow diagram of the Flight booking Chatbot system
Online QA System and Attention Visualization
To be able to test out various examples we used the BiDAF [23] model for an
online demo One can either choose from the available examples from the drop-down
menu or paste their own passage and examples While this is a useful and interesting
system to test the model in a user-friendly way we created this system to be able to
focus on the wrong samples
36
Figure 4-5 QA system interface with attention highlight over candidate answers
The candidate answers are shown with the blue highlights as per their
confidence values The higher the confidence the darker the answer The highest
confidence value is chosen as the predicted answer We developed the system to show
attention spread of the candidate answers to realize what needs to be done to improve
the system This led us to realize the importance of including the query attention part as
well as multilevel attention on the BiDAF model as described in the first section of this
chapter
37
CHAPTER 5 FUTURE DIRECTIONS AND CONCLUSION
Future Directions
Having done a thorough analysis of the current methods and state of the arts in
Machine Comprehension and for the QA task and developing systems on it we have
achieved a strong sense of what needs to be done to further improve the QA models
After observing the wrong samples we can see that the system is still unable to encode
meaning and is picking answers based on statistical patterns of occurrence of answers
on training examples
As per the State of the Art models and analyzing the ongoing research literature
it is easy to conclude that more features need to be embedded to encode the meaning
of words phrases and sentences A paper called Reinforced Mnemonic Reader for
Machine Comprehension encoded POS and NER tags of words along with their word
and character embedding This gave them better results
Figure 5-1 An English language semantic parse tree [26]
38
We have developed a method to encode the syntax parse tree of a sentence that
not only encodes the post tags but the relation of the word within the phrase and the
relation of the phrase within the whole sentence in a hierarchical manner
Finally data augmentation is another solution to get better results One definite
way to reduce the errors would be to include similar samples in the training data which
the system is faltering in the dev set One could generate similar examples as the failure
cases and include them in the training set to have better prediction Another system
would be to train to similar and bigger datasets Our models were trained on the SQuAD
dataset There are other datasets too which does the similar question answering task
such as TriviaQA We could augment the training set of TriviaQA along with SQuAD to
have a more robust system that is able to generalize better and thus have higher
accuracy for predicting answer spans
Conclusion
In this work we have tried to explore the most fundamental techniques that have
shaped the current state of the art Then we proposed a minor improvement of
architecture over an existing model Furthermore we developed two applications that
uses the base model First we talked about how a chatbot application can be made
using the QA system and lastly we also created a web interface where the model can
be used for any Passage and Question This interface also shows the attention spread
on the candidate answers While our effort is ongoing to push the state of the art
forward we strongly believe that surpassing human level accuracy on this task will have
high dividends for the society at large
39
LIST OF REFERENCES
[1] Danqi Chen From Reading Comprehension To Open-Domain Question Answering 2018 Youtube Accessed March 16 2018 httpswwwyoutubecomwatchv=1RN88O9C13U
[2] Lehnert Wendy G The Process of Question Answering No RR-88 Yale Univ New Haven Conn Dept of computer science 1977
[3] Joshi Mandar Eunsol Choi Daniel S Weld and Luke Zettlemoyer Triviaqa A large scale distantly supervised challenge dataset for reading comprehension arXiv preprint arXiv170503551 (2017)
[4] Rajpurkar Pranav Jian Zhang Konstantin Lopyrev and Percy Liang Squad 100000+ questions for machine comprehension of text arXiv preprint arXiv160605250 (2016)
[5] The Stanford Question Answering Dataset 2018 RajpurkarGithubIo Accessed March 16 2018 httpsrajpurkargithubioSQuAD-explorer
[6] Hu Minghao Yuxing Peng and Xipeng Qiu Reinforced mnemonic reader for machine comprehension CoRR abs170502798 (2017)
[7] Schalkoff Robert J Artificial neural networks Vol 1 New York McGraw-Hill 1997
[8] Me For The AI and Neetesh Mehrotra 2018 The Connect Between Deep Learning And AI - Open Source For You Open Source For You Accessed March 16 2018 httpopensourceforucom201801connect-deep-learning-ai
[9] Krizhevsky Alex Ilya Sutskever and Geoffrey E Hinton Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems pp 1097-1105 2012
[10] An Intuitive Explanation Of Convolutional Neural Networks 2016 The Data Science Blog Accessed March 16 2018 httpsujjwalkarnme20160811intuitive-explanation-convnets
[11] Williams Ronald J and David Zipser A learning algorithm for continually running fully recurrent neural networks Neural computation 1 no 2 (1989) 270-280
[12] Hochreiter Sepp and Juumlrgen Schmidhuber Long short-term memory Neural computation 9 no 8 (1997) 1735-1780
[13] Chung Junyoung Caglar Gulcehre KyungHyun Cho and Yoshua Bengio Empirical evaluation of gated recurrent neural networks on sequence modeling arXiv preprint arXiv14123555 (2014)
40
[14] Mikolov Tomas Wen-tau Yih and Geoffrey Zweig Linguistic regularities in continuous space word representations In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics Human Language Technologies pp 746-751 2013
[15] Pennington Jeffrey Richard Socher and Christopher Manning Glove Global vectors for word representation In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) pp 1532-1543 2014
[16] Word2vec Tutorial - The Skip-Gram Model middot Chris Mccormick 2016 MccormickmlCom Accessed March 16 2018 httpmccormickmlcom20160419word2vec-tutorial-the-skip-gram-model
[17] Group 2015 [Paper Introduction] Bilingual Word Representations With Monolingual hellip SlideshareNet Accessed March 16 2018 httpswwwslidesharenetnaist-mtstudybilingual-word-representations-with-monolingual-quality-in-mind
[18] Attention Mechanism 2016 BlogHeuritechCom Accessed March 16 2018 httpsblogheuritechcom20160120attention-mechanism
[19] Weston Jason Sumit Chopra and Antoine Bordes 2014 Memory Networks arXiv preprint arXiv14103916v11 Accessed March 16 2018 httpsarxivorgabs14103916
[20] Wang Shuohang and Jing Jiang Machine comprehension using match-lstm and answer pointer arXiv preprint arXiv160807905 (2016)
[21] Wang Shuohang and Jing Jiang Learning natural language inference with LSTM arXiv preprint arXiv151208849 (2015)
[22] Vinyals Oriol Meire Fortunato and Navdeep Jaitly Pointer networks In Advances in Neural Information Processing Systems pp 2692-2700 2015
[23] Wang Wenhui Nan Yang Furu Wei Baobao Chang and Ming Zhou Gated self-matching networks for reading comprehension and question answering In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1 Long Papers) vol 1 pp 189-198 2017
[24] Seo Minjoon Aniruddha Kembhavi Ali Farhadi and Hannaneh Hajishirzi Bidirectional attention flow for machine comprehension arXiv preprint arXiv161101603 (2016)
[25] Clark Christopher and Matt Gardner Simple and effective multi-paragraph reading comprehension arXiv preprint arXiv171010723 (2017)
41
[26] 8 Analyzing Sentence Structure 2018 NltkOrg Accessed March 16 2018 httpwwwnltkorgbookch08html
42
BIOGRAPHICAL SKETCH
Purnendu Mukherjee grew up in Kolkata India and had a strong interest for
computers since an early age After his high school he did his BSc in computer
science from Ramakrishna Mission Residential College Narendrapur followed by MSc
in computer science from St Xavierrsquos College Kolkata He had a strong intuition and
interest for human like learning systems and wanted to work in this area He started
working at TCS Innovation Labs Pune for the application of Natural Language
Processing in Educational Applications As he was deeply passionate about learning
system that mimic the human brain and learn like a human child does he was
increasing interested about Deep Learning and its applications After working for a year
he went on to pursue a Master of Science degree in computer science from the
University of Florida Gainesville His academic interests have been focused on Deep
Learning and Natural Language Processing and he has been working on Machine
Reading Comprehension since summer of 2017
- ACKNOWLEDGMENTS
- LIST OF FIGURES
- LIST OF ABBREVIATIONS
- INTRODUCTION
- THE BUILDING BLOCKS OF A QUESTION ANSWERING SYSTEM
-
- Neural Networks
- Convolutional Neural Network
- Recurrent Neural Networks (RNN)
- Word Embedding
- Attention Mechanism
- Memory Networks
-
- LITERATURE REVIEW AND STATE OF THE ART
-
- Machine Comprehension Using Match-LSTM and Answer Pointer
- R-NET Matching Reading Comprehension with Self-Matching Networks
- Bi-Directional Attention Flow (BiDAF) for Machine Comprehension
- Summary
-
- MULTI-ATTENTION QUESTION ANSWERING
-
- Multi-Attention BiDAF Model
- Chatbot Design Using a QA System
- Online QA System and Attention Visualization
-
- FUTURE DIRECTIONS AND CONCLUSION
-
- Future Directions
- Conclusion
-
- LIST OF REFERENCES
- BIOGRAPHICAL SKETCH
-
32
7 Context Query Bi-Attention + Self-Attention The output of the Context self-attention and the Query Self-Attention layers are taken as input and the same process for Bi-Attention and self-attention is applied on these inputs
8 Prediction In the last layer of our model a bidirectional GRU is applied followed by a linear layer that computes answer start scores for each token The hidden states of that layer are concatenated with the input and fed into a second bidirectional GRU and linear layer to predict answer end scores The softmax operation is applied to the start and end scores to produce start and end probabilities and we optimize the negative log likelihood of selecting correct start and end tokens
Having carried out this modification we were able to solve the wrong example we
started with The multilevel attention model gives the correct output as ldquoSan Jose
Staterdquo Also we achieved slightly better scores than the original model with a F1 score
of 8544 on the SQuAD dev dataset
Chatbot Design Using a QA System
Designing a perfect chatbot that passes the Turing test is a fundamental goal for
Artificial Intelligence Although we are many order to magnitudes away from achieving
such a goal domain specific tasks can be solved with chatbots made from current
technology We had started our goal with a similar objective in mind ie to design a
domain specific chatbot and then generalize to other areas as it is able to achieve the
first domain specific objective robustly This led us to the fundamental problem of
Machine Comprehension and subsequently to the task of Question Answering Having
achieved some degree of success with QA systems we looked back if we could apply
our newly acquired knowledge in the task of designing Chatbots
The chatbots made with todayrsquos technologies are mostly handcrafted techniques
such as template matching that requires anticipating all possible ways a user may
articulate his requirements and a conversation may occur This requires a lot of man
hours for designing a domain specific system and is still very error prone In this section
33
we propose a general Chatbot design that would make the designing of a domain
specific chatbot very easy and robust at the same time
Every domain specific chatbot needs to obtain a set of information from the user
and show some results based on the user specific information obtained The traditional
chatbots use template matching and keywords lookup to determine if the user has
provided the required information Our idea is to use the Question Answering system in
the backend to extract out the required information from whatever the user has typed
until this point of the conversation The information to be extracted can be posed in the
form of a set of questions and the answers obtained from those questions can be used
as the parameters to supply the relevant information to the user
We had chosen our chatbot domain as the flight reservation system Our goal
was to extract the required information from the user to be able to show him the
available flights as per the userrsquos requirements
Figure 4-2 Flight reservation chatbotrsquos chat window
34
For a flight reservation task the booking agent needs to know the origin city
destination city and date of travel at minimum to be able to show the available flights
Optional information includes the number of tickets passengerrsquos name one way or
round trip etc
The minimalistic conversation with the user through the chat window would be as
shown above We had a platform called OneTask on which we wanted to implement our
chat bot The chat interface within the OneTask system looks as follows
Figure 4-3 Chatbot within OneTask system
The working of the chat system is as follows ndash
1 Initiation The user opens the chat window which starts a session with the chat bot The chat bot asks lsquoHow may I help yoursquo
2 User Reply The user may reply with none of the required information for flight booking or may reply with multiple information in the same message
3 User Reply parsing The conversation up to this point is treated as a passage and the internal questions are run on this passage So the four questions that are run are
Where do you want to go
35
From where do you want to leave
When do you want to depart
4 Parsed responses from QA model After running the questions on the QA system with the given conversation answers are obtained for all along with their corresponding confidence values Since answers will be obtained even if the required question was not answered up to this point in the conversation the confidence values plays a crucial role in determining the validity of the answer For this purpose we determined ranges of confidence values for validation A confidence value above 10 signifies the question has been answered correctly A confidence of 2 ndash 10 signifies that it may have been answered but should verify with the user for correctness and any confidence below 2 is discarded
5 Asking remaining questions iteratively After the parsing it is checked if any of the required questions are still unanswered If so the chatbot asks the remaining question and the process from 3 ndash 4 is carried out iteratively Once all the questions have been answered the user is shown with the available flight options as per his request
Figure 4-4 The Flow diagram of the Flight booking Chatbot system
Online QA System and Attention Visualization
To be able to test out various examples we used the BiDAF [23] model for an
online demo One can either choose from the available examples from the drop-down
menu or paste their own passage and examples While this is a useful and interesting
system to test the model in a user-friendly way we created this system to be able to
focus on the wrong samples
36
Figure 4-5 QA system interface with attention highlight over candidate answers
The candidate answers are shown with the blue highlights as per their
confidence values The higher the confidence the darker the answer The highest
confidence value is chosen as the predicted answer We developed the system to show
attention spread of the candidate answers to realize what needs to be done to improve
the system This led us to realize the importance of including the query attention part as
well as multilevel attention on the BiDAF model as described in the first section of this
chapter
37
CHAPTER 5 FUTURE DIRECTIONS AND CONCLUSION
Future Directions
Having done a thorough analysis of the current methods and state of the arts in
Machine Comprehension and for the QA task and developing systems on it we have
achieved a strong sense of what needs to be done to further improve the QA models
After observing the wrong samples we can see that the system is still unable to encode
meaning and is picking answers based on statistical patterns of occurrence of answers
on training examples
As per the State of the Art models and analyzing the ongoing research literature
it is easy to conclude that more features need to be embedded to encode the meaning
of words phrases and sentences A paper called Reinforced Mnemonic Reader for
Machine Comprehension encoded POS and NER tags of words along with their word
and character embedding This gave them better results
Figure 5-1 An English language semantic parse tree [26]
38
We have developed a method to encode the syntax parse tree of a sentence that
not only encodes the post tags but the relation of the word within the phrase and the
relation of the phrase within the whole sentence in a hierarchical manner
Finally data augmentation is another solution to get better results One definite
way to reduce the errors would be to include similar samples in the training data which
the system is faltering in the dev set One could generate similar examples as the failure
cases and include them in the training set to have better prediction Another system
would be to train to similar and bigger datasets Our models were trained on the SQuAD
dataset There are other datasets too which does the similar question answering task
such as TriviaQA We could augment the training set of TriviaQA along with SQuAD to
have a more robust system that is able to generalize better and thus have higher
accuracy for predicting answer spans
Conclusion
In this work we have tried to explore the most fundamental techniques that have
shaped the current state of the art Then we proposed a minor improvement of
architecture over an existing model Furthermore we developed two applications that
uses the base model First we talked about how a chatbot application can be made
using the QA system and lastly we also created a web interface where the model can
be used for any Passage and Question This interface also shows the attention spread
on the candidate answers While our effort is ongoing to push the state of the art
forward we strongly believe that surpassing human level accuracy on this task will have
high dividends for the society at large
39
LIST OF REFERENCES
[1] Danqi Chen From Reading Comprehension To Open-Domain Question Answering 2018 Youtube Accessed March 16 2018 httpswwwyoutubecomwatchv=1RN88O9C13U
[2] Lehnert Wendy G The Process of Question Answering No RR-88 Yale Univ New Haven Conn Dept of computer science 1977
[3] Joshi Mandar Eunsol Choi Daniel S Weld and Luke Zettlemoyer Triviaqa A large scale distantly supervised challenge dataset for reading comprehension arXiv preprint arXiv170503551 (2017)
[4] Rajpurkar Pranav Jian Zhang Konstantin Lopyrev and Percy Liang Squad 100000+ questions for machine comprehension of text arXiv preprint arXiv160605250 (2016)
[5] The Stanford Question Answering Dataset 2018 RajpurkarGithubIo Accessed March 16 2018 httpsrajpurkargithubioSQuAD-explorer
[6] Hu Minghao Yuxing Peng and Xipeng Qiu Reinforced mnemonic reader for machine comprehension CoRR abs170502798 (2017)
[7] Schalkoff Robert J Artificial neural networks Vol 1 New York McGraw-Hill 1997
[8] Me For The AI and Neetesh Mehrotra 2018 The Connect Between Deep Learning And AI - Open Source For You Open Source For You Accessed March 16 2018 httpopensourceforucom201801connect-deep-learning-ai
[9] Krizhevsky Alex Ilya Sutskever and Geoffrey E Hinton Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems pp 1097-1105 2012
[10] An Intuitive Explanation Of Convolutional Neural Networks 2016 The Data Science Blog Accessed March 16 2018 httpsujjwalkarnme20160811intuitive-explanation-convnets
[11] Williams Ronald J and David Zipser A learning algorithm for continually running fully recurrent neural networks Neural computation 1 no 2 (1989) 270-280
[12] Hochreiter Sepp and Juumlrgen Schmidhuber Long short-term memory Neural computation 9 no 8 (1997) 1735-1780
[13] Chung Junyoung Caglar Gulcehre KyungHyun Cho and Yoshua Bengio Empirical evaluation of gated recurrent neural networks on sequence modeling arXiv preprint arXiv14123555 (2014)
40
[14] Mikolov Tomas Wen-tau Yih and Geoffrey Zweig Linguistic regularities in continuous space word representations In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics Human Language Technologies pp 746-751 2013
[15] Pennington Jeffrey Richard Socher and Christopher Manning Glove Global vectors for word representation In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) pp 1532-1543 2014
[16] Word2vec Tutorial - The Skip-Gram Model middot Chris Mccormick 2016 MccormickmlCom Accessed March 16 2018 httpmccormickmlcom20160419word2vec-tutorial-the-skip-gram-model
[17] Group 2015 [Paper Introduction] Bilingual Word Representations With Monolingual hellip SlideshareNet Accessed March 16 2018 httpswwwslidesharenetnaist-mtstudybilingual-word-representations-with-monolingual-quality-in-mind
[18] Attention Mechanism 2016 BlogHeuritechCom Accessed March 16 2018 httpsblogheuritechcom20160120attention-mechanism
[19] Weston Jason Sumit Chopra and Antoine Bordes 2014 Memory Networks arXiv preprint arXiv14103916v11 Accessed March 16 2018 httpsarxivorgabs14103916
[20] Wang Shuohang and Jing Jiang Machine comprehension using match-lstm and answer pointer arXiv preprint arXiv160807905 (2016)
[21] Wang Shuohang and Jing Jiang Learning natural language inference with LSTM arXiv preprint arXiv151208849 (2015)
[22] Vinyals Oriol Meire Fortunato and Navdeep Jaitly Pointer networks In Advances in Neural Information Processing Systems pp 2692-2700 2015
[23] Wang Wenhui Nan Yang Furu Wei Baobao Chang and Ming Zhou Gated self-matching networks for reading comprehension and question answering In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1 Long Papers) vol 1 pp 189-198 2017
[24] Seo Minjoon Aniruddha Kembhavi Ali Farhadi and Hannaneh Hajishirzi Bidirectional attention flow for machine comprehension arXiv preprint arXiv161101603 (2016)
[25] Clark Christopher and Matt Gardner Simple and effective multi-paragraph reading comprehension arXiv preprint arXiv171010723 (2017)
41
[26] 8 Analyzing Sentence Structure 2018 NltkOrg Accessed March 16 2018 httpwwwnltkorgbookch08html
42
BIOGRAPHICAL SKETCH
Purnendu Mukherjee grew up in Kolkata India and had a strong interest for
computers since an early age After his high school he did his BSc in computer
science from Ramakrishna Mission Residential College Narendrapur followed by MSc
in computer science from St Xavierrsquos College Kolkata He had a strong intuition and
interest for human like learning systems and wanted to work in this area He started
working at TCS Innovation Labs Pune for the application of Natural Language
Processing in Educational Applications As he was deeply passionate about learning
system that mimic the human brain and learn like a human child does he was
increasing interested about Deep Learning and its applications After working for a year
he went on to pursue a Master of Science degree in computer science from the
University of Florida Gainesville His academic interests have been focused on Deep
Learning and Natural Language Processing and he has been working on Machine
Reading Comprehension since summer of 2017
- ACKNOWLEDGMENTS
- LIST OF FIGURES
- LIST OF ABBREVIATIONS
- INTRODUCTION
- THE BUILDING BLOCKS OF A QUESTION ANSWERING SYSTEM
-
- Neural Networks
- Convolutional Neural Network
- Recurrent Neural Networks (RNN)
- Word Embedding
- Attention Mechanism
- Memory Networks
-
- LITERATURE REVIEW AND STATE OF THE ART
-
- Machine Comprehension Using Match-LSTM and Answer Pointer
- R-NET Matching Reading Comprehension with Self-Matching Networks
- Bi-Directional Attention Flow (BiDAF) for Machine Comprehension
- Summary
-
- MULTI-ATTENTION QUESTION ANSWERING
-
- Multi-Attention BiDAF Model
- Chatbot Design Using a QA System
- Online QA System and Attention Visualization
-
- FUTURE DIRECTIONS AND CONCLUSION
-
- Future Directions
- Conclusion
-
- LIST OF REFERENCES
- BIOGRAPHICAL SKETCH
-
33
we propose a general Chatbot design that would make the designing of a domain
specific chatbot very easy and robust at the same time
Every domain specific chatbot needs to obtain a set of information from the user
and show some results based on the user specific information obtained The traditional
chatbots use template matching and keywords lookup to determine if the user has
provided the required information Our idea is to use the Question Answering system in
the backend to extract out the required information from whatever the user has typed
until this point of the conversation The information to be extracted can be posed in the
form of a set of questions and the answers obtained from those questions can be used
as the parameters to supply the relevant information to the user
We had chosen our chatbot domain as the flight reservation system Our goal
was to extract the required information from the user to be able to show him the
available flights as per the userrsquos requirements
Figure 4-2 Flight reservation chatbotrsquos chat window
34
For a flight reservation task the booking agent needs to know the origin city
destination city and date of travel at minimum to be able to show the available flights
Optional information includes the number of tickets passengerrsquos name one way or
round trip etc
The minimalistic conversation with the user through the chat window would be as
shown above We had a platform called OneTask on which we wanted to implement our
chat bot The chat interface within the OneTask system looks as follows
Figure 4-3 Chatbot within OneTask system
The working of the chat system is as follows ndash
1 Initiation The user opens the chat window which starts a session with the chat bot The chat bot asks lsquoHow may I help yoursquo
2 User Reply The user may reply with none of the required information for flight booking or may reply with multiple information in the same message
3 User Reply parsing The conversation up to this point is treated as a passage and the internal questions are run on this passage So the four questions that are run are
Where do you want to go
35
From where do you want to leave
When do you want to depart
4 Parsed responses from QA model After running the questions on the QA system with the given conversation answers are obtained for all along with their corresponding confidence values Since answers will be obtained even if the required question was not answered up to this point in the conversation the confidence values plays a crucial role in determining the validity of the answer For this purpose we determined ranges of confidence values for validation A confidence value above 10 signifies the question has been answered correctly A confidence of 2 ndash 10 signifies that it may have been answered but should verify with the user for correctness and any confidence below 2 is discarded
5 Asking remaining questions iteratively After the parsing it is checked if any of the required questions are still unanswered If so the chatbot asks the remaining question and the process from 3 ndash 4 is carried out iteratively Once all the questions have been answered the user is shown with the available flight options as per his request
Figure 4-4 The Flow diagram of the Flight booking Chatbot system
Online QA System and Attention Visualization
To be able to test out various examples we used the BiDAF [23] model for an
online demo One can either choose from the available examples from the drop-down
menu or paste their own passage and examples While this is a useful and interesting
system to test the model in a user-friendly way we created this system to be able to
focus on the wrong samples
36
Figure 4-5 QA system interface with attention highlight over candidate answers
The candidate answers are shown with the blue highlights as per their
confidence values The higher the confidence the darker the answer The highest
confidence value is chosen as the predicted answer We developed the system to show
attention spread of the candidate answers to realize what needs to be done to improve
the system This led us to realize the importance of including the query attention part as
well as multilevel attention on the BiDAF model as described in the first section of this
chapter
37
CHAPTER 5 FUTURE DIRECTIONS AND CONCLUSION
Future Directions
Having done a thorough analysis of the current methods and state of the arts in
Machine Comprehension and for the QA task and developing systems on it we have
achieved a strong sense of what needs to be done to further improve the QA models
After observing the wrong samples we can see that the system is still unable to encode
meaning and is picking answers based on statistical patterns of occurrence of answers
on training examples
As per the State of the Art models and analyzing the ongoing research literature
it is easy to conclude that more features need to be embedded to encode the meaning
of words phrases and sentences A paper called Reinforced Mnemonic Reader for
Machine Comprehension encoded POS and NER tags of words along with their word
and character embedding This gave them better results
Figure 5-1 An English language semantic parse tree [26]
38
We have developed a method to encode the syntax parse tree of a sentence that
not only encodes the post tags but the relation of the word within the phrase and the
relation of the phrase within the whole sentence in a hierarchical manner
Finally data augmentation is another solution to get better results One definite
way to reduce the errors would be to include similar samples in the training data which
the system is faltering in the dev set One could generate similar examples as the failure
cases and include them in the training set to have better prediction Another system
would be to train to similar and bigger datasets Our models were trained on the SQuAD
dataset There are other datasets too which does the similar question answering task
such as TriviaQA We could augment the training set of TriviaQA along with SQuAD to
have a more robust system that is able to generalize better and thus have higher
accuracy for predicting answer spans
Conclusion
In this work we have tried to explore the most fundamental techniques that have
shaped the current state of the art Then we proposed a minor improvement of
architecture over an existing model Furthermore we developed two applications that
uses the base model First we talked about how a chatbot application can be made
using the QA system and lastly we also created a web interface where the model can
be used for any Passage and Question This interface also shows the attention spread
on the candidate answers While our effort is ongoing to push the state of the art
forward we strongly believe that surpassing human level accuracy on this task will have
high dividends for the society at large
39
LIST OF REFERENCES
[1] Danqi Chen From Reading Comprehension To Open-Domain Question Answering 2018 Youtube Accessed March 16 2018 httpswwwyoutubecomwatchv=1RN88O9C13U
[2] Lehnert Wendy G The Process of Question Answering No RR-88 Yale Univ New Haven Conn Dept of computer science 1977
[3] Joshi Mandar Eunsol Choi Daniel S Weld and Luke Zettlemoyer Triviaqa A large scale distantly supervised challenge dataset for reading comprehension arXiv preprint arXiv170503551 (2017)
[4] Rajpurkar Pranav Jian Zhang Konstantin Lopyrev and Percy Liang Squad 100000+ questions for machine comprehension of text arXiv preprint arXiv160605250 (2016)
[5] The Stanford Question Answering Dataset 2018 RajpurkarGithubIo Accessed March 16 2018 httpsrajpurkargithubioSQuAD-explorer
[6] Hu Minghao Yuxing Peng and Xipeng Qiu Reinforced mnemonic reader for machine comprehension CoRR abs170502798 (2017)
[7] Schalkoff Robert J Artificial neural networks Vol 1 New York McGraw-Hill 1997
[8] Me For The AI and Neetesh Mehrotra 2018 The Connect Between Deep Learning And AI - Open Source For You Open Source For You Accessed March 16 2018 httpopensourceforucom201801connect-deep-learning-ai
[9] Krizhevsky Alex Ilya Sutskever and Geoffrey E Hinton Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems pp 1097-1105 2012
[10] An Intuitive Explanation Of Convolutional Neural Networks 2016 The Data Science Blog Accessed March 16 2018 httpsujjwalkarnme20160811intuitive-explanation-convnets
[11] Williams Ronald J and David Zipser A learning algorithm for continually running fully recurrent neural networks Neural computation 1 no 2 (1989) 270-280
[12] Hochreiter Sepp and Juumlrgen Schmidhuber Long short-term memory Neural computation 9 no 8 (1997) 1735-1780
[13] Chung Junyoung Caglar Gulcehre KyungHyun Cho and Yoshua Bengio Empirical evaluation of gated recurrent neural networks on sequence modeling arXiv preprint arXiv14123555 (2014)
40
[14] Mikolov Tomas Wen-tau Yih and Geoffrey Zweig Linguistic regularities in continuous space word representations In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics Human Language Technologies pp 746-751 2013
[15] Pennington Jeffrey Richard Socher and Christopher Manning Glove Global vectors for word representation In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) pp 1532-1543 2014
[16] Word2vec Tutorial - The Skip-Gram Model middot Chris Mccormick 2016 MccormickmlCom Accessed March 16 2018 httpmccormickmlcom20160419word2vec-tutorial-the-skip-gram-model
[17] Group 2015 [Paper Introduction] Bilingual Word Representations With Monolingual hellip SlideshareNet Accessed March 16 2018 httpswwwslidesharenetnaist-mtstudybilingual-word-representations-with-monolingual-quality-in-mind
[18] Attention Mechanism 2016 BlogHeuritechCom Accessed March 16 2018 httpsblogheuritechcom20160120attention-mechanism
[19] Weston Jason Sumit Chopra and Antoine Bordes 2014 Memory Networks arXiv preprint arXiv14103916v11 Accessed March 16 2018 httpsarxivorgabs14103916
[20] Wang Shuohang and Jing Jiang Machine comprehension using match-lstm and answer pointer arXiv preprint arXiv160807905 (2016)
[21] Wang Shuohang and Jing Jiang Learning natural language inference with LSTM arXiv preprint arXiv151208849 (2015)
[22] Vinyals Oriol Meire Fortunato and Navdeep Jaitly Pointer networks In Advances in Neural Information Processing Systems pp 2692-2700 2015
[23] Wang Wenhui Nan Yang Furu Wei Baobao Chang and Ming Zhou Gated self-matching networks for reading comprehension and question answering In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1 Long Papers) vol 1 pp 189-198 2017
[24] Seo Minjoon Aniruddha Kembhavi Ali Farhadi and Hannaneh Hajishirzi Bidirectional attention flow for machine comprehension arXiv preprint arXiv161101603 (2016)
[25] Clark Christopher and Matt Gardner Simple and effective multi-paragraph reading comprehension arXiv preprint arXiv171010723 (2017)
41
[26] 8 Analyzing Sentence Structure 2018 NltkOrg Accessed March 16 2018 httpwwwnltkorgbookch08html
42
BIOGRAPHICAL SKETCH
Purnendu Mukherjee grew up in Kolkata India and had a strong interest for
computers since an early age After his high school he did his BSc in computer
science from Ramakrishna Mission Residential College Narendrapur followed by MSc
in computer science from St Xavierrsquos College Kolkata He had a strong intuition and
interest for human like learning systems and wanted to work in this area He started
working at TCS Innovation Labs Pune for the application of Natural Language
Processing in Educational Applications As he was deeply passionate about learning
system that mimic the human brain and learn like a human child does he was
increasing interested about Deep Learning and its applications After working for a year
he went on to pursue a Master of Science degree in computer science from the
University of Florida Gainesville His academic interests have been focused on Deep
Learning and Natural Language Processing and he has been working on Machine
Reading Comprehension since summer of 2017
- ACKNOWLEDGMENTS
- LIST OF FIGURES
- LIST OF ABBREVIATIONS
- INTRODUCTION
- THE BUILDING BLOCKS OF A QUESTION ANSWERING SYSTEM
-
- Neural Networks
- Convolutional Neural Network
- Recurrent Neural Networks (RNN)
- Word Embedding
- Attention Mechanism
- Memory Networks
-
- LITERATURE REVIEW AND STATE OF THE ART
-
- Machine Comprehension Using Match-LSTM and Answer Pointer
- R-NET Matching Reading Comprehension with Self-Matching Networks
- Bi-Directional Attention Flow (BiDAF) for Machine Comprehension
- Summary
-
- MULTI-ATTENTION QUESTION ANSWERING
-
- Multi-Attention BiDAF Model
- Chatbot Design Using a QA System
- Online QA System and Attention Visualization
-
- FUTURE DIRECTIONS AND CONCLUSION
-
- Future Directions
- Conclusion
-
- LIST OF REFERENCES
- BIOGRAPHICAL SKETCH
-
34
For a flight reservation task the booking agent needs to know the origin city
destination city and date of travel at minimum to be able to show the available flights
Optional information includes the number of tickets passengerrsquos name one way or
round trip etc
The minimalistic conversation with the user through the chat window would be as
shown above We had a platform called OneTask on which we wanted to implement our
chat bot The chat interface within the OneTask system looks as follows
Figure 4-3 Chatbot within OneTask system
The working of the chat system is as follows ndash
1 Initiation The user opens the chat window which starts a session with the chat bot The chat bot asks lsquoHow may I help yoursquo
2 User Reply The user may reply with none of the required information for flight booking or may reply with multiple information in the same message
3 User Reply parsing The conversation up to this point is treated as a passage and the internal questions are run on this passage So the four questions that are run are
Where do you want to go
35
From where do you want to leave
When do you want to depart
4 Parsed responses from QA model After running the questions on the QA system with the given conversation answers are obtained for all along with their corresponding confidence values Since answers will be obtained even if the required question was not answered up to this point in the conversation the confidence values plays a crucial role in determining the validity of the answer For this purpose we determined ranges of confidence values for validation A confidence value above 10 signifies the question has been answered correctly A confidence of 2 ndash 10 signifies that it may have been answered but should verify with the user for correctness and any confidence below 2 is discarded
5 Asking remaining questions iteratively After the parsing it is checked if any of the required questions are still unanswered If so the chatbot asks the remaining question and the process from 3 ndash 4 is carried out iteratively Once all the questions have been answered the user is shown with the available flight options as per his request
Figure 4-4 The Flow diagram of the Flight booking Chatbot system
Online QA System and Attention Visualization
To be able to test out various examples we used the BiDAF [23] model for an
online demo One can either choose from the available examples from the drop-down
menu or paste their own passage and examples While this is a useful and interesting
system to test the model in a user-friendly way we created this system to be able to
focus on the wrong samples
36
Figure 4-5 QA system interface with attention highlight over candidate answers
The candidate answers are shown with the blue highlights as per their
confidence values The higher the confidence the darker the answer The highest
confidence value is chosen as the predicted answer We developed the system to show
attention spread of the candidate answers to realize what needs to be done to improve
the system This led us to realize the importance of including the query attention part as
well as multilevel attention on the BiDAF model as described in the first section of this
chapter
37
CHAPTER 5 FUTURE DIRECTIONS AND CONCLUSION
Future Directions
Having done a thorough analysis of the current methods and state of the arts in
Machine Comprehension and for the QA task and developing systems on it we have
achieved a strong sense of what needs to be done to further improve the QA models
After observing the wrong samples we can see that the system is still unable to encode
meaning and is picking answers based on statistical patterns of occurrence of answers
on training examples
As per the State of the Art models and analyzing the ongoing research literature
it is easy to conclude that more features need to be embedded to encode the meaning
of words phrases and sentences A paper called Reinforced Mnemonic Reader for
Machine Comprehension encoded POS and NER tags of words along with their word
and character embedding This gave them better results
Figure 5-1 An English language semantic parse tree [26]
38
We have developed a method to encode the syntax parse tree of a sentence that
not only encodes the post tags but the relation of the word within the phrase and the
relation of the phrase within the whole sentence in a hierarchical manner
Finally data augmentation is another solution to get better results One definite
way to reduce the errors would be to include similar samples in the training data which
the system is faltering in the dev set One could generate similar examples as the failure
cases and include them in the training set to have better prediction Another system
would be to train to similar and bigger datasets Our models were trained on the SQuAD
dataset There are other datasets too which does the similar question answering task
such as TriviaQA We could augment the training set of TriviaQA along with SQuAD to
have a more robust system that is able to generalize better and thus have higher
accuracy for predicting answer spans
Conclusion
In this work we have tried to explore the most fundamental techniques that have
shaped the current state of the art Then we proposed a minor improvement of
architecture over an existing model Furthermore we developed two applications that
uses the base model First we talked about how a chatbot application can be made
using the QA system and lastly we also created a web interface where the model can
be used for any Passage and Question This interface also shows the attention spread
on the candidate answers While our effort is ongoing to push the state of the art
forward we strongly believe that surpassing human level accuracy on this task will have
high dividends for the society at large
39
LIST OF REFERENCES
[1] Danqi Chen From Reading Comprehension To Open-Domain Question Answering 2018 Youtube Accessed March 16 2018 httpswwwyoutubecomwatchv=1RN88O9C13U
[2] Lehnert Wendy G The Process of Question Answering No RR-88 Yale Univ New Haven Conn Dept of computer science 1977
[3] Joshi Mandar Eunsol Choi Daniel S Weld and Luke Zettlemoyer Triviaqa A large scale distantly supervised challenge dataset for reading comprehension arXiv preprint arXiv170503551 (2017)
[4] Rajpurkar Pranav Jian Zhang Konstantin Lopyrev and Percy Liang Squad 100000+ questions for machine comprehension of text arXiv preprint arXiv160605250 (2016)
[5] The Stanford Question Answering Dataset 2018 RajpurkarGithubIo Accessed March 16 2018 httpsrajpurkargithubioSQuAD-explorer
[6] Hu Minghao Yuxing Peng and Xipeng Qiu Reinforced mnemonic reader for machine comprehension CoRR abs170502798 (2017)
[7] Schalkoff Robert J Artificial neural networks Vol 1 New York McGraw-Hill 1997
[8] Me For The AI and Neetesh Mehrotra 2018 The Connect Between Deep Learning And AI - Open Source For You Open Source For You Accessed March 16 2018 httpopensourceforucom201801connect-deep-learning-ai
[9] Krizhevsky Alex Ilya Sutskever and Geoffrey E Hinton Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems pp 1097-1105 2012
[10] An Intuitive Explanation Of Convolutional Neural Networks 2016 The Data Science Blog Accessed March 16 2018 httpsujjwalkarnme20160811intuitive-explanation-convnets
[11] Williams Ronald J and David Zipser A learning algorithm for continually running fully recurrent neural networks Neural computation 1 no 2 (1989) 270-280
[12] Hochreiter Sepp and Juumlrgen Schmidhuber Long short-term memory Neural computation 9 no 8 (1997) 1735-1780
[13] Chung Junyoung Caglar Gulcehre KyungHyun Cho and Yoshua Bengio Empirical evaluation of gated recurrent neural networks on sequence modeling arXiv preprint arXiv14123555 (2014)
40
[14] Mikolov Tomas Wen-tau Yih and Geoffrey Zweig Linguistic regularities in continuous space word representations In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics Human Language Technologies pp 746-751 2013
[15] Pennington Jeffrey Richard Socher and Christopher Manning Glove Global vectors for word representation In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) pp 1532-1543 2014
[16] Word2vec Tutorial - The Skip-Gram Model middot Chris Mccormick 2016 MccormickmlCom Accessed March 16 2018 httpmccormickmlcom20160419word2vec-tutorial-the-skip-gram-model
[17] Group 2015 [Paper Introduction] Bilingual Word Representations With Monolingual hellip SlideshareNet Accessed March 16 2018 httpswwwslidesharenetnaist-mtstudybilingual-word-representations-with-monolingual-quality-in-mind
[18] Attention Mechanism 2016 BlogHeuritechCom Accessed March 16 2018 httpsblogheuritechcom20160120attention-mechanism
[19] Weston Jason Sumit Chopra and Antoine Bordes 2014 Memory Networks arXiv preprint arXiv14103916v11 Accessed March 16 2018 httpsarxivorgabs14103916
[20] Wang Shuohang and Jing Jiang Machine comprehension using match-lstm and answer pointer arXiv preprint arXiv160807905 (2016)
[21] Wang Shuohang and Jing Jiang Learning natural language inference with LSTM arXiv preprint arXiv151208849 (2015)
[22] Vinyals Oriol Meire Fortunato and Navdeep Jaitly Pointer networks In Advances in Neural Information Processing Systems pp 2692-2700 2015
[23] Wang Wenhui Nan Yang Furu Wei Baobao Chang and Ming Zhou Gated self-matching networks for reading comprehension and question answering In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1 Long Papers) vol 1 pp 189-198 2017
[24] Seo Minjoon Aniruddha Kembhavi Ali Farhadi and Hannaneh Hajishirzi Bidirectional attention flow for machine comprehension arXiv preprint arXiv161101603 (2016)
[25] Clark Christopher and Matt Gardner Simple and effective multi-paragraph reading comprehension arXiv preprint arXiv171010723 (2017)
41
[26] 8 Analyzing Sentence Structure 2018 NltkOrg Accessed March 16 2018 httpwwwnltkorgbookch08html
42
BIOGRAPHICAL SKETCH
Purnendu Mukherjee grew up in Kolkata India and had a strong interest for
computers since an early age After his high school he did his BSc in computer
science from Ramakrishna Mission Residential College Narendrapur followed by MSc
in computer science from St Xavierrsquos College Kolkata He had a strong intuition and
interest for human like learning systems and wanted to work in this area He started
working at TCS Innovation Labs Pune for the application of Natural Language
Processing in Educational Applications As he was deeply passionate about learning
system that mimic the human brain and learn like a human child does he was
increasing interested about Deep Learning and its applications After working for a year
he went on to pursue a Master of Science degree in computer science from the
University of Florida Gainesville His academic interests have been focused on Deep
Learning and Natural Language Processing and he has been working on Machine
Reading Comprehension since summer of 2017
- ACKNOWLEDGMENTS
- LIST OF FIGURES
- LIST OF ABBREVIATIONS
- INTRODUCTION
- THE BUILDING BLOCKS OF A QUESTION ANSWERING SYSTEM
-
- Neural Networks
- Convolutional Neural Network
- Recurrent Neural Networks (RNN)
- Word Embedding
- Attention Mechanism
- Memory Networks
-
- LITERATURE REVIEW AND STATE OF THE ART
-
- Machine Comprehension Using Match-LSTM and Answer Pointer
- R-NET Matching Reading Comprehension with Self-Matching Networks
- Bi-Directional Attention Flow (BiDAF) for Machine Comprehension
- Summary
-
- MULTI-ATTENTION QUESTION ANSWERING
-
- Multi-Attention BiDAF Model
- Chatbot Design Using a QA System
- Online QA System and Attention Visualization
-
- FUTURE DIRECTIONS AND CONCLUSION
-
- Future Directions
- Conclusion
-
- LIST OF REFERENCES
- BIOGRAPHICAL SKETCH
-
35
From where do you want to leave
When do you want to depart
4 Parsed responses from QA model After running the questions on the QA system with the given conversation answers are obtained for all along with their corresponding confidence values Since answers will be obtained even if the required question was not answered up to this point in the conversation the confidence values plays a crucial role in determining the validity of the answer For this purpose we determined ranges of confidence values for validation A confidence value above 10 signifies the question has been answered correctly A confidence of 2 ndash 10 signifies that it may have been answered but should verify with the user for correctness and any confidence below 2 is discarded
5 Asking remaining questions iteratively After the parsing it is checked if any of the required questions are still unanswered If so the chatbot asks the remaining question and the process from 3 ndash 4 is carried out iteratively Once all the questions have been answered the user is shown with the available flight options as per his request
Figure 4-4 The Flow diagram of the Flight booking Chatbot system
Online QA System and Attention Visualization
To be able to test out various examples we used the BiDAF [23] model for an
online demo One can either choose from the available examples from the drop-down
menu or paste their own passage and examples While this is a useful and interesting
system to test the model in a user-friendly way we created this system to be able to
focus on the wrong samples
36
Figure 4-5 QA system interface with attention highlight over candidate answers
The candidate answers are shown with the blue highlights as per their
confidence values The higher the confidence the darker the answer The highest
confidence value is chosen as the predicted answer We developed the system to show
attention spread of the candidate answers to realize what needs to be done to improve
the system This led us to realize the importance of including the query attention part as
well as multilevel attention on the BiDAF model as described in the first section of this
chapter
37
CHAPTER 5 FUTURE DIRECTIONS AND CONCLUSION
Future Directions
Having done a thorough analysis of the current methods and state of the arts in
Machine Comprehension and for the QA task and developing systems on it we have
achieved a strong sense of what needs to be done to further improve the QA models
After observing the wrong samples we can see that the system is still unable to encode
meaning and is picking answers based on statistical patterns of occurrence of answers
on training examples
As per the State of the Art models and analyzing the ongoing research literature
it is easy to conclude that more features need to be embedded to encode the meaning
of words phrases and sentences A paper called Reinforced Mnemonic Reader for
Machine Comprehension encoded POS and NER tags of words along with their word
and character embedding This gave them better results
Figure 5-1 An English language semantic parse tree [26]
38
We have developed a method to encode the syntax parse tree of a sentence that
not only encodes the post tags but the relation of the word within the phrase and the
relation of the phrase within the whole sentence in a hierarchical manner
Finally data augmentation is another solution to get better results One definite
way to reduce the errors would be to include similar samples in the training data which
the system is faltering in the dev set One could generate similar examples as the failure
cases and include them in the training set to have better prediction Another system
would be to train to similar and bigger datasets Our models were trained on the SQuAD
dataset There are other datasets too which does the similar question answering task
such as TriviaQA We could augment the training set of TriviaQA along with SQuAD to
have a more robust system that is able to generalize better and thus have higher
accuracy for predicting answer spans
Conclusion
In this work we have tried to explore the most fundamental techniques that have
shaped the current state of the art Then we proposed a minor improvement of
architecture over an existing model Furthermore we developed two applications that
uses the base model First we talked about how a chatbot application can be made
using the QA system and lastly we also created a web interface where the model can
be used for any Passage and Question This interface also shows the attention spread
on the candidate answers While our effort is ongoing to push the state of the art
forward we strongly believe that surpassing human level accuracy on this task will have
high dividends for the society at large
39
LIST OF REFERENCES
[1] Danqi Chen From Reading Comprehension To Open-Domain Question Answering 2018 Youtube Accessed March 16 2018 httpswwwyoutubecomwatchv=1RN88O9C13U
[2] Lehnert Wendy G The Process of Question Answering No RR-88 Yale Univ New Haven Conn Dept of computer science 1977
[3] Joshi Mandar Eunsol Choi Daniel S Weld and Luke Zettlemoyer Triviaqa A large scale distantly supervised challenge dataset for reading comprehension arXiv preprint arXiv170503551 (2017)
[4] Rajpurkar Pranav Jian Zhang Konstantin Lopyrev and Percy Liang Squad 100000+ questions for machine comprehension of text arXiv preprint arXiv160605250 (2016)
[5] The Stanford Question Answering Dataset 2018 RajpurkarGithubIo Accessed March 16 2018 httpsrajpurkargithubioSQuAD-explorer
[6] Hu Minghao Yuxing Peng and Xipeng Qiu Reinforced mnemonic reader for machine comprehension CoRR abs170502798 (2017)
[7] Schalkoff Robert J Artificial neural networks Vol 1 New York McGraw-Hill 1997
[8] Me For The AI and Neetesh Mehrotra 2018 The Connect Between Deep Learning And AI - Open Source For You Open Source For You Accessed March 16 2018 httpopensourceforucom201801connect-deep-learning-ai
[9] Krizhevsky Alex Ilya Sutskever and Geoffrey E Hinton Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems pp 1097-1105 2012
[10] An Intuitive Explanation Of Convolutional Neural Networks 2016 The Data Science Blog Accessed March 16 2018 httpsujjwalkarnme20160811intuitive-explanation-convnets
[11] Williams Ronald J and David Zipser A learning algorithm for continually running fully recurrent neural networks Neural computation 1 no 2 (1989) 270-280
[12] Hochreiter Sepp and Juumlrgen Schmidhuber Long short-term memory Neural computation 9 no 8 (1997) 1735-1780
[13] Chung Junyoung Caglar Gulcehre KyungHyun Cho and Yoshua Bengio Empirical evaluation of gated recurrent neural networks on sequence modeling arXiv preprint arXiv14123555 (2014)
40
[14] Mikolov Tomas Wen-tau Yih and Geoffrey Zweig Linguistic regularities in continuous space word representations In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics Human Language Technologies pp 746-751 2013
[15] Pennington Jeffrey Richard Socher and Christopher Manning Glove Global vectors for word representation In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) pp 1532-1543 2014
[16] Word2vec Tutorial - The Skip-Gram Model middot Chris Mccormick 2016 MccormickmlCom Accessed March 16 2018 httpmccormickmlcom20160419word2vec-tutorial-the-skip-gram-model
[17] Group 2015 [Paper Introduction] Bilingual Word Representations With Monolingual hellip SlideshareNet Accessed March 16 2018 httpswwwslidesharenetnaist-mtstudybilingual-word-representations-with-monolingual-quality-in-mind
[18] Attention Mechanism 2016 BlogHeuritechCom Accessed March 16 2018 httpsblogheuritechcom20160120attention-mechanism
[19] Weston Jason Sumit Chopra and Antoine Bordes 2014 Memory Networks arXiv preprint arXiv14103916v11 Accessed March 16 2018 httpsarxivorgabs14103916
[20] Wang Shuohang and Jing Jiang Machine comprehension using match-lstm and answer pointer arXiv preprint arXiv160807905 (2016)
[21] Wang Shuohang and Jing Jiang Learning natural language inference with LSTM arXiv preprint arXiv151208849 (2015)
[22] Vinyals Oriol Meire Fortunato and Navdeep Jaitly Pointer networks In Advances in Neural Information Processing Systems pp 2692-2700 2015
[23] Wang Wenhui Nan Yang Furu Wei Baobao Chang and Ming Zhou Gated self-matching networks for reading comprehension and question answering In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1 Long Papers) vol 1 pp 189-198 2017
[24] Seo Minjoon Aniruddha Kembhavi Ali Farhadi and Hannaneh Hajishirzi Bidirectional attention flow for machine comprehension arXiv preprint arXiv161101603 (2016)
[25] Clark Christopher and Matt Gardner Simple and effective multi-paragraph reading comprehension arXiv preprint arXiv171010723 (2017)
41
[26] 8 Analyzing Sentence Structure 2018 NltkOrg Accessed March 16 2018 httpwwwnltkorgbookch08html
42
BIOGRAPHICAL SKETCH
Purnendu Mukherjee grew up in Kolkata India and had a strong interest for
computers since an early age After his high school he did his BSc in computer
science from Ramakrishna Mission Residential College Narendrapur followed by MSc
in computer science from St Xavierrsquos College Kolkata He had a strong intuition and
interest for human like learning systems and wanted to work in this area He started
working at TCS Innovation Labs Pune for the application of Natural Language
Processing in Educational Applications As he was deeply passionate about learning
system that mimic the human brain and learn like a human child does he was
increasing interested about Deep Learning and its applications After working for a year
he went on to pursue a Master of Science degree in computer science from the
University of Florida Gainesville His academic interests have been focused on Deep
Learning and Natural Language Processing and he has been working on Machine
Reading Comprehension since summer of 2017
- ACKNOWLEDGMENTS
- LIST OF FIGURES
- LIST OF ABBREVIATIONS
- INTRODUCTION
- THE BUILDING BLOCKS OF A QUESTION ANSWERING SYSTEM
-
- Neural Networks
- Convolutional Neural Network
- Recurrent Neural Networks (RNN)
- Word Embedding
- Attention Mechanism
- Memory Networks
-
- LITERATURE REVIEW AND STATE OF THE ART
-
- Machine Comprehension Using Match-LSTM and Answer Pointer
- R-NET Matching Reading Comprehension with Self-Matching Networks
- Bi-Directional Attention Flow (BiDAF) for Machine Comprehension
- Summary
-
- MULTI-ATTENTION QUESTION ANSWERING
-
- Multi-Attention BiDAF Model
- Chatbot Design Using a QA System
- Online QA System and Attention Visualization
-
- FUTURE DIRECTIONS AND CONCLUSION
-
- Future Directions
- Conclusion
-
- LIST OF REFERENCES
- BIOGRAPHICAL SKETCH
-
36
Figure 4-5 QA system interface with attention highlight over candidate answers
The candidate answers are shown with the blue highlights as per their
confidence values The higher the confidence the darker the answer The highest
confidence value is chosen as the predicted answer We developed the system to show
attention spread of the candidate answers to realize what needs to be done to improve
the system This led us to realize the importance of including the query attention part as
well as multilevel attention on the BiDAF model as described in the first section of this
chapter
37
CHAPTER 5 FUTURE DIRECTIONS AND CONCLUSION
Future Directions
Having done a thorough analysis of the current methods and state of the arts in
Machine Comprehension and for the QA task and developing systems on it we have
achieved a strong sense of what needs to be done to further improve the QA models
After observing the wrong samples we can see that the system is still unable to encode
meaning and is picking answers based on statistical patterns of occurrence of answers
on training examples
As per the State of the Art models and analyzing the ongoing research literature
it is easy to conclude that more features need to be embedded to encode the meaning
of words phrases and sentences A paper called Reinforced Mnemonic Reader for
Machine Comprehension encoded POS and NER tags of words along with their word
and character embedding This gave them better results
Figure 5-1 An English language semantic parse tree [26]
38
We have developed a method to encode the syntax parse tree of a sentence that
not only encodes the post tags but the relation of the word within the phrase and the
relation of the phrase within the whole sentence in a hierarchical manner
Finally data augmentation is another solution to get better results One definite
way to reduce the errors would be to include similar samples in the training data which
the system is faltering in the dev set One could generate similar examples as the failure
cases and include them in the training set to have better prediction Another system
would be to train to similar and bigger datasets Our models were trained on the SQuAD
dataset There are other datasets too which does the similar question answering task
such as TriviaQA We could augment the training set of TriviaQA along with SQuAD to
have a more robust system that is able to generalize better and thus have higher
accuracy for predicting answer spans
Conclusion
In this work we have tried to explore the most fundamental techniques that have
shaped the current state of the art Then we proposed a minor improvement of
architecture over an existing model Furthermore we developed two applications that
uses the base model First we talked about how a chatbot application can be made
using the QA system and lastly we also created a web interface where the model can
be used for any Passage and Question This interface also shows the attention spread
on the candidate answers While our effort is ongoing to push the state of the art
forward we strongly believe that surpassing human level accuracy on this task will have
high dividends for the society at large
39
LIST OF REFERENCES
[1] Danqi Chen From Reading Comprehension To Open-Domain Question Answering 2018 Youtube Accessed March 16 2018 httpswwwyoutubecomwatchv=1RN88O9C13U
[2] Lehnert Wendy G The Process of Question Answering No RR-88 Yale Univ New Haven Conn Dept of computer science 1977
[3] Joshi Mandar Eunsol Choi Daniel S Weld and Luke Zettlemoyer Triviaqa A large scale distantly supervised challenge dataset for reading comprehension arXiv preprint arXiv170503551 (2017)
[4] Rajpurkar Pranav Jian Zhang Konstantin Lopyrev and Percy Liang Squad 100000+ questions for machine comprehension of text arXiv preprint arXiv160605250 (2016)
[5] The Stanford Question Answering Dataset 2018 RajpurkarGithubIo Accessed March 16 2018 httpsrajpurkargithubioSQuAD-explorer
[6] Hu Minghao Yuxing Peng and Xipeng Qiu Reinforced mnemonic reader for machine comprehension CoRR abs170502798 (2017)
[7] Schalkoff Robert J Artificial neural networks Vol 1 New York McGraw-Hill 1997
[8] Me For The AI and Neetesh Mehrotra 2018 The Connect Between Deep Learning And AI - Open Source For You Open Source For You Accessed March 16 2018 httpopensourceforucom201801connect-deep-learning-ai
[9] Krizhevsky Alex Ilya Sutskever and Geoffrey E Hinton Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems pp 1097-1105 2012
[10] An Intuitive Explanation Of Convolutional Neural Networks 2016 The Data Science Blog Accessed March 16 2018 httpsujjwalkarnme20160811intuitive-explanation-convnets
[11] Williams Ronald J and David Zipser A learning algorithm for continually running fully recurrent neural networks Neural computation 1 no 2 (1989) 270-280
[12] Hochreiter Sepp and Juumlrgen Schmidhuber Long short-term memory Neural computation 9 no 8 (1997) 1735-1780
[13] Chung Junyoung Caglar Gulcehre KyungHyun Cho and Yoshua Bengio Empirical evaluation of gated recurrent neural networks on sequence modeling arXiv preprint arXiv14123555 (2014)
40
[14] Mikolov Tomas Wen-tau Yih and Geoffrey Zweig Linguistic regularities in continuous space word representations In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics Human Language Technologies pp 746-751 2013
[15] Pennington Jeffrey Richard Socher and Christopher Manning Glove Global vectors for word representation In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) pp 1532-1543 2014
[16] Word2vec Tutorial - The Skip-Gram Model middot Chris Mccormick 2016 MccormickmlCom Accessed March 16 2018 httpmccormickmlcom20160419word2vec-tutorial-the-skip-gram-model
[17] Group 2015 [Paper Introduction] Bilingual Word Representations With Monolingual hellip SlideshareNet Accessed March 16 2018 httpswwwslidesharenetnaist-mtstudybilingual-word-representations-with-monolingual-quality-in-mind
[18] Attention Mechanism 2016 BlogHeuritechCom Accessed March 16 2018 httpsblogheuritechcom20160120attention-mechanism
[19] Weston Jason Sumit Chopra and Antoine Bordes 2014 Memory Networks arXiv preprint arXiv14103916v11 Accessed March 16 2018 httpsarxivorgabs14103916
[20] Wang Shuohang and Jing Jiang Machine comprehension using match-lstm and answer pointer arXiv preprint arXiv160807905 (2016)
[21] Wang Shuohang and Jing Jiang Learning natural language inference with LSTM arXiv preprint arXiv151208849 (2015)
[22] Vinyals Oriol Meire Fortunato and Navdeep Jaitly Pointer networks In Advances in Neural Information Processing Systems pp 2692-2700 2015
[23] Wang Wenhui Nan Yang Furu Wei Baobao Chang and Ming Zhou Gated self-matching networks for reading comprehension and question answering In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1 Long Papers) vol 1 pp 189-198 2017
[24] Seo Minjoon Aniruddha Kembhavi Ali Farhadi and Hannaneh Hajishirzi Bidirectional attention flow for machine comprehension arXiv preprint arXiv161101603 (2016)
[25] Clark Christopher and Matt Gardner Simple and effective multi-paragraph reading comprehension arXiv preprint arXiv171010723 (2017)
41
[26] 8 Analyzing Sentence Structure 2018 NltkOrg Accessed March 16 2018 httpwwwnltkorgbookch08html
42
BIOGRAPHICAL SKETCH
Purnendu Mukherjee grew up in Kolkata India and had a strong interest for
computers since an early age After his high school he did his BSc in computer
science from Ramakrishna Mission Residential College Narendrapur followed by MSc
in computer science from St Xavierrsquos College Kolkata He had a strong intuition and
interest for human like learning systems and wanted to work in this area He started
working at TCS Innovation Labs Pune for the application of Natural Language
Processing in Educational Applications As he was deeply passionate about learning
system that mimic the human brain and learn like a human child does he was
increasing interested about Deep Learning and its applications After working for a year
he went on to pursue a Master of Science degree in computer science from the
University of Florida Gainesville His academic interests have been focused on Deep
Learning and Natural Language Processing and he has been working on Machine
Reading Comprehension since summer of 2017
- ACKNOWLEDGMENTS
- LIST OF FIGURES
- LIST OF ABBREVIATIONS
- INTRODUCTION
- THE BUILDING BLOCKS OF A QUESTION ANSWERING SYSTEM
-
- Neural Networks
- Convolutional Neural Network
- Recurrent Neural Networks (RNN)
- Word Embedding
- Attention Mechanism
- Memory Networks
-
- LITERATURE REVIEW AND STATE OF THE ART
-
- Machine Comprehension Using Match-LSTM and Answer Pointer
- R-NET Matching Reading Comprehension with Self-Matching Networks
- Bi-Directional Attention Flow (BiDAF) for Machine Comprehension
- Summary
-
- MULTI-ATTENTION QUESTION ANSWERING
-
- Multi-Attention BiDAF Model
- Chatbot Design Using a QA System
- Online QA System and Attention Visualization
-
- FUTURE DIRECTIONS AND CONCLUSION
-
- Future Directions
- Conclusion
-
- LIST OF REFERENCES
- BIOGRAPHICAL SKETCH
-
37
CHAPTER 5 FUTURE DIRECTIONS AND CONCLUSION
Future Directions
Having done a thorough analysis of the current methods and state of the arts in
Machine Comprehension and for the QA task and developing systems on it we have
achieved a strong sense of what needs to be done to further improve the QA models
After observing the wrong samples we can see that the system is still unable to encode
meaning and is picking answers based on statistical patterns of occurrence of answers
on training examples
As per the State of the Art models and analyzing the ongoing research literature
it is easy to conclude that more features need to be embedded to encode the meaning
of words phrases and sentences A paper called Reinforced Mnemonic Reader for
Machine Comprehension encoded POS and NER tags of words along with their word
and character embedding This gave them better results
Figure 5-1 An English language semantic parse tree [26]
38
We have developed a method to encode the syntax parse tree of a sentence that
not only encodes the post tags but the relation of the word within the phrase and the
relation of the phrase within the whole sentence in a hierarchical manner
Finally data augmentation is another solution to get better results One definite
way to reduce the errors would be to include similar samples in the training data which
the system is faltering in the dev set One could generate similar examples as the failure
cases and include them in the training set to have better prediction Another system
would be to train to similar and bigger datasets Our models were trained on the SQuAD
dataset There are other datasets too which does the similar question answering task
such as TriviaQA We could augment the training set of TriviaQA along with SQuAD to
have a more robust system that is able to generalize better and thus have higher
accuracy for predicting answer spans
Conclusion
In this work we have tried to explore the most fundamental techniques that have
shaped the current state of the art Then we proposed a minor improvement of
architecture over an existing model Furthermore we developed two applications that
uses the base model First we talked about how a chatbot application can be made
using the QA system and lastly we also created a web interface where the model can
be used for any Passage and Question This interface also shows the attention spread
on the candidate answers While our effort is ongoing to push the state of the art
forward we strongly believe that surpassing human level accuracy on this task will have
high dividends for the society at large
39
LIST OF REFERENCES
[1] Danqi Chen From Reading Comprehension To Open-Domain Question Answering 2018 Youtube Accessed March 16 2018 httpswwwyoutubecomwatchv=1RN88O9C13U
[2] Lehnert Wendy G The Process of Question Answering No RR-88 Yale Univ New Haven Conn Dept of computer science 1977
[3] Joshi Mandar Eunsol Choi Daniel S Weld and Luke Zettlemoyer Triviaqa A large scale distantly supervised challenge dataset for reading comprehension arXiv preprint arXiv170503551 (2017)
[4] Rajpurkar Pranav Jian Zhang Konstantin Lopyrev and Percy Liang Squad 100000+ questions for machine comprehension of text arXiv preprint arXiv160605250 (2016)
[5] The Stanford Question Answering Dataset 2018 RajpurkarGithubIo Accessed March 16 2018 httpsrajpurkargithubioSQuAD-explorer
[6] Hu Minghao Yuxing Peng and Xipeng Qiu Reinforced mnemonic reader for machine comprehension CoRR abs170502798 (2017)
[7] Schalkoff Robert J Artificial neural networks Vol 1 New York McGraw-Hill 1997
[8] Me For The AI and Neetesh Mehrotra 2018 The Connect Between Deep Learning And AI - Open Source For You Open Source For You Accessed March 16 2018 httpopensourceforucom201801connect-deep-learning-ai
[9] Krizhevsky Alex Ilya Sutskever and Geoffrey E Hinton Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems pp 1097-1105 2012
[10] An Intuitive Explanation Of Convolutional Neural Networks 2016 The Data Science Blog Accessed March 16 2018 httpsujjwalkarnme20160811intuitive-explanation-convnets
[11] Williams Ronald J and David Zipser A learning algorithm for continually running fully recurrent neural networks Neural computation 1 no 2 (1989) 270-280
[12] Hochreiter Sepp and Juumlrgen Schmidhuber Long short-term memory Neural computation 9 no 8 (1997) 1735-1780
[13] Chung Junyoung Caglar Gulcehre KyungHyun Cho and Yoshua Bengio Empirical evaluation of gated recurrent neural networks on sequence modeling arXiv preprint arXiv14123555 (2014)
40
[14] Mikolov Tomas Wen-tau Yih and Geoffrey Zweig Linguistic regularities in continuous space word representations In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics Human Language Technologies pp 746-751 2013
[15] Pennington Jeffrey Richard Socher and Christopher Manning Glove Global vectors for word representation In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) pp 1532-1543 2014
[16] Word2vec Tutorial - The Skip-Gram Model middot Chris Mccormick 2016 MccormickmlCom Accessed March 16 2018 httpmccormickmlcom20160419word2vec-tutorial-the-skip-gram-model
[17] Group 2015 [Paper Introduction] Bilingual Word Representations With Monolingual hellip SlideshareNet Accessed March 16 2018 httpswwwslidesharenetnaist-mtstudybilingual-word-representations-with-monolingual-quality-in-mind
[18] Attention Mechanism 2016 BlogHeuritechCom Accessed March 16 2018 httpsblogheuritechcom20160120attention-mechanism
[19] Weston Jason Sumit Chopra and Antoine Bordes 2014 Memory Networks arXiv preprint arXiv14103916v11 Accessed March 16 2018 httpsarxivorgabs14103916
[20] Wang Shuohang and Jing Jiang Machine comprehension using match-lstm and answer pointer arXiv preprint arXiv160807905 (2016)
[21] Wang Shuohang and Jing Jiang Learning natural language inference with LSTM arXiv preprint arXiv151208849 (2015)
[22] Vinyals Oriol Meire Fortunato and Navdeep Jaitly Pointer networks In Advances in Neural Information Processing Systems pp 2692-2700 2015
[23] Wang Wenhui Nan Yang Furu Wei Baobao Chang and Ming Zhou Gated self-matching networks for reading comprehension and question answering In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1 Long Papers) vol 1 pp 189-198 2017
[24] Seo Minjoon Aniruddha Kembhavi Ali Farhadi and Hannaneh Hajishirzi Bidirectional attention flow for machine comprehension arXiv preprint arXiv161101603 (2016)
[25] Clark Christopher and Matt Gardner Simple and effective multi-paragraph reading comprehension arXiv preprint arXiv171010723 (2017)
41
[26] 8 Analyzing Sentence Structure 2018 NltkOrg Accessed March 16 2018 httpwwwnltkorgbookch08html
42
BIOGRAPHICAL SKETCH
Purnendu Mukherjee grew up in Kolkata India and had a strong interest for
computers since an early age After his high school he did his BSc in computer
science from Ramakrishna Mission Residential College Narendrapur followed by MSc
in computer science from St Xavierrsquos College Kolkata He had a strong intuition and
interest for human like learning systems and wanted to work in this area He started
working at TCS Innovation Labs Pune for the application of Natural Language
Processing in Educational Applications As he was deeply passionate about learning
system that mimic the human brain and learn like a human child does he was
increasing interested about Deep Learning and its applications After working for a year
he went on to pursue a Master of Science degree in computer science from the
University of Florida Gainesville His academic interests have been focused on Deep
Learning and Natural Language Processing and he has been working on Machine
Reading Comprehension since summer of 2017
- ACKNOWLEDGMENTS
- LIST OF FIGURES
- LIST OF ABBREVIATIONS
- INTRODUCTION
- THE BUILDING BLOCKS OF A QUESTION ANSWERING SYSTEM
-
- Neural Networks
- Convolutional Neural Network
- Recurrent Neural Networks (RNN)
- Word Embedding
- Attention Mechanism
- Memory Networks
-
- LITERATURE REVIEW AND STATE OF THE ART
-
- Machine Comprehension Using Match-LSTM and Answer Pointer
- R-NET Matching Reading Comprehension with Self-Matching Networks
- Bi-Directional Attention Flow (BiDAF) for Machine Comprehension
- Summary
-
- MULTI-ATTENTION QUESTION ANSWERING
-
- Multi-Attention BiDAF Model
- Chatbot Design Using a QA System
- Online QA System and Attention Visualization
-
- FUTURE DIRECTIONS AND CONCLUSION
-
- Future Directions
- Conclusion
-
- LIST OF REFERENCES
- BIOGRAPHICAL SKETCH
-
38
We have developed a method to encode the syntax parse tree of a sentence that
not only encodes the post tags but the relation of the word within the phrase and the
relation of the phrase within the whole sentence in a hierarchical manner
Finally data augmentation is another solution to get better results One definite
way to reduce the errors would be to include similar samples in the training data which
the system is faltering in the dev set One could generate similar examples as the failure
cases and include them in the training set to have better prediction Another system
would be to train to similar and bigger datasets Our models were trained on the SQuAD
dataset There are other datasets too which does the similar question answering task
such as TriviaQA We could augment the training set of TriviaQA along with SQuAD to
have a more robust system that is able to generalize better and thus have higher
accuracy for predicting answer spans
Conclusion
In this work we have tried to explore the most fundamental techniques that have
shaped the current state of the art Then we proposed a minor improvement of
architecture over an existing model Furthermore we developed two applications that
uses the base model First we talked about how a chatbot application can be made
using the QA system and lastly we also created a web interface where the model can
be used for any Passage and Question This interface also shows the attention spread
on the candidate answers While our effort is ongoing to push the state of the art
forward we strongly believe that surpassing human level accuracy on this task will have
high dividends for the society at large
39
LIST OF REFERENCES
[1] Danqi Chen From Reading Comprehension To Open-Domain Question Answering 2018 Youtube Accessed March 16 2018 httpswwwyoutubecomwatchv=1RN88O9C13U
[2] Lehnert Wendy G The Process of Question Answering No RR-88 Yale Univ New Haven Conn Dept of computer science 1977
[3] Joshi Mandar Eunsol Choi Daniel S Weld and Luke Zettlemoyer Triviaqa A large scale distantly supervised challenge dataset for reading comprehension arXiv preprint arXiv170503551 (2017)
[4] Rajpurkar Pranav Jian Zhang Konstantin Lopyrev and Percy Liang Squad 100000+ questions for machine comprehension of text arXiv preprint arXiv160605250 (2016)
[5] The Stanford Question Answering Dataset 2018 RajpurkarGithubIo Accessed March 16 2018 httpsrajpurkargithubioSQuAD-explorer
[6] Hu Minghao Yuxing Peng and Xipeng Qiu Reinforced mnemonic reader for machine comprehension CoRR abs170502798 (2017)
[7] Schalkoff Robert J Artificial neural networks Vol 1 New York McGraw-Hill 1997
[8] Me For The AI and Neetesh Mehrotra 2018 The Connect Between Deep Learning And AI - Open Source For You Open Source For You Accessed March 16 2018 httpopensourceforucom201801connect-deep-learning-ai
[9] Krizhevsky Alex Ilya Sutskever and Geoffrey E Hinton Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems pp 1097-1105 2012
[10] An Intuitive Explanation Of Convolutional Neural Networks 2016 The Data Science Blog Accessed March 16 2018 httpsujjwalkarnme20160811intuitive-explanation-convnets
[11] Williams Ronald J and David Zipser A learning algorithm for continually running fully recurrent neural networks Neural computation 1 no 2 (1989) 270-280
[12] Hochreiter Sepp and Juumlrgen Schmidhuber Long short-term memory Neural computation 9 no 8 (1997) 1735-1780
[13] Chung Junyoung Caglar Gulcehre KyungHyun Cho and Yoshua Bengio Empirical evaluation of gated recurrent neural networks on sequence modeling arXiv preprint arXiv14123555 (2014)
40
[14] Mikolov Tomas Wen-tau Yih and Geoffrey Zweig Linguistic regularities in continuous space word representations In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics Human Language Technologies pp 746-751 2013
[15] Pennington Jeffrey Richard Socher and Christopher Manning Glove Global vectors for word representation In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) pp 1532-1543 2014
[16] Word2vec Tutorial - The Skip-Gram Model middot Chris Mccormick 2016 MccormickmlCom Accessed March 16 2018 httpmccormickmlcom20160419word2vec-tutorial-the-skip-gram-model
[17] Group 2015 [Paper Introduction] Bilingual Word Representations With Monolingual hellip SlideshareNet Accessed March 16 2018 httpswwwslidesharenetnaist-mtstudybilingual-word-representations-with-monolingual-quality-in-mind
[18] Attention Mechanism 2016 BlogHeuritechCom Accessed March 16 2018 httpsblogheuritechcom20160120attention-mechanism
[19] Weston Jason Sumit Chopra and Antoine Bordes 2014 Memory Networks arXiv preprint arXiv14103916v11 Accessed March 16 2018 httpsarxivorgabs14103916
[20] Wang Shuohang and Jing Jiang Machine comprehension using match-lstm and answer pointer arXiv preprint arXiv160807905 (2016)
[21] Wang Shuohang and Jing Jiang Learning natural language inference with LSTM arXiv preprint arXiv151208849 (2015)
[22] Vinyals Oriol Meire Fortunato and Navdeep Jaitly Pointer networks In Advances in Neural Information Processing Systems pp 2692-2700 2015
[23] Wang Wenhui Nan Yang Furu Wei Baobao Chang and Ming Zhou Gated self-matching networks for reading comprehension and question answering In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1 Long Papers) vol 1 pp 189-198 2017
[24] Seo Minjoon Aniruddha Kembhavi Ali Farhadi and Hannaneh Hajishirzi Bidirectional attention flow for machine comprehension arXiv preprint arXiv161101603 (2016)
[25] Clark Christopher and Matt Gardner Simple and effective multi-paragraph reading comprehension arXiv preprint arXiv171010723 (2017)
41
[26] 8 Analyzing Sentence Structure 2018 NltkOrg Accessed March 16 2018 httpwwwnltkorgbookch08html
42
BIOGRAPHICAL SKETCH
Purnendu Mukherjee grew up in Kolkata India and had a strong interest for
computers since an early age After his high school he did his BSc in computer
science from Ramakrishna Mission Residential College Narendrapur followed by MSc
in computer science from St Xavierrsquos College Kolkata He had a strong intuition and
interest for human like learning systems and wanted to work in this area He started
working at TCS Innovation Labs Pune for the application of Natural Language
Processing in Educational Applications As he was deeply passionate about learning
system that mimic the human brain and learn like a human child does he was
increasing interested about Deep Learning and its applications After working for a year
he went on to pursue a Master of Science degree in computer science from the
University of Florida Gainesville His academic interests have been focused on Deep
Learning and Natural Language Processing and he has been working on Machine
Reading Comprehension since summer of 2017
- ACKNOWLEDGMENTS
- LIST OF FIGURES
- LIST OF ABBREVIATIONS
- INTRODUCTION
- THE BUILDING BLOCKS OF A QUESTION ANSWERING SYSTEM
-
- Neural Networks
- Convolutional Neural Network
- Recurrent Neural Networks (RNN)
- Word Embedding
- Attention Mechanism
- Memory Networks
-
- LITERATURE REVIEW AND STATE OF THE ART
-
- Machine Comprehension Using Match-LSTM and Answer Pointer
- R-NET Matching Reading Comprehension with Self-Matching Networks
- Bi-Directional Attention Flow (BiDAF) for Machine Comprehension
- Summary
-
- MULTI-ATTENTION QUESTION ANSWERING
-
- Multi-Attention BiDAF Model
- Chatbot Design Using a QA System
- Online QA System and Attention Visualization
-
- FUTURE DIRECTIONS AND CONCLUSION
-
- Future Directions
- Conclusion
-
- LIST OF REFERENCES
- BIOGRAPHICAL SKETCH
-
39
LIST OF REFERENCES
[1] Danqi Chen From Reading Comprehension To Open-Domain Question Answering 2018 Youtube Accessed March 16 2018 httpswwwyoutubecomwatchv=1RN88O9C13U
[2] Lehnert Wendy G The Process of Question Answering No RR-88 Yale Univ New Haven Conn Dept of computer science 1977
[3] Joshi Mandar Eunsol Choi Daniel S Weld and Luke Zettlemoyer Triviaqa A large scale distantly supervised challenge dataset for reading comprehension arXiv preprint arXiv170503551 (2017)
[4] Rajpurkar Pranav Jian Zhang Konstantin Lopyrev and Percy Liang Squad 100000+ questions for machine comprehension of text arXiv preprint arXiv160605250 (2016)
[5] The Stanford Question Answering Dataset 2018 RajpurkarGithubIo Accessed March 16 2018 httpsrajpurkargithubioSQuAD-explorer
[6] Hu Minghao Yuxing Peng and Xipeng Qiu Reinforced mnemonic reader for machine comprehension CoRR abs170502798 (2017)
[7] Schalkoff Robert J Artificial neural networks Vol 1 New York McGraw-Hill 1997
[8] Me For The AI and Neetesh Mehrotra 2018 The Connect Between Deep Learning And AI - Open Source For You Open Source For You Accessed March 16 2018 httpopensourceforucom201801connect-deep-learning-ai
[9] Krizhevsky Alex Ilya Sutskever and Geoffrey E Hinton Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems pp 1097-1105 2012
[10] An Intuitive Explanation Of Convolutional Neural Networks 2016 The Data Science Blog Accessed March 16 2018 httpsujjwalkarnme20160811intuitive-explanation-convnets
[11] Williams Ronald J and David Zipser A learning algorithm for continually running fully recurrent neural networks Neural computation 1 no 2 (1989) 270-280
[12] Hochreiter Sepp and Juumlrgen Schmidhuber Long short-term memory Neural computation 9 no 8 (1997) 1735-1780
[13] Chung Junyoung Caglar Gulcehre KyungHyun Cho and Yoshua Bengio Empirical evaluation of gated recurrent neural networks on sequence modeling arXiv preprint arXiv14123555 (2014)
40
[14] Mikolov Tomas Wen-tau Yih and Geoffrey Zweig Linguistic regularities in continuous space word representations In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics Human Language Technologies pp 746-751 2013
[15] Pennington Jeffrey Richard Socher and Christopher Manning Glove Global vectors for word representation In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) pp 1532-1543 2014
[16] Word2vec Tutorial - The Skip-Gram Model middot Chris Mccormick 2016 MccormickmlCom Accessed March 16 2018 httpmccormickmlcom20160419word2vec-tutorial-the-skip-gram-model
[17] Group 2015 [Paper Introduction] Bilingual Word Representations With Monolingual hellip SlideshareNet Accessed March 16 2018 httpswwwslidesharenetnaist-mtstudybilingual-word-representations-with-monolingual-quality-in-mind
[18] Attention Mechanism 2016 BlogHeuritechCom Accessed March 16 2018 httpsblogheuritechcom20160120attention-mechanism
[19] Weston Jason Sumit Chopra and Antoine Bordes 2014 Memory Networks arXiv preprint arXiv14103916v11 Accessed March 16 2018 httpsarxivorgabs14103916
[20] Wang Shuohang and Jing Jiang Machine comprehension using match-lstm and answer pointer arXiv preprint arXiv160807905 (2016)
[21] Wang Shuohang and Jing Jiang Learning natural language inference with LSTM arXiv preprint arXiv151208849 (2015)
[22] Vinyals Oriol Meire Fortunato and Navdeep Jaitly Pointer networks In Advances in Neural Information Processing Systems pp 2692-2700 2015
[23] Wang Wenhui Nan Yang Furu Wei Baobao Chang and Ming Zhou Gated self-matching networks for reading comprehension and question answering In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1 Long Papers) vol 1 pp 189-198 2017
[24] Seo Minjoon Aniruddha Kembhavi Ali Farhadi and Hannaneh Hajishirzi Bidirectional attention flow for machine comprehension arXiv preprint arXiv161101603 (2016)
[25] Clark Christopher and Matt Gardner Simple and effective multi-paragraph reading comprehension arXiv preprint arXiv171010723 (2017)
41
[26] 8 Analyzing Sentence Structure 2018 NltkOrg Accessed March 16 2018 httpwwwnltkorgbookch08html
42
BIOGRAPHICAL SKETCH
Purnendu Mukherjee grew up in Kolkata India and had a strong interest for
computers since an early age After his high school he did his BSc in computer
science from Ramakrishna Mission Residential College Narendrapur followed by MSc
in computer science from St Xavierrsquos College Kolkata He had a strong intuition and
interest for human like learning systems and wanted to work in this area He started
working at TCS Innovation Labs Pune for the application of Natural Language
Processing in Educational Applications As he was deeply passionate about learning
system that mimic the human brain and learn like a human child does he was
increasing interested about Deep Learning and its applications After working for a year
he went on to pursue a Master of Science degree in computer science from the
University of Florida Gainesville His academic interests have been focused on Deep
Learning and Natural Language Processing and he has been working on Machine
Reading Comprehension since summer of 2017
- ACKNOWLEDGMENTS
- LIST OF FIGURES
- LIST OF ABBREVIATIONS
- INTRODUCTION
- THE BUILDING BLOCKS OF A QUESTION ANSWERING SYSTEM
-
- Neural Networks
- Convolutional Neural Network
- Recurrent Neural Networks (RNN)
- Word Embedding
- Attention Mechanism
- Memory Networks
-
- LITERATURE REVIEW AND STATE OF THE ART
-
- Machine Comprehension Using Match-LSTM and Answer Pointer
- R-NET Matching Reading Comprehension with Self-Matching Networks
- Bi-Directional Attention Flow (BiDAF) for Machine Comprehension
- Summary
-
- MULTI-ATTENTION QUESTION ANSWERING
-
- Multi-Attention BiDAF Model
- Chatbot Design Using a QA System
- Online QA System and Attention Visualization
-
- FUTURE DIRECTIONS AND CONCLUSION
-
- Future Directions
- Conclusion
-
- LIST OF REFERENCES
- BIOGRAPHICAL SKETCH
-
40
[14] Mikolov Tomas Wen-tau Yih and Geoffrey Zweig Linguistic regularities in continuous space word representations In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics Human Language Technologies pp 746-751 2013
[15] Pennington Jeffrey Richard Socher and Christopher Manning Glove Global vectors for word representation In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) pp 1532-1543 2014
[16] Word2vec Tutorial - The Skip-Gram Model middot Chris Mccormick 2016 MccormickmlCom Accessed March 16 2018 httpmccormickmlcom20160419word2vec-tutorial-the-skip-gram-model
[17] Group 2015 [Paper Introduction] Bilingual Word Representations With Monolingual hellip SlideshareNet Accessed March 16 2018 httpswwwslidesharenetnaist-mtstudybilingual-word-representations-with-monolingual-quality-in-mind
[18] Attention Mechanism 2016 BlogHeuritechCom Accessed March 16 2018 httpsblogheuritechcom20160120attention-mechanism
[19] Weston Jason Sumit Chopra and Antoine Bordes 2014 Memory Networks arXiv preprint arXiv14103916v11 Accessed March 16 2018 httpsarxivorgabs14103916
[20] Wang Shuohang and Jing Jiang Machine comprehension using match-lstm and answer pointer arXiv preprint arXiv160807905 (2016)
[21] Wang Shuohang and Jing Jiang Learning natural language inference with LSTM arXiv preprint arXiv151208849 (2015)
[22] Vinyals Oriol Meire Fortunato and Navdeep Jaitly Pointer networks In Advances in Neural Information Processing Systems pp 2692-2700 2015
[23] Wang Wenhui Nan Yang Furu Wei Baobao Chang and Ming Zhou Gated self-matching networks for reading comprehension and question answering In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1 Long Papers) vol 1 pp 189-198 2017
[24] Seo Minjoon Aniruddha Kembhavi Ali Farhadi and Hannaneh Hajishirzi Bidirectional attention flow for machine comprehension arXiv preprint arXiv161101603 (2016)
[25] Clark Christopher and Matt Gardner Simple and effective multi-paragraph reading comprehension arXiv preprint arXiv171010723 (2017)
41
[26] 8 Analyzing Sentence Structure 2018 NltkOrg Accessed March 16 2018 httpwwwnltkorgbookch08html
42
BIOGRAPHICAL SKETCH
Purnendu Mukherjee grew up in Kolkata India and had a strong interest for
computers since an early age After his high school he did his BSc in computer
science from Ramakrishna Mission Residential College Narendrapur followed by MSc
in computer science from St Xavierrsquos College Kolkata He had a strong intuition and
interest for human like learning systems and wanted to work in this area He started
working at TCS Innovation Labs Pune for the application of Natural Language
Processing in Educational Applications As he was deeply passionate about learning
system that mimic the human brain and learn like a human child does he was
increasing interested about Deep Learning and its applications After working for a year
he went on to pursue a Master of Science degree in computer science from the
University of Florida Gainesville His academic interests have been focused on Deep
Learning and Natural Language Processing and he has been working on Machine
Reading Comprehension since summer of 2017
- ACKNOWLEDGMENTS
- LIST OF FIGURES
- LIST OF ABBREVIATIONS
- INTRODUCTION
- THE BUILDING BLOCKS OF A QUESTION ANSWERING SYSTEM
-
- Neural Networks
- Convolutional Neural Network
- Recurrent Neural Networks (RNN)
- Word Embedding
- Attention Mechanism
- Memory Networks
-
- LITERATURE REVIEW AND STATE OF THE ART
-
- Machine Comprehension Using Match-LSTM and Answer Pointer
- R-NET Matching Reading Comprehension with Self-Matching Networks
- Bi-Directional Attention Flow (BiDAF) for Machine Comprehension
- Summary
-
- MULTI-ATTENTION QUESTION ANSWERING
-
- Multi-Attention BiDAF Model
- Chatbot Design Using a QA System
- Online QA System and Attention Visualization
-
- FUTURE DIRECTIONS AND CONCLUSION
-
- Future Directions
- Conclusion
-
- LIST OF REFERENCES
- BIOGRAPHICAL SKETCH
-
41
[26] 8 Analyzing Sentence Structure 2018 NltkOrg Accessed March 16 2018 httpwwwnltkorgbookch08html
42
BIOGRAPHICAL SKETCH
Purnendu Mukherjee grew up in Kolkata India and had a strong interest for
computers since an early age After his high school he did his BSc in computer
science from Ramakrishna Mission Residential College Narendrapur followed by MSc
in computer science from St Xavierrsquos College Kolkata He had a strong intuition and
interest for human like learning systems and wanted to work in this area He started
working at TCS Innovation Labs Pune for the application of Natural Language
Processing in Educational Applications As he was deeply passionate about learning
system that mimic the human brain and learn like a human child does he was
increasing interested about Deep Learning and its applications After working for a year
he went on to pursue a Master of Science degree in computer science from the
University of Florida Gainesville His academic interests have been focused on Deep
Learning and Natural Language Processing and he has been working on Machine
Reading Comprehension since summer of 2017
- ACKNOWLEDGMENTS
- LIST OF FIGURES
- LIST OF ABBREVIATIONS
- INTRODUCTION
- THE BUILDING BLOCKS OF A QUESTION ANSWERING SYSTEM
-
- Neural Networks
- Convolutional Neural Network
- Recurrent Neural Networks (RNN)
- Word Embedding
- Attention Mechanism
- Memory Networks
-
- LITERATURE REVIEW AND STATE OF THE ART
-
- Machine Comprehension Using Match-LSTM and Answer Pointer
- R-NET Matching Reading Comprehension with Self-Matching Networks
- Bi-Directional Attention Flow (BiDAF) for Machine Comprehension
- Summary
-
- MULTI-ATTENTION QUESTION ANSWERING
-
- Multi-Attention BiDAF Model
- Chatbot Design Using a QA System
- Online QA System and Attention Visualization
-
- FUTURE DIRECTIONS AND CONCLUSION
-
- Future Directions
- Conclusion
-
- LIST OF REFERENCES
- BIOGRAPHICAL SKETCH
-
42
BIOGRAPHICAL SKETCH
Purnendu Mukherjee grew up in Kolkata India and had a strong interest for
computers since an early age After his high school he did his BSc in computer
science from Ramakrishna Mission Residential College Narendrapur followed by MSc
in computer science from St Xavierrsquos College Kolkata He had a strong intuition and
interest for human like learning systems and wanted to work in this area He started
working at TCS Innovation Labs Pune for the application of Natural Language
Processing in Educational Applications As he was deeply passionate about learning
system that mimic the human brain and learn like a human child does he was
increasing interested about Deep Learning and its applications After working for a year
he went on to pursue a Master of Science degree in computer science from the
University of Florida Gainesville His academic interests have been focused on Deep
Learning and Natural Language Processing and he has been working on Machine
Reading Comprehension since summer of 2017
- ACKNOWLEDGMENTS
- LIST OF FIGURES
- LIST OF ABBREVIATIONS
- INTRODUCTION
- THE BUILDING BLOCKS OF A QUESTION ANSWERING SYSTEM
-
- Neural Networks
- Convolutional Neural Network
- Recurrent Neural Networks (RNN)
- Word Embedding
- Attention Mechanism
- Memory Networks
-
- LITERATURE REVIEW AND STATE OF THE ART
-
- Machine Comprehension Using Match-LSTM and Answer Pointer
- R-NET Matching Reading Comprehension with Self-Matching Networks
- Bi-Directional Attention Flow (BiDAF) for Machine Comprehension
- Summary
-
- MULTI-ATTENTION QUESTION ANSWERING
-
- Multi-Attention BiDAF Model
- Chatbot Design Using a QA System
- Online QA System and Attention Visualization
-
- FUTURE DIRECTIONS AND CONCLUSION
-
- Future Directions
- Conclusion
-
- LIST OF REFERENCES
- BIOGRAPHICAL SKETCH
-