Neural Structured Turing Machine · duced by Neural Structured Turing Machine with high confidence...

5
Neural Structured Turing Machine Hao Liu *1 Xinyi Yang *2 Zenglin Xu 2 Abstract Differential external memory has greatly im- proved attention based model, such as neural Turing machines(NTM), however none of exist- ing attention models and memory augmented at- tention models has considered real-valued struc- tured attention, which may hinders their repre- sentation power. Here we take structured infor- mation into consideration and propose a Neu- ral Structured Turing Machine(NSTM). Our pre- liminary experiment on varied length sequence copy task shows the benefit of such structured at- tention mechanism, leading to a novel perspec- tive for improving differential external memory based mechanism, e.g., NTM. 1. Introduction Neural processes involving attention have been largely studied in Neuroscience and Computational Neuro- science(Itti et al., 1998). Basically, an attention model is a method that take n arguments and a context and return a vector which is supposed to be the summary of the argu- ment focusing on information linked to the context. More formally, it returns a weighted arithmetic mean of the aug- ment and the weights are chosen according the relevance of each argument of each argument given the context. Attention mechanism can be commonly divided into soft attention and hard attention. Soft attention is a fully dif- ferentiable deterministic mechanism that can be plugged into an existing system, and the gradients are propagated through the attention mechanism at the same time they are propagated through the rest of the network. Contrast to soft attention, hard attention is a stochastic process where it in- stead of using all the hidden states as an input for the de- * Equal contribution 1 SMILE Lab & Yingcai Honors Col- lege, University of Electronic Science and Technology of China 2 SMILE Lab & Big Data Research Center School of Com- puter Science and Engineering, University of Electronic Sci- ence and Technology of China. Correspondence to: Zenglin Xu <[email protected]>. Proceedings of the ICML 17 Workshop on Deep Structured Pre- diction, Sydney, Australia, PMLR 70, 2017. Copyright 2017 by the author(s). coding, the system samples a hidden state with a stochastic probabilities. Commonly used approaches to estimate the gradient are Monte Carlo sampling and REINFORCE. Recent work have advanced attention mechanism to fully differentiable addressable memory. Such as Mem- ory Networks(Weston et al., 2014), Neural Turing Ma- chines(Graves et al., 2014) and many other models(Kurach et al., 2016; Chandar et al., 2016; Gulcehre et al., 2016), all of which use external memory in which they can read (eventually write). Fully differentiable addressable mem- ory enables model to bind values to specific locations in data structures(Fodor & Pylyshyn, 1988). Differentiable memory is important because many of the algorithms and things that we do every day on a computer are actually very hard for a computer to do because computers work in ab- solutes. While most neural network models dont actually do that. They work with real numbers, smoother sort of curves, and that makes them actually a great deal easier to train, because it means that what you see and hope that they provide, that you can easily track back how to improve them by tweaking the parameters. However, currently none of attention models and other differential memory mechanisms considers the structured information between memory location and query content when addressing memory, instead they simply use content to calculate a relevance score, which may hinder their rep- resentation power. Here we show how to use a general structured inference mechanism to model complex depen- dencies between memory locations and query in Neural Turing Machine(NTM), we validate through experiments the power of incorporating real-valued structured attention in NTM. 2. Model 2.1. Energy Based Structured Attention Let m t-1 denotes input, q t denotes the query, to en- able query based content search, input m t-1 can be pro- cessed with query q t by x = m > t-1 Wq t where W comes from a neural network(e.g., MLP or parameters), y = (y 1 ,y 2 , ..., y n ) denotes the attention value we want to predict, here y i [0, 1], E(x, y, w) : R K R is an energy function, defined as E = i f i (y i ,x; w u )+

Transcript of Neural Structured Turing Machine · duced by Neural Structured Turing Machine with high confidence...

Page 1: Neural Structured Turing Machine · duced by Neural Structured Turing Machine with high confidence and very few mistakes. 3.1. Copy To exams whether NSTM can store and recall a long

Neural Structured Turing Machine

Hao Liu * 1 Xinyi Yang * 2 Zenglin Xu 2

AbstractDifferential external memory has greatly im-proved attention based model, such as neuralTuring machines(NTM), however none of exist-ing attention models and memory augmented at-tention models has considered real-valued struc-tured attention, which may hinders their repre-sentation power. Here we take structured infor-mation into consideration and propose a Neu-ral Structured Turing Machine(NSTM). Our pre-liminary experiment on varied length sequencecopy task shows the benefit of such structured at-tention mechanism, leading to a novel perspec-tive for improving differential external memorybased mechanism, e.g., NTM.

1. IntroductionNeural processes involving attention have been largelystudied in Neuroscience and Computational Neuro-science(Itti et al., 1998). Basically, an attention model isa method that take n arguments and a context and return avector which is supposed to be the summary of the argu-ment focusing on information linked to the context. Moreformally, it returns a weighted arithmetic mean of the aug-ment and the weights are chosen according the relevance ofeach argument of each argument given the context.

Attention mechanism can be commonly divided into softattention and hard attention. Soft attention is a fully dif-ferentiable deterministic mechanism that can be pluggedinto an existing system, and the gradients are propagatedthrough the attention mechanism at the same time they arepropagated through the rest of the network. Contrast to softattention, hard attention is a stochastic process where it in-stead of using all the hidden states as an input for the de-

*Equal contribution 1SMILE Lab & Yingcai Honors Col-lege, University of Electronic Science and Technology of China2SMILE Lab & Big Data Research Center School of Com-puter Science and Engineering, University of Electronic Sci-ence and Technology of China. Correspondence to: Zenglin Xu<[email protected]>.

Proceedings of the ICML 17 Workshop on Deep Structured Pre-diction, Sydney, Australia, PMLR 70, 2017. Copyright 2017 bythe author(s).

coding, the system samples a hidden state with a stochasticprobabilities. Commonly used approaches to estimate thegradient are Monte Carlo sampling and REINFORCE.

Recent work have advanced attention mechanism tofully differentiable addressable memory. Such as Mem-ory Networks(Weston et al., 2014), Neural Turing Ma-chines(Graves et al., 2014) and many other models(Kurachet al., 2016; Chandar et al., 2016; Gulcehre et al., 2016),all of which use external memory in which they can read(eventually write). Fully differentiable addressable mem-ory enables model to bind values to specific locations indata structures(Fodor & Pylyshyn, 1988). Differentiablememory is important because many of the algorithms andthings that we do every day on a computer are actually veryhard for a computer to do because computers work in ab-solutes. While most neural network models dont actuallydo that. They work with real numbers, smoother sort ofcurves, and that makes them actually a great deal easierto train, because it means that what you see and hope thatthey provide, that you can easily track back how to improvethem by tweaking the parameters.

However, currently none of attention models and otherdifferential memory mechanisms considers the structuredinformation between memory location and query contentwhen addressing memory, instead they simply use contentto calculate a relevance score, which may hinder their rep-resentation power. Here we show how to use a generalstructured inference mechanism to model complex depen-dencies between memory locations and query in NeuralTuring Machine(NTM), we validate through experimentsthe power of incorporating real-valued structured attentionin NTM.

2. Model2.1. Energy Based Structured Attention

Let mt−1 denotes input, qt denotes the query, to en-able query based content search, input mt−1 can be pro-cessed with query qt by x = m>t−1Wqt where W comesfrom a neural network(e.g., MLP or parameters), y =(y1, y2, ..., yn) denotes the attention value we want topredict, here yi ∈ [0, 1], E(x, y, w) : RK → R isan energy function, defined as E =

∑i fi(yi, x;wu) +

Page 2: Neural Structured Turing Machine · duced by Neural Structured Turing Machine with high confidence and very few mistakes. 3.1. Copy To exams whether NSTM can store and recall a long

Neural Structured Turing Machine

Writing Reading

Query Based Structured Attention

Interpolate

Convolve

Sharpen

Controller

W

gt

st

γ i

atet

Memory

{ } { }

{ } { }mt

{ }mt− 1

r t

qt r t− 1

yct

Figure 1. Neural Structured Turing machine

∑α fα(yα, x;wα) which can be viewed as unary func-

tion plus higher order energy functions. Our aim is obtainy? = argminy∈Y

∑i fi(yi, x;wu) +

∑α fα(yα, x;wα).

Since this kind of model in general is intractable althougha variety of methods have been developed in the contextof predicting discrete outputs(Zheng et al., 2015b; Chenet al., 2015; Zheng et al., 2015a). A Markov randomfields (MRFs) with Gaussian potentials is also real-valuedand when the precision matrix is positive semi-definite andsatisfies the spectral radius condition (Weiss & Freeman,2001) then message passing is possible for exact inference.However, MRF with Gaussian potentials may not be pow-erful enough for extracting complex features from memoryand model complex dependencies between query and mem-ory. We turn to proximal method to remedy, for more aboutproximal method refer to (Parikh et al., 2014). We made thefollowing assumptions:fi(yi, x;w) = gi(yi, hi(x,w)) andfα(yα, x;wα) = hα(x;w)gα(w

Tαyα), then

E =∑i

gi(yi, hi(x,w)) +∑α

hα(x;w)gα(wTαyα) (1)

The above equations can be solved by primal-dual methodeffectively. Denote the output of t-th iteration of as yct .

2.2. Controller Parameters

Having shown how to address memory using structured at-tention mechanism, now we turn to other addressing situ-

ation, in some cases, we may want to read from specificmemory locations instead of reading specific memory val-ues. This is called location-based addressing, and to imple-ment it, we need three more stages. In the second stage, ascalar parameter gt ∈ (0, 1) called the interpolation gate,blends the content weight vector yct with the previous timestep’s weight vector yt−1 to produce the gated weightingygt .This allows the system learn when to use (or ignore)content-based addressing, ygt ← gty

ct + (1− gt)yt−1.

We’d like the controller to be able to shift focus to otherrows. Let’s suppose that as one of the system’s param-eters, the range of allowable shifts is specified. For ex-ample, a head’s attention could shift forward a row (+1),stay still (0), or shift backward a row(-1) given a shift mod-ulo R, so that a shift forward at the bottom row of mem-ory moves the head’s attention to the top row, and simi-larly for a shift backward at the top row. After interpo-lation, each head emits a normalized shift weighting stand the following convolutional shift with shift weight st,yt(i)←

∑i(y

gt (j))st(i− j).

In order to prevent shift weight st from blurring, we definesharpening as following, yt(i) ≈ yt(i)γi , here gt, st and γiare controller parameters.

2.3. Writing

Writing involves two separate steps: erasing, then adding.In order to erase old data, a write head will need a new vec-tor, the length-C erase vector et , in addition to our length-R normalized weight vector yt. The erase vector is usedin conjunction with the weight vector to specify which ele-ments in a row should be erased, left unchanged, or some-thing in between. If the weight vector tells us to focus ona row, and the erase vector tells us to erase an element, theelement in that row will be erased. Then the write headuses a length-C add vector at to complete the writing.

merasedt ← mt−1(i)[1− yt(i)et]

mt(i)← merasedt (i) + yt(i)at

2.4. Optimization and Learning

The general idea of primal dual solvers is to introduce aux-iliary variables z to decompose the high order terms. Wecan then minimize z and y alternately through computingtheir proximal operator. In particular, we can transform theprimal problem in Eq.(1) into the following saddle pointproblem.

E = miny∈Y

maxz∈Z

∑i

gi(yi, hi(x,wα))−∑α

hα(x,w)g?α(zα)

+∑α

hα(x,w)〈wTαyα, zα〉

Page 3: Neural Structured Turing Machine · duced by Neural Structured Turing Machine with high confidence and very few mistakes. 3.1. Copy To exams whether NSTM can store and recall a long

Neural Structured Turing Machine

where g?α(·) is the convex function of gα(z?) :

sup{〈z?, z〉 − gα(z)|z ∈ Z}. Note that the whole infer-ence process has two stages: first we compute the unariesh(x,wu) with a forward pass, followed by a MAP infer-ence. This kind of optimization approach was also pro-posed in(Wang et al., 2016), note that both are special casesof (Domke, 2012).

3. ExperimentThe goal is not only to establish that NSTM is able to solvethe problems, but also that it is able to do so by learningcompact internal programs. The hallmark of such solutionsis that they generalise well beyond the range of the trainingdata. For example, we were curious to see if a network thathad been trained to copy sequences of length up to 20 couldcopy a sequence of length 100 with no further training.

We compare NSTM to different work namely NTM witha feedforward controller, NTM with an LSTM controller,and a standard LSTM network. All the tasks were super-vised learning problems with binary targets; all networkshad logistic sigmoid output layers and were trained withthe cross-entropy objective function.

Figure 2. Results of Neural Structured Turing Machine on dif-ference sequences. The four pairs of plots in the top row depictnetwork outputs and corresponding copy targets for test sequencesof length 10, 20, 30, and 50, respectively. The plots in the bottomrow are for a length 120 sequence. The network was only trainedon sequences of up to length 20. Compared to other methodsshown in Figure 2 and Figure 1, the first four sequences are repro-duced by Neural Structured Turing Machine with high confidenceand very few mistakes.

3.1. Copy

To exams whether NSTM can store and recall a long se-quence of arbitrary information, we apply NSTM to copytask. This experiment was designed to qualify whether hav-ing an structured attention system allows for better perfor-mance. The network is presented with an input sequence ofrandom binary vectors followed by a delimiter flag. Stor-age and access of information over long time periods has al-ways been problematic for RNNs and other dynamic archi-tectures. There are three different models compared:Neural

Turing Machine with LSTM controller, a basic LSTM andNSTM proposed in this paper. To qualify NSTM’s per-formance on longer sequences which were never seen be-fore, the networks were trained to copy sequences of eightbit random vectors, where the sequence lengths were ran-domised between 1 and 20. The target sequence was sim-ply a copy of the input sequence (without the delimiterflag). Note that no inputs were presented to the networkwhile it receives the targets, to ensure that it recalls the en-tire sequence with no intermediate assistance. Since thetraining sequences had between 1 and 20 random vectors,the LSTM and NTMs were compared using sequences oflengths 10, 20, 30, 50 and 120. Results of NSTM shownin Figure 3, compared to LSTM and NTM shown in Figure2 and Figure 1, NSTM remembers longer and can recalla longer sequence of arbitrary information, we can draw aconclusion that it is beneficial to enable structured atten-tion.

3.2. Repeat Copy

To compare NTM and NSTM on learning a simple nestedfunction, repeat copy task is proposed, since it requires themodel to output the copied sequence a specified numberof times and then emit an end maker. The model is fedwith random length sequences of random binary vectors,followed by a scalar value indicating the desired numberof copies which presents through a separate input channel.The model must be both able to interpret the extra inputand keep count of number of copies it has performed so farto be able to emit the end maker at the correct time. Themodels were trained to reproduce sequences of size eightrandom binary vectors, where both the sequence lengthand the number of repetitions were chosen randomly fromone to ten. The input representing the repeat number wasnormalised to have mean zero and variance one. Com-parison between Neural Structured Turing Machine withLSTM controller, LSTM, NTM with feedforward/LSTMcontroller on repeat copy Learning task shown in Figure 3.It shows that compared to NTM and LSTM, Neural Struc-tured Turing Machine allows faster convergence althoughall of them eventually converge to the optimum.

4. Related work4.1. Attention Mechanism and Neural Turing Machine

In Neural Turing Machine(Graves et al., 2014), the authorsconsider augment attention model with differential exter-nal memory to enable neural networks to bind values tospecific locations in data structure by learning how to usethis memory. (Kim et al., 2017) proposes to use a linear-chain CRF to model discrete attention in a structured way.However, their model is limited to discrete situation, whichmeans that the dependency this kind of structured attention

Page 4: Neural Structured Turing Machine · duced by Neural Structured Turing Machine with high confidence and very few mistakes. 3.1. Copy To exams whether NSTM can store and recall a long

Neural Structured Turing Machine

Table 1. Results of Neural Turing Machine on different length se-quences in copy task.

Table 2. Results of Long short-term memory on different lengthsequences in copy task.

0 100 200 300 400 500sequence number(thousands)

0

50

100

150

200

cost

per

sequence

(bit

s)

LSTMNTM with Feedforward ControllerNTM with LSTM ControllerNSTM with LSTM Controller

Figure 3. Comparison between Neural Structured Turing Ma-chine with LSTM controller, LSTM, NTM with feedfor-ward/LSTM controller on repeat copy Learning task. The com-pared baselines NTM and LSTM use the same parameter settingas in (Graves et al., 2014).

can model is rather limited. (Mnih et al., 2014) views vi-sion as a sequential decision task, the model sequentiallychooses small windows of information on the data, inte-grating information from all past windows to make its nextdecision. (Gregor et al., 2015) combines an attention mech-anism with a sequential variational auto-encoder and makesboth reading and writing sequential tasks. (Xu et al., 2015)use attention-based model to dynamically select neededfeatures based based on the intuitions of attention.

4.2. Neural Structured Models

It is natural to combine MRF, CRF and neural networkto enable (deep) neural structured model. (LeCun et al.,2006; Collobert et al., 2011; Chen et al., 2014; Schwing& Urtasun; Chen et al., 2015) parametrize the potentialsof a CRF using deep features. Alternatively, non-iterativefeed-forward predictors can be constructed using struc-tured models as motivation (Domke, 2013; Li & Zemel,2014; Zheng et al., 2015b). (Tompson et al., 2014) jointlytrain convolutional neural networks and a graphical model

for pose estimation, (Johnson et al., 2016) also combinesgraphical models with neural network for behaviour mod-elling.(Meshi et al., 2010; Taskar et al., 2005) used local-consistency relaxation message-passing algorithms to doMAP inference in a logical MRF but using the dual objec-tives from fast message-passing algorithms for graphicalmodel inference as the inner dual. (Schwing et al., 2012)used this for latent variable modelling, also using message-passing duals. (Chen et al., 2015) used the inner dualmethod twice to even more aggressively dualize expensiveinferences during latent variable learning . (Chavez, 2015)revisited the inner dual idea to train deep models with struc-tured predictors attached to them.

5. Conclusion and future workNeural Turing Machine is a very promising way to achievegeneral intelligence, it shows that accessing to externalmemory is crucial to learn and generalise. This paper fur-ther improve upon this ideas by considering structured real-valued addressing attention mechanism, our preliminaryexperiments show the benefit of such structured attentionin Neural Turing Machine. Future work include conduct-ing more experiments on various tasks including importantvision and NLP tasks and exploring the recurrence natureof structure attention mechanism.

ReferencesChandar, A. P. Sarath, Ahn, Sungjin, Larochelle, Hugo,

Vincent, Pascal, Tesauro, Gerald, and Bengio, Yoshua.Hierarchical memory networks. CoRR, 2016.

Chavez, Hugo. Paired-dual learning for fast training of la-tent variable hinge-loss mrfs: Appendices. 2015.

Chen, Liang-Chieh, Papandreou, George, Kokkinos, Ia-sonas, Murphy, Kevin, and Yuille, Alan L. Semantic im-age segmentation with deep convolutional nets and fullyconnected crfs. ICCV, 2014.

Chen, Liang-Chieh, Schwing, Alexander, Yuille, Alan, andUrtasun, Raquel. Learning deep structured models. InProceedings of the 32nd International Conference onMachine Learning (ICML-15), pp. 1785–1794, 2015.

Page 5: Neural Structured Turing Machine · duced by Neural Structured Turing Machine with high confidence and very few mistakes. 3.1. Copy To exams whether NSTM can store and recall a long

Neural Structured Turing Machine

Collobert, Ronan, Weston, Jason, Bottou, Leon, Karlen,Michael, Kavukcuoglu, Koray, and Kuksa, Pavel. Nat-ural language processing (almost) from scratch. JMLR,12:2493–2537, 2011.

Domke, Justin. Generic methods for optimization-basedmodeling. In AISTATS, 2012.

Domke, Justin. Learning graphical model parameters withapproximate marginal inference. PAML, 2013.

Fodor, Jerry A and Pylyshyn, Zenon W. Connectionism andcognitive architecture: A critical analysis. Cognition, 28(1):3–71, 1988.

Graves, Alex, Wayne, Greg, and Danihelka, Ivo. Neuralturing machines. arXiv preprint arXiv:1410.5401, 2014.

Gregor, Karol, Danihelka, Ivo, Graves, Alex, Rezende,Danilo Jimenez, and Wierstra, Daan. Draw: A recur-rent neural network for image generation. arXiv preprintarXiv:1502.04623, 2015.

Gulcehre, Caglar, Chandar, Sarath, Cho, Kyunghyun, andBengio, Yoshua. Dynamic neural turing machine withsoft and hard addressing schemes. arXiv preprintarXiv:1607.00036, 2016.

Itti, Laurent, Koch, Christof, and Niebur, Ernst. A model ofsaliency-based visual attention for rapid scene analysis.PAMI, 20(11):1254–1259, 1998.

Johnson, Matthew J., Duvenaud, David K., Wiltschko,Alex, Adams, Ryan P., and Datta, Sandeep R. Compos-ing graphical models with neural networks for structuredrepresentations and fast inference. In NIPS, 2016.

Kim, Yoon, Denton, Carl, Hoang, Luong, and Rush,Alexander M. Structured attention networks. arXivpreprint arXiv:1702.00887, 2017.

Kurach, Karol, Andrychowicz, Marcin, and Sutskever,Ilya. Neural random access machines. ERCIM News,2016, 2016.

LeCun, Yann, Chopra, Sumit, Hadsell, Raia, and Huang,Jie. A tutorial on energy-based learning. 2006.

Li, Yujia and Zemel, Richard S. Mean-field networks.CoRR, abs/1410.5884, 2014.

Meshi, Ofer, Sontag, David, Jaakkola, Tommi S., andGloberson, Amir. Learning efficiently with approximateinference via dual losses. In ICML, 2010.

Mnih, Volodymyr, Heess, Nicolas, Graves, Alex, et al. Re-current models of visual attention. In NIPS, 2014.

Parikh, Neal, Boyd, Stephen, et al. Proximal algorithms.Foundations and Trends R© in Optimization, 1(3):127–239, 2014.

Schwing, Alexander G. and Urtasun, Raquel. Fully con-nected deep structured networks.

Schwing, Alexander G., Hazan, Tamir, Pollefeys, Marc,and Urtasun, Raquel. Efficient structured prediction withlatent variables for general graphical models. In ICML,2012.

Taskar, Ben, Chatalbashev, Vassil, Koller, Daphne, andGuestrin, Carlos. Learning structured prediction mod-els: a large margin approach. In ICML, 2005.

Tompson, Jonathan, Jain, Arjun, LeCun, Yann, and Bre-gler, Christoph. Joint training of a convolutional net-work and a graphical model for human pose estimation.In NIPS, 2014.

Wang, Shenlong, Fidler, Sanja, and Urtasun, Raquel. Prox-imal deep structured models. In NIPS, 2016.

Weiss, Yair and Freeman, William T. Correctness of be-lief propagation in gaussian graphical models of arbi-trary topology. Neural computation, 2001.

Weston, Jason, Chopra, Sumit, and Bordes, Antoine. Mem-ory networks. arXiv preprint arXiv:1410.3916, 2014.

Xu, Kelvin, Ba, Jimmy, Kiros, Ryan, Cho, Kyunghyun,Courville, Aaron, Salakhudinov, Ruslan, Zemel, Rich,and Bengio, Yoshua. Show, attend and tell: Neural im-age caption generation with visual attention. In ICML,pp. 2048–2057, 2015.

Zheng, Shuai, Jayasumana, Sadeep, Romera-Paredes,Bernardino, Vineet, Vibhav, Su, Zhizhong, Du, Dalong,Huang, Chang, and Torr, Philip H. S. Conditional ran-dom fields as recurrent neural networks. 2015 IEEE In-ternational Conference on Computer Vision (ICCV), pp.1529–1537, 2015a.

Zheng, Shuai, Jayasumana, Sadeep, Romera-Paredes,Bernardino, Vineet, Vibhav, Su, Zhizhong, Du, Dalong,Huang, Chang, and Torr, Philip HS. Conditional randomfields as recurrent neural networks. In ICCV, pp. 1529–1537, 2015b.