lamBERT: Language and Action Learning Using Multimodal BERT

lamBERT: Language and Action Learning Using Multimodal BERT

Kazuki Miyazawa1, Tatsuya Aoki1,2, Takato Horii1, and Takayuki Nagai1,3

Abstract— Recently, the bidirectional encoder representationsfrom transformers (BERT) model has attracted much attentionin the field of natural language processing, owing to its high per-formance in language understanding-related tasks. The BERTmodel learns language representation that can be adaptedto various tasks via pre-training using a large corpus in anunsupervised manner. This study proposes the language andaction learning using multimodal BERT (lamBERT) model thatenables the learning of language and actions by 1) extending theBERT model to multimodal representation and 2) integratingit with reinforcement learning. To verify the proposed model,an experiment is conducted in a grid environment that requireslanguage understanding for the agent to act properly. Asa result, the lamBERT model obtained higher rewards inmultitask settings and transfer settings when compared to othermodels, such as the convolutional neural network-based modeland the lamBERT model without pre-training.

I. INTRODUCTION

Human support robots, that can work together with usand support our lives in our surrounding environments, areexpected to be realized in the near future. However, despitethe rapid development of AI and robotics in recent years,such robots have not yet been realized. One of the reasons isthat it is currently difficult for such robots to “understand” thephysical environment and human language to “act” appropri-ately. As human language commands are ambiguous, and thesituation changes sequentially, the robot is required to keepadapting to such a change while acquiring knowledge basedon its own experience in a bottom-up manner. Therefore, itis not practicable to collect a huge amount of labeled datain advance. In other words, it is necessary for the robot tolearn the representation and actions based on data collectedthrough its own actions.

To completely understand complex environments, multiplemodalities should be handled. The relationship between themodalities is the key to understanding the complex physicalworld. Research has been conducted on real-world under-standing including language acquisition by robots [1]. Inthis context, “understanding” is defined as the prediction ofunobserved information through concepts. The concept is astructure of multimodal information obtained via multiplesensors of the robot. By predicting the information throughthe concept, the robot can predict unobserved information

This research was supported by JST CREST (JPMJCR15E3) and JSPSKAKENHI Grant Number JP19J23364.

1Graduate School of Engineering Science, Osaka University, Osaka,Japan {k.miyazawa, t.aoki}@rlg.sys.es.osaka-u.ac.jp,{takato, nagai}@sys.es.osaka-u.ac.jp

2Graduate School of Informatics and Engineering, The University ofElectro-Communications, Tokyo, Japan

3Artificial Intelligence Exploration Research Center, The University ofElectro-Communications, Tokyo, Japan

more quickly and accurately. This is because detailed infor-mation including noise is discarded in the concept formationprocess, and the important information is collected. In [2],the authors have proposed a model that categorizes themultimodal information obtained by real robots using a prob-abilistic generative model to form concepts and to predict theunobserved information. Moreover, this framework has beenextended to a model that learns the concepts and the languagesimultaneously [3], [4]. Despite these achievements, therestill exist some problems in the multimodal representationlearning using these probabilistic generative models. First,it is not clear whether the knowledge required for the robotoperating in complex environments can be represented in theform of generative processes. Second, the inference via latentvariables can learn the multimodal representation accordingto the specific task (classification and reinforcement learning,etc.); however, it might be difficult to acquire effectiverepresentation for multiple tasks. In the probabilistic gen-erative models, the relationship between the latent variablesis provided in the form of a directed acyclic graph (DAG).The drawback is that it is unknown as to how the structuraldesign of the DAG affects the performance of the task. Inany case, these factors can pose problems to the robot’s op-eration in complex environments. Although the multimodalrepresentation learning has been studied for deep generativemodels such as the variational autoencoders (VAEs) [5],[6], they also pose similar problems. It is unclear as towhich mechanism is suitable for learning the multimodalrepresentations at all.

Recently, deep reinforcement learning has greatly con-tributed to the development of robot action learning. Variousmodels have been proposed so far, and their usefulness hasbeen demonstrated in real robots [7]. Moreover, transferringthe experience learned from a certain environment to a newenvironment is very important in the real robot learningscenario, where the available samples are limited. To achievethis, several methods have been proposed in which meta-knowledge, that can be used across multiple tasks, is acquiredand used to adapt to a new environment [8]. Furthermore,attempts are being made to learn knowledge applicable tomultiple robots from a large amount of data acquired by therobots [9]. Thus, the problem of acquiring knowledge acrossmultiple tasks and making appropriate decisions in a newenvironment is crucial, but not completely solved thus far.

On the contrary, natural language processing (NLP) inrecent years has begun to fare better than humans in tasksthat require language understanding, and it has thereforereceived considerable attention. These advances are based onthe language models, such as ELMo [10] and BERT [11], that

arX

iv:2

004.

0709

3v1

[cs

.LG

] 1

5 A

pr 2

020

perform unsupervised pre-training using a large text corpus.In particular, the BERT model enables the pre-training forvarious NLP tasks. The BERT model is now considered as ageneral pre-training model for language understanding. Thepre-training is based solely on self-supervised learning, suchas the “mask prediction task” and “next sentence predictiontask,” which requires no labels for the training data. The pre-trained BERT model scored the highest among all models oneight different NLP tasks.

Multimodal representation learning that can be appliedto multiple tasks is also a critical aspect to be consideredregarding robots. One of the recent developments concerningBERT is its extension to multimodal data. Models that enablemultimodal representation learning using the structure andlearning mechanism of BERT have been proposed [12], [13],[14]. By further expanding these frameworks, we believe itis possible for robots to understand the environment whereinthey can respond to human language by connecting it to thereal world, which is referred to as “symbol grounding.” Inaddition, as the robot can act autonomously, the robot cancontinuously obtain information corresponding to the pre-training data in NLP tasks via its own actions and keepupdating the knowledge of the world and language.

In this study, we propose lamBERT (language and actionlearning using multimodal BERT), in which the robot learnsconcepts from multiple sensory inputs using the multimodalBERT structure, and learns actions using the learned multi-modal representation and reinforcement learning algorithm.We verify the usefulness of the model using a grid envi-ronment that requires language instructions to be understoodand the structure and learning method of the BERT modelto be shown as effective for multitask learning and transferlearning.

II. RELATED WORK

Related work can be broadly classified into three view-points, each of which is described below.

A. Multimodal representation learning

For a robot to understand the real world, organizationof multimodal information and prediction of unobservedinformation are inevitable processes. Various models for per-forming the multimodal representation learning that enablesuch predictions have been proposed, as mentioned earlier.For example, JMVAE [5] and TELBO [15] use VAEs toacquire a multimodal representation through latent variablez and enable the mutual prediction of information for eachmodality. In VideoBERT [12], ViLBERT [13], and Image-BERT [14], the multimodal representation of language andimages is obtained using the structure of BERT, and cross-modal mutual prediction is made possible. The multimodallatent Dirichlet allocation (MLDAs) [2] and spatial conceptSLAM (SpCoSLAM) [16] are examples of the multimodalrepresentation learning using probabilistic generative models.By acquiring and structuring the multimodal information,these models enable real robots to learn object and spatialconcepts and predict unobserved information.

B. Transfer learning

Various models have been proposed for learning andacting while exhibiting adaptation to multiple environments.UNREAL [17] shows that adding an auxiliary task to areinforcement learning task improves the performance inthe environment to which it is transferred. Similarly, theperformance of the autonomous driving cars can be improvedby adding certain tasks, such as prediction of depth andsegmented images from the RGB image [18]. Moreover, ascollecting data for the robot in the real world is expensive,a method called domain randomization, wherein the trainingis conducted in the simulation following which the model isfine-tuned in the real world, has also attracted attention [19].

C. Integrated models

As previously mentioned, research focused solely on rep-resentation learning or reinforcement learning, as well ason models that handle multimodal representation learningand reinforcement learning in a unified manner, have beenproposed. Hermann et al. proposed a model for learningagent actions and language in a three-dimensional mazeenvironment [20]. We proposed a model, which we callthe integrated cognitive architecture, for learning concepts,language, and actions using a robot in the real world [21].The Visual-and-Language Navigation (VLN) is a task thatneeds to deal with the multimodal information and makedecisions. The models that realize this task can be consideredas integrated models. Zhu et al. have proposed a VLN agentmodel [22] that uses attention to integrate the image andlanguage information.

In this study, to realize an autonomous agent that can actby understanding human language, we propose the lamBERTmodel that can learn language and actions using the struc-ture and learning method of BERT employed in the NLPtasks. This is realized by extending the BERT model tomultimodal data and integrating the network of multimodalrepresentation learning with deep reinforcement learning.The contributions of this paper are twofold: 1) BERT isshown to be effective for multimodal representation learningin a reinforcement learning scenario and 2) we demonstratethat BERT is suitable for the transfer reinforcement learning.To the best of our knowledge, this is the first attempt ofapplying BERT to reinforcement learning.

III. PROPOSED METHOD

In this section, we introduce BERT briefly followed bythe explanations of the structure and learning method of theproposed lamBERT model.

A. Overview of BERT

The BERT model is a pre-trained model used for solvingNLP tasks. This model is trained using a large amount oftext data in a self-supervised manner. Upon using this pre-training scheme, the BERT can adapt to various tasks byemploying the fine-tuning. The BERT model consists of mul-tilayered transformers that utilize a multi-head-self-attentionmechanism. The input to the BERT model is the sum of the

[ 0 ] = Segment Embedding(0)

[ 0 ] = Position Embedding(0)

Trm Trm Trm Trm

Trm Trm Trm Trm

……

……lamBERT

+ + + + + + +

+ + + + + + +

5 / 16

𝑇𝑇𝑇𝑇𝑇𝑇1𝑣𝑣

𝑉𝑉𝜃𝜃(𝑇𝑇𝑡𝑡)𝜋𝜋𝜃𝜃 � 𝑇𝑇𝑡𝑡)

CLS SEP SEP𝑇𝑇𝑇𝑇𝑇𝑇𝑁𝑁𝑣𝑣 𝑇𝑇𝑇𝑇𝑇𝑇1𝑙𝑙 𝑇𝑇𝑇𝑇𝑇𝑇𝑀𝑀𝑙𝑙… …

𝑇𝑇1𝑣𝑣𝑇𝑇𝐶𝐶𝐶𝐶𝐶𝐶 𝑇𝑇𝐶𝐶𝑆𝑆𝑆𝑆 𝑇𝑇𝐶𝐶𝑆𝑆𝑆𝑆𝑇𝑇𝑁𝑁𝑣𝑣 𝑇𝑇1𝑙𝑙 𝑇𝑇𝑀𝑀𝑙𝑙… …

𝐸𝐸1𝑣𝑣𝐸𝐸𝐶𝐶𝐶𝐶𝐶𝐶 𝐸𝐸𝐶𝐶𝑆𝑆𝑆𝑆 𝐸𝐸𝐶𝐶𝑆𝑆𝑆𝑆𝐸𝐸𝑁𝑁𝑣𝑣 𝐸𝐸1𝑙𝑙 𝐸𝐸𝑀𝑀𝑙𝑙… …

Mask Loss Mask Loss

PPO Loss

Fully

Actor Critic

you must fetch a green key

7 × 7 × 2

…

Env.

Mission.

Vision info. Language info.

[ 1 ][ 0 ] [ N+1 ] [M+N+2][ N ] [ N+2 ] [M+N+1]… …

𝐸𝐸1𝑣𝑣𝐸𝐸𝐶𝐶𝐶𝐶𝐶𝐶 𝐸𝐸𝐶𝐶𝑆𝑆𝑆𝑆 𝐸𝐸𝐶𝐶𝑆𝑆𝑆𝑆𝐸𝐸𝑁𝑁𝑣𝑣 𝐸𝐸1𝑙𝑙 𝐸𝐸𝑀𝑀𝑙𝑙… …[ 0 ][ 0 ] [ 0 ] [ 1 ][ 0 ] [ 1 ] [ 1 ]… …

InputEmbedding

SegmentEmbedding

PositionEmbedding

Flat98 × 1

InputTokens 𝑇𝑇𝑡𝑡

Fig. 1: Overview of lamBERT: lamBERT outputs policies πθ (·|ot) andvalues Vθ (ot) using visual and language information as input tokens ot .The visual information is input as a tokenized sequence of a flattened imageTokv

n, and the language information is input as tokenized words Toklm. Model

learning is performed by inputting a masked token sequence and calculatingthe loss function. The sum of “Proximal Policy Optimizations (PPO) Loss”and “Mask Loss” is the loss of the entire model. The policy is calculatedusing the input token without the masking at the time of action.

following three embedded vectors: token embeddings thatembed the language token information, position embeddingsthat embed the information indicating the order of words inthe sequence, and segment embeddings that embed informa-tion for discriminating between two sentences. In the pre-training phase, the language model is trained by performingtwo unsupervised learning tasks on this input. One is themask prediction task for predicting the masked tokens, whichis called a masked language model (Masked LM). The otheris the next sentence prediction task that determines whethertwo sentences in the input are connected. By performingfine-tuning for a specific task using the weights obtainedby the pre-training, the BERT model has demonstrated highperformance in various NLP tasks.

B. lamBERT model

An overview of the lamBERT model is shown in Fig. 1.The lamBERT model uses two modality tokens, which arevision tokens Tokv

n and language tokens Toklm, as input ot ,

learns the multimodal representation of these two modalitiesobtained through the BERT structure, and outputs the poli-cies πθ (·|ot) and values Vθ (ot) via an actor-critic network.The model parameter θ is updated by the reinforcementlearning algorithm and the mask prediction task. We nowexplain the details of the lamBERT model.

1) Multimodal representation: The proposed model usesthe BERT structure extended to multimodal input. This isrealized using the BERT structure and the token sequence

that combines visual information and language informationas the input. This method is inspired by the structure ofthe model to handle multimodal information with the BERTstructure [12], [13], [23]. There are various ways to inputthe visual information to BERT, but in this model, the visiontoken sequence uses a flattened image directly. The proposedmodel assumes that the visual information and languageinformation are input simultaneously. In this research, weuse the image observed by the agents in a grid environmentas shown in Fig. 1.

The input tokens are composed of the vision tokens andthe language tokens by inserting the classification (CLS)token and separation (SEP) token, as shown in Fig. 1. Ifthe sequence length of the input tokens is shorter than themaximum length, a padding is performed using a PAD token.The position tokens and the segmentation tokens are usedas the representation for the difference between the tokenposition information and modality, respectively. The positiontokens assign a position ID from 0 to the sequence lengthto the token string that combines the vision tokens and thelanguage tokens. The segmentation tokens assign 0 to thevision tokens and 1 to the language tokens. The final inputto the model is to map the input tokens, position tokens, andsegmentation tokens into a d-dimensional embedding spaceusing a learnable network, and then take the sum and use itas an input to the model.

To learn multimodal representation, the mask predictiontask performed by the original BERT applies also to theproposed method. The mask prediction task in the proposedmodel is performed as follows. First, 5% of the tokens areselected from input tokens (vision + language) as maskprediction candidates. Then, 80% of the mask prediction can-didates are replaced with the “MASK” token, and 10% of themask prediction candidates are replaced with another randomtoken. No conversion of the last 10% of mask predictioncandidates should occur. The masked token sequence ot\mtis input to the lamBERT, and the lamBERT model learnsto generate the original token sequence. The masked tokensequences ot\mt is created D kinds per one token sequence otand the model uses them as the training data. For learning,the sum of the cross-entropy loss of only the tokens thathave become mask candidates is used. The loss function ofthe mask prediction task is expressed as

Lml =1|C|

C

∑cid

[−log

(exp(xcid)

∑IDid exp(xid)

)], (1)

where x, ID, and cid ∈C represent the final embedding vectorof the masked token, size of all token IDs, and ID of thetoken selected as the mask candidate token, respectively. |C|represents the total number of the tokens selected as the maskcandidate token.

In modalities such as visual information and languageinformation, where the number of tokens differ betweenthe modalities, the effect of the mask prediction task isdominated by modalities with many tokens. Therefore, whenselecting mask candidates, we adjust the probability of selec-tion as a mask candidate. When the length of each modality

sequence is N and M, respectively, each token is selected formask prediction with the probabilities P(Tokv

n = mask) = 12N

and P(Toklm = mask) = 1

2M .2) Action learning: We use Proximal Policy Optimiza-

tions (PPO) [24], a type of actor-critic algorithm, for theaction learning. The actor and the critic are calculated usingthe embedding vector of the last layer of the CLS tokenas shown in Fig. 1. In this manner, the action decisionbecomes possible using the structure of BERT. The PPOlearning is performed based on the following formula usingdata sampled by the agents acting in the environment.

Lppo = −min[rt(θ)At , clip(rt(θ),1− ε,1+ ε) At

]+ c1LV F

t (θ)− c2S[πθ ](ot\mt), (2)

where, rt(θ) is the probability ratio:

rt(θ) =πθ (at |ot\mt)

πθold (at |ot), (3)

where at , ot , ot\mt , and At represent the agent action, in-put tokens, masked input tokens, and estimated advantage,respectively. LV F

t , S, and ε represent the squared-error lossfor value function, entropy bonus, and hyperparameter forclipping, respectively. c1 and c2 are weights for the value andthe entropy bonus, respectively. Note that the policy πθold iscalculated using input tokens ot during action, and the policyπθ is calculated using masked input tokens ot\mt during themodel update.

3) Leaning: In reinforcement learning, unlike the problemsetting in NLP, it is rare that a large amount of environmentdata is available before learning in the environment. It isnecessary for the agent to acquire the data itself and acquirethe necessary knowledge. Therefore, unlike the BERT modelused in NLP, the lamBERT model obtains the data usedfor mask prediction learning through trial and error in theenvironment. In other words, the lamBERT model does nothave two steps of pre-training and fine-tuning as performedin the original BERT model. The lamBERT model updates bylearning the mask prediction task simultaneously with datacollection and policy learning by reinforcement learning. Theloss function of the lamBERT model is expressed as

L = Lml +Lppo. (4)

The learning flow of the lamBERT model is shown inAlgorithm 1.

4) Transfer: Transfer learning is performed by applyingthe model learned using the above algorithm in a certainenvironment to the other environments. In a certain envi-ronment, the lamBERT model adapts to the environmentby updating the model using the PPO algorithm, and learnsgeneral knowledge by applying mask prediction tasks to newlanguages and environments.

IV. EXPERIMENTS

To verify the effectiveness of the proposed method, weexamined action learning and language learning in a multi-tasking environment and the transfer of knowledge to a new

Algorithm 1 lamBERT model Algorithm

1: for iteration = 1,2, ... do2: for actor = 1,2, ...,N do3: Act by policy πθold for T timesteps4: Calculate advantage estimates A1, ..., AT5: end for6: for duple = 1,2, ...,D do7: Generate T ×N masked token sequence o\mt8: end for9: Optimize L with respect to θ with K epochs and

minibatch size B < N×T ×D10: θold ← θ

11: end for

environment for a grid environment with language instruc-tions. First, we describe the experimental environment, theexperimental setting, and finally, describe the experimentalresults.

1) Experimental environment: We used the gym minigrid[25] for the experimental environment. In this environment,objects (key, ball, and box, etc.) and doors are placed in thegrid environment. The agent can take any of seven actions:straight ahead, left turn, right turn, pick up, drop an object,toggle an object, and done. A language instruction (mission)in a natural language is given as a task to be performed by anagent. The agent gets a reward for taking action to achievethis instruction.

The gym minigrid provides various environments, but inthis study, we used three environments that need languageunderstanding to allow the agent to get the reward. Fig. 2shows the three environments that we used in this study andthe details of the language instructions, objects, and rewardconditions given in each environment. In these environments,the visual information obtained by the agent is in form of7× 7 squares in front of the agent. In addition, the samelanguage instruction is given in the episode for each step.The goal of the agent is to use these two modalities (visualinformation and language information) to make appropriateaction decisions that understand the language and achievethe language instructions.

2) Experimental setting: To verify multitask learning andtransfer learning, the agents were trained by the followingthree settings:• From scratch: Learn from scratch at the GoToObject

environment• Multitask learning: Multitask learning at the Fetch en-

vironment and GoToDoor environment• Transfer learning: Learning at the GoToObject environ-

ment using the agent learned in the above multitasksetting

In multitask learning, the model was updated after obtainingthe same number of samples from the two environments.

To compare the proposed methods, three models shownin Fig. 3 were learned according to the above settings. As acomparison method, we used the lamBERT model (w/o mask

転移学習の環境についての説明 5 / 16

Fetch Env. GoToDoor Env. GoToObject Env.

Env.

mission

get a object colorgo get a object colorfetch a object colorgo fetch a object coloryou must fetch a object color

go to the color door go to the object color

object key, ball, door key, ball, boxcolor red, green, blue, purple, yellow, grey

mission detail

Take action to grasp the specified object.

Take action “done action” when the specified door is 4-neighbor distance.

Take action “done action” when the specified object is 8-neighbor distance.

setting Multitask learning Transfer learning

you must fetch a red ball go to the yellow door go to the green box

Fig. 2: Experimental environments: Agents (shown as red triangle) receiverewards only when they achieve the mission. The reward value is amaximum value of 1 and decays over time. The color and object of themission are randomly selected from those present in the environment. Forthe fetch environment’s mission, one fixed phrase is randomly selected fromthe above missions. The agent’s and the initial object positions in eachepisode are given at random. The agent’s visual information is the partialobservation of 7×7 squares in front of the agent (the highlighted area). Thevisual information has two channels of discrete values representing type andcolor of the object.

TABLE I: Hyperparameters of lamBERT models. The parameters are shownseparately to the BERT part and the PPO part.

BERT network PPO networkembedding token size 64 c1 0.5num of attention head 4 c2 0.01num of hidden layers 3 ε 0.2

loss (ML)) and a convolutional neural network (CNN)-basedmodel (CNN+GRU) that represents the vision and languageinformation using convolutional neural network (CNN) andgated recurrent unit (GRU), respectively. When learning thelamBERT (w/o ML), the mask processing was performedon the input token during learning in the same way, as inthe lamBERT (w/ ML). This processing of the lamBERT(w/o ML) was performed to eliminate the possibility thatthe input of data with a mask by the proposed method couldimprove the generalization of the model. The lamBERTmodel’s hyperparameters are shown in Table I.

3) Experimental Results: The mean and standard devia-tion (0.5 σ ) of rewards for each model in each setting isshown in Fig. 4. We performed each setting and each modellearning three times with different random seed values.First, by looking at the results of Fig. 4 (a) GotoObject,initially, the lamBERT and lamBERT (w/o ML) have loweraverage rewards than CNN+GRU, but as learning progresses,lamBERT models obtain higher average rewards.

Next, we explain multitask learning. Fig. 4 (b) is theresult of learning at Fetch Environment and GoToDoorEnvironment at the same time. This figure shows that the

マスク予測タスクの転移学習に対する影響の検証 5 / 16

比較手法の説明1. CNN+GRUモデル

2. BERT without mask予測 loss (α=0)(ただし，方策学習時の入力のみ，5%のTokenをマスク処理．これは3と比較するときにモデルの汎化性をあげている原因が，データオーグメンテーションであるという可能性を排除するため)

3. BERT with msk予測 loss (α=100) [提案手法](方策学習時の入力のみ，5%のTokenをマスク処理)

※ BERTモデルにおいて，1Token列あたり10通りのマスク処理Tokenを生成して学習

(c) CNN+GRUlamBERTw/o mask loss(a) lamBERT

BERT

Fully

Actor Critic

𝑉𝑉𝜃𝜃(𝑜𝑜𝑡𝑡)𝜋𝜋𝜃𝜃 � 𝑜𝑜𝑡𝑡)

𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑜𝑜𝑣𝑣𝑡𝑡 𝑙𝑙𝑙𝑙𝑣𝑣𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑡𝑡

BERT

Fully

Actor Critic


𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑜𝑜𝑣𝑣𝑡𝑡 𝑙𝑙𝑙𝑙𝑣𝑣𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑡𝑡

GRUCNN

Concatenate

Actor Critic


𝑙𝑙𝑙𝑙𝑣𝑣𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑡𝑡𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑜𝑜𝑣𝑣𝑡𝑡

(b)

Fig. 3: Comparison models: (a) lamBERT (prop.), (b) lamBERT (w/o ML),and (c) CNN+GRU. The hyperparameters of models (a) and (b) are equal.In models (a) and (b), the mask processing is performed on input tokensduring learning, but not during action in the environment. The model (b)masks the input tokens during learning but does not update the networkusing the mask prediction loss Lml . The black dots in the figure representthis situation.

CNN+GRU model obtained a higher reward value in theinitial learning, but as the learning progressed, the lamBERTand lamBERT (w/o ML) obtained a higher reward value. Inaddition, the results of around 600 - 800 k steps show thatthe lamBERT obtained the highest average reward value.

Based on these results, we verified the method by whicheach model was able to take appropriate action decisionsfor each language instruction in a multitasking environment.In the verification, based on each model learned in themultitask environment, the agent again acted in the multitaskenvironment and summarized the amount of reward obtainedfor each language instruction. The result is shown in Fig. 5.This figure shows that the lamBERT received a high rewardfor all language instructions. By looking at the lamBERT(w/o ML), the reward value for the language instruction“go to the” was lower than the reward value for otherlanguage instructions. On the contrary to the lamBERT(w/o ML), the reward value of the CNN+GRU model forthe language instruction “go to the” was high, but thereward value for other language instructions was low. Thelanguage instruction “go to the” belongs to the GotoDoorEnvironment, and the other language instructions belong tothe Fetch Environment. In summary, the lamBERT (w/o ML)and the CNN+GRU model could not act properly in the twoenvironments, but the lamBERT model can act properly inall environments. These results suggest that the lamBERTmodel’s mask prediction task is effective for reinforcementlearning tasks in a multitasking environment.

Then, we explain the results of transfer learning. Fig. 4 (c)shows the result of learning in the GoToObject Environmentusing a model learned in a multitask setting. Fig. 4 (c) showsthat all models can get higher reward values compared tolearning from scratch (Fig. 4 (a)). Comparing the results oftransfer learning, the lamBERT obtained the highest rewardvalues and was the fastest to obtain a high reward value

0 50k 100k 150k 200k 250kStep

0.00.10.20.30.40.50.60.70.80.9

Rew

ard

lamBERTlamBERT (w/o ML)CNN+GRU

(a) From scratch

0 200k 400k 600k 800kStep

0.00.10.20.30.40.50.60.70.80.9

Rew

ard

lamBERTlamBERT (w/o ML) CNN+GRU

(b) Multitask learning

0 50k 100k 150k 200k 250kStep

0.00.10.20.30.40.50.60.70.80.9

Rew

ard

lamBERTlamBERT (w/o ML)CNN+GRU

(c) Transfer learning

Fig. 4: The Reward value for each model in each environment; (a) from scratch at GotoObject, (b) multitask learning at Fetch and GotoDoor, and (c)transfer learning at GotoObject: The results of each model are the average and standard deviation (0.5σ ) of reward values when the seed value was changedthree times. Each model updates its parameters every 8064 steps. In the transfer learning, the last model learned in the multitask setting was used for theinitial weight of the transfer setting. In the multitask setting, the number of samples obtained from each environment is equal.

fetch a get a go fetch a go get a go to the you must fetch a

0.0

0.3

0.6

0.9

lamBERT lamBERT (w/o ML) CNN+GRU

Fig. 5: Reward value for each language instruction: The average andstandard deviation (0.5 σ ) of reward values for each language instructionin multitask setting are shown. For each model, the results obtained fromthe learned model from three seed values are shown. For each languageinstruction, the reward values are averaged over the colors and object names.Hence the reward values for different verbs are given in the figure.

as compared to other methods. The lamBERT (w/o ML)and CNN+GRU models, which are comparison methods,have obtained almost the same reward values. These resultssuggest that the lamBERT model’s mask prediction task iseffective for transfer learning in reinforcement learning tasks.

Finally, we explain the results of attention weight. Fig. 6shows the attention weight when the agent makes an actiondecision. Fig. 6 (a) shows the attention weight of the agentacting on GotoDoor Environment under the mission “go tothe green door.” Fig. 6 (b) shows the attention weight at theagent acting on Fetch Environment under the mission “fetcha yellow ball.” The attention weight in Fig. 6 indicates theattention of the CLS token. This is because, as shown in Fig.1, the actor-critic network uses the CLS token. The colorof each attention weight indicates the type of the attentionhead. These results are obtained by using the lamBERTmodel learned in the multitask setting. There are two typesof attention, namely, visual attention and language attention.Visual attention corresponds to partial visual observation andis divided into two types: object type attention and colorchannel attention as the visual observation has two channelsof object type and color. The language attention indicatesthe attention weight of each attention head corresponding toeach word of the mission. In Fig. 6, the attention is shownonly for the second layer of the lamBERT model.

Fig. 6 shows that the visual (object type) attention tendsto take a higher value for a certain position of an object suchas a door or a ball. Further, from the visual (color channel)attention (green attention head) in Fig. 6 (b), the attentionweight is high only for the “yellow” ball. Furthermore, thelanguage attention (green attention head) shows a high valuefor the word for color. These results indicate that the attentionweight for important information is high. Further, as shownby the green attention head result, it is considered that themodel obtained the multimodal attention representation.

V. DISCUSSION

The purpose of the present study is to show that using thestructure and learning method of BERT, effective representa-tion for action and language learning of autonomous agentscan be learned. Experimental results show that the proposedmethod can obtain higher rewards than other models in themultitask and transfer learning settings. We discuss theseresults from two aspects: 1) BERT representation learningand 2) reinforcement learning.

1) BERT representation learning: The proposed methodis a model in which the reinforcement learning is applied tothe fine-tuning task of BERT. By applying the reinforcementlearning to BERT, agents can perform data acquisition andmask prediction tasks simultaneously. Thus, it is possible toautonomously make the model learn while collecting datato understand action and language and performing a maskprediction task in the current environment, as pre-trainingfor a new environment. The results (Fig. 6) show that theattention is applied to important information for both visionand language information. From these results, it is consideredthat the lamBERT model learned multimodal representation,which is important for the action decision. The results oftransfer learning in Fig. 4 (c) demonstrate that the agent withthe proposed model adapted faster and gained higher rewardsthan the other models in the new environment. As describedabove, by performing the pre-training task (mask predictiontask) in the pre-transition environment, the agent can betterunderstand the language and the environment for acting ina new environment. We consider in the proposed methodthat the learning of representation in a specific environment

goto

thegreendoor

goto

thegreendoor

goto

thegreendoor

goto

thegreendoor

step1

step2

step3

step4

VisualObs.

Env. Language attention

Visual (color channel) attentionVisual (object type) attention

(a) GotoDoor Environment (mission: go to the green door)fetch

ayellow

ballfetch

ayellow

ballfetch

ayellow

ball

step1

step2

step3

VisualObs.

Env. Language attention

Visual (color channel) attentionVisual (object type) attention

(b) Fetch Environment (mission: fetch a yellow ball)

Fig. 6: Visualization of attention weight: This figure shows the attention weight for the CLS token in each step. The color of each attention indicates thetype of attention-head. The attention of visual information is shown separately for two channels of visual information (object type and color). The attentionfor language information indicates the attention weight of each attention-head for each word of the mission. The attention is shown only for the secondlayer of the lamBERT model. We used an agent model that learned the lamBERT (w/ ML) in the multitask setting to obtain this result.

was acquired by reinforcement learning along with acquiringgeneral representation such as grammatical structures and en-vironmental structures by mask prediction tasks. We considerthat we can analyze whether such representation learninghas been performed by using the attention weight, as shownin Fig. 6. Furthermore, by comparing the difference in theattention weight between the models with and without themask prediction task, we think it will be possible to analyzethe effect of the mask prediction task on decision making.

2) Reinforcement learning: The mask prediction task inthe proposed model can be considered as an auxiliary task,which is often used in reinforcement learning. The auxiliarytask is designed in addition to the main task (cumulativereward maximization) in reinforcement learning (e.g., pre-diction of reward value or prediction of other modalities). Asreported, such an auxiliary task has an advantage in obtainingthe agents’ rewards [17], [20]. The proposed model can beregarded as a reinforcement learning model using a maskprediction task as an auxiliary task. We consider that thisis one reason that the mask prediction task was effective

in obtaining the rewards in a multitask environment andtransfer learning. The auxiliary task has a problem in thatthe effective design of the auxiliary task is unknown. It isuseful for a mask prediction task to be used as an auxiliarytask because a specific auxiliary task does not need to bemanually designed. Furthermore, if the relationship betweenthe mask prediction task in the BERT and the auxiliarytask in reinforcement learning becomes clear, better maskcandidates can be selected. In addition, from the viewpointof reinforcement learning, the effects of the auxiliary tasksmay be interpreted by analyzing an attention weight of theBERT model.

3) Limitations of the proposed model: Finally, we de-scribe the three limitations. First, the proposed model can-not have anything other than a token sequence as input.In other words, the proposed model only handles discreteinformation, such as language and visual information, inthe grid environment. However, the robots need continuousmultimodal information, such as sounds, haptics, and motorcommands. This problem can be addressed by applying the

mask prediction task to continuous values using regression.Second, the proposed model makes it difficult to process

high-dimensional data because self-attention requires a com-putational cost of O(n2) for an input sequence of lengthn. Thus, reduction of the calculation cost and number ofparameters for self-attention is one solution, to which severalmethods have been proposed [26], [27]. We consider thatthese methods are also applicable to the proposed method.

Third, the proposed model cannot learn the multimodalrepresentation when each modality has a dependency overtime because the proposed method is learned using dataat a certain time t. However, in general, the modalitieshave dependencies over time. For example, the languageinformation observed by the agent is not only closely relatedto a certain time t but also to a state-action time series.Therefore, it is important to understand the multimodalrelationship between such a state-action time sequence andlanguage information. One way to obtain this representationis by extending the input of the model to action-state timeseries information.

VI. CONCLUSION

In this study, we propose the lamBERT model, in whichthe robot learns concepts from multiple sensory inputs us-ing the BERT structure, and learns actions using learnedmultimodal representation and reinforcement learning. Weverified the usefulness of the model using a grid environmentthat required to understand the language instructions andshowed that the structure and learning method of the BERTis effective for multitask learning and transfer learning. Inthis study, we evaluated the model mainly by looking atthe results of the agent actions. However, in the future, wewill analyze the internal representation of the model and theeffects of representation learning on decision making. Foranother future work, we will extend the model to adapt theproposed model to a real robot. We consider the integrationof continuous and high-dimensional sensory input and motorinformation with discrete language information. Further-more, we aim to learn more complex language environmentrelationships by extending the model over time.

REFERENCES

[1] T. Taniguchi, T. Nagai, T. Nakamura, N. Iwahashi, T. Ogata, andH. Asoh, “Symbol emergence in robotics: a survey,” AdvancedRobotics, vol. 30, no. 11-12, pp. 706–728, 2016.

[2] T. Nakamura, T. Nagai, and N. Iwahashi, “Grounding of wordmeanings in multimodal concepts using lda,” in 2009 IEEE/RSJInternational Conference on Intelligent Robots and Systems, pp. 3943–3948, IEEE, 2009.

[3] T. Araki, T. Nakamura, T. Nagai, S. Nagasaka, T. Taniguchi, andN. Iwahashi, “Online learning of concepts and words using multimodallda and hierarchical pitman-yor language model,” in 2012 IEEE/RSJInternational Conference on Intelligent Robots and Systems, pp. 1623–1630, IEEE, 2012.

[4] T. Araki, T. Nakamura, and T. Nagai, “Long-term learning of conceptand word by robots: Interactive learning framework and preliminaryresults,” in 2013 IEEE/RSJ International Conference on IntelligentRobots and Systems, pp. 2280–2287, IEEE, 2013.

[5] M. Suzuki, K. Nakayama, and Y. Matsuo, “Joint multimodal learningwith deep generative models,” arXiv preprint arXiv:1611.01891, 2016.

[6] Y. Shi, S. N, B. Paige, and P. Torr, “Variational mixture-of-expertsautoencoders for multi-modal deep generative models,” Advances inNeural Information Processing Systems 32, pp. 15718–15729, 2019.

[7] S. Levine, C. Finn, T. Darrell, and P. Abbeel, “End-to-end trainingof deep visuomotor policies,” The Journal of Machine LearningResearch, vol. 17, no. 1, pp. 1334–1373, 2016.

[8] C. Finn, P. Abbeel, and S. Levine, “Model-agnostic meta-learningfor fast adaptation of deep networks,” in Proceedings of the 34thInternational Conference on Machine Learning-Volume 70, pp. 1126–1135, JMLR. org, 2017.

[9] S. Dasari, F. Ebert, S. Tian, S. Nair, B. Bucher, K. Schmeckpeper,S. Singh, S. Levine, and C. Finn, “Robonet: Large-scale multi-robotlearning,” arXiv preprint arXiv:1910.11215, 2019.

[10] M. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee,and L. Zettlemoyer, “Deep contextualized word representations,” inProceedings of the 2018 Conference of the North American Chapterof the Association for Computational Linguistics: Human LanguageTechnologies, Volume 1 (Long Papers), pp. 2227–2237, 2018.

[11] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-trainingof deep bidirectional transformers for language understanding,” arXivpreprint arXiv:1810.04805, 2018.

[12] C. Sun, A. Myers, C. Vondrick, K. Murphy, and C. Schmid,“Videobert: A joint model for video and language representationlearning,” in Proceedings of the IEEE International Conference onComputer Vision, pp. 7464–7473, 2019.

[13] J. Lu, D. Batra, D. Parikh, and S. Lee, “Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks,”in Advances in Neural Information Processing Systems, pp. 13–23,2019.

[14] D. Qi, L. Su, J. Song, E. Cui, T. Bharti, and A. Sacheti, “Imagebert:Cross-modal pre-training with large-scale weak-supervised image-textdata,” arXiv preprint arXiv:2001.07966, 2020.

[15] R. Vedantam, I. Fischer, J. Huang, and K. Murphy, “Generative modelsof visually grounded imagination,” arXiv preprint arXiv:1705.10762,2017.

[16] A. Taniguchi, Y. Hagiwara, T. Taniguchi, and T. Inamura, “Online spa-tial concept and lexical acquisition with simultaneous localization andmapping,” in 2017 IEEE/RSJ International Conference on IntelligentRobots and Systems (IROS), pp. 811–818, Sep. 2017.

[17] M. Jaderberg, V. Mnih, W. M. Czarnecki, T. Schaul, J. Z. Leibo, D. Sil-ver, and K. Kavukcuoglu, “Reinforcement learning with unsupervisedauxiliary tasks,” arXiv preprint arXiv:1611.05397, 2016.

[18] Z. Li, T. Motoyoshi, K. Sasaki, T. Ogata, and S. Sugano, “Rethink-ing self-driving: Multi-task knowledge for better generalization andaccident explanation ability,” arXiv preprint arXiv:1809.11100, 2018.

[19] Y. Chebotar, A. Handa, V. Makoviychuk, M. Macklin, J. Issac,N. Ratliff, and D. Fox, “Closing the sim-to-real loop: Adapting simula-tion randomization with real world experience,” in 2019 InternationalConference on Robotics and Automation (ICRA), pp. 8973–8979,IEEE, 2019.

[20] K. M. Hermann, F. Hill, S. Green, F. Wang, R. Faulkner, H. Soyer,D. Szepesvari, W. M. Czarnecki, M. Jaderberg, D. Teplyashin, et al.,“Grounded language learning in a simulated 3d world,” arXiv preprintarXiv:1706.06551, 2017.

[21] K. Miyazawa, T. Horii, T. Aoki, and T. Nagai, “Integrated cognitivearchitecture for robot learning of action and language,” Frontiers inRobotics and AI, vol. 6, p. 131, 2019.

[22] F. Zhu, Y. Zhu, X. Chang, and X. Liang, “Vision-language naviga-tion with self-supervised auxiliary reasoning tasks,” arXiv preprintarXiv:1911.07883, 2019.

[23] D. Kiela, S. Bhooshan, H. Firooz, and D. Testuggine, “Supervisedmultimodal bitransformers for classifying images and text,” arXivpreprint arXiv:1909.02950, 2019.

[24] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov,“Proximal policy optimization algorithms,” arXiv preprintarXiv:1707.06347, 2017.

[25] M. Chevalier-Boisvert, D. Bahdanau, S. Lahlou, L. Willems, C. Sa-haria, T. H. Nguyen, and Y. Bengio, “BabyAI: First steps towardsgrounded language learning with a human in the loop,” in InternationalConference on Learning Representations, 2019.

[26] Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Soricut,“Albert: A lite bert for self-supervised learning of language represen-tations,” arXiv preprint arXiv:1909.11942, 2019.

[27] N. Kitaev, Ł. Kaiser, and A. Levskaya, “Reformer: The efficienttransformer,” arXiv preprint arXiv:2001.04451, 2020.

lamBERT: Language and Action Learning Using Multimodal BERT

Documents

Transcript of lamBERT: Language and Action Learning Using Multimodal BERT