Structure and Interpretation of Neural Codes · 2017. 9. 7. · Jacob Andreas. Translating...

Post on 01-Mar-2021

7 views 0 download

Transcript of Structure and Interpretation of Neural Codes · 2017. 9. 7. · Jacob Andreas. Translating...

Structure and Interpretation of Neural Codes

Jacob Andreas

Translating Neuralese

Jacob Andreas, Anca Drăgan and Dan Klein

Learning to Communicate

33[Wagner et al. 03, Sukhbaatar et al. 16, Foerster et al. 16]

Learning to Communicate

44

Neuralese

55

1.02.3-0.30.4-1.21.1

Translating neuralese

1.02.3-0.30.4-1.21.1

all clear

6

• Interoperate with autonomous systems

• Diagnose errors

• Learn from solutions

Translating neuralese

1.02.3-0.30.4-1.21.1

all clear

[Lazaridou et al. 16] 7

Outline

Natural language & neuralese

Statistical machine translation

Semantic machine translation

Implementation details

Evaluation8

Outline

Natural language & neuralese

Statistical machine translation

Semantic machine translation

Implementation details

Evaluation9

Outline

Natural language & neuralese

Statistical machine translation

Semantic machine translation

Implementation details

Evaluation10

Outline

Natural language & neuralese

Statistical machine translation

Semantic machine translation

Implementation details

Evaluation1111

Outline

Natural language & neuralese

Statistical machine translation

Semantic machine translation

Implementation details

Evaluation12

Outline

Natural language & neuralese

Statistical machine translation

Semantic machine translation

Implementation details

Evaluation13

A statistical MT problem

14

max p( | ) p( )0 a aa

1.02.3-0.30.4-1.21.1

all clear

[e.g. Koehn 10]

A statistical MT problem

15

How do we induce a translation model?

A statistical MT problem

16

max p( | ) p( )0 a aa

max Σ p( | ) p( | ) p( )0 aa

Strategy mismatch

17

�(s) =1

�(s)

� �

0

1

ex � 1xs dx

x

Strategy mismatch

18

not sure

�(s) =1

�(s)

� �

0

1

ex � 1xs dx

x

Strategy mismatch

19

not sure

dunno

Strategy mismatch

20

not sure

dunno

yes

Strategy mismatch

21

not sure

dunno

yes

yes

no

yes

Strategy mismatch

22

not sure yes

Σ p( , | not sure ) p( not sure )0

Stat MT criterion doesn’t capture meaning

23

moving(0,3)→(1,4)In the

intersection

Outline

Natural language & neuralese

Statistical machine translation

Semantic machine translation

Implementation details

Evaluation

24

A “semantic MT” problem

25

I’m going north

The meaning of an utterance is given by its truth conditions

[Davidson 67]

A “semantic MT” problem

26✔ ✔ ✘

I’m going north

The meaning of an utterance is given by its truth conditions

[Davidson 67]

A “semantic MT” problem

27(loc(goalblue)north)

I’m going north

The meaning of an utterance is given by its truth conditions

A “semantic MT” problem

280.4 0.2 0.001

I’m going north

The meaning of an utterance is given by its truth conditions

the distribution over states in which it is uttered

[Beltagy et al. 14]

A “semantic MT” problem

290.4 0.2 0.001

I’m going north

The meaning of an utterance is given by its truth conditions

the distribution over states in which it is uttered

or equivalently, the belief it induces in listeners

[Frank et al. 09, A & Klein 16]

Representing meaning

30

The meaning of an utterance is given by its truth conditions

the distribution over states in which it is uttered

or equivalently, the belief it induces in listeners

Representing meaning

31

The meaning of an utterance is given by its truth conditions

the distribution over states in which it is uttered

or equivalently, the belief it induces in listeners

This distribution is well-defined even if the “utterance” is a vector rather than a sequence of tokens.

Translating with meaning

32

1.02.3-0.30.4-1.21.1

Translating with meaning

33

1.02.3-0.30.4-1.21.1

In the intersection

Translating with meaning

34

1.02.3-0.30.4-1.21.1

I’m going north

Translating with meaning

35

1.02.3-0.30.4-1.21.1

I’m going north

p( | )0p( | )a

Translating with meaning

36

1.02.3-0.30.4-1.21.1

I’m going north

β( )0β( )a

Interlingua!

37

source text target text

β( )0 β( )a

KL( || )

Translation criterion

argmin

β( )0 β( )a

a

38

KL( || )

Translation criterion

argmin

β( )0 β( )a

a

39

KL( || )

Translation criterion

argmin

β( )0 β( )a

a

40

KL( || )

Translation criterion

argmin

β( )0 β( )a

a

41

Computing representations

argmina

KL( || )β( ) β( )a0

42

Computing representations: sparsity

argmina

KL( || )β( ) β( )a0

p( | )ap( | )0

43

Computing representations: smoothing

argmina

KL( || )β( ) β( )a0

actions &messages

agentpolicy

44

argmina

KL( || )β( ) β( )a0

actions &messages

agentpolicy

agentmodel

45

Computing representations: smoothing

argmina

KL( || )β( ) β( )a0

actions &messages

human

46

Computing representations: smoothing

argmina

KL( || )β( ) β( )a0

actions &messages

humanpolicy

humanmodel

47

Computing representations: smoothing

argmina

KL( || )β( ) β( )a0

0.10

0.05

0.13

0.08

0.01

0.22

0 a

48

Computing representations: smoothing

Computing KL

argmina

KL( || )β( ) β( )a0

49

Computing KL

argmina

KL( || )β( ) β( )a0

KL(p || q) = Ep( )

q( )

50

p

Computing KL: sampling

argmina

KL( || )β( ) β( )a0

KL(p || q) = Σ p( ) logp( )

q( )

51

ii

i

i

Finding translations

argmina

KL( || )β( ) β( )a0

52

Finding translations: brute force

argmina

KL( || )β( ) β( )a0

going north

crossing the intersection

I’m done

after you

0.5

2.3

0.2

9.753

argmina

KL( || )β( ) β( )a0

going north

crossing the intersection

I’m done

after you

0.5

2.3

0.2

9.754

Finding translations: brute force

KL( || )

Finding translations

argmin

β( )0 β( )a

a

55

Outline

Natural language & neuralese

Statistical machine translation

Semantic machine translation

Implementation details

Evaluation56

Referring expression games

57

1.02.3-0.30.4-1.21.1

orange bird with black

face

Evaluation: translator-in-the-loop

58

1.02.3-0.30.4-1.21.1

orange bird with black

face

59

1.02.3-0.30.4-1.21.1

orange bird with black

face

Evaluation: translator-in-the-loop

Experiment: color references

60

Experiment: color references

0.50

1.00Neuralese → Neuralese

English → English*0.83

61

0.50

1.00

Neuralese → English* English → Neuralese

Neuralese → Neuralese

English → English*Statistical MT 0.83

0.72 0.70

62

Experiment: color references

0.50

1.00

Neuralese → English* English → Neuralese

Neuralese → Neuralese

English → English*Statistical MT

Semantic MT

0.72 0.70

0.86

0.73

0.83

63

Experiment: color references

translation), we focus on developing automaticmeasures of system performance. We use the avail-able training data to develop simulated models ofhuman decisions; by first showing that these mod-els track well with human judgments, we can beconfident that their use in evaluations will corre-late with human understanding. We employ thefollowing two metrics:

Belief evaluation This evaluation focuses on thedenotational perspective in semantics that moti-vated the initial development of our model. Wehave successfully understood the semantics of amessage z

r

if, after translating z

r

7! z

h

, a humanlistener can form a correct belief about the statein which z

r

was produced. We construct a simplestate-guessing game where the listener is presentedwith a translated message and two state observa-tions, and must guess which state the speaker wasin when the message was emitted.

When translating from natural language to neu-ralese, we use the learned agent model to directlyguess the hidden state. For neuralese to naturallanguage we must first construct a “model humanlistener” to map from strings back to state repre-sentations; we do this by using the training data tofit a simple regression model that scores (state, sen-tence) pairs using a bag-of-words sentence repre-sentation. We find that our “model human” matchesthe judgments of real humans 83% of the time onthe colors task, 77% of the time on the birds task,and 77% of the time on the driving task. This givesus confidence that the model human gives a reason-ably accurate proxy for human interpretation.

Behavior evaluation This evaluation focuses onthe cooperative aspects of interpretability: we mea-sure the extent to which learned models are ableto interoperate with each other by way of a transla-tion layer. In the case of reference games, the goalof this semantic evaluation is identical to the goalof the game itself (to identify the hidden state ofthe speaker), so we perform this additional prag-matic evaluation only for the driving game. Wefound that the most data-efficient and reliable wayto make use of human game traces was to constructa “deaf” model human. The evaluation selects afull game trace from a human player, and replaysboth the human’s actions and messages exactly (dis-regarding any incoming messages); the evaluationmeasures the quality of the natural-language-to-neuralese translator, and the extent to which the

(a)

as speakerR H

aslis

tene

r R 1.000.50 random0.70 direct0.73 belief (ours)

H*0.50

0.830.720.86

(b)

as speakerR H

aslis

tene

r R 0.950.50 random0.55 direct0.60 belief (ours)

H*0.50

0.770.570.75

Table 1: Evaluation results for reference games. (a) The colorstask. (b) The birds task. Whether the model human is in alistener or speaker role, translation based on belief matchingoutperforms both random and machine translation baselines.

learned agent model can accommodate a (real) hu-man given translations of the human’s messages.

Baselines We compare our approach to two base-lines: a random baseline that chooses a translationof each input uniformly from messages observedduring training, and a direct baseline that directlymaximizes p(z

0|z) (by analogy to a conventionalmachine translation system). This is accomplishedby sampling from a DCP speaker in training stateslabeled with natural language strings.

8 Results

In all below, “R” indicates a DCP agent, “H” in-dicates a real human, and “H*” indicates a modelhuman player.

Reference games Results for the two referencegames are shown in Table 1. The end-to-end trainedmodel achieves nearly perfect accuracy in both

magenta, hot, rose, violet, purple

magenta, hot, violet, rose, purple

olive, puke, pea, grey, brown

pinkish, grey, dull, pale, light

Figure 7: Best-scoring translations generated for color task.

magenta, hot, rose

magenta, hot, violet

olive, puke, pea

pinkish, grey, dull

0

64

Experiment: color references

translation), we focus on developing automaticmeasures of system performance. We use the avail-able training data to develop simulated models ofhuman decisions; by first showing that these mod-els track well with human judgments, we can beconfident that their use in evaluations will corre-late with human understanding. We employ thefollowing two metrics:

Belief evaluation This evaluation focuses on thedenotational perspective in semantics that moti-vated the initial development of our model. Wehave successfully understood the semantics of amessage z

r

if, after translating z

r

7! z

h

, a humanlistener can form a correct belief about the statein which z

r

was produced. We construct a simplestate-guessing game where the listener is presentedwith a translated message and two state observa-tions, and must guess which state the speaker wasin when the message was emitted.

When translating from natural language to neu-ralese, we use the learned agent model to directlyguess the hidden state. For neuralese to naturallanguage we must first construct a “model humanlistener” to map from strings back to state repre-sentations; we do this by using the training data tofit a simple regression model that scores (state, sen-tence) pairs using a bag-of-words sentence repre-sentation. We find that our “model human” matchesthe judgments of real humans 83% of the time onthe colors task, 77% of the time on the birds task,and 77% of the time on the driving task. This givesus confidence that the model human gives a reason-ably accurate proxy for human interpretation.

Behavior evaluation This evaluation focuses onthe cooperative aspects of interpretability: we mea-sure the extent to which learned models are ableto interoperate with each other by way of a transla-tion layer. In the case of reference games, the goalof this semantic evaluation is identical to the goalof the game itself (to identify the hidden state ofthe speaker), so we perform this additional prag-matic evaluation only for the driving game. Wefound that the most data-efficient and reliable wayto make use of human game traces was to constructa “deaf” model human. The evaluation selects afull game trace from a human player, and replaysboth the human’s actions and messages exactly (dis-regarding any incoming messages); the evaluationmeasures the quality of the natural-language-to-neuralese translator, and the extent to which the

(a)

as speakerR H

aslis

tene

r R 1.000.50 random0.70 direct0.73 belief (ours)

H*0.50

0.830.720.86

(b)

as speakerR H

aslis

tene

r R 0.950.50 random0.55 direct0.60 belief (ours)

H*0.50

0.770.570.75

Table 1: Evaluation results for reference games. (a) The colorstask. (b) The birds task. Whether the model human is in alistener or speaker role, translation based on belief matchingoutperforms both random and machine translation baselines.

learned agent model can accommodate a (real) hu-man given translations of the human’s messages.

Baselines We compare our approach to two base-lines: a random baseline that chooses a translationof each input uniformly from messages observedduring training, and a direct baseline that directlymaximizes p(z

0|z) (by analogy to a conventionalmachine translation system). This is accomplishedby sampling from a DCP speaker in training stateslabeled with natural language strings.

8 Results

In all below, “R” indicates a DCP agent, “H” in-dicates a real human, and “H*” indicates a modelhuman player.

Reference games Results for the two referencegames are shown in Table 1. The end-to-end trainedmodel achieves nearly perfect accuracy in both

magenta, hot, rose, violet, purple

magenta, hot, violet, rose, purple

olive, puke, pea, grey, brown

pinkish, grey, dull, pale, light

Figure 7: Best-scoring translations generated for color task.

magenta, hot, rose

magenta, hot, violet

olive, puke, pea

pinkish, grey, dull

0

65

Experiment: color references

translation), we focus on developing automaticmeasures of system performance. We use the avail-able training data to develop simulated models ofhuman decisions; by first showing that these mod-els track well with human judgments, we can beconfident that their use in evaluations will corre-late with human understanding. We employ thefollowing two metrics:

Belief evaluation This evaluation focuses on thedenotational perspective in semantics that moti-vated the initial development of our model. Wehave successfully understood the semantics of amessage z

r

if, after translating z

r

7! z

h

, a humanlistener can form a correct belief about the statein which z

r

was produced. We construct a simplestate-guessing game where the listener is presentedwith a translated message and two state observa-tions, and must guess which state the speaker wasin when the message was emitted.

When translating from natural language to neu-ralese, we use the learned agent model to directlyguess the hidden state. For neuralese to naturallanguage we must first construct a “model humanlistener” to map from strings back to state repre-sentations; we do this by using the training data tofit a simple regression model that scores (state, sen-tence) pairs using a bag-of-words sentence repre-sentation. We find that our “model human” matchesthe judgments of real humans 83% of the time onthe colors task, 77% of the time on the birds task,and 77% of the time on the driving task. This givesus confidence that the model human gives a reason-ably accurate proxy for human interpretation.

Behavior evaluation This evaluation focuses onthe cooperative aspects of interpretability: we mea-sure the extent to which learned models are ableto interoperate with each other by way of a transla-tion layer. In the case of reference games, the goalof this semantic evaluation is identical to the goalof the game itself (to identify the hidden state ofthe speaker), so we perform this additional prag-matic evaluation only for the driving game. Wefound that the most data-efficient and reliable wayto make use of human game traces was to constructa “deaf” model human. The evaluation selects afull game trace from a human player, and replaysboth the human’s actions and messages exactly (dis-regarding any incoming messages); the evaluationmeasures the quality of the natural-language-to-neuralese translator, and the extent to which the

(a)

as speakerR H

aslis

tene

r R 1.000.50 random0.70 direct0.73 belief (ours)

H*0.50

0.830.720.86

(b)

as speakerR H

aslis

tene

r R 0.950.50 random0.55 direct0.60 belief (ours)

H*0.50

0.770.570.75

Table 1: Evaluation results for reference games. (a) The colorstask. (b) The birds task. Whether the model human is in alistener or speaker role, translation based on belief matchingoutperforms both random and machine translation baselines.

learned agent model can accommodate a (real) hu-man given translations of the human’s messages.

Baselines We compare our approach to two base-lines: a random baseline that chooses a translationof each input uniformly from messages observedduring training, and a direct baseline that directlymaximizes p(z

0|z) (by analogy to a conventionalmachine translation system). This is accomplishedby sampling from a DCP speaker in training stateslabeled with natural language strings.

8 Results

In all below, “R” indicates a DCP agent, “H” in-dicates a real human, and “H*” indicates a modelhuman player.

Reference games Results for the two referencegames are shown in Table 1. The end-to-end trainedmodel achieves nearly perfect accuracy in both

magenta, hot, rose, violet, purple

magenta, hot, violet, rose, purple

olive, puke, pea, grey, brown

pinkish, grey, dull, pale, light

Figure 7: Best-scoring translations generated for color task.

magenta, hot, rose

magenta, hot, violet

olive, puke, pea

pinkish, grey, dull

0

66

Experiment: color references

translation), we focus on developing automaticmeasures of system performance. We use the avail-able training data to develop simulated models ofhuman decisions; by first showing that these mod-els track well with human judgments, we can beconfident that their use in evaluations will corre-late with human understanding. We employ thefollowing two metrics:

Belief evaluation This evaluation focuses on thedenotational perspective in semantics that moti-vated the initial development of our model. Wehave successfully understood the semantics of amessage z

r

if, after translating z

r

7! z

h

, a humanlistener can form a correct belief about the statein which z

r

was produced. We construct a simplestate-guessing game where the listener is presentedwith a translated message and two state observa-tions, and must guess which state the speaker wasin when the message was emitted.

When translating from natural language to neu-ralese, we use the learned agent model to directlyguess the hidden state. For neuralese to naturallanguage we must first construct a “model humanlistener” to map from strings back to state repre-sentations; we do this by using the training data tofit a simple regression model that scores (state, sen-tence) pairs using a bag-of-words sentence repre-sentation. We find that our “model human” matchesthe judgments of real humans 83% of the time onthe colors task, 77% of the time on the birds task,and 77% of the time on the driving task. This givesus confidence that the model human gives a reason-ably accurate proxy for human interpretation.

Behavior evaluation This evaluation focuses onthe cooperative aspects of interpretability: we mea-sure the extent to which learned models are ableto interoperate with each other by way of a transla-tion layer. In the case of reference games, the goalof this semantic evaluation is identical to the goalof the game itself (to identify the hidden state ofthe speaker), so we perform this additional prag-matic evaluation only for the driving game. Wefound that the most data-efficient and reliable wayto make use of human game traces was to constructa “deaf” model human. The evaluation selects afull game trace from a human player, and replaysboth the human’s actions and messages exactly (dis-regarding any incoming messages); the evaluationmeasures the quality of the natural-language-to-neuralese translator, and the extent to which the

(a)

as speakerR H

aslis

tene

r R 1.000.50 random0.70 direct0.73 belief (ours)

H*0.50

0.830.720.86

(b)

as speakerR H

aslis

tene

r R 0.950.50 random0.55 direct0.60 belief (ours)

H*0.50

0.770.570.75

Table 1: Evaluation results for reference games. (a) The colorstask. (b) The birds task. Whether the model human is in alistener or speaker role, translation based on belief matchingoutperforms both random and machine translation baselines.

learned agent model can accommodate a (real) hu-man given translations of the human’s messages.

Baselines We compare our approach to two base-lines: a random baseline that chooses a translationof each input uniformly from messages observedduring training, and a direct baseline that directlymaximizes p(z

0|z) (by analogy to a conventionalmachine translation system). This is accomplishedby sampling from a DCP speaker in training stateslabeled with natural language strings.

8 Results

In all below, “R” indicates a DCP agent, “H” in-dicates a real human, and “H*” indicates a modelhuman player.

Reference games Results for the two referencegames are shown in Table 1. The end-to-end trainedmodel achieves nearly perfect accuracy in both

magenta, hot, rose, violet, purple

magenta, hot, violet, rose, purple

olive, puke, pea, grey, brown

pinkish, grey, dull, pale, light

Figure 7: Best-scoring translations generated for color task.

magenta, hot, rose

magenta, hot, violet

olive, puke, pea

pinkish, grey, dull

0

67

Experiment: color references

translation), we focus on developing automaticmeasures of system performance. We use the avail-able training data to develop simulated models ofhuman decisions; by first showing that these mod-els track well with human judgments, we can beconfident that their use in evaluations will corre-late with human understanding. We employ thefollowing two metrics:

Belief evaluation This evaluation focuses on thedenotational perspective in semantics that moti-vated the initial development of our model. Wehave successfully understood the semantics of amessage z

r

if, after translating z

r

7! z

h

, a humanlistener can form a correct belief about the statein which z

r

was produced. We construct a simplestate-guessing game where the listener is presentedwith a translated message and two state observa-tions, and must guess which state the speaker wasin when the message was emitted.

When translating from natural language to neu-ralese, we use the learned agent model to directlyguess the hidden state. For neuralese to naturallanguage we must first construct a “model humanlistener” to map from strings back to state repre-sentations; we do this by using the training data tofit a simple regression model that scores (state, sen-tence) pairs using a bag-of-words sentence repre-sentation. We find that our “model human” matchesthe judgments of real humans 83% of the time onthe colors task, 77% of the time on the birds task,and 77% of the time on the driving task. This givesus confidence that the model human gives a reason-ably accurate proxy for human interpretation.

Behavior evaluation This evaluation focuses onthe cooperative aspects of interpretability: we mea-sure the extent to which learned models are ableto interoperate with each other by way of a transla-tion layer. In the case of reference games, the goalof this semantic evaluation is identical to the goalof the game itself (to identify the hidden state ofthe speaker), so we perform this additional prag-matic evaluation only for the driving game. Wefound that the most data-efficient and reliable wayto make use of human game traces was to constructa “deaf” model human. The evaluation selects afull game trace from a human player, and replaysboth the human’s actions and messages exactly (dis-regarding any incoming messages); the evaluationmeasures the quality of the natural-language-to-neuralese translator, and the extent to which the

(a)

as speakerR H

aslis

tene

r R 1.000.50 random0.70 direct0.73 belief (ours)

H*0.50

0.830.720.86

(b)

as speakerR H

aslis

tene

r R 0.950.50 random0.55 direct0.60 belief (ours)

H*0.50

0.770.570.75

Table 1: Evaluation results for reference games. (a) The colorstask. (b) The birds task. Whether the model human is in alistener or speaker role, translation based on belief matchingoutperforms both random and machine translation baselines.

learned agent model can accommodate a (real) hu-man given translations of the human’s messages.

Baselines We compare our approach to two base-lines: a random baseline that chooses a translationof each input uniformly from messages observedduring training, and a direct baseline that directlymaximizes p(z

0|z) (by analogy to a conventionalmachine translation system). This is accomplishedby sampling from a DCP speaker in training stateslabeled with natural language strings.

8 Results

In all below, “R” indicates a DCP agent, “H” in-dicates a real human, and “H*” indicates a modelhuman player.

Reference games Results for the two referencegames are shown in Table 1. The end-to-end trainedmodel achieves nearly perfect accuracy in both

magenta, hot, rose, violet, purple

magenta, hot, violet, rose, purple

olive, puke, pea, grey, brown

pinkish, grey, dull, pale, light

Figure 7: Best-scoring translations generated for color task.

magenta, hot, rose

magenta, hot, violet

olive, puke, pea

pinkish, grey, dull

0

68

Experiment: color references

translation), we focus on developing automaticmeasures of system performance. We use the avail-able training data to develop simulated models ofhuman decisions; by first showing that these mod-els track well with human judgments, we can beconfident that their use in evaluations will corre-late with human understanding. We employ thefollowing two metrics:

Belief evaluation This evaluation focuses on thedenotational perspective in semantics that moti-vated the initial development of our model. Wehave successfully understood the semantics of amessage z

r

if, after translating z

r

7! z

h

, a humanlistener can form a correct belief about the statein which z

r

was produced. We construct a simplestate-guessing game where the listener is presentedwith a translated message and two state observa-tions, and must guess which state the speaker wasin when the message was emitted.

When translating from natural language to neu-ralese, we use the learned agent model to directlyguess the hidden state. For neuralese to naturallanguage we must first construct a “model humanlistener” to map from strings back to state repre-sentations; we do this by using the training data tofit a simple regression model that scores (state, sen-tence) pairs using a bag-of-words sentence repre-sentation. We find that our “model human” matchesthe judgments of real humans 83% of the time onthe colors task, 77% of the time on the birds task,and 77% of the time on the driving task. This givesus confidence that the model human gives a reason-ably accurate proxy for human interpretation.

Behavior evaluation This evaluation focuses onthe cooperative aspects of interpretability: we mea-sure the extent to which learned models are ableto interoperate with each other by way of a transla-tion layer. In the case of reference games, the goalof this semantic evaluation is identical to the goalof the game itself (to identify the hidden state ofthe speaker), so we perform this additional prag-matic evaluation only for the driving game. Wefound that the most data-efficient and reliable wayto make use of human game traces was to constructa “deaf” model human. The evaluation selects afull game trace from a human player, and replaysboth the human’s actions and messages exactly (dis-regarding any incoming messages); the evaluationmeasures the quality of the natural-language-to-neuralese translator, and the extent to which the

(a)

as speakerR H

aslis

tene

r R 1.000.50 random0.70 direct0.73 belief (ours)

H*0.50

0.830.720.86

(b)

as speakerR H

aslis

tene

r R 0.950.50 random0.55 direct0.60 belief (ours)

H*0.50

0.770.570.75

Table 1: Evaluation results for reference games. (a) The colorstask. (b) The birds task. Whether the model human is in alistener or speaker role, translation based on belief matchingoutperforms both random and machine translation baselines.

learned agent model can accommodate a (real) hu-man given translations of the human’s messages.

Baselines We compare our approach to two base-lines: a random baseline that chooses a translationof each input uniformly from messages observedduring training, and a direct baseline that directlymaximizes p(z

0|z) (by analogy to a conventionalmachine translation system). This is accomplishedby sampling from a DCP speaker in training stateslabeled with natural language strings.

8 Results

In all below, “R” indicates a DCP agent, “H” in-dicates a real human, and “H*” indicates a modelhuman player.

Reference games Results for the two referencegames are shown in Table 1. The end-to-end trainedmodel achieves nearly perfect accuracy in both

magenta, hot, rose, violet, purple

magenta, hot, violet, rose, purple

olive, puke, pea, grey, brown

pinkish, grey, dull, pale, light

Figure 7: Best-scoring translations generated for color task.

magenta, hot, rose

magenta, hot, violet

olive, puke, pea

pinkish, grey, dull

0

69

Experiment: color references

Experiment: image references

50

95

Neuralese → English* English → Neuralese

Neuralese → Neuralese

English → English*

Statistical MT

Semantic MT

77

72 70

86

73

70

alization and ablation techniques used in previouswork on understanding complex models (Strobeltet al., 2016; Ribeiro et al., 2016).

While structurally quite similar to the task ofmachine translation between pairs of human lan-guages, interpretation of neuralese poses a numberof novel challenges. First, there is no natural sourceof parallel data: there are no bilingual “speakers”of both neuralese and natural language. Second,there may not be a direct correspondence betweenthe strategy employed by humans and CDP agents:even if it were constrained to communicate usingnatural language, an automated agent might chooseto produce a different message from humans in agiven state. We tackle both of these challenges byappealing to the grounding of messages in game-play. Our approach is based on one of the coreinsights in natural language semantics: messages(whether in neuralese or natural language) havesimilar meanings when they induce similar beliefsabout the state of the world. We explore severalrelated questions:

• What makes a good translation, and underwhat conditions is translation possible at all(Section 4)?

• How can we build a model to translatebetween neuralese and natural language(Section 5)?

• How does this model trade off the compet-ing demands of translatability and accuracy inlearned policies (Section 6)?

We conclude by applying our approach to agentstrained on a number of different tasks, finding thatit outperforms a more conventional machine trans-lation criterion both when attempting to interoper-ate with a neuralese speaker and when predictingits hidden state.

2 Related work

A variety of approaches for learning deep policieswith communication were proposed essentially si-multaneously in the past year. We have broadlylabeled these as “communicating deep policies”;concrete examples include Lazaridou et al. (2016b),Foerster et al. (2016), and Sukhbaatar et al. (2016).The policy representation we employ in this paperis similar to the latter two of these, although thegeneral framework is agnostic to low-level model-ing details and could be straightforwardly applied

to other architectures. Analysis of communicationstrategies in all these papers has been largely ad-hoc, obtained by clustering states from which simi-lar messages are emitted and attempting to manu-ally assign semantics to these clusters. The presentwork aims at developing tools for performing thisanalysis automatically.

Most closely related to our approach is that ofLazaridou et al. (2016a), who also develop a modelfor assigning natural language interpretations tolearned messages; however, this approach relieson a combination of supervised cluster labels andis targeted specifically towards referring expres-sion games. Here we attempt to develop an ap-proach that can handle general multiagent interac-tions without assuming a prior discrete structure inspace of observations.

The literature on learning decentralized multi-agent policies in general is considerably larger(Bernstein et al., 2002; Dibangoye et al., 2016).This includes work focused on communication inmultiagent settings (Roth et al., 2005) and evencommunication using natural language messages(Vogel et al., 2013b). All of these approaches em-ploy structured communication schemes with man-ually engineered messaging protocols; these are, insome sense, automatically interpretable, but at thecost of introducing considerable complexity intoboth training and inference.

Our evaluation in this paper investigates com-munication strategies that arise in a number of dif-ferent games, including reference games and anextended-horizon driving game. Communicationstrategies for reference games were previosly ex-plored by Vogel et al. (2013a), Andreas and Klein(2016) and Kazemzadeh et al. (2014), and refer-

large bird, black wings, black crown

small brown, light brown, dark brown

Figure 2: Preview of our approach—best-scoring translationsgenerated for a reference game involving images of birds.The speaking agent’s goal is to send a message that uniquelyidentifies the bird on the left. From these translations it can beseen that the learned model appears to discriminate based oncoarse attributes like size and color.

alization and ablation techniques used in previouswork on understanding complex models (Strobeltet al., 2016; Ribeiro et al., 2016).

While structurally quite similar to the task ofmachine translation between pairs of human lan-guages, interpretation of neuralese poses a numberof novel challenges. First, there is no natural sourceof parallel data: there are no bilingual “speakers”of both neuralese and natural language. Second,there may not be a direct correspondence betweenthe strategy employed by humans and CDP agents:even if it were constrained to communicate usingnatural language, an automated agent might chooseto produce a different message from humans in agiven state. We tackle both of these challenges byappealing to the grounding of messages in game-play. Our approach is based on one of the coreinsights in natural language semantics: messages(whether in neuralese or natural language) havesimilar meanings when they induce similar beliefsabout the state of the world. We explore severalrelated questions:

• What makes a good translation, and underwhat conditions is translation possible at all(Section 4)?

• How can we build a model to translatebetween neuralese and natural language(Section 5)?

• How does this model trade off the compet-ing demands of translatability and accuracy inlearned policies (Section 6)?

We conclude by applying our approach to agentstrained on a number of different tasks, finding thatit outperforms a more conventional machine trans-lation criterion both when attempting to interoper-ate with a neuralese speaker and when predictingits hidden state.

2 Related work

A variety of approaches for learning deep policieswith communication were proposed essentially si-multaneously in the past year. We have broadlylabeled these as “communicating deep policies”;concrete examples include Lazaridou et al. (2016b),Foerster et al. (2016), and Sukhbaatar et al. (2016).The policy representation we employ in this paperis similar to the latter two of these, although thegeneral framework is agnostic to low-level model-ing details and could be straightforwardly applied

to other architectures. Analysis of communicationstrategies in all these papers has been largely ad-hoc, obtained by clustering states from which simi-lar messages are emitted and attempting to manu-ally assign semantics to these clusters. The presentwork aims at developing tools for performing thisanalysis automatically.

Most closely related to our approach is that ofLazaridou et al. (2016a), who also develop a modelfor assigning natural language interpretations tolearned messages; however, this approach relieson a combination of supervised cluster labels andis targeted specifically towards referring expres-sion games. Here we attempt to develop an ap-proach that can handle general multiagent interac-tions without assuming a prior discrete structure inspace of observations.

The literature on learning decentralized multi-agent policies in general is considerably larger(Bernstein et al., 2002; Dibangoye et al., 2016).This includes work focused on communication inmultiagent settings (Roth et al., 2005) and evencommunication using natural language messages(Vogel et al., 2013b). All of these approaches em-ploy structured communication schemes with man-ually engineered messaging protocols; these are, insome sense, automatically interpretable, but at thecost of introducing considerable complexity intoboth training and inference.

Our evaluation in this paper investigates com-munication strategies that arise in a number of dif-ferent games, including reference games and anextended-horizon driving game. Communicationstrategies for reference games were previosly ex-plored by Vogel et al. (2013a), Andreas and Klein(2016) and Kazemzadeh et al. (2014), and refer-

large bird, black wings, black crown

small brown, light brown, dark brown

Figure 2: Preview of our approach—best-scoring translationsgenerated for a reference game involving images of birds.The speaking agent’s goal is to send a message that uniquelyidentifies the bird on the left. From these translations it can beseen that the learned model appears to discriminate based oncoarse attributes like size and color.

large bird, black wings, black crown

small brown, light brown, dark brown71

Experiment: image references

Experiment: driving game

1.35

1.93

Neuralese ↔ English*

Neuralese → Neuralese

Statistical MT

Semantic MT1.49

1.54

72

How to translate

(a)

as speakerR H

aslis

tene

r R 1.000.50 random0.70 direct0.73 belief (ours)

H*0.50

0.830.720.86

(b)

as speakerR H

aslis

tene

r R 0.950.50 random0.55 direct0.60 belief (ours)

H*0.5

0.710.570.75

Table 1: Evaluation results for reference games. (a) The colorstask. (b) The birds task. Whether the model human is in alistener or speaker role, translation based on belief matchingoutperforms both random and machine translation baselines.

R / R H / H R / H

1.93 / 0.71 — / 0.771.35 / 0.641.49 / 0.67

1.54 / 0.67

Table 2: Behavior evaluation results for the driving game.Scores are presented in the form “reward / completion rate”.While less accurate than either humans or CDPs with a sharedlanguage, the models that employ a translation layer obtainhigher reward and a greater overall success rate than baselines.

Reference games Results for the two referencegames are shown in Table 1. The end-to-end trainedmodel achieves nearly perfect accuracy in bothcases, while a model trained to communicate innatural language achieves somewhat lower perfor-mance. Regardless of whether the speaker is a CDPand the listener a model human or vice-versa, trans-lation based on the belief-matching criterion in Sec-tion 5 achieves the best performance; indeed, whentranslating from neuralese to natural language, thelistener is able to achieve a higher score than it isnatively. This suggests that the automated agenthas discovered a more effective strategy than theone demonstrated by humans in the dataset, andthat the effectiveness of this strategy is preservedby translation. Example translations from the refer-ence games are depicted in Figure 2 and Figure 7.

magenta, hot, rose, violet, purple

magenta, hot, violet, rose, purple

olive, puke, pea, grey, brown

pinkish, grey, dull, pale, light

Figure 7: Best-scoring translations generated for color task.

as speakerR H

aslis

tene

r R 0.850.50 random0.45 direct0.61 belief (ours)

H*0.5

0.770.450.57

Table 3: Belief evaluation results for the driving game. Drivingstates are challenging to identify based on messages alone (asevidenced by the comparatively low scores obtained by single-language pairs) . Translation based on belief achieves the bestoverall performance in both directions.

Driving game Behavior evaluation of the drivinggame is shown in Table 2, and belief evaluation isshown in Table 3. Translation of messages in thedriving game is considerably more challenging thanin the reference games, and scores are uniformlylower; however, a clear benefit from the belief-matching model is still visible. Belief matchingleads to higher scores on the belief evaluation inboth directions, and allows agents to obtain a higherreward on average (though task completion ratesremain roughly the same across all agents).

Some example translations of driving game mes-sages are shown in Figure 8.

9 Conclusion

We have investigated the problem of interpretingmessage vectors from communicating deep policiesby translating them. After introducing a translationcriterion based on matching listener beliefs aboutspeaker states, we presented both theoretical andempirical evidence that this criterion outperformsa more conventional machine translation approachat both recovering the content of message vectorsand facilitating collaboration between neuraleseand natural language speakers.

at goal, done, left to top

going in intersection, proceed, going

you first, following, going down

Figure 8: Best-scoring translations generated for driving task.

(a)

as speakerR H

aslis

tene

r R 1.000.50 random0.70 direct0.73 belief (ours)

H*0.50

0.830.720.86

(b)

as speakerR H

aslis

tene

r R 0.950.50 random0.55 direct0.60 belief (ours)

H*0.5

0.710.570.75

Table 1: Evaluation results for reference games. (a) The colorstask. (b) The birds task. Whether the model human is in alistener or speaker role, translation based on belief matchingoutperforms both random and machine translation baselines.

R / R H / H R / H

1.93 / 0.71 — / 0.771.35 / 0.641.49 / 0.67

1.54 / 0.67

Table 2: Behavior evaluation results for the driving game.Scores are presented in the form “reward / completion rate”.While less accurate than either humans or CDPs with a sharedlanguage, the models that employ a translation layer obtainhigher reward and a greater overall success rate than baselines.

Reference games Results for the two referencegames are shown in Table 1. The end-to-end trainedmodel achieves nearly perfect accuracy in bothcases, while a model trained to communicate innatural language achieves somewhat lower perfor-mance. Regardless of whether the speaker is a CDPand the listener a model human or vice-versa, trans-lation based on the belief-matching criterion in Sec-tion 5 achieves the best performance; indeed, whentranslating from neuralese to natural language, thelistener is able to achieve a higher score than it isnatively. This suggests that the automated agenthas discovered a more effective strategy than theone demonstrated by humans in the dataset, andthat the effectiveness of this strategy is preservedby translation. Example translations from the refer-ence games are depicted in Figure 2 and Figure 7.

magenta, hot, rose, violet, purple

magenta, hot, violet, rose, purple

olive, puke, pea, grey, brown

pinkish, grey, dull, pale, light

Figure 7: Best-scoring translations generated for color task.

as speakerR H

aslis

tene

r R 0.850.50 random0.45 direct0.61 belief (ours)

H*0.5

0.770.450.57

Table 3: Belief evaluation results for the driving game. Drivingstates are challenging to identify based on messages alone (asevidenced by the comparatively low scores obtained by single-language pairs) . Translation based on belief achieves the bestoverall performance in both directions.

Driving game Behavior evaluation of the drivinggame is shown in Table 2, and belief evaluation isshown in Table 3. Translation of messages in thedriving game is considerably more challenging thanin the reference games, and scores are uniformlylower; however, a clear benefit from the belief-matching model is still visible. Belief matchingleads to higher scores on the belief evaluation inboth directions, and allows agents to obtain a higherreward on average (though task completion ratesremain roughly the same across all agents).

Some example translations of driving game mes-sages are shown in Figure 8.

9 Conclusion

We have investigated the problem of interpretingmessage vectors from communicating deep policiesby translating them. After introducing a translationcriterion based on matching listener beliefs aboutspeaker states, we presented both theoretical andempirical evidence that this criterion outperformsa more conventional machine translation approachat both recovering the content of message vectorsand facilitating collaboration between neuraleseand natural language speakers.

at goal, done, left to top

going in intersection, proceed, going

you first, following, going down

Figure 8: Best-scoring translations generated for driving task.

(a)

as speakerR H

aslis

tene

r R 1.000.50 random0.70 direct0.73 belief (ours)

H*0.50

0.830.720.86

(b)

as speakerR H

aslis

tene

r R 0.950.50 random0.55 direct0.60 belief (ours)

H*0.5

0.710.570.75

Table 1: Evaluation results for reference games. (a) The colorstask. (b) The birds task. Whether the model human is in alistener or speaker role, translation based on belief matchingoutperforms both random and machine translation baselines.

R / R H / H R / H

1.93 / 0.71 — / 0.771.35 / 0.641.49 / 0.67

1.54 / 0.67

Table 2: Behavior evaluation results for the driving game.Scores are presented in the form “reward / completion rate”.While less accurate than either humans or CDPs with a sharedlanguage, the models that employ a translation layer obtainhigher reward and a greater overall success rate than baselines.

Reference games Results for the two referencegames are shown in Table 1. The end-to-end trainedmodel achieves nearly perfect accuracy in bothcases, while a model trained to communicate innatural language achieves somewhat lower perfor-mance. Regardless of whether the speaker is a CDPand the listener a model human or vice-versa, trans-lation based on the belief-matching criterion in Sec-tion 5 achieves the best performance; indeed, whentranslating from neuralese to natural language, thelistener is able to achieve a higher score than it isnatively. This suggests that the automated agenthas discovered a more effective strategy than theone demonstrated by humans in the dataset, andthat the effectiveness of this strategy is preservedby translation. Example translations from the refer-ence games are depicted in Figure 2 and Figure 7.

magenta, hot, rose, violet, purple

magenta, hot, violet, rose, purple

olive, puke, pea, grey, brown

pinkish, grey, dull, pale, light

Figure 7: Best-scoring translations generated for color task.

as speakerR H

aslis

tene

r R 0.850.50 random0.45 direct0.61 belief (ours)

H*0.5

0.770.450.57

Table 3: Belief evaluation results for the driving game. Drivingstates are challenging to identify based on messages alone (asevidenced by the comparatively low scores obtained by single-language pairs) . Translation based on belief achieves the bestoverall performance in both directions.

Driving game Behavior evaluation of the drivinggame is shown in Table 2, and belief evaluation isshown in Table 3. Translation of messages in thedriving game is considerably more challenging thanin the reference games, and scores are uniformlylower; however, a clear benefit from the belief-matching model is still visible. Belief matchingleads to higher scores on the belief evaluation inboth directions, and allows agents to obtain a higherreward on average (though task completion ratesremain roughly the same across all agents).

Some example translations of driving game mes-sages are shown in Figure 8.

9 Conclusion

We have investigated the problem of interpretingmessage vectors from communicating deep policiesby translating them. After introducing a translationcriterion based on matching listener beliefs aboutspeaker states, we presented both theoretical andempirical evidence that this criterion outperformsa more conventional machine translation approachat both recovering the content of message vectorsand facilitating collaboration between neuraleseand natural language speakers.

at goal, done, left to top

going in intersection, proceed, going

you first, following, going down

Figure 8: Best-scoring translations generated for driving task.

at goaldoneleft to top

going in intersectionproceedgoing

you firstfollowinggoing down

• Classical notions of “meaning” apply even toun-language-like things (e.g. RNN states)

• These meanings can be compactly represented without logical forms if we have access to world states

• Communicating policies “say” interpretable things!

Conclusions so far

74

• Classical notions of “meaning” apply even tonon-language-like things (e.g. RNN states)

• These meanings can be compactly represented without logical forms if we have access to world states

• Communicating policies “say” interpretable things!

75

Conclusions so far

• Classical notions of “meaning” apply even tonon-language-like things (e.g. RNN states)

• These meanings can be compactly represented without logical forms if we have access to world states

• Communicating policies “say” interpretable things!

76

Conclusions so far

Limitations

argmina

KL( || )β( ) β( )a0

KL(p || q) = Σ p( ) logp( )

q( )

77

ii

i

i

but what about compositionality?

Analogs of linguistic structure in deep representations

Jacob Andreas and Dan Klein

“Flat” semantics

(a)

as speakerR H

aslis

tene

r R 1.000.50 random0.70 direct0.73 belief (ours)

H*0.50

0.830.720.86

(b)

as speakerR H

aslis

tene

r R 0.950.50 random0.55 direct0.60 belief (ours)

H*0.5

0.710.570.75

Table 1: Evaluation results for reference games. (a) The colorstask. (b) The birds task. Whether the model human is in alistener or speaker role, translation based on belief matchingoutperforms both random and machine translation baselines.

R / R H / H R / H

1.93 / 0.71 — / 0.771.35 / 0.641.49 / 0.67

1.54 / 0.67

Table 2: Behavior evaluation results for the driving game.Scores are presented in the form “reward / completion rate”.While less accurate than either humans or CDPs with a sharedlanguage, the models that employ a translation layer obtainhigher reward and a greater overall success rate than baselines.

Reference games Results for the two referencegames are shown in Table 1. The end-to-end trainedmodel achieves nearly perfect accuracy in bothcases, while a model trained to communicate innatural language achieves somewhat lower perfor-mance. Regardless of whether the speaker is a CDPand the listener a model human or vice-versa, trans-lation based on the belief-matching criterion in Sec-tion 5 achieves the best performance; indeed, whentranslating from neuralese to natural language, thelistener is able to achieve a higher score than it isnatively. This suggests that the automated agenthas discovered a more effective strategy than theone demonstrated by humans in the dataset, andthat the effectiveness of this strategy is preservedby translation. Example translations from the refer-ence games are depicted in Figure 2 and Figure 7.

magenta, hot, rose, violet, purple

magenta, hot, violet, rose, purple

olive, puke, pea, grey, brown

pinkish, grey, dull, pale, light

Figure 7: Best-scoring translations generated for color task.

as speakerR H

aslis

tene

r R 0.850.50 random0.45 direct0.61 belief (ours)

H*0.5

0.770.450.57

Table 3: Belief evaluation results for the driving game. Drivingstates are challenging to identify based on messages alone (asevidenced by the comparatively low scores obtained by single-language pairs) . Translation based on belief achieves the bestoverall performance in both directions.

Driving game Behavior evaluation of the drivinggame is shown in Table 2, and belief evaluation isshown in Table 3. Translation of messages in thedriving game is considerably more challenging thanin the reference games, and scores are uniformlylower; however, a clear benefit from the belief-matching model is still visible. Belief matchingleads to higher scores on the belief evaluation inboth directions, and allows agents to obtain a higherreward on average (though task completion ratesremain roughly the same across all agents).

Some example translations of driving game mes-sages are shown in Figure 8.

9 Conclusion

We have investigated the problem of interpretingmessage vectors from communicating deep policiesby translating them. After introducing a translationcriterion based on matching listener beliefs aboutspeaker states, we presented both theoretical andempirical evidence that this criterion outperformsa more conventional machine translation approachat both recovering the content of message vectorsand facilitating collaboration between neuraleseand natural language speakers.

at goal, done, left to top

going in intersection, proceed, going

you first, following, going down

Figure 8: Best-scoring translations generated for driving task.

(a)

as speakerR H

aslis

tene

r R 1.000.50 random0.70 direct0.73 belief (ours)

H*0.50

0.830.720.86

(b)

as speakerR H

aslis

tene

r R 0.950.50 random0.55 direct0.60 belief (ours)

H*0.5

0.710.570.75

Table 1: Evaluation results for reference games. (a) The colorstask. (b) The birds task. Whether the model human is in alistener or speaker role, translation based on belief matchingoutperforms both random and machine translation baselines.

R / R H / H R / H

1.93 / 0.71 — / 0.771.35 / 0.641.49 / 0.67

1.54 / 0.67

Table 2: Behavior evaluation results for the driving game.Scores are presented in the form “reward / completion rate”.While less accurate than either humans or CDPs with a sharedlanguage, the models that employ a translation layer obtainhigher reward and a greater overall success rate than baselines.

Reference games Results for the two referencegames are shown in Table 1. The end-to-end trainedmodel achieves nearly perfect accuracy in bothcases, while a model trained to communicate innatural language achieves somewhat lower perfor-mance. Regardless of whether the speaker is a CDPand the listener a model human or vice-versa, trans-lation based on the belief-matching criterion in Sec-tion 5 achieves the best performance; indeed, whentranslating from neuralese to natural language, thelistener is able to achieve a higher score than it isnatively. This suggests that the automated agenthas discovered a more effective strategy than theone demonstrated by humans in the dataset, andthat the effectiveness of this strategy is preservedby translation. Example translations from the refer-ence games are depicted in Figure 2 and Figure 7.

magenta, hot, rose, violet, purple

magenta, hot, violet, rose, purple

olive, puke, pea, grey, brown

pinkish, grey, dull, pale, light

Figure 7: Best-scoring translations generated for color task.

as speakerR H

aslis

tene

r R 0.850.50 random0.45 direct0.61 belief (ours)

H*0.5

0.770.450.57

Table 3: Belief evaluation results for the driving game. Drivingstates are challenging to identify based on messages alone (asevidenced by the comparatively low scores obtained by single-language pairs) . Translation based on belief achieves the bestoverall performance in both directions.

Driving game Behavior evaluation of the drivinggame is shown in Table 2, and belief evaluation isshown in Table 3. Translation of messages in thedriving game is considerably more challenging thanin the reference games, and scores are uniformlylower; however, a clear benefit from the belief-matching model is still visible. Belief matchingleads to higher scores on the belief evaluation inboth directions, and allows agents to obtain a higherreward on average (though task completion ratesremain roughly the same across all agents).

Some example translations of driving game mes-sages are shown in Figure 8.

9 Conclusion

We have investigated the problem of interpretingmessage vectors from communicating deep policiesby translating them. After introducing a translationcriterion based on matching listener beliefs aboutspeaker states, we presented both theoretical andempirical evidence that this criterion outperformsa more conventional machine translation approachat both recovering the content of message vectorsand facilitating collaboration between neuraleseand natural language speakers.

at goal, done, left to top

going in intersection, proceed, going

you first, following, going down

Figure 8: Best-scoring translations generated for driving task.

(a)

as speakerR H

aslis

tene

r R 1.000.50 random0.70 direct0.73 belief (ours)

H*0.50

0.830.720.86

(b)

as speakerR H

aslis

tene

r R 0.950.50 random0.55 direct0.60 belief (ours)

H*0.5

0.710.570.75

Table 1: Evaluation results for reference games. (a) The colorstask. (b) The birds task. Whether the model human is in alistener or speaker role, translation based on belief matchingoutperforms both random and machine translation baselines.

R / R H / H R / H

1.93 / 0.71 — / 0.771.35 / 0.641.49 / 0.67

1.54 / 0.67

Table 2: Behavior evaluation results for the driving game.Scores are presented in the form “reward / completion rate”.While less accurate than either humans or CDPs with a sharedlanguage, the models that employ a translation layer obtainhigher reward and a greater overall success rate than baselines.

Reference games Results for the two referencegames are shown in Table 1. The end-to-end trainedmodel achieves nearly perfect accuracy in bothcases, while a model trained to communicate innatural language achieves somewhat lower perfor-mance. Regardless of whether the speaker is a CDPand the listener a model human or vice-versa, trans-lation based on the belief-matching criterion in Sec-tion 5 achieves the best performance; indeed, whentranslating from neuralese to natural language, thelistener is able to achieve a higher score than it isnatively. This suggests that the automated agenthas discovered a more effective strategy than theone demonstrated by humans in the dataset, andthat the effectiveness of this strategy is preservedby translation. Example translations from the refer-ence games are depicted in Figure 2 and Figure 7.

magenta, hot, rose, violet, purple

magenta, hot, violet, rose, purple

olive, puke, pea, grey, brown

pinkish, grey, dull, pale, light

Figure 7: Best-scoring translations generated for color task.

as speakerR H

aslis

tene

r R 0.850.50 random0.45 direct0.61 belief (ours)

H*0.5

0.770.450.57

Table 3: Belief evaluation results for the driving game. Drivingstates are challenging to identify based on messages alone (asevidenced by the comparatively low scores obtained by single-language pairs) . Translation based on belief achieves the bestoverall performance in both directions.

Driving game Behavior evaluation of the drivinggame is shown in Table 2, and belief evaluation isshown in Table 3. Translation of messages in thedriving game is considerably more challenging thanin the reference games, and scores are uniformlylower; however, a clear benefit from the belief-matching model is still visible. Belief matchingleads to higher scores on the belief evaluation inboth directions, and allows agents to obtain a higherreward on average (though task completion ratesremain roughly the same across all agents).

Some example translations of driving game mes-sages are shown in Figure 8.

9 Conclusion

We have investigated the problem of interpretingmessage vectors from communicating deep policiesby translating them. After introducing a translationcriterion based on matching listener beliefs aboutspeaker states, we presented both theoretical andempirical evidence that this criterion outperformsa more conventional machine translation approachat both recovering the content of message vectorsand facilitating collaboration between neuraleseand natural language speakers.

at goal, done, left to top

going in intersection, proceed, going

you first, following, going down

Figure 8: Best-scoring translations generated for driving task.

at goaldone

going in intersectionproceedgoing

you firstfollowing

80

Compositional semantics

81

Compositional semantics

82

Compositional semantics

✔ ✔ ✔

83

message

Compositional semantics

✔ ✔ ✔

everything but the blue shapes

orange square and non-squares

84[FitzGerald et al. 2013]

Compositional semantics

✔ ✔ ✔

lambdax:not(blue(x))

lambdax:or(orange(x),not(square(x))

85[FitzGerald et al. 2013]

Compositional semantics

✔ ✔ ✔

86

???

Model architecture

87

Model architecture

88

Model architecture

✔ ✔ ✔

89

Model architecture

✔ ✔ ✔

90

1.02.3-0.30.4-1.21.1

Computing meaning representations

91

1.02.3-0.30.4-1.21.1

on the left

92

-0.11.30.5-0.40.21.0

✔ ✔ ✔

✔ ✔ ✔✔

✔ ✔ ✔

✔ ✔ ✔✔

everything but squares

Computing meaning representations

everything but squares

93

-0.11.30.5-0.40.21.0

✔ ✔ ✔

✔ ✔ ✔✔

Computing meaning representations

everything but squares

94

-0.11.30.5-0.40.21.0

✔ ✔ ✔

✔ ✔ ✔✔

✔ ✔ ✔

✔ ✔ ✔✔

Computing meaning representations

everything but squares

95

-0.11.30.5-0.40.21.0

✔ ✔ ✔

✔ ✔ ✔✔

✔ ✔ ✔

✔ ✔ ✔✔

Computing meaning representations

96

-0.11.30.5-0.40.21.0

✔ ✔ ✔

✔ ✔ ✔✔

✔ ✔ ✔

✔ ✔ ✔✔

everything but squares

Computing meaning representations

97

-0.11.30.5-0.40.21.0

✔ ✔ ✔

✔ ✔ ✔✔

✔ ✔ ✔

✔ ✔ ✔✔

lambdax:not(square(x))

Computing meaning representations

98

-0.11.30.5-0.40.21.0

✔ ✔ ✔

✔ ✔ ✔✔

✔ ✔ ✔

✔ ✔ ✔✔

lambdax:not(square(x))

Computing meaning representations

q( , ) =a KL( || )β( ) β( )a0

0.10

0.05

0.13

0.08

0.01

0.2299

0

0 a

Translation criterion

q( , ) =a β( ) E[ ]β( )a0

0 a

100

0

✔ ✔ ✔

✔ ✔ ✔✔

✔ ✔ ✔

✔ ✔ ✔✔

=

Translation criterion

Experiments

“High-level” communicative behavior

“Low-level” message structure

101

Experiments

“High-level” communicative behavior

“Low-level” message structure

102

103

Comparing strategies

104

everything but squares

-0.11.30.5-0.40.21.0

Comparing strategies

105

everything but squares

✔ ✔ ✔

✔ ✔ ✔✔

-0.11.30.5-0.40.21.0

✔ ✔ ✔

✔ ✔ ✔✔

Comparing strategies

106

everything but squares

✔ ✔ ✔

✔ ✔ ✔✔

=?

-0.11.30.5-0.40.21.0

✔ ✔ ✔

✔ ✔ ✔✔

Comparing strategies

107

✔ ✔ ✔

✔ ✔✔

=?

-0.11.30.5-0.40.21.0

✔ ✔ ✔

✔ ✔ ✔✔

Theories of model behavior: random

108

-0.11.30.5-0.40.21.0

✔ ✔ ✔

✔ ✔ ✔✔

✔ ✔

=?

Theories of model behavior: literal

109

0.00

0.50

1.00

50

63

27

0 Literal Human

Evaluation: high-level scene agreement

110

0.00

0.50

1.00

72

50

92

74

Random Literal Human

Evaluation: high-level object agreement

Experiments

“High-level” communicative behavior

“Low-level” message structure

111

Collecting translation data

112

all the red shapes

blue objects

everything but red

green squares

not green squares

Collecting translation data

113

λx.red(x)

λx.blu(x)

λx.¬red(x)

λx.grn(x)∧sqr(x)

λx.¬(grn(x)∧sqr(x))

Collecting translation data

114

λx.red(x)

λx.blu(x)

λx.¬red(x)

λx.grn(x)∧sqr(x)

λx.¬(grn(x)∧sqr(x))

0.1-0.30.51.1

-0.30.20.10.1

1.4-0.3-0.50.8

0.2-0.20.5-0.1

0.3-1.3-1.50.1

Extracting related pairs

115

λx.red(x) λx.¬red(x)

λx.grn(x)∧sqr(x) λx.¬(grn(x)∧sqr(x))

0.1-0.30.51.1 1.4-0.3-0.50.8

0.2-0.20.5-0.1 0.3-1.3-1.50.1

Extracting related pairs

116

λx.red(x) λx.¬red(x)

λx.grn(x)∧sqr(x) λx.¬(grn(x)∧sqr(x))

0.1-0.30.51.1 1.4-0.3-0.50.8

0.2-0.20.5-0.1 0.3-1.3-1.50.1

Learning compositional operators

117

argmin

2

Evaluating learned operators

λx.red(x) λx.¬red(x)

λx.grn(x)∧sqr(x) λx.¬(grn(x)∧sqr(x))

0.1-0.30.51.1 1.4-0.3-0.50.8

0.2-0.20.5-0.1 0.3-1.3-1.50.1

λx.f(x)

0.2-0.20.5-0.1

Evaluating learned operators

λx.red(x) λx.¬red(x)

λx.grn(x)∧sqr(x) λx.¬(grn(x)∧sqr(x))

0.1-0.30.51.1 1.4-0.3-0.50.8

0.2-0.20.5-0.1 0.3-1.3-1.50.1

λx.f(x)

0.2-0.20.5-0.1 -0.20.4-0.30.0

Evaluating learned operators

λx.red(x) λx.¬red(x)

λx.grn(x)∧sqr(x) λx.¬(grn(x)∧sqr(x))

0.1-0.30.51.1 1.4-0.3-0.50.8

0.2-0.20.5-0.1 0.3-1.3-1.50.1

λx.f(x)

0.2-0.20.5-0.1

???

-0.20.4-0.30.0

121

Evaluation: scene agreement for negation

0.00

0.50

1.00

50

81

120

122�4 �3 �2 �1 0 1 2 3 4 5�5

�4

�3

�2

�1

0

1

2

3all the toys that

are not red

every thing that is red

only the blue andgreen objects

all items that arenot blue or green

Input

Predicted

True

Visualizing negation

123

0.00

0.50

1.00

50

54

90

Evaluation: scene agreement for disjunction

124

Input

Predicted

True

�4 �3 �2 �1 0 1 2 3 4�3

�2

�1

0

1

2

3all of the red objects

the blue andred items

the blue objects

the blue andyellow items

all the yellow toys

all yellow orred items

Visualizing disjunction

• We can translate between neuralese and natural lang. by grounding in distributions over world states

• Under the right conditions, neuralese exhibits interpretable pragmatics & compositional structure

• Not just communication games—language might be a good general-purpose tool for interpreting deep reprs.

Conclusions

125

• We can translate between neuralese and natural lang. by grounding in distributions over world states

• Under the right conditions, neuralese exhibits interpretable pragmatics & compositional structure

• Not just communication games—language might be a good general-purpose tool for interpreting deep reprs.

Conclusions

126

• We can translate between neuralese and natural lang. by grounding in distributions over world states

• Under the right conditions, neuralese exhibits interpretable pragmatics & compositional structure

• Not just communication games—language might be a good general-purpose tool for interpreting deep reprs.

Conclusions

127

Conclusions

128

not sure

dunno

yes

yes

no

yes

1.02.3-0.30.4-1.21.1 Thank you!

http://github.com/jacobandreas/{neuralese,rnn-syn}