Download - Dual Learning: Pushing New Frontier of Artificial …...Dual Learning: Pushing New Frontier of Artificial Intelligence Tie-Yan Liu Principal Researcher Microsoft Research Asia AI is

Dual Learning: Pushing New Frontier of Artificial Intelligence

Tie-Yan Liu

Principal Researcher

Microsoft Research Asia

AI is Making Fast Progress!

Speech Recognition

• In 2016，Microsoft’s speech recognition

system achieved human parity on

conversational data (word error rate: 5.9%)

• This result was powered by Microsoft

Cognitive Toolkit (CNTK).

12/16/2016 Tie-Yan Liu @ Microsoft Research Asia 2



Image Classification Object Segmentation

0.16

0.18

0.2

0.22

0.24

0.26

0.28

2012.08 2012.12 2013.05 2013.11 2014.04 2014.11 2015.05 2016.04

Chinese->English

Google

Youdao

Baidu

MSRA

BLEU4


Machine Translation



Atari Game Playing Go Playing

Deep Q-networks

Driving Forces


FNN

CNN

RNN

Deep Learning Reinforcement Learning

Limitations


• Cannot live without huge amount of human-labeled training data

Tasks Typical training data

Image classification Millions of labeled images

Speech recognition Thousands of hours of annotated voice data

Machine translation Tens of millions of bilingual sentence pairs

Go playing Tens of millions of expert moves

Human labeling is in general very expensive, and it is hard, if not impossible, to obtain large-scale labeled data for rare domains.

• Reinforcement learning learns much more slowly than supervised learning, and it relies on the feedback signals obtained from many rounds of “trial and error”.

• This means that high-frequency interactions with the environment are needed during the learning process, which is feasible for repeatable games, but not for many real applications.

This explains why the recent successes of reinforcement learning mainly lie in game

playing (Atari or Go).

Deep Learning Reinforcement Learning

Desirable Learning Scheme

• To overcome limitations of today’s deep learning and reinforcement learning technologies, one may want something that could learn

Without large-scale labeled data

Without frequent interactions with real environment

Is this going to be possible?


A Important Observation

• Many real applications involve two dual AI tasksApplication Primal task Dual task

Machine translation Translation from language A to B Translation from language B to A

Speech processing Speech recognition Text to speech

Image understanding Image captioning Image generation

Conversation Question answering Question generation (e.g., Jeopardy!)

Search engine Query-document matching Query/keyword suggestion


One can obtain feedback signals that are useful for deep learning, through the interplay of the two dual tasks, even if there are no labeled data.

Two dual tasks can serve as “virtual” environments for each other in reinforcement learning, so as to enable many rounds of “trial and error”, without the necessity of interacting with real environment.

Dual Learning

Unlabeled data 𝑥Predicted label

𝑦 = 𝑓(𝑥)Reconstructed data 𝑥′ = 𝑔(𝑦)

Feedback signals during the loop:• 𝑅 𝑥, 𝑓, 𝑔 = 𝑠 𝑥, 𝑥′ : reconstruction error. • 𝐿(𝑦) and 𝐿(𝑥): Likelihoods.

Primal Task 𝑓: 𝑥 → 𝑦

Dual Task 𝑔: 𝑦 → 𝑥

Environment

Agent

Environment

Agent


Algorithms like policy gradient can be used to improve both primal and dual models according to feedback signals

Dual Machine Translation as Example

English sentence 𝑥 Chinese sentence𝑦 = 𝑓(𝑥)New English sentence

𝑥′ = 𝑔(𝑦)

Feedback signals during the loop:• 𝑅 𝑥, 𝑓, 𝑔 = 𝑠 𝑥, 𝑥′ : BLEU of 𝑥′ given 𝑥 . • 𝐿(𝑦) and 𝐿(𝑥′): Language model of 𝑦 and 𝑥’



Environment

Agent

Environment

Agent

Ch->En translation

En->Ch translation


Experimental Setting

• Baseline: • State-of-art neural machine translation model (NMT) , trained using 100%

bilingual data

• ICLR 2015, “Neural Machine Translation by Jointly Learning to Align and Translate”, by Y. Bengio’s group

• Our algorithm:• Step 1: Initialization

• Start from a weak NMT model learned from 10% training data

• Step 2: Dual learning with monolingual data• Use the policy gradient algorithm to update the dual models


Experimental Results

Larger the BLEU score, the better

20

21

22

23

24

25

26

27

28

NMT with 10%bilingual data

dual-NMT with 10%bilingual data


BLEU score: French->English

20

22

24

26

28

30

32


dual-NMT with10% bilingual data


BLEU score: English->French

Starting from initial models obtained from only 10% bilingual data, dual learning achieves similar accuracy as the NMT model learned

from 100% bilingual data!12/16/2016 Tie-Yan Liu @ Microsoft Research Asia 13

Other Examples: Dual Speech Processing

Speech segment 𝑥 text sentence𝑦 = 𝑓(𝑥)New Speech segment

𝑥′ = 𝑔(𝑦)

Feedback signals during the loop:• 𝑅 𝑥, 𝑓, 𝑔 = 𝑠 𝑥, 𝑥′ : acoustic

similarity between 𝑥′ and 𝑥. • 𝐿(𝑦): Language model of 𝑦, 𝐿(𝑥′):

naturalness of 𝑥′



Environment

Agent

Environment

Agent

Text to Speech

Speech Recognition


Other Examples: Dual Image Processing

Image 𝑥 Text sentence𝑦 = 𝑓(𝑥)New image

𝑥′ = 𝑔(𝑦)

Feedback signals during the loop:• 𝑅 𝑥, 𝑓, 𝑔 = 𝑠 𝑥, 𝑥′ : Similarity

between 𝑥′ and 𝑥 . • 𝐿(𝑦): Language model of 𝑦, 𝐿(𝑥′):

naturalness of 𝑥′



Environment

Agent

Environment

Agent

Image Generation

Image Captioning


Other Examples: Dual Conversation

Question 𝑥 Answer𝑦 = 𝑓(𝑥)New question

𝑥′ = 𝑔(𝑦)

Feedback signals during the loop:• 𝑅 𝑥, 𝑓, 𝑔 = 𝑠 𝑥, 𝑥′ : BLEU of 𝑥′ given 𝑥 . • 𝐿(𝑦) and 𝐿(𝑥′): Language model of 𝑦 and 𝑥’



Environment

Agent

Environment

Agent

Question generation

Question answering


Actually, the idea of “Dual Learning” is much more generally applicable…


Even for tasks without physical duality


Virtual Duality: Auto Encoder

Raw data 𝑥 Hidden representation𝑦 = 𝑓(𝑥)

New data 𝑥′ = 𝑔(𝑦)

Decoder 𝑔

Encoder 𝑓

Feedback signals during the loop:• 𝑅 𝑥, 𝑓, 𝑔 = 𝑠 𝑥, 𝑥1 : similarity between 𝑥 and 𝑥′


Virtual Duality: Generative Adversarial Nets

Noise 𝑥 Generated image𝑦 = 𝑓(𝑥)

Real or fake

Discriminator 𝑔

Generator 𝑓

• The generator is receiving a reward signal from the discriminator letting it know whether the generated data is natural or not.

• Feedback signal: 𝑅 𝑥, 𝑓, 𝑔 = 𝑔 𝑦 = 𝑔(𝑓 𝑥 )


Even for supervised learning & inference


“Dual” Supervised Learning

Labeled data 𝑥Predicted label


Feedback signals during the loop:• 𝑅 𝑥, 𝑓, 𝑔 = |𝑃 𝑥 𝑃 𝑦 𝑥; 𝑓 − 𝑃 𝑦 𝑃 𝑦 𝑥; 𝑔 |: the gap between

the joint probability 𝑃(𝑥, 𝑦) obtained in two directions



Environment

Agent

Environment

Agent

min |𝑃 𝑥 𝑃 𝑦 𝑥; 𝑓 − 𝑃 𝑦 𝑃 𝑦 𝑥; 𝑔 |

max log𝑃(𝑦|𝑥; 𝑓)

max log𝑃(𝑥|𝑦; 𝑔)


“Dual” Inference

Test data 𝑥Predicted label




Environment

Agent

Environment

Agent

𝑃 𝑦 𝑥 =𝑃 𝑥 𝑦 𝑃 𝑦

𝑃 𝑥

Choose the 𝑦 that can maximize 𝑃(𝑦|𝑥; 𝑓)Standard inference

Choose the 𝑦 that can maximize both 𝑃(𝑦|𝑥; 𝑓) and 𝑃 𝑥 𝑦; 𝑔 𝑃 𝑦

𝑃 𝑥

Dual inference: leverage both the primal model and the dual model for testing


Dual Learning: A New Learning Paradigm

Dual learning: automatically generate reinforcement feedback for unlabeled data, multiple tasks involved.

Unsupervised/semi-supervised learning:no feedback signals for unlabeled data, only one task.

Multi-task learning: multiple tasks share the same representation.

Dual learning: dual tasks don’t need to share representations, only if the loop is closed.

Transfer learning: use auxiliary tasks to boostthe target task.

Dual learning: all the tasks are mutually and simultaneously boosted.

Co-training: only one task, assuming different feature sets that provide complementary information about the instance .

Dual learning: multiple tasks involved, no assumption on feature set.


Thanks!

[email protected]

http://research.microsoft.com/users/tyliu/

http://weibo.com/tieyanliu/


http://weibo.com/tieyanliu/