Dual Learning: Pushing New Frontier of Artificial Intelligence
Tie-Yan Liu
Principal Researcher
Microsoft Research Asia
AI is Making Fast Progress!
Speech Recognition
• In 2016,Microsoft’s speech recognition
system achieved human parity on
conversational data (word error rate: 5.9%)
• This result was powered by Microsoft
Cognitive Toolkit (CNTK).
12/16/2016 Tie-Yan Liu @ Microsoft Research Asia 2
AI is Making Fast Progress!
12/16/2016 Tie-Yan Liu @ Microsoft Research Asia 3
Image Classification Object Segmentation
0.16
0.18
0.2
0.22
0.24
0.26
0.28
2012.08 2012.12 2013.05 2013.11 2014.04 2014.11 2015.05 2016.04
Chinese->English
Youdao
Baidu
MSRA
BLEU4
12/16/2016 Tie-Yan Liu @ Microsoft Research Asia 4
Machine Translation
AI is Making Fast Progress!
AI is Making Fast Progress!
Atari Game Playing Go Playing
Deep Q-networks
Driving Forces
12/16/2016 Tie-Yan Liu @ Microsoft Research Asia 6
FNN
CNN
RNN
Deep Learning Reinforcement Learning
Limitations
12/16/2016 Tie-Yan Liu @ Microsoft Research Asia 7
• Cannot live without huge amount of human-labeled training data
Tasks Typical training data
Image classification Millions of labeled images
Speech recognition Thousands of hours of annotated voice data
Machine translation Tens of millions of bilingual sentence pairs
Go playing Tens of millions of expert moves
Human labeling is in general very expensive, and it is hard, if not impossible, to obtain large-scale labeled data for rare domains.
• Reinforcement learning learns much more slowly than supervised learning, and it relies on the feedback signals obtained from many rounds of “trial and error”.
• This means that high-frequency interactions with the environment are needed during the learning process, which is feasible for repeatable games, but not for many real applications.
This explains why the recent successes of reinforcement learning mainly lie in game
playing (Atari or Go).
Deep Learning Reinforcement Learning
Desirable Learning Scheme
• To overcome limitations of today’s deep learning and reinforcement learning technologies, one may want something that could learn
Without large-scale labeled data
Without frequent interactions with real environment
Is this going to be possible?
12/16/2016 Tie-Yan Liu @ Microsoft Research Asia 8
A Important Observation
• Many real applications involve two dual AI tasksApplication Primal task Dual task
Machine translation Translation from language A to B Translation from language B to A
Speech processing Speech recognition Text to speech
Image understanding Image captioning Image generation
Conversation Question answering Question generation (e.g., Jeopardy!)
Search engine Query-document matching Query/keyword suggestion
12/16/2016 Tie-Yan Liu @ Microsoft Research Asia 9
One can obtain feedback signals that are useful for deep learning, through the interplay of the two dual tasks, even if there are no labeled data.
Two dual tasks can serve as “virtual” environments for each other in reinforcement learning, so as to enable many rounds of “trial and error”, without the necessity of interacting with real environment.
Dual Learning
Unlabeled data 𝑥Predicted label
𝑦 = 𝑓(𝑥)Reconstructed data 𝑥′ = 𝑔(𝑦)
Feedback signals during the loop:• 𝑅 𝑥, 𝑓, 𝑔 = 𝑠 𝑥, 𝑥′ : reconstruction error. • 𝐿(𝑦) and 𝐿(𝑥): Likelihoods.
Primal Task 𝑓: 𝑥 → 𝑦
Dual Task 𝑔: 𝑦 → 𝑥
Environment
Agent
Environment
Agent
12/16/2016 Tie-Yan Liu @ Microsoft Research Asia 10
Algorithms like policy gradient can be used to improve both primal and dual models according to feedback signals
Dual Machine Translation as Example
English sentence 𝑥 Chinese sentence𝑦 = 𝑓(𝑥)New English sentence
𝑥′ = 𝑔(𝑦)
Feedback signals during the loop:• 𝑅 𝑥, 𝑓, 𝑔 = 𝑠 𝑥, 𝑥′ : BLEU of 𝑥′ given 𝑥 . • 𝐿(𝑦) and 𝐿(𝑥′): Language model of 𝑦 and 𝑥’
Primal Task 𝑓: 𝑥 → 𝑦
Dual Task 𝑔: 𝑦 → 𝑥
Environment
Agent
Environment
Agent
Ch->En translation
En->Ch translation
12/16/2016 Tie-Yan Liu @ Microsoft Research Asia 11
Experimental Setting
• Baseline: • State-of-art neural machine translation model (NMT) , trained using 100%
bilingual data
• ICLR 2015, “Neural Machine Translation by Jointly Learning to Align and Translate”, by Y. Bengio’s group
• Our algorithm:• Step 1: Initialization
• Start from a weak NMT model learned from 10% training data
• Step 2: Dual learning with monolingual data• Use the policy gradient algorithm to update the dual models
12/16/2016 Tie-Yan Liu @ Microsoft Research Asia 12
Experimental Results
Larger the BLEU score, the better
20
21
22
23
24
25
26
27
28
NMT with 10%bilingual data
dual-NMT with 10%bilingual data
NMT with 100%bilingual data
BLEU score: French->English
20
22
24
26
28
30
32
NMT with 10%bilingual data
dual-NMT with10% bilingual data
NMT with 100%bilingual data
BLEU score: English->French
Starting from initial models obtained from only 10% bilingual data, dual learning achieves similar accuracy as the NMT model learned
from 100% bilingual data!12/16/2016 Tie-Yan Liu @ Microsoft Research Asia 13
Other Examples: Dual Speech Processing
Speech segment 𝑥 text sentence𝑦 = 𝑓(𝑥)New Speech segment
𝑥′ = 𝑔(𝑦)
Feedback signals during the loop:• 𝑅 𝑥, 𝑓, 𝑔 = 𝑠 𝑥, 𝑥′ : acoustic
similarity between 𝑥′ and 𝑥. • 𝐿(𝑦): Language model of 𝑦, 𝐿(𝑥′):
naturalness of 𝑥′
Primal Task 𝑓: 𝑥 → 𝑦
Dual Task 𝑔: 𝑦 → 𝑥
Environment
Agent
Environment
Agent
Text to Speech
Speech Recognition
12/16/2016 Tie-Yan Liu @ Microsoft Research Asia 14
Other Examples: Dual Image Processing
Image 𝑥 Text sentence𝑦 = 𝑓(𝑥)New image
𝑥′ = 𝑔(𝑦)
Feedback signals during the loop:• 𝑅 𝑥, 𝑓, 𝑔 = 𝑠 𝑥, 𝑥′ : Similarity
between 𝑥′ and 𝑥 . • 𝐿(𝑦): Language model of 𝑦, 𝐿(𝑥′):
naturalness of 𝑥′
Primal Task 𝑓: 𝑥 → 𝑦
Dual Task 𝑔: 𝑦 → 𝑥
Environment
Agent
Environment
Agent
Image Generation
Image Captioning
12/16/2016 Tie-Yan Liu @ Microsoft Research Asia 15
Other Examples: Dual Conversation
Question 𝑥 Answer𝑦 = 𝑓(𝑥)New question
𝑥′ = 𝑔(𝑦)
Feedback signals during the loop:• 𝑅 𝑥, 𝑓, 𝑔 = 𝑠 𝑥, 𝑥′ : BLEU of 𝑥′ given 𝑥 . • 𝐿(𝑦) and 𝐿(𝑥′): Language model of 𝑦 and 𝑥’
Primal Task 𝑓: 𝑥 → 𝑦
Dual Task 𝑔: 𝑦 → 𝑥
Environment
Agent
Environment
Agent
Question generation
Question answering
12/16/2016 Tie-Yan Liu @ Microsoft Research Asia 16
Actually, the idea of “Dual Learning” is much more generally applicable…
12/16/2016 Tie-Yan Liu @ Microsoft Research Asia 17
Even for tasks without physical duality
12/16/2016 Tie-Yan Liu @ Microsoft Research Asia 18
Virtual Duality: Auto Encoder
Raw data 𝑥 Hidden representation𝑦 = 𝑓(𝑥)
New data 𝑥′ = 𝑔(𝑦)
Decoder 𝑔
Encoder 𝑓
Feedback signals during the loop:• 𝑅 𝑥, 𝑓, 𝑔 = 𝑠 𝑥, 𝑥1 : similarity between 𝑥 and 𝑥′
12/16/2016 Tie-Yan Liu @ Microsoft Research Asia 19
Virtual Duality: Generative Adversarial Nets
Noise 𝑥 Generated image𝑦 = 𝑓(𝑥)
Real or fake
Discriminator 𝑔
Generator 𝑓
• The generator is receiving a reward signal from the discriminator letting it know whether the generated data is natural or not.
• Feedback signal: 𝑅 𝑥, 𝑓, 𝑔 = 𝑔 𝑦 = 𝑔(𝑓 𝑥 )
12/16/2016 Tie-Yan Liu @ Microsoft Research Asia 20
Even for supervised learning & inference
12/16/2016 Tie-Yan Liu @ Microsoft Research Asia 21
“Dual” Supervised Learning
Labeled data 𝑥Predicted label
𝑦 = 𝑓(𝑥)Reconstructed data 𝑥′ = 𝑔(𝑦)
Feedback signals during the loop:• 𝑅 𝑥, 𝑓, 𝑔 = |𝑃 𝑥 𝑃 𝑦 𝑥; 𝑓 − 𝑃 𝑦 𝑃 𝑦 𝑥; 𝑔 |: the gap between
the joint probability 𝑃(𝑥, 𝑦) obtained in two directions
Primal Task 𝑓: 𝑥 → 𝑦
Dual Task 𝑔: 𝑦 → 𝑥
Environment
Agent
Environment
Agent
min |𝑃 𝑥 𝑃 𝑦 𝑥; 𝑓 − 𝑃 𝑦 𝑃 𝑦 𝑥; 𝑔 |
max log𝑃(𝑦|𝑥; 𝑓)
max log𝑃(𝑥|𝑦; 𝑔)
12/16/2016 Tie-Yan Liu @ Microsoft Research Asia 22
“Dual” Inference
Test data 𝑥Predicted label
𝑦 = 𝑓(𝑥)Reconstructed data 𝑥′ = 𝑔(𝑦)
Primal Task 𝑓: 𝑥 → 𝑦
Dual Task 𝑔: 𝑦 → 𝑥
Environment
Agent
Environment
Agent
𝑃 𝑦 𝑥 =𝑃 𝑥 𝑦 𝑃 𝑦
𝑃 𝑥
Choose the 𝑦 that can maximize 𝑃(𝑦|𝑥; 𝑓)Standard inference
Choose the 𝑦 that can maximize both 𝑃(𝑦|𝑥; 𝑓) and 𝑃 𝑥 𝑦; 𝑔 𝑃 𝑦
𝑃 𝑥
Dual inference: leverage both the primal model and the dual model for testing
12/16/2016 Tie-Yan Liu @ Microsoft Research Asia 23
Dual Learning: A New Learning Paradigm
Dual learning: automatically generate reinforcement feedback for unlabeled data, multiple tasks involved.
Unsupervised/semi-supervised learning:no feedback signals for unlabeled data, only one task.
Multi-task learning: multiple tasks share the same representation.
Dual learning: dual tasks don’t need to share representations, only if the loop is closed.
Transfer learning: use auxiliary tasks to boostthe target task.
Dual learning: all the tasks are mutually and simultaneously boosted.
Co-training: only one task, assuming different feature sets that provide complementary information about the instance .
Dual learning: multiple tasks involved, no assumption on feature set.
12/16/2016 Tie-Yan Liu @ Microsoft Research Asia 24
Thanks!
http://research.microsoft.com/users/tyliu/
http://weibo.com/tieyanliu/
12/16/2016 Tie-Yan Liu @ Microsoft Research Asia 25
Top Related