Sequence to sequence (encoder-decoder) learning

46
Seq2seq ...and beyond

Transcript of Sequence to sequence (encoder-decoder) learning

Page 1: Sequence to sequence (encoder-decoder) learning

Seq2seq...and beyond

Page 2: Sequence to sequence (encoder-decoder) learning

Hello!I am Roberto Silveira

EE engineer, ML enthusiast

[email protected]

@rsilveira79

Page 3: Sequence to sequence (encoder-decoder) learning

SequenceIs a matter of time

Page 4: Sequence to sequence (encoder-decoder) learning

RNNIs what you need!

Page 5: Sequence to sequence (encoder-decoder) learning

Basic Recurrent cells (RNN)

Source: http://colah.github.io/

Issues× Difficulties to deal with long term

dependencies× Difficult to train - vanish gradient issues

Page 6: Sequence to sequence (encoder-decoder) learning

Long term issues

Source: http://colah.github.io/, CS224d notes

Sentence 1"Jane walked into the room. John walked in too. Jane said hi to ___"

Sentence 2"Jane walked into the room. John walked in too. It was late in the day, and everyone was walking home after a long day at work. Jane said hi to ___"

Page 7: Sequence to sequence (encoder-decoder) learning

LSTM in 2 min...

Review× Address long term dependencies× More complex to train× Very powerful, lots of data

Source: http://colah.github.io/

Page 8: Sequence to sequence (encoder-decoder) learning

LSTM in 2 min...

Review× Address long term dependencies× More complex to train× Very powerful, lots of data

Cell state

Source: http://colah.github.io/

Forget gate

Input gate

Output gate

Page 9: Sequence to sequence (encoder-decoder) learning

Gated recurrent unit (GRU) in 2 min ...

Review× Fewer hyperparameters× Train faster× Better solution w/ less data

Source: http://www.wildml.com/, arXiv:1412.3555

Page 10: Sequence to sequence (encoder-decoder) learning

Gated recurrent unit (GRU) in 2 min ...

Review× Fewer hyperparameters× Train faster× Better solution w/ less data

Source: http://www.wildml.com/, arXiv:1412.3555

Reset gate

Update gate

Page 11: Sequence to sequence (encoder-decoder) learning

Seq2seq learning

Or encoder-decoder architectures

Page 12: Sequence to sequence (encoder-decoder) learning

Variable size input - output

Source: http://karpathy.github.io/

Page 13: Sequence to sequence (encoder-decoder) learning

Variable size input - output

Source: http://karpathy.github.io/

Page 14: Sequence to sequence (encoder-decoder) learning

Basic idea"Variable" size input (encoder) ->

Fixed size vector representation ->"Variable" size output (decoder)

"Machine","Learning",

"is","fun"

"Aprendizado","de",

"Máquina","é",

"divertido"

0.6360.1220.981

Input One word at a time Stateful

ModelStateful

ModelEncoded

Sequence

Output One word at a time

First RNN(Encoder)

Second RNN

(Decoder)

Memory of previous word influence next

result

Memory of previous word influence next

result

Page 15: Sequence to sequence (encoder-decoder) learning

Sequence to Sequence Learning with Neural Networks (2014)

"Machine","Learning",

"is","fun"

"Aprendizado","de",

"Máquina","é",

"divertido"

0.6360.1220.981

1000d word embeddings

4 layers1000

cells/layer

Encoded Sequence

LSTM(Encoder)

LSTM(Decoder)

Source: arXiv 1409.3215v3

TRAINING → SGD w/o momentum, fixed learning rate of 0.7, 7.5 epochs, batches of 128 sentences, 10 days of training (WMT 14 dataset English to French)

4 layers1000

cells/layer

Page 16: Sequence to sequence (encoder-decoder) learning

Recurrent encoder-decoders

Les chiens aiment les os <EOS> Dogs love bones

Dogs love bones <EOS>

Source Sequence Target Sequence

Source: arXiv 1409.3215v3

Page 17: Sequence to sequence (encoder-decoder) learning

Recurrent encoder-decoders

Les chiens aiment les os <EOS> Dogs love bones

Dogs love bones <EOS>

Source: arXiv 1409.3215v3

Page 18: Sequence to sequence (encoder-decoder) learning

Recurrent encoder-decoders

Leschiensaimentlesos <EOS> Dogs love bones

Dogs love bones <EOS>

Source: arXiv 1409.3215v3

Page 19: Sequence to sequence (encoder-decoder) learning

Source: arXiv 1409.3215v3

Recurrent encoder-decoders - issues

● Difficult to cope with large sentences (longer than training corpus)

● Decoder w/ attention mechanism →relieve encoder to squash into fixed length vector

Page 20: Sequence to sequence (encoder-decoder) learning

NEURAL MACHINE TRANSLATION BY JOINTLY LEARNING TO ALIGN AND TRANSLATE (2015)

Source: arXiv 1409.0473v7

Decoder

Context vector for each target word

Weights of each annotation hj

Page 21: Sequence to sequence (encoder-decoder) learning

NEURAL MACHINE TRANSLATION BY JOINTLY LEARNING TO ALIGN AND TRANSLATE (2015)

Source: arXiv 1409.0473v7

Decoder

Context vector for each target word

Weights of each annotation hj

Non-monotonic alignment

Page 22: Sequence to sequence (encoder-decoder) learning

Attention models for NLP

Source: arXiv 1409.0473v7

Les chiens aiment les os <EOS>

+

<EOS>

Page 23: Sequence to sequence (encoder-decoder) learning

Attention models for NLP

Source: arXiv 1409.0473v7

Les chiens aiment les os <EOS>

+

<EOS>

Dogs

Page 24: Sequence to sequence (encoder-decoder) learning

Attention models for NLP

Source: arXiv 1409.0473v7

Les chiens aiment les os <EOS>

+

<EOS>

Dogs

Dogs

love+

Page 25: Sequence to sequence (encoder-decoder) learning

Attention models for NLP

Source: arXiv 1409.0473v7

Les chiens aiment les os <EOS>

+

<EOS>

Dogs

Dogs

love+

love

bones+

Page 26: Sequence to sequence (encoder-decoder) learning

Challenges in using the model● Cannot handle true

variable size input

Source: http://suriyadeepan.github.io/

PADDING

BUCKETING

WORD EMBEDDINGS

● Capture context semantic meaning

● Hard to deal with both short and large sentences

Page 27: Sequence to sequence (encoder-decoder) learning

padding

Source: http://suriyadeepan.github.io/

EOS : End of sentencePAD : FillerGO : Start decodingUNK : Unknown; word not in vocabulary

Q : "What time is it? "A : "It is seven thirty."

Q : [ PAD, PAD, PAD, PAD, PAD, “?”, “it”,“is”, “time”, “What” ] A : [ GO, “It”, “is”, “seven”, “thirty”, “.”, EOS, PAD, PAD, PAD ]

Page 28: Sequence to sequence (encoder-decoder) learning

Source: https://www.tensorflow.org/

bucketing

Efficiently handle sentences of different lengths

Ex: 100 tokens is the largest sentence in corpus

How about short sentences like: "How are you?" → lots of PAD

Bucket list: [(5, 10), (10, 15), (20, 25), (40, 50)](defaut on Tensorflow translate.py)

Q : [ PAD, PAD, “.”, “go”,“I”] A : [GO "Je" "vais" "." EOS PAD PAD PAD PAD PAD]

Page 29: Sequence to sequence (encoder-decoder) learning

Word embeddings (remember previous presentation ;-)Distributed representations → syntactic and semantic is captured

Take =

0.2860.792-0.177-0.1070.109

-0.5420.3490.271

Page 30: Sequence to sequence (encoder-decoder) learning

Word embeddings (remember previous presentation ;-)Linguistic regularities (recap)

Page 31: Sequence to sequence (encoder-decoder) learning

Phrase representations (Paper - earning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation)

Source: arXiv 1406.1078v3

Page 32: Sequence to sequence (encoder-decoder) learning

Phrase representations (Paper - earning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation)

Source: arXiv 1406.1078v3

1000d vector representation

Page 33: Sequence to sequence (encoder-decoder) learning

applications

Page 34: Sequence to sequence (encoder-decoder) learning

Neural conversational model - chatbots

Source: arXiv 1506.05869v3

Page 35: Sequence to sequence (encoder-decoder) learning

Google Smart reply

Page 36: Sequence to sequence (encoder-decoder) learning

Google Smart reply

Source: arXiv 1606.04870v1

Interesting facts● Currently responsible for 10% Inbox replies● Training set 238 million messages

Page 37: Sequence to sequence (encoder-decoder) learning

Google Smart reply

Source: arXiv 1606.04870v1

Seq2Seq model

Interesting facts● Currently responsible for 10% Inbox replies● Training set 238 million messages

Feedforward triggering model

Semi-supervised semantic clustering

Page 38: Sequence to sequence (encoder-decoder) learning

Image captioning(Paper - Show and Tell: A Neural Image Caption Generator)

Source: arXiv 1411.4555v2

Page 39: Sequence to sequence (encoder-decoder) learning

Image captioning(Paper - Show and Tell: A Neural Image Caption Generator)

Encoder

Decoder

Source: arXiv 1411.4555v2

Page 40: Sequence to sequence (encoder-decoder) learning

What's next?

And so?

Page 41: Sequence to sequence (encoder-decoder) learning

Multi-task sequence to sequence(Paper - MULTI-TASK SEQUENCE TO SEQUENCE LEARNING)

Source: arXiv 1511.06114v4

One-to-Many (common encoder)

Many-to-One(common decoder)

Many-to-Many

Page 42: Sequence to sequence (encoder-decoder) learning

Neural programmer(Paper - NEURAL PROGRAMMER: INDUCING LATENT PROGRAMS WITH GRADIENT DESCENT)

Source: arXiv 1511.04834v3

Page 43: Sequence to sequence (encoder-decoder) learning

Unsupervised pre-training for seq2seq - 2017(Paper - UNSUPERVISED PRETRAINING FOR SEQUENCE TO SEQUENCE LEARNING)

Source: arXiv 1611.02683v1

Page 44: Sequence to sequence (encoder-decoder) learning

Unsupervised pre-training for seq2seq - 2017(Paper - UNSUPERVISED PRETRAINING FOR SEQUENCE TO SEQUENCE LEARNING)

Source: arXiv 1611.02683v1

Pre-trained

Pre-trained

Page 45: Sequence to sequence (encoder-decoder) learning

[email protected]

@rsilveira79

Page 46: Sequence to sequence (encoder-decoder) learning

Place your screenshot here

A Quick example on tensorflow