Transformers and sequence- to-sequence learning

49
Transformers and sequence- to-sequence learning CS 685, Fall 2021 Mohit Iyyer College of Information and Computer Sciences University of Massachusetts Amherst some slides from Emma Strubell

Transcript of Transformers and sequence- to-sequence learning

Page 1: Transformers and sequence- to-sequence learning

Transformers and sequence-to-sequence learning

CS 685, Fall 2021

Mohit IyyerCollege of Information and Computer Sciences

University of Massachusetts Amherst

some slides from Emma Strubell

Page 2: Transformers and sequence- to-sequence learning

Attention is great

• Attention significantly improves NMT performance• It’s very useful to allow decoder to focus on certain parts of the source

• Attention solves the bottleneck problem• Attention allows decoder to look directly at source; bypass bottleneck

• Attention helps with vanishing gradient problem• Provides shortcut to faraway states

• Attention provides some interpretability• By inspecting attention distribution, we can see

what the decoder was focusing on• We get alignment for free!• This is cool because we never explicitly trained

an alignment system• The network just learned alignment by itself

2/15/1863

9/24/14

5

Alignments: harder

The balance

was the

territory of

the aboriginal

people

Le reste appartenait aux autochtones

many-to-one alignments

The balance

was the

territory

of the

aboriginal people

Le

rest

e

appa

rtena

it au

x

auto

chto

nes

Alignments: hardest

The poor don’t have

any money

Les pauvres sont démunis

many-to-many alignment

The poor

don�t have

any

money

Les

pauv

res

sont

mun

is

phrase alignment

Alignment as a vector

Mary did not

slap

the green witch

1 2 3 4

5 6 7

Maria no daba una botefada a la bruja verde

1 2 3 4 5 6 7 8 9

i j

1 3 4 4 4 0 5 7 6

aj=i •  used in all IBM models •  a is vector of length J •  maps indexes j to indexes i •  each aj {0, 1 … I} •  aj = 0 fj is �spurious� •  no one-to-many alignments •  no many-to-many alignments •  but provides foundation for

phrase-based alignment

IBM Model 1 generative story

And the

program has

been implemented

aj

Le

prog

ram

me

a été

mis

en

ap

plic

atio

n

2 3 4 5 6 6 6

Choose length J for French sentence

For each j in 1 to J:

–  Choose aj uniformly from 0, 1, … I

–  Choose fj by translating eaj

Given English sentence e1, e2, … eI

We want to learn how to do this

Want: P(f|e)

IBM Model 1 parameters

And the

program has

been implemented

Le

prog

ram

me

a été

mis

en

ap

plic

atio

n

2 3 4 5 6 6 6 aj

Applying Model 1*

As translation model

As alignment model

P(f, a | e) can be used as a translation model or an alignment model

* Actually, any P(f, a | e), e.g., any IBM model

From last time• Project proposals now due 10/1!

• Quiz 2 due Friday

• Can TAs record zoom office hours? Maybe

• How did we get this grid in the previous lecture? Will explain in today’s class.

• Final proj. reports due Dec. 16th

Page 3: Transformers and sequence- to-sequence learning

iPad

Page 4: Transformers and sequence- to-sequence learning

sequence-to-sequence learning

• we’ll use French (f) to English (e) as a running example

• goal: given French sentence f with tokens f1, f2, … fn produce English translation e with tokens e1, e2, … em

arg maxe

p(e | f )• real goal: compute

Used when inputs and outputs are both sequences of words (e.g.,

machine translation, summarization)

Page 5: Transformers and sequence- to-sequence learning

This is an instance of conditional language modeling

p(e | f ) = p(e1, e2, …, em | f )

= p(e1 | f ) ⋅ p(e2 |e1, f ) ⋅ p(e3 |e2, e1, f ) ⋅ …

=m

∏i=1

p(ei |e1, …, ei−1, f )

Just like we’ve seen before, except we additionally condition our prediction of

the next word on some other input (here, the French sentence)

Page 6: Transformers and sequence- to-sequence learning

seq2seq models• use two different neural networks to model

• first we have the encoder, which encodes the French sentence f

• then, we have the decoder, which produces the English sentence e

6

L

∏i=1

p(ei |e1, …, ei−1, f )

Page 7: Transformers and sequence- to-sequence learning

7

Enco

der

RNN

Neural Machine Translation (NMT)

2/15/1823

<START>

Source sentence (input)

les pauvres sont démunis

The sequence-to-sequence modelTarget sentence (output)

Decoder RN

N

Encoder RNN produces an encoding of the source sentence.

Encoding of the source sentence.Provides initial hidden state

for Decoder RNN.

Decoder RNN is a Language Model that generates target sentence conditioned on encoding.

the

argm

ax

the

argm

ax

poor

poor

argm

ax

don’t

Note: This diagram shows test time behavior: decoder output is fed in as next step’s input

have any money <END>

don’t have any money

argm

ax

argm

ax

argm

ax

argm

ax

Page 8: Transformers and sequence- to-sequence learning

8

Enco

der

RNN

Neural Machine Translation (NMT)

2/15/1823

<START>

Source sentence (input)

les pauvres sont démunis

The sequence-to-sequence modelTarget sentence (output)

Decoder RN

N

Encoder RNN produces an encoding of the source sentence.

Encoding of the source sentence.Provides initial hidden state

for Decoder RNN.

Decoder RNN is a Language Model that generates target sentence conditioned on encoding.

the

argm

ax

the

argm

ax

poor

poor

argm

ax

don’t

Note: This diagram shows test time behavior: decoder output is fed in as next step’s input

have any money <END>

don’t have any money

argm

ax

argm

ax

argm

ax

argm

ax

Page 9: Transformers and sequence- to-sequence learning

9

Training a Neural Machine Translation system

2/15/1825

Enco

der R

NN

Source sentence (from corpus)

<START> the poor don’t have any moneyles pauvres sont démunis

Target sentence (from corpus)

Seq2seq is optimized as a single system.Backpropagation operates “end to end”.

Decoder RNN

!"# !"$ !"% !"& !"' !"( !")

*# *$ *% *& *' *( *)

= negative log prob of “the”

* = 1-./0#

1*/ = + + + + + +

= negative log prob of <END>

= negative log prob of “have”

Page 10: Transformers and sequence- to-sequence learning

We’ll talk much more about machine translation / other seq2seq problems later… but for now, let’s

go back to the Transformer

Page 11: Transformers and sequence- to-sequence learning
Page 12: Transformers and sequence- to-sequence learning

encoder

Page 13: Transformers and sequence- to-sequence learning

encoder

decoder

Page 14: Transformers and sequence- to-sequence learning

encoder

decoderSo far we’ve just talked

about self-attention… what is all this other stuff?

Page 15: Transformers and sequence- to-sequence learning

Self-attention (in encoder)

15committee awards Strickland advanced opticswho

Layer p

QKV

[Vaswani et al. 2017]

Nobel

Slides by Emma Strubell!

Page 16: Transformers and sequence- to-sequence learning

16

Layer p

QKV

[Vaswani et al. 2017]

committee awards Strickland advanced opticswhoNobel

Slides by Emma Strubell!Self-attention (in encoder)

Page 17: Transformers and sequence- to-sequence learning

17

Layer p

QKV

optics advanced

who Strickland

awards committee

Nobel

[Vaswani et al. 2017]

committee awards Strickland advanced opticswhoNobel

Slides by Emma Strubell!Self-attention (in encoder)

Page 18: Transformers and sequence- to-sequence learning

optics advanced

who Strickland

awards committee

Nobel

18

Layer p

QKV

A

[Vaswani et al. 2017]

committee awards Strickland advanced opticswhoNobel

Slides by Emma Strubell!Self-attention (in encoder)

Page 19: Transformers and sequence- to-sequence learning

19

Layer p

QKV

[Vaswani et al. 2017]

committee awards Strickland advanced opticswhoNobel

optics advanced

who Strickland

awards committee

NobelA

Slides by Emma Strubell!Self-attention (in encoder)

Page 20: Transformers and sequence- to-sequence learning

20

Layer p

QKV

[Vaswani et al. 2017]

committee awards Strickland advanced opticswhoNobel

optics advanced

who Strickland

awards committee

NobelA

Slides by Emma Strubell!Self-attention (in encoder)

Page 21: Transformers and sequence- to-sequence learning

21

Layer p

QKV

M

[Vaswani et al. 2017]

committee awards Strickland advanced opticswhoNobel

optics advanced

who Strickland

awards committee

NobelA

Slides by Emma Strubell!Self-attention (in encoder)

Page 22: Transformers and sequence- to-sequence learning

22

Layer p

QKV

M

[Vaswani et al. 2017]

committee awards Strickland advanced opticswhoNobel

optics advanced

who Strickland

awards committee

NobelA

Slides by Emma Strubell!Self-attention (in encoder)

Page 23: Transformers and sequence- to-sequence learning

Multi-head self-attention

23

Layer p

QKV

MM1

MH

[Vaswani et al. 2017]

committee awards Strickland advanced opticswhoNobel

optics advanced

who Strickland

awards committee

NobelA

Slides by Emma Strubell!

Page 24: Transformers and sequence- to-sequence learning

24

Layer p

QKV

MH

M1

[Vaswani et al. 2017]

committee awards Strickland advanced opticswhoNobel

optics advanced

who Strickland

awards committee

NobelA

Multi-head self-attentionSlides by Emma Strubell!

Page 25: Transformers and sequence- to-sequence learning

25

Layer p

QKV

MH

M1

Layer p+1

Feed Forward

Feed Forward

Feed Forward

Feed Forward

Feed Forward

Feed Forward

Feed Forward

committee awards Strickland advanced opticswhoNobel

[Vaswani et al. 2017]

optics advanced

who Strickland

awards committee

NobelA

Multi-head self-attentionSlides by Emma Strubell!

Page 26: Transformers and sequence- to-sequence learning

26committee awards Strickland advanced opticswhoNobel

[Vaswani et al. 2017]

p+1

Multi-head self-attentionSlides by Emma Strubell!

Page 27: Transformers and sequence- to-sequence learning

Multi-head self-attention + feed forward

Multi-head self-attention + feed forward

27

Layer 1

Layer p

Multi-head self-attention + feed forwardLayer J

committee awards Strickland advanced opticswhoNobel

[Vaswani et al. 2017]

Multi-head self-attentionSlides by Emma Strubell!

Page 28: Transformers and sequence- to-sequence learning

Position embeddings are added to each word embedding. Otherwise, since we

have no recurrence, our model is unaware of the position of a word in the sequence!

Page 29: Transformers and sequence- to-sequence learning

Residual connections, which mean that we add the input to a particular block to its

output, help improve gradient flow

Page 30: Transformers and sequence- to-sequence learning

A feed-forward layer on top of the attention-weighted averaged value vectors allows us

to add more parameters / nonlinearity

Page 31: Transformers and sequence- to-sequence learning

We stack as many of these Transformer blocks on top of each other as we can (bigger models are generally better given enough data!)

Page 32: Transformers and sequence- to-sequence learning

Moving onto the decoder, which takes in English sequences that have been shifted to the right

(e.g., <START> schools opened their)

Page 33: Transformers and sequence- to-sequence learning

We first have an instance of masked self attention. Since the decoder is responsible for predicting the English words, we need to apply

masking as we saw before.

Page 34: Transformers and sequence- to-sequence learning

We first have an instance of masked self attention. Since the decoder is responsible for predicting the English words, we need to apply

masking as we saw before.

Why don’t we do masked self-attention

in the encoder?

Page 35: Transformers and sequence- to-sequence learning

Now, we have cross attention, which connects the decoder to the encoder by enabling it to

attend over the encoder’s final hidden states.

Page 36: Transformers and sequence- to-sequence learning

After stacking a bunch of these decoder blocks, we

finally have our familiar Softmax layer to predict the next English word

Page 37: Transformers and sequence- to-sequence learning

Positional encoding

Page 38: Transformers and sequence- to-sequence learning

Creating positional encodings?• We could just concatenate a fixed value to each time step

(e.g., 1, 2, 3, … 1000) that corresponds to its position, but then what happens if we get a sequence with 5000 words at test time?

• We want something that can generalize to arbitrary sequence lengths. We also may want to make attending to relative positions (e.g., tokens in a local window to the current token) easier.

• Distance between two positions should be consistent with variable-length inputs

Page 39: Transformers and sequence- to-sequence learning

Intuitive example

https://kazemnejad.com/blog/transformer_architecture_positional_encoding/

Page 40: Transformers and sequence- to-sequence learning

Transformer positional encoding

Positional encoding is a 512d vector i = a particular dimension of this vector pos = dimension of the word d_model = 512

Page 41: Transformers and sequence- to-sequence learning

What does this look like? (each row is the pos. emb. of a 50-word sentence)

https://kazemnejad.com/blog/transformer_architecture_positional_encoding/

Page 42: Transformers and sequence- to-sequence learning

Despite the intuitive flaws, many models these days use learned positional

embeddings (i.e., they cannot generalize to longer sequences, but this isn’t a big deal for

their use cases)

Page 43: Transformers and sequence- to-sequence learning

Hacks to make Transformers work

Page 44: Transformers and sequence- to-sequence learning
Page 45: Transformers and sequence- to-sequence learning

I went to class and took ___

0 0 1 0 0cats TV notes took sofa

0.025 0.025 0.9 0.025 0.025with label smoothing

Page 46: Transformers and sequence- to-sequence learning

Get penalized for overconfidence!

Loss

Target word confidence

Page 47: Transformers and sequence- to-sequence learning

Why these decisions? Unsatisfying answer: they empirically worked well.

Neural architecture search finds even better Transformer variants:

Primer: Searching for efficient Transformer architectures… So et al., Sep. 2021

Page 48: Transformers and sequence- to-sequence learning

OpenAI’s Transformer LMs• GPT (Jun 2018): 117 million parameters, trained on

13GB of data (~1 billion tokens)

• GPT2 (Feb 2019): 1.5 billion parameters, trained on 40GB of data

• GPT3 (July 2020): 175 billion parameters, ~500GB data (300 billion tokens)

Page 49: Transformers and sequence- to-sequence learning

Coming up!• Transfer learning via Transformer models like BERT

• Tokenization (word vs subword vs character/byte)

• Prompt-based learning

• Efficient / long-range Transformers

• Downstream tasks