KapilThadani [email protected] - Liangliang...
Transcript of KapilThadani [email protected] - Liangliang...
![Page 2: KapilThadani kapil@cs.columbia - Liangliang Caollcao.net/cu-deeplearning17/lecture/lecture8_kapil.pdf · 2020-05-19 · kapil@cs.columbia.edu RESEARCH. Outline 2 Recurrentneuralnetworks-Connectionsandupdates-Activationfunctions-Gatedunits](https://reader033.fdocuments.in/reader033/viewer/2022042317/5f05f5f07e708231d4159747/html5/thumbnails/2.jpg)
2
Outline
◦ Recurrent neural networks- Connections and updates- Activation functions- Gated units
◦ Sequence-to-sequence networks- Machine translation- Encoder-decoder architectures- Attention mechanism- Large vocabularies- Copying mechanism- Scheduled sampling- Multilingual MT
![Page 3: KapilThadani kapil@cs.columbia - Liangliang Caollcao.net/cu-deeplearning17/lecture/lecture8_kapil.pdf · 2020-05-19 · kapil@cs.columbia.edu RESEARCH. Outline 2 Recurrentneuralnetworks-Connectionsandupdates-Activationfunctions-Gatedunits](https://reader033.fdocuments.in/reader033/viewer/2022042317/5f05f5f07e708231d4159747/html5/thumbnails/3.jpg)
3
Recurrent connections
Output vector
Hidden state
Input vector
ht
xt
yt
![Page 4: KapilThadani kapil@cs.columbia - Liangliang Caollcao.net/cu-deeplearning17/lecture/lecture8_kapil.pdf · 2020-05-19 · kapil@cs.columbia.edu RESEARCH. Outline 2 Recurrentneuralnetworks-Connectionsandupdates-Activationfunctions-Gatedunits](https://reader033.fdocuments.in/reader033/viewer/2022042317/5f05f5f07e708231d4159747/html5/thumbnails/4.jpg)
3
Recurrent connections
Output vector
Hidden state
Input vector
ht
xt
yt
Wxh
Whh
ht = φh(Wxh xt +Whh ht−1)
![Page 5: KapilThadani kapil@cs.columbia - Liangliang Caollcao.net/cu-deeplearning17/lecture/lecture8_kapil.pdf · 2020-05-19 · kapil@cs.columbia.edu RESEARCH. Outline 2 Recurrentneuralnetworks-Connectionsandupdates-Activationfunctions-Gatedunits](https://reader033.fdocuments.in/reader033/viewer/2022042317/5f05f5f07e708231d4159747/html5/thumbnails/5.jpg)
3
Recurrent connections
Output vector
Hidden state
Input vector
ht
xt
yt
Wxh
Why
Whh
yt = φy(Why ht)
ht = φh(Wxh xt +Whh ht−1)
![Page 6: KapilThadani kapil@cs.columbia - Liangliang Caollcao.net/cu-deeplearning17/lecture/lecture8_kapil.pdf · 2020-05-19 · kapil@cs.columbia.edu RESEARCH. Outline 2 Recurrentneuralnetworks-Connectionsandupdates-Activationfunctions-Gatedunits](https://reader033.fdocuments.in/reader033/viewer/2022042317/5f05f5f07e708231d4159747/html5/thumbnails/6.jpg)
4
Recurrent connections: Unfolding
x1 x2 x3 x4 · · ·
![Page 7: KapilThadani kapil@cs.columbia - Liangliang Caollcao.net/cu-deeplearning17/lecture/lecture8_kapil.pdf · 2020-05-19 · kapil@cs.columbia.edu RESEARCH. Outline 2 Recurrentneuralnetworks-Connectionsandupdates-Activationfunctions-Gatedunits](https://reader033.fdocuments.in/reader033/viewer/2022042317/5f05f5f07e708231d4159747/html5/thumbnails/7.jpg)
4
Recurrent connections: Unfolding
x1 x2 x3 x4 · · ·
h1
Wxh
![Page 8: KapilThadani kapil@cs.columbia - Liangliang Caollcao.net/cu-deeplearning17/lecture/lecture8_kapil.pdf · 2020-05-19 · kapil@cs.columbia.edu RESEARCH. Outline 2 Recurrentneuralnetworks-Connectionsandupdates-Activationfunctions-Gatedunits](https://reader033.fdocuments.in/reader033/viewer/2022042317/5f05f5f07e708231d4159747/html5/thumbnails/8.jpg)
4
Recurrent connections: Unfolding
x1 x2 x3 x4 · · ·
h1
Wxh
y1
Why
![Page 9: KapilThadani kapil@cs.columbia - Liangliang Caollcao.net/cu-deeplearning17/lecture/lecture8_kapil.pdf · 2020-05-19 · kapil@cs.columbia.edu RESEARCH. Outline 2 Recurrentneuralnetworks-Connectionsandupdates-Activationfunctions-Gatedunits](https://reader033.fdocuments.in/reader033/viewer/2022042317/5f05f5f07e708231d4159747/html5/thumbnails/9.jpg)
4
Recurrent connections: Unfolding
x1 x2 x3 x4 · · ·
h1
Wxh
y1
Why
h2
Wxh
Whh
![Page 10: KapilThadani kapil@cs.columbia - Liangliang Caollcao.net/cu-deeplearning17/lecture/lecture8_kapil.pdf · 2020-05-19 · kapil@cs.columbia.edu RESEARCH. Outline 2 Recurrentneuralnetworks-Connectionsandupdates-Activationfunctions-Gatedunits](https://reader033.fdocuments.in/reader033/viewer/2022042317/5f05f5f07e708231d4159747/html5/thumbnails/10.jpg)
4
Recurrent connections: Unfolding
x1 x2 x3 x4 · · ·
h1
Wxh
y1
Why
h2
Wxh
Whh
y2
Why
![Page 11: KapilThadani kapil@cs.columbia - Liangliang Caollcao.net/cu-deeplearning17/lecture/lecture8_kapil.pdf · 2020-05-19 · kapil@cs.columbia.edu RESEARCH. Outline 2 Recurrentneuralnetworks-Connectionsandupdates-Activationfunctions-Gatedunits](https://reader033.fdocuments.in/reader033/viewer/2022042317/5f05f5f07e708231d4159747/html5/thumbnails/11.jpg)
4
Recurrent connections: Unfolding
x1 x2 x3 x4 · · ·
h1
Wxh
y1
Why
h2
Wxh
Whh
y2
Why
h3
Wxh
Whh
![Page 12: KapilThadani kapil@cs.columbia - Liangliang Caollcao.net/cu-deeplearning17/lecture/lecture8_kapil.pdf · 2020-05-19 · kapil@cs.columbia.edu RESEARCH. Outline 2 Recurrentneuralnetworks-Connectionsandupdates-Activationfunctions-Gatedunits](https://reader033.fdocuments.in/reader033/viewer/2022042317/5f05f5f07e708231d4159747/html5/thumbnails/12.jpg)
4
Recurrent connections: Unfolding
x1 x2 x3 x4 · · ·
h1
Wxh
y1
Why
h2
Wxh
Whh
y2
Why
h3
Wxh
Whh
y3
Why
![Page 13: KapilThadani kapil@cs.columbia - Liangliang Caollcao.net/cu-deeplearning17/lecture/lecture8_kapil.pdf · 2020-05-19 · kapil@cs.columbia.edu RESEARCH. Outline 2 Recurrentneuralnetworks-Connectionsandupdates-Activationfunctions-Gatedunits](https://reader033.fdocuments.in/reader033/viewer/2022042317/5f05f5f07e708231d4159747/html5/thumbnails/13.jpg)
4
Recurrent connections: Unfolding
x1 x2 x3 x4 · · ·
h1
Wxh
y1
Why
h2
Wxh
Whh
y2
Why
h3
Wxh
Whh
y3
Why
h4
Wxh
Whh
![Page 14: KapilThadani kapil@cs.columbia - Liangliang Caollcao.net/cu-deeplearning17/lecture/lecture8_kapil.pdf · 2020-05-19 · kapil@cs.columbia.edu RESEARCH. Outline 2 Recurrentneuralnetworks-Connectionsandupdates-Activationfunctions-Gatedunits](https://reader033.fdocuments.in/reader033/viewer/2022042317/5f05f5f07e708231d4159747/html5/thumbnails/14.jpg)
4
Recurrent connections: Unfolding
x1 x2 x3 x4 · · ·
h1
Wxh
y1
Why
h2
Wxh
Whh
y2
Why
h3
Wxh
Whh
y3
Why
h4
Wxh
Whh
y4
Why
![Page 15: KapilThadani kapil@cs.columbia - Liangliang Caollcao.net/cu-deeplearning17/lecture/lecture8_kapil.pdf · 2020-05-19 · kapil@cs.columbia.edu RESEARCH. Outline 2 Recurrentneuralnetworks-Connectionsandupdates-Activationfunctions-Gatedunits](https://reader033.fdocuments.in/reader033/viewer/2022042317/5f05f5f07e708231d4159747/html5/thumbnails/15.jpg)
4
Recurrent connections: Backprop through time
x1 x2 x3 x4 · · ·
h1
Wxh
y1
Why
h2
Wxh
Whh
y2
Why
h3
Wxh
Whh
y3
Why
h4
Wxh
Whh
y4
Why ∂L∂Why
∂L∂Wxh
∂L∂Whh
![Page 16: KapilThadani kapil@cs.columbia - Liangliang Caollcao.net/cu-deeplearning17/lecture/lecture8_kapil.pdf · 2020-05-19 · kapil@cs.columbia.edu RESEARCH. Outline 2 Recurrentneuralnetworks-Connectionsandupdates-Activationfunctions-Gatedunits](https://reader033.fdocuments.in/reader033/viewer/2022042317/5f05f5f07e708231d4159747/html5/thumbnails/16.jpg)
4
Recurrent connections: Backprop through time
x1 x2 x3 x4 · · ·
h1
Wxh
y1
Why
h2
Wxh
Whh
y2
Why
h3
Wxh
Whh
y3
Why
h4
Wxh
Whh
y4
Why ∂L∂Why
∂L∂Wxh
∂L∂Whh
∂L∂Wxh
∂L∂Whh
∂L∂Wxh
∂L∂Whh
∂L∂Wxh
![Page 17: KapilThadani kapil@cs.columbia - Liangliang Caollcao.net/cu-deeplearning17/lecture/lecture8_kapil.pdf · 2020-05-19 · kapil@cs.columbia.edu RESEARCH. Outline 2 Recurrentneuralnetworks-Connectionsandupdates-Activationfunctions-Gatedunits](https://reader033.fdocuments.in/reader033/viewer/2022042317/5f05f5f07e708231d4159747/html5/thumbnails/17.jpg)
5
Activation functionsφh is typically a smooth, bounded function, e.g., σ, tanh
ht−1 httanh
xt
ht = tanh(Wxh xt +Whh ht−1)
− Susceptible to vanishing gradients− Can fail to capture long-term dependencies
![Page 18: KapilThadani kapil@cs.columbia - Liangliang Caollcao.net/cu-deeplearning17/lecture/lecture8_kapil.pdf · 2020-05-19 · kapil@cs.columbia.edu RESEARCH. Outline 2 Recurrentneuralnetworks-Connectionsandupdates-Activationfunctions-Gatedunits](https://reader033.fdocuments.in/reader033/viewer/2022042317/5f05f5f07e708231d4159747/html5/thumbnails/18.jpg)
6
Long short-term memory (LSTM) Hochreiter & Schmidhuber (1997)
ct−1
ct
xt
ht−1 tanh +
c̃t = tanh(Wxh xt +Whh ht−1)
ct = ct−1 + c̃t
![Page 19: KapilThadani kapil@cs.columbia - Liangliang Caollcao.net/cu-deeplearning17/lecture/lecture8_kapil.pdf · 2020-05-19 · kapil@cs.columbia.edu RESEARCH. Outline 2 Recurrentneuralnetworks-Connectionsandupdates-Activationfunctions-Gatedunits](https://reader033.fdocuments.in/reader033/viewer/2022042317/5f05f5f07e708231d4159747/html5/thumbnails/19.jpg)
6
Long short-term memory (LSTM) Hochreiter & Schmidhuber (1997)
ct−1
ct
xt
ht−1 tanh +
forget×σ
ft = σ(Wfx xt +Wfh ht−1)
c̃t = tanh(Wxh xt +Whh ht−1)
ct = ft � ct−1 + c̃t
![Page 20: KapilThadani kapil@cs.columbia - Liangliang Caollcao.net/cu-deeplearning17/lecture/lecture8_kapil.pdf · 2020-05-19 · kapil@cs.columbia.edu RESEARCH. Outline 2 Recurrentneuralnetworks-Connectionsandupdates-Activationfunctions-Gatedunits](https://reader033.fdocuments.in/reader033/viewer/2022042317/5f05f5f07e708231d4159747/html5/thumbnails/20.jpg)
6
Long short-term memory (LSTM) Hochreiter & Schmidhuber (1997)
ct−1
ct
xt
ht−1 tanh +
forget×σ
input×
σ
ft = σ(Wfx xt +Wfh ht−1)
it = σ(Wix xt +Wih ht−1)
c̃t = tanh(Wxh xt +Whh ht−1)
ct = ft � ct−1 + it � c̃t
![Page 21: KapilThadani kapil@cs.columbia - Liangliang Caollcao.net/cu-deeplearning17/lecture/lecture8_kapil.pdf · 2020-05-19 · kapil@cs.columbia.edu RESEARCH. Outline 2 Recurrentneuralnetworks-Connectionsandupdates-Activationfunctions-Gatedunits](https://reader033.fdocuments.in/reader033/viewer/2022042317/5f05f5f07e708231d4159747/html5/thumbnails/21.jpg)
6
Long short-term memory (LSTM) Hochreiter & Schmidhuber (1997)
ct−1
ct
xt
ht−1 tanh +
forget×σ
input×
σ
output
tanh
×
ht
σ
ft = σ(Wfx xt +Wfh ht−1)
it = σ(Wix xt +Wih ht−1)
ot = σ(Wox xt +Woh ht−1)
c̃t = tanh(Wxh xt +Whh ht−1)
ct = ft � ct−1 + it � c̃t
ht = ot � tanh(ct)
![Page 22: KapilThadani kapil@cs.columbia - Liangliang Caollcao.net/cu-deeplearning17/lecture/lecture8_kapil.pdf · 2020-05-19 · kapil@cs.columbia.edu RESEARCH. Outline 2 Recurrentneuralnetworks-Connectionsandupdates-Activationfunctions-Gatedunits](https://reader033.fdocuments.in/reader033/viewer/2022042317/5f05f5f07e708231d4159747/html5/thumbnails/22.jpg)
7
Gated Recurrent Unit (GRU) Cho et al. (2014)
xt
ht−1
ht
tanh
h̃t = tanh(Wxh xt +Whh ht−1)
ht = h̃t
![Page 23: KapilThadani kapil@cs.columbia - Liangliang Caollcao.net/cu-deeplearning17/lecture/lecture8_kapil.pdf · 2020-05-19 · kapil@cs.columbia.edu RESEARCH. Outline 2 Recurrentneuralnetworks-Connectionsandupdates-Activationfunctions-Gatedunits](https://reader033.fdocuments.in/reader033/viewer/2022042317/5f05f5f07e708231d4159747/html5/thumbnails/23.jpg)
7
Gated Recurrent Unit (GRU) Cho et al. (2014)
xt
ht−1
ht
tanh
reset
×
σ
rt = σ(Wrx xt +Wrh ht−1)
h̃t = tanh(Wxh xt +Whh (rt � ht−1))ht = h̃t
![Page 24: KapilThadani kapil@cs.columbia - Liangliang Caollcao.net/cu-deeplearning17/lecture/lecture8_kapil.pdf · 2020-05-19 · kapil@cs.columbia.edu RESEARCH. Outline 2 Recurrentneuralnetworks-Connectionsandupdates-Activationfunctions-Gatedunits](https://reader033.fdocuments.in/reader033/viewer/2022042317/5f05f5f07e708231d4159747/html5/thumbnails/24.jpg)
7
Gated Recurrent Unit (GRU) Cho et al. (2014)
xt
ht−1
ht
tanh
reset
×
σ
update
×
σ
rt = σ(Wrx xt +Wrh ht−1)
zt = σ(Wzx xt +Wzh ht−1)
h̃t = tanh(Wxh xt +Whh (rt � ht−1))ht = zt � h̃t
![Page 25: KapilThadani kapil@cs.columbia - Liangliang Caollcao.net/cu-deeplearning17/lecture/lecture8_kapil.pdf · 2020-05-19 · kapil@cs.columbia.edu RESEARCH. Outline 2 Recurrentneuralnetworks-Connectionsandupdates-Activationfunctions-Gatedunits](https://reader033.fdocuments.in/reader033/viewer/2022042317/5f05f5f07e708231d4159747/html5/thumbnails/25.jpg)
7
Gated Recurrent Unit (GRU) Cho et al. (2014)
xt
ht−1
ht
tanh
reset
×
σ
update
×
σ
1−
×
+
σ
rt = σ(Wrx xt +Wrh ht−1)
zt = σ(Wzx xt +Wzh ht−1)
h̃t = tanh(Wxh xt +Whh (rt � ht−1))ht = (1− zt)� ht−1 + zt � h̃t
![Page 26: KapilThadani kapil@cs.columbia - Liangliang Caollcao.net/cu-deeplearning17/lecture/lecture8_kapil.pdf · 2020-05-19 · kapil@cs.columbia.edu RESEARCH. Outline 2 Recurrentneuralnetworks-Connectionsandupdates-Activationfunctions-Gatedunits](https://reader033.fdocuments.in/reader033/viewer/2022042317/5f05f5f07e708231d4159747/html5/thumbnails/26.jpg)
8
Processing text with RNNsInput
- Word/sentence embeddings- One-hot words/characters- CNNs over characters/words/sentences, e.g., document modeling- Absent, e.g., RNN-LMs
...
Recurrent layer- Gated units: LSTMs, GRUs- Forward, backward, bidirectional- ReLUs initialized with identity matrix
...
Output- Softmax over words/characters/labels, e.g., text generation- Deeper RNN layers- Absent, e.g., text encoders
...
![Page 27: KapilThadani kapil@cs.columbia - Liangliang Caollcao.net/cu-deeplearning17/lecture/lecture8_kapil.pdf · 2020-05-19 · kapil@cs.columbia.edu RESEARCH. Outline 2 Recurrentneuralnetworks-Connectionsandupdates-Activationfunctions-Gatedunits](https://reader033.fdocuments.in/reader033/viewer/2022042317/5f05f5f07e708231d4159747/html5/thumbnails/27.jpg)
9
Machine Translation
“One naturally wonders if the problem of translation could conceivably betreated as a problem in cryptography. When I look at an article inRussian, I say: ’This is really written in English, but it has been coded insome strange symbols. I will now proceed to decode.”
— Warren WeaverTranslation (1955)
![Page 28: KapilThadani kapil@cs.columbia - Liangliang Caollcao.net/cu-deeplearning17/lecture/lecture8_kapil.pdf · 2020-05-19 · kapil@cs.columbia.edu RESEARCH. Outline 2 Recurrentneuralnetworks-Connectionsandupdates-Activationfunctions-Gatedunits](https://reader033.fdocuments.in/reader033/viewer/2022042317/5f05f5f07e708231d4159747/html5/thumbnails/28.jpg)
10
The MT Pyramid
analysis
generation
Source Target
Interlingua
lexical
syntactic
semantic
pragmatic
![Page 29: KapilThadani kapil@cs.columbia - Liangliang Caollcao.net/cu-deeplearning17/lecture/lecture8_kapil.pdf · 2020-05-19 · kapil@cs.columbia.edu RESEARCH. Outline 2 Recurrentneuralnetworks-Connectionsandupdates-Activationfunctions-Gatedunits](https://reader033.fdocuments.in/reader033/viewer/2022042317/5f05f5f07e708231d4159747/html5/thumbnails/29.jpg)
11
Phrase-based MT
Tomorrow I will fly to the conference in Canada
Morgen fliege Ich nach Kanada zur Konferenz
![Page 30: KapilThadani kapil@cs.columbia - Liangliang Caollcao.net/cu-deeplearning17/lecture/lecture8_kapil.pdf · 2020-05-19 · kapil@cs.columbia.edu RESEARCH. Outline 2 Recurrentneuralnetworks-Connectionsandupdates-Activationfunctions-Gatedunits](https://reader033.fdocuments.in/reader033/viewer/2022042317/5f05f5f07e708231d4159747/html5/thumbnails/30.jpg)
11
Phrase-based MT
Tomorrow I will fly to the conference in Canada
Morgen fliege Ich nach Kanada zur Konferenz
![Page 31: KapilThadani kapil@cs.columbia - Liangliang Caollcao.net/cu-deeplearning17/lecture/lecture8_kapil.pdf · 2020-05-19 · kapil@cs.columbia.edu RESEARCH. Outline 2 Recurrentneuralnetworks-Connectionsandupdates-Activationfunctions-Gatedunits](https://reader033.fdocuments.in/reader033/viewer/2022042317/5f05f5f07e708231d4159747/html5/thumbnails/31.jpg)
12
Phrase-based MT
1. Collect bilingual dataset 〈Si, Ti〉 ∈ D
2. Unsupervised phrase-based alignmentI phrase table π
3. Unsupervised n-gram language modelingI language model ψ
4. Supervised decoderI parameters θ T̂ = argmax
Tp(T |S)
= argmaxT
p(S|T, π, θ) · p(T |ψ)
![Page 32: KapilThadani kapil@cs.columbia - Liangliang Caollcao.net/cu-deeplearning17/lecture/lecture8_kapil.pdf · 2020-05-19 · kapil@cs.columbia.edu RESEARCH. Outline 2 Recurrentneuralnetworks-Connectionsandupdates-Activationfunctions-Gatedunits](https://reader033.fdocuments.in/reader033/viewer/2022042317/5f05f5f07e708231d4159747/html5/thumbnails/32.jpg)
12
Neural MT
1. Collect bilingual dataset 〈Si, Ti〉 ∈ D
2. Unsupervised phrase-based alignmentI phrase table π
3. Unsupervised n-gram language modelingI language model ψ
4. Supervised encoder-decoder frameworkI parameters θ
![Page 33: KapilThadani kapil@cs.columbia - Liangliang Caollcao.net/cu-deeplearning17/lecture/lecture8_kapil.pdf · 2020-05-19 · kapil@cs.columbia.edu RESEARCH. Outline 2 Recurrentneuralnetworks-Connectionsandupdates-Activationfunctions-Gatedunits](https://reader033.fdocuments.in/reader033/viewer/2022042317/5f05f5f07e708231d4159747/html5/thumbnails/33.jpg)
13
Encoder
I Input: source words x1, . . . , xnI Output: context vector c
RNN w/gated units
x1 x2 x3 x4 x5 . . . xn
h1 h2 h3 h4 h5 . . . hn
c
![Page 34: KapilThadani kapil@cs.columbia - Liangliang Caollcao.net/cu-deeplearning17/lecture/lecture8_kapil.pdf · 2020-05-19 · kapil@cs.columbia.edu RESEARCH. Outline 2 Recurrentneuralnetworks-Connectionsandupdates-Activationfunctions-Gatedunits](https://reader033.fdocuments.in/reader033/viewer/2022042317/5f05f5f07e708231d4159747/html5/thumbnails/34.jpg)
14
Decoder
I Input: context vector cI Output: translated words y1, . . . , ym
Softmax
RNN w/gated units
y1 y2 y3 y4 y5 . . . ym
c
s1 s2 s3 s4 s5 . . . sm
si = f(si−1, yi−1, c)
![Page 35: KapilThadani kapil@cs.columbia - Liangliang Caollcao.net/cu-deeplearning17/lecture/lecture8_kapil.pdf · 2020-05-19 · kapil@cs.columbia.edu RESEARCH. Outline 2 Recurrentneuralnetworks-Connectionsandupdates-Activationfunctions-Gatedunits](https://reader033.fdocuments.in/reader033/viewer/2022042317/5f05f5f07e708231d4159747/html5/thumbnails/35.jpg)
15
Sequence-to-sequence learning Sutskever, Vinyals & Le (2014)
x1 x2 x3 x4 x5 . . . xn
h1 h2 h3 h4 h5 . . . hn
y1 y2 y3 y4 y5 . . . ym
c
s1 s2 s3 s4 s5 . . . sm
si = f(si−1, yi−1, hn)
![Page 36: KapilThadani kapil@cs.columbia - Liangliang Caollcao.net/cu-deeplearning17/lecture/lecture8_kapil.pdf · 2020-05-19 · kapil@cs.columbia.edu RESEARCH. Outline 2 Recurrentneuralnetworks-Connectionsandupdates-Activationfunctions-Gatedunits](https://reader033.fdocuments.in/reader033/viewer/2022042317/5f05f5f07e708231d4159747/html5/thumbnails/36.jpg)
16
Sequence-to-sequence learning Sutskever, Vinyals & Le (2014)
Produces a fixed length representation of input- “sentence embedding” or “thought vector” ( )
![Page 37: KapilThadani kapil@cs.columbia - Liangliang Caollcao.net/cu-deeplearning17/lecture/lecture8_kapil.pdf · 2020-05-19 · kapil@cs.columbia.edu RESEARCH. Outline 2 Recurrentneuralnetworks-Connectionsandupdates-Activationfunctions-Gatedunits](https://reader033.fdocuments.in/reader033/viewer/2022042317/5f05f5f07e708231d4159747/html5/thumbnails/37.jpg)
16
Sequence-to-sequence learning Sutskever, Vinyals & Le (2014)
Produces a fixed length representation of input- “sentence embedding” or “thought vector” ( )
![Page 38: KapilThadani kapil@cs.columbia - Liangliang Caollcao.net/cu-deeplearning17/lecture/lecture8_kapil.pdf · 2020-05-19 · kapil@cs.columbia.edu RESEARCH. Outline 2 Recurrentneuralnetworks-Connectionsandupdates-Activationfunctions-Gatedunits](https://reader033.fdocuments.in/reader033/viewer/2022042317/5f05f5f07e708231d4159747/html5/thumbnails/38.jpg)
17
Sequence-to-sequence learning Sutskever, Vinyals & Le (2014)
LSTM units do not solve vanishing gradients- Poor performance on long sentences- Need to reverse the input
![Page 39: KapilThadani kapil@cs.columbia - Liangliang Caollcao.net/cu-deeplearning17/lecture/lecture8_kapil.pdf · 2020-05-19 · kapil@cs.columbia.edu RESEARCH. Outline 2 Recurrentneuralnetworks-Connectionsandupdates-Activationfunctions-Gatedunits](https://reader033.fdocuments.in/reader033/viewer/2022042317/5f05f5f07e708231d4159747/html5/thumbnails/39.jpg)
18
Attention-based translation Bahdanau et al (2014)
x1 x2 x3 x4 x5 . . . xn
h1 h2 h3 h4 h5 . . . hn
y1 y2 y3 y4
s1 s2 s3 s4
![Page 40: KapilThadani kapil@cs.columbia - Liangliang Caollcao.net/cu-deeplearning17/lecture/lecture8_kapil.pdf · 2020-05-19 · kapil@cs.columbia.edu RESEARCH. Outline 2 Recurrentneuralnetworks-Connectionsandupdates-Activationfunctions-Gatedunits](https://reader033.fdocuments.in/reader033/viewer/2022042317/5f05f5f07e708231d4159747/html5/thumbnails/40.jpg)
18
Attention-based translation Bahdanau et al (2014)
x1 x2 x3 x4 x5 . . . xn
h1 h2 h3 h4 h5 . . . hn
y1 y2 y3 y4
s1 s2 s3 s4
feedforward
e4,1 e4,2 e4,3 e4,4 e4,5 e4,n
eij = a(si−1, hj)
![Page 41: KapilThadani kapil@cs.columbia - Liangliang Caollcao.net/cu-deeplearning17/lecture/lecture8_kapil.pdf · 2020-05-19 · kapil@cs.columbia.edu RESEARCH. Outline 2 Recurrentneuralnetworks-Connectionsandupdates-Activationfunctions-Gatedunits](https://reader033.fdocuments.in/reader033/viewer/2022042317/5f05f5f07e708231d4159747/html5/thumbnails/41.jpg)
18
Attention-based translation Bahdanau et al (2014)
x1 x2 x3 x4 x5 . . . xn
h1 h2 h3 h4 h5 . . . hn
y1 y2 y3 y4
s1 s2 s3 s4
e4,1 e4,2 e4,3 e4,4 e4,5 e4,n
α4,1 α4,2 α4,3 α4,4 α4,5 α4,n
softmax
αij =exp(eij)∑k exp(eik)
eij = a(si−1, hj)
![Page 42: KapilThadani kapil@cs.columbia - Liangliang Caollcao.net/cu-deeplearning17/lecture/lecture8_kapil.pdf · 2020-05-19 · kapil@cs.columbia.edu RESEARCH. Outline 2 Recurrentneuralnetworks-Connectionsandupdates-Activationfunctions-Gatedunits](https://reader033.fdocuments.in/reader033/viewer/2022042317/5f05f5f07e708231d4159747/html5/thumbnails/42.jpg)
18
Attention-based translation Bahdanau et al (2014)
x1 x2 x3 x4 x5 . . . xn
h1 h2 h3 h4 h5 . . . hn
y1 y2 y3 y4
s1 s2 s3 s4
e4,1 e4,2 e4,3 e4,4 e4,5 e4,n
α4,1 α4,2 α4,3 α4,4 α4,5 α4,n
weightedaverage
c5
ci =∑j
αijhj
αij =exp(eij)∑k exp(eik)
eij = a(si−1, hj)
![Page 43: KapilThadani kapil@cs.columbia - Liangliang Caollcao.net/cu-deeplearning17/lecture/lecture8_kapil.pdf · 2020-05-19 · kapil@cs.columbia.edu RESEARCH. Outline 2 Recurrentneuralnetworks-Connectionsandupdates-Activationfunctions-Gatedunits](https://reader033.fdocuments.in/reader033/viewer/2022042317/5f05f5f07e708231d4159747/html5/thumbnails/43.jpg)
18
Attention-based translation Bahdanau et al (2014)
x1 x2 x3 x4 x5 . . . xn
h1 h2 h3 h4 h5 . . . hn
y1 y2 y3 y4
s1 s2 s3 s4
e4,1 e4,2 e4,3 e4,4 e4,5 e4,n
α4,1 α4,2 α4,3 α4,4 α4,5 α4,n
c5
s5
si = f(si−1, yi−1, ci)
ci =∑j
αijhj
αij =exp(eij)∑k exp(eik)
eij = a(si−1, hj)
![Page 44: KapilThadani kapil@cs.columbia - Liangliang Caollcao.net/cu-deeplearning17/lecture/lecture8_kapil.pdf · 2020-05-19 · kapil@cs.columbia.edu RESEARCH. Outline 2 Recurrentneuralnetworks-Connectionsandupdates-Activationfunctions-Gatedunits](https://reader033.fdocuments.in/reader033/viewer/2022042317/5f05f5f07e708231d4159747/html5/thumbnails/44.jpg)
19
Attention-based translation Bahdanau et al (2014)
◦ Encoder:- Bidirectional RNN: forward (hj) + backward (h′j)
◦ GRUs instead of LSTMs- Simpler, fewer parameters
◦ Decoder:- si and yi also depend on yi−1
- Additional hidden layer prior to softmax for yi- Inference is O(mn) instead of O(m) for seq-to-seq
![Page 45: KapilThadani kapil@cs.columbia - Liangliang Caollcao.net/cu-deeplearning17/lecture/lecture8_kapil.pdf · 2020-05-19 · kapil@cs.columbia.edu RESEARCH. Outline 2 Recurrentneuralnetworks-Connectionsandupdates-Activationfunctions-Gatedunits](https://reader033.fdocuments.in/reader033/viewer/2022042317/5f05f5f07e708231d4159747/html5/thumbnails/45.jpg)
20
Attention-based translation Bahdanau et al (2014)
Improved results on long sentences
![Page 46: KapilThadani kapil@cs.columbia - Liangliang Caollcao.net/cu-deeplearning17/lecture/lecture8_kapil.pdf · 2020-05-19 · kapil@cs.columbia.edu RESEARCH. Outline 2 Recurrentneuralnetworks-Connectionsandupdates-Activationfunctions-Gatedunits](https://reader033.fdocuments.in/reader033/viewer/2022042317/5f05f5f07e708231d4159747/html5/thumbnails/46.jpg)
21
Attention-based translation Bahdanau et al (2014)
Sensible induced alignments
![Page 47: KapilThadani kapil@cs.columbia - Liangliang Caollcao.net/cu-deeplearning17/lecture/lecture8_kapil.pdf · 2020-05-19 · kapil@cs.columbia.edu RESEARCH. Outline 2 Recurrentneuralnetworks-Connectionsandupdates-Activationfunctions-Gatedunits](https://reader033.fdocuments.in/reader033/viewer/2022042317/5f05f5f07e708231d4159747/html5/thumbnails/47.jpg)
21
Attention-based translation Bahdanau et al (2014)
Sensible induced alignments
![Page 48: KapilThadani kapil@cs.columbia - Liangliang Caollcao.net/cu-deeplearning17/lecture/lecture8_kapil.pdf · 2020-05-19 · kapil@cs.columbia.edu RESEARCH. Outline 2 Recurrentneuralnetworks-Connectionsandupdates-Activationfunctions-Gatedunits](https://reader033.fdocuments.in/reader033/viewer/2022042317/5f05f5f07e708231d4159747/html5/thumbnails/48.jpg)
22
Images
Show, Attend & Tell: Neural Image Caption Generation with Visual Attention(Xu et al. 2015)
![Page 49: KapilThadani kapil@cs.columbia - Liangliang Caollcao.net/cu-deeplearning17/lecture/lecture8_kapil.pdf · 2020-05-19 · kapil@cs.columbia.edu RESEARCH. Outline 2 Recurrentneuralnetworks-Connectionsandupdates-Activationfunctions-Gatedunits](https://reader033.fdocuments.in/reader033/viewer/2022042317/5f05f5f07e708231d4159747/html5/thumbnails/49.jpg)
23
Videos
Describing Videos by Exploiting Temporal Structure (Yao et al. 2015)
![Page 50: KapilThadani kapil@cs.columbia - Liangliang Caollcao.net/cu-deeplearning17/lecture/lecture8_kapil.pdf · 2020-05-19 · kapil@cs.columbia.edu RESEARCH. Outline 2 Recurrentneuralnetworks-Connectionsandupdates-Activationfunctions-Gatedunits](https://reader033.fdocuments.in/reader033/viewer/2022042317/5f05f5f07e708231d4159747/html5/thumbnails/50.jpg)
24
Large vocabularies
Sequence-to-sequence models can typically scale to 30K-50K words
But real-world applications need at least 500K-1M words
![Page 51: KapilThadani kapil@cs.columbia - Liangliang Caollcao.net/cu-deeplearning17/lecture/lecture8_kapil.pdf · 2020-05-19 · kapil@cs.columbia.edu RESEARCH. Outline 2 Recurrentneuralnetworks-Connectionsandupdates-Activationfunctions-Gatedunits](https://reader033.fdocuments.in/reader033/viewer/2022042317/5f05f5f07e708231d4159747/html5/thumbnails/51.jpg)
25
Large vocabularies
Alternative 1: Hierarchical softmax
- Predict path in binary tree representation of output layer- Reduces to log2(V ) binary decisions
p(wt = “dog”| · · · ) = (1− σ(U0ht))× σ(U1ht)× σ(U4ht)
0
1 2
3 4 5 6
cow duck cat dog she he and the
![Page 52: KapilThadani kapil@cs.columbia - Liangliang Caollcao.net/cu-deeplearning17/lecture/lecture8_kapil.pdf · 2020-05-19 · kapil@cs.columbia.edu RESEARCH. Outline 2 Recurrentneuralnetworks-Connectionsandupdates-Activationfunctions-Gatedunits](https://reader033.fdocuments.in/reader033/viewer/2022042317/5f05f5f07e708231d4159747/html5/thumbnails/52.jpg)
26
Large vocabularies Jean et al (2015)
Alternative 2: Importance sampling
- Expensive to compute the softmax normalization term over V
p(yi = wj |y<i, x) =exp
(W>j f(si, yi−1, ci)
)∑|V |k=1 exp
(W>k f(si, yi−1, ci)
)- Use a small subset of the target vocabulary for each update
- Approximate expectation over gradient of loss with fewer samples
- Partition the training corpus and maintain local vocabularies in eachpartition to use GPUs efficiently
![Page 53: KapilThadani kapil@cs.columbia - Liangliang Caollcao.net/cu-deeplearning17/lecture/lecture8_kapil.pdf · 2020-05-19 · kapil@cs.columbia.edu RESEARCH. Outline 2 Recurrentneuralnetworks-Connectionsandupdates-Activationfunctions-Gatedunits](https://reader033.fdocuments.in/reader033/viewer/2022042317/5f05f5f07e708231d4159747/html5/thumbnails/53.jpg)
27
Large vocabularies Wu et al (2016)
Alternative 3: Wordpiece units
- Reduce vocabulary by replacing infrequent words with wordpieces
Jet makers feud over seat width with big orders at stake
⇓
_J et _makers _fe ud _over _seat _width _with _big _orders _at _stake
![Page 54: KapilThadani kapil@cs.columbia - Liangliang Caollcao.net/cu-deeplearning17/lecture/lecture8_kapil.pdf · 2020-05-19 · kapil@cs.columbia.edu RESEARCH. Outline 2 Recurrentneuralnetworks-Connectionsandupdates-Activationfunctions-Gatedunits](https://reader033.fdocuments.in/reader033/viewer/2022042317/5f05f5f07e708231d4159747/html5/thumbnails/54.jpg)
28
Copying mechanism Gu et al (2016)
In monolingual tasks, copy rare words directly from the input
Generation via standard attention-based decoder
ψg(yi = wj) =W>j f(si, yi−1, ci) wj ∈ V
Copying via a non-linear projection of input hidden states
ψc(yi = xj) = tanh(h>j U)f(si, yi−1, ci) xj ∈ X
Both modes compete via the softmax
p(yi = wj |y<i, x) =1
Z
exp (ψg(wj)) +∑
k:xk=wj
exp (ψc(xk))
![Page 55: KapilThadani kapil@cs.columbia - Liangliang Caollcao.net/cu-deeplearning17/lecture/lecture8_kapil.pdf · 2020-05-19 · kapil@cs.columbia.edu RESEARCH. Outline 2 Recurrentneuralnetworks-Connectionsandupdates-Activationfunctions-Gatedunits](https://reader033.fdocuments.in/reader033/viewer/2022042317/5f05f5f07e708231d4159747/html5/thumbnails/55.jpg)
29
Copying mechanism Gu et al (2016)
Decoding probability p(yt| · · · )
![Page 56: KapilThadani kapil@cs.columbia - Liangliang Caollcao.net/cu-deeplearning17/lecture/lecture8_kapil.pdf · 2020-05-19 · kapil@cs.columbia.edu RESEARCH. Outline 2 Recurrentneuralnetworks-Connectionsandupdates-Activationfunctions-Gatedunits](https://reader033.fdocuments.in/reader033/viewer/2022042317/5f05f5f07e708231d4159747/html5/thumbnails/56.jpg)
30
Copying mechanism Gu et al (2016)
![Page 57: KapilThadani kapil@cs.columbia - Liangliang Caollcao.net/cu-deeplearning17/lecture/lecture8_kapil.pdf · 2020-05-19 · kapil@cs.columbia.edu RESEARCH. Outline 2 Recurrentneuralnetworks-Connectionsandupdates-Activationfunctions-Gatedunits](https://reader033.fdocuments.in/reader033/viewer/2022042317/5f05f5f07e708231d4159747/html5/thumbnails/57.jpg)
31
Scheduled sampling Bengio et al (2015)
Decoder outputs yi and hidden states si typically conditioned onprevious outputs
si = f(si−1, yi−1, ci)
At training time, these are taken from labels (“teacher forcing”)I Model can’t learn to recover from errors
Replace with model output over training using annealed parameter
![Page 58: KapilThadani kapil@cs.columbia - Liangliang Caollcao.net/cu-deeplearning17/lecture/lecture8_kapil.pdf · 2020-05-19 · kapil@cs.columbia.edu RESEARCH. Outline 2 Recurrentneuralnetworks-Connectionsandupdates-Activationfunctions-Gatedunits](https://reader033.fdocuments.in/reader033/viewer/2022042317/5f05f5f07e708231d4159747/html5/thumbnails/58.jpg)
32
Multilingual MT Johnson et al (2016)One model for translating between multiple languages
- Just add a language identification token before each sentence
t-SNE projection of learned representations of 74 sentences and differenttranslations in English, Japanese and Korean