Neural Machine Translation with Source Dependency ...
Transcript of Neural Machine Translation with Source Dependency ...
NeuralMachineTranslationwithSourceDependencyRepresentation
KehaiChen1,RuiWang2,MasaoUtiyama2,Lemao Liu3,AkihiroTamura4,Eiichiro Sumita2 andTiejun Zhao1
1HarbinInstituteofTechnology,China2NationalInstituteofInformationandCommunicationsTechnology,Japan3TencentAILab,China4EhimeUniversity,Japan
1
2
Overview
StandardNMTmodel
y1 y2 y3 y4 y5 y6 y7Trg: y8
x1 x2 x3 x4 x5 x6 x7Src:
• Traditional NMT Model
3
• Our proposed NMT model
Overview
OurproposedNMTmodel
y1 y2 y3 y4 y5 y6 y7Trg: y8
x1 x2 x3 x4 x5 x6 x7
Src dep:
Src:
root
Inspired by the syntax knowledge in SMT, we want to explicitly integrate source dependency information into NMT
4
Related Work• NMT with source syntax information
-Tree2seq (Eriguchi et al., 2016; Li et al., 2017; +other)Tree-based neural network is used to encode source phrase structures
-Extending source inputs with syntax labels (Sennrich et al., 2016; Chen et al., 2017; +other)Dependency labels are concatenated to source word
5
Related Work• NMT with source syntax information
-Tree2seq (Eriguchi et al., 2016; Li et al., 2017; +other)Tree-based neural network is used to encode source phrase structures
-Extending source inputs with syntax labels (Sennrich et al., 2016; Chen et al., 2017; +other)Dependency labels are concatenated to source word
• Our work
-A compromise between the two kinds of works
-A novel double context approach to utilizing source dependency constraints
Source Dependency Representation (SDR)
6
, ,j j jj x x xU PA SI CH=< >
• Extracting a dependency unit for each source word to capture source long-distance dependency constraints:
x1 x2 x3 x4 x5 x6 x7
Src dep:
Src:
root
Source Dependency Representation (SDR)
7
, ,j j jj x x xU PA SI CH=< >
• Extracting a dependency unit for each source word to capture source long-distance dependency constraints:
x1 x2 x3 x4 x5 x6 x7
Src dep:
Src:
root
Where PAxj, SIxj, and CHxj denote the parent, siblings and children words of source word xj in a dependency structure.
Take x2 as an example: 2
2
2
3
1 4 7
,
, , ,
,
x
x
x
PA x
SI x x x
CH e
=< >
=< >
=< >
2 3, 1, 4, 7,U x x x x e=< >then,
Source Dependency Representation (SDR)
8
Convolution layer 1
Max-pooling layer 1
Output layer
x3x1x4x7ε/////
Input layer
Convolution layer 2Max-pooling layer 2
10*d8*d
4*d2*d
1*dVU2
3×d kernel
3×d kernel
• Learn sematic representation of each dependency unitTake x2 as an example:
2
2
2
3
1 4 7
,
, , ,
,
x
x
x
PA x
SI x x x
CH e
=< >
=< >
=< >
2 3, 1, 4, 7,U x x x x e=< >then,
Neural Machine Translation with SDR
Src
Dep Tuples
Enc
oder
Vx1
h1
Vx2
h2
VxJ
hJ
…
…
yi
Dec
oder
…
si…
ci
yi-1
si-1
…
…
x1 x2 x3 x4 x5 x6 x7
,1ia ,2ia … ,i Ja
SDRNMT-1:
9
Neural Machine Translation with SDRSrc dep
Src
U2=<x3, x1, x4, x7 , ε>Dep Tuples U1
Enc
oder
Vx1 :VU1
h1
Vx2 : VU2
h2
VxJ : VUJ
hJ
: ……
…
yi
Dec
oder
…
si…
ci
yi-1
si-1
…
…
UJ
CNNCNN CNN CNN
x1 x2 x3 x4 x5 x6 x7
root
…
,1ia ,2ia … ,i Ja
SDRNMT-1:
10
1( : , )j jj enc x U jh f V V h -=
Where the Vxj is 360-dim and the learned VUj is 260-dim.
Neural Machine Translation with SDRSDRNMT-2:
x1 x2 x3 x4 x5 x6 x7
Src dep
Src
root
U2=<x3, x1, x4, x7 , ε>Dep Tuples U1
Enc
oder
Vx1 VU1
h1
Vx2 VU2
h2
VxJ VUJ
hJ
……
…
UJ
CNNCNN CNN CNN
…
11
Neural Machine Translation with SDRSDRNMT-2:
1
1
( , ),
( , )j
j
j enc x j
j enc U j
h f V h
d f V d-
-
=
=x1 x2 x3 x4 x5 x6 x7
Src dep
Src
root
U2=<x3, x1, x4, x7 , ε>Dep Tuples U1
Enc
oder
Vx1 VU1
h1
Vx2 VU2
h2
VxJ VUJ
hJ
……
…
UJ
CNNCNN CNN CNN
d1 d2 dJ…
…
Encoder:
12
Neural Machine Translation with SDRSDRNMT-2:
1
1
( , ),
( , )j
j
j enc x j
j enc U j
h f V h
d f V d-
-
=
=
, 1
, 1
( ),
( ).
s si j i j
d di j i j
e f s h
e f s d-
-
= +
= +
x1 x2 x3 x4 x5 x6 x7
Src dep
Src
root
U2=<x3, x1, x4, x7 , ε>Dep Tuples U1
Enc
oder
Vx1 VU1
h1
Vx2 VU2
h2
VxJ VUJ
hJ
……
…
UJ
CNNCNN CNN CNN
d1 d2 dJ…
…
Attention 𝛼"
Attention:
Encoder:
, ,,
, ,1
exp( (1 ) )
exp( (1 ) )
s di j i j
i j J s di j i jj
e e
e e
l la
l l=
+ -=
+ -å
13
Neural Machine Translation with SDRSDRNMT-2:
1
1
( , ),
( , )j
j
j enc x j
j enc U j
h f V h
d f V d-
-
=
=
, 1
, 1
( ),
( ).
s si j i j
d di j i j
e f s h
e f s d-
-
= +
= +
Dec
oder
sdi-1
yi-1
ssi-1
sdi
csi
cdi
ssi
x1 x2 x3 x4 x5 x6 x7
Src dep
Src
root
U2=<x3, x1, x4, x7 , ε>Dep Tuples U1
Enc
oder
Vx1 VU1
h1
Vx2 VU2
h2
VxJ VUJ
hJ
……
…
UJ
CNNCNN CNN CNN
d1 d2 dJ…
…
… …
……
……
𝛼" i,1𝛼" i,1 𝛼" i,2
𝛼" i,2 𝛼" i,J𝛼" i,J
Attention 𝛼"
Attention:
Encoder:
, ,,
, ,1
exp( (1 ) )
exp( (1 ) )
s di j i j
i j J s di j i jj
e e
e e
l la
l l=
+ -=
+ -åDecoder: , , , ,
1 1,
J Js di j i j j i j i j j
j jc h c da a
= =
= =å å1 1
1 1
( , , ),
( , , ).
s s si i i id d di i i i
s s y cs s y c
j
j- -
- -
=
=
14
Neural Machine Translation with SDRSDRNMT-2:
1
1
( , ),
( , )j
j
j enc x j
j enc U j
h f V h
d f V d-
-
=
=
, 1
, 1
( ),
( ).
s si j i j
d di j i j
e f s h
e f s d-
-
= +
= +
Dec
oder
sdi-1
yi-1
ssi-1
sdi
csi
yi
cdi
ssi
x1 x2 x3 x4 x5 x6 x7
Src dep
Src
root
U2=<x3, x1, x4, x7 , ε>Dep Tuples U1
Enc
oder
Vx1 VU1
h1
Vx2 VU2
h2
VxJ VUJ
hJ
……
…
UJ
CNNCNN CNN CNN
d1 d2 dJ…
…
… …
……
……
𝛼" i,1𝛼" i,1 𝛼" i,2
𝛼" i,2 𝛼" i,J𝛼" i,J
Attention 𝛼"
Attention:
Encoder:
, ,,
, ,1
exp( (1 ) )
exp( (1 ) )
s di j i j
i j J s di j i jj
e e
e e
l la
l l=
+ -=
+ -åDecoder: , , , ,
1 1,
J Js di j i j j i j i j j
j jc h c da a
= =
= =å å
1( | , , ) ( , , , , )s d s di i i i i i i ip y y x T g y s s c c- -=
1 1
1 1
( , , ),
( , , ).
s s si i i id d di i i i
s s y cs s y c
j
j- -
- -
=
=
15
Neural Machine Translation with SDRSDRNMT-2:
1
1
( , ),
( , )j
j
j enc x j
j enc U j
h f V h
d f V d-
-
=
=
, 1
, 1
( ),
( ).
s si j i j
d di j i j
e f s h
e f s d-
-
= +
= +
Dec
oder
sdi-1
yi-1
ssi-1
sdi
csi
yi
cdi
ssi
x1 x2 x3 x4 x5 x6 x7
Src dep
Src
root
U2=<x3, x1, x4, x7 , ε>Dep Tuples U1
Enc
oder
Vx1 VU1
h1
Vx2 VU2
h2
VxJ VUJ
hJ
……
…
UJ
CNNCNN CNN CNN
d1 d2 dJ…
…
… …
……
……
𝛼" i,1𝛼" i,1 𝛼" i,2
𝛼" i,2 𝛼" i,J𝛼" i,J
Attention 𝛼"
Attention:
Encoder:
, ,,
, ,1
exp( (1 ) )
exp( (1 ) )
s di j i j
i j J s di j i jj
e e
e e
l la
l l=
+ -=
+ -åDecoder: , , , ,
1 1,
J Js di j i j j i j i j j
j jc h c da a
= =
= =å å
1( | , , ) ( , , , , )s d s di i i i i i i ip y y x T g y s s c c- -=
1 1
1 1
( , , ),
( , , ).
s s si i i id d di i i i
s s y cs s y c
j
j- -
- -
=
=
16DoubleContextNMT
Experimental• Experiments on Chinese-to-English translation task, 1.42M LDC corpus
• Parse source sentences of training data by Stanford Parser (Chang et al.,2009)
• For the SDRNMT-1 and SDRNMT-2, the dimension of Vxj is 360 and the
dimension of VUj is 260, and input embedding of the baseline is 620
• The baselines include Phrase-Based Statistical Machine Translation
(PBSMT) (Koehn et al., 2007), standard Attentional NMT (AttNMT)
(Bahdanau et al., 2014), NMT with dependency labels (Sennrich and
Haddow, 2016)
17
Experimental
System Dev(NIST02) NIST03 NIST04 NIST05 NIST06 NIST08 AVG
PBSMT 33.15 31.02 33.78 30.33 29.62 23.53 29.66
AttNMT 36.31 34.02 37.11 32.86 32.54 25.44 32.40
Sennrich-deponly 36.68 34.51 38.09 33.37 32.96 26.96 32.98
SDRNMT-1 36.88 34.98* 38.14 34.61** 33.58* 27.06 33.32
SDRNMT-2 37.34 35.91** 38.73* 34.18* 33.76** 27.64* 34.04
“*” indicates statistically significant better than “Sennrich-deponly” at p-value < 0.05 and “**” at p-value < 0.01 by bootsrap resampling (Koehn, 2004)
18
ExperimentalResults• Translation qualities for different sentence lengths
25
30
35
10 20 30 40 50 60 70 80 80+
BLE
U
SENT LENGTHS
PBSMTAttNMTSennrich-deponlySDSNMT-1SDSNMT-2
19
Conclusion
• Source dependency unit to capture source long-distance dependency constraint
• The proposed SDRNMT-1 and SDRNMT-2 consist of NMT and CNN, which are
jointly trained to learn SDR and translation instead of separately trained
• Double-Context approach to further utilize source dependency representation
20