Neural Machine Translation with Source Dependency ...

NeuralMachineTranslationwithSourceDependencyRepresentation

KehaiChen1,RuiWang2,MasaoUtiyama2,Lemao Liu3,AkihiroTamura4,Eiichiro Sumita2 andTiejun Zhao1

1HarbinInstituteofTechnology,China2NationalInstituteofInformationandCommunicationsTechnology,Japan3TencentAILab,China4EhimeUniversity,Japan

1

2

Overview

StandardNMTmodel

y1 y2 y3 y4 y5 y6 y7Trg: y8

x1 x2 x3 x4 x5 x6 x7Src:

• Traditional NMT Model

3

• Our proposed NMT model

Overview

OurproposedNMTmodel

y1 y2 y3 y4 y5 y6 y7Trg: y8

x1 x2 x3 x4 x5 x6 x7

Src dep:

Src:

root

Inspired by the syntax knowledge in SMT, we want to explicitly integrate source dependency information into NMT

4

Related Work• NMT with source syntax information

-Tree2seq (Eriguchi et al., 2016; Li et al., 2017; +other)Tree-based neural network is used to encode source phrase structures

-Extending source inputs with syntax labels (Sennrich et al., 2016; Chen et al., 2017; +other)Dependency labels are concatenated to source word

5

Related Work• NMT with source syntax information

-Tree2seq (Eriguchi et al., 2016; Li et al., 2017; +other)Tree-based neural network is used to encode source phrase structures

-Extending source inputs with syntax labels (Sennrich et al., 2016; Chen et al., 2017; +other)Dependency labels are concatenated to source word

• Our work

-A compromise between the two kinds of works

-A novel double context approach to utilizing source dependency constraints

Source Dependency Representation (SDR)

6

, ,j j jj x x xU PA SI CH=< >

• Extracting a dependency unit for each source word to capture source long-distance dependency constraints:

x1 x2 x3 x4 x5 x6 x7

Src dep:

Src:

root


7

, ,j j jj x x xU PA SI CH=< >

• Extracting a dependency unit for each source word to capture source long-distance dependency constraints:

x1 x2 x3 x4 x5 x6 x7

Src dep:

Src:

root

Where PAxj, SIxj, and CHxj denote the parent, siblings and children words of source word xj in a dependency structure.

Take x2 as an example: 2

2

2

3

1 4 7

,

, , ,

,

x

x

x

PA x

SI x x x

CH e

=< >

=< >

=< >

2 3, 1, 4, 7,U x x x x e=< >then,


8

Convolution layer 1

Max-pooling layer 1

Output layer

x3x1x4x7ε/////

Input layer

Convolution layer 2Max-pooling layer 2

10*d8*d

4*d2*d

1*dVU2

3×d kernel

3×d kernel

• Learn sematic representation of each dependency unitTake x2 as an example:

2

2

2

3

1 4 7

,

, , ,

,

x

x

x

PA x

SI x x x

CH e

=< >

=< >

=< >

2 3, 1, 4, 7,U x x x x e=< >then,

Neural Machine Translation with SDR

Src

Dep Tuples

Enc

oder

Vx1

h1

Vx2

h2

VxJ

hJ

…

…

yi

Dec

oder

…

si…

ci

yi-1

si-1

…

…

x1 x2 x3 x4 x5 x6 x7

,1ia ,2ia … ,i Ja

SDRNMT-1:

9

Neural Machine Translation with SDRSrc dep

Src

U2=<x3, x1, x4, x7 , ε>Dep Tuples U1

Enc

oder

Vx1 :VU1

h1

Vx2 : VU2

h2

VxJ : VUJ

hJ

: ……

…

yi

Dec

oder

…

si…

ci

yi-1

si-1

…

…

UJ

CNNCNN CNN CNN

x1 x2 x3 x4 x5 x6 x7

root

…

,1ia ,2ia … ,i Ja

SDRNMT-1:

10

1( : , )j jj enc x U jh f V V h -=

Where the Vxj is 360-dim and the learned VUj is 260-dim.

Neural Machine Translation with SDRSDRNMT-2:

x1 x2 x3 x4 x5 x6 x7

Src dep

Src

root


Enc

oder

Vx1 VU1

h1

Vx2 VU2

h2

VxJ VUJ

hJ

……

…

UJ

CNNCNN CNN CNN

…

11


1

1

( , ),

( , )j

j

j enc x j

j enc U j

h f V h

d f V d-

-

=

=x1 x2 x3 x4 x5 x6 x7

Src dep

Src

root


Enc

oder

Vx1 VU1

h1

Vx2 VU2

h2

VxJ VUJ

hJ

……

…

UJ

CNNCNN CNN CNN

d1 d2 dJ…

…

Encoder:

12


1

1

( , ),

( , )j

j

j enc x j

j enc U j

h f V h

d f V d-

-

=

=

, 1

, 1

( ),

( ).

s si j i j

d di j i j

e f s h

e f s d-

-

= +

= +

x1 x2 x3 x4 x5 x6 x7

Src dep

Src

root


Enc

oder

Vx1 VU1

h1

Vx2 VU2

h2

VxJ VUJ

hJ

……

…

UJ

CNNCNN CNN CNN

d1 d2 dJ…

…

Attention 𝛼"

Attention:

Encoder:

, ,,

, ,1

exp( (1 ) )

exp( (1 ) )

s di j i j

i j J s di j i jj

e e

e e

l la

l l=

+ -=

+ -å

13


1

1

( , ),

( , )j

j

j enc x j

j enc U j

h f V h

d f V d-

-

=

=

, 1

, 1

( ),

( ).

s si j i j

d di j i j

e f s h

e f s d-

-

= +

= +

Dec

oder

sdi-1

yi-1

ssi-1

sdi

csi

cdi

ssi

x1 x2 x3 x4 x5 x6 x7

Src dep

Src

root


Enc

oder

Vx1 VU1

h1

Vx2 VU2

h2

VxJ VUJ

hJ

……

…

UJ

CNNCNN CNN CNN

d1 d2 dJ…

…

… …

……

……

𝛼" i,1𝛼" i,1 𝛼" i,2

𝛼" i,2 𝛼" i,J𝛼" i,J

Attention 𝛼"

Attention:

Encoder:

, ,,

, ,1

exp( (1 ) )

exp( (1 ) )

s di j i j

i j J s di j i jj

e e

e e

l la

l l=

+ -=

+ -åDecoder: , , , ,

1 1,

J Js di j i j j i j i j j

j jc h c da a

= =

= =å å1 1

1 1

( , , ),

( , , ).

s s si i i id d di i i i

s s y cs s y c

j

j- -

- -

=

=

14


1

1

( , ),

( , )j

j

j enc x j

j enc U j

h f V h

d f V d-

-

=

=

, 1

, 1

( ),

( ).

s si j i j

d di j i j

e f s h

e f s d-

-

= +

= +

Dec

oder

sdi-1

yi-1

ssi-1

sdi

csi

yi

cdi

ssi

x1 x2 x3 x4 x5 x6 x7

Src dep

Src

root


Enc

oder

Vx1 VU1

h1

Vx2 VU2

h2

VxJ VUJ

hJ

……

…

UJ

CNNCNN CNN CNN

d1 d2 dJ…

…

… …

……

……

𝛼" i,1𝛼" i,1 𝛼" i,2


Attention 𝛼"

Attention:

Encoder:

, ,,

, ,1

exp( (1 ) )

exp( (1 ) )

s di j i j

i j J s di j i jj

e e

e e

l la

l l=

+ -=


1 1,


j jc h c da a

= =

= =å å

1( | , , ) ( , , , , )s d s di i i i i i i ip y y x T g y s s c c- -=

1 1

1 1

( , , ),

( , , ).


s s y cs s y c

j

j- -

- -

=

=

15


1

1

( , ),

( , )j

j

j enc x j

j enc U j

h f V h

d f V d-

-

=

=

, 1

, 1

( ),

( ).

s si j i j

d di j i j

e f s h

e f s d-

-

= +

= +

Dec

oder

sdi-1

yi-1

ssi-1

sdi

csi

yi

cdi

ssi

x1 x2 x3 x4 x5 x6 x7

Src dep

Src

root


Enc

oder

Vx1 VU1

h1

Vx2 VU2

h2

VxJ VUJ

hJ

……

…

UJ

CNNCNN CNN CNN

d1 d2 dJ…

…

… …

……

……

𝛼" i,1𝛼" i,1 𝛼" i,2


Attention 𝛼"

Attention:

Encoder:

, ,,

, ,1

exp( (1 ) )

exp( (1 ) )

s di j i j

i j J s di j i jj

e e

e e

l la

l l=

+ -=


1 1,


j jc h c da a

= =

= =å å

1( | , , ) ( , , , , )s d s di i i i i i i ip y y x T g y s s c c- -=

1 1

1 1

( , , ),

( , , ).


s s y cs s y c

j

j- -

- -

=

=

16DoubleContextNMT

Experimental• Experiments on Chinese-to-English translation task, 1.42M LDC corpus

• Parse source sentences of training data by Stanford Parser (Chang et al.,2009)

• For the SDRNMT-1 and SDRNMT-2, the dimension of Vxj is 360 and the

dimension of VUj is 260, and input embedding of the baseline is 620

• The baselines include Phrase-Based Statistical Machine Translation

(PBSMT) (Koehn et al., 2007), standard Attentional NMT (AttNMT)

(Bahdanau et al., 2014), NMT with dependency labels (Sennrich and

Haddow, 2016)

17

Experimental

System Dev(NIST02) NIST03 NIST04 NIST05 NIST06 NIST08 AVG

PBSMT 33.15 31.02 33.78 30.33 29.62 23.53 29.66

AttNMT 36.31 34.02 37.11 32.86 32.54 25.44 32.40

Sennrich-deponly 36.68 34.51 38.09 33.37 32.96 26.96 32.98

SDRNMT-1 36.88 34.98* 38.14 34.61** 33.58* 27.06 33.32

SDRNMT-2 37.34 35.91** 38.73* 34.18* 33.76** 27.64* 34.04

“*” indicates statistically significant better than “Sennrich-deponly” at p-value < 0.05 and “**” at p-value < 0.01 by bootsrap resampling (Koehn, 2004)

18

ExperimentalResults• Translation qualities for different sentence lengths

25

30

35

10 20 30 40 50 60 70 80 80+

BLE

U

SENT LENGTHS

PBSMTAttNMTSennrich-deponlySDSNMT-1SDSNMT-2

19

Conclusion

• Source dependency unit to capture source long-distance dependency constraint

• The proposed SDRNMT-1 and SDRNMT-2 consist of NMT and CNN, which are

jointly trained to learn SDR and translation instead of separately trained

• Double-Context approach to further utilize source dependency representation

20

Neural Machine Translation with Source Dependency ...

Documents

Transcript of Neural Machine Translation with Source Dependency ...