Deeplearning2015 Manning Deep Learning 01

70
NLP and Deep Learning 2: Composi4onal Deep Learning Christopher Manning Stanford University @chrmanning 2015 Deep Learning Summer School, Montreal

description

deep

Transcript of Deeplearning2015 Manning Deep Learning 01

Page 1: Deeplearning2015 Manning Deep Learning 01

NLP  and  Deep  Learning  2:  Composi4onal  Deep  Learning  

Christopher  Manning  

Stanford  University  

@chrmanning  

2015  Deep  Learning  Summer  School,  Montreal  

Page 2: Deeplearning2015 Manning Deep Learning 01

Compositionality

Page 3: Deeplearning2015 Manning Deep Learning 01

Artificial Intelligence requires being able to understand bigger things from knowing about smaller parts

Page 4: Deeplearning2015 Manning Deep Learning 01

We need more than word embeddings! What of larger semantic units?

How  can  we  know  when  larger  units  are  similar  in  meaning?  

•  The  snowboarder  is  leaping  over  the  mogul  

•  A  person  on  a  snowboard  jumps  into  the  air  

People  interpret  the  meaning  of  larger  text  units  –  enGGes,  descripGve  terms,  facts,  arguments,  stories  –  by  seman4c  composi4on  of  smaller  elements  

Page 5: Deeplearning2015 Manning Deep Learning 01

Representing Phrases as Vectors

x2  

x1        0                1            2          3              4            5            6          7            8            9          10  

5  

4  

3  

2  

1  Monday  

9  2  

Tuesday   9.5  1.5  

1  5  

1.1  4  

the  country  of  my  birth  the  place  where  I  was  born  

Can  we  extend  the  ideas  of  word  vector  spaces  to  phrases?  

France   2  2.5  

Germany   1  3  

Vector  for  single  words  are  useful  as  features  but  limited!  

Page 6: Deeplearning2015 Manning Deep Learning 01

How should we map phrases into a vector space?

   the                      country              of                      my                    birth  

0.4 0.3

2.3 3.6

4 4.5

7 7

2.1 3.3

2.5 3.8

5.5 6.1

1 3.5

1 5

Use  the  principle  of  composiGonality!  

The  meaning  (vector)  of  a  sentence  is  determined  by    (1) the  meanings  of  its  words  and  (2) a  method  that  combine  them.  

x2

x1      0                1            2              3            4            5            6            7              8            9          10  

5

4

3

2

1

the country of my birth

the place where I was born

Monday

Tuesday

France

Germany

Page 7: Deeplearning2015 Manning Deep Learning 01

−  

Page 8: Deeplearning2015 Manning Deep Learning 01
Page 9: Deeplearning2015 Manning Deep Learning 01

Can we build meaning composition functions

in deep learning systems ?

Page 10: Deeplearning2015 Manning Deep Learning 01

Conjecture

10  

You  can  a[empt  to  model  language  with  a  simple,  uniform  architecture  

•  A  sequence  model  (RNN,  LSTM,  …)  

•  A  1d  convoluGonal  neural  network  

However,  maybe  one  can  produce  a  be[er  composiGon  funcGon  for  language  by  modeling  an  input-­‐specific  composiGonal  tree  

Page 11: Deeplearning2015 Manning Deep Learning 01

A “generic” hierarchy on natural language doesn’t make sense?

Node has to represent sentence fragment “cat sat on.” Doesn’t make much sense.

           on                            the                                mat.    The                                  cat                              sat  

9 1

5 3

8 5

9 1

4 3

7 1

Feature  representaGon  for  words  

Page 12: Deeplearning2015 Manning Deep Learning 01

                                                                                           on                                the                            mat.  

What we want: An input-dependent tree structure

The                                    cat                            sat  

9 1

5 3

8 5

9 1

4 3

NP NP

PP

S This node’s job is to represent “on the mat.”

7 1

VP

Page 13: Deeplearning2015 Manning Deep Learning 01

Strong priors? Universals of language?

•  This  is  a  controversial  issue  (read:  Chomsky!),  but  there  does  seem  to  be  a  fairly  common  structure  over  all  human  languages  

•  Much  of  this  may  be  funcGonally  moGvated  

•  To  what  extent  should  we  use  these  priors  in  our  ML  models?  

13  

Page 14: Deeplearning2015 Manning Deep Learning 01
Page 15: Deeplearning2015 Manning Deep Learning 01
Page 16: Deeplearning2015 Manning Deep Learning 01

Where does the tree structure come from?

1.  It  can  come  from  a  convenGonal  staGsGcal  NLP  parser,  such  as  the  Stanford  Parser’s  PCFG  

2.  It  can  be  built  by  a  neural  network  component,  such  as  a  neural  network  dependency  parser  [adverGsement]  

3.  It  can  be  learned  and  built  as  part  of  the  training/operaGon  of  the  TreeRNN  system,  by  adding  another  matrix  to  score  the  goodness  of  consGtuents  built  

Mainly,  we’ve  done  1  or  2.  

16  

Page 17: Deeplearning2015 Manning Deep Learning 01

Transition-based dependency parsers

Decide  next  move  from  configuraGon  

Feature templates: usually a combination of 1 ~ 3 elements from the configuration.

Sparse!  Incomplete!  Slow!!!  (95%  of  4me)  

0 0 0 1 0 0 1 0 0 0 1 0 Indicator features binary, sparse dim =106 ~ 107

Page 18: Deeplearning2015 Manning Deep Learning 01

Deep Learning Dependency Parser [Chen & Manning, EMNLP 2014] http://nlp.stanford.edu/software/nndep.shtml

Page 19: Deeplearning2015 Manning Deep Learning 01

Deep Learning Dependency Parser [Chen & Manning, EMNLP 2014]

•  An  accurate  and  fast  neural-­‐network-­‐based  dependency  parser!  •  Parsing  to  Stanford  Dependencies:  

•  Unlabeled  a[achment  score  (UAS)  =  head  

•  Labeled  a[achment  score  (LAS)  =  head  and  label  

Parser   UAS   LAS   sent  /  s  MaltParser   89.8   87.2   469  MSTParser   91.4   88.1   10  TurboParser   92.3   89.6   8  Our  Parser   92.0   89.7   654  Google  pulling  out  all  the  stops    94.3                                            92.4  

Page 20: Deeplearning2015 Manning Deep Learning 01

Five attempts at meaning composition

20  

Page 21: Deeplearning2015 Manning Deep Learning 01

Tree Recursive Neural Networks (Tree RNNs)

on                          the                                  mat.  

9  1  

4  3  

3  3  

8  3  

ComputaGonal  unit:  Neural  Network  layer(s),  applied  recursively  

8  5  

3  3  

Neural    Network  

8  3  

1.3  

8  5  

(Goller  &  Küchler  1996,  Costa  et  al.  2003,  Socher  et  al.  ICML,  2011)

Page 22: Deeplearning2015 Manning Deep Learning 01

Version 1: Simple concatenation Tree RNN

Earlier  TreeRNN  work  includes  (Goller  &  Küchler  1996),  with  a  fixed  tree  structure,  Costa  et  al.  (2003)  using  an  RNN  for    PP  a[achment,  but  on  one  hot  vectors,  Bo[ou  (2011)  for  composiGonality  with  recursion    

W

c1 c2

pWscore s

p    =    tanh(W              +  b),      

where  tanh:    score    =    Wscorep  

 

c1  c2  

Page 23: Deeplearning2015 Manning Deep Learning 01

Semantic similarity: nearest neighbors All  the  figures  are  adjusted  for  seasonal  variaGons  1.  All  the  numbers  are  adjusted  for  seasonal  fluctuaGons  2.  All  the  figures  are  adjusted  to  remove  usual  seasonal  pa[erns  

 Knight-­‐Ridder  would  n’t  comment  on  the  offer  1.  Harsco  declined  to  say  what  country  placed  the  order  2.  Coastal  would  n’t  disclose  the  terms    Sales  grew  almost  7%  to  $UNK  m.  from  $UNK  m.  1.  Sales  rose  more  than  7%  to  $94.9  m.  from  $88.3  m.  2.  Sales  surged  40%  to  UNK  b.  yen  from  UNK  b.  

Page 24: Deeplearning2015 Manning Deep Learning 01

Version 1 Limitations

24  

ComposiGon  funcGon  is  a  single  weight  matrix!  

No  real  interacGon  between  the  input  words!  

Not  adequate  for  human  language  composiGon  funcGon  

Page 25: Deeplearning2015 Manning Deep Learning 01

Version 2: PCFG + Syntactically-Untied RNN •  A  symbolic  Context-­‐Free  Grammar  (CFG)  backbone  is  adequate  for  basic  syntacGc  structure  

•  We  use  the  discrete  syntacGc  categories  of  the  children  to  choose  the  composiGon  matrix  

•  An  RNN  can  do  be[er  with  a  different  composiGon  matrix  for  different  syntacGc  environments  

•  The  result  gives  us  a  be[er  semanGcs  

Page 26: Deeplearning2015 Manning Deep Learning 01

Experiments

Standard  WSJ  split,  labeled  F1  

Parser   Test,  All  Sentences  

Stanford  PCFG,  (Klein  and  Manning,  2003a)   85.5  

Stanford  Factored  (Klein  and  Manning,  2003b)   86.6  

Factored  PCFGs  (Hall  and  Klein,  2012)   89.4  

Collins  (Collins,  1997)   87.7  

SSN  (Henderson,  2004)   89.4  

Berkeley  Parser  (Petrov  and  Klein,  2007)   90.1  

CVG  (RNN)  (Socher  et  al.,  ACL  2013)   85.0  

CVG  (SU-­‐RNN)  (Socher  et  al.,  ACL  2013)   90.4  

Charniak  -­‐  Self  Trained  (McClosky  et  al.  2006)   91.0  

Charniak  -­‐  Self  Trained-­‐ReRanked  (McClosky  et  al.  2006)   92.1  

Page 27: Deeplearning2015 Manning Deep Learning 01

SU-RNN / CVG [Socher, Bauer, Manning, Ng 2013]

Learns  sov  noGon  of  head  words  IniGalizaGon:    

NP-­‐CC  

PP-­‐NP  

PRP$-­‐NP  

Page 28: Deeplearning2015 Manning Deep Learning 01

SU-RNN / CVG [Socher, Bauer, Manning, Ng 2013]

ADJP-­‐NP  

ADVP-­‐ADJP  

JJ-­‐NP  

DT-­‐NP  

Page 29: Deeplearning2015 Manning Deep Learning 01

Version 3: Matrix-vector RNNs [Socher, Huval, Bhat, Manning, & Ng, 2012]

                           p                                

Page 30: Deeplearning2015 Manning Deep Learning 01

Version 3: Matrix-vector RNNs [Socher, Huval, Bhat, Manning, & Ng, 2012]

             p  =                                

                             

 A                          B  

=P

Page 31: Deeplearning2015 Manning Deep Learning 01

Classification of Semantic Relationships •  Can  an  MV-­‐RNN  learn  how  a  large  syntacGc  context  conveys  a  semanGc  relaGonship?  

•     My  [apartment]e1  has  a  pre[y  large  [kitchen]  e2          à  component-­‐whole  relaGonship  (e2,e1)  

•  Build  a  single  composiGonal  semanGcs  for  the  minimal  consGtuent  including  both  terms  

Page 32: Deeplearning2015 Manning Deep Learning 01

Classification of Semantic Relationships

Classifier   Features   F1  

SVM   POS,  stemming,  syntacGc  pa[erns   60.1  

MaxEnt   POS,  WordNet,  morphological  features,  noun  compound  system,  thesauri,  Google  n-­‐grams  

77.6  

SVM   POS,  WordNet,  prefixes,  morphological  features,  dependency  parse  features,  Levin  classes,  PropBank,  FrameNet,  NomLex-­‐Plus,  Google  n-­‐grams,  paraphrases,  TextRunner  

82.2  

RNN     –   74.8  

MV-­‐RNN   –   79.1  

MV-­‐RNN   POS,  WordNet,  NER   82.4  

Page 33: Deeplearning2015 Manning Deep Learning 01

Version 4: Recursive Neural Tensor Network

•  Less  parameters  than  MV-­‐RNN  •  Allows  the  two  word  or  phrase  vectors  to  interact  

mulGplicaGvely  

Page 34: Deeplearning2015 Manning Deep Learning 01

Beyond the bag of words: Sentiment detection

Is  the  tone  of  a  piece  of  text  posiGve,  negaGve,  or  neutral?  

•  SenGment  is  that  senGment  is  “easy”  •  DetecGon  accuracy  for  longer  documents  ~90%,  BUT    

…  …  loved  …  …  …  …  …  great  …  …  …  …  …  …  impressed  …  …  …  …  …  …  marvelous  …  …  …  …  

Page 35: Deeplearning2015 Manning Deep Learning 01

Stanford Sentiment Treebank •  215,154  phrases  labeled  in  11,855  sentences  •  Can  actually  train  and  test  composiGons  

http://nlp.stanford.edu:8080/sentiment/

Page 36: Deeplearning2015 Manning Deep Learning 01

Better Dataset Helped All Models

•  Hard  negaGon  cases  are  sGll  mostly  incorrect  •  We  also  need  a  more  powerful  model!  

75  

76  

77  

78  

79  

80  

81  

82  

83  

84  

Training  with  Sentence  Labels  

Training  with  Treebank  

Bi  NB  

RNN  

MV-­‐RNN  

Page 37: Deeplearning2015 Manning Deep Learning 01

Version 4: Recursive Neural Tensor Network

Idea:  Allow  both  addiGve  and  mediated  mulGplicaGve  interacGons  of  vectors  

Page 38: Deeplearning2015 Manning Deep Learning 01

Recursive Neural Tensor Network

Page 39: Deeplearning2015 Manning Deep Learning 01

Recursive Neural Tensor Network

Page 40: Deeplearning2015 Manning Deep Learning 01

Recursive Neural Tensor Network •  Use  resulGng  vectors  in  tree  as  input  to    

a  classifier  like  logisGc  regression  

•  Train  all  weights  jointly  with  gradient  descent  

Page 41: Deeplearning2015 Manning Deep Learning 01

Positive/Negative Results on Treebank

74  

76  

78  

80  

82  

84  

86  

Training  with  Sentence  Labels   Training  with  Treebank  

Bi  NB  RNN  MV-­‐RNN  RNTN  

Classifying  Sentences:  Accuracy  improves  to  85.4  

Page 42: Deeplearning2015 Manning Deep Learning 01

42  

−0.04 −0.02 0 0.02 0.04 0.06 0.08

−0.1

−0.08

−0.06

−0.04

−0.02

0

0.02

0.04

0.06

0.08

!

&

’’

’d

’ll

’m

’re’s

’ve

*

,

−−

2

2002

:

;

?

a

ability

able

about

above

absolutely

across

act

acted acting

action

actor

actors

actress

actually

adaptation

add

adults

adventure

after

again

against

age

ago

air

allalmost

alone

along

already

also

although

always

ambitious

american

among

amount

amusing

an

and

animated

animation

another

any

anyone

anything

apart

appeal

appealing

appears

approach

are

around

art

artist

as

at

attempt

attempts

attention

audience

audiences

away

awful

back

bad

badly

barely

based

be

beautifulbeautifully

beauty

becausebecome

becomes

been

before

begins

behind

being

believe

best

better

betweenbeyond

big

bit

black

bland

bond

book

boring

both

boy

boys

brilliant

bring

brings

but

by

ca

call

camera

can

captures

care

career

cartoon

case

cast

casting

certain

certainly

change

character

characters

charm

charming

cheapchemistry

children

cinema

cinematic

city

class

classic

clear

clever

cliches

close

cold

college

come

comedies

comedy

comes

comic

coming

coming−of−age

company

compelling

complete

completely

complex

concept

contrived

convincing

cool

could

country

couple

course

crazy

create

creates

creative

creepy

crime

crush

cultural

culture

cut

cute

dark

date

day

days

de

dead

deal

death

debut

decent

deep

deeply

definitely

delightful

delivers

depth

deserves

despite

dialogue

did

different

difficult

directed

direction

director

do

documentary

does

dog

doing

done

down

drama

dramatic

dull

dumb

during

e

each

earnest

easily

easy

effective

effects

effort

either

elements

else

emotional

emotionally

emotions

end

ending

ends

energetic

energy

engaging

engrossing

enjoy

enjoyable

enough

ensemble

entertaining

entertainment

entire

entirely

epic

episode

equally

era

es

especially

even

events

ever

every

everyone

everything

exactly

examination

excellent

except

exciting

execution

exercise

expect

experience

extreme

eye

eyes

face

fact

fails

fairly

falls

familiar family

fan

fans

fantasy

far

fascinating

feature

feel

feeling

feels

felt

female

few

filled

film

filmmaker

filmmakers

filmmaking

films

final

finally

find

finds

fine

fire

first

five

flat

flawed

flaws

flick

flicks

focus

for

force

forced

forget

form

formula

formulaic

found

four

frame

french

fresh

from

full

fully

fun

funny

future

gags

game

generic

genre

gentle

genuine

get

gets

getting

girl

girls

give

given

gives

go

goes

going

gone

good

goofy

gorgeous

got

grace

grant

great

green

gripping

guy

guys

had

half

happens

happy

hard

hardly

has

have

having

he

head

heart

heaven

hell

help

her

here

hero

high

highly

hilarious

him

himself

his

historical

history

hit

hold

hollywood

home

honest

hope

horror

hour

hours

house

how

however

human

humor

i

idea

ideas

if

ii

images

imagination

imagine

impact

important

impossible

impressive

in

indeed

inside

insight

instead

intelligence

intelligent

interest

interesting

intimate

into

intriguing

inventive

is

it

its

itself

job

john

joke

jokes

journey

just

keep

keeps

kid

kids

kind

know

knows

la

lack

lacking

lacks

largely

last

latest

laugh

laughs

lead

leads

least

leave

leaves

left

less

let

level

lifelight

like

likely

line

little

live

lives

long

look

looking

looks

loses

lost

lot

lots

loud

love

lovely

low

mmade

magic

major

make

makes

making

man

manages

manner

many

mark

master

masterpiece

material

matter

may

maybe

me

mediocre

meditation

melodrama

melodramatic

memorable

memory

men

merely

mess

message

middle

might

mildly

mind

minutes

missing

mix

modern

moment

moments

money

mood

moral

more

most

mostly

movie

movies

moving

mr

much

murder

music

mustmy

mystery

n

n’t

narrative

naturenearly

need

needs

neither

never

new

next

nice

no

none

nor

not

nothing

novel

now

o

obvious

occasionally

odd

of

off

offer

offers

often

old

old−fashioned

on

once

one

ones

only

opera

or

original

other

others

otherwise

our

out

over

overall

own

pace

parents

part

particularly

parts

passion

past

people

perfect

perfectly

performance

performances

perhaps

period

personal

picture

pictures

piece

place

plain

play

playing

plays

pleasantpleasure

plenty

plot

poetry

poignant

point

points

political

pop

portrait

possible

potential

power

powerful

predictable

premise

pretentious

pretty

previous

probably

problem

problems

process

production

project

promise

proves

provides

provocative

psychological

pure

put

puts

quality

que

question

quickly

quiet

quirky

quite

rare

rarely

rather

read

real

realityreallyreason

recent

recommend

red

refreshing

relationship

relationships

remains

remake

remarkable

remember

rest

result

reveals

rich

rideright

rock

role

romance

romantic

routine

run

running

runs

s

sad

same

satire

satisfying

saw

say

scary

scene

scenes

school

sci−fi

screen

screenplay

script

second

see

seeing

seem

seems

seen

sense

sensitive

sentimental

sequel

sequences

series

serious

seriously

set

sets

several

sex

sexual

sharp

she

short

shot

should

show

shows

side

silly

simple

simplysince

sincere

single

sit

situation

slight

slightly

slow

small

smart

so

soap

social

solid

some

someone

something

sometimes

somewhat

sort

sound

soundtrack

special

spirit

spy

stand

star

stars

start

starts

still

stories

story

storytelling

straight

strange

strong

study

stuff

stupid

style

stylish

subject

subtle

success

such

suffers

summer

supposed

sure

surprise

surprises

surprising

surprisingly

suspense

sustain

sweet

t

take

taken

takes

taking

tale

talent

talented

taste

tedious

teen

telltension

terms

terrific

than

thanks

that

the

theater

their

them

themes

themselves

then

there

these

they

thin

thing

things

think

this

thoroughly

those

though

thought

thoughtful

threethriller

thrills

through

throughout

time

times

tired

title

to

together

told

tone

too

touch

touching

tragedy

tragic

tries

true

truly

truth

try

trying

turn

turns

tv

twist

twists

two

typicalugly

ultimate

ultimately

under

understand

uneven

unfortunately

unique

unsettling

until

upupon

us

use

used

uses

usual

utterly

version

very

video

view

viewer

viewers

violence

vision

visual

visually

want

wanted

wants

war

warm

was

waste

watch

watching

water

way

ways

we

weird

welcome

well

were

what

when

where

whether

which

while

white

who

whole

whose

why

wild

will

winning

wit

with

without

witty

wo

woman

womenwonder

wonderful

word

words

work

working

works

world

worse

worst

worth

worthy

would

writing

written

wrong

y

year

years

yes

yet

york

youyoung

your

Page 43: Deeplearning2015 Manning Deep Learning 01

Experimental Results on Treebank •  RNTN  can  capture  construcGons  like  X  but  Y  •  RNTN  accuracy  of  72%,  compared  to  MV-­‐RNN  (65%),  

biword  NB  (58%)  and  RNN  (54%)  

Page 44: Deeplearning2015 Manning Deep Learning 01

Negation Results When  negaGng  negaGves,  posiGve  acGvaGon  should  increase!  

Demo:  h\p://nlp.stanford.edu:8080/sen4ment/  

Page 45: Deeplearning2015 Manning Deep Learning 01

A disappointment

78  

79  

80  

81  

82  

83  

84  

85  

86  

87  

88  

89  

NB  

Beaten  by  a  Paragraph  Vector  –  a  word2vec  extension  with  no  sentence  structure!  [Le  &  Mikolov  2014]  

NB RNTN ParaVec

Page 46: Deeplearning2015 Manning Deep Learning 01

Deep Recursive Neural Networks for Compositionality in Language (Irsoy  &  Cardie  NIPS,  2014)  

Two  ideas:  •  Separate  word  

and  phrase  embedding  space  

•  Stack  NNs  for  depth  at  each  node  

Beats  paragraph  vector!  

46  

Page 47: Deeplearning2015 Manning Deep Learning 01

Version 5: Improving Deep Learning Semantic Representations using a TreeLSTM [Tai et al., ACL 2015]

Goals:  •  SGll  trying  to  represent  the  meaning  of  a  sentence  as  a  locaGon  

in  a  (high-­‐dimensional,  conGnuous)  vector  space  

•  In  a  way  that  accurately  handles  semanGc  composiGon  and  sentence  meaning  

•  Generalizing  the  widely  used  chain-­‐structured  LSTM  to  trees  

•  Beat  Paragraph  Vector!    

 

Page 48: Deeplearning2015 Manning Deep Learning 01

Long Short-Term Memory (LSTM) Units for Sequential Composition

Gates  are  vectors  in  [0,1]d  mulGplied  element-­‐wise  for  sov  masking  

48  

Page 49: Deeplearning2015 Manning Deep Learning 01

Use  Long  Short-­‐Term  Memories  (Hochreiter  and  Schmidhuber  1997)  

Tree-Structured Long Short-Term Memory Networks

Sentences  have  structure  beyond  word  order  –  Use  this  syntacGc  structure  

Page 50: Deeplearning2015 Manning Deep Learning 01

Tree-Structured Long Short-Term Memory Networks [Tai et al., ACL 2015]

Page 51: Deeplearning2015 Manning Deep Learning 01

Tree-structured LSTM

Generalizes  sequenGal  LSTM  to  trees  with  any  branching  factor  

51  

Page 52: Deeplearning2015 Manning Deep Learning 01

Tree-structured LSTM

Generalizes  sequenGal  LSTM  to  trees  with  any  branching  factor  

52  

Page 53: Deeplearning2015 Manning Deep Learning 01

Results: Sentiment Analysis: Stanford Sentiment Treebank

Method   Accuracy  %    (Fine-­‐grain,    5  classes)  

RNTN  (Socher  et  al.  2013)   45.7  Paragraph-­‐Vec  (Le  &  Mikolov  2014)   48.7  DRNN  (Irsoy  &  Cardie  2014)   49.8  LSTM   46.4  Tree  LSTM  (this  work)   50.9  

Page 54: Deeplearning2015 Manning Deep Learning 01

Results: Semantic Relatedness SICK 2014 (Sentences Involving Compositional Knowledge)

Method   Pearson  correla4on  

Word  vector  average   0.758  Meaning  Factory  (Bjerva  et  al.  2014)   0.827  ECNU  (Zhao  et  al.  2014)   0.841  LSTM   0.853  Tree  LSTM   0.868  

Page 55: Deeplearning2015 Manning Deep Learning 01

Forget Gates: Selective State Preservation

•  Stripes  =  forget  gate  acGvaGons;  more  white  ⇒  more  preserved  

55  

Page 56: Deeplearning2015 Manning Deep Learning 01

Tree structure helps

56  

It  ’s  actually  pre[y  good  in  the  first  few  minutes  ,  but  the  longer  the  movie  goes  ,  the  worse  it  gets  .    

 Gold    LSTM    TreeLSTM  

 −    −    −  

The  longer  the  movie  goes  ,  the  worse  it  gets  ,  but  it  was  actually  pre[y  good  in  the  first  few  minutes  .    

 Gold    LSTM    TreeLSTM  

 −    +    −  

 

 

Page 57: Deeplearning2015 Manning Deep Learning 01

Natural Language Inference

Can  we  tell  if  one  piece  of  text  follows  from  another?  

•  Two  senators  received  contribu>ons  engineered  by  lobbyist  Jack  Abramoff  in  return  for  poli>cal  favors.  

•  Jack  Abramoff  aCempted  to  bribe  two  legislators.  

 

Natural  Language  Inference  =  Recognizing  Textual  Entailment  [Dagan  2005,  MacCartney  &  Manning,  2009]  

Page 58: Deeplearning2015 Manning Deep Learning 01

Natural language inference: The 3-way classification task

     James  Byron  Dean  refused  to  move  without  blue  jeans                              {entails,  contradicts,  neither}      James  Dean  didn’t  dance  without  pants  

     

Page 59: Deeplearning2015 Manning Deep Learning 01

The task: Natural language inference

Claim:  Simple  task  to  define,  but  engages  the  full  complexity  of  composiGonal  semanGcs:  

•  Lexical  entailment  

•  QuanGficaGon  •  Coreference  •  Lexical/scope  ambiguity  •  Commonsense  knowledge  

•  ProposiGonal  a�tudes  •  Modality  

•  FacGvity  and  implicaGvity  ...  

Page 60: Deeplearning2015 Manning Deep Learning 01

Natural logic approach: relations (van Benthem 1988, MacCartney & Manning 2008)

Seven  possible  relaGons  between  phrases/sentences:     Venn symbol name example

x ≡ y equivalence couch ≡ sofa

x ⊏ y forward entailment (strict) crow ⊏ bird

x ⊐ y reverse entailment (strict) European ⊐ French

x ^ y negation (exhaustive exclusion) human ^ nonhuman

x | y alternation (non-exhaustive exclusion) cat | dog

x ‿ y cover (exhaustive non-exclusion) animal ‿ nonhuman

x # y independence hungry # hippo

Page 61: Deeplearning2015 Manning Deep Learning 01

Natural logic: relation joins

Can our NNs learn to make these inferences over pairs of embedding vectors?

Page 62: Deeplearning2015 Manning Deep Learning 01

MacCartney’s natural logic

An  implementable  logic  for  natural  language  inference  without  logical  forms.  (MacCartney  and  Manning  ‘09)  

●  Sound  logical  interpretaGon  (Icard  and  Moss  ‘13)    

Page 63: Deeplearning2015 Manning Deep Learning 01

●  Words are learned embedding vectors.

●  One TreeRNN or TreeRNTN per sentence

●  Softmax emits label

●  Learn everything with

SGD.

A neural network for NLI [Bowman 2014]

Page 64: Deeplearning2015 Manning Deep Learning 01

Natural language inference data [Bowman, Manning & Potts, to appear EMNLP 2015]

•  To  do  NLI  on  real  English,  we  need  to  teach  an  NN  model  English  almost  from  scratch  

•  What  data  do  we  have  to  work  with:  

• Word  embeddings:  GloVe/word2vec  (useful  with  any  data  source)  

•  SICK:  Thousands  of  examples  created  by  ediGng  and  pairing  hundreds  of  sentences  

•  RTE:  Hundreds  of  examples  created  by  hand  

•  Denota4onGraph:  Millions  of  extremely  noisy  examples  (~73%  correct?)  constructed  fully  automaGcally  

Page 65: Deeplearning2015 Manning Deep Learning 01

Results on SICK (+DG, +tricks)

SICK  Train   DG  Train   Test  

Most  freq.  class   56.7%   50.0%   56.7%  

30  dim  TreeRNN   95.4%   67.0%   74.9%  

50  dim  TreeRNTN   97.8%   74.0%   76.9%  

Page 66: Deeplearning2015 Manning Deep Learning 01

Are we competitive on SICK? Sort of...

Best  result  (U.  Illinois)        84.5%    ≈  interannotator  agreement!  

Median  submission  (out  of  18):      77%  Our  TreeRNTN:          76.9%    We’re  a  purely-­‐learned  system  None  of  the  ones  in  the  compeGGon  were  

Page 67: Deeplearning2015 Manning Deep Learning 01

Natural language inference data [Bowman, Manning & Potts, to appear EMNLP 2015]

•  To  do  NLI  on  real  English,  we  need  to  teach  an  NN  model  English  almost  from  scratch  

•  What  data  do  we  have  to  work  with:  

•  GloVe/word2vec  (useful  w/  any  data  source)  •  SICK:  Thousands  of  examples  created  by  ediGng  and  pairing  hundreds  of  sentences  

•  RTE:  Hundreds  of  examples  created  by  hand  •  DenotaGonGraph:  Millions  of  extremely  noisy  examples  (~73%  correct?)  constructed  fully  automaGcally  

•  Stanford  NLI  corpus:  ~600k  examples,  wri\en  by  Turkers  

Page 68: Deeplearning2015 Manning Deep Learning 01

The Stanford NLI corpus

Page 69: Deeplearning2015 Manning Deep Learning 01

Initial SNLI Results

69  

Model   Accuracy  

100d  sum  of  words   75.3  

100d  TreeRNN   72.2  

100d  LSTM  TreeRNN   77.6  

Lexicalized  linear  classifier  with  cross-­‐bigram  features   78.2  

Page 70: Deeplearning2015 Manning Deep Learning 01

Envoi

We  want  more  than  word  meanings!    

We  want:  

•  Meanings  of  larger  units,  calculated  composiGonally  

•  The  ability  to  do  natural  language  inference