Deeplearning2015 Manning Deep Learning 01

NLP and Deep Learning 2: Composi4onal Deep Learning

Christopher Manning

Stanford University

@chrmanning

2015 Deep Learning Summer School, Montreal

Compositionality

Artificial Intelligence requires being able to understand bigger things from knowing about smaller parts

We need more than word embeddings! What of larger semantic units?

How can we know when larger units are similar in meaning?

•  The snowboarder is leaping over the mogul

•  A person on a snowboard jumps into the air

People interpret the meaning of larger text units – enGGes, descripGve terms, facts, arguments, stories – by seman4c composi4on of smaller elements

Representing Phrases as Vectors

x2

x1 0 1 2 3 4 5 6 7 8 9 10

5

4

3

2

1 Monday

9 2

Tuesday 9.5 1.5

1 5

1.1 4

the country of my birth the place where I was born

Can we extend the ideas of word vector spaces to phrases?

France 2 2.5

Germany 1 3

Vector for single words are useful as features but limited!

How should we map phrases into a vector space?

the country of my birth

0.4 0.3

2.3 3.6

4 4.5

7 7

2.1 3.3

2.5 3.8

5.5 6.1

1 3.5

1 5

Use the principle of composiGonality!

The meaning (vector) of a sentence is determined by (1) the meanings of its words and (2) a method that combine them.

x2

x1 0 1 2 3 4 5 6 7 8 9 10

5

4

3

2

1

the country of my birth

the place where I was born

Monday

Tuesday

France

Germany

−

−

−

Can we build meaning composition functions

in deep learning systems ?

Conjecture

10

You can a[empt to model language with a simple, uniform architecture

•  A sequence model (RNN, LSTM, …)

•  A 1d convoluGonal neural network

However, maybe one can produce a be[er composiGon funcGon for language by modeling an input-‐specific composiGonal tree

A “generic” hierarchy on natural language doesn’t make sense?

Node has to represent sentence fragment “cat sat on.” Doesn’t make much sense.

on the mat. The cat sat

9 1

5 3

8 5

9 1

4 3

7 1

Feature representaGon for words

on the mat.

What we want: An input-dependent tree structure

The cat sat

9 1

5 3

8 5

9 1

4 3

NP NP

PP

S This node’s job is to represent “on the mat.”

7 1

VP

Strong priors? Universals of language?

•  This is a controversial issue (read: Chomsky!), but there does seem to be a fairly common structure over all human languages

•  Much of this may be funcGonally moGvated

•  To what extent should we use these priors in our ML models?

13

Where does the tree structure come from?

1.  It can come from a convenGonal staGsGcal NLP parser, such as the Stanford Parser’s PCFG

2.  It can be built by a neural network component, such as a neural network dependency parser [adverGsement]

3.  It can be learned and built as part of the training/operaGon of the TreeRNN system, by adding another matrix to score the goodness of consGtuents built

Mainly, we’ve done 1 or 2.

16

Transition-based dependency parsers

Decide next move from configuraGon

Feature templates: usually a combination of 1 ~ 3 elements from the configuration.

Sparse! Incomplete! Slow!!! (95% of 4me)

0 0 0 1 0 0 1 0 0 0 1 0 Indicator features binary, sparse dim =106 ~ 107

…

Deep Learning Dependency Parser [Chen & Manning, EMNLP 2014] http://nlp.stanford.edu/software/nndep.shtml

Deep Learning Dependency Parser [Chen & Manning, EMNLP 2014]

•  An accurate and fast neural-‐network-‐based dependency parser! •  Parsing to Stanford Dependencies:

•  Unlabeled a[achment score (UAS) = head

•  Labeled a[achment score (LAS) = head and label

Parser UAS LAS sent / s MaltParser 89.8 87.2 469 MSTParser 91.4 88.1 10 TurboParser 92.3 89.6 8 Our Parser 92.0 89.7 654 Google pulling out all the stops 94.3 92.4

Five attempts at meaning composition

20

Tree Recursive Neural Networks (Tree RNNs)

on the mat.

9 1

4 3

3 3

8 3

ComputaGonal unit: Neural Network layer(s), applied recursively

8 5

3 3

Neural Network

8 3

1.3

8 5

(Goller & Küchler 1996, Costa et al. 2003, Socher et al. ICML, 2011)

Version 1: Simple concatenation Tree RNN

Earlier TreeRNN work includes (Goller & Küchler 1996), with a fixed tree structure, Costa et al. (2003) using an RNN for PP a[achment, but on one hot vectors, Bo[ou (2011) for composiGonality with recursion

W

c1 c2

pWscore s

p = tanh(W + b),

where tanh: score = Wscorep

c1 c2

Semantic similarity: nearest neighbors All the figures are adjusted for seasonal variaGons 1. All the numbers are adjusted for seasonal fluctuaGons 2. All the figures are adjusted to remove usual seasonal pa[erns

Knight-‐Ridder would n’t comment on the offer 1. Harsco declined to say what country placed the order 2. Coastal would n’t disclose the terms Sales grew almost 7% to $UNK m. from $UNK m. 1. Sales rose more than 7% to $94.9 m. from $88.3 m. 2. Sales surged 40% to UNK b. yen from UNK b.

Version 1 Limitations

24

ComposiGon funcGon is a single weight matrix!

No real interacGon between the input words!

Not adequate for human language composiGon funcGon

Version 2: PCFG + Syntactically-Untied RNN •  A symbolic Context-‐Free Grammar (CFG) backbone is adequate for basic syntacGc structure

•  We use the discrete syntacGc categories of the children to choose the composiGon matrix

•  An RNN can do be[er with a different composiGon matrix for different syntacGc environments

•  The result gives us a be[er semanGcs

Experiments

Standard WSJ split, labeled F1

Parser Test, All Sentences

Stanford PCFG, (Klein and Manning, 2003a) 85.5

Stanford Factored (Klein and Manning, 2003b) 86.6

Factored PCFGs (Hall and Klein, 2012) 89.4

Collins (Collins, 1997) 87.7

SSN (Henderson, 2004) 89.4

Berkeley Parser (Petrov and Klein, 2007) 90.1

CVG (RNN) (Socher et al., ACL 2013) 85.0

CVG (SU-‐RNN) (Socher et al., ACL 2013) 90.4

Charniak -‐ Self Trained (McClosky et al. 2006) 91.0

Charniak -‐ Self Trained-‐ReRanked (McClosky et al. 2006) 92.1

SU-RNN / CVG [Socher, Bauer, Manning, Ng 2013]

Learns sov noGon of head words IniGalizaGon:

NP-‐CC

PP-‐NP

PRP$-‐NP

SU-RNN / CVG [Socher, Bauer, Manning, Ng 2013]

ADJP-‐NP

ADVP-‐ADJP

JJ-‐NP

DT-‐NP

Version 3: Matrix-vector RNNs [Socher, Huval, Bhat, Manning, & Ng, 2012]

p

Version 3: Matrix-vector RNNs [Socher, Huval, Bhat, Manning, & Ng, 2012]

p =

A B

=P

Classification of Semantic Relationships •  Can an MV-‐RNN learn how a large syntacGc context conveys a semanGc relaGonship?

•  My [apartment]e1 has a pre[y large [kitchen] e2 à component-‐whole relaGonship (e2,e1)

•  Build a single composiGonal semanGcs for the minimal consGtuent including both terms

Classification of Semantic Relationships

Classifier Features F1

SVM POS, stemming, syntacGc pa[erns 60.1

MaxEnt POS, WordNet, morphological features, noun compound system, thesauri, Google n-‐grams

77.6

SVM POS, WordNet, prefixes, morphological features, dependency parse features, Levin classes, PropBank, FrameNet, NomLex-‐Plus, Google n-‐grams, paraphrases, TextRunner

82.2

RNN – 74.8

MV-‐RNN – 79.1

MV-‐RNN POS, WordNet, NER 82.4

Version 4: Recursive Neural Tensor Network

•  Less parameters than MV-‐RNN •  Allows the two word or phrase vectors to interact

mulGplicaGvely

Beyond the bag of words: Sentiment detection

Is the tone of a piece of text posiGve, negaGve, or neutral?

•  SenGment is that senGment is “easy” •  DetecGon accuracy for longer documents ~90%, BUT

… … loved … … … … … great … … … … … … impressed … … … … … … marvelous … … … …

Stanford Sentiment Treebank •  215,154 phrases labeled in 11,855 sentences •  Can actually train and test composiGons

http://nlp.stanford.edu:8080/sentiment/

Better Dataset Helped All Models

•  Hard negaGon cases are sGll mostly incorrect •  We also need a more powerful model!

75

76

77

78

79

80

81

82

83

84

Training with Sentence Labels

Training with Treebank

Bi NB

RNN

MV-‐RNN

Version 4: Recursive Neural Tensor Network

Idea: Allow both addiGve and mediated mulGplicaGve interacGons of vectors

Recursive Neural Tensor Network

Recursive Neural Tensor Network •  Use resulGng vectors in tree as input to

a classifier like logisGc regression

•  Train all weights jointly with gradient descent

Positive/Negative Results on Treebank

74

76

78

80

82

84

86

Training with Sentence Labels Training with Treebank

Bi NB RNN MV-‐RNN RNTN

Classifying Sentences: Accuracy improves to 85.4

42

−0.04 −0.02 0 0.02 0.04 0.06 0.08

−0.1

−0.08

−0.06

−0.04

−0.02

0

0.02

0.04

0.06

0.08

!

&

’

’’

’d

’ll

’m

’re’s

’ve

*

,

−

−−

2

2002

:

;

?

a

ability

able

about

above

absolutely

across

act

acted acting

action

actor

actors

actress

actually

adaptation

add

adults

adventure

after

again

against

age

ago

air

allalmost

alone

along

already

also

although

always

ambitious

american

among

amount

amusing

an

and

animated

animation

another

any

anyone

anything

apart

appeal

appealing

appears

approach

are

around

art

artist

as

at

attempt

attempts

attention

audience

audiences

away

awful

back

bad

badly

barely

based

be

beautifulbeautifully

beauty

becausebecome

becomes

been

before

begins

behind

being

believe

best

better

betweenbeyond

big

bit

black

bland

bond

book

boring

both

boy

boys

brilliant

bring

brings

but

by

ca

call

camera

can

captures

care

career

cartoon

case

cast

casting

certain

certainly

change

character

characters

charm

charming

cheapchemistry

children

cinema

cinematic

city

class

classic

clear

clever

cliches

close

cold

college

come

comedies

comedy

comes

comic

coming

coming−of−age

company

compelling

complete

completely

complex

concept

contrived

convincing

cool

could

country

couple

course

crazy

create

creates

creative

creepy

crime

crush

cultural

culture

cut

cute

dark

date

day

days

de

dead

deal

death

debut

decent

deep

deeply

definitely

delightful

delivers

depth

deserves

despite

dialogue

did

different

difficult

directed

direction

director

do

documentary

does

dog

doing

done

down

drama

dramatic

dull

dumb

during

e

each

earnest

easily

easy

effective

effects

effort

either

elements

else

emotional

emotionally

emotions

end

ending

ends

energetic

energy

engaging

engrossing

enjoy

enjoyable

enough

ensemble

entertaining

entertainment

entire

entirely

epic

episode

equally

era

es

especially

even

events

ever

every

everyone

everything

exactly

examination

excellent

except

exciting

execution

exercise

expect

experience

extreme

eye

eyes

face

fact

fails

fairly

falls

familiar family

fan

fans

fantasy

far

fascinating

feature

feel

feeling

feels

felt

female

few

filled

film

filmmaker

filmmakers

filmmaking

films

final

finally

find

finds

fine

fire

first

five

flat

flawed

flaws

flick

flicks

focus

for

force

forced

forget

form

formula

formulaic

found

four

frame

french

fresh

from

full

fully

fun

funny

future

gags

game

generic

genre

gentle

genuine

get

gets

getting

girl

girls

give

given

gives

go

goes

going

gone

good

goofy

gorgeous

got

grace

grant

great

green

gripping

guy

guys

had

half

happens

happy

hard

hardly

has

have

having

he

head

heart

heaven

hell

help

her

here

hero

high

highly

hilarious

him

himself

his

historical

history

hit

hold

hollywood

home

honest

hope

horror

hour

hours

house

how

however

human

humor

i

idea

ideas

if

ii

images

imagination

imagine

impact

important

impossible

impressive

in

indeed

inside

insight

instead

intelligence

intelligent

interest

interesting

intimate

into

intriguing

inventive

is

it

its

itself

job

john

joke

jokes

journey

just

keep

keeps

kid

kids

kind

know

knows

la

lack

lacking

lacks

largely

last

latest

laugh

laughs

lead

leads

least

leave

leaves

left

less

let

level

lifelight

like

likely

line

little

live

lives

long

look

looking

looks

loses

lost

lot

lots

loud

love

lovely

low

mmade

magic

major

make

makes

making

man

manages

manner

many

mark

master

masterpiece

material

matter

may

maybe

me

mediocre

meditation

melodrama

melodramatic

memorable

memory

men

merely

mess

message

middle

might

mildly

mind

minutes

missing

mix

modern

moment

moments

money

mood

moral

more

most

mostly

movie

movies

moving

mr

much

murder

music

mustmy

mystery

n

n’t

narrative

naturenearly

need

needs

neither

never

new

next

nice

no

none

nor

not

nothing

novel

now

o

obvious

occasionally

odd

of

off

offer

offers

often

old

old−fashioned

on

once

one

ones

only

opera

or

original

other

others

otherwise

our

out

over

overall

own

pace

parents

part

particularly

parts

passion

past

people

perfect

perfectly

performance

performances

perhaps

period

personal

picture

pictures

piece

place

plain

play

playing

plays

pleasantpleasure

plenty

plot

poetry

poignant

point

points

political

pop

portrait

possible

potential

power

powerful

predictable

premise

pretentious

pretty

previous

probably

problem

problems

process

production

project

promise

proves

provides

provocative

psychological

pure

put

puts

quality

que

question

quickly

quiet

quirky

quite

rare

rarely

rather

read

real

realityreallyreason

recent

recommend

red

refreshing

relationship

relationships

remains

remake

remarkable

remember

rest

result

reveals

rich

rideright

rock

role

romance

romantic

routine

run

running

runs

s

sad

same

satire

satisfying

saw

say

scary

scene

scenes

school

sci−fi

screen

screenplay

script

second

see

seeing

seem

seems

seen

sense

sensitive

sentimental

sequel

sequences

series

serious

seriously

set

sets

several

sex

sexual

sharp

she

short

shot

should

show

shows

side

silly

simple

simplysince

sincere

single

sit

situation

slight

slightly

slow

small

smart

so

soap

social

solid

some

someone

something

sometimes

somewhat

sort

sound

soundtrack

special

spirit

spy

stand

star

stars

start

starts

still

stories

story

storytelling

straight

strange

strong

study

stuff

stupid

style

stylish

subject

subtle

success

such

suffers

summer

supposed

sure

surprise

surprises

surprising

surprisingly

suspense

sustain

sweet

t

take

taken

takes

taking

tale

talent

talented

taste

tedious

teen

telltension

terms

terrific

than

thanks

that

the

theater

their

them

themes

themselves

then

there

these

they

thin

thing

things

think

this

thoroughly

those

though

thought

thoughtful

threethriller

thrills

through

throughout

time

times

tired

title

to

together

told

tone

too

touch

touching

tragedy

tragic

tries

true

truly

truth

try

trying

turn

turns

tv

twist

twists

two

typicalugly

ultimate

ultimately

under

understand

uneven

unfortunately

unique

unsettling

until

upupon

us

use

used

uses

usual

utterly

version

very

video

view

viewer

viewers

violence

vision

visual

visually

want

wanted

wants

war

warm

was

waste

watch

watching

water

way

ways

we

weird

welcome

well

were

what

when

where

whether

which

while

white

who

whole

whose

why

wild

will

winning

wit

with

without

witty

wo

woman

womenwonder

wonderful

word

words

work

working

works

world

worse

worst

worth

worthy

would

writing

written

wrong

y

year

years

yes

yet

york

youyoung

your

Experimental Results on Treebank •  RNTN can capture construcGons like X but Y •  RNTN accuracy of 72%, compared to MV-‐RNN (65%),

biword NB (58%) and RNN (54%)

Negation Results When negaGng negaGves, posiGve acGvaGon should increase!

Demo: h\p://nlp.stanford.edu:8080/sen4ment/

A disappointment

78

79

80

81

82

83

84

85

86

87

88

89

NB

Beaten by a Paragraph Vector – a word2vec extension with no sentence structure! [Le & Mikolov 2014]

NB RNTN ParaVec

Deep Recursive Neural Networks for Compositionality in Language (Irsoy & Cardie NIPS, 2014)

Two ideas: •  Separate word

and phrase embedding space

•  Stack NNs for depth at each node

Beats paragraph vector!

46

Version 5: Improving Deep Learning Semantic Representations using a TreeLSTM [Tai et al., ACL 2015]

Goals: •  SGll trying to represent the meaning of a sentence as a locaGon

in a (high-‐dimensional, conGnuous) vector space

•  In a way that accurately handles semanGc composiGon and sentence meaning

•  Generalizing the widely used chain-‐structured LSTM to trees

•  Beat Paragraph Vector!

Long Short-Term Memory (LSTM) Units for Sequential Composition

Gates are vectors in [0,1]d mulGplied element-‐wise for sov masking

48

Use Long Short-‐Term Memories (Hochreiter and Schmidhuber 1997)

Tree-Structured Long Short-Term Memory Networks

Sentences have structure beyond word order – Use this syntacGc structure

Tree-Structured Long Short-Term Memory Networks [Tai et al., ACL 2015]

Tree-structured LSTM

Generalizes sequenGal LSTM to trees with any branching factor

51

Tree-structured LSTM

Generalizes sequenGal LSTM to trees with any branching factor

52

Results: Sentiment Analysis: Stanford Sentiment Treebank

Method Accuracy % (Fine-‐grain, 5 classes)

RNTN (Socher et al. 2013) 45.7 Paragraph-‐Vec (Le & Mikolov 2014) 48.7 DRNN (Irsoy & Cardie 2014) 49.8 LSTM 46.4 Tree LSTM (this work) 50.9

Results: Semantic Relatedness SICK 2014 (Sentences Involving Compositional Knowledge)

Method Pearson correla4on

Word vector average 0.758 Meaning Factory (Bjerva et al. 2014) 0.827 ECNU (Zhao et al. 2014) 0.841 LSTM 0.853 Tree LSTM 0.868

Forget Gates: Selective State Preservation

•  Stripes = forget gate acGvaGons; more white ⇒ more preserved

55

Tree structure helps

56

It ’s actually pre[y good in the first few minutes , but the longer the movie goes , the worse it gets .

Gold LSTM TreeLSTM

− − −

The longer the movie goes , the worse it gets , but it was actually pre[y good in the first few minutes .

Gold LSTM TreeLSTM

− + −

Natural Language Inference

Can we tell if one piece of text follows from another?

•  Two senators received contribu>ons engineered by lobbyist Jack Abramoff in return for poli>cal favors.

•  Jack Abramoff aCempted to bribe two legislators.

Natural Language Inference = Recognizing Textual Entailment [Dagan 2005, MacCartney & Manning, 2009]

Natural language inference: The 3-way classification task

James Byron Dean refused to move without blue jeans {entails, contradicts, neither} James Dean didn’t dance without pants

The task: Natural language inference

Claim: Simple task to define, but engages the full complexity of composiGonal semanGcs:

•  Lexical entailment

•  QuanGficaGon •  Coreference •  Lexical/scope ambiguity •  Commonsense knowledge

•  ProposiGonal a�tudes •  Modality

•  FacGvity and implicaGvity ...

Natural logic approach: relations (van Benthem 1988, MacCartney & Manning 2008)

Seven possible relaGons between phrases/sentences: Venn symbol name example

x ≡ y equivalence couch ≡ sofa

x ⊏ y forward entailment (strict) crow ⊏ bird

x ⊐ y reverse entailment (strict) European ⊐ French

x ^ y negation (exhaustive exclusion) human ^ nonhuman

x | y alternation (non-exhaustive exclusion) cat | dog

x ‿ y cover (exhaustive non-exclusion) animal ‿ nonhuman

x # y independence hungry # hippo

Natural logic: relation joins

Can our NNs learn to make these inferences over pairs of embedding vectors?

MacCartney’s natural logic

An implementable logic for natural language inference without logical forms. (MacCartney and Manning ‘09)

●  Sound logical interpretaGon (Icard and Moss ‘13)

●  Words are learned embedding vectors.

●  One TreeRNN or TreeRNTN per sentence

●  Softmax emits label

●  Learn everything with

SGD.

A neural network for NLI [Bowman 2014]

Natural language inference data [Bowman, Manning & Potts, to appear EMNLP 2015]

•  To do NLI on real English, we need to teach an NN model English almost from scratch

•  What data do we have to work with:

• Word embeddings: GloVe/word2vec (useful with any data source)

•  SICK: Thousands of examples created by ediGng and pairing hundreds of sentences

•  RTE: Hundreds of examples created by hand

•  Denota4onGraph: Millions of extremely noisy examples (~73% correct?) constructed fully automaGcally

Results on SICK (+DG, +tricks)

SICK Train DG Train Test

Most freq. class 56.7% 50.0% 56.7%

30 dim TreeRNN 95.4% 67.0% 74.9%

50 dim TreeRNTN 97.8% 74.0% 76.9%

Are we competitive on SICK? Sort of...

Best result (U. Illinois) 84.5% ≈ interannotator agreement!

Median submission (out of 18): 77% Our TreeRNTN: 76.9% We’re a purely-‐learned system None of the ones in the compeGGon were

Natural language inference data [Bowman, Manning & Potts, to appear EMNLP 2015]

•  To do NLI on real English, we need to teach an NN model English almost from scratch

•  What data do we have to work with:

•  GloVe/word2vec (useful w/ any data source) •  SICK: Thousands of examples created by ediGng and pairing hundreds of sentences

•  RTE: Hundreds of examples created by hand •  DenotaGonGraph: Millions of extremely noisy examples (~73% correct?) constructed fully automaGcally

•  Stanford NLI corpus: ~600k examples, wri\en by Turkers

The Stanford NLI corpus

Initial SNLI Results

69

Model Accuracy

100d sum of words 75.3

100d TreeRNN 72.2

100d LSTM TreeRNN 77.6

Lexicalized linear classifier with cross-‐bigram features 78.2

Envoi

We want more than word meanings!

We want:

•  Meanings of larger units, calculated composiGonally

•  The ability to do natural language inference

Deeplearning2015 Manning Deep Learning 01

Documents

Transcript of Deeplearning2015 Manning Deep Learning 01