Deeplearning2015 Manning Deep Learning 01
-
Upload
aritra-ghosh -
Category
Documents
-
view
11 -
download
0
description
Transcript of Deeplearning2015 Manning Deep Learning 01
NLP and Deep Learning 2: Composi4onal Deep Learning
Christopher Manning
Stanford University
@chrmanning
2015 Deep Learning Summer School, Montreal
Compositionality
Artificial Intelligence requires being able to understand bigger things from knowing about smaller parts
We need more than word embeddings! What of larger semantic units?
How can we know when larger units are similar in meaning?
• The snowboarder is leaping over the mogul
• A person on a snowboard jumps into the air
People interpret the meaning of larger text units – enGGes, descripGve terms, facts, arguments, stories – by seman4c composi4on of smaller elements
Representing Phrases as Vectors
x2
x1 0 1 2 3 4 5 6 7 8 9 10
5
4
3
2
1 Monday
9 2
Tuesday 9.5 1.5
1 5
1.1 4
the country of my birth the place where I was born
Can we extend the ideas of word vector spaces to phrases?
France 2 2.5
Germany 1 3
Vector for single words are useful as features but limited!
How should we map phrases into a vector space?
the country of my birth
0.4 0.3
2.3 3.6
4 4.5
7 7
2.1 3.3
2.5 3.8
5.5 6.1
1 3.5
1 5
Use the principle of composiGonality!
The meaning (vector) of a sentence is determined by (1) the meanings of its words and (2) a method that combine them.
x2
x1 0 1 2 3 4 5 6 7 8 9 10
5
4
3
2
1
the country of my birth
the place where I was born
Monday
Tuesday
France
Germany
−
−
−
Can we build meaning composition functions
in deep learning systems ?
Conjecture
10
You can a[empt to model language with a simple, uniform architecture
• A sequence model (RNN, LSTM, …)
• A 1d convoluGonal neural network
However, maybe one can produce a be[er composiGon funcGon for language by modeling an input-‐specific composiGonal tree
A “generic” hierarchy on natural language doesn’t make sense?
Node has to represent sentence fragment “cat sat on.” Doesn’t make much sense.
on the mat. The cat sat
9 1
5 3
8 5
9 1
4 3
7 1
Feature representaGon for words
on the mat.
What we want: An input-dependent tree structure
The cat sat
9 1
5 3
8 5
9 1
4 3
NP NP
PP
S This node’s job is to represent “on the mat.”
7 1
VP
Strong priors? Universals of language?
• This is a controversial issue (read: Chomsky!), but there does seem to be a fairly common structure over all human languages
• Much of this may be funcGonally moGvated
• To what extent should we use these priors in our ML models?
13
Where does the tree structure come from?
1. It can come from a convenGonal staGsGcal NLP parser, such as the Stanford Parser’s PCFG
2. It can be built by a neural network component, such as a neural network dependency parser [adverGsement]
3. It can be learned and built as part of the training/operaGon of the TreeRNN system, by adding another matrix to score the goodness of consGtuents built
Mainly, we’ve done 1 or 2.
16
Transition-based dependency parsers
Decide next move from configuraGon
Feature templates: usually a combination of 1 ~ 3 elements from the configuration.
Sparse! Incomplete! Slow!!! (95% of 4me)
0 0 0 1 0 0 1 0 0 0 1 0 Indicator features binary, sparse dim =106 ~ 107
…
Deep Learning Dependency Parser [Chen & Manning, EMNLP 2014] http://nlp.stanford.edu/software/nndep.shtml
Deep Learning Dependency Parser [Chen & Manning, EMNLP 2014]
• An accurate and fast neural-‐network-‐based dependency parser! • Parsing to Stanford Dependencies:
• Unlabeled a[achment score (UAS) = head
• Labeled a[achment score (LAS) = head and label
Parser UAS LAS sent / s MaltParser 89.8 87.2 469 MSTParser 91.4 88.1 10 TurboParser 92.3 89.6 8 Our Parser 92.0 89.7 654 Google pulling out all the stops 94.3 92.4
Five attempts at meaning composition
20
Tree Recursive Neural Networks (Tree RNNs)
on the mat.
9 1
4 3
3 3
8 3
ComputaGonal unit: Neural Network layer(s), applied recursively
8 5
3 3
Neural Network
8 3
1.3
8 5
(Goller & Küchler 1996, Costa et al. 2003, Socher et al. ICML, 2011)
Version 1: Simple concatenation Tree RNN
Earlier TreeRNN work includes (Goller & Küchler 1996), with a fixed tree structure, Costa et al. (2003) using an RNN for PP a[achment, but on one hot vectors, Bo[ou (2011) for composiGonality with recursion
W
c1 c2
pWscore s
p = tanh(W + b),
where tanh: score = Wscorep
c1 c2
Semantic similarity: nearest neighbors All the figures are adjusted for seasonal variaGons 1. All the numbers are adjusted for seasonal fluctuaGons 2. All the figures are adjusted to remove usual seasonal pa[erns
Knight-‐Ridder would n’t comment on the offer 1. Harsco declined to say what country placed the order 2. Coastal would n’t disclose the terms Sales grew almost 7% to $UNK m. from $UNK m. 1. Sales rose more than 7% to $94.9 m. from $88.3 m. 2. Sales surged 40% to UNK b. yen from UNK b.
Version 1 Limitations
24
ComposiGon funcGon is a single weight matrix!
No real interacGon between the input words!
Not adequate for human language composiGon funcGon
Version 2: PCFG + Syntactically-Untied RNN • A symbolic Context-‐Free Grammar (CFG) backbone is adequate for basic syntacGc structure
• We use the discrete syntacGc categories of the children to choose the composiGon matrix
• An RNN can do be[er with a different composiGon matrix for different syntacGc environments
• The result gives us a be[er semanGcs
Experiments
Standard WSJ split, labeled F1
Parser Test, All Sentences
Stanford PCFG, (Klein and Manning, 2003a) 85.5
Stanford Factored (Klein and Manning, 2003b) 86.6
Factored PCFGs (Hall and Klein, 2012) 89.4
Collins (Collins, 1997) 87.7
SSN (Henderson, 2004) 89.4
Berkeley Parser (Petrov and Klein, 2007) 90.1
CVG (RNN) (Socher et al., ACL 2013) 85.0
CVG (SU-‐RNN) (Socher et al., ACL 2013) 90.4
Charniak -‐ Self Trained (McClosky et al. 2006) 91.0
Charniak -‐ Self Trained-‐ReRanked (McClosky et al. 2006) 92.1
SU-RNN / CVG [Socher, Bauer, Manning, Ng 2013]
Learns sov noGon of head words IniGalizaGon:
NP-‐CC
PP-‐NP
PRP$-‐NP
SU-RNN / CVG [Socher, Bauer, Manning, Ng 2013]
ADJP-‐NP
ADVP-‐ADJP
JJ-‐NP
DT-‐NP
Version 3: Matrix-vector RNNs [Socher, Huval, Bhat, Manning, & Ng, 2012]
p
Version 3: Matrix-vector RNNs [Socher, Huval, Bhat, Manning, & Ng, 2012]
p =
A B
=P
Classification of Semantic Relationships • Can an MV-‐RNN learn how a large syntacGc context conveys a semanGc relaGonship?
• My [apartment]e1 has a pre[y large [kitchen] e2 à component-‐whole relaGonship (e2,e1)
• Build a single composiGonal semanGcs for the minimal consGtuent including both terms
Classification of Semantic Relationships
Classifier Features F1
SVM POS, stemming, syntacGc pa[erns 60.1
MaxEnt POS, WordNet, morphological features, noun compound system, thesauri, Google n-‐grams
77.6
SVM POS, WordNet, prefixes, morphological features, dependency parse features, Levin classes, PropBank, FrameNet, NomLex-‐Plus, Google n-‐grams, paraphrases, TextRunner
82.2
RNN – 74.8
MV-‐RNN – 79.1
MV-‐RNN POS, WordNet, NER 82.4
Version 4: Recursive Neural Tensor Network
• Less parameters than MV-‐RNN • Allows the two word or phrase vectors to interact
mulGplicaGvely
Beyond the bag of words: Sentiment detection
Is the tone of a piece of text posiGve, negaGve, or neutral?
• SenGment is that senGment is “easy” • DetecGon accuracy for longer documents ~90%, BUT
… … loved … … … … … great … … … … … … impressed … … … … … … marvelous … … … …
Stanford Sentiment Treebank • 215,154 phrases labeled in 11,855 sentences • Can actually train and test composiGons
http://nlp.stanford.edu:8080/sentiment/
Better Dataset Helped All Models
• Hard negaGon cases are sGll mostly incorrect • We also need a more powerful model!
75
76
77
78
79
80
81
82
83
84
Training with Sentence Labels
Training with Treebank
Bi NB
RNN
MV-‐RNN
Version 4: Recursive Neural Tensor Network
Idea: Allow both addiGve and mediated mulGplicaGve interacGons of vectors
Recursive Neural Tensor Network
Recursive Neural Tensor Network
Recursive Neural Tensor Network • Use resulGng vectors in tree as input to
a classifier like logisGc regression
• Train all weights jointly with gradient descent
Positive/Negative Results on Treebank
74
76
78
80
82
84
86
Training with Sentence Labels Training with Treebank
Bi NB RNN MV-‐RNN RNTN
Classifying Sentences: Accuracy improves to 85.4
42
−0.04 −0.02 0 0.02 0.04 0.06 0.08
−0.1
−0.08
−0.06
−0.04
−0.02
0
0.02
0.04
0.06
0.08
!
&
’
’’
’d
’ll
’m
’re’s
’ve
*
,
−
−−
2
2002
:
;
?
a
ability
able
about
above
absolutely
across
act
acted acting
action
actor
actors
actress
actually
adaptation
add
adults
adventure
after
again
against
age
ago
air
allalmost
alone
along
already
also
although
always
ambitious
american
among
amount
amusing
an
and
animated
animation
another
any
anyone
anything
apart
appeal
appealing
appears
approach
are
around
art
artist
as
at
attempt
attempts
attention
audience
audiences
away
awful
back
bad
badly
barely
based
be
beautifulbeautifully
beauty
becausebecome
becomes
been
before
begins
behind
being
believe
best
better
betweenbeyond
big
bit
black
bland
bond
book
boring
both
boy
boys
brilliant
bring
brings
but
by
ca
call
camera
can
captures
care
career
cartoon
case
cast
casting
certain
certainly
change
character
characters
charm
charming
cheapchemistry
children
cinema
cinematic
city
class
classic
clear
clever
cliches
close
cold
college
come
comedies
comedy
comes
comic
coming
coming−of−age
company
compelling
complete
completely
complex
concept
contrived
convincing
cool
could
country
couple
course
crazy
create
creates
creative
creepy
crime
crush
cultural
culture
cut
cute
dark
date
day
days
de
dead
deal
death
debut
decent
deep
deeply
definitely
delightful
delivers
depth
deserves
despite
dialogue
did
different
difficult
directed
direction
director
do
documentary
does
dog
doing
done
down
drama
dramatic
dull
dumb
during
e
each
earnest
easily
easy
effective
effects
effort
either
elements
else
emotional
emotionally
emotions
end
ending
ends
energetic
energy
engaging
engrossing
enjoy
enjoyable
enough
ensemble
entertaining
entertainment
entire
entirely
epic
episode
equally
era
es
especially
even
events
ever
every
everyone
everything
exactly
examination
excellent
except
exciting
execution
exercise
expect
experience
extreme
eye
eyes
face
fact
fails
fairly
falls
familiar family
fan
fans
fantasy
far
fascinating
feature
feel
feeling
feels
felt
female
few
filled
film
filmmaker
filmmakers
filmmaking
films
final
finally
find
finds
fine
fire
first
five
flat
flawed
flaws
flick
flicks
focus
for
force
forced
forget
form
formula
formulaic
found
four
frame
french
fresh
from
full
fully
fun
funny
future
gags
game
generic
genre
gentle
genuine
get
gets
getting
girl
girls
give
given
gives
go
goes
going
gone
good
goofy
gorgeous
got
grace
grant
great
green
gripping
guy
guys
had
half
happens
happy
hard
hardly
has
have
having
he
head
heart
heaven
hell
help
her
here
hero
high
highly
hilarious
him
himself
his
historical
history
hit
hold
hollywood
home
honest
hope
horror
hour
hours
house
how
however
human
humor
i
idea
ideas
if
ii
images
imagination
imagine
impact
important
impossible
impressive
in
indeed
inside
insight
instead
intelligence
intelligent
interest
interesting
intimate
into
intriguing
inventive
is
it
its
itself
job
john
joke
jokes
journey
just
keep
keeps
kid
kids
kind
know
knows
la
lack
lacking
lacks
largely
last
latest
laugh
laughs
lead
leads
least
leave
leaves
left
less
let
level
lifelight
like
likely
line
little
live
lives
long
look
looking
looks
loses
lost
lot
lots
loud
love
lovely
low
mmade
magic
major
make
makes
making
man
manages
manner
many
mark
master
masterpiece
material
matter
may
maybe
me
mediocre
meditation
melodrama
melodramatic
memorable
memory
men
merely
mess
message
middle
might
mildly
mind
minutes
missing
mix
modern
moment
moments
money
mood
moral
more
most
mostly
movie
movies
moving
mr
much
murder
music
mustmy
mystery
n
n’t
narrative
naturenearly
need
needs
neither
never
new
next
nice
no
none
nor
not
nothing
novel
now
o
obvious
occasionally
odd
of
off
offer
offers
often
old
old−fashioned
on
once
one
ones
only
opera
or
original
other
others
otherwise
our
out
over
overall
own
pace
parents
part
particularly
parts
passion
past
people
perfect
perfectly
performance
performances
perhaps
period
personal
picture
pictures
piece
place
plain
play
playing
plays
pleasantpleasure
plenty
plot
poetry
poignant
point
points
political
pop
portrait
possible
potential
power
powerful
predictable
premise
pretentious
pretty
previous
probably
problem
problems
process
production
project
promise
proves
provides
provocative
psychological
pure
put
puts
quality
que
question
quickly
quiet
quirky
quite
rare
rarely
rather
read
real
realityreallyreason
recent
recommend
red
refreshing
relationship
relationships
remains
remake
remarkable
remember
rest
result
reveals
rich
rideright
rock
role
romance
romantic
routine
run
running
runs
s
sad
same
satire
satisfying
saw
say
scary
scene
scenes
school
sci−fi
screen
screenplay
script
second
see
seeing
seem
seems
seen
sense
sensitive
sentimental
sequel
sequences
series
serious
seriously
set
sets
several
sex
sexual
sharp
she
short
shot
should
show
shows
side
silly
simple
simplysince
sincere
single
sit
situation
slight
slightly
slow
small
smart
so
soap
social
solid
some
someone
something
sometimes
somewhat
sort
sound
soundtrack
special
spirit
spy
stand
star
stars
start
starts
still
stories
story
storytelling
straight
strange
strong
study
stuff
stupid
style
stylish
subject
subtle
success
such
suffers
summer
supposed
sure
surprise
surprises
surprising
surprisingly
suspense
sustain
sweet
t
take
taken
takes
taking
tale
talent
talented
taste
tedious
teen
telltension
terms
terrific
than
thanks
that
the
theater
their
them
themes
themselves
then
there
these
they
thin
thing
things
think
this
thoroughly
those
though
thought
thoughtful
threethriller
thrills
through
throughout
time
times
tired
title
to
together
told
tone
too
touch
touching
tragedy
tragic
tries
true
truly
truth
try
trying
turn
turns
tv
twist
twists
two
typicalugly
ultimate
ultimately
under
understand
uneven
unfortunately
unique
unsettling
until
upupon
us
use
used
uses
usual
utterly
version
very
video
view
viewer
viewers
violence
vision
visual
visually
want
wanted
wants
war
warm
was
waste
watch
watching
water
way
ways
we
weird
welcome
well
were
what
when
where
whether
which
while
white
who
whole
whose
why
wild
will
winning
wit
with
without
witty
wo
woman
womenwonder
wonderful
word
words
work
working
works
world
worse
worst
worth
worthy
would
writing
written
wrong
y
year
years
yes
yet
york
youyoung
your
Experimental Results on Treebank • RNTN can capture construcGons like X but Y • RNTN accuracy of 72%, compared to MV-‐RNN (65%),
biword NB (58%) and RNN (54%)
Negation Results When negaGng negaGves, posiGve acGvaGon should increase!
Demo: h\p://nlp.stanford.edu:8080/sen4ment/
A disappointment
78
79
80
81
82
83
84
85
86
87
88
89
NB
Beaten by a Paragraph Vector – a word2vec extension with no sentence structure! [Le & Mikolov 2014]
NB RNTN ParaVec
Deep Recursive Neural Networks for Compositionality in Language (Irsoy & Cardie NIPS, 2014)
Two ideas: • Separate word
and phrase embedding space
• Stack NNs for depth at each node
Beats paragraph vector!
46
Version 5: Improving Deep Learning Semantic Representations using a TreeLSTM [Tai et al., ACL 2015]
Goals: • SGll trying to represent the meaning of a sentence as a locaGon
in a (high-‐dimensional, conGnuous) vector space
• In a way that accurately handles semanGc composiGon and sentence meaning
• Generalizing the widely used chain-‐structured LSTM to trees
• Beat Paragraph Vector!
Long Short-Term Memory (LSTM) Units for Sequential Composition
Gates are vectors in [0,1]d mulGplied element-‐wise for sov masking
48
Use Long Short-‐Term Memories (Hochreiter and Schmidhuber 1997)
Tree-Structured Long Short-Term Memory Networks
Sentences have structure beyond word order – Use this syntacGc structure
Tree-Structured Long Short-Term Memory Networks [Tai et al., ACL 2015]
Tree-structured LSTM
Generalizes sequenGal LSTM to trees with any branching factor
51
Tree-structured LSTM
Generalizes sequenGal LSTM to trees with any branching factor
52
Results: Sentiment Analysis: Stanford Sentiment Treebank
Method Accuracy % (Fine-‐grain, 5 classes)
RNTN (Socher et al. 2013) 45.7 Paragraph-‐Vec (Le & Mikolov 2014) 48.7 DRNN (Irsoy & Cardie 2014) 49.8 LSTM 46.4 Tree LSTM (this work) 50.9
Results: Semantic Relatedness SICK 2014 (Sentences Involving Compositional Knowledge)
Method Pearson correla4on
Word vector average 0.758 Meaning Factory (Bjerva et al. 2014) 0.827 ECNU (Zhao et al. 2014) 0.841 LSTM 0.853 Tree LSTM 0.868
Forget Gates: Selective State Preservation
• Stripes = forget gate acGvaGons; more white ⇒ more preserved
55
Tree structure helps
56
It ’s actually pre[y good in the first few minutes , but the longer the movie goes , the worse it gets .
Gold LSTM TreeLSTM
− − −
The longer the movie goes , the worse it gets , but it was actually pre[y good in the first few minutes .
Gold LSTM TreeLSTM
− + −
Natural Language Inference
Can we tell if one piece of text follows from another?
• Two senators received contribu>ons engineered by lobbyist Jack Abramoff in return for poli>cal favors.
• Jack Abramoff aCempted to bribe two legislators.
Natural Language Inference = Recognizing Textual Entailment [Dagan 2005, MacCartney & Manning, 2009]
Natural language inference: The 3-way classification task
James Byron Dean refused to move without blue jeans {entails, contradicts, neither} James Dean didn’t dance without pants
The task: Natural language inference
Claim: Simple task to define, but engages the full complexity of composiGonal semanGcs:
• Lexical entailment
• QuanGficaGon • Coreference • Lexical/scope ambiguity • Commonsense knowledge
• ProposiGonal a�tudes • Modality
• FacGvity and implicaGvity ...
Natural logic approach: relations (van Benthem 1988, MacCartney & Manning 2008)
Seven possible relaGons between phrases/sentences: Venn symbol name example
x ≡ y equivalence couch ≡ sofa
x ⊏ y forward entailment (strict) crow ⊏ bird
x ⊐ y reverse entailment (strict) European ⊐ French
x ^ y negation (exhaustive exclusion) human ^ nonhuman
x | y alternation (non-exhaustive exclusion) cat | dog
x ‿ y cover (exhaustive non-exclusion) animal ‿ nonhuman
x # y independence hungry # hippo
Natural logic: relation joins
Can our NNs learn to make these inferences over pairs of embedding vectors?
MacCartney’s natural logic
An implementable logic for natural language inference without logical forms. (MacCartney and Manning ‘09)
● Sound logical interpretaGon (Icard and Moss ‘13)
● Words are learned embedding vectors.
● One TreeRNN or TreeRNTN per sentence
● Softmax emits label
● Learn everything with
SGD.
A neural network for NLI [Bowman 2014]
Natural language inference data [Bowman, Manning & Potts, to appear EMNLP 2015]
• To do NLI on real English, we need to teach an NN model English almost from scratch
• What data do we have to work with:
• Word embeddings: GloVe/word2vec (useful with any data source)
• SICK: Thousands of examples created by ediGng and pairing hundreds of sentences
• RTE: Hundreds of examples created by hand
• Denota4onGraph: Millions of extremely noisy examples (~73% correct?) constructed fully automaGcally
Results on SICK (+DG, +tricks)
SICK Train DG Train Test
Most freq. class 56.7% 50.0% 56.7%
30 dim TreeRNN 95.4% 67.0% 74.9%
50 dim TreeRNTN 97.8% 74.0% 76.9%
Are we competitive on SICK? Sort of...
Best result (U. Illinois) 84.5% ≈ interannotator agreement!
Median submission (out of 18): 77% Our TreeRNTN: 76.9% We’re a purely-‐learned system None of the ones in the compeGGon were
Natural language inference data [Bowman, Manning & Potts, to appear EMNLP 2015]
• To do NLI on real English, we need to teach an NN model English almost from scratch
• What data do we have to work with:
• GloVe/word2vec (useful w/ any data source) • SICK: Thousands of examples created by ediGng and pairing hundreds of sentences
• RTE: Hundreds of examples created by hand • DenotaGonGraph: Millions of extremely noisy examples (~73% correct?) constructed fully automaGcally
• Stanford NLI corpus: ~600k examples, wri\en by Turkers
The Stanford NLI corpus
Initial SNLI Results
69
Model Accuracy
100d sum of words 75.3
100d TreeRNN 72.2
100d LSTM TreeRNN 77.6
Lexicalized linear classifier with cross-‐bigram features 78.2
Envoi
We want more than word meanings!
We want:
• Meanings of larger units, calculated composiGonally
• The ability to do natural language inference