Deep Learning for Natural Language UnderstandingDeep learning has been attracting increasing … A...
Transcript of Deep Learning for Natural Language UnderstandingDeep learning has been attracting increasing … A...
Graph Representation Learning for Natural Language Understanding and Reasoning
Jian Tang
MILA, HEC Montreal
Why Graphs?
• Graphs are general data structures and flexible to encode complexrelationships between different objects
• Graphs can be used for encoding various relationships between different semantic units (e.g., words, entities, sentences, and documents)
• Many natural language understanding tasks rely on different kinds of graph structures
Example 1: Word co-occurrence Graph • Local-context word co-occurrence graph
• Words within a window are assumed to occur with each other
• Key information used for learning word embeddings by several models (e.g., SkipGram, Glove)
degree
network
edge
nodeword
document
classification
text
embedding
Figure: Word co-occurrence graph
Example 2: Word-Document Graph • Encode document-level word occurrence information
• Important information used for learning word and document representations
• Key information by models such as ParagraphVec and statistical topic models (e.g., Latent Dirichlet allocation)
text
information
network
word…
classification
doc_1
doc_2
doc_3
doc_4…
…
Figure: Word-document graph
Example 3: Sentences as Graphs• Encoding semantic and syntactic dependency relationships between
words
• Useful for a variety of tasks• E.g., sentence classification, semantic role labeling, and machine translation
Figure: semantic and syntactic dependency graph (Marcheggiani and Titov, 2017)
Example 4: Knowledge Graph
• Encode the relationships between different entities• Google’s Freebase and Microsoft’s Satori
• Useful for tasks such as question answering and search
More other NLP tasks based on graphs
• Graph-based methods for word sense disambiguation
• Graph-based methods for text summarization
• Graph-based strategies for semantic relation identification
• Graph-based representations for ontology learning
• ….
A very active topic in NLP communities
• TextGraphs Workshops: a specific workshop on graph-based methods for NLP problems in the NLP conferences
Most of these work are based on traditional graph-based methods
PageRank Label Propagation
Progress on Graph Representation Learning
LINE, DeepWalk, node2vecGraph Convolutional Networks
Graph Attention Networks Neural Message Passing Networks
Outline
• Recent progress on graph representation learning• Unsupervised node representation• Semi-supervised node representation• Learning representation of entire graph
• Unsupervised text representation learning
• Semi-supervised text representation learning
• Sentence representation learning
• Keyphrase Extraction
• Extractive Summarization
• Future Directions
Unsupervised Node Representations
• Learning node representations with graph structures• Preserve the similarities between the nodes
• LINE (Tang et al. 2015), DeepWalk, node2vec
LINE (Tang et al. 2015)
• Key idea: preserve the neighborhood structure of each node
Empirical distribution of
neighborhood structure:
Model distribution of
neighborhood structure:
Semi-supervised Node Representations
• Learning node representations with supervision from specific tasks, e.g.,
• Neural message passing algorithms:
• Graph convolutional neural networks (GCN) (Kipf et al. 2016)
• Graph attention networks (Veličković et al. 2017)
• Neural message passing algorithms for quantum chemistry (Gilmer et al. 2017)
Neural Message Passing Algorithms (Gilmer et al. 2017)• Key idea: iteratively update the node representations
• Aggregate the messages from neighbors
• Update the nodes based on aggregated messages and the current node representations
• Both messages and node updating function are defined as neural networks
• Predict the targets with the node representations in the last layer and do backpropagation
Aggregate messages from neighbors
Update the node representations
v
w
Learning Representation of an Entire Graph
• Neural message passing algorithms• Iteratively update the node representations
• Summarize the graph representation with the final node representations
Pooling function
Outline
• Recent progress on graph representation learning• Unsupervised node representation• Semi-supervised node representation• Learning representation of entire graph
• Unsupervised text representation learning
• Semi-supervised text representation learning
• Sentence representation learning
• Keyphrase Extraction
• Extractive Summarization
• Future Directions
Unsupervised Text Representation (Tang et al. 2015a)• Learning text representations with text graphs
• Word co-occurrence graph
• Word-document graph
degree
network
edge
nodeword
document
classification
text
embedding
Word co-occurrence graph
Unstructured text
Text representation, e.g., word and document representation, …
…
Deep learning has been attracting increasingattention …
A future direction of deep learning is to integrateunlabeled data …
The Skip-gram model is quite effective and efficient …
Information networks encode the relationshipsbetween the data objects …
text
information
network
word…
classification
doc_1
doc_2
doc_3
doc_4…
…
Word-document graph
Word Analogy (Tang et al. 2015a)
• Entire Wikipedia articles => word co-occurrence network
(~2M words, 1B edges)
• Size of word co-occurrence networks does not grow linearly with data size• Only the weights of edges change
LINE > SkipGram
Algorithm Semantic(%) Syntactic(%) Overall
SkipGram 69.14 57.94 63.02
LINE 73.79 59.72 66.10
Outline
• Recent progress on graph representation learning• Unsupervised node representation• Semi-supervised node representation• Learning representation of entire graph
• Unsupervised text representation learning
• Semi-supervised text representation learning
• Sentence representation learning
• Keyphrase Extraction
• Extractive Summarization
• Future Directions
Semi-supervised Text Representation(Tang et al. 2015b)• Heterogeneous text graph
• Word-word, word-document, and word-label graphs
• Learning word embeddings through jointly training the heterogeneous graphs
• Document embeddings as the average of word embeddings
• Outperform CNN on long documents
Unstructured text
Text representation, e.g., word and document representation, …
…
Deep learning has been attracting increasing …
A future direction of deep learning is to integrate …
The Skip-gram model is quite effective and efficient …
Information networks encode the relationships
label document
label
label
null
null
null
degree
network
edge
nodeword
document
classification
text
embedding
Word co-occurrence
graph
text
information
network
word…
classification
doc_1
doc_2
doc_3
doc_4…
…
Word-document
graph
text
information
network
word…
classification
label_2
label_1
label_3……
Word-label
graph
Jian Tang, Meng Qu, and Qiaozhu Mei. PTE: Predictive Text Embedding through Large-scale Heterogeneous Text Networks. KDD’15.
Outline
• Recent progress on graph representation learning• Unsupervised node representation• Semi-supervised node representation• Learning representation of entire graph
• Unsupervised word representation
• Semi-supervised document representation
• Sentence representation
• Keyphrase Extraction
• Extractive Summarization
• Future Directions
Learning Sentence Representations via Graph Representations (Liu et al. 2018)
Results on Sentence Classification
Table: Results on Sentence Classification
Outline
• Recent progress on graph representation learning• Unsupervised node representation• Semi-supervised node representation• Learning representation of entire graph
• Unsupervised word representation
• Semi-supervised document representation
• Sentence representation
• Keyphrase Extraction
• Extractive Summarization
• Future Directions
Keyphrases Extraction
• Potentially useful to a variety of applications• Information retrieval
• Text summarization
• Question answering
• …
Graph-based Ranking Methods(Mihalcea and Tarau 2014)
Rada Mihalcea and Paul Tarau. TextRank: Bringing Order into Texts. EMNLP 2014.
Compatibility of systems of linear constraints over the set of natural numbers.
Criteria of compatibility of a system of linear Diophantine equations, strict
inequations, and nonstrict inequations are considered. Upper bounds for
components of a minimal set of solutions and algorithms of construction of
minimal generating sets of solutions for all types of systems are given.
These criteria and the corresponding algorithms for constructing a minimal
supporting set of solutions can be used in solving all the considered types
systems and systems of mixed types.
types
systems
linear
diophantineconstraints
system
compatibility
criteria
numbers
natural
equations
strict inequations
nonstrict
upper
bounds
components
algorithmssolutions
sets
minimal
construction
Keywords assigned by TextRank:
Keywords assigned by human annotators:
linear constraints; linear diophantine equations; natural numbers; nonstrict
inequations; strict inequations; upper bounds
strict inequations; set of natural numbers; strict inequations; upper bounds
linear constraints; linear diophantine equations; minimal generating sets; non −
Figure 2: Sample graph build for keyphrase extrac-
tion from an Inspec abstract
the text is tokenized, and annotated with part of
speech tags – a preprocessing step required to enable
the application of syntactic filters. To avoid exces-
sive growth of the graph size by adding all possible
combinations of sequences consisting of more than
one lexical unit (ngrams), we consider only single
words as candidates for addition to the graph, with
multi-word keywords being eventually reconstructed
in the post-processing phase.
Next, all lexical units that pass the syntactic filter
are added to the graph, and an edge is added between
those lexical units that co-occur within a window of
words. After the graph is constructed (undirected
unweighted graph), the score associated with each
vertex is set to an initial value of 1, and the ranking
algorithm described in section 2 is run on the graph
for several iterations until it converges – usually for
20-30 iterations, at a threshold of 0.0001.
Once a final score is obtained for each vertex in the
graph, vertices are sorted in reversed order of their
score, and the top vertices in the ranking are re-
tained for post-processing. While may be set to
any fixed value, usually ranging from 5 to 20 key-
words (e.g. (Turney, 1999) limits the number of key-
words extracted with his GenEx system to five), we
are using a more flexible approach, which decides
the number of keywords based on the size of the text.
For the data used in our experiments, which consists
of relatively short abstracts, is set to a third of the
number of vertices in the graph.
During post-processing, all lexical units selected
as potential keywords by the TextRank algorithm are
marked in the text, and sequences of adjacent key-
words are collapsed into a multi-word keyword. For
instance, in the text Matlab code for plotting ambi-
guity functions, if both Matlab and code are selected
as potential keywords by TextRank, since they are
adjacent, they are collapsed into one single keyword
Matlab code.
Figure 2 shows a sample graph built for an abstract
from our test collection. While the size of the ab-
stracts ranges from 50 to 350 words, with an average
size of 120 words, we have deliberately selected a
very small abstract for the purpose of illustration. For
this example, the lexical units found to have higher
“importance” by the TextRank algorithm are (with
the TextRank score indicated in parenthesis): num-
bers (1.46), inequations (1.45), linear (1.29), dio-
phantine (1.28), upper (0.99), bounds (0.99), strict
(0.77). Notice that this ranking is different than the
one rendered by simple word frequencies. For the
same text, a frequency approach provides the fol-
lowing top-ranked lexical units: systems (4), types
(3), solutions (3), minimal (3), linear (2), inequations
(2), algorithms (2). All other lexical units have a fre-
quency of 1, and therefore cannot be ranked, but only
listed.
3.2 Evaluation
The data set used in the experiments is a collection
of 500 abstracts from the Inspec database, and the
corresponding manually assigned keywords. This is
the same test data set as used in the keyword ex-
traction experiments reported in (Hulth, 2003). The
Inspec abstracts are from journal papers from Com-
puter Science and Information Technology. Each
abstract comes with two sets of keywords assigned
by professional indexers: controlled keywords, re-
stricted to a given thesaurus, and uncontrolled key-
words, freely assigned by the indexers. We follow
the evaluation approach from (Hulth, 2003), and use
the uncontrolled set of keywords.
In her experiments, Hulth is using a total of 2000
abstracts, divided into 1000 for training, 500 for de-
velopment, and 500 for test2. Since our approach
is completely unsupervised, no training/development
data is required, and we are only using the test docu-
2Many thanks to Anette Hulth for allowing us to run our al-
gorithm on the data set used in her keyword extraction exper-
iments, and for making available the training/test/development
data split.
Compatibility of systems of linear constraints over the set of natural numbers.
Criteria of compatibility of a system of linear Diophantine equations, strict
inequations, and nonstrict inequations are considered. Upper bounds for
components of a minimal set of solutions and algorithms of construction of
minimal generating sets of solutions for all types of systems are given.
These criteria and the corresponding algorithms for constructing a minimal
supporting set of solutions can be used in solving all the considered types
systems and systems of mixed types.
types
systems
linear
diophantineconstraints
system
compatibility
criteria
numbers
natural
equations
strict inequations
nonstrict
upper
bounds
components
algorithmssolutions
sets
minimal
construction
Keywords assigned by TextRank:
Keywords assigned by human annotators:
linear constraints; linear diophantine equations; natural numbers; nonstrict
inequations; strict inequations; upper bounds
strict inequations; set of natural numbers; strict inequations; upper bounds
linear constraints; linear diophantine equations; minimal generating sets; non −
Figure 2: Sample graph build for keyphrase extrac-
tion from an Inspec abstract
the text is tokenized, and annotated with part of
speech tags – a preprocessing step required to enable
the application of syntactic filters. To avoid exces-
sive growth of the graph size by adding all possible
combinations of sequences consisting of more than
one lexical unit (ngrams), we consider only single
words as candidates for addition to the graph, with
multi-word keywords being eventually reconstructed
in the post-processing phase.
Next, all lexical units that pass the syntactic filter
are added to the graph, and an edge is added between
those lexical units that co-occur within a window of
words. After the graph is constructed (undirected
unweighted graph), the score associated with each
vertex is set to an initial value of 1, and the ranking
algorithm described in section 2 is run on the graph
for several iterations until it converges – usually for
20-30 iterations, at a threshold of 0.0001.
Once a final score is obtained for each vertex in the
graph, vertices are sorted in reversed order of their
score, and the top vertices in the ranking are re-
tained for post-processing. While may be set to
any fixed value, usually ranging from 5 to 20 key-
words (e.g. (Turney, 1999) limits the number of key-
words extracted with his GenEx system to five), we
are using a more flexible approach, which decides
the number of keywords based on the size of the text.
For the data used in our experiments, which consists
of relatively short abstracts, is set to a third of the
number of vertices in the graph.
During post-processing, all lexical units selected
as potential keywords by the TextRank algorithm are
marked in the text, and sequences of adjacent key-
words are collapsed into a multi-word keyword. For
instance, in the text Matlab code for plotting ambi-
guity functions, if both Matlab and code are selected
as potential keywords by TextRank, since they are
adjacent, they are collapsed into one single keyword
Matlab code.
Figure 2 shows a sample graph built for an abstract
from our test collection. While the size of the ab-
stracts ranges from 50 to 350 words, with an average
size of 120 words, we have deliberately selected a
very small abstract for the purpose of illustration. For
this example, the lexical units found to have higher
“importance” by the TextRank algorithm are (with
the TextRank score indicated in parenthesis): num-
bers (1.46), inequations (1.45), linear (1.29), dio-
phantine (1.28), upper (0.99), bounds (0.99), strict
(0.77). Notice that this ranking is different than the
one rendered by simple word frequencies. For the
same text, a frequency approach provides the fol-
lowing top-ranked lexical units: systems (4), types
(3), solutions (3), minimal (3), linear (2), inequations
(2), algorithms (2). All other lexical units have a fre-
quency of 1, and therefore cannot be ranked, but only
listed.
3.2 Evaluation
The data set used in the experiments is a collection
of 500 abstracts from the Inspec database, and the
corresponding manually assigned keywords. This is
the same test data set as used in the keyword ex-
traction experiments reported in (Hulth, 2003). The
Inspec abstracts are from journal papers from Com-
puter Science and Information Technology. Each
abstract comes with two sets of keywords assigned
by professional indexers: controlled keywords, re-
stricted to a given thesaurus, and uncontrolled key-
words, freely assigned by the indexers. We follow
the evaluation approach from (Hulth, 2003), and use
the uncontrolled set of keywords.
In her experiments, Hulth is using a total of 2000
abstracts, divided into 1000 for training, 500 for de-
velopment, and 500 for test2. Since our approach
is completely unsupervised, no training/development
data is required, and we are only using the test docu-
2Many thanks to Anette Hulth for allowing us to run our al-
gorithm on the data set used in her keyword extraction exper-
iments, and for making available the training/test/development
data split.
Compatibility of systems of linear constraints over the set of natural numbers.
Criteria of compatibility of a system of linear Diophantine equations, strict
inequations, and nonstrict inequations are considered. Upper bounds for
components of a minimal set of solutions and algorithms of construction of
minimal generating sets of solutions for all types of systems are given.
These criteria and the corresponding algorithms for constructing a minimal
supporting set of solutions can be used in solving all the considered types
systems and systems of mixed types.
types
systems
linear
diophantineconstraints
system
compatibility
criteria
numbers
natural
equations
strict inequations
nonstrict
upper
bounds
components
algorithmssolutions
sets
minimal
construction
Keywords assigned by TextRank:
Keywords assigned by human annotators:
linear constraints; linear diophantine equations; natural numbers; nonstrict
inequations; strict inequations; upper bounds
strict inequations; set of natural numbers; strict inequations; upper bounds
linear constraints; linear diophantine equations; minimal generating sets; non −
Figure 2: Sample graph build for keyphrase extrac-
tion from an Inspec abstract
the text is tokenized, and annotated with part of
speech tags – a preprocessing step required to enable
the application of syntactic filters. To avoid exces-
sive growth of the graph size by adding all possible
combinations of sequences consisting of more than
one lexical unit (ngrams), we consider only single
words as candidates for addition to the graph, with
multi-word keywords being eventually reconstructed
in the post-processing phase.
Next, all lexical units that pass the syntactic filter
are added to the graph, and an edge is added between
those lexical units that co-occur within a window of
words. After the graph is constructed (undirected
unweighted graph), the score associated with each
vertex is set to an initial value of 1, and the ranking
algorithm described in section 2 is run on the graph
for several iterations until it converges – usually for
20-30 iterations, at a threshold of 0.0001.
Once a final score is obtained for each vertex in the
graph, vertices are sorted in reversed order of their
score, and the top vertices in the ranking are re-
tained for post-processing. While may be set to
any fixed value, usually ranging from 5 to 20 key-
words (e.g. (Turney, 1999) limits the number of key-
words extracted with his GenEx system to five), we
are using a more flexible approach, which decides
the number of keywords based on the size of the text.
For the data used in our experiments, which consists
of relatively short abstracts, is set to a third of the
number of vertices in the graph.
During post-processing, all lexical units selected
as potential keywords by the TextRank algorithm are
marked in the text, and sequences of adjacent key-
words are collapsed into a multi-word keyword. For
instance, in the text Matlab code for plotting ambi-
guity functions, if both Matlab and code are selected
as potential keywords by TextRank, since they are
adjacent, they are collapsed into one single keyword
Matlab code.
Figure 2 shows a sample graph built for an abstract
from our test collection. While the size of the ab-
stracts ranges from 50 to 350 words, with an average
size of 120 words, we have deliberately selected a
very small abstract for the purpose of illustration. For
this example, the lexical units found to have higher
“importance” by the TextRank algorithm are (with
the TextRank score indicated in parenthesis): num-
bers (1.46), inequations (1.45), linear (1.29), dio-
phantine (1.28), upper (0.99), bounds (0.99), strict
(0.77). Notice that this ranking is different than the
one rendered by simple word frequencies. For the
same text, a frequency approach provides the fol-
lowing top-ranked lexical units: systems (4), types
(3), solutions (3), minimal (3), linear (2), inequations
(2), algorithms (2). All other lexical units have a fre-
quency of 1, and therefore cannot be ranked, but only
listed.
3.2 Evaluation
The data set used in the experiments is a collection
of 500 abstracts from the Inspec database, and the
corresponding manually assigned keywords. This is
the same test data set as used in the keyword ex-
traction experiments reported in (Hulth, 2003). The
Inspec abstracts are from journal papers from Com-
puter Science and Information Technology. Each
abstract comes with two sets of keywords assigned
by professional indexers: controlled keywords, re-
stricted to a given thesaurus, and uncontrolled key-
words, freely assigned by the indexers. We follow
the evaluation approach from (Hulth, 2003), and use
the uncontrolled set of keywords.
In her experiments, Hulth is using a total of 2000
abstracts, divided into 1000 for training, 500 for de-
velopment, and 500 for test2. Since our approach
is completely unsupervised, no training/development
data is required, and we are only using the test docu-
2Many thanks to Anette Hulth for allowing us to run our al-
gorithm on the data set used in her keyword extraction exper-
iments, and for making available the training/test/development
data split.
Graph Pointer Network for DiverseKeyphrases Extraction (Sun et al. 2018)• Encoder: Graph-based document encoder
• Word graph as input
• Decoder: pointer network over graphs• Select words from the input word graph
• Promoting diversity
3
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
EMNLP 2018 Submission 1787. Confidential Review Copy. DO NOT DISTRIBUTE.
x1
𝑥2
𝑥3
𝑥4
𝑥2
𝑥1
𝑥2
𝑥3 𝑥4
𝒄
𝐡𝟎(𝟏)
𝐡𝟏(𝟏)
𝐡𝟏(𝟐)
𝐡𝟐(𝟐)
𝑥3
𝑥1 𝑥2 𝑥3 𝑥4 $
𝐡𝟐(𝟏)
Graph ConstructionGraph Convolutional Networks
Node Representat ion
Coverage
Attention
Context
Modification
Document
Context
DivPointer Decoder Document
𝐡𝟑(𝟏)
$𝑥3𝑥2y0(1)
𝐡𝟎(𝟐)
y0(2)
Figure 1: Illustration of our encoder-decoder architecture for keyphrase extraction. In this example, the
document is a sequence of words, namely, d = hx1, x2, x3, x4, x2i , and we have generated the first
keyphrase y1 = hx2, x3i . Wearepredicting y22, namely, thesecond word for keyphrase y2, which will be
selected from within the graph nodes and the ending token $ of aphrase.
The representation of the entire graph (or the
document representation) c is then obtained by av-
eraging the aggregation of the last layer’s node
representations f L (HL ), whereL denotes thetotal
number of GCN layers.
Based on the encoded document representation
c, we propose a decoder, named DivPointer, to
generate summative and diverse keyphrases in the
next section.
2.2 Keyphrases Decoding
In this part, we introduce our approach of
keyphrase extraction based on the graph repre-
sentation. Most of the traditional approaches
select keyphrases independently during the ex-
traction process. However, ignoring the diver-
sity among phrases may lead to multiple similar
keyphrases, undermining therepresentativenessof
the keyphrase set. Therefore, we propose a Div-
Pointer Network with two mechanisms on seman-
tic level and lexicon level respectively to improve
the diversity among keyphrases during the decod-
ing process.
2.2.1 DivPointer Decoder
Thedecoder isused to generate output keyphrases
according to the representation of the input doc-
ument. We adopt a pointer network with diver-
sity enabled attentions to generate keyphrases. A
Pointer Network (Vinyals et al., 2015) is a neu-
ral architecture to learn theconditional probability
of an output sequence with elements that are dis-
cretetokenscorresponding to positions in theorig-
inal dataspace. Thegraph nodes corresponding to
words in a document are regarded as the original
data space of the pointer network in our case. Di-
versity attentions are leveraged when pointers se-
lect nodes from the graph as the output.
The pointer decoder receives thedocument rep-
resentation c as the initial state h( i )0 , and predicts
each word y( i )t of a keyphrase y( i ) sequentially
based on h( i )t :
h( i )t = decwor d(y
( i )t− 1, h
( i )t− 1) (5)
wherey( i )t− 1 denotes the node representation of the
word y( i )t− 1 that keyphrase y( i ) generated at thepre-
vious step. h t is the hidden state of an RNN. The
word for phrase y( i )t is then selected with apointer
network according to certain attention mechanism
based on h( i )t .
A general attention (Bahdanau et al., 2014)
score et ,j on each graph node xj N with respect
to the hidden state h( i )t can be computed by:
e( i )t ,j = vT tanh(Whh
( i )t + Wxx j + b) (6)
where x j is the node representation of xj taken
from HL and vT , Wh, Wx , and b are parameters
to be learned. We can then obtain the pointer dis-
tribution on the nodes by normalizing { e( i )t ,j } :
p(y( i )t = xj ) =
exp(e( i )t ,j )
P Nk= 1 exp(e
( i )t ,k )
(7)
Other Related Work
• Diego Marcheggiani and Ivan Titov, Encoding Sentences with Graph Convolutional Networks for Semantic Role Labeling, 2017.
• Joost Bastings, Ivan Titov, Wilker Aziz, Diego Marcheggiani, Khalil Sima’an. Graph Convolutional Encoders for Syntax-aware Neural Machine Translation. 2017.
Outline
• Recent progress on graph representation learning• Unsupervised node representation• Semi-supervised node representation• Learning representation of entire graph
• Unsupervised word representation
• Semi-supervised document representation
• Sentence representation
• Keyphrase Extraction
• Extractive Summarization
• Future Directions
Extractive Summarization
• Extract informative sentences for summarization
Combining Graph neural Networks and Reinforcement Learning for Extractive Summarization (Ongoing Work)• Document => Sentence graph
• Model the relations between sentences
• Summarization as a sequential decision process• Sequentially select an sentence from the sentence graph
Compatibility of systems of linear constraints over the set of natural numbers.
Criteria of compatibility of a system of linear Diophantine equations, strict
inequations, and nonstrict inequations are considered. Upper bounds for
components of a minimal set of solutions and algorithms of construction of
minimal generating sets of solutions for all types of systems are given.
These criteria and the corresponding algorithms for constructing a minimal
supporting set of solutions can be used in solving all the considered types
systems and systems of mixed types.
types
systems
linear
diophantineconstraints
system
compatibility
criteria
numbers
natural
equations
strict inequations
nonstrict
upper
bounds
components
algorithmssolutions
sets
minimal
construction
Keywords assigned by TextRank:
Keywords assigned by human annotators:
linear constraints; linear diophantine equations; natural numbers; nonstrict
inequations; strict inequations; upper bounds
strict inequations; set of natural numbers; strict inequations; upper bounds
linear constraints; linear diophantine equations; minimal generating sets; non −
Figure 2: Sample graph build for keyphrase extrac-
tion from an Inspec abstract
the text is tokenized, and annotated with part of
speech tags – a preprocessing step required to enable
the application of syntactic filters. To avoid exces-
sive growth of the graph size by adding all possible
combinations of sequences consisting of more than
one lexical unit (ngrams), we consider only single
words as candidates for addition to the graph, with
multi-word keywords being eventually reconstructed
in the post-processing phase.
Next, all lexical units that pass the syntactic filter
are added to the graph, and an edge is added between
those lexical units that co-occur within a window of
words. After the graph is constructed (undirected
unweighted graph), the score associated with each
vertex is set to an initial value of 1, and the ranking
algorithm described in section 2 is run on the graph
for several iterations until it converges – usually for
20-30 iterations, at a threshold of 0.0001.
Once a final score is obtained for each vertex in the
graph, vertices are sorted in reversed order of their
score, and the top vertices in the ranking are re-
tained for post-processing. While may be set to
any fixed value, usually ranging from 5 to 20 key-
words (e.g. (Turney, 1999) limits the number of key-
words extracted with his GenEx system to five), we
are using a more flexible approach, which decides
the number of keywords based on the size of the text.
For the data used in our experiments, which consists
of relatively short abstracts, is set to a third of the
number of vertices in the graph.
During post-processing, all lexical units selected
as potential keywords by the TextRank algorithm are
marked in the text, and sequences of adjacent key-
words are collapsed into a multi-word keyword. For
instance, in the text Matlab code for plotting ambi-
guity functions, if both Matlab and code are selected
as potential keywords by TextRank, since they are
adjacent, they are collapsed into one single keyword
Matlab code.
Figure 2 shows a sample graph built for an abstract
from our test collection. While the size of the ab-
stracts ranges from 50 to 350 words, with an average
size of 120 words, we have deliberately selected a
very small abstract for the purpose of illustration. For
this example, the lexical units found to have higher
“importance” by the TextRank algorithm are (with
the TextRank score indicated in parenthesis): num-
bers (1.46), inequations (1.45), linear (1.29), dio-
phantine (1.28), upper (0.99), bounds (0.99), strict
(0.77). Notice that this ranking is different than the
one rendered by simple word frequencies. For the
same text, a frequency approach provides the fol-
lowing top-ranked lexical units: systems (4), types
(3), solutions (3), minimal (3), linear (2), inequations
(2), algorithms (2). All other lexical units have a fre-
quency of 1, and therefore cannot be ranked, but only
listed.
3.2 Evaluation
The data set used in the experiments is a collection
of 500 abstracts from the Inspec database, and the
corresponding manually assigned keywords. This is
the same test data set as used in the keyword ex-
traction experiments reported in (Hulth, 2003). The
Inspec abstracts are from journal papers from Com-
puter Science and Information Technology. Each
abstract comes with two sets of keywords assigned
by professional indexers: controlled keywords, re-
stricted to a given thesaurus, and uncontrolled key-
words, freely assigned by the indexers. We follow
the evaluation approach from (Hulth, 2003), and use
the uncontrolled set of keywords.
In her experiments, Hulth is using a total of 2000
abstracts, divided into 1000 for training, 500 for de-
velopment, and 500 for test2. Since our approach
is completely unsupervised, no training/development
data is required, and we are only using the test docu-
2Many thanks to Anette Hulth for allowing us to run our al-
gorithm on the data set used in her keyword extraction exper-
iments, and for making available the training/test/development
data split.
Reinforcement Learning on Sentence Graph
• Sequentially pick a node on a graph
Outline
• Recent progress on graph representation learning• Unsupervised node representation
• Semi-supervised node representation
• Learning representation of entire graph
• Unsupervised word representation
• Semi-supervised document representation
• Sentence representation
• Keyphrase Extraction
• Future Directions
Relational Reasoning with Graph Neural Networks
• A hot topic in computer vision community• Model the relation between different objects in an image, e.g.,
• Santoro et al. A simple neural network module for relational reasoning
• Chen et al. Iterative Visual Reasoning Beyond Convolutions. CVPR’ 2018.
• Zambaldi et al. Relational deep reinforcement learning, arXiv, 2018.
Relational Reasoning for Natural Language Understanding• Combining graph neural networks and reinforcement learning for
sequential reasoning in natural language understanding• Model the relations between entities, sentences, or facts
• Applications• Machine comprehension
• Question answering with knowledge graph
• …
Take Away
• Graphs provide a flexible way to encode various structures in natural language
• Recent progress on graph representation learning provides big opportunities for natural language understanding• Word representation with word co-occurrence graph
• document representation with heterogeneous text graph
• sentence representation with sentence graph
• Future: combining graph neural networks+ reinforcement learning for sequential reasoning in NLP
References
• Tang et al. LINE: Large-scale Information Network Embedding. WWW’15.
• Tang et al. PTE: Predictive text embedding with heterogeneous text networks. KDD’15.
• Perozzi et al. DeepWalk: Online Learning of Social Representations. KDD’14.
• Grover et al. node2vec: Scalable Feature Learning for Networks. KDD’16.
• TextGraphs-2018: https://sites.google.com/view/textgraphs2018/home
• Kipf et al. Semi-Supervised Classification with Graph Convolutional Networks. ICLR’16.
• Veličković et al. Graph attention networks. ICLR’18.
• Gilmer et al. Neural Message Passing for Quantum Chemistry. ICML’17.
• Marcheggiani et al. Encoding Sentences with Graph Convolutional Networks for Semantic Role Labeling, 2017.
• Bastings et al. Graph Convolutional Encoders for Syntax-aware Neural Machine Translation. 2017.
Thanks!