Deep Learning for Natural Language UnderstandingDeep learning has been attracting increasing … A...

Graph Representation Learning for Natural Language Understanding and Reasoning

Jian Tang

MILA, HEC Montreal

[email protected]

Why Graphs?

• Graphs are general data structures and flexible to encode complexrelationships between different objects

• Graphs can be used for encoding various relationships between different semantic units (e.g., words, entities, sentences, and documents)

• Many natural language understanding tasks rely on different kinds of graph structures

Example 1: Word co-occurrence Graph • Local-context word co-occurrence graph

• Words within a window are assumed to occur with each other

• Key information used for learning word embeddings by several models (e.g., SkipGram, Glove)

degree

network

edge

nodeword

document

classification

text

embedding

Figure: Word co-occurrence graph

Example 2: Word-Document Graph • Encode document-level word occurrence information

• Important information used for learning word and document representations

• Key information by models such as ParagraphVec and statistical topic models (e.g., Latent Dirichlet allocation)

text

information

network

word…

classification

doc_1

doc_2

doc_3

doc_4…

…

Figure: Word-document graph

Example 3: Sentences as Graphs• Encoding semantic and syntactic dependency relationships between

words

• Useful for a variety of tasks• E.g., sentence classification, semantic role labeling, and machine translation

Figure: semantic and syntactic dependency graph (Marcheggiani and Titov, 2017)

Example 4: Knowledge Graph

• Encode the relationships between different entities• Google’s Freebase and Microsoft’s Satori

• Useful for tasks such as question answering and search

More other NLP tasks based on graphs

• Graph-based methods for word sense disambiguation

• Graph-based methods for text summarization

• Graph-based strategies for semantic relation identification

• Graph-based representations for ontology learning

• ….

A very active topic in NLP communities

• TextGraphs Workshops: a specific workshop on graph-based methods for NLP problems in the NLP conferences

Most of these work are based on traditional graph-based methods

PageRank Label Propagation

Progress on Graph Representation Learning

LINE, DeepWalk, node2vecGraph Convolutional Networks

Graph Attention Networks Neural Message Passing Networks

Outline

• Recent progress on graph representation learning• Unsupervised node representation• Semi-supervised node representation• Learning representation of entire graph

• Unsupervised text representation learning

• Semi-supervised text representation learning

• Sentence representation learning

• Keyphrase Extraction

• Extractive Summarization

• Future Directions

Unsupervised Node Representations

• Learning node representations with graph structures• Preserve the similarities between the nodes

• LINE (Tang et al. 2015), DeepWalk, node2vec

LINE (Tang et al. 2015)

• Key idea: preserve the neighborhood structure of each node

Empirical distribution of

neighborhood structure:

Model distribution of

neighborhood structure:

Semi-supervised Node Representations

• Learning node representations with supervision from specific tasks, e.g.,

• Neural message passing algorithms:

• Graph convolutional neural networks (GCN) (Kipf et al. 2016)

• Graph attention networks (Veličković et al. 2017)

• Neural message passing algorithms for quantum chemistry (Gilmer et al. 2017)

Neural Message Passing Algorithms (Gilmer et al. 2017)• Key idea: iteratively update the node representations

• Aggregate the messages from neighbors

• Update the nodes based on aggregated messages and the current node representations

• Both messages and node updating function are defined as neural networks

• Predict the targets with the node representations in the last layer and do backpropagation

Aggregate messages from neighbors

Update the node representations

v

w

Learning Representation of an Entire Graph

• Neural message passing algorithms• Iteratively update the node representations

• Summarize the graph representation with the final node representations

Pooling function

Outline








Unsupervised Text Representation (Tang et al. 2015a)• Learning text representations with text graphs

• Word co-occurrence graph

• Word-document graph

degree

network

edge

nodeword

document

classification

text

embedding

Word co-occurrence graph

Unstructured text

Text representation, e.g., word and document representation, …

…

Deep learning has been attracting increasingattention …

A future direction of deep learning is to integrateunlabeled data …

The Skip-gram model is quite effective and efficient …

Information networks encode the relationshipsbetween the data objects …

text

information

network

word…

classification

doc_1

doc_2

doc_3

doc_4…

…

Word-document graph

Word Analogy (Tang et al. 2015a)

• Entire Wikipedia articles => word co-occurrence network

(~2M words, 1B edges)

• Size of word co-occurrence networks does not grow linearly with data size• Only the weights of edges change

LINE > SkipGram

Algorithm Semantic(%) Syntactic(%) Overall

SkipGram 69.14 57.94 63.02

LINE 73.79 59.72 66.10

Outline








Semi-supervised Text Representation(Tang et al. 2015b)• Heterogeneous text graph

• Word-word, word-document, and word-label graphs

• Learning word embeddings through jointly training the heterogeneous graphs

• Document embeddings as the average of word embeddings

• Outperform CNN on long documents

Unstructured text

Text representation, e.g., word and document representation, …

…

Deep learning has been attracting increasing …

A future direction of deep learning is to integrate …

The Skip-gram model is quite effective and efficient …

Information networks encode the relationships

label document

label

label

null

null

null

degree

network

edge

nodeword

document

classification

text

embedding

Word co-occurrence

graph

text

information

network

word…

classification

doc_1

doc_2

doc_3

doc_4…

…

Word-document

graph

text

information

network

word…

classification

label_2

label_1

label_3……

Word-label

graph

Jian Tang, Meng Qu, and Qiaozhu Mei. PTE: Predictive Text Embedding through Large-scale Heterogeneous Text Networks. KDD’15.

Outline


• Unsupervised word representation

• Semi-supervised document representation

• Sentence representation




Learning Sentence Representations via Graph Representations (Liu et al. 2018)

Results on Sentence Classification

Table: Results on Sentence Classification

Outline








Keyphrases Extraction

• Potentially useful to a variety of applications• Information retrieval

• Text summarization

• Question answering

• …

Graph-based Ranking Methods(Mihalcea and Tarau 2014)

Rada Mihalcea and Paul Tarau. TextRank: Bringing Order into Texts. EMNLP 2014.

Compatibility of systems of linear constraints over the set of natural numbers.

Criteria of compatibility of a system of linear Diophantine equations, strict

inequations, and nonstrict inequations are considered. Upper bounds for

components of a minimal set of solutions and algorithms of construction of

minimal generating sets of solutions for all types of systems are given.

These criteria and the corresponding algorithms for constructing a minimal

supporting set of solutions can be used in solving all the considered types

systems and systems of mixed types.

types

systems

linear

diophantineconstraints

system

compatibility

criteria

numbers

natural

equations

strict inequations

nonstrict

upper

bounds

components

algorithmssolutions

sets

minimal

construction

Keywords assigned by TextRank:

Keywords assigned by human annotators:

linear constraints; linear diophantine equations; natural numbers; nonstrict

inequations; strict inequations; upper bounds

strict inequations; set of natural numbers; strict inequations; upper bounds

linear constraints; linear diophantine equations; minimal generating sets; non −

Figure 2: Sample graph build for keyphrase extrac-

tion from an Inspec abstract

the text is tokenized, and annotated with part of

speech tags – a preprocessing step required to enable

the application of syntactic filters. To avoid exces-

sive growth of the graph size by adding all possible

combinations of sequences consisting of more than

one lexical unit (ngrams), we consider only single

words as candidates for addition to the graph, with

multi-word keywords being eventually reconstructed

in the post-processing phase.

Next, all lexical units that pass the syntactic filter

are added to the graph, and an edge is added between

those lexical units that co-occur within a window of

words. After the graph is constructed (undirected

unweighted graph), the score associated with each

vertex is set to an initial value of 1, and the ranking

algorithm described in section 2 is run on the graph

for several iterations until it converges – usually for

20-30 iterations, at a threshold of 0.0001.

Once a final score is obtained for each vertex in the

graph, vertices are sorted in reversed order of their

score, and the top vertices in the ranking are re-

tained for post-processing. While may be set to

any fixed value, usually ranging from 5 to 20 key-

words (e.g. (Turney, 1999) limits the number of key-

words extracted with his GenEx system to five), we

are using a more flexible approach, which decides

the number of keywords based on the size of the text.

For the data used in our experiments, which consists

of relatively short abstracts, is set to a third of the

number of vertices in the graph.

During post-processing, all lexical units selected

as potential keywords by the TextRank algorithm are

marked in the text, and sequences of adjacent key-

words are collapsed into a multi-word keyword. For

instance, in the text Matlab code for plotting ambi-

guity functions, if both Matlab and code are selected

as potential keywords by TextRank, since they are

adjacent, they are collapsed into one single keyword

Matlab code.

Figure 2 shows a sample graph built for an abstract

from our test collection. While the size of the ab-

stracts ranges from 50 to 350 words, with an average

size of 120 words, we have deliberately selected a

very small abstract for the purpose of illustration. For

this example, the lexical units found to have higher

“importance” by the TextRank algorithm are (with

the TextRank score indicated in parenthesis): num-

bers (1.46), inequations (1.45), linear (1.29), dio-

phantine (1.28), upper (0.99), bounds (0.99), strict

(0.77). Notice that this ranking is different than the

one rendered by simple word frequencies. For the

same text, a frequency approach provides the fol-

lowing top-ranked lexical units: systems (4), types

(3), solutions (3), minimal (3), linear (2), inequations

(2), algorithms (2). All other lexical units have a fre-

quency of 1, and therefore cannot be ranked, but only

listed.

3.2 Evaluation

The data set used in the experiments is a collection

of 500 abstracts from the Inspec database, and the

corresponding manually assigned keywords. This is

the same test data set as used in the keyword ex-

traction experiments reported in (Hulth, 2003). The

Inspec abstracts are from journal papers from Com-

puter Science and Information Technology. Each

abstract comes with two sets of keywords assigned

by professional indexers: controlled keywords, re-

stricted to a given thesaurus, and uncontrolled key-

words, freely assigned by the indexers. We follow

the evaluation approach from (Hulth, 2003), and use

the uncontrolled set of keywords.

In her experiments, Hulth is using a total of 2000

abstracts, divided into 1000 for training, 500 for de-

velopment, and 500 for test2. Since our approach

is completely unsupervised, no training/development

data is required, and we are only using the test docu-

2Many thanks to Anette Hulth for allowing us to run our al-

gorithm on the data set used in her keyword extraction exper-

iments, and for making available the training/test/development

data split.









types

systems

linear


system

compatibility

criteria

numbers

natural

equations

strict inequations

nonstrict

upper

bounds

components

algorithmssolutions

sets

minimal

construction















































Matlab code.


















listed.

3.2 Evaluation






















data split.









types

systems

linear


system

compatibility

criteria

numbers

natural

equations

strict inequations

nonstrict

upper

bounds

components

algorithmssolutions

sets

minimal

construction















































Matlab code.


















listed.

3.2 Evaluation






















data split.

Graph Pointer Network for DiverseKeyphrases Extraction (Sun et al. 2018)• Encoder: Graph-based document encoder

• Word graph as input

• Decoder: pointer network over graphs• Select words from the input word graph

• Promoting diversity

3

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

240

241

242

243

244

245

246

247

248

249

250

251

252

253

254

255

256

257

258

259

260

261

262

263

264

265

266

267

268

269

270

271

272

273

274

275

276

277

278

279

280

281

282

283

284

285

286

287

288

289

290

291

292

293

294

295

296

297

298

299

EMNLP 2018 Submission 1787. Confidential Review Copy. DO NOT DISTRIBUTE.

x1

𝑥2

𝑥3

𝑥4

𝑥2

𝑥1

𝑥2

𝑥3 𝑥4

𝒄

𝐡𝟎(𝟏)

𝐡𝟏(𝟏)

𝐡𝟏(𝟐)

𝐡𝟐(𝟐)

𝑥3

𝑥1 𝑥2 𝑥3 𝑥4 $

𝐡𝟐(𝟏)

Graph ConstructionGraph Convolutional Networks

Node Representat ion

Coverage

Attention

Context

Modification

Document

Context

DivPointer Decoder Document

𝐡𝟑(𝟏)

$𝑥3𝑥2y0(1)

𝐡𝟎(𝟐)

y0(2)

Figure 1: Illustration of our encoder-decoder architecture for keyphrase extraction. In this example, the

document is a sequence of words, namely, d = hx1, x2, x3, x4, x2i , and we have generated the first

keyphrase y1 = hx2, x3i . Wearepredicting y22, namely, thesecond word for keyphrase y2, which will be

selected from within the graph nodes and the ending token $ of aphrase.

The representation of the entire graph (or the

document representation) c is then obtained by av-

eraging the aggregation of the last layer’s node

representations f L (HL ), whereL denotes thetotal

number of GCN layers.

Based on the encoded document representation

c, we propose a decoder, named DivPointer, to

generate summative and diverse keyphrases in the

next section.

2.2 Keyphrases Decoding

In this part, we introduce our approach of

keyphrase extraction based on the graph repre-

sentation. Most of the traditional approaches

select keyphrases independently during the ex-

traction process. However, ignoring the diver-

sity among phrases may lead to multiple similar

keyphrases, undermining therepresentativenessof

the keyphrase set. Therefore, we propose a Div-

Pointer Network with two mechanisms on seman-

tic level and lexicon level respectively to improve

the diversity among keyphrases during the decod-

ing process.

2.2.1 DivPointer Decoder

Thedecoder isused to generate output keyphrases

according to the representation of the input doc-

ument. We adopt a pointer network with diver-

sity enabled attentions to generate keyphrases. A

Pointer Network (Vinyals et al., 2015) is a neu-

ral architecture to learn theconditional probability

of an output sequence with elements that are dis-

cretetokenscorresponding to positions in theorig-

inal dataspace. Thegraph nodes corresponding to

words in a document are regarded as the original

data space of the pointer network in our case. Di-

versity attentions are leveraged when pointers se-

lect nodes from the graph as the output.

The pointer decoder receives thedocument rep-

resentation c as the initial state h( i )0 , and predicts

each word y( i )t of a keyphrase y( i ) sequentially

based on h( i )t :

h( i )t = decwor d(y

( i )t− 1, h

( i )t− 1) (5)

wherey( i )t− 1 denotes the node representation of the

word y( i )t− 1 that keyphrase y( i ) generated at thepre-

vious step. h t is the hidden state of an RNN. The

word for phrase y( i )t is then selected with apointer

network according to certain attention mechanism

based on h( i )t .

A general attention (Bahdanau et al., 2014)

score et ,j on each graph node xj N with respect

to the hidden state h( i )t can be computed by:

e( i )t ,j = vT tanh(Whh

( i )t + Wxx j + b) (6)

where x j is the node representation of xj taken

from HL and vT , Wh, Wx , and b are parameters

to be learned. We can then obtain the pointer dis-

tribution on the nodes by normalizing { e( i )t ,j } :

p(y( i )t = xj ) =

exp(e( i )t ,j )

P Nk= 1 exp(e

( i )t ,k )

(7)

Other Related Work

• Diego Marcheggiani and Ivan Titov, Encoding Sentences with Graph Convolutional Networks for Semantic Role Labeling, 2017.

• Joost Bastings, Ivan Titov, Wilker Aziz, Diego Marcheggiani, Khalil Sima’an. Graph Convolutional Encoders for Syntax-aware Neural Machine Translation. 2017.

Outline








Extractive Summarization

• Extract informative sentences for summarization

Combining Graph neural Networks and Reinforcement Learning for Extractive Summarization (Ongoing Work)• Document => Sentence graph

• Model the relations between sentences

• Summarization as a sequential decision process• Sequentially select an sentence from the sentence graph









types

systems

linear


system

compatibility

criteria

numbers

natural

equations

strict inequations

nonstrict

upper

bounds

components

algorithmssolutions

sets

minimal

construction















































Matlab code.


















listed.

3.2 Evaluation






















data split.

Reinforcement Learning on Sentence Graph

• Sequentially pick a node on a graph

Outline

• Recent progress on graph representation learning• Unsupervised node representation

• Semi-supervised node representation

• Learning representation of entire graph






Relational Reasoning with Graph Neural Networks

• A hot topic in computer vision community• Model the relation between different objects in an image, e.g.,

• Santoro et al. A simple neural network module for relational reasoning

• Chen et al. Iterative Visual Reasoning Beyond Convolutions. CVPR’ 2018.

• Zambaldi et al. Relational deep reinforcement learning, arXiv, 2018.

Relational Reasoning for Natural Language Understanding• Combining graph neural networks and reinforcement learning for

sequential reasoning in natural language understanding• Model the relations between entities, sentences, or facts

• Applications• Machine comprehension

• Question answering with knowledge graph

• …

Take Away

• Graphs provide a flexible way to encode various structures in natural language

• Recent progress on graph representation learning provides big opportunities for natural language understanding• Word representation with word co-occurrence graph

• document representation with heterogeneous text graph

• sentence representation with sentence graph

• Future: combining graph neural networks+ reinforcement learning for sequential reasoning in NLP

References

• Tang et al. LINE: Large-scale Information Network Embedding. WWW’15.

• Tang et al. PTE: Predictive text embedding with heterogeneous text networks. KDD’15.

• Perozzi et al. DeepWalk: Online Learning of Social Representations. KDD’14.

• Grover et al. node2vec: Scalable Feature Learning for Networks. KDD’16.

• TextGraphs-2018: https://sites.google.com/view/textgraphs2018/home

• Kipf et al. Semi-Supervised Classification with Graph Convolutional Networks. ICLR’16.

• Veličković et al. Graph attention networks. ICLR’18.

• Gilmer et al. Neural Message Passing for Quantum Chemistry. ICML’17.

• Marcheggiani et al. Encoding Sentences with Graph Convolutional Networks for Semantic Role Labeling, 2017.

• Bastings et al. Graph Convolutional Encoders for Syntax-aware Neural Machine Translation. 2017.

https://arxiv.org/abs/1403.6652

https://sites.google.com/view/textgraphs2018/home



Thanks!

Deep Learning for Natural Language UnderstandingDeep learning has been attracting increasing … A...

Documents

Transcript of Deep Learning for Natural Language UnderstandingDeep learning has been attracting increasing … A...