Keyphrase Extraction: from Distributional Feature ... · Distributed Semantic Composition Rui Wang...

Keyphrase Extraction: fromDistributional Feature Engineering to

Distributed Semantic Composition

Rui Wang

This thesis is presented for the degree of

Doctor of Philosophy

of The University of Western Australia

School of Computer Science and Software Engineering

December 2017

Thesis Declaration

I, Rui Wang, declare that this thesis titled, ’Keyphrase Extraction: from Distributional

Feature Engineering to Distributed Semantic Composition’ and the work presented in

it are my own. I confirm that:

� This thesis has been substantially accomplished during enrolment in the degree.

� This thesis does not contain material which has been accepted for the award of any

other degree or diploma in my name, in any university or other tertiary institution..

� No part of this work will, in the future, be used in a submission in my name, for

any other degree or diploma in any university or other tertiary institution without

the prior approval of The University of Western Australia and where applicable,

any partner institution responsible for the joint-award of this degree.

� This thesis does not contain any material previously published or written by an-

other person, except where due reference has been made in the text.

� The work(s) are not in any way a violation or infringement of any copyright,

trademark, patent, or other rights whatsoever of any person.

� This thesis contains published work and/or work prepared for publication, some

of which has been co-authored.

Signed:

Date:

i

rui

rui

Abstract

Keyphrases of a document provide high-level descriptions of the content, which sum-

marise the core topics, concepts, ideas or arguments of the document. These descriptive

phrases enable algorithms to retrieve relevant information more quickly and effectively,

which play an important role in many areas of document processing, such as document

indexing, classification, clustering and summarisation. However, most documents lack

keyphrases provided by the authors, and manually identifying keyphrases for large col-

lections of documents is infeasible.

At present, solutions for automatic keyphrase extraction mainly rely on manually se-

lected features, such as frequencies and relative occurrence positions. However, such

solutions are dataset-dependent, which often need to be purposely modified to work

for documents of different length, discourse modes, and disciplines. This is due to the

fact that the performance of such algorithms heavily relies on the selections of features,

which turns the development of automatic keyphrase extraction algorithms into a time-

consuming and labour-intensive exercise. Moreover, most of these solutions can only

extract tokens explicitly appearing in documents as keyphrases, thus they are incapable

of capturing the semantic meanings of keyphrases and documents.

This research aims to develop a keyphrase extraction approach that automatically learns

features of phrases and documents via deep neural networks, which not only eliminates

the effort of feature engineering, but also encodes the semantics of phrases and doc-

uments in the features. The learnt features become representations of phrases and

documents, which enhance the robustness and adaptability of the learning algorithm on

different datasets. Specifically, this thesis addresses three issues: 1) the lack of under-

standing the meanings of documents and phrases inhibits the performance of the existing

algorithms; 2) the feature engineering process turns the development of keyphrase ex-

traction algorithms into a time-consuming, potentially biased, and empirical exercise;

and 3) using public semantic knowledge bases such as WordNet to obtain additional

semantics is practically difficult because they supply limit vocabularies and insufficient

domain-specific knowledge.

In this thesis, we firstly carry out a systematic study on investigating the application

of combining distributional features that measuring the occurrence patterns, and dis-

tributed word representations that supply additional semantic knowledge. We then de-

velop a series of models to learn the representations of phrases and documents that

enable classifiers to ‘understand’ the semantic meanings of them. We demonstrate that

the models developed in this thesis provide effective tools for learning general represen-

tations and capturing the meanings of phrases and documents through similarity and

classification tests. The learnt representations enable classifiers to efficiently identify

keyphrases of documents, without using any manually selected features.

ii

Acknowledgements

This dissertation would not have been possible without the support of many people, and

hereby I dedicate this milestone to them.

First and foremost, I would like to express my sincere gratitude to my supervisors: Dr

Wei Liu and Dr Chris McDonald for their immense knowledge, valuable guidance, schol-

arly inputs and consistent encouragement throughout my research work. I’m lucky and

proud of having them as my supervisors and friends. Thanks Wei for bringing me into

this exciting field – Natural Language Processing. She always helps me to conceptualise

the big picture and to develop a critical analysis of my work, and this dissertation is a

direct consequence of her excellent supervision. Thanks Chris for providing me many

insightful suggestions and helping me solving many programming problems. His enthu-

siasm, integral view on research, encouragement and skills will always be remembered.

I would like to thank the other academic members of the School of Computer Science

and Software Engineering (CSSE) for their advice including Assoc/Prof Rachel Cardell-

Oliver, Dr Du Huynh, Dr Tim French, Dr Jianxin Li. I am also very grateful to the

CSSE IT support and administrative staffs including Mr Samuel Thomas, Ms Yvette

Harrap and Ms Rosie Kriskans. A special thank you to my colleagues Lyndon White,

Michael Stewart, Christopher Bartley for their feedback, cooperation and friendship.

I am also thankful to the oversees scholars Assist/Prof Hatem Haddad and Mr Chedi

Bechikh Ali for their cooperation and friendship. Thanks to Dr Peter D. Turney, Prof.

Marco Baroni, Dr Kazi Saidul Hasan for answering my questions and providing datasets.

I would also like to thank the University of Western Australia (UWA), and the Australian

Research Council (ARC). This research was supported by an Australian Government

Research Training Program (RTP) Scholarship, UWA Safety Top-Up Scholarship, and

Discovery Grant DP150102405 and Linkage Grant LP LP110100050.

Last but not least, my deep and sincere gratitude to my family and friends for their

continuous love, help and support, which make the completion of this thesis possible.

Thanks to my wife Kefei for her eternal love, support and understanding of my goals

and aspirations, and her patience and sacrifice will remain my inspiration throughout

my life. Thanks to my mother Ping for giving me the opportunities and experiences

that have made me who I am. Her selflessly love and support has always been my

strength. A special thank you to my friends Zhuojun Zhou, Sheng Bi, Yuxuan Bi, John

Dorrington, and Clayton Dorrington for their constant inspiration and encouragement.

Finally, thanks to my cat Bella for not eating my source code.

iii

Publications Arising from this Thesis

This thesis contains published work and/or work prepared for publication. The biblio-

graphical details of the work and where it appears in the thesis are outlined below.

1. Rui Wang, Wei Liu, and Chris McDonald. A Matrix-Vector Recurrent Unit

Model for Capturing Semantic Compositionalities in Phrase Embeddings. Ac-

cepted by International Conference on Information and Knowledge Management,

2017 (Chapter 6)

2. Rui Wang, Wei Liu, and Chris McDonald. Featureless Domain-Specific term ex-

traction with minimal labelled data. In Australasian Language Technology Asso-

ciation Workshop 2016, page 103. (Chapter 5)

3. Rui Wang, Wei Liu, and Chris McDonald. Using word embeddings to enhance

keyword identification for scientific publications. In Australasian Database Con-

ference, pages 257-268. Springer, 2015. (Chapter 4)

4. Chedi Bechikh Ali, Rui Wang, and Hatem Haddad. A two-level keyphrase ex-

traction approach. In International Conference on Intelligent Text Processing and

Computational Linguistics, pages 390-401. Springer, 2015.

5. Wei Liu, Bo Chuen Chung, Rui Wang, Jonathon Ng, and Nigel Morlet. A genetic

algorithm enabled ensemble for unsupervised medical term extraction from clinical

letters. Health information science and systems, 3(1):1, 2015.

6. Rui Wang, Wei Liu, and Chris McDonald. Corpus-independent generic keyphrase

extraction using word embedding vectors. In Deep Learning for Web Search and

Data Mining Workshop (DL-WSDM 2015), 2014. (Chapter 4)

7. Rui Wang, Wei Liu, and Chris McDonald. How preprocessing affects unsupervised

keyphrase extraction. In International Conference on Intelligent Text Processing

and Computational Linguistics, pages 163-176. Springer, 2014. – Best Paper

Award. (Chapter 3)

iv

Declaration of Authorship

This thesis contains published work and/or work prepared for publication, some of which

has been co-authored. The extent of the candidate’s contribution towards the published

work is outlined below.

� Publications [1-3] and [6-7]: I am the first author of these papers, with 80%

contribution. I co-authored them with my two supervisors: Dr Wei Liu and Dr

Chris McDonald. I designed and implemented the algorithms, conducted the ex-

periments and wrote the papers. My supervisors reviewed the papers and provided

useful advice for improvements.

� Publications [4] : I am the second author of the paper, with 40% contribution.

I reimplemented all statistical ranking algorithms for keyphrase extraction used

in the paper, coordinated the conflicts between the English linguistic patterns

proposed by the first authors and the statistical ranking algorithms, conducted

experiments, and wrote Section 2.2, Section 3.2, Section 4.1, and Section 5.1 – 5.2.

� Publications [5] : I am the third author of the paper, with 15% contribution. I

normalised the textual data of the medical letters, reimplemented the TextRank

algorithm, conducted partial experiments, and wrote two subsections related to

my work.

Student Signature:

Date:

I, Wei LIU, certify that the student stat tion to each of

the works listed above are correct.

Coordinating supervisor Signature:

Date:

v

Wei Liu

Wei Liu

Wei Liu

Wei Liu

Wei Liu

Wei Liu

Contents

Thesis Declaration i

Acknowledgements ii

Acknowledgements iii

Publications v

Declaration of Authorship vi

List of Figures xi

List of Tables xiii

1 Introduction 1

1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Contributions and Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . 3

1.2.1 Chapter 3: Conundrums in Unsupervised AKE . . . . . . . . . . . 4

1.2.2 Chapter 4: Using Word Embeddings as External Knowledge forAKE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.2.3 Chapter 5: Featureless Phrase Representation Learning with Min-imal Labelled Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.2.4 Chapter 6: A Matrix-Vector Recurrent Unit Network for Captur-ing Semantic Compositionality in Phrase Embeddings . . . . . . . 6

1.2.5 Chapter 7: A Deep Neural Network Architecture for AKE . . . . . 7

1.2.6 Chapter 8: Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . 8

2 Literature Review and Background 9

2.1 Automatic Keyphrase Extraction . . . . . . . . . . . . . . . . . . . . . . . 9

2.1.1 Overview of AKE systems . . . . . . . . . . . . . . . . . . . . . . . 9

2.1.2 Text Pre-processing and Candidate Phrase Identification . . . . . . 11

2.1.2.1 Test Pre-processing . . . . . . . . . . . . . . . . . . . . . 11

2.1.2.2 Candidate Phrase Identification . . . . . . . . . . . . . . 12

2.1.3 Common Phrase Features . . . . . . . . . . . . . . . . . . . . . . . 13

2.1.3.1 Self-reliant features . . . . . . . . . . . . . . . . . . . . . 13

vi

Contents vii

2.1.3.2 Relational features . . . . . . . . . . . . . . . . . . . . . . 15

2.1.4 Supervised AKE . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.1.5 Unsupervised AKE . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.1.5.1 Capturing unusually frequencies – statistical-based ap-proaches . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.1.5.2 Capturing topics – clustering-based approaches . . . . . . 18

2.1.5.3 Capturing strong relations – graph-based approaches . . 19

2.1.6 Deep Learning for AKE . . . . . . . . . . . . . . . . . . . . . . . . 20

2.1.7 Evaluation Methodology . . . . . . . . . . . . . . . . . . . . . . . . 21

2.1.8 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.1.9 Similar Tasks to AKE . . . . . . . . . . . . . . . . . . . . . . . . . 22

2.1.9.1 Automatic Domain-specific Term Extraction . . . . . . . 22

2.1.9.2 Other Similar Tasks . . . . . . . . . . . . . . . . . . . . . 27

2.2 Learning Representations of Words andTheir Compositionality . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

2.2.1 Word Representations . . . . . . . . . . . . . . . . . . . . . . . . . 29

2.2.1.1 Atomic Symbols and One-hot Representations . . . . . . 29

2.2.1.2 Vector-space Word Representations . . . . . . . . . . . . 29

2.2.1.3 Inducing Distributional Representations using Count-basedModels . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

2.2.1.4 Learning Word Embeddings using Prediction-based Models 32

2.2.2 Deep Learning for Compositional Semantics . . . . . . . . . . . . . 36

2.2.2.1 Learning Meanings of Documents . . . . . . . . . . . . . 42

2.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

3 Conundrums in Unsupervised AKE 44

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

3.2 Reimplementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

3.2.1 Phrase Identifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

3.2.1.1 PTS Splitter . . . . . . . . . . . . . . . . . . . . . . . . . 47

3.2.1.2 N-gram Filter . . . . . . . . . . . . . . . . . . . . . . . . 47

3.2.1.3 Noun Phrase Chunker . . . . . . . . . . . . . . . . . . . . 48

3.2.1.4 Prefixspan . . . . . . . . . . . . . . . . . . . . . . . . . . 48

3.2.1.5 C-value . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

3.2.2 Ranking Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . 50

3.2.2.1 TF-IDF . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

3.2.2.2 RAKE . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

3.2.2.3 TextRank . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

3.2.2.4 HITS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

3.3 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

3.3.1 Dataset Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

3.3.2 Ground-truth Keyphrase not Appearing in Texts . . . . . . . . . . 55

3.3.3 Dataset Cleaning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

3.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56


3.4.2 Evaluation 1: Candidate Coverage on ground-truth Keyphrases . . 56

3.4.3 Evaluation 1 Results Discussion . . . . . . . . . . . . . . . . . . . . 57

Contents viii

3.4.4 Evaluation 2: System Performance . . . . . . . . . . . . . . . . . . 59

3.4.4.1 Direct Phrase Ranking and Phrase Score Summation . . 59

3.4.4.2 Ranking Algorithm Setup . . . . . . . . . . . . . . . . . . 59

3.4.5 Evaluation 2 Results Discussion . . . . . . . . . . . . . . . . . . . . 60

3.4.5.1 Candidate Identification Impact . . . . . . . . . . . . . . 60

3.4.5.2 Phrase Scoring Impact . . . . . . . . . . . . . . . . . . . 63

3.4.5.3 Frequency Impact . . . . . . . . . . . . . . . . . . . . . . 64

3.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

4 Using Word Embeddings as External Knowledge for AKE 66

4.1 Weighting Schemes for Graph-based AKE . . . . . . . . . . . . . . . . . . 67

4.2 Proposed Weighting Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . 69

4.2.1 Word Embeddings as Knowledge Base . . . . . . . . . . . . . . . . 69

4.2.2 Weighting Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

4.3 Training Word Embeddings . . . . . . . . . . . . . . . . . . . . . . . . . . 71

4.4 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

4.4.1 Training Word Embeddings . . . . . . . . . . . . . . . . . . . . . . 73

4.4.2 Phrase Embeddings . . . . . . . . . . . . . . . . . . . . . . . . . . 74

4.4.3 Ranking Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . 75

4.4.3.1 Degree Centrality . . . . . . . . . . . . . . . . . . . . . . 75

4.4.3.2 Betweenness and Closeness Centrality . . . . . . . . . . . 75

4.4.4 PageRank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

4.4.5 Assigning Weights to Graphs . . . . . . . . . . . . . . . . . . . . . 76

4.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

4.5.1 Evaluation Results and Discussion . . . . . . . . . . . . . . . . . . 77

4.5.1.1 Discussion on Direct Phrase Ranking System . . . . . . . 77

4.5.1.2 Discussion on Phrase Score Summation System . . . . . . 79

4.5.1.3 Mitigating the Frequency-Sensitivity Problem . . . . . . 80

4.5.1.4 Tunable hyper-parameters . . . . . . . . . . . . . . . . . 80

4.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

5 Featureless Phrase Representation Learning with Minimal LabelledData 82

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

5.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

5.3 Proposed Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

5.3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

5.3.2 Term Representation Learning . . . . . . . . . . . . . . . . . . . . 86

5.3.2.1 Convolutional Model . . . . . . . . . . . . . . . . . . . . 87

5.3.2.2 LSTM Model . . . . . . . . . . . . . . . . . . . . . . . . . 88

5.3.3 Training Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

5.3.4 Pre-training Word Embedding . . . . . . . . . . . . . . . . . . . . 90

5.3.5 Co-training Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 91

5.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

5.4.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

5.4.2 Pre-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

5.4.3 Experiment Settings . . . . . . . . . . . . . . . . . . . . . . . . . . 93

Contents ix


5.4.5 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

5.4.6 Result and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . 94

5.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

6 A Matrix-Vector Recurrent Unit Network for Capturing SemanticCompositionality in Phrase Embeddings 98

6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

6.2 Proposed Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

6.2.1 Compositional Semantics . . . . . . . . . . . . . . . . . . . . . . . 101

6.2.2 Matrix-Vector Recurrent Unit Model . . . . . . . . . . . . . . . . . 103

6.2.3 Low-Rank Matrix Approximation . . . . . . . . . . . . . . . . . . . 105

6.3 Unsupervised Learning and Evaluations . . . . . . . . . . . . . . . . . . . 105

6.3.1 Evaluation Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . 106

6.3.2 Evaluation Approach . . . . . . . . . . . . . . . . . . . . . . . . . . 107

6.3.3 Baseline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

6.3.4 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

6.3.5 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . 108

6.4 Supervised Learning and Evaluations . . . . . . . . . . . . . . . . . . . . . 111

6.4.1 Predicting Phrase Sentiment Distributions . . . . . . . . . . . . . . 111

6.4.2 Domain-Specific Term Identification . . . . . . . . . . . . . . . . . 113

6.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

7 A Deep Neural Network Architecture for AKE 117

7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118

7.2 Deep Learning for Document Modelling . . . . . . . . . . . . . . . . . . . 120

7.2.1 Convolutional Document Cube Model . . . . . . . . . . . . . . . . 122

7.3 Proposed Deep Learning Architecture for AKE . . . . . . . . . . . . . . . 125

7.4 Evaluations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126

7.4.1 Baseline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126

7.4.2 State-of-the-art Algorithms on Datasets . . . . . . . . . . . . . . . 129

7.4.3 Evaluation Datasets and Methodology . . . . . . . . . . . . . . . . 130

7.4.4 Training and Evaluation Setup . . . . . . . . . . . . . . . . . . . . 130

7.4.4.1 Pre-training Word Embeddings . . . . . . . . . . . . . . . 130

7.4.4.2 Training AKE Models . . . . . . . . . . . . . . . . . . . . 131

7.4.5 Evaluation Results and Discussion . . . . . . . . . . . . . . . . . . 131

7.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134

8 Conclusion 135

8.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138

Bibliography 140

List of Figures

2.1 One-hot Vector Representations: words with 9 dimensions . . . . . . . . . 29

2.2 Co-occurrence matrices of two sentences. (A): word-word co-occurrencematrix, (B): word-document co-occurrence matrix. . . . . . . . . . . . . . 31

2.4 Neural Probabilistic Language Model . . . . . . . . . . . . . . . . . . . . . 34

2.5 Word Embedding Vectors: induced from the probabilistic neural languagemodel training over the toy dataset. Each vector only has 2 dimensions. . 35

2.6 A Simple Recurrent Neural Network for Language Modelling. . . . . . . . 37

2.8 Recursive Neural Network: fitting the structure of English language . . . 39

2.9 Image Processing Convolution . . . . . . . . . . . . . . . . . . . . . . . . . 40

2.10 Word Embedding Convolution Process: (A) Embedding vector-wise con-volution (B): Embedding feature-wise convolution . . . . . . . . . . . . . 41

3.1 Phrase Scoring Processing Pipelines . . . . . . . . . . . . . . . . . . . . . 46

3.3 Illustrating the sets’ relationships. Algorithm Extracted Keyphrases is asubset of Identified Candidate Phrases. Ground- truth and Identified Can-didates are subsets of All Possible Grams of the document. TP : the truepositive set is the extracted phrases match ground-truth keyphrases; FP :the false positive set is the extracted phrases that do not match ground-truth keyphrases; FN : the false negative set contains all ground-truthkeyphrases that are not extracted as keyphrases; TN : the true negativeset has the candidate phrases that are not ground-truth and not extractedas keyphrases. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

3.4 Error 1: candidate identified is too long, being the super-string of theassigned phrase; Error 2: candidate identified is too short, being the sub-string of the assigned phrase; Error 3: assigned phrase contains invalidchar such as punctuation marks; Error 4: assigned phrase contains stop-words; Error 5: Others. . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

4.1 Evaluation Results F-score: Left graphs show the results using directphrase ranking scoring approach, where green and purple columns showembedding effects. Right graphs show the results using phrase score sum-mation approach, where green columns show the embedding effects. . . . 78

5.1 Co-training Network Architecture Overview: Solid lines indicate the train-ing process; dashed lines indicate prediction and labelling processes. . . . 85

5.2 Convolutional Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

5.3 LSTM Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

5.4 Relationships in TP, TN, FP, and FN for Term Extraction. . . . . . . . . 93

5.7 Convolutional and LSTM Classifier Training on 200 Examples on POSEvaluation set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

x

List of Figures xi

6.1 Elman Recurrent Network for Modelling Compositional Semantic . . . . . 102

6.2 Matrix-Vector Recurrent Unit Model . . . . . . . . . . . . . . . . . . . . . 104

6.3 Predicting Adverb-Adjective Pairs Sentiment Distributions . . . . . . . . 113

7.1 General Network Architecture for Learning Document Representations . . 121

7.2 Convolutional Document Cube Model . . . . . . . . . . . . . . . . . . . . 124

7.3 Overview of Proposed Network Architecture . . . . . . . . . . . . . . . . . 126

7.4 Baseline Document Model – LSTM-GRU Architecture . . . . . . . . . . . 127

7.5 Baseline Document Model – CNN-GRU Architecture . . . . . . . . . . . . 127

7.6 Gated Recurrent Unit Architecture . . . . . . . . . . . . . . . . . . . . . . 129

List of Tables

2.1 Common Pre-processing and Candidate Phrase Selection Techniques . . . 11

2.2 Features Used in Extraction Algorithms . . . . . . . . . . . . . . . . . . . 17

3.1 Dataset Statistics After Cleaning . . . . . . . . . . . . . . . . . . . . . . . 54

3.2 The Coverages on Ground-truth Keyphrases . . . . . . . . . . . . . . . . . 57

4.1 Most Similar Words: Top ranked most similar words to the samplers,fetched using Cosine similarity, trained twice 1) over Wikipedia generaldataset, and 2) a computer science domain specific dataset . . . . . . . . 70

5.1 Dataset Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

5.2 Candidate Terms Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . 92

5.3 Evaluation Dataset Statistics . . . . . . . . . . . . . . . . . . . . . . . . . 93

5.4 Evaluation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

6.1 Phrase Similarity Test Results . . . . . . . . . . . . . . . . . . . . . . . . . 110

6.2 Phrase Composition Test Results . . . . . . . . . . . . . . . . . . . . . . . 110

6.3 KL Divergence Score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112

6.4 Candidate Terms Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . 114

6.5 Evaluation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

7.1 Common Deep Learning Document Models . . . . . . . . . . . . . . . . . 121

7.2 Evaluation Results on Three Datasets . . . . . . . . . . . . . . . . . . . . 132

xii

Chapter 1

Introduction

1.1 Overview

As the amount of unstructured textual data continues to grow exponentially, the need

for automatically processing and extracting knowledge from it becomes increasingly crit-

ical. An essential step towards the automatic text processing is to automatically index

unstructured text, i.e. tagging text with pre-identified vocabulary, taxonomy, thesaurus

or ontologies, which enables algorithms to retrieve relevant information more quickly

and effectively for further processing. An efficient approach for automatic text index-

ing is to summarise the core topics, concepts, ideas or arguments of a document into

a small group of phrases, namely keyphrases, which are representative phrases describ-

ing the document at a highly abstract level. However, most documents do not have

the keyphrases provided by the authors, yet manually identifying keyphrases for large

collections of documents is labour-intensive or even infeasible.

This thesis focuses on developing techniques for automatically extracting keyphrases

from documents, known as Automatic Keyphrase Extraction (AKE). The task of AKE

identifies a group of important phrases from the content of a document, which are capable

of representing the author’s main ideas, concepts, or arguments of the document. Thus,

a keyphrase must be a semantically meaningful and syntactically acceptable expression

that may consist of one or more unigram words1.

There has been impressive progress in developing AKE systems, such as Keyphrase

Extraction Algorithm (KEA)2, a publicly available system using machine learning algo-

rithms to identify keyphrases [1]. However, these traditional AKE algorithms rely on

1Although there is no limitation on the maximum number of words in a phrase, a phrase should notbe a full sentence.

2http://http://www.nzdl.org/Kea

1

Chapter 1. Introduction 2

computing the likelihood of being a keyphrase for each candidate phrase. Candidate

phrases are represented as vectors of predefined distributional features, which measure

the characteristics of phrases’ distributions in a document or corpus [1–5], such as their

occurrence frequencies. Such statistical measures often fail to extract keyphrases with

low frequencies. In addition, they are incapable of capturing the semantic meanings of

phrases. Later approaches use external knowledge, such as WordNet and Wikipedia, to

supply extra semantic knowledge [6–8]. However, these knowledge bases are designed for

general purposes, thus they often make little or no contribution for extracting keyphrases

from domain-specific corpora. In summary, existing AKE approaches suffer from three

major shortcomings:

1. Distributional features do not help algorithms to understand the semantics of

phrases and documents, because most features only encode statistical distribu-

tions of phrases in a document. Unlike human annotators who need to understand

the documents to identify keyphrases, existing AKE algorithms built upon distri-

butional features often are unable to or have no attempt to understand the

meanings of phrases and documents. Inevitably, this lack of cognitive ability

inhibits their performance.

2. Observing, selecting and evaluating different combinations of features, namely

feature-engineering, turn the development of AKE algorithms into a time-

consuming and labour-intensive exercise, where the primary goal is to design

representations of data to improve the performance of AKE algorithms. These

features are manually selected based on the observation of keyphrases’ linguis-

tic, statistical, or structural characteristics. However, characteristics of the same

phrase may vary on different types of datasets. For example, the relative occur-

rence position of a phrase is an important cue of being a keyphrase in a collection of

journal articles, because intuitively, most keyphrases occur in the abstract and in-

troduction sections of an article. Such characteristic does not exist in collections of

news articles, since they are mostly short, and do not have sections. Thus, an AKE

model developed upon specific features on one dataset is difficult to be adapted to

another. For example, Hasan and Ng [9] demonstrate that algorithms developed

for specific datasets usually deliver poor performance on a different dataset. Take

ExpandRank [10] for instance, it assumes that phrases co-occur in a similar fash-

ion in its neighbourhood documents – the most similar documents in the corpus

ranked by cosine similarities. However, such topic-wise similar documents may not

exist in other datasets.

3. Using semantic knowledge supplied by public knowledge bases to analyse the se-

mantic relations of phrases has limitations. Firstly, most knowledge bases have


limited vocabularies, which will not sufficiently cover all fast-growing technical

terms. Secondly, they only provide general meanings of common words, which

make little or no contribution in the analysis of domain-specific corpora.

For example, the word neuron generally refers to the core components of the brain

and spinal cord of the central nervous system, thus it closely relates to neocortex

or sensorimotor. In machine learning literature, a neuron typically refers to a

node in neural networks that performs the linear or non-linear computation, thus

closely related words may be sigmoid or back-propagation.

This thesis aims to addresses all three problems. The recent advancements of deep neural

networks have enabled a paradigm shift of representing semantics of lexical words from

distributional to distributed vectors, known as word embeddings [11, 12]. These low

dimensional, dense and real-valued vectors open up immense opportunities of applying

known algorithms for vector and matrix manipulations to carry out semantic calculations

that were previously impossible. For example, Mikolov et al. [13] demonstrate that word

embedding vectors encode both linguistic and semantic knowledge, e.g. the result of word

vector calculations show that vec(king) − vec(man) + vec(woman) ≈ vec(queen), and

vec(apple) − vec(apples) ≈ vec(car) − vec(cars). Most importantly, learning

word embeddings requires very little or no feature-engineering. The learning of word

embedding is facilitated by purposely designed neural network architectures capturing

different language models. For example, the word2vec model [14, 15] uses a feed-forward

network architecture to encode the probability distributions of words in word embeddings

by predicting the next word given its precedents; recurrent network architectures are

capable of learning language models via the sequential combinations of words [16, 17];

and the recursive neural networks [18–20] focus more on capturing syntactical patterns

and commonalities among words.

1.2 Contributions and Thesis Outline

In this research, a series of semantic knowledge-based techniques are proposed and de-

veloped to automatically extract keyphrases from documents. This thesis firstly com-

bines manually selected distributional features with word embedding vectors to inves-

tigate whether the additional semantic knowledge embedded in these vectors can im-

prove keyphrase extraction. These word embeddings are pre-trained over both large

collections of documents (e.g. Wikipedia) and domain-specific datasets to overcome the

problems of limited vocabulary and insufficient domain-specific knowledge supplied by

public knowledge bases. We demonstrate that using the knowledge encoded in word

embedding vectors is an effective way to improve the performance of traditional AKE


algorithms [21, 22]. However, word embeddings only encode semantics of unigram words.

To enable learning algorithms encoding the meanings of multi-word phrases and doc-

uments, we develop a series of deep learning models, including the Co-training model

presented in Chapter 5, the Matrix-Vector Recurrent Unit (MVRU) model presented in

Chapter 6, and the Convolutional Document Cube (CDC) model presented in Chap-

ter 7. We demonstrate that the deep learning models developed throughout this thesis

provide effective tools for learning general representations of phrases and documents,

which enable classifiers to efficiently identify the importance of phrases in documents,

without using any manually selected features.

In the rest of this chapter, we outline the structure of the thesis and summarise our

primary contributions. Chapter 2 consists of two parts: the first part is an overview of

existing AKE approaches, and the second part focuses on using deep neural networks for

modelling languages and learning representations of words and their compositionality.

The core content of this thesis is presented in Chapter 3 to Chapter 7. The outlines

with highlights on contributions are presented as follows.

1.2.1 Chapter 3: Conundrums in Unsupervised AKE

This chapter presents a systematic analysis and investigation on how different

pre-processing and post-processing techniques affect the performance of an

AKE system, with empirical explorations of common weaknesses of existing unsuper-

vised AKE systems.

Generally, algorithms for AKE can be grouped into two categories: unsupervised ranking

and supervised machine learning. In comparison to the supervised AKE, unsupervised

approaches use fewer features, making them less dependent on the choice of features.

Unsupervised AKE systems consist of a processing pipeline, including pre-processing

that identifies candidate phrases, scoring and ranking candidates, and post-processing

that extracts top ranked candidates as keyphrases. Each process plays a critical role

in delivering the final extraction result. However, it is unclear how exactly different

techniques employed at each process will affect the performance of an AKE system. Most

studies only focus on the keyphrase extraction algorithms with little discussion on how

pre-processing and post-processing steps affect the system performance, which leaves

ambiguities in how the claimed improvements are achieved, bringing extra difficulties for

researchers seeking further improvements over existing approaches. Thus, the departure

point of this research is to gain a comprehensive understanding of the strengths and

weaknesses of unsupervised AKE algorithms.


In Chapter 3, we conduct a systematic evaluation of four popular unsupervised AKE

algorithms combined with five pre-processing and two post-processing techniques, where

we clearly identify the impacts of different pre-processing and post-processing techniques

to the same keyphrase extraction algorithm. More importantly, we also show that the

unsupervised AKE algorithms are overly sensitive to phrase frequencies, i.e. they mostly

fail to identify infrequent but important phrases. Chapter 3 is developed based on our

publication [23], which received the best paper award from the CICLing 2014 conference.

1.2.2 Chapter 4: Using Word Embeddings as External Knowledge for

AKE

This chapter presents a general-purpose weighting scheme for graph-based AKE

algorithms using word embedding vectors to incorporate background knowl-

edge to overcome 1) the problems of limited vocabulary and insufficient domain-specific

knowledge supplied by public knowledge bases, 2) the frequency sensitivity problem

identified in Chapter 3.

The family of graph ranking algorithms are, by far the most popular unsupervised ap-

proaches for AKE, which produce a number of state-of-the-arts performance on different

datasets [3, 6, 24]. Our proposed approach combines the classic graph-based AKE ap-

proaches with the recent deep learning outcomes. In contrast to other studies that use

existing semantic knowledge bases, such as WordNet as the external knowledge source,

we combine manually selected distributional features with the semantic knowledge en-

coded in word embeddings to compute the semantic relatedness of words. We train word

embeddings over a Wikipedia snapshot and the AKE corpus to encode both general and

domain-specific meanings of words, demonstrating that the proposed weighting scheme

mitigates the frequency sensitivity problem, and generally improves the performance of

graph ranking algorithms for identifying keyphrases. Moreover, despite assigning weights

to graphs improves the performance of graph-based AKE algorithms, the development

of weighting schemes is a rather ad-hoc process where the choice of features is critical

to the overall performance of the algorithms. This problem turns the development of

graph-based AKE algorithms into a laborious feature engineering process. One of our

goals is to discover AKE approaches that are less dependent on the choice of features and

datasets. Hence, in the rest of the thesis, we focus on using deep learning approaches to

automatically learn useful features. The learnt features become the representations of

phrases and documents encoding the meanings of them, which help learning algorithms

delivering better performance. Chapter 4 is based on our work presented in the WSDM

2014 workshop [21], and the ADC 2015 conference [22].


1.2.3 Chapter 5: Featureless Phrase Representation Learning with

Minimal Labelled Data

This chapter presents a co-training deep learning network architecture that min-

imises the burden on domain experts to produce training data. A convolutional

neural network (CNN) in conjunction with a Long Short-Term Memory (LSTM) network

in a weakly-supervised co-training fashion is designed and implemented.

To encode the semantic knowledge of multi-word phrases, a model must be able to un-

derstand the meaning of each constituent word and learn the rules of composing them,

known as the principle of semantic compositionality [25]. Chapter 5 investigates the use

of two deep neural networks: CNN [26, 27] and LSTM network [28] to automatically

learn phrase embeddings. The intuition is that the meaning of a phrase can be learnt

by analysing 1) different sub-gram compositions, and 2) sequential combination of each

constituent word. The CNN network analyses different regions of an input matrix con-

structed by stacking constituent word embedding vectors of a phrase, where the sizes

of regions reflect different N-grams. By scanning the embedding matrix with different

region sizes, the CNN network can learn the meaning of a phrase by capturing the most

representative sub-grams. The LSTM network, on the other hand, learns the compo-

sitionality by recursively composing an input embedding vector with the precedent or

previously composed value. The network captures the meaning of a phrase by control-

ling the information flow – how much information (or meaning) of an input word can be

added into the overall meaning, and how much information from the previous composi-

tion should be dismissed. We demonstrate that co-training using the two networks can

achieve the same level of classification accuracy with very little labelled data. The work

presented in this chapter is based on our work presented at ALTA 2016 conference [29].

1.2.4 Chapter 6: A Matrix-Vector Recurrent Unit Network for Cap-

turing Semantic Compositionality in Phrase Embeddings

This chapter presents the Matrix-Vector Recurrent Unit (MVRU) model – a novel

computational mechanism for learning language semantic compositionality

based on the recurrent neural network architecture. The MVRU model learns

the meanings of phrases without using any manually selected features.

Although both CNN and LSTM deliver impressive results on learning multi-word phrase

embeddings, they have limitations. CNN networks have the strength to encode regional

compositions of different locations in data matrices. In image processing, pixels close

to each other usually are a part of the same object, thus convoluting image matrices


captures the regional compositions of semantically related pixels. However, the location

invariance does not exist in word embedding vectors. The LSTM network, on the other

hand, uses shared weights to encode composition rules for every word in the vocabulary

of a corpus, which may overly generalise the compositionality of words.

In this chapter, a novel compositional model based on the recurrent neural network

architecture is developed, where we introduce a new computation mechanism for the

recurrent units – MVRU to integrate the different views of compositional semantics

originated from linguistic, cognitive, and neural science perspectives. The recurrent

architecture of the network allows for processing phrases of various lengths as well as

encoding the consecutive orders of words. Each recurrent unit consists of a composi-

tional function that computes the composed meaning of two input words, and a control

mechanism to govern the information flow at each composition. When the network is

trained in an unsupervised fashion, it is able to capture latent compositional semantics

of phrases, producing any phrase embedding vectors regardless the presence of a phrase

in the training set. When the network is trained in a supervised fashion by adding a

regression layer, it is able to perform classification tasks using the phrase embeddings

learnt from the lower layer of the network. The work presented in this chapter is ac-

cepted for publication by the International Conference on Information and Knowledge

Management (CIKM) 2017.

1.2.5 Chapter 7: A Deep Neural Network Architecture for AKE

This chapter first presents the Convolutional Document Cube (CDC) model for

quickly learning distributed representations of documents using a CNN net-

work. In the second part of the chapter, a semantic knowledge-based deep learn-

ing model for AKE is developed. The model encodes the semantics of both phrases

and documents using unsupervised training and supervised fine-turning. We demon-

strate that the deep learning AKE model requires no manually selected features, and

hence it is much less dependent on the datasets.

Keyphrases are document-dependent – a phrase that is a keyphrase for one document,

may not be important for others. As a natural process for human identifying keyphrases,

one needs to understand not only the meanings of words and phrases, but also the overall

meaning of the document. Thus, a deep learning model for AKE is required to capture

the semantics for both phrases and documents. The proposed CDC model treats a

document as a cube with the height being the number of sentences, the width being the

number of words in a sentence, and the depth being the dimension of word embedding

vectors. We ‘slice’ the cube by each of its depth’s (word embeddings’) dimension to


generate 2-dimensional channels, which are inputs to a CNN network to analyse the

intrinsic structure of the document and the latent relations in words and sentences.

To extract keyphrases, we propose a deep learning AKE model, which is a binary clas-

sifier that jointly trains the MVRU and CDC models. We evaluate the deep learning

AKE model on three datasets, and demonstrate that the proposed approach delivers the

new state-of-the-art performance on the two datasets without employing any dataset-

dependent heuristics.

1.2.6 Chapter 8: Conclusion

The conclusion of this thesis is presented in Chapter 8, where we discuss major findings

and shortcomings, with an outlook to future directions.

Chapter 2

Literature Review and

Background

To lay the foundation of this research, this chapter firstly provides a comprehensive re-

view on automatic keyphrase extraction (AKE). Then we present a general introduction

to representation learning and language modelling using deep neural networks, starting

with the theoretical background of deep learning, and then progressing into how it may

solve or mitigate the problems for downstream NLP tasks, such as keyphrase extraction.

2.1 Automatic Keyphrase Extraction

2.1.1 Overview of AKE systems

Before we dive into our own literature review, it is worth to take a brief look at a few

recent surveys since 2013 [30–33].

At the Semantic Evaluation workshop in 2010 (SemEval2010), AKE was the fifth shared

tasks. Kim et al. [30] present a work that summarises each of the submitted systems,

and results of the shared task. They firstly broke down existing techniques into four

components from a system point of view: 1) candidate identification, 2) feature engi-

neering, 3) learning model development, and 4) keyphrase evaluation techniques. A brief

summary of the submitted systems is presented by following the order of four compo-

nents. By analysing the upper-bound performance of the task, two key findings are: 1) a

fixed threshold for the number of keyphrases per document restricts the performance of

a system, and 2) the existing evaluation methods underestimate the actual performance

of a system.

9

Chapter 2. Background 10

Hasan and Ng [31] summarise that supervised AKE approaches into two steps 1) task

reformulation that recast AKE task as either a classification or ranking task, and 2) fea-

ture design that selects features from the training documents and/or external resources

such as Wikipedia. Hasan and Ng group unsupervised AKE approaches into four cate-

gories: 1) graph-based ranking, 2) topic-based clustering, 3) simultaneous learning, and

4) language modelling. An important contribution is that they pointed out common

shortcomings of the state-of-the-art approaches. Hasan and Ng classify them into four

categories: 1) over-generation errors – a system identify an important word and hence

all phrases containing the word are identified as keyphrases, 2) infrequency error – low

frequent phrases are unlikely to be identified, 3) redundancy error – semantically equiv-

alent phrases are identified1, and 4) evaluation error – current evaluation methodologies

are unable to identify whether an extracted phrase and the ground-truth is semantically

equivalent.

Beliga et al. [32] focused more on graph-based keyword extraction approaches. The

graph-based approach by far is the most common unsupervised keyword/keyphrase ex-

traction approach. Beliga et al. analysed and compared various graph-based approaches,

and provided guidelines for future research on graph-based AKE. Siddiqi and Sharan [33]

present a survey, in which they classify AKE approaches into four different categories:

1) Rule-based linguistic approaches, 2) statistical approaches, 3) machine learning ap-

proaches, and 4) domain specific approaches. Siddiqi and Sharan also discussed some

common features used for extracting keyphrases.

Based on the previous survey work, we hereby present our own review of AKE approaches

from a system point of view similar to Kim et al. [30]’s work. However, before a system

can correctly identify candidate phrases, it usually requires a data cleaning step, which

removes noisy data (e.g. math equations) and normalises documents to obtain clean texts

for the steps that follow, namely text pre-processing. Hence, we see a AKE system that

is constructed as a processing pipeline, including text pre-processing, candidate phrase

identification, feature selection, keyphrase extraction, and optionally post-processing.

The feature selection identifies the essential features that characterise a phrase, and

computes the numerical feature values. Each phrase then acquires an index associated

with a value vector, where each dimension value corresponds to a feature encoding a

lexical, or a syntactical or semantic interpretation of the candidate.

The keyphrase extraction algorithm is the core of an AKE system, which is responsi-

ble for identifying keyphrases from the candidates. Generally, the algorithms can be

grouped into supervised machine learning and unsupervised ranking algorithms. Su-

pervised machine learning for AKE is more of a feature engineering process, meaning

1We believe that this type of error is overlapping with the type 1 error.


Table 2.1: Common Pre-processing and Candidate Phrase Selection Techniques

Processing Sequence 1 2 3 4 5 6 7

Ohsawa et al. (1998)[35] 3, 4, 5 X X XMihalcea and Tarau (2004)[3] 1, 2, 6 X X XMatsuo and Ishizuka (2004)[36] 4, 5 X XBracewell et al. (2005)[37] 1, 2, 4, 6 X X X XKrapivin et al. (2008)[38] 1, 3, 5, 4 X X X XLiu et al. (2009)[8] 1, 2, 6, 7 X X X XLiu et al. (2010)[39] 1, 2, 6 X X XOrtiz et al. (2010)*[40] 3, 4, 5 X X XBordea and Buitelaar (2010)*[41] 1, 2, 6, 7 X X X XEl-Beltagy and Rafea(2010)*[42] 3, 7, 4, 5 X X X XPaukkeri and Honkela(2010)*[43] 7, 4 X XZervanou (2010)*[44] 1, 7, 2, 6 X X X XRose et al. (2010) [45] 3 XDostal and Jazek (2011)[46] 1, 3, 2, 6 X X X XBellaachia and Al-Dhelaan(2012) [47] 1, 2, 6, 4 X X X XSarkar (2013) [48] 3, 4, 5, 7 X X X XYou et al. (2013) [49] 3, 7 X XGollapalli and Caragea (2014) [50] 1, 2, 6 X X XTotal: 10 9 8 9 6 9 7

1: Tokenising 2: POS Tagging 3: Splitting text by meaningless words or characters and/or removing them 4:Stemming 5: N-gram filtering with heuristic rules 6: Phrase identification with POS patterns 7: other heuristicrules.∗ Although some of the processing may not be explicitly mentioned in the paper, it participated SemEval2010workshop shared task 5 and have been reviewed by Kim et al. [51] where more implementation details areprovided.

that the majority of the effort is spent on selecting and inducing features, since different

combination of features have a critical impact on the performance of the same algo-

rithm [34]. On the other hand, early unsupervised AKE approaches rely more on the

understanding of what keyphrases are. For example, graph-based unsupervised AKE

algorithms are developed based on the common belief that keyphrases have stronger

co-occurrence relations to others. Hence, keyphrases are the phrases that tie and hold

the text together. More recent unsupervised AKE algorithms use external knowledge

bases, such as WordNet, to supply more semantic features.

2.1.2 Text Pre-processing and Candidate Phrase Identification

Table 2.1 shows an overview of common text pre-processing and candidate phrase iden-

tification techniques with respect to their corresponding processing sequences, after sur-

veying 18 key papers published from late 1998 to recent years.

2.1.2.1 Test Pre-processing

The objective of text pre-processing is to normalise and clean documents. The nor-

malisation process aims to convert texts into a unified format enabling more efficient


processing. Common normalisation techniques include converting characters to low-

ercase, tokenising, sentence splitting, lemmatising, and stemming. For example, the

distinction between ‘Network’, ‘network’, and ‘networks’ is ignored after the normalisa-

tion. The cleaning process identifies and optionally removes characters or words that

carry little or no semantic meaning, such as punctuation marks, stop-words, symbols,

or mathematical equations. Techniques used in text cleaning can be corpus-dependent,

often requiring applying heuristics to remove unnecessary or irrelevant information. For

example, Paukkeri and Honkela [43] remove authors’ names and addresses, tables, fig-

ures, citations, and bibliographic information from scientific articles.

2.1.2.2 Candidate Phrase Identification

Keyphrases consist of not only unigram words, but also multi-word phrases, and hence

the candidate phrase identification process recognises syntactically acceptable phrases

from documents, which are treated as the candidates of keyphrases. Two common

approaches are Phrase as Text Segment [35, 38, 40, 48] and Phrase Identification with

Part-of-Speech Patterns [8, 37, 39, 41, 47, 50].

Phrase as Text Segment (PTS) approaches identify phrases by splitting texts based

on heuristics. The simplest approach is to use stop-words and punctuations as delimiters.

It assumes that syntactically and semantically valid phrases rarely contain meaningless

words and characters. For example, in this sample sentence “information interaction is

the process that people use in interacting with the content of an information system”,

phrases identified are information interaction, process, people, interacting, content, and

information system. A more sophisticated approach is to apply additional heuristics.

For example, N-gram filtration [35, 40] selects candidate phrases based on statistical

information and the lengths of phrases. It compares the frequencies and the lengths of

all possible N-grams of a text segment, where only the phrase with the highest frequency

is kept, and if two or more phrases have the same frequency, then the longer one is kept.

PTS approaches have advantages when identifying candidates from short and well-

written texts, such as abstracts of journal articles. They do not involve any deeper

linguistic analysis, thus no Part-of-Speech (POS) tagging is required. In addition, PTS

approaches are language independent, which can be applied to any language with a fi-

nite set of stop-words and punctuations. However, when they are used to process long

documents, such as full-length journal articles, too many candidates will be produced

and hence more irrelevant candidates are introduced, because longer documents have

more word combinations.


Phrase Identification with POS Patterns relies on linguistic analysis, which iden-

tifies candidate phrases based on pre-identified POS patterns. In comparison to PTS,

it is not a language independent technique, so choosing a reliable POS tagger is im-

portant. The most common POS pattern is that a phrase should start with zero or

more adjectives followed by one or more nouns [3, 8, 44, 52], the regular expression is

<JJ>*<NN.*>+. However, the pattern assumes keyphrases only containing adjectives and

nouns, which can be a drawback. For example, machine intelligence can be correctly

identified, whereas intelligence of machines will not be considered. More sophisticated

phrase identification approaches rely on deeper linguistic analysis to work out more pat-

terns. For example, Chedi et al. [53] define 12 syntactic patterns for English including

4 bigram patterns, 6 trigram patterns, and 2 4-gram patterns.

Other Phrase Identification Approaches are introduced in different NLP tasks,

which are not commonly employed in AKE systems. For example, Liu et al. [54] have

investigated using Prefixspan [55] –a sequential pattern mining algorithm to automati-

cally identify single and multi-word medical terms without any domain-specific knowl-

edge. Frantzi et al. [56] introduce C-value and NC-value algorithms that use both POS

identification techniques and statistical information to identify domain specific terms.

Wong et al. [57] present a framework using statistics from web search engines for mea-

suring unithood – the likelihood of a sequence of words being a valid phrase.

2.1.3 Common Phrase Features

After identifying candidate phrases, the next process is to select phrases’ features based

on the observations of datasets. Features represent distinct characters of a phrase,

which differentiate itself from others. Broadly, a phrase has two kinds of features, self-

reliant and relational features. Self-reliant features relate to the information about a

phrase itself, such as its frequency, the linguistic structure, or its occurrence positions

in a document. Relational features capture the relational information of a phrase to

others, such as co-occurrence and semantic relations. Table 2.2 shows an overview of

common features employed by 30 studies including both supervised and unsupervised

AKE approaches.

2.1.3.1 Self-reliant features

1. Frequencies are the main source of identifying the importance of phrases [1, 4,

10, 35]. However, in many cases, raw frequency statistics do not precisely reflect

the importance of phrases. A phrase with distinct high frequency may not be

a good discriminator for being a keyphrase if it distributes evenly in a corpus.


Scott [58] suggests that a keyphrase should have unusual frequency in a given

document comparing with reference corpora Therefore, raw frequency statistics

are also commonly used as inputs to other statistical models for more sophisticated

features, such as Term Frequency - Inverse Document Frequency (TF-IDF) [2].

2. Linguistic features2 are obtained from linguistic analysis, including part-of-

speech (POS) tagging, sentence parsing, syntactic structure and dependency anal-

ysis, morphological suffixes or prefixes analysis, and lexical analysis. They are

mostly used in supervised machine learning approaches [59–61]. However, given

the fact that the average length of phrases is very short, phrases have relatively

simple syntactic structures, which may offer less benefit to the overall performance

of a system. Kim and Kan [34] reveal that applying linguistic features, such as POS

and morphological suffix sequences, has no effect or unconfirmed small changes to

the performance in supervised machine learning approaches for AKE.

3. Structural features encode how a phrase occurs in the document, such as the

relative positions of the first or last occurrence, whether the phrase appears in

the abstract of a document. A common observation is that in well-structured

documents, such as scientific publications, keyphrases are more likely to appear

in abstracts and introduction sections. Structural features can also be specific to

the format of texts. For example, Yih et al. [59] analyse whether a phrase occurs

in hyper-links and URLs to identify keyphrases from web pages. However, such

specific information may be unavailable in other corpora.

4. The length of a phrase is the number of words in a phrase, which is considered

to be a useful feature by a number of studies [62–65]. Although it may be a

little non-intuitive, since keyphrases rarely have more than 5 tokens, studies claim

that applying the feature generally improves the performance of a supervised AKE

system [34, 66].

5. Phraseness refers to the likelihood of a word sequence constituting a meaningful

phrase. In particular, the phraseness feature is useful for identifying a keyphrase

from its longer or shorter cousins, such as convolutional neural network and neural

network. Two common techniques to compute feature values are Pointwise Mutual

Information (PMI) [67] and Dice coefficient [68].

2Not to be confused with the POS pattern-based phrase identification for identifying candidatephrases.


2.1.3.2 Relational features

1. Phrase co-occurrence statistics provide the information of how phrases co-

occur with others, which also implicitly supply phrase frequency statistics – a

phrase co-occurs more frequently with others indicating the phrase itself having

higher importance.

In supervised AKE approaches, it is less common to employ co-occurrence statistics

since they assume no correlations between candidate phrases. However, Kim and

Kan [34] find that applying co-occurrence features is empirically useful for super-

vised AKE. In contrast, co-occurrence statistics are the main source of features for

the majority of unsupervised AKE approaches [35, 36], especially in graph-based

ranking approaches [3, 10]. The intuition is that a phrase may be a keyphrase if

it 1) frequently co-occurs with different candidates, or 2) only co-occurs with a

particular set of high frequent candidates.

2. Phrases and documents relations examine how important a candidate phrase

is for one particular document with respect to the distribution of the phrase in

a corpus. TF-IDF [2] is the most common algorithm, computed as the product

of the phrase (term) frequency (TF) and the inverse document frequency (IDF).

TF-IDF assigns lower scores to the phrases that are evenly distributed across the

corpus, and higher scores to the phrases that occur frequently in a few particular

documents. Other studies [69–71] also separate TF and IDF as two standalone

features.

3. Semantic features present semantic relatedness between phrases, which can be

any linguistic relation between two phrases, such as similarity, synonym, hyponym,

hypernym, and meronym. Semantic relatedness can be directly obtained from

off-the-shelf semantic knowledge bases. For example, Wang et al. [6] measure

synonyms of phrases based on the ontological information provided by WordNet.

Ercan and Cicekli [61] also use the ontology from WordNet to identify synonym,

hyponym, hypernym, and meronym relations between two word senses. Semantic

relatedness also can be induced from large corpus (e.g. Wikipedia) using statistical

or machine learning techniques. Liu et al. [8] consider each Wikipedia article as

a concept, and the semantic meaning of a candidate phrase can be represented

as a weighted vector of its TF-IDF score within corresponding Wikipedia arti-

cles. The semantic similarity between two phrases can be computed using cosine

similarity, Euclidean distance, or the PMI measure. Other studies, including [72–

74], compute the semantic relatedness as the ratio of the number of times that

a candidate appears as a link in Wikipedia, and the total number of Wikipedia

articles where the candidate appears. Using dimension reduction techniques such


as Singular Value Decomposition (SVD) to induce distributional representations3

of candidates, then applying the cosine similarity measure to obtain similarity

scores between two phrases has also been investigated [75]. More recently, Wang

et al. [22] use pre-trained word embedding vectors over Wikipedia to compute

semantic relatedness using the cosine similarity measure.

4. Phrases and topic relation features indicate how a candidate is related to a

specific topic in a document. A common belief is that keyphrases should cover

the main topics or arguments of an article. The most common approach is to

use Latent Dirichlet Allocation (LDA) [76] to obtain the distributions of words

over latent topics [39, 47, 62, 77]. Another popular approach applies clustering

algorithms, grouping phrases into clusters, where each cluster is treated as a topic,

so phrases from the same cluster are considered to be more similar in the semantic

context [8, 78].

2.1.4 Supervised AKE

Based on the learning algorithms employed, supervised machine learning approaches

treat AKE as either a classification or learning to rank problem. When treating AKE

as a binary classification problem, the goal is to train a classifier that decides whether

a candidate can be a keyphrase. Algorithms used include C45 decision tree [4, 61],

Naıve Bayes [1, 72], Support Vector Machine (SVM) [65, 82], neural networks [71, 81],

and Conditional Random Fields (CRF) [83]. Another line of studies treats AKE as a

ranking problem. Intuitively, no phrases appear in a document just by chance, and each

phrase carries a certain amount of information that represents the theme of an article to

some degree. Thus, the goal is to train a pair-wise ranker that ranks candidates based

on their degree of representativeness – the most representative ones will be extracted

as keyphrases. For example, Jiang et al. [64] employ Linear Ranking SVM to train the

ranker. Jean-Louis et al. [85] present a Passive-Aggressive Perceptron model trained

using the Combined Regression and Ranking method.

Classic supervised machine learning approaches to AKE are feature-driven – the perfor-

mance of algorithms is heavily dependent on the choice of features. Thus, the majority of

the effort is spent on selecting and inducing features, also known as feature engineering.

Feature engineering is difficult and expensive, since both quality and quantity of the fea-

tures have significant influence on the overall performance. Kim and Kan [34] present a

comprehensive evaluation analysing how different features employed may affect the per-

formance of the same algorithms, including Naıve Bayes and Maximum Entropy. The

3please refer Section 2.2.1 for detail.


Table 2.2: Features Used in Extraction Algorithms

Algorithm 1 2 3 4 5 6 7 8 9 10

Unsupervised ApproachesSparck Jones(1972)[2] statistical X XOhsawa et al.(1998)[35] cluster X XMatsuo and Ishizuka(2004)[36] statistical X XMihalcea and Tarau(2004)[3] graph XBracewell et al.(2005)[37] graph X XWang et al.(2007)[6] graph X X XWan and Xiao(2008)[10] graph X X XGrineva et al.(2009)[7] graph X X XLiu et al. (2009)[8] cluster X X XRose et al.(2010)[45] statistical X XLiu et al.(2010)[39] graph X X XZhao et al.(2011)[79] graph X X X XBellaachia and Al-Dhelaan(2012)[47] graph X X XBoudin and Norin(2013)[80] graph X X X XWang et al.(2015)[22] graph X X X X

Supervised ApproachesWitten et al.(1999)[1] NaıveBayes X X X XTurney(2000)[4] C45 X X X XHulth(2003)[5] Bagging X X X XYih et al.(2006)[59] LogisticReg X X X X XJo et al.(2006)[81] NeuralNet X X XZhang et al.(2006)[82] SVM X X X XErcan and Cicekli(2007)[61] C45 X X XZhang et al.(2008)[83] CRF X X X XJiang et al.(2009)[64] RankingSVM X X X XMedelyan et al.(2009)[72] NaıveBayes X X X X X X XSarkar et al.(2010)[71] NeuralNet X X X X XXu et al.(2010)[65] SVM X X X X XEichler and Neumann (2010)[63] RankingSVM X X X X XDing et al.(2011)[62] BIP X X X X XJo and Lee(2015)[84] DeepLearning – – – – – – – – – –

1: Frequency 2: Linguistic feature 3: Structural feature 4: Length of candidate 5: Co-occurrence 6: Candidateand documents relation feature 7: semantic feature 8: Candidate and topic feature 9: phraseness 10: Usingexternal knowledge base

study discovers that frequency, structural features, lengths of phrases, and co-occurrence

statistical features, general improve the performance of the algorithms.

2.1.5 Unsupervised AKE

Unsupervised AKE approaches are built based on observations or understandings of what

keyphrases are. We classify unsupervised AKE approaches into three groups based on

different views of keyphrases:

1. Keyphrases are the phrases having unusual frequencies.

2. Keyphrases are representative phrases that should cover the main topics or argu-

ments of an article.


3. Keyphrases are the phrases having stronger relations with other phrases, which

tie and hold the entire article together.

2.1.5.1 Capturing unusually frequencies – statistical-based approaches

Statistical-based approaches use deterministic mathematical functions to identify phrases

having unusual frequencies. Different algorithms interpret the notion of unusual in dif-

ferent ways. For example, TF-IDF [2] identifies a phrase having high frequency in a

few particular documents rather than being evenly distributed over the corpus. Sim-

ilarly, Likely [86] selects phrases by taking the ratio of the rank value of a phrase in

the documents to its rank value in the reference corpus, where rank is computed as the

relative N-gram frequency of the phrase. RAKE [45] identifies unusual frequent phrases

by examining how often it co-occurs with others, which scores a phrase as the ratio of co-

occurrence frequency with other phrases and its frequency. Wartena et al. [87] compare

phrases’ co-occurrence distributions in a document with their frequency distributions in

the entire corpus.

Statistical-based approaches usually do not require any additional resource apart from

the raw data statistics of phrases from the corpus and documents. This allows statistical-

based approaches to be easily reimplemented. However, they can be frequency-sensitive,

favouring high frequency phrases, thus preventing the algorithms identifying keyphrases

having low frequencies.

2.1.5.2 Capturing topics – clustering-based approaches

Clustering-based approaches apply clustering algorithms to group candidate phrases

into topic clusters, then the most representative ones from each cluster are selected as

keyphrases. Ohsawa et al. [35] cluster phrases based on a co-occurrence graph, where

phrases are vertices and edges are co-occurrence relations weighted by co-occurrence

frequencies. The weak (low scored) edges are considered to be the appropriate ones for

segmenting the document into clusters that are regarded as groups of supporting phrases

on which the author’s points are based. Finally, keyphrases are identified as the ones

that hold clusters together. Liu et al. [8] apply Hierarchical, Spectral, and Affinity Prop-

agation clustering algorithms that group semantically related phrases using Wikipedia

and co-occurrence frequencies. Keyphrases are the phrases close to the centroid of each

cluster. Bracewell et al. [37] cluster phrases by firstly arranging all unigram words to

their own clusters, then multi-gram phrases are assigned to clusters containing the com-

ponent unigrams. If no unigram cluster is found, candidates are assigned to their own

clusters, finally centroids from the top k scored clusters are extracted as keyphrases.


Pasquier [77] propose to induce topic distributions from groups of semantically related

sentences using both cluster algorithms and LDA. Keyphrases are scored with the con-

sideration of distributions of topics over clusters, the distributions of phrases over topics,

and the size of each cluster.

Extracting keyphrases from each topic cluster treats that topics are equally important

in an article. In reality, however, there exist minor topics that are unimportant to an

article. Hence, they should not have keyphrases representing them.

2.1.5.3 Capturing strong relations – graph-based approaches

Graph-based approaches score phrases using graph ranking algorithms by representing

a document as a graph, where each phrase corresponds to a vertex, two vertices are

connected if pre-identified relations, such as phrase co-occurrence relations, are found in

a predefined window size.

The most common ranking algorithms employed are webpage link analysis algorithms,

such as HITS [88], and PageRank [89] and its variants. HITS and PageRank recur-

sively compute the importance of a vertex in a graph by analysing both the number of

neighbouring vertices it connects to, and the importance of its each neighbour vertex.

Applying link analysis algorithms to keyphrase extraction assumes that 1) an important

phrase should have high frequency, such that it co-occurs more often with other phrases,

and 2) a phrase selectively co-occurs with one or a few particular highly frequent ones

can also be important.

The most well-known algorithm is TextRank introduced by Mihalcea and Tarau [3],

which applies PageRank to AKE4 by representing documents as undirected and un-

weighted graphs, considering only frequency and co-occurrence frequency features. Fol-

low Mihalcea and Tarau’s work, researchers tend to improve the performance by adding

features as weights to graph’s edges. For example, Wan and Xiao [10] propose to use

phrase co-occurrence frequencies as weights collected from both the target document and

its k nearest neighbour documents identified using document cosine similarity measure.

The approach essentially expands a single document to a small document set, by adding

a few topic-wise similar documents to capture more statistical information. Wang et

al. [6] use synset in WordNet to obtain the semantic relations for a pair of phrases.

In addition to webpage link analysis algorithms, applying traditional graph centrality

measures to AKE has also been investigated. For example, Boudin [90] compares various

centrality measures for graph-based keyphrase extraction, including degree, closeness,

4TextRank uses weighted graph ranking algorithm derived from PageRank for text summarisation.However, TextRank is identical to PageRank when it is used for AKE, where the graph is unweighted.


betweenness centrality measures, as well as PageRank that is classified as a variant of

the eigenvector centrality. The study shows that the simple degree centrality measure

achieves comparable results to the widely used PageRank algorithm, and the closeness

centrality delivers the best performance on short documents.

Using probabilistic topic models in conjunction with graph ranking algorithms has also

been investigated. This line of work is similar to clustering-based AKE approaches,

since both of them require identifying topics from documents as a prior. However,

in clustering-based approaches, keyphrases are directly drawn from each cluster. In

contrast, graph-based approaches treat topic distributions as features that are inputs

to the graph ranking algorithms. Liu et al. [39] use LDA [76] to induce latent topic

distributions of each phrase, then run the personalised PageRank algorithm [91] for

each topic separately, where the random jump probability of a phrase is computed as

its probability distribution of a topic. Candidates are finally scored as the sum of

their scores in each topic. In their later work, Liu et al. [92] propose a Topic Trigger

Model derived from Polylingual Topic Model [93] that is an extension of LDA. Similarly,

Bellaachia and Al-Dhelaan [47] also use LDA to identify latent topics, then rank phrases

with respect to their topic distributions and TF-IDF scores.

2.1.6 Deep Learning for AKE

Jo and Lee use a Deep Belief Network (DBN) that connects to a logistic regression layer

to learn a classifier. The model does not require any manually selected features. It

uses a greedy layer-wise unsupervised learning approach [94] to automatically learn the

features of one layer once a time. After training, all pre-trained layers are connected and

fine-tuned. The input is the bag-of-word representation of a document, which essentially

is a vector with 1 and 0 values indicating whether a word appears in the document. The

logistic regression layer outputs potential latent phrases. Zhang et al. [95] propose a deep

learning model using Recurrent Neural Network (RNN) combining keywords and context

information to extract keyphrases. The network has two hidden layers, where the first

one aims to capture the keyword information, and the second layer extract the keyphrases

based on the encoded keywords information from the first layer. Meng et al. [96] propose

a generative model for keyphrase prediction with an encoder-decoder model using Gated

Recurrent Unit neural network [97] that incorporate a copying mechanism [98]. Li et

al. [99] use word embeddings to represent words, then apply Euclidean distances to

identify the top-N-closest keywords as the extracted keywords.


2.1.7 Evaluation Methodology

Early studies employ human evaluation [36], which is an impractical approach because

of the significant amount of effort involved. The most common approach is exact match,

i.e. a ground-truth keyphrase matches an extracted phrase when they correspond to the

same stem sequence. For example, the phrase neural networks matches neural network,

but not network or neural net. However, the exact match evaluation is overly strict. For

example, convolutional neural network and convolutional net refer to the same concept

in the machine learning field, but they are treated as an incorrect match using the ex-

act match approach. To discover more efficient evaluation approaches, Kim et al. [100]

investigate six evaluation metrics. Four of them are selected from the machine transla-

tion and automatic text summarisation fields, including BLEU [101], METEOR [102],

NIST [103] and ROUGE [104]. Others include R-precision [105], and the cosine similarity

measure of phrases’ distributional representations induced from web documents. Four

graduate students from the NLP field were hired to be the judges, and the Spearman’s ρ

correlation is used to compare the computer generated and human assigned keyphrases.

The study shows that semantic similarity-based method produces the lowest score, and

R-precision achieves the highest correlation with humans, about 0.45. They also suggest

that phrases should be scored differently based on whether they match keyphrases by

head nouns, the first word, or middle words. However, there is no sophisticated evalu-

ation matrix that can identify whether two phrases share the same semantic meaning

at a human acceptable level. Hence, most researchers still use the exact match as their

evaluation matrix.

The most common measures for AKE system performance are Precision, Recall, and

F-score (also known as the F1 score), computed as:

precision =TP

TP + FP=the number of correctly matched

total number of extracted

recall =TP

TP + FN=the number of correctly matched

total number of ground truth

F = 2× precision× recallprecision+ recall

2.1.8 Applications

Keyphrases are also referred to as topic representative terms, topical terms, or seman-

tically significant phrases, which benefit many downstream NLP tasks, such as text


summarisation, document clustering and classification, information retrieval, and entity

recognition.

In text summarisation, keyphrases can indicate the important sentences in which they

occur. D’Avanzo and Magnini [106] present a text summarisation system – LAKE,

which demonstrates that an ordered list of relevant keyphrases is a good representation

of the document content. Other work, including [107–109], also generate summaries

based on keyphrases, where they integrate the keyphrase identification step with their

summarisation techniques.

Keyphrases are also topic representative phrases that offer significant benefits to docu-

ment classification and clustering. Hulth and Megyesi [110], and Kim et al. [111] show

the use of keyphrases generally improves the efficiency and accuracy of document clas-

sification systems. Hammouda et al. [112] introduce CorePhrase, an algorithm for topic

discovery using extracted keyphrase from multi-document sets and clusters. Zhang et

al. [113] cluster web page collections using automatically extracted topical terms that

are essentially keyphrases.

Keyphrases also offer indexing of documents, which assists users in formulating queries

for search engines. For example, Wu et al. [114] propose enriching metadata of the

returned results by incorporating automatically extracted keyphrases of documents with

each returned hit.

Other studies include Mihalcea and Csomai [115], who use keyphrases to link web doc-

uments to Wikipedia articles; Ferrara et al. [116], who use keyphrases to produce richer

descriptions of document as well as user interests attempting to enhance the accessibil-

ity of scientific digital libraries; and Rennie and Jaakkola [117], who treat keyphrases as

topic oriented and informative terms to identify named entities.

2.1.9 Similar Tasks to AKE

Many downstream NLP tasks, such as automatic term extraction, keyword extraction,

share a number of commonalities with AKE. However, they are subtly different.

2.1.9.1 Automatic Domain-specific Term Extraction

Automatic domain-specific term extraction, also known as term recognition, term iden-

tification, or terminology mining [118], is a task that automatically identifies domain-

specific technical terms from relatively large corpora. Domain specific terms are stable

lexical units such as words or multi-word phrases, which are used in specific contexts


to represent domain-related concepts [119]. Extracting domain-specific terms is an im-

portant and essential step for ontology learning [120]. This subsection briefly reviews

some common techniques for domain-specific term extraction, which also serves as the

background for Section 5.

Domain-specific terms (or terms for short) share many characteristics with keyphrases,

yet there are subtle differences between them. From linguistic perspective, terms are

similar to keyphrases, which need to be semantically meaningful and syntactically ac-

ceptable expressions that may consist of one or more words. Terms correspond to tech-

nical entities in a domain. Hence, from a statistical point of view, terms intuitively have

high frequencies that are referenced by a large number of articles. Based on these sim-

ilarities, some AKE approaches can be applied to term extraction, such as TF-IDF [2].

Newman et al. [121] present a Dirichlet Process Segmentation model for both AKE and

term extraction. More recently, Liu et al. [54] applied TextRank to identify medical

terms.

The main difference between term extraction and AKE, however, is the level of scope.

Terms describe concepts and technical elements, which are properties of a specific do-

main. Hence, the task of term extraction aims to identify whether a term is relevant or

important to the domain of a document collection, which essentially is at the level of the

entire corpus. On the other hand, keyphrases present the main ideas or arguments of a

document, which describe the document at a highly abstract level. Therefore, keyphrases

need to be identified for each document, thus the task is at the level of documents.

The research in term extraction has a long history. Firstly, we summarise the work pre-

sented by Kageura and Umino [122] that surveys early studies before or in 1996. Kageura

and Umino classified these studies into two groups: linguistic and statistical approaches.

Linguistic approaches focus on applying linguistic knowledge such as analysing syntac-

tical information to identify noun phrases as terms. The most common approach along

this line is to use pre-defined POS patterns [123–125] or heuristics [126, 127]. Statis-

tical approaches focus on developing weighting schemes that measure the likelihood of

terms using statistical information such as term frequencies or co-occurrence frequencies.

Kageura and Umino classified statistical approaches into two further groups based on

two important concepts introduced in their work, namely unithood and termhood [122].

The unithood refers to the measure of how likely a sequence of words can form a syntac-

tically correct and semantically meaningful term. From this point of view, the unithood

measure can be thought as of the pre-process that identify valid phrases for further

processing. Early studies for measuring unithood include [124, 128–130]. The termhood

refers to the degree of relevance to a specific domain, which is the actual measure for

identifying terms. Early studies alone this line include [131–133]


However, in the later research, many studies have employed both linguistic and statistical

techniques. For example, Frantzi et al. [56] uses a pre-defined POS pattern to identify

noun phrases. The identified noun phrases become candidate terms for further processing

using a statistical-based approach. Hence, these approaches cannot be purely classified

as neither linguistic nor statistical approaches. The commonality is that they do not

use any training data, thus can be said as unsupervised approaches. On the other

hand, applying supervised machine learning algorithms to term extraction has also been

recently proposed [134–136]. Based on this, we group more recent studies into two

streams: supervised machine learning and unsupervised approaches. The main difference

between supervised machine learning and unsupervised approaches is the core algorithms

used for identifying domain-specific terms. Supervised approaches aim to train classifiers

using labelled data, whereas unsupervised approaches rely on weighting schemes that

measures termhood.

Supervised machine learning approaches treat the term extraction as either a binary

classification or ranking problem. In comparison to unsupervised approaches, applying

supervised machine learning algorithm to term extraction is a less popular choice due

to the fact that constructing labelled training data is difficult and laborious. Another

possible reason might be that term extraction is performed over the entire corpus, hence

it has stronger statistics but fewer local features e.g. appearing positions in AKE, which

are advantageous to unsupervised approaches.

Most supervised approaches employ a processing pipeline, consisting of 1) candidate

term identification, 2) feature selection, and 3) term classification. The first step is to

identify candidates, which has the same goal as AKE. Hence, most pre-processing tech-

niques in AKE described in Section 2.1.2 are also applicable to term extraction, among

which using per-identified POS patterns remains the most popular choice [134–136].

Statistical and linguistic features are commonly used in supervised term extraction. For

example, Spasic et al. [137] proposed a classification approach based on verb complemen-

tation patterns, and Foo [134, 138] uses POS tags, morph-syntactic descriptions, and

grammatical functions combined with statistical information to train classifiers. After

feature engineering, candidate terms are mapped from their symbolic representations

to vectors that are the inputs to the supervised machine learning algorithms. Conrado

et al. [136] applied Rule Induction, Naıve Bays, Decision Tree, and Sequential minimal

optimization; Foo and Merkel used an existing Rule Induction learning system namely

Ripper [139]. More recently, Fedorenko et al. [140] present an experimental evaluation,

in which Random Forest, Logistic Regression, and Voting algorithms are compared. The

work shows that Logistic Regression and Voting deliver the best performance on two

datasets.


Unsupervised term extraction scores and ranks terms, then the top ranked ones

are extracted as domain-specific terms. The unsupervised term extraction involves the

measures of unithood and termhood. The unithood measure is mainly based on statistical

evidence, such as term frequencies, to identify whether a sequence of words occurs in

patterns or by chance. Some of the well-known unithood measures include log-likelihood

ratio [141], t-test, and pointwise mutual information (PMI) [67]. More recently, Wong

et al. [57] present a probabilistic unithood measure framework that uses both linguistic

features and the statistical evidence from web search engines. It is worth noting that

the unithood measure is not popular in AKE. One reason is that unithood requires

large dataset to work well – which is the entire corpus. On the other hand, most AKE

approaches only use statistics within each document, which has much less statistical

information.

The unithood measure can be thought as of pre-processing techniques, which is mainly

used for measuring the likelihood of a sequence of words that constitute a term. Ter-

mhood, on the other hand, measures the degree of relevance to a specific domain for a

given candidate term. Most techniques for termhood measure are based on two types of

resources: 1) statistical information from the local corpora, and 2) information from ex-

ternal contrastive corpora or knowledge bases. In fact, the development of unsupervised

algorithm for term extraction can be thought as a journey of seeking and employing

resources.

Using statistical information supported only by the local corpora is the fundamental and

simplest approach, because it does not require any external resource that may not always

be available. The most well-known approach using statistics is TF-IDF [2], which can be

either applied directly to extract terms, or as a derived feature for further computation.

Kim et al. [142] proposed to use TF-IDF to extract domain-specific terms. They use

TF-IDF to compute domain specificity, a notion introduced in their work based on idea

that domain-specific terms occur in a particular domain are much more frequent than

they do in others. Navigli and Velardi [143] use TF-IDF as a feature for measuring

terms’ distribution over the entire corpus.

Another popular approach using only local statistics is C-value [56]. It specifically de-

signed for identifying a term from its longer version – super-strings of the term, e.g. novel

neural network and neural network, based on statistics of term occurrences. Concretely,

C-value uses statistical information of a phrase a from four aspects: 1) the occurrence

frequency of a in the corpus, 2) the number of times that a appears as a part of other

longer phrases, 3) the number of the longer phrases that contain a, and 4) the number


of words that a contains. The score of a phrase a is computed as:

Cvalue(a) =

log2 |a| × fa if a not appearing in any phrase

log2 |a| × (fa −∑

b∈Taf(b)

P (Ta) ) otherwise

where fa is the frequency of phrase a, |a| is the number of words that a contains, Ta is the

list of longer phrases that contain a, P (Ta) is the size of Ta (the number of elements in

Ta). In this way, the score of the candidate is reduced if it is a part of other candidates.

However, C-value is designed for only recognising multi-word terms. Barron-Cedeno

et al. [144] generalise C-value to handle single word terms by adding a constant to the

logarithm. Other algorithms for term extraction derived from C-value include [145–147].

Later studies have also attempted using contrastive corpora (also named referencing

corpora in some literatures) in addition to the local ones to provide extra informa-

tion [148–152]. A contrastive corpus is a document collection that usually contains texts

from more general or contrastive fields than the target corpus from which domain-specific

terms are extracted. The purpose of using contrastive corpora is to compare the distri-

butions of words in both target and contrastive corpora, from which potential terms are

inferred. For example, Gelbukh [151] et al. assume that a word or phrase appears much

more frequently in the target corpus is more likely to be a domain-specific term.

Basili et al. [148] proposed a contrastive approach that relies on a cross-domain statistical

measure using a target corpus consists of 1,400 documents from Italian Supreme Court,

and a contrastive corpus consists of 6,000 news articles. The purposed measure called

contrastive weight is based on Inverse Word Frequencies [153] – a variant of Inverse

Document Frequencies, which measures different distributions of terms throughout a set

of given topics. This contrastive weight for an individual word a in target domain d is

defined as:

CW (a) = log f(ad)log∑

j

∑i F(ij)∑

j f(aj)

where f(ad) is the frequency of a in d,∑

j

∑i F(ij) computes the frequencies of all words

in both target and contrastive corpora, and∑

j f(aj) computes the frequency of a in all

corpora. The score of a multi-word term t is then computed as:

CW (t) = log f(td)CW (th)

where f(td) is the frequency of t in d, and th is the head noun of the term.

In the later work, Wong et al. [150] proposed a probabilistic model for measuring ter-

mhood, namely odd of termhood (OT). The authors firstly identified seven characteristics


of terms, from which they developed the model using Bayes’ Theorem, as:

P (R1|A) =P (A|R1)P (R1)

P (A)

where R1 is the event that a is relevant to the domain, and A is the event that a is

a candidate term represented as a vector of feature values derived from the identified

characteristics. The authors assumed that probability of relevance of candidate a to

both target domain and contrastive corpora is approximately 0, i.e. P (R1 ∩ R2) ≈ 0,

where R2 is the event that candidate a is relevant to other domains. The odds of a

candidate term a being relevant to the domain is computed as:

O(a) =P (A|R1)

1− P (A|R1)

The OT of a term a is computed as:

OT (a) = logP (A|R1)

P (A|R2)

Using online resources for term extraction is also proposed. For example, Wong et al. [57]

present a probabilistic unithood measure framework that uses both linguistic features

and the statistical evidence from web search engine, where authors employed the page

count by Google search engine for calculating the dependency of between words. Each

constituent word in a term is formulated as a query to Google search engine, and the

page count returned for each word is for calculating the mutual information. Dobrov

and Loukachevitch [154] also proposed to use search engines to provide extra features.

However, instead of using page counts, the author analyse snippets (a short fragments

of texts explaining search results) returned from the search engine. Studies have also

attempted to use Wikipedia – one of the largest online knowledge repository as external

knowledge. For example, Vivaldi and Rodriguez [155] uses Wikipedia as a external

corpus, where they built a ontological categories using Wikipedia for the domain by

traversing the Wikipedia category graph.

2.1.9.2 Other Similar Tasks

Keyword extraction can be thought of as a sub-task of keyphrase extraction, which only

focuses on extracting unigram keywords. Theoretically, automatic keyword extraction

algorithms can be applied to keyphrase extraction task as long as one treats phrases in

the same way as unigram words. In practice, however, it is not always possible. For

example, Ercan and Cicekli [61] present a lexical chain model that requires the knowledge

of candidate senses and semantic relations between candidates. However, they have only


focused on extracting keywords instead of keyphrases, because keyphrases are usually

domain dependent phrases that are not presented in WordNet.

The most similar task to AKE is automatic keyphrase assignment, which aims to select

keyphrases for a document from a given set of vocabulary [156]. Given a fixed set of

keyphrases as the ground-truth, the task is to tag each document in a corpus with

its corresponding ground-truth keyphrases. On the other hand, AKE aims to identify

keyphrases from the content of a document without any given ground-truth.

Automatic text summarisation is another closely related task to AKE, which aims to

identify significant information that is important enough to summarise a given document.

The major difference is that text summarisation aims to produce syntactically valid

sentences, rather than just a list of phrases.

Other related tasks include named entity recognition, and term extraction. Entity recog-

nition aims to extract specific types of information, such as extracting the athletes,

teams, leagues, scores, locations, and winners from a set of sport articles [157].

2.2 Learning Representations of Words and

Their Compositionality

Machine learning requires data representations. Unlike other fields, such as image pro-

cessing, where the data are naturally encoded as vectors of the individual pixel intensities

for images, the data in NLP are sequences of words. Words are fundamental building

blocks of a language, thus, the first thing to all NLP downstream tasks is to obtain

vector representations of words that are inputs to machine learning algorithms. Very

often, researchers also need representations beyond word level, i.e. multi-word phrases,

sentences, or even documents, to perform specific tasks such as Machine Translation or

Sentence Classification. Learning representations beyond word level requires not only

the understanding of each word, but also learning the combination rules, which remains

a great challenge to the NLP community. This section provides a comprehensive review

on representation learning for unigram word and multi-word expressions. Specifically,

we focus on what is representation learning and why it is important, how the represen-

tations are induced, and where they can be applied.


Sentence 1: The cat is running in the bedroom.Sentence 2: The dog is walking in the kitchen.

the 1 0 0 0 0 0 0 0 0

cat 0 1 0 0 0 0 0 0 0

dog 0 0 1 0 0 0 0 0 0

is 0 0 0 1 0 0 0 0 0

running 0 0 0 0 1 0 0 0 0

walking 0 0 0 0 0 1 0 0 0

in 0 0 0 0 0 0 1 0 0

bedroom 0 0 0 0 0 0 0 1 0

kitchen 0 0 0 0 0 0 0 0 1

Figure 2.1: One-hot Vector Representations: words with 9 dimensions

2.2.1 Word Representations

2.2.1.1 Atomic Symbols and One-hot Representations

In early rule-based and statistical NLP systems, words are treated as discrete atomic

symbols, i.e. words are represented in their lexical forms. Atomic symbols offer natural

representations of words, which allow researchers using regular expressions or morphemes

analysis to tackle NLP tasks at the lexical level, such as text searching [158], parsing [159,

160], or stemming [161]. In machine learning, atomic symbols are converted into vector

representations, where the symbols become identifiers referring to feature vectors with

the same length as the size of the vocabulary with only one dimension asserted to

indicate the ID, namely one-hot representations. Figure 2.1 shows an example of one-

hot representations of words for a vocabulary size of 9.

However, neither the atomic symbols nor one-hot vectors are capable of providing se-

mantic information of words, and thus a system built based on such representations has

no understanding of the meanings of words. Specific to one-hot representations, machine

learning algorithms suffer from the curse of dimensionality and data sparsity problems

when the size of vocabulary is large, since the dimension of vectors increases with the

vocabulary’s size, but most feature values are zeros.

2.2.1.2 Vector-space Word Representations

The idea of vector-space representations of words is to represent words in a continuous

vector space where words having similar semantics are mapped closely. The learnt rep-

resentations are low-dimensional and real valued vectors, which naturally overcome the

curse of dimensionality and data sparsity problem. Learning such representations is an

essential step to modern NLP, which benefits a number of downstream NLP tasks, such


as speech recognition [162, 163], machine translation [17, 97], and sentiment classifica-

tion [20, 26].

The vast majority approaches for learning vector-space representations are based on the

distributional hypothesis [164], i.e. words tend to have similar meanings if they appear

in the same contexts. Recent studies have demonstrated that the learnt representations

not only encode statistical and lexical properties of words, but also partially encode

the linguistic and semantic knowledge of words. For example, the result of word vec-

tor calculations show that vec(king) − vec(man) + vec(woman) ≈ vec(queen), and

vec(apple) − vec(apples) ≈ vec(car) − vec(cars) [13].

The vector-space representations of words can be induced from two approaches: the dis-

tributional and distributed approaches. The distributional approaches induce word rep-

resentations using word-context count-based models by applying mathematical functions

(e.g. matrix decomposition algorithms) to reduce the dimensions of word co-occurrence

matrices. On the other hand, distributed approaches use word-context prediction-based

models, which compute the word probability mass function over the training corpus.

The induced word representations are also named word embeddings. In the following

two sections, we illustrate the fundamental ideas of each approach using a toy dataset

that only consists of two sentences: The cat is running in the bedroom, and The

dog is walking in the kitchen.

2.2.1.3 Inducing Distributional Representations using Count-based Models

Count-based models learn distributional representations by applying dimensionality re-

duction algorithms on word-context matrices [165]. Concretely, let M ∈ R|W |×|C| be

a word-context matrix, where |W | is the size of the vocabulary W and |C| is the size

of contexts C. Each row in M always corresponds to the representation of a word.

However, columns in M depend on the interest of the context to be concerned. For

example, if the interest is analysing how words co-occur, then the context is word, and

M is constructed as an adjacency (word-word co-occurrence) matrix, as shown in Fig-

ure 2.2 (A). If the interest is analysing how words occur in sentences, then the context

become each sentence, thus M is a word-sentence matrix, shown in Figure 2.2(B). The

count-based models aims to reduce |C| by looking for a function g that maps M to a

matrix M ′ ∈ R|W |×d, where d<|C|.

The most common approach is to factorize M to yield a lower-dimensional M ′, where

each row becomes the new vector representation for each word. The most well-known

algorithm is Latent Semantic Analysis (LSA) introduced by Deerwester et al. [166] who


the cat dog is running walking in bedroom kitchen

the 2 1 1 2 1 1 2 1 1

cat 1 0 0 1 1 0 1 1 0

dog 1 0 0 1 0 1 1 0 1

is 2 1 1 0 1 1 2 1 1

running 1 1 0 1 0 0 1 1 0

walking 1 0 1 1 0 0 1 0 1

in 2 1 1 2 1 1 0 1 1

bedroom 1 1 0 1 1 0 1 0 0

kitchen 1 0 1 1 0 1 1 0 0

Sentence 1: The cat is running in the bedroom.Sentence 2: The dog is walking in the kitchen.

Sentence 1 Sentence2

the 2 2

cat 1 0

dog 0 1

is 1 1

running 1 0

walking 0 1

in 1 1

bedroom 1 0

kitchen 0 1

(A) (B)

Figure 2.2: Co-occurrence matrices of two sentences. (A): word-word co-occurrencematrix, (B): word-document co-occurrence matrix.

Dog, cat,

Walking,

kitchen

Bedroom,

running

(a)

Dog,

Walking,

kitchen

cat

Bedroom,

running,

(b)

Figure 2.3: Distributional Representations of Words in 2 dimensions induced fromSVD over the toy dataset. (A): Representations from the word-word matrix, (B)

Representations from the word-document matrix.

apply Singular Value Decomposition (SVD) to the matrix M . SVD learns latent struc-

tures of M ∈ R|W |×|C| by decomposing it into M ′ = USV T . U is a |W | × |W | unitary

matrix of left singular vectors, where the columns of U are orthonormal eigenvectors of

MMT . V is a |C| × |C| unitary matrix of right singular vectors, where the columns of

V are orthonormal eigenvectors of MTM . S is a diagonal matrix containing the square

roots of eigenvalues from U or V . Keeping d top elements of S, we have M ′ = UdSdVTd .

We apply SVD to the toy dataset, showing how SVD reduces the number of dimensions

while preserving the similar structure of the original representations. Figure 2.3 plots

2-dimensional word vectors induced using SVD from the matrices in Figure 2.2. We take

a sub-matrix M ′ ∈ U |W |×2 as the representations of words. The 2-dimensional distri-

butional word representations induced from the toy dataset still capture co-occurrence

patterns and basic semantic similarities. For example, words cat and dog have very

similar representations. In practice, the low dimensional representations of words can


be compared using the cosine similarity measure between any two vectors. Values close

to 1 indicate that the two words have almost the same meaning, while values close to 0

represent two very dissimilar words.

Following LSA, subsequent studies have attempted different matrix factorisation ap-

proaches, including Nonnegative Matrix Factorization (NMF) [167], Probabilistic Latent

Semantic Indexing (PLSI) [168], Principal Components Analysis, and Iterative Scaling

(IS) [169]. However, since a word-context matrix can be very high dimensional and

sparse, storing the matrix is memory-intensive and thus running the algorithms can be

computational expensive for large vocabulary.

More recently, Pennington et al. [170] present a model named GloVe that factorise the

logarithm of a word-word occurrence matrix. Specifically, let M denote the matrix, w

be a word, and the corresponding vector be mw with d-dimensions. The co-occurrence

context word and its vector of w are denoted as c and mc. GloVe learns parameters

m ∈ R|W |×d. For each w and c, we have:

Mlog(w,c) ≈ mw ·mcT + bw + bc

where bw and bc are additional biases for w and c. Unlike other matrix factorisation

approaches that decompose the entire sparse matrix, GloVe leverages statistical infor-

mation by training only on the nonzero elements in the matrix which efficiently reduces

the computational cost.

2.2.1.4 Learning Word Embeddings using Prediction-based Models

Word embeddings are originally proposed to overcome the curse of dimensionality prob-

lem in probabilistic language modelling [171]. Mathematically, a probabilistic language

model computes the probability distribution P over sequences of words W , which in-

dicates the likelihood of W being a valid sentence. Let the occurrence probability of a

sequence of words be:

P (W ) = P (w1, w2, ..., wt−1, wt)

Applying the Chain Rule of Probability, we have:

P (w1, w2, ..., wt−1, wt) = P (w1)P (w2|w1)...P (wt|w1, w2, ..., wt−1)

So, given t− 1 previous words, the conditional probability of the t word occurring is:

P (wt|w1, w2, ..., wt−1) =

t∏i=1

P (wi|wi−11 )


where wji denotes a sequence of words (wi, wi+1, ..., wj−1, wj) from i to j. However, this

calculation is computational expensive for large datasets. The N-gram model is intro-

duced to reduce the computational cost by taking advantage of word orders – considering

only the combinations of the last n− 1 words:

P (wt|wt−11 ) ≈ P (wt|wt−1

t−n+1)

Based on the N-gram model, Bengio et al. [171] introduce the neural probabilistic lan-

guage model demonstrating how word embeddings can be induced. The objective is to

learn a function f that maps the probability P using a feed-forward neural network:

f(wt, ..., wt−n+1) = P (wt|wt−1t−n+1)

The network features three layers – a projection layer, a hidden layer, and a softmax

output layer, shown in Figure 2.4. Let V be the training set vocabulary that maps the

sequence of words w1, w2, ..., wv where wi ∈ V . Let C represent the set of word embed-

ding vectors with the dimension of d, so C(w) ∈ Rd, and C has |V | × d free parameters

to learn. Let a function g take a sequence of embedding vectors C(wt−n+1), ..., C(wt−1)

as inputs to map a conditional probability distribution over words in V for the next

word wt. The output of g is the estimated probability P (wt = i|wt−11 ), as:

f(i, wt−1, ..., wt−n+1) = g(i, C(wt−1), ..., C(wt−n+1))

The function f consists of two mappings, C and g. C is a collection of word embedding

vectors, the projection layer is a concatenation of input embeddings for each word via a

look-up table. The function g is implemented using a feed-forward neural network. The

network has one hidden layer, and the output layer is a softmax layer:

P (wt|wt−1t−n+1) =

eywt∑Vi=1 e

yi

where yi is the unnormalised log-probability for each output word i, computed as.

y = bo +Wx+ Utanh(bh +Hx)

where bo is the bias unit of the output layer, and bh is the hidden layer bias. W is

the weight matrix if there is a direct connection between the projection layer and the

output layer. U is the weight matrix between the hidden layer and the output layer,

and H is the weight matrix between the projection layer and the hidden layer. The

overall parameters to be learnt are θ = (bo,W,U, bh, H,C). The learning objective is


123456789

V

dLook up Table: C

C(1) C(2) C(4)

… ...

Projection Layer

Hidden LayerWeights: U

Weights: H

Softmax Layer

ID: 1 2 3 4 5 6 7 8 9

Token: the cat dog is running walking in bedroom kitchen

1 2 3 4 5 6 7 8 9

Figure 2.4: Neural Probabilistic Language Model

to maximise the probabilities P (wt|wt−1t−n+1, θ) by looking for parameter θ. Using the

Log-likelihood function, the loss L can be written as:

L =1

T

T∑t=1

log f(wt, wt−1, ..., wt−n+1; θ) =1

T

T∑t=1

logP (wt|wt−1t−n+1, θ)

Using Stochastic Gradient Ascent, θ is updated as

θ := θ + ε∂ logP (wt|wt−1, ..., wt−n+1)

∂θ

where ε is the learning rate. Figure 2.5 plots the embedding vectors induced from the

network by training over the toy dataset. The training iterates 1,000 times due to the

small size of the corpus. Each word embedding vector has only 2 dimensions. Similar to

the distributional representation, the word embeddings also capture the co-occurrence

patterns. For example, bedroom and kitchen share very similar values.

In practice, training the probabilistic language model over large datasets, such as Wikipedia

or Google News, the softmax function can be very computationally expensive because

of the summation over the entire vocabulary V . To overcome the problem, Collobert

and Weston [172] present a discriminative and non-probabilistic model using a pairwise

ranking approach. The idea is to treat a sequence of words appearing in the corpus as

a positive sample, and a negative sample is constructed by replacing a word from the


3 2 1 0 1 2 33

2

1

0

1

2

3

thecat

dog

is

runningwalking

in

bedroomkitchen

Figure 2.5: Word Embedding Vectors: induced from the probabilistic neural languagemodel training over the toy dataset. Each vector only has 2 dimensions.

positive sample by a random one, simulating that the pair never appears in the corpus.

Given s as the positive example and sw is the negative one, the learning objective is to

minimise the ranking criterion by looking for parameter θ

θ 7→∑s∈S

∑w∈D

max(0, 1− f(s) + f(sw))

where S is the set of positive samples, V is the vocabulary, and D is the training corpus.

More recently, Mikolov et al. [14, 15] introduce two models, namely Continuous Bag of

Word (CBOW) and Continuous SkipGram. The work is also commonly referred to as

word2vec. Both models adopt a shallow network architecture for specifically learning

word embeddings. Similar to the classic neural probabilistic language model [171], both

models learn a function f that maps the probability distributions over words in a cor-

pus. The CBOW model predicts a target words given its context words, whereas the

SkipGram model is the opposite to CBOW – it predicts the context words given the

target word. There is no non-linear hidden layer in both models. Instead, each word

w ∈ V is associated with a pair of vectors – one is the word embedding vector that is the

input to the network, and the other is an output vector. Let C be the set of all input

vectors for V , and C ′ be the set of output vectors, given a word wt, using the softmax

function, the probability of choosing wj is:

P (wj |wt; θ) =eC

′(wj)T ·C(wt)∑Vi=1 e

C′(wi)T ·C(wt)

The learning objective is to maximise the conditional probability distribution over vo-

cabulary V in a training corpus D by looking for parameters θ = (C,C ′). In addition,

Mikolov et al. [15] use two optimisation technique to further reduce the computational


cost, namely Hierarchical Softmax and Negative Sampling. Hierarchical softmax con-

structs a binary tree representing word frequencies, thus the probability of choosing the

context word wj given wt can be computed as the relative probabilities of randomly

walking down the tree from wi to reach wj . The idea of the negative sampling is sim-

ilar to Collobert and Weston [172]’s model, rewarding positive samples and penalising

negative samples.

In the past few years, researchers have proposed different neural networks to learn word

embeddings. Mnih and Hinton [173] present a log-bilinear language model trained us-

ing a Restricted Boltzmann Machine. Luong et al. [174] use a recursive neural net-

work learning morphologically-aware word representations. Bian et al. [175] introduce

a knowledge-rich approach learning word embeddings by leveraging morphological, syn-

tactic, and semantic knowledge.

2.2.2 Deep Learning for Compositional Semantics

Languages are creative, thus the meaning of a complex expression usually cannot be

determined by simply assembling the meaning of each constituent word. Despite the

fact that word embeddings encode meaning of individual words, they are not capable

of dynamically generating the meaning of multi-word phrases, sentences, or even doc-

uments. Frege [25] states that the meaning of a complex expression is determined by

the meanings of its constituent words and the rules combining them, known as the prin-

ciple of compositionality. Hence, learning compositional semantics requires not only to

understand the meaning of words, but also to learn the rules of combining words.

Mitchell and Lapata [176] present a general composition model, defining that the mean-

ing of a phrases is derived from the meaning of its words, in conjunction with the

background knowledge and syntactical information. Formally, consider a bigram phrase

M consisting of two words m1 and m2. Let p denote the meaning of a bigram phrase,

u and v denote the vector representations of m1 and m2, K and R be the background

knowledge and syntactical information respectively, then we have:

p = f(u, v,K,R)

In recent years, researchers have investigated different types of deep neural networks that

implement the general compositional model from different angles, including feed-forward,

recurrent, recursive, and constitutional networks.

feed-forward Networks learn the compositional semantics using the similar approach

as learning the probabilistic language models, i.e. given context words, it predicts the


u u u

xt-1 xt xt+1

1 2 4

v

w w wst-1 st st+1

ID: 1 2 3 4 5 6 7 8 9

Token: the cat dog is running walking in bedroom kitchen

Softmax Layer

1 2 3 4 5 6 7 8 9

yt+1

Lookup Table

Figure 2.6: A Simple Recurrent Neural Network for Language Modelling.

target word. Mikolov et al. [15] in their well-know work word2vec, demonstrate how

phrase embeddings can be induced using the same network that learns word embeddings

– it treats the phrases as atomic units, i.e. the same way as unigram words. Such

technique is also known as holistic approach [177]. However, it requires that all phrases

are pre-identified, which is not able to generate representations for phrases not appearing

in the training set. Moreover, the network neither has the mechanisms nor intentions to

learn the compositional rules.

Based on the word2vec model, Le and Mikolov [178] present a model that learns the

distributed representations of sentences and documents, namely doc2vec. Although,

the model uses the same learning algorithm as in [15], the author add an extra vector

representation for each sentence or paragraph acting as a memory that remembers the

topic of the content, which can be thought as to encode the compositional rules of words.

Lau and Baldwin [179] present s study that empirically evaluate the quality of document

embeddings learnt by doc2vec, and demonstrate that the model performs robustly when

trained over large datasets.

Recurrent neural networks learn the compositionality by recursively combining the

representation of an input word with the precedents, with the intuition that the meaning

of a complex expression can be assembled by sequentially combining the meaning of each

constituent word. The recurrent architecture allows network to take inputs with arbi-

trarily lengths, naturally offering an advantage of learning representations of multi-word


expressions with various lengths. In addition, the network learns patterns of sequential

data, which implicitly encodes the syntactical information.

The Elman’s network [180] – a simple recurrent neural network, is employed to learn

compositional semantics by many researchers [16, 181, 182]. Figure 2.6 shows a general

architecture5. The inputs to the network are word embeddings, and W , U , and V are

shared parameters across the network. At a timestamp t, the input xt is an embedding

vector corresponding to the word, and the output is hidden state St = f(Uxt+WSt−1),

f is the activation function such as sigmoid or hyperbolic tangent. The hidden state

St composes the output from the previous state St−1 and the newly input value of xt,

which represents the compositional meaning of a ngram input. At any timestamp, the

network can optionally output yt = g(V St), where g is an output function such as

softmax. This enables the network to have multiple outputs, which forms a sequence-

sequence architecture enabling the network to perform complex NLP tasks, such as

machine translation. Figure 2.7 shows common architectures, including multiple-to-one

output, one-to-multiple outputs, and multiple-to-multiple.

The application of more sophisticated recurrent networks has also been investigated.

Sutskever et al. [17] use a Long Short-Term Memory (LSTM) networks with multiple-

to-multiple architecture to train a machine translation model for English to French.

The network maps an input sequence (English words) to the corresponding word em-

bedding vectors, and then decodes the vectors into target sequence (French words). More

recently, Cho et al. [97] introduce a new type of recurrent network, namely Gated Re-

current Unit (GRU), which is variant of the LSTM network. It combines the forget gate

and input gate in LSTM into a single update gate. The GRU network has an encoder

and a decoder, which is similar to the multiple-to-multiple architecture. The encoder

is responsible for encoding the input sequence into a composed vector representation,

and the decoder transforms the vector into the target sequence. They train the net-

work to perform different tasks, including machine translation, language modelling, and

learning phrases representation, demonstrating that the network is capable of learning

semantically and syntactically meaningful representations of phrases.

Recursive neural networks are introduced by Culler and Kuchler [183] that uses

shared weights to compute outputs recursively over a structure by traversing it in a

topological order. Because of the special recursive architecture, the network encodes

structural combination of the input data. Socher and colleagues [18–20, 184] first

propose to use recursive neural networks modelling compositional semantics. The key

idea is that the English language naturally has recursive tree structures that can be

5We present a general architecture of the Elman’s network, there might be small variance in differentstudies depending on the actual task.


(a) Multiple to one

(b) One to multi-ple

(c) Multiple to multiple

Figure 2.7: Common Recurrent Neural Network: sequence to sequence architectures

S

NP

Det.

N

the cat

VP

VBZ

VP

isVBG

running

PP

NP

IN

inDet.

N

the bedroom

Figure 2.8: Recursive Neural Network: fitting the structure of English language

captured by the recursive neural network. Figure 2.8 show an architecture of a recursive

neural network that fits the tree structure of a sentence the cat is running in the

bedroom.

Socher et al. [18] use a simple recursive network to learn phrase representations for

syntactic parsing. The inputs to the network are embedding vectors having d dimensions.

The network has shared weights W ∈ Rd×2d. At each recursive iteration, the network

takes two inputs xleft and xright from each node of the tree and concatenates them. The

output is a composed vector p ∈ Rd representing the compositional meaning of the input

words, as p = f(W [x1;x2]). The network recursively computes outputs by traversing

the tree. However, because this network has a small number of free parameters – it uses

shared weights for all recursions, it may not be able to encode enough information for

large datasets. In their later study, Socher et al. [19] propose using more parameters to

learn the semantic compositionality. Instead of having only one vector representation


Figure 2.9: Image Processing Convolution

for each word, the proposed Matrix-Vector Recursive Neural Network (MVRNN) model

uses a pair of vector and matrix representing each word. The idea is to let the vectors

capture the meanings of words, and the matrices encode the rules of composing them.

Specifically, each word w is associated with a vector vw ∈ Rd and a matrix Mw ∈Rd×d. The compositional meaning of two input words w1 and w2 are computed as

p = f([vw1Mw2; vw2Mw1]), and the composed matrix for Mw1 and Mw2 are computed

as P = f(Mw1;Mw2). Similarly, the computation traverse the entire tree to output

the overall compositional meaning of input words. Socher et al. [20] further present a

recursive tensor model, which replaces the weight matrix W in the MVRNN model by

a third-order tensor, yielding the best performance over the two precedent models.

Following the work of Socher, Zhao et al. [185] present an unsupervised general model

for specifically learning multi-word phrase embeddings. The model merges the tensor

recursive network [20] and the word2vec model [15], which learns the compositional

meaning of a phrase by predicting its surrounding words. Irsoy and Cardie [186] propose

a deep recursive neutral network constructed by stacking multiple recursive layers on top

of each other to create a network that has a deep space. The idea to create a recursive

network that not only has the deep recursive structure, but also features deep structure

in space. However, recursive networks rely on syntactical parsers to work, which may

be a drawback since extra parsing step is required.

Convolutional neural networks (CNN) are popular in image processing due to the

efficiency of capturing location invariance and compositionality of image pixels. A typ-

ical convolution process in image processing is shown in Figure 2.9, where the filters

slide over local regions of an image to learn location invariance and compositionality.

However, the location feature does not apply for word embeddings. Therefore, instead of

filtering different regions, learning semantic compositionality use one-dimensional convo-

lution process. We classify the one-dimensional convolution process into are two types:

embedding vector-wise convolution and embedding feature-wise convolution.

The embedding vector-wise convolution is more popular and intuitive, which treats

the features in each word embedding vector as a whole assuming no dependencies in

the features from different word embedding vectors. Mathematically, the vector-wise

convolution, shown in Figure 2.10 (A), concatenates word embeddings vectors into a


the

cat is

running

the cat is running

(A) (B)

Figure 2.10: Word Embedding Convolution Process: (A) Embedding vector-wiseconvolution (B): Embedding feature-wise convolution

larger one-dimensional vector, the filters are vectors with weights w ∈ Rd×l, where d is

the size of word embedding vectors and l is the length of a window (across l words per

convolution). A single convolution is computed as:

ci = f(wTxi:i+l−1 + b)

where xi:j denotes words from i to j. The output is a vector c ∈ Rn−l+1, where n is the

length of input words. A more intuitive way to understand the vector-wise convolution

is to treat the concatenating process as stacking word embedding vectors into a matrix,

and filters are matrices having dimension of w ∈ Rd×l meaning that all filters having

the same width d as the word embedding vectors, such that the convolution process is

the same as in image processing, but only able to convolute over l, i.e. one dimensional

convolution.

In contrast to the vector-wise convolution, the feature-wise convolution assumes that

each feature value in word embedding vectors encode the similar features of words,

and hence feature values are independent in a word vector but correlate in different

word vectors. The vector-wise convolution firstly stacks word embedding vectors into a

matrix, where columns are word embedding vectors, and rows are values of embeddings

at each dimension, as shown in Figure 2.10 (B). M is a matrix of weights with size d× l.However, unlike the vector-wise convolution where the matrix is the weight of one filter,

in the feature-wise convolution, the matrix can be thought of as a collection of filters

where each filter is a vector of size m. The convolution is performed by letting each

filter slide over its associated rows in the matrix and the output is a matrix with size of

d× (n− l + 1).

Collobert and Weston [172] use a Time-Delay Neural Network (TDNN) [187] perform

the vector-wise convolution. The TDNN is a specially case of the CNN network, which

has one fixed size filter sharing the weights along a temporal dimension. They apply the


network on six NLP tasks where two of them are related to learning semantic composi-

tionality, including semantic role labelling and semantic relatedness prediction. Kim [26]

reports a number of experiments using the vector-wise convolution with one convolu-

tional layer and 1-Max pooling. The model is trained to perform the sentiment analysis

and topic categorisation tasks, producing the new state-of-the-art results on 4 out of

7 datasets, which shows the power of the CNN network. However, the CNN network

requires the inputs with a fixed size. Kim [26] addresses this issue by predefining a fixed

length for all sentences, and padding zero vectors to the input matrix if the length of

a sentence is less than the fixed length. Kalchbrenner et al. [188] introduce a dynamic

k -Max pooling to handle input sentences of varying lengths using the feature-wise con-

volution. Given a value k and a vector v ∈ Rd where d ≥ k, the k-Max pooling extracts

k maximum values from v. The dynamic k-Max pooling sets k to be dynamically cal-

culated by a function k = max(ktop,L−lL s), where l and L are the number of current

convolutional layers and the total of convolution layers, respectively. ktop is fixed on the

top convolutional layer to output a fixed length vector, and the output layer is a fully

connected layer. This strategy enables the input sentence to be various lengths as long

as the maximum filter width l ≤ ktop. For example, in a 3-layer convolutional network,

if the input sentence has 6 tokens, k − 1 = 4 at the first pooling layer then k2 = 3 and

ktop = 3.

2.2.2.1 Learning Meanings of Documents

Documents usually contain much more words than phrases and sentences. One common

approach is to firstly learn the meanings of sentences, and then learn the representa-

tions of documents based on the learnt sentence representations. Kalchbrenner and

Blunsom [189] use a recurrent-convolutional network to learn the compositionality of

discourses. The model consists of two networks: a CNN and a recurrent network. The

CNN network is responsible for learning the meanings of sentences. At each sentence st

in the document, the CNN network outputs the vector representation of the sentence.

The recurrent network takes two inputs: the vector representation of sentence st from

the convolutional network, and the output from its previous hidden state st−1, to predict

a probability distribution over the current label. Similarly, Tang et al. [190] propose to

use a LSTM or a CNN network to learn the meaning of a sentence, then use a GRU

network to adaptively encode semantics of sentences and their relations for document

representations.

However, using the same neural network architecture to learn both the representations

of sentences and documents has also been attempted. Denil et al. [191] use a single CNN

network to train convolution filters hierarchically at both the sentence and document


level, intending to transfer lexical features from word embeddings into high level semantic

concepts. Word embedding vectors of a sentence are stacked into a matrix on which

the convolution process is performed and output feature maps representing the latent

features learnt from the sentence. Following a max pooling operation, each sentence

embedding is then stacked into a document matrix, on which another convolution process

is perform to output the document embedding vectors. Ganesh et al. [192] propose

two probabilistic language models implemented using the same feed-forward network

architecture to jointly learn the representations of sentences and documents.

It is worth mention that word-sentence-document models are not the only way of learning

document representations. As aforementioned, doc2vec [178] is also commonly used for

learning document embeddings. Dai et al. [193], and Lau and Baldwin [179] conduct

empirical evaluations of doc2vec, and both demonstrate the model performs significantly

better than other baselines.

2.3 Conclusion

In this section, we firstly reviewed common approaches for the automatic keyphrase

extraction task. The approaches for automatically extracting keyphrases from docu-

ments can be grouped into two categories: unsupervised ranking and supervised machine

learning approaches. Both approaches require manually selected features to represent

phrases. However, unsupervised ranking approaches use fewer features then supervised

machine learning approaches, which are less dependent on the choice of features. In

the next chapter, we will present a systematic evaluation of different unsupervised AKE

algorithms, in order to gain better understandings of the strengths and weaknesses of

unsupervised AKE algorithms.

In the second part of this chapter, we provide a general introduction of representation

learning in NLP using deep neural networks. Deep learning is a relatively new research

field, which particularly focuses on automatically learning features of words, multi-word

phrases and sentences, and documents using deep neural networks, including conven-

tional feed-forward network, recurrent, recursive, and convolutional networks, and some

hybrid networks. However, learning the representations of phrases and documents still

remains a great challenge to the NLP community. In chapter 5 to 7, we will present a

series of deep learning models to automatically learn the representations of phrases and

documents.

Chapter 3

Conundrums in Unsupervised

AKE

In Chapter 2.1, we have reviewed common approaches for AKE, which can be grouped

into two categories: unsupervised ranking, and supervised machine learning approaches.

In comparison to supervised AKE, unsupervised approaches use fewer features making

them less dependent on the choice of features. Hence, in this chapter, we focus on

unsupervised AKE.

Approaches for unsupervised AKE typically consists of candidate phrase identification,

ranking and scoring processes. Each step plays a critical role on the processing pipeline,

which can significantly affect the overall performance of an AKE system. However,

most studies conduct evaluations of their AKE algorithms from a system point of view

by treating all steps as a whole, yet the efficiency and effectiveness of each process have

not been precisely identified.

In this chapter, we conduct a systematic evaluation to gain a precise understanding of

unsupervised AKE approaches. We evaluate four popular ranking algorithms, five can-

didate identification techniques, and two scoring approaches, by combining each ranking

algorithm with different candidate identification and scoring techniques. We show how

different techniques employed at each step affect the overall performance. The evalua-

tion also reveals the common strengths and weaknesses of unsupervised AKE algorithms,

which provides a clear pathway for seeking further improvements.

44

Chapter 3. Conundrums in Unsupervised AKE 45

3.1 Introduction

Unsupervised AKE approaches assign scores to each phrase, and the scores indicate the

likelihood of being keyphrases for each phrase. The output is a list of phrases with the

corresponding scores, from which a number of top scored phrases will be extracted as

potential keyphrases. Hence, how score is assigned to a phrase decides the sequence of

processing data – the processing pipeline. A phrase’s score can be assign directly by

an unsupervised AKE algorithm, which takes the whole phrase as a ranking element.

Alternately, the phrase’s score can be the sum of its each constituent word’s score,

and hence the AKE algorithm takes words as the ranking elements. We name the

former scoring approach as direct phrase ranking, and the later approach as phrase score

summation. Figure 3.1 provides an overview of the two system processing pipelines1.

The two system processing pipelines, shown in Figure 3.1, can produce very different

scores to the same phrase in a document, even with the same candidate phrase identi-

fication technique and AKE ranking algorithm. For example, a phrase neural network

only appears once in a journal article, it may be assigned a very low score by the direct

phrase ranking pipeline due to the low frequency. However, the word network co-occurs

hundreds of times with other words, such as recurrent, feed-forward, and convolutional.

Using the phrase score summation pipeline, the phrase neural network will be assigned

a reasonably high score, due to that the word network will gain an extremely high score.

Two common steps of both processing pipelines are candidate phrase identification and

ranking. The candidate phrase identification process identifies syntactically valid phrases

from documents. It acts as a filter that prevents noisy data, such as non-content-bearing

words, to be inputted to the system, providing the candidates from which keyphrases

are extracted. Hence, a phrase not in the candidate phrase list will never be scored and

extracted.

The ranking process is performed by an unsupervised AKE ranking algorithm, which

carries out the main computation and assigns scores to words or phrases. Unsupervised

AKE ranking algorithms directly decide what will be extracted as keyphrases. Over

the past two decades, a number of unsupervised AKE ranking algorithms have been

developed, such as KeyGraph (1998) [35], TextRank (2004) [3], ExpandRank (2009) [10],

RAKE (2010) [45], and TopicRank (2013) [78].

Each step, including candidate phrase identification, ranking, and phrase scoring, plays

a critical role that affects the overall performance of an AKE system. However, most

studies of unsupervised AKE only focus on the ranking algorithms, with little discussion

1AKE systems require a text cleaning and normalising process, which often involves cleaning noisydata by applying dataset-dependent heuristics, thus we do not take them into consideration.


(a) Direct Phrase Ranking Processing Pipeline

(b) Phrase Score Summation Processing Pipeline

Figure 3.1: Phrase Scoring Processing Pipelines

on how other processes are implemented. This leaves difficulty in understanding how the

claimed improvements are achieved, let alone identifying whether they are achieved from

the candidate identification approach, the ranking algorithm, or the scoring technique.

Although some research has reported the importance of the candidate phrase identifi-

cation process, there is a lack of a comprehensive and systematic study on exactly how

each step along the processing pipeline affects the overall extraction results in unsu-

pervised AKE. Hulth (2003) [5] presents a comparison between three candidate phrase

identification approaches: N-grams, POS patterns, and noun-phrase (NP) chunking, but

on a supervised AKE algorithm. More recently, some studies show refined candidate se-

lection approaches, but experiments are conducted in conjunction with their ranking

algorithms causing the effect from candidate selection approaches to be blurred. For

example, Kumar and Srinathan (2008) [194] present an approach of preparing a dictio-

nary of distinct N-grams using LZ78 data compression, which produces better results.

Kim and Kan (2009) [34], and Wang and Li (2010) [195] also focus more on their refined

NP chunking approaches. However, the studies neither explore the capability of their

refined phrase identification approaches combining with other ranking approaches, nor

compare their approaches with other similar approaches side by side. In addition, to our

best knowledge, there is no study examining the impact from phrase scoring techniques.

In this chapter, we aim to have a clear understanding of how different techniques at each

step may affect the overall performance. We re-implemented five phrase identification

approaches, four ranking algorithms, and two scoring techniques. We conduct two eval-

uations on three datasets with documents of varying lengths. The first evaluation is on

the performance of phrase identification approaches by analysing the coverage on the

human assigned keyphrases (the ground-truth). In the second evaluation, we analyse the

performance of four ranking algorithms by coupling with different phrase identification

and scoring approaches.


3.2 Reimplementation

We reimplemented five common candidate phrase identification approaches, including

Phrase as Text Segment Splitter (PTS), N-gram Filter, NP Chunker, Prefixspan [55], and

C-value [56]. For ranking algorithms, we have reimplemented Term Frequency Inverse

Document Frequency (TF-IDF) [2], Rapid Automatic Keyword Extraction (RAKE) [45],

TextRank [3], and Hyperlink-Induced Topic Search (HITS) [88].

3.2.1 Phrase Identifiers

3.2.1.1 PTS Splitter

The PTS Splitter uses stop-words and punctuation marks (excluding hyphens) as delim-

iters to split a sentence into candidate phrases. For example, creating an information

architecture for a bilingual web site will produce three candidates, creating, information

architecture, and bilingual web site. The processing sequence is as follows:

1. convert an input text to lowercases.

2. split the text into candidate phrases using stop-words and punctuation marks.

3. stem the identified candidate phrases using Porter’s algorithm [161].

3.2.1.2 N-gram Filter

The N-gram filter is built upon the PTS Splitter. It takes outputs from the PTS Splitter

as inputs, and then generates all possible sequential combinations of the candidate, where

each combination has at least two tokens. For example, a phrase bilingual web site will

generate bilingual web, web site, and bilingual web site. After generating all N-grams

for the inputs, heuristics are applied to remove unwanted combinations. The processing

sequence is as follows:

1. use the outputs from the PTS Splitter as the inputs.

2. generate all N-grams for each input and save them into list L, where 2 < n ≤length of an input.

3. sort L by the frequencies of each N-gram.

4. for each N-gram p ∈ L, if there exists any N-gram g ∈ L where p is the substring

of g and freq(p) ≤ freq(g), remove p, otherwise remove g.


3.2.1.3 Noun Phrase Chunker

The Noun Phrase (NP) Chunker discards tokens not fitting into the predefined POS

pattern. We choose a simple but widely used pattern <JJ>*<NN.*>+ [3, 8, 44, 52],

which finds phrase that begins with zero or more adjectives, followed by one or more

nouns. The processing sequence is as follows:

1. convert an input text to lowercase.

2. tokenise the text.

3. tag the text using the Stanford Log-linear POS tagger [196].

4. chunk candidate phrases using the predefined POS pattern.

5. stem the identified candidate phrases using Porter’s algorithm [161].

3.2.1.4 Prefixspan

Prefixspan, introduced by Pei et al. [55], is a sequential pattern mining algorithm, which

identifies frequent sub-sequences from a given set of sequences. It can be adapted to

identify frequent candidate phrases from a document [54].

Formally, let I = {i1, i2, ..., in} be a set of items. An itemset sj is a subset of I. A

sequence is denoted as s = 〈s1s2...sl〉, where sj ⊆ I for 1 ≤ j ≤ l is an element of s, and

sj = (x1x2...xm), where xk ∈ I for 1 ≤ k ≤ m is an item. A sequence α = 〈a1a2...an〉 is

said to be a sub-sequence of another sequence β = 〈b1b2...bm〉 (and β is a super-sequence

of α) denoted as α v β, if there exists integers 1 ≤ j1 ≤ j2 ≤ ... ≤ jn ≤ m such that

a1 ⊆ bj1, a2 ⊆ bj2, ..., an ⊆ bjn. A sequence with l instances of items is called a l-length

sequence. A sequence database S is a list of tuples 〈sid, s〉 where sid is the ID of sequence

s. A tuple contains a sequence t if t is a sub-sequence of s. A support of t is the number

of tuples in S that contains t, and t is considered to be frequent if its support is greater

than a user-defined threshold min sup.

Given a sequence database (a set of sequences), instead of considering all the possible

occurrences of frequent sub-sequences, Prefixspan only focuses on the frequent prefixes

because all frequent sub-sequences can be identified by growing frequent prefixes. Given

a sequence α = 〈a1a2...an〉, and a sequence β = 〈b1b2...bm〉 where m ≤ n, β is only

said to be a prefix of α when it satisfies three conditions: 1) bi = ai for i ≤ m − 1; 2)

bm ⊆ am; and 3) all the items in (am − bm) are alphabetically listed after those in bm.

For example, 〈a〉, 〈ab〉, and 〈aa〉 are prefix of a sequence 〈aabc〉 but not 〈ac〉.


Prefixspan takes a sequence database S and a threshold min sup as inputs, and out-

puts a list of frequent patterns with their supports. It recursively calls a function

Prefixspan(α, l, S|a), where the parameters are initialised by setting α = 〈〉, l = 0, S|a =

S. The iteration is as follows:

1. scan S|a to find a set of frequent items b such that b can be assembled to the last

element of α to form a sequential pattern, or 〈b〉 can be appended to α to form a

sequential pattern.

2. For each frequent item b, append it to α, in order to form a sequential pattern α′,

and output α′.

3. for each α′, construct α′-projected database S|a′ and call Prefixspan(α′, l+1, S|a′)

Using Prefixspan to identify candidate phrase, we treat a token as an item, and a corpus

as the set of all items. We could also treat a sentence as a sequence. However, our goal

is to identify candidate phrases, thus treating a sentence as a sequence directly may

introduce many unwanted sub-sequences, which can contain stop-words or punctuation.

Therefore, we choose using outputs from the PTS Splitter and treat each text segment as

a sequence. The outputs from Prefixspan are frequent sub-sequences, which are candidate

phrases. The processing steps are as follows:

1. run the PTS Splitter over a corpus to obtain the text segments, and save them

into S as the sequence database

2. set min sup = 2 and call Prefixspan(α, l, S|a), where the parameters are ini-

tialised by setting α = 〈〉, l = 0, S|a = S.

3. save outputs from Prefixspan into a list C.

4. sort C by the length of tokens a sequence contains.

5. for each c ∈ C, if there exists any sub-sequence c′ of c, and support(c′) > support(c)

and length(c′) ≥ 2, remove c from C.

3.2.1.5 C-value

C-value [56] is proposed for identifying domain specific term using both linguistic and

statistical information. It identifies a phrase from its longer cousins – super-strings of

the phrase. C-value takes a set of candidate phrases chunked by pre-identified POS

patterns as inputs, and outputs a list of scored phrases, where the scores indicate the

likelihood of being a valid phrase.


The processing sequence is as follows:

1. run the NP Chunker over a dataset to obtain the candidates and their frequencies,

then save them in dictionary D.

2. compute the C-value scores for each d ∈ D using equation 2.1.9.1.

3. for each d ∈ D, sort D by |d|.

4. for each d ∈ D, if there exists any d′ appearing as a part of d for (d′ ∈ d), and

Cvalue(d′) > Cvalue(d) and |d′| ≥ 2, remove d from D.

3.2.2 Ranking Algorithms

Ranking algorithms usually organise documents into their own representations for easily

further processing2. Figure 3.2 shows three types of document representations used by

TF-IDF, TextRank and HITS, and RAKE, respectively.

3.2.2.1 TF-IDF

TF-IDF is a weighting scheme, which statistically analyses how important a phrase is to

an individual document in a corpus. The underlining intuition is that a frequent phrase

distributed evenly in a corpus is not a good discriminator, so it should be assigned

a lower weight (score). In contrast, a phrase occurring frequently in a few particular

documents should be assigned more weight.

TF-IDF constructs a Phrase-Document Matrix prior to scoring, where each row corre-

sponds to a phrase in the corpus, and each column represents a document. The value in

a cell represents the frequency of the phrase in the corresponding document. Figure 3.2

(B) shows an example, where phrases are identified by the NP Chunker.

The TF-IDF score of a phrase is calculated as the product of two statistics: the phrase’s

TF score and its IDF score. The TF score indicates the importance of a phrase against

the document in which it appears – the higher frequency gains a higher TF score. The

IDF score corresponds to the importance of a phrase against the corpus – it occurs

frequently in a large number of documents gains a lower IDF score. Let t denote a

phrase; d denote a document; and D denote a corpus; where t ∈ D, and d ∈ D, then

the TF-IDF score is computed as:

tfidf(t, d,D) = tf(t, d)× idf(t,D) (3.1)

2These representations, however, are not used for actual computations. We will discuss this issue indetails in Chapter 7.


(a)

(b) (c) (d)

(e) (f)

Figure 3.2: Document Representations: (A) a sample dataset contains only twodocuments, candidate phrases are identified by the NP Chunker as the inputs to eachranking algorithm. (B) Doc1 and Doc2 are represented in a Phrase-Document matrixused by TF-IDF. (C) Doc1 is represented in a Co-occurrence graph used by TextRankand HITS. Two phrases are connected if they co-occur in the same sentence. (D)Doc2 is represented in a Co-occurrence graph. (E) Doc1 is represented in a PhraseCo-occurrence Matrix used by RAKE. Phrases are co-occurred if they appear in the

same sentence. (F) Doc2 is represented in a Phrase Co-occurrence Matrix.


We use the weighting scheme introduced by Jones [2] as:

tfidf(ti) = tfi × idfi = tfi × log|D|

| {d ∈ D : ti ∈ d} |(3.2)

where tfi is the number of times phrase ti occurs in d, |D| is the total number of

documents in corpus D, and | {d ∈ D : ti ∈ d} | is the number of documents in which

phrase ti occurs.

3.2.2.2 RAKE

RAKE [45] is a statistical approach based on analysing phrases’ frequencies and co-

occurrence frequencies. In contrast to TF-IDF that conducts statistical analysis based

on a corpus, RAKE only uses the statistical information of each individual document.

RAKE constructs a Co-occurrence Matrix for representing a document, and the co-

occurrence relation is defined as:

1. if inputs are pre-identified phrases, two phrases are said to be co-occurring if they

appear in the same window. The window size can be an arbitrary number typically

from 2 to 10, or just a natural sentence. For example, in Figure 3.2 (E), phrases

are co-occurring if they appear in the same sentence.

2. if inputs are individual words, two words co-occur if they appear in the same

candidate phrase identified by a phrase identifier. For example, in a sentence:

“information interaction provides a framework for information architecture”, the

identified phrases are: information interaction, framework, information architec-

ture. Then the co-occurrences are: 1) information and interaction co-occurs once,

and information and architecture co-occurs once.

Concretely, the occurrence frequency for candidate phrase c is denoted as freq(c), and

the co-occurrence frequency with other candidates is denoted as deg(c) (refers to the

degree of candidate c). The Score(w) is computed as the ratio of degree to frequency:

Score(w) =deg(w)

freq(w)(3.3)

For example, in Figure 3.2 (E), deg(form interact) = 6, and freq(form interact) = 2,

so the overall score Score(form interact) = 3.


3.2.2.3 TextRank

TextRank [3] is a graph-based ranking algorithm. A document firstly is represented as an

undirected graph, where phrases (or individual words if using phrase score summation)

correspond to vertices, and edges represent co-occurrence relations between two vertices

that are connected if the relation is found within a predefined window-size. For example,

in Figure 3.2 (C) and (D), two phrases are connected if they appear in the same sentence.

TextRank implements the idea of ‘voting’. Imagine that a vertex vi links to another

vertex vj , as vi casting a vote for vj , then the higher the number of votes vj receives, the

more important vj is. Moreover, the importance of the vote itself is also considered by

the algorithm: the more important the voter vi is, the more important the vote becomes.

The score of a vertex is calculated based on the votes it received and the importance of

the voters.

The TextRank algorithm introduced in its original paper [3] can be applied to both AKE

and text summarisation tasks. However, using TextRank to extract keyphrases, the

calculation is identical to the original PageRank, apart from that TextRank constructs

undirected graphs for representing documents, where the in-degree of a vertex simply

equals its out-degree. Given a graph G = (V,E), let in(vi) be the set of vertices that

point to a vertex vi, and out(vi) be the set of vertices to which vi points. The score of

a vertex is calculated as :

S(vi) = (1− d) + d×∑

j∈in(vi)

1

|out(vj)|S(vj) (3.4)

where d is dumping factor, usually set to 0.85 [3, 89].

3.2.2.4 HITS

HITS [88] is a link analysis algorithm, which ranks web pages with references to their

degrees of hubs and authorities. The degree of hub is the number of links that a web page

points to others, and authority is the number of links that a web page receives. Hubs

and authorities are the same concepts as the out-degree and in-degree in PageRank.

A hub score hub(vi) of a vertex vi is computed as the sum of authority scores of all the

vertices out(vi) that vi points to, as:

hub(vi) =∑

vj∈out(vi)

authority(vj) (3.5)


Similarly, an authority score authority(vi) is the sum of the hub scores of all the vertices

In(vi) that point to vi, as:

authority(vi) =∑

vj∈In(vi)

hub(vj) (3.6)

In directed graphs, the score of a vertex can be either the maximum or average score

of hub and authority scores [197]. In undirected graphs, however, hubs’ scores equal to

authorities’ scores. For example, in Figure 3.2 (C), both the hubs score and authorities

score for the phrase inform interact are 6.

Applying HITS to AKE, we present a document as an undirected graph, and edges

are co-occurrence relations. The algorithm assigns a pair of scores (hub and authority

scores) to each vertex in a graph.

3.3 Datasets

3.3.1 Dataset Statistics

We select three publicly available datasets3: Hulth (2003) [5], DUC (2001) [52], and

SemEval (2010) [51]. The lengths of documents in each dataset vary: Hulth is a collection

of short documents – abstracts of journal articles, DUC consists of mid-length documents

– news articles, and SemEval consists of long documents – full-length journal papers.

Table 3.1 shows a summary of the selected datasets.

Table 3.1: Dataset Statistics After Cleaning

Hulth DUC SemEval

Total number of articles 2,000 308 244Total words in dataset 249,660 244,509 1,417,295Average words per article 125 799 5,808Total ground-truth phrases 27,014 2,488 3,686ground-truth phrases in articles 16,419 (60.8%) 2,436 (97.9%) 2,840 (77.0%)Average keyphrase per article 8.2 7.9 11.6Average tokens per keyphrase 2.3 2.1 2.2

*Some statistics slightly vary from the original literatures [5, 51, 52] due to the different pre-processing procedures.

The Hulth dataset consists of 2,000 abstracts of journal papers collected from Inspec

database, which are mostly in the field of Computer Science and Information Technol-

ogy. The dataset has training and test sets for supervised AKE. In this evaluation, we

3http://github.com/snkim/AutomaticKeyphraseExtraction


merged training and test sets, since no training data is required for unsupervised AKE.

Each article pairs with two sets of ground-truth keyphrases assigned by readers and the

authors. We combine two sets by taking the union set as the final ground-truth.

The DUC dataset consists of 308 news articles. The dataset was built based on DUC20014

that is used for the document summarisation task. All articles were manually annotated

by humans. The Kappa statistic for measuring inter-agreement among the human an-

notators is 0.70.

The SemEval dataset consists of 244 full-length journal papers from the ACM Digital

Library, where 183 articles are from Computer Science, and 61 articles are from Social

and Behavioural Science. The dataset consists of training and test data. Similar to the

Hulth dataset, each article pairs with both readers and authors assigned keyphrases.

We merged training and test datasets, and use the combination of readers and authors

assigned keyphrases as our ground-truth.

3.3.2 Ground-truth Keyphrase not Appearing in Texts

One common but important issue with all three datasets is that not all ground-truth

(human assigned) keyphrases appear in the actual content of the documents, even after

text normalisation and stemming, as shown in Table 3.1. Such non-appearing keyphrases

usually do not follow the exact same word sequences as their semantically equivalent

counterparts appearing in documents. For example, a ground-truth keyphrase is non-

linear distributed parameter model, in the document, it appears as two separate phrases

non-linear and distributed parameter model. In another example, the ground-truth

keyphrases are average-case identifiability and average-case controllability. In the docu-

ment, they appear as a single phrase average-case identifiability and controllability.

The task of AKE is to identify keyphrases appearing in a document, and hence, we

exclude the non-appearing keyphrases from the ground-truth list.

3.3.3 Dataset Cleaning

Each dataset needs to be cleaned prior to the evaluation. The Hulth dataset does not

require any special cleaning processing. Therefore, we only removed line separators for

each article in the dataset. The articles in the DUC dataset are in XML format, so we

extracted the content of each document by looking for <text> tags. In addition, any

4http://www-nlpir.nist.gov/projects/duc/guidelines/2001.html


Figure 3.3: Illustrating the sets’ relationships. Algorithm Extracted Keyphrases isa subset of Identified Candidate Phrases. Ground- truth and Identified Candidatesare subsets of All Possible Grams of the document. TP : the true positive set is theextracted phrases match ground-truth keyphrases; FP : the false positive set is theextracted phrases that do not match ground-truth keyphrases; FN : the false negativeset contains all ground-truth keyphrases that are not extracted as keyphrases; TN : thetrue negative set has the candidate phrases that are not ground-truth and not extracted

as keyphrases.

XML tags within the content were also removed. The SemEval dataset contains jour-

nal papers, so all mathematical symbols and equations, tables, figures, authors details,

references were removed.

3.4 Evaluation


All ground-truth keyphrases were stemmed using Porter Stemmer [161]. An assigned

keyphrase matches an extracted phrase when they correspond to the same stem sequence.

For example, information architectures matches inform architectur, but not inform or

architectur inform.

In Figure 3.3, we show the set relations of True Positive (TP), False Positive (FP),

True Negative (TN), and False Negative (FN). We employ the Precision, Recall, and

F-measure for evaluating the ranking algorithm as detailed in Chapter 2.1.7.

3.4.2 Evaluation 1: Candidate Coverage on ground-truth Keyphrases

We firstly evaluate the five candidate identification approaches described in Section 3.2.1.

This evaluation aims to discover how many ground-truth keyphrases can be identified

by different phrase identification approaches, which is the coverage on ground-truth.


Table 3.2: The Coverages on Ground-truth Keyphrases

Hulth GT Inc. Coverage CandTol Cand/Art. GT Prop.Prefixspan 9,612 58.5% 81,946 41 11.7%N-gramFilter 10,018 61.0% 76,186 38 13.1%PTSSplitter 11,002 67.0% 74,210 37 14.8%C-value 11,178 68.1% 55,329 28 20.2%NPChunker 11,666 71.1% 53,685 27 21.7%

DUCPrefixspan 1,693 69.5% 75,654 246 2.2%N-gramFilter 1,649 67.7% 70,527 229 2.3%PTSSplitter 1,750 71.8% 69,266 225 2.5%C-value 2,005 82.3% 45,379 147 4.4%NPChunker 2,149 88.2% 44,755 145 4.8%

SemEvalPrefixspan 2,095 73.8% 268,850 1,102 0.78%N-gramFilter 2,110 74.3% 251,538 1,031 0.84%PTSSplitter 2,353 82.9% 247,787 1,016 0.95%C-value 2,223 78.3% 144,466 592 1.54%NPChunker 2,338 82.3% 143,143 587 1.63%

GT Inc.: the number of ground-truth keyphrases included in identified candidate phrases. Coverage: thecoverage on ground-truth. CandTol: the total number of identified candidates. Cand/Art: the average numberof candidates per article. GT Prop: the proportion of ground-truth in candidates.

The coverage on ground-truth indicates the maximum recall score that a system can

obtain. For example, the PTS Splitter identifies 67% of total ground-truth keyphrases

in the Hulth dataset, resulting a loss of 33% true positive before running any ranking

algorithm.

It is worth noting that the evaluation results presented in this chapter are slightly dif-

ferent from our previously reported [23], due to the different pre-processing and ground-

truth selection process. In [23], we cleaned the datasets (Hulth and SemEval) using

many heuristics such as discarding any document in which not all assigned keyphrases

appear, and the DUC dataset is not included. Nevertheless, both evaluations report

similar findings.

3.4.3 Evaluation 1 Results Discussion

Of the five phrase identification approaches, the NP Chunker produces the best coverage

on assigned keyphrases on both Hulth (short document dataset) and DUC (mid-length

document dataset). The PTS Splitter produces a slight better coverage than the NP

Chunker on SemEval (long dataset) and a very close coverage on Hulth. However, there

is a distinct difference on the DUC dataset. The Hulth and SemEval datasets are col-

lections of academic journal abstracts and full-length articles. The authors use more


formal writings, and the ground-truth keyphrases are usually technical terms (phrases).

This offers a better chance for the PTS Splitter to produce a greater coverage on ground-

truth keyphrases. For example, in a sentence5: “we show that it is possible in theory to

solve for 3D lambertian surface structure for the case of a single point light source and

propose that ...”, the ground-truth keyphrases are 3D lambertian surface structure and

single point light source. In comparison to Hulth and SemEval, the DUC dataset is a

collection of news articles, where the authors use less formal but descriptive language,

appealing to the readers’ senses and helping them to imagine or reconstruct the sto-

ries. For example, a ground-truth keyphrase is yellowstone park fire, but in the actual

content of the article, it appears as greater yellowstone park fire. The PTS Splitter incor-

rectly identifies the candidate as greater yellowstone park fire, whereas the NP Chunker

identifies yellowstone park fire.

The NP Chunker and C-value are all based on POS tags. The difference is that the

C-value algorithm identifies phrases based on their frequencies, thus phrases with low

frequencies will be discarded. However, many ground-truth keyphrases indeed have very

low frequencies, that is why the C-value algorithm produces less coverage comparing with

NP Chunker. Similarly, Prefixspan is also based on statistical analysis, and hence, in

most cases produces worse coverage on ground-truth than the PTS Splitter and N-gram

Filter.

We summarise the loss (unidentified ground-truth keyphrases) into 5 types, shown in

Figure 3.4. The majority of loss falls into the first and the second type, where a can-

didate phrase is either a substring or a superstring of a ground-truth keyphrase, such

as in the aforementioned example, yellowstone park fire and greater yellowstone park

fire. Ground-truth keyphrases contain punctuation marks or stop-words, are the third

and fourth types, respectively. The most common stop-words occur in the assigned

keyphrases are of, and, on, until, by, with, for, from. The most common punctuation

mark is the apostrophe, following by ‘.’ and ‘+’ which appear in words such as ‘.net’ and

‘C++’ in many documents. Each phrase identifier treats the phrases containing punc-

tuation marks differently. For example, the PTS Splitter uses all punctuation marks

(except hyphen) as delimiters, therefore it is not able to identify any keyphrase contain-

ing punctuation marks; the NP Chunker does not explicitly exclude any punctuation

mark, however, words containing punctuation marks may not be recognised as valid

nouns or adjectives by the POS tagger we employed. The POS tagger also incorrectly

tags some words mostly occurring in the Hulth dataset, which explains why the NP

Chunker and C-value identifier have the large loss rate in the fifth error type comparing

with others.

5The Hulth dataset article: 205.abstr, ground-truth keyphrase list: 205.contr, 205.uncontr


Figure 3.4: Error 1: candidate identified is too long, being the super-string of theassigned phrase; Error 2: candidate identified is too short, being the sub-string of theassigned phrase; Error 3: assigned phrase contains invalid char such as punctuation

marks; Error 4: assigned phrase contains stop-words; Error 5: Others.

3.4.4 Evaluation 2: System Performance

In this evaluation, we combine each candidate identification approach with the four

different ranking algorithms and two phrase scoring techniques, forming different pro-

cessing pipelines. The evaluation aims to analyse how different candidate identification

and scoring approach will affect the performance of the same ranking algorithm.

3.4.4.1 Direct Phrase Ranking and Phrase Score Summation

In the direct phrase ranking pipeline, candidate phrases are directly scored after ranking.

In phrase score summation pipeline, a phrase score is the sum of its each word’s score,

as

s(P ) =∑wi∈P

s(wi) (3.7)

where s(w) is a word’s score assigned by ranking algorithms.

3.4.4.2 Ranking Algorithm Setup

Among the four ranking algorithms, TF-IDF and RAKE do not require any special

settings. For TextRank, we use a fixed window size of 2 for identifying co-occurrence,

which is reported producing the best performance [3]. The initial value of each vertex

in a graph is set to 1, the damping factor is set to 0.85, the iteration is 30, and the


threshold of breaking is 1.0−5. For HITS, we also use the same window size, iteration

and threshold of breaking as TextRank. The HITS score for each phrase is computed as

the average score of hub and authority scores. Finally, the top 10 ranked candidates are

selected from each result set as extracted keyphrases by the corresponding algorithm.

Some settings may not conform with what the authors originally used. Our main interest

is not to reproduce results, we only focus on analysing the potential factors that may

affect the ranking algorithms. Therefore, we do not follow the exact pipeline approaches

and heuristics described in the original papers. However, we are confident with our

reimplementation. For example, we reproduced the same results for TextRank on the

Hulth dataset [3] as the original paper claimed.

3.4.5 Evaluation 2 Results Discussion

We address the discussion from three aspects: the impact from different candidate iden-

tification approaches, the comparison of two scoring techniques, and the impact from

frequencies.

3.4.5.1 Candidate Identification Impact

We plot the results on six sub-figures, shown in Figure 3.5, where the vertical axis shows

the F-score, and the horizontal axis is the candidate identifiers ordered by the proportion

of ground-truth over the candidates.

The Hulth dataset consists of short documents – the average number of words per

article is 125, which offers very little advantage to statistical-based ranking, because the

majority of words and phrases only occur once ot twice in a document. To gain better

performance, a ranking algorithm needs to pair with an efficient candidate identification

approach, which not only covers more the ground-truth keyphrases, but also reduces

the total number of candidate phrases to increase the proportion of ground-truth in

candidate phrases. For example, as shown in Table 3.2, the PTS Splitter and C-value

algorithm produce a very similar coverage on ground-truth keyphrases, which are 67%

and 68.1%, respectively. However, the C-value algorithm produces much fewer candidate

phrases than the PTS Splitter, resulting in a ground truth which is about 6% larger in

candidates. Hence, the same ranking algorithm with C-value yields better performance

than with the PTS Splitter. This can be intuitively understood – the larger proportion

increases the chance for a ranking algorithm choosing correct keyphrases. As shown

in Figure 3.5 (A) and (B), there is a clear linear relation between the proportion of

ground-truth in candidates and the overall performance of a ranking algorithm. For


0.0%

5.0%

10.0%

15.0%

20.0%

25.0%

30.0%

35.0%

40.0%

45.0%

50.0%

11.7% 13.1% 14.8% 20.2% 21.7%

Prefixspan N-gram Spli<er C-value NPChunker

F-Score

Ground-truthPropo1on

DirectPhraseRankingonHulthDataset

TF-IDF

RAKE

TextRank

HITS

(a)

0.0%

5.0%

10.0%

15.0%

20.0%

25.0%

30.0%

35.0%

40.0%

45.0%

50.0%

11.7% 13.1% 14.8% 20.2% 21.7%

Prefixspan N-gram Spli<er C-value NPChunker

F-Score


PhraseScoreSumma1ononHulthDataset

TF-IDF

RAKE

TextRank

HITS

(b)

0.0%

5.0%

10.0%

15.0%

20.0%

25.0%

30.0%

35.0%

40.0%

45.0%

50.0%

2.2% 2.3% 2.5% 4.4% 4.8%

Prefixspan N-gram Spli;er C-value NPChunker

F-Score


DirectPhraseRankingonDUCDataset

TF-IDF

RAKE

TextRank

HITS

(c)

0.0%

5.0%

10.0%

15.0%

20.0%

25.0%

30.0%

35.0%

40.0%

45.0%

50.0%

2.2% 2.3% 2.5% 4.4% 4.8%

Prefixspan N-gram Spli;er C-value NPChunker

F-Score


PhraseScoreSumma1ononDUCDataset

TF-IDF

RAKE

TextRank

HITS

(d)

0.0%

5.0%

10.0%

15.0%

20.0%

25.0%

30.0%

35.0%

40.0%

45.0%

50.0%

0.78% 0.84% 0.95% 1.54% 1.63%

Prefixspan N-gram Spli>er C-value NPChunker

F-Score


DirectPhraseRankingonSemEvalDataset

TF-IDF

RAKE

TextRank

HITS

(e)

0.0%

5.0%

10.0%

15.0%

20.0%

25.0%

30.0%

35.0%

40.0%

45.0%

50.0%

0.78% 0.84% 0.95% 1.54% 1.63%

prefixspan N-gram Spli=er c-value NPChunker

F-Score


PhraseScoreSumma1ononSemEvalDataset

TF-IDF

RAKE

TextRank

HITS

(f)

Figure 3.5: Evaluation 2 Results: (A) Performance on the Hulth Dataset using DirectPhrase Ranking (B) Performance on the Hulth Dataset using Phrase Score Summation(C) Performance on the DUC Dataset using Direct Phrase Ranking (D) Performanceon the DUC Dataset using Phrase Score Summation (E) Performance on the SemEvalDataset using Direct Phrase Ranking (F) Performance on the SemEval Dataset using

Phrase Score Summation


example, with the direct phrase ranking scoring approach, increasing the proportion

by 10 percent from 11.7% to 21.7% results in a raise on F-score by about 10% for all

ranking algorithms. Therefore, extract keyphrases for the datasets consisting of short

documents, such as Hulth, the candidate identification should focus on either increase

the coverage on ground-truth set, or reduce the number of candidate phrases, or both.

However, for the datasets containing longer documents, such as the DUC (mid-length

documents) and SemEval (long documents) datasets, increasing a small percentage on

the proportion of ground-truth keyphrases in candidate phrases does not improve much

on the performance. As shown in Figure 3.5 (C), (D), (E) and (F), there is no clear

correlation between the proportion and performance. Trying to reduce a large number

of candidate phrases in order to increase the proportion of ground-truth is an extremely

difficult or even impossible task, because of the simple fact that longer documents con-

tain more words and phrases. Instead, improve the efficiency of correctly identifying

candidate phrases from their longer or shorter versions e.g. greater yellowstone park fire

and yellowstone park fire, is a better practice to improve the performance of a ranking

algorithm. All ranking algorithms extract keyphrases based on the statistical informa-

tion of phrases, where the frequency is the main source. Correctly identifying a phrase

from its longer or shorter versions provides a cleaner list of candidate phrase and more

accurate statistical information. For example, if a document contains text segments

greater yellowstone park fire and yellowstone park fire, and each segment occurs twice

in the document, the TPS Splitter will identify two different phrases from the text

segments, where each one has a frequency count as 2. The C-value algorithm, on the

other hand, will identify only one phrase yellowstone park fire from the text segments

with the frequency count of 4. Given the outputs from two candidate phrase identifiers

to the same ranking algorithm, the former combination may produce a lower score to

both phrases, whereas the latter combination will assign a much higher score to the

phrase yellowstone park fire due to the distinct high frequency. Consequently, to extract

keyphrases for the datasets consisting of longer documents, the candidate identification

should focus on correctly identifying phrases from text segments using both statistical

and linguistic information.

Finally, the identifiers that use linguistic features (the NP Chunker and the C-value

identifier) deliver the best performance in 23 out of 24 evaluation cases, regardless of

the dataset and the ranking algorithm they are coupled with. This further suggests that

the linguistic features also play a critical role in AKE.


3.4.5.2 Phrase Scoring Impact

Two different phrase scoring techniques are employed in this evaluation: the direct phrase

ranking and phrase score summation. On the Hulth dataset, the phrase score summation

technique generally produces a better result than they combine with the direct phrase

ranking. On the SemEval dataset, the phrase score summation technique produces much

worse results. However, on the DUC dataset, there is no clear trend – only the ranking

algorithm with C-value and NP Chunker produces better result when combining with

the phrase score summation technique.

The Hulth dataset contains short documents, the frequencies and co-occurrence fre-

quencies of phrases are very low – a typical phrase may only present once or twice

in a document. When using the direct phrase ranking approach, phrases are inputs

to the ranking algorithms, and hence, with relatively low frequencies, it is difficult for

the algorithms to correctly identify the important phrases. On the other hand, using

the phrase score summation technique, the inputs to the ranking algorithms are words,

which naturally present the better statistical information than phrases. For example, in

Figure 3.2 (A), the document Doc 1 has phrases information interaction, information

architecture, and content, and each phrase occurs twice in the document. Among the

three phrases, the phrase content obviously is the least important one. However, since all

three phrases have the same frequencies, using the direct phrase ranking approach, the

ranking algorithms are difficult to differentiate the importance of each phrase. However,

if concerning only words, i.e. using the phrase score summation approach, the word

information will gain a very high score from the ranking algorithms. Hence, by taking

the sum of each word’s score as a phrase score, the phrases information interaction and

information architecture will gain much higher scores than content.

On the long document collection – the SemEval dataset, important phrases are easily

occur tens of times, whereas the unimportant one may only occur no more than a few

times. This provides much better statistical information to the direct phrase ranking

approach. On the other hand, long documents have more word combinations, and

hence, more candidate phrases will be identified. When the phrase score summation

accumulates words’ scores, any phrase containing high scored words will stay on the top

of the ranking list – the longer a phrase is, the higher score it receives. For example, if

the word network receives a high score after ranking, the longest phrase hybrid neural

network architecture will have the highest score, followed by its cousins: neural network

architecture, and neural network, where only the phrase neural network is in the ground-

truth keyphrases.


The DUC dataset is between the long and short document collections, and hence, it

inherits partial characteristics from both sides. Phrases occur more frequently in the

DUC dataset than the Hulth dataset, however, not as frequent as in the SemEval dataset.

Word combinations in the DUC dataset are less than SemEval, yet the identified can-

didate are not as clean as in the Hulth dataset. These characteristics lead very inter-

esting results – two scoring approaches produce very similar results with Prefixspan,

N-gram, and PTS Splitter, but the phrase score summation approach produces much

better results with the candidate identifiers using linguistic analysis. The actual reason

is simple: identifiers using linguistic analysis produce cleaner candidate phrase lists. As

shown in Table 3.2, the candidate identifiers without linguistic analysis produce much

more candidate phrases than using linguistic analysis. For example, the NP Chunker

and C-value can correctly identify the phrase yellowstone park fire from a text segment

greater yellowstone park fire, whereas others cannot. However, why there is no significant

improvement on the results using the direct phrase ranking approach with the linguistic

candidate phrase identifiers? One reason is that phrases occur in mid-length documents

are not as frequent as in long documents. Another reason is that many keyphrases do

not occur with high frequencies, as we will discuss in the next section.

3.4.5.3 Frequency Impact

(a) (b)

Figure 3.6: Ground-truth and extracted keyphrase distributions based on individualdocument. (A): Distributions on the DUC Dataset (B): Distributions on the SemEval

Dataset

The majority of unsupervised AKE algorithms aim to identify the phrases having rel-

atively high frequency, based on the assumption that keyphrases tend to occur more

frequently than others high. However, such assumption does not always hold. We

plot the ground-truth frequency distributions on the DUC and SemEval datasets, along

with the number of keyphrase extracted by each algorithm, shown in Figure 3.6. The


vertical axis is the number of keyphrases assigned by human annotators, and the hori-

zontal axis shows their frequencies in the corresponding documents. Many ground-truth

keyphrases only occur 2 to 3 times in a document. However, all ranking algorithms fail

to extract keyphrases having relatively low frequencies. TextRank, HITS, and RAKE

extract keyphrases solely rely on the information of individual documents, which seri-

ously suffer from the problem of frequency-sensitivity. TF-IDF, on the other hand, is

less sensitive to frequency impact compared with others because it considers both local

document frequencies of phrases and how they occur in the entire corpus. This suggests

that incorporating external knowledge may mitigate the frequency-sensitivity problem in

unsupervised AKE systems. Nevertheless, improving performance of ranking algorithms

needs to focus on identifying keyphrases from low-frequent candidate phrases.

3.5 Conclusions

In this chapter, we have conducted a systematic evaluation of five common candidate

phrase selection approaches and two candidate phrase scoring techniques coupled with

four unsupervised ranking algorithms. Our evaluation is based on three publicly available

datasets that contain articles with various lengths.

The evaluation reveals three key observations. Firstly, candidate identification ap-

proaches have a strong impact on the overall performance, where statistical and lin-

guistic feature-based approaches usually deliver better performance. Secondly, using

phrase scoring approaches is critical and should be carefully considered against the cho-

sen dataset. The direct phrase ranking may be more suitable for the longer dataset that

contains more combinations of words, whereas the phrase score summation may be a

better choice on relatively short and clean datasets where fewer phrases occur. Thirdly,

keyphrases do not always occur with high frequencies, and existing unsupervised AKE

algorithms suffer from the frequency-sensitivity problem. The TF-IDF algorithm deliv-

ers relatively better results on extracting keyphrases with low frequencies since it uses

corpus-level statistics, which suggests that incorporating external knowledge may miti-

gate the frequency-sensitivity problem, and hence further improvement should focus on

identifying keyphrases from low-frequent candidates using external knowledge. In the

next chapter, we focus on this problem by presenting a knowledge based graph rank-

ing approach using word embedding vectors as the external knowledge source providing

semantic features.

Chapter 4

Using Word Embeddings as

External Knowledge for AKE

In Chapter 3, we have systematically evaluated four unsupervised AKE algorithms,

including TF-IDF [2], RAKE [45], TextRank [3], and HITS [88]. The evaluation not only

shows the impact on an AKE system of different candidate identification and scoring

approaches, but also reveals a common weakness in unsupervised AKE algorithms – they

mostly fail to extract keyphrases with low frequencies. TF-IDF produces slightly better

performance than others, because it uses the statistical information from the entire

corpus. This cue suggests that using external knowledge may mitigate the frequency-

sensitivity problem.

In this chapter, we investigate using additional semantic knowledge supplied by pre-

trained word embeddings to overcome the frequency-sensitivity problem and improve

the performance of unsupervised AKE algorithms. We focus on graph-based ranking

approaches. The family of graph-based ranking algorithms by far, is the most widely

used unsupervised approach for AKE, which is based on the intuition that keyphrases are

the phrases that have stronger relations with others. In the past decade, the development

of graph-based AKE systems have mainly focused on discovering and investigating dif-

ferent weighting schemes that compute the strengths of relations between phrases. The

weights are assigned to edges for each pair of vertices that represent the corresponding

phrases. Early studies assign co-occurrence statistics as weights [10, 198–200]. How-

ever, co-occurrence statistics do not provide any semantic knowledge. More recently,

researchers use available knowledge bases such as Wikipedia and WordNet to obtain

semantic relatedness of words and phrases [6, 7, 24]. The shortcoming is that public

knowledge bases offer limited vocabularies that cannot cover all domain-specific terms.

In addition, they only provide general semantics of words and phrases, which makes little

66

Chapter 4. Using Word Embeddings as External Knowledge for AKE 67

contribution for identifying semantic relatedness of entities in domain-specific corpora,

because the same phrases may have very different meanings in a specific domain from

their general meanings.

In contrast to existing work, we use both co-occurrence statistics in documents, and

the semantic knowledge encoded in word embedding vectors, to compute the relation

strengths of phrases. Word embeddings are trained over both general corpora and

domain-specific datasets, enabling them to encode both general and domain-specific

knowledge. We evaluate the weighting scheme with four graph ranking algorithms, in-

cluding Weighted PageRank, Degree Centrality, Betweenness Centrality and Closeness

centrality. We demonstrate that using word embedding to supply extra knowledge for

measuring the semantic relatedness of phrases is an efficient approach to mitigate the

frequency-sensitive problem. It also generally improves the performance of graph rank-

ing algorithms.

4.1 Weighting Schemes for Graph-based AKE

Keyphrases are representative phrase, which describe the main ideas or arguments of an

article. Keyphrases can be considered as the ‘soul’ of an article that forms the skeleton,

other words and phrases in the article are supporting terms that help emphasising the

key ideas. Graph-based ranking approaches for AKE are developed upon such intuition –

keyphrases are the phrases having stronger relations with others, which tie and hold the

entire article together. Graph-based approaches represent documents as graphs, where

the candidate phrases are vertices, and edges of two vertices present the relations. In

most graph-based AKE systems, edges are identified as the co-occurrence of two phrases,

i.e. two vertices are connected if their corresponding phrases are co-occurred within a

predefined window.

The most well-known graph-based AKE algorithm is TextRank [3], which is developed

upon the idea of assigning higher scores to both the vertices having more edges (co-

occurring more often) and the vertices only link to a few most important vertices.

However, TextRank does not take other information such as co-occurrence frequen-

cies into considerations1. Following TextRank, Wan and Xiao [10] present SingleRank

using weighted PageRank [201], where the weights are normalised co-occurrence fre-

quencies. Other studies [198–200] also investigate different graph-ranking algorithms by

assigning co-occurrence frequencies as the weights to edges, including Degree Centrality,

1TextRank is developed for both AKE and text summarisation, which does not assign weights toedges for the AKE task.


Betweenness Centrality, Closeness Centrality, Strength, Neighbourhood Size, Coreness,

Clustering Coefficient, Structural Diversity Index, Eigenvector Centrality, and HITS.

However, assigning weights based on co-occurrence frequencies of individual documents

may not suit short documents, such as abstracts of journal papers in the Hulth dataset,

where most phrases only co-occur once or twice. Instead of considering only co-occurrence

frequencies within documents, researchers also attempted using the knowledge embed-

ded in the corpus. For example, ExpandRank [10] analyses how phrases co-occur not

only in a current document, but also in its neighbourhood documents – the most similar

documents in the corpus ranked by cosine similarities. Weights are computed as the

sum of phrases’ co-occurrence frequencies in each neighbourhood document multiplying

the cosine similarity score. ExpandRank assumes that there are topic-wise similar doc-

uments in the corpus. However, this prerequisite does not always exist. Liu et al. [39]

apply Latent Dirichlet Allocation (LDA) [76] to induce latent topic distributions of each

candidate, then run Personalised PageRank [91] for each topic separately, where the

random jump factor of each vertex in a graph is weighted as its probability distribution

of the topic, and the out-degree of a vertex is weighted as the co-occurrence statistics.

In reality, however, topics in a document are not equally important, and hence trivial

topics may not be necessary for the AKE task.

While incorporating lexical statistics of corpora certainly improves the performance of

graph-based ranking algorithms, these AKE systems can be dataset-dependent, i.e. they

usually work well on one particular dataset [9]. The lack of semantic interpretations or

measure of phrases inhibits the opportunity to accurately identify phrases representing

a document’s core theme. Thus, researchers start using semantic knowledge bases to

measure semantic relatedness of phrases. The semantic relatedness is mostly obtained

from external knowledge bases. Grineva et al. [7] use links between Wikipedia con-

cepts to compute semantic relatedness based on Dice co-efficient [68] measure. Wang

et al. [6] and Martinez-Romo et al. [24] use synset in WordNet to obtain the semantic

relatedness between two candidates. However, publicly available knowledge bases such

as Wikipedia and WordNet are designed for general propose, which may not precisely

present semantics of domain-specific phrases. Moreover, they offer limit vocabulary.

WordNet is handcrafted thus having limit vocabulary. Wikipedia has richer vocabulary

living in its articles, the techniques for inducing semantic relatedness, however, typically

rely on analysing the hyper-links and the titles of articles in Wikipedia.

The development of graph-based AKE approaches can be thought as a journey of dis-

covering and investigating different weighting schemes that are built upon different level

of knowledge. Studies utilising phrases’ co-occurrence statistics from documents only

consider the local statistical information. Topic-wise based approaches incorporate the


knowledge of the corpus. Public knowledge bases essentially provide general knowledge

that helps algorithms to measure the general semantic relatedness of phrases.

4.2 Proposed Weighting Scheme

In contrast to existing work, we propose a weighting scheme built upon three levels of

knowledge: the local documents, the corpus, and the general knowledge. The phrase

co-occurrence statistics can be easily obtained from documents. To gain corpus level

and general knowledge, we pre-train word embeddings over both general and domain-

specific corpora, and use the semantic information encoded in word embedding vectors

to measure the semantic relatedness of phrases.

4.2.1 Word Embeddings as Knowledge Base

Capturing both domain-specific knowledge (in the corpus) and the general knowledge

is a challenge task, because the meanings of domain-specific terms (words or phrases)

can be very different in the domain from their general meanings. For example, neuron

generally relates to some biological terms such as cortical, sensorimotor, neocortex. In

computer science, partially machine learning domain, the term neuron is often referred to

a computational unit that performs linear or non-linear computation. Hence, the most

semantically related phrases are either the names of machine learning algorithms or

artificial neural network architectures, such as back-propagation, hopfield, feed-forward,

perceptron. However, the meanings of non-domain specific terms remain the same even in

domain-specific corpora. For example, the meaning of dog remains the same in computer

science domain, but it rarely appears.

Word embeddings are low dimensional and real-valued vector representations of words.

A typical way to induce word embeddings is to learn a probabilistic language model

that predicts the next word given its previous ones. Hence, word embeddings essentially

encode the co-occurrence statistics of words over the training corpus [202]. Such infor-

mation is capable of presenting the semantics of words – the distributional hypothesis

states that words that occur in similar contexts tend to have similar meanings [164].

Mikolov et al. [13] have demonstrated that embedding vectors trained over a Wikipedia

snapshot capture both semantic and syntactic features of words.

Since word embeddings encode the co-occurrence statistics of words distributed in the

corpus, they can be retrained over different corpora to encode the meanings of domain-

specific terms that have distinct co-occurrence statistics from the general domain. Ta-

ble 4.1 lists some samples words and their most similar words extracted using the cosine


similarity measure. The word embeddings are firstly trained over a general corpus – a

Wikipedia snapshot, then they are retrained over a computer science corpus. As shown

in the table, the meanings of domain-specific words are changed significantly to present

the domain knowledge. Neutral words, which are naturally domain-specific words, have

very little change. Non-domain specific words tend to remain their original meaning.

This can be intuitively understood – domain-specific words change the co-occur partners

in the corpus, whereas non-domain specific words do not.

Table 4.1: Most Similar Words: Top ranked most similar words to the samplers,fetched using Cosine similarity, trained twice 1) over Wikipedia general dataset, and 2)

a computer science domain specific dataset

Sample Most Similar (General Do-main)

Most Similar (CS Domain)

Domain-Specificneural cortical, sensorimotor, neo-

cortex, sensory, neuronback-propagation, hop-field, feed-forward, percep-trons, network

thread weft, skein, yarn, darning,buttonhole

dataflow, parallelism, user-specified, client-side, mu-texes

Neutralalgorithm heuristic, depth-first,

breadth-first, recursive,polynomial

breadth-first, bilinear,depth-first, autocorrela-tion, eigenspace

database RDBMS, clusterpoint,metadata, mapreduce,memcached

repository, web-accessible,metadata, searchable,CDBS

Non-domain Specificdog sheepdog, doberman,

puppy, rottweiler, poodlecat, animal, wolf, pet,sheep

flight take-off, aircraft, jetliner,jet, airplane

take-off, aircraft, airline,landing, airport

Domain-Specific: words have totally different meanings in open domain and specific domainNeutral: words essentially relate to specific domain even in the open domain.Non-domain Specific: words do not change their meanings.

4.2.2 Weighting Scheme

The weight scheme is developed based on a few intuitions and understandings of what

keyphrases are. We use two indicators: co-occurrence frequency and semantic relat-

edness. Frequencies have been the main source for statistical language processing. An

important phrase of a document should have relatively high frequency that co-occur more

often with other different phrases, and conversely, a phrase having high co-occurrence

frequencies indicates that the phrase itself is a high frequent phrase. On the other hand,


a phrase that selectively co-occurs with one or a few particular high frequent phrases

can also be important. For example, in an article about deep learning, terms like recur-

rent network and convolutional network may appear many times, and hence they can

be important ones for representing the theme of the article. A phrase neural network

that may only appear twice, co-occurs once with recurrent network and convolutional

network, respectively. It is clear that the term neural network is a good candidate

of keyphrases even though it has very low frequencies. On the other hand, assuming

that a phrase cognitive science also co-occurs once with recurrent network and convo-

lutional network. How do we differentiate the importance between neural network and

cognitive science? We use the second indicator – semantic relatedness. Both recurrent

networks and convolutional networks are instances of neural networks, and hence they

have much stronger semantic relatedness than cognitive science. Consequently, we pro-

pose the weight is computed as the product of two phrases’ co-occurrence frequency and

semantic relatedness.

Formally, let S be the relation strength of phrase pi and pj , we compute S as the product

of the co-occurrence frequency count coocc of pi and pj in document D, and their cosine

similarity score sim of their corresponding word embedding vectors.

S(pi, pj) = coocc(pi, pj)× sim(pi, pj) (4.1)

The cosine similarity is computed as:

sim(pi, pj) =vi · vj||vi|| ||vj ||

(4.2)

where vi and vj are the embedding vectors for pi and pj , respectively.

4.3 Training Word Embeddings

We use the SkipGram model introduced by Mikolov et al. [15] to train word embed-

dings. The SkipGram model aims to learn a function f that maps the probability

distributions over words in a corpus. In SkipGram model, there are two vector repre-

sentations for each word, with one being input vector and another one being output

vector. Two vectors are used in conjunction to compute the probabilities. The Skip-

Gram model takes one word wt (the centre word) as input to predict its surrounding

words wt−n, ...wt−1, wt+1, ..., wt+n, where n is the size of surrounding window.

Formally, let C denotes the set of input word embedding vectors, and C ′ denotes the

set of output vectors. Given a sequence of words w1, w2, ..., wT , the goal is to maximise


the conditional probability distribution P (wt+j |wt) over all the words V in a training

corpus by looking for parameters θ = (C,C ′). The learning objective is to maximise the

average log probability:

L =1

T

T∑t=1

∑−n≤j≤n

logP (wt+j |wt; θ)

Using softmax function, we have:

P (wt+j |wt; θ) =eC

′(wt+j)T ·C(wt)∑Vi=1 e

C′(wi)T ·C(wt)

where C ′(wt+j) is the output vector representation for wt+j , C(wt) is the input vector

representation for input wt respectively. However, directly applying softmax function

over a large corpus is very computationally expensive due to the summation over the en-

tire vocabulary V . To optimise computational efficiency, we use Negative Sampling [15].

The negative sampling algorithm used in SkipGram model is a simplified version of

the Noise Contrastive Estimation [203]. The idea is to reward positive samples and

penalise negative samples. Considering a pair of word (wi, wj) and context, the negative

sampling rewards the sample (wi, wj) if they come from the actual training data (they

co-occur in a context), by assigning probability P (D = 1|wi, wj). Conversely, if they do

not come from the training data, then a zero probability P (D = 0|wi, wj) is assigned,

meaning they never co-occur in the training data. The learning objective is to maximise

the probability of choosing positive samples, and minimise the probability of choosing

negative ones (randomly generated) by looking for parameters θ. In the actual training,

given a word wt (the centre word), each surrounding word wt−n, ...wt−1, wt+1, ..., wt+n

is treated as positive sample, whereas the negative samples are generated by randomly

selecting a word from the vocabulary. The probabilities for positive and negative samples

are computed as:

p(w|wt; θ) =

σ(C ′(w)TC(wt)) for positive

1− σ(C ′(w)TC(wt)) for negative

Let POS(w) be the set of positive samples, and NEG(w) be the set of negative samples,

the overall equation for computing the probability is:

P (w|wt; θ) =∏

i∈POS(w)

σ(C ′(wi)TC(wt))

∏j∈NEG(w)

(1− σ(C ′(wj)

TC(wt)))


The objective function maximises the log probability, as:

L =

S(w)∑i=1

(yi · log(σ(C ′(wi)

TC(wt)) + (1− yi) · log(1− σ(C ′(wi)

TC(wt))

)where S(w) = POS(w)

⋃NEG(w), and

yi =

1 for positive

0 for negative

4.4 Implementation

4.4.1 Training Word Embeddings

We first train our model over an English Wikipedia snapshot downloaded in January

2015. Then we retrain the embeddings over each AKE dataset separately to encode the

domain-specific knowledge of each dataset.

The hyper-parameters used for trainings are:

• Number of negative samples: 5

• Window size: 10

• Embedding vector dimensions: 300

• Negative table size: 2 ∗ 109

• Learning rate: 0.01 or training over general dataset, and 0.005 for training over

the AKE datasets

• Subsampling parameter: 10−6 for training over general dataset, and 10−4 for train-

ing over the AKE datasets

• Word minimum occurrence frequency: 20 for training over general dataset, and 1

for training over the AKE dataset

• Iteration on training set: 1 for training over general dataset, and 10 for training

over the AKE datasets

Some hyper-parameters used in training over the AKE dataset are different from general

dataset because they have much smaller size, thus the parameters need to be tuned

slightly to generalise the learning.


The dimension of embedding vectors controls how much information that a embedding

vector may encode. Mikolov et al. [15] show that larger dimension certainly can encode

more knowledge. However, there is a trade-off between dimensions of word embeddings

and the training time – larger dimensions significantly increases the training time. Our

empirical evidence show that the amount of information encoded in embedding vectors

increases significantly up to 300 dimensions, then the difference becomes trivial from

300 to 1000 dimensions, evaluated on word analogy tests. This is because the trade-off

between the amount of training data and the dimensionality of the word embeddings, i

.e. it needs much more data if we expand the capacity (word embedding dimensionality)

of the model. In this study, the dimension of embedding vectors is set to 300.

The subsampling threshold balances frequent and infrequent words occurring in training

samples. The top frequent words (not only stop-words) can occur millions of times

more than infrequent ones, which may overfit the model. To generalise the learning,

the subsampling threshold is used to randomly discard the frequent words with the

probability computed as p(w) = 1 −√t/f(w), where f(w) is the frequency count of

word w, and t is the subsampling threshold parameter. We use t = 10−6 for training the

model over the general dataset, and t = 10−4 for AKE datasets due to the small size.

The iteration controls how many times we train the embeddings over the dataset. The

Wikipedia snapshot contains enough training samples to generalise the learning, thus we

only train once2. AKE datasets do not contain enough training examples, so we iterate

training for 10 times to gain better generalisations3.

4.4.2 Phrase Embeddings

SkipGram model is for learning word embedding vectors. To gain multi-word phrase

embeddings, we use two common approaches, namely holistic and algebraic composition

approaches [177]. The holistic approach learns phrase embeddings by treating pre-

identified phrases as atomic units, where each phrase associates with a vector, and

is trained in the same way as words. On the other hand, the algebraic composition

approach applies simple algebraic functions to work out phrase embeddings. In this

study, we apply vector addition function, i.e. a phrase embedding is computed as the

sum of component word embeddings.

2The obtained word embeddings produce the exact same results on the word analogy test as claimedin the original paper [15], hence we are confident with our re-implementation.

3We empirically evaluated different iterations for AKE datasets, and did not notice any significantimprovement for iterating more than 10 times.


4.4.3 Ranking Algorithms

4.4.3.1 Degree Centrality

In unweighted graphs, the Degree Centrality is simply the sum over the number of

connections a vertex has. Formally, let Cd(i) be the degree of vertex vi in graph G, N

be the number of vertices that vi connects to, xij is the connection between vertex vi

and vj , the degree of vi is computed as:

Cd(i) =

N∑j=1

xij (4.3)

In weighted graph, the Degree Centrality can be extended to the sum of weights w of

all vertices that i connects to [204, 205], as:

Cwd (i) =

N∑j=1

wij (4.4)

It can be normalised by taking the ratio of its Degree Centrality and the maximum

possible degree N − 1, as:

NCwd (i) =Cwd (i)

N − 1(4.5)

4.4.3.2 Betweenness and Closeness Centrality

The Betweenness Centrality of a vertex vi is the sum of the number of times a vertex

acting as a bridge lying on the shortest path between two other vertices. We use the

algorithm proposed by Brades [206] to compute the betweenness for vertex vi as:

Cb(i) =∑

s 6=t6=i∈V

σst(i)

σst(4.6)

where V is the total number of vertices in graph G, σst is the number of shortest paths

from vs to vt in G, and σst(i) is the number of shortest paths passing through vi other

than vs, vt. The Betweenness Centrality can also be normalised by:

NCb(i) =2× Cb(i)

(N − 1)(N − 2)(4.7)


The Closeness Centrality is the inverse sum of shortest path to all others that a vertex

vi connects to, which is computed as:

Cc(i) =N − 1∑Nj=1 d(j, i)

, (4.8)

where d(j, i) is the shortest path distance between vertices vj and vj .

Both Betweenness and Closeness centralities rely on identification of the shortest paths

in a graph. We use Dijkstra’s algorithm [207] to identify the shortest paths in a weighted

graph.

4.4.4 PageRank

The original PageRank algorithm [89] ranks vertices of a directed unweighted graph. In

a directed graph G, let in(vi) be the set of vertices that point to a vertex vi, and out(vi)

be the set of vertices to which vi point. The score of vi is calculated by PageRank as:

S(vi) = (1− d) + d×∑

j∈in(vi)

1

|out(vj)|S(vj) (4.9)

where d is the damping factor. Weighted PageRank [3, 201] score of a vertex vi as:

WS(vi) = (1− d) + d×∑

vj∈in(vi)

wji∑vk∈out(vj)wjk

WS(vj)

where wij is the weight of the connection between two vertices vi and vj . Both PageRank

and the Weighted PageRank require settings of hyper-parameters. We set dumping

factor to 0.85, iterations to 30, and the threshold of breaking to 10−5. For undirected

graphs, the in-degree of a vertex simply equals to the out-degree of the vertex.

4.4.5 Assigning Weights to Graphs

Two vertices are connected if the corresponding candidate phrases are co-occurred within

a windows size of 2. For Degree Centrality and PageRank, weights are assigned directly

to edges. For Betweenness and Closeness, the weights need to be reversed before assign-

ing to edges, as 1wij

where wij is the weight of vertex i and j computed using equation 4.2.

Since our weighting scheme is the product of co-occurrence frequency and cosine sim-

ilarity, it is possible that the weights (cosine similarity) is negative or even turn to 0

if two vectors are orthogonal. From our observation, there are less than 0.1% negative


weights occurring in the evaluation datasets, and we did not experience any orthogonal

vectors. Therefore, we simply do not connect two vertices if negative weights occur.

4.5 Evaluation

We use two different phrase scoring techniques as described in 3.4.4.1: direct phrase

ranking and phrase score summation. The direct phrase ranking is to rank pre-identified

phrase instead of individual words. For phrases that cannot be pre-identified will not be

ranked and scored. The phrase summation approach is to sum all word scores to make

up a phrase score, and in such scoring approach, the holistic approach is not applicable.

Hence here is no result for the holistic approach shown in Figure 4.1. We select top 10

ranked phrases as potential keyphrases. Results are measured by Precision, Recall, and

F-measure. We use the datasets described in Chapter 3, namely the Hulth, DUC, and

SemEval datasets.

We setup two baselines. One is unweighted graph ranking using the four ranking al-

gorithms: Degree, Betweenness, and Closeness centralities, and PageRank. The other

baseline uses the normalised co-occurrence frequencies as weights.

4.5.1 Evaluation Results and Discussion

We plot the evaluation results into six column charts, shown in Figure 4.5.1. Each

chart has four groups of columns, showing evaluation results for the four graph ranking

algorithms.

4.5.1.1 Discussion on Direct Phrase Ranking System

The proposed weighting scheme uniformly improves the performance of all ranking algo-

rithms on the evaluation datasets. However, using only co-occurrence frequency statis-

tics as weights does not boost the performance in comparison to unweighted graphs.

In many cases, it produces even worse results. As we have shown in the last chapter,

majority of unsupervised algorithms are overly sensitive to the frequencies of phrases,

i.e. they are unable to identify keyphrases with low frequencies. Hence, accumulating

co-occurrence frequencies of phrases as weights does not solve the problem of frequency-

sensitivity, it rather makes the situation even worse – more frequent phrases receive

even higher ranking scores due to the co-occurrence frequency statistics implicitly en-

code the frequencies of phrases. That is why the unweighted graph ranking outperforms

the co-occurrence statistics based ranking. On the other hand, our weighting scheme


0.0%

5.0%

10.0%

15.0%

20.0%

25.0%

30.0%

35.0%

40.0%

Degree Betweenness Closeness PageRank

Hulth(DirectPhraseRanking)

Unweighted

Cooccurrencefrequency

HolisDcPhraseEmb

ComposiDonalPhraseEmb

0.0%

5.0%

10.0%

15.0%

20.0%

25.0%

30.0%

35.0%

40.0%


Hulth(PhraseSumma/on)

Unweighted


UsingWordEmbeddings

0.0%

5.0%

10.0%

15.0%

20.0%

25.0%

30.0%

35.0%

40.0%


DUC(DirectPhraseRanking)

Unweighted


HolisDcPhraseEmb


0.0%

5.0%

10.0%

15.0%

20.0%

25.0%

30.0%

35.0%

40.0%


DUC(PhraseSumma/on)

Unweighted


UsingWordEmbeddings

0.0%

5.0%

10.0%

15.0%

20.0%

25.0%

30.0%

35.0%

40.0%


SemEval(DirectPhraseRanking)

Unweighted


HolisDcPhraseEmb


0.0%

5.0%

10.0%

15.0%

20.0%

25.0%

30.0%

35.0%

40.0%


SemEval(PhraseSumma/on)

Unweighted


UsingWordEmbeddings

Figure 4.1: Evaluation Results F-score: Left graphs show the results using directphrase ranking scoring approach, where green and purple columns show embeddingeffects. Right graphs show the results using phrase score summation approach, where

green columns show the embedding effects.

mitigates the frequency impact by taking the product of the co-occurrence frequency

and the semantic relatedness of a pair of phrases, which considers not only how the pair

of phrases co-occur in the document, but also how they co-occur in both general and

domain-specific datasets, since word embeddings implicitly capture the co-occurrence

statistics of words and phrases.


Comparing the two phrase embeddings approaches, the holistic embedding approach

produces slightly better results in most of the cases. The holistic approach treats phrases

in the same way as words during the training, thus theoretically it encodes more accurate

co-occurrence statistical information than the vector addition approach. For example,

the holistic approach considers the sequence of words, thus the phrase embeddings for

first lady and lady first are different. On the other hand, the vector addition approach

simply sums the values of two word embedding vectors, so it does not take sequential

information of words into consideration. However, in practice, the holistic approach

may suffer from the data sparsity problem, because identifying all possible phrases is

not possible.

Comparing the performance between different graph ranking algorithms is not the main

interest of this study. However, PageRank and the Closeness Centrality produce slightly

better performance on all three datasets.

4.5.1.2 Discussion on Phrase Score Summation System

In the phrase score summation system, the improvement is not as clear as in the phrase

direct ranking system. On the Hulth dataset, the proposed weighting scheme still im-

proves the performance of all algorithms about 2%. However, the improvement on the

DUC dataset becomes blurred, and there is no improvement or even slightly worse per-

formance on the SemEval dataset. We identify two reasons.

Firstly, word embeddings do not directly tell semantic meanings of phrases. The score

summation system ranks unigram words, and a phrase scores is the sum of scores of its

each constituent word. Therefore, the graph is constructed by unigram words rather

than phrases. Hence, applying our weighting scheme essentially assigns weights to a

pair of words rather than phrases. There is no semantic knowledge of phrases involved

in ranking process. Therefore, the score summation system extracts any phrases having

high scored words as keyphrases. There is still little improvement on the Hulth dataset,

because documents are much shorter in comparison to the documents in other two

datasets. Short documents produce fewer phrases, and the weighting scheme using

word embeddings certainly influences the ranking where words have stronger semantic

relatedness will gain higher score.

Secondly, the usefulness of word embeddings is blurred on the DUC and SemEval

datasets, where each document contains large number of words thus produces more

candidate phrases. By accumulating word scores, the top scored phrases are always

the longer phrases containing one or more top scored words, due to the sum of scores

over all constituent words (graph ranking algorithms never produce negative scores).


For example, if a word hurricane gains a distinct high ranking score, then any phrase

contains the word will also stay on the top of the ranking list, such as hurricane center,

ornational hurricane center.

4.5.1.3 Mitigating the Frequency-Sensitivity Problem

In this evaluation, we show how the weighting scheme mitigates the frequency-sensitive

problem. Figure 4.2 shows the keyphrases and their frequencies extracted by PageRank,

Weighted PageRank using co-occurrence frequencies as weights, and our weighting scheme

using the direct phrase ranking on the DUC and SemEval datasets.4

Applying the proposed weight scheme to four ranking algorithms does not make much

difference in identifying high frequent keyphrases comparing with the unweighted and

weighted by co-occurrence frequencies approaches. In fact, the actual improvement

comes from extracting low frequent keyphrases. As shown in Figure 4.2, the proposed

approach extracts much lower frequent keyphrases than others. The proposed weighting

scheme identifies the degrees of semantic relatedness of candidate phrases. Therefore, if

a low frequent phrase co-occurs with the phrases having high frequencies and stronger

semantic relatedness, it will gain higher weights and hence higher ranking scores.

0

20

40

60

80

100

120

140

2 3 4 5 6-10 above10

Num

bero

fKeyph

raseCorrectlyExtracted

FrequencyinDocument

DUCDataset

Unweighted

Co-occurfrequency

HolisBcPhraseEmb

ComposPhraseEmb

(a)

0

20

40

60

80

100

120

1-10 11-20 21-30 31-5051-100above100

Num

bero

fKeyph

raseCorrectlyExtracted

FrequencyinDocument

SemEvalDataset

Unweighted

Co-occurfrequency

HolisBcPhraseEmb

ComposPhraseEmb

(b)

Figure 4.2: Ground-truth and extracted keyphrase distributions based on individualdocument. (A): Distributions on DUC Dataset (B): Distributions on SemEval Dataset

4.5.1.4 Tunable hyper-parameters

Our evaluations mainly focus on analysing the actual effectiveness of the weighting

scheme to graph ranking algorithms for AKE. However, it is worth to note that there

4The Hulth data consists of short articles, where most candidates only occur once. Therefore, we donot use this dataset for this evaluation.


are a few potential hyper-parameters or strategies can be applied to our approach for

further improvement, as we have demonstrated in our previous work [21, 22]. Firstly,

restricting the maximum number of words in phrases will efficiently reduce the errors

made by POS taggers, especially on SemEval dataset. A commonly used number is 5

based on the observation that most human assigned keyphrases have less than 5 words.

Secondly, setting a threshold to filter low frequent words can also improve the results.

Thirdly, the proposed weighting scheme can also use Pointwise Mutual Information

(PMI) or Dice co-efficient measure instead of raw frequency account. Finally, different

word embedding algorithms and training techniques may also impact the results.

4.6 Conclusion

In this chapter, we have presented a weighting scheme using word or phrase embed-

dings as external knowledge source. The evaluation shows that the proposed weighting

scheme generally improves the performance of graph ranking algorithms for AKE. We

also demonstrated that using word embeddings is an efficient approach to mitigate the

frequency sensitive problem in unsupervised AKE approaches.

The phrase embeddings used in this study are induced from two approaches: the holis-

tic and algebraic composition approaches. Despite the improvement that the weighting

scheme made, each of the approaches for inducing phrase embeddings has drawback in

practice. The holistic approach suffers from the data coverage and sparsity problem – it

is unable to generate embeddings for phrases not appearing in the training set, and most

phrases occur much less often than unigram words. On the other hand, the algebraic

composition approach is overly simplified, which does not take the syntactical informa-

tion of words into account. In the next chapter, we specifically focus on these issues. We

will introduce a deep learning approach for modelling the language compositionality.

Chapter 5

Featureless Phrase

Representation Learning with

Minimal Labelled Data

In Chapter 4, we have demonstrated the effectiveness of using word embedding vec-

tors to supply the semantic relatedness for graph-based keyphrase extraction. However,

the study also discovered two problems. Firstly, applying word embeddings as exter-

nal knowledge for AKE has limitations. Current techniques for learning distributed

representations of linguistic expressions are limited to unigrams. Learning distributed

representations for multi-gram expressions, such as multi-word phrases or sentences,

remains a great challenge, because it requires to learn not only the meaning of each con-

stituent word of an expression, but also encode the rules of combining the words [25].

Secondly, despite assigning weights to graphs improves the performance of graph-based

AKE algorithms, the development of weighting schemes is a rather heuristic process,

where the choice of features is critical to the overall performance of the algorithms. This

problem turns the development of graph-based AKE algorithms into a laborious feature

engineering process. The goal of this thesis is to discover AKE approaches that do not

reply on the choice of features and datasets, hence in the rest of the thesis, we focus on

deep learning approaches for AKE that automatic learns of useful features.

This chapter is an initial investigation on learning distributed representations of multi-

word phrases using deep neural networks. In this chapter, instead of focusing on AKE, we

target a cousin task of AKE – Automatic Domain-specific Term Extraction. Keyphrases

describe the main ideas, topics, or arguments of a document, which are document-

dependent. Extracting keyphrases requires the understanding of both phrases and doc-

uments. On the other hand, domain-specific terms are properties of a particular domain,

82

Chapter 5. Featureless Phrase Representation Learning with Minimal Labelled Data 83

which are independent from documents. Identifying domain-specific terms only requires

understanding the meanings of terms. Therefore, automatic term extraction is more

efficient for evaluating the meanings of phrases obtained from deep learning models.

Section 2.1.9.1 provides a detailed literature review of common approaches for auto-

matic term extraction.

In this chapter, we introduce a weakly supervised bootstrapping approach using two

deep learning classifiers. Each classifier learns the representations of terms separately by

taking word embedding vectors as inputs, thus no manually selected feature is required.

The two classifiers are firstly trained on a small set of labelled data, then independently

make predictions on a subset of the unlabelled data. The most confident predictions are

subsequently added to the training set to retrain the classifiers. This co-training process

minimises the reliance on labelled data. Evaluations on two datasets demonstrate that

the proposed co-training approach achieves a competitive performance to the standard

supervised baseline algorithms with very little labelled data.

5.1 Introduction

Domain-specific terms are essential for many knowledge management applications, such

as clinical text processing, risk management, and equipment maintenance. Domain-

specific term extraction aims to automatically identify domain relevant technical terms

that can be either unigram words or multi-word phrases. Supervised domain-specific

term extraction often relies on the training of a binary classifier to identify whether a

candidate term is relevant to the domain [134–136]. In such approaches, term extrac-

tion models are built upon manually selected features including the local statistical and

linguistic information (e.g. frequency, co-occurrence frequency, or linguistic patterns),

and external information form third-party knowledge bases (e.g. WordNet, DBpedia).

Designing and evaluating different feature combinations turn the development of term ex-

traction models into a time-consuming and labour-intensive exercise. In addition, these

approaches require a large amount of labelled training data to generalise the learning.

However, labelled data is often hard or impractical to obtain.

In this chapter, our first objective is to minimise the usage of labelled data by training

two classifiers in a co-training fashion. Co-training is a weakly supervised learning mech-

anism introduced by Blum and Mitchell (1998), which tackles the problem of building

a classification model from a dataset with limited labelled data among the majority of

unlabelled ones. It requires two classifiers, each built upon separate views of the data.

Each view represents a separate set of manually selected features that must be sufficient

to learn a classifier. For example, Blum and Mitchell classify web pages using two views:


1) words appearing in the content of a web page, and 2) words in hyperlinks pointing

to the web page. Co-training starts with training each classifier on a small labelled

dataset, then each classifier makes predictions individually on a subset of the unlabelled

data. The most confident predictions are subsequently added to the training set to re-

train each classifier. This process is iterated a fixed number of times. Co-training has

been applied to many NLP tasks where labelled data are in scarce, including statistical

parsing [208], word sense disambiguation [209], and coreference resolution [210], which

demonstrate that it generally improves the performance without requiring additional

labelled data.

Our second objective is to eliminate the effort of feature engineering by using deep

learning models. Applying deep neural networks directly to NLP tasks without feature

engineering is also described as NLP from scratch [172]. Because of such training, words

are represented as low dimensional and real-valued vectors, encoding both semantic

and syntactic features [15]. In our model, word embeddings are pre-trained over the

corpora to encode word features that are used as inputs to two deep neural networks

to learn different term representations (corresponding to the concept of views) over

the same dataset. We use two deep neural networks. One is a Convolutional Neural

Network (CNN) that learns term representations through a single convolutional layer

with multiple filters followed by a max pooling layer. Each filter is associated with a

region that essentially corresponds to a sub-gram of a term. The underlying intuition

is that the meaning of a term can be learnt from its sub-grams by analysing different

combinations of words. The other is a Long Short-Term Memory (LSTM) network that

learns the representation of a term by recursively composing the embeddings of an input

word and the composed value from its precedent, hypothesising that the meaning of

a term can be learnt from the sequential combination of each constituent word. Each

network connects to a logistic regression layer to perform classifications.

Our model is evaluated on two benchmark domain-specific corpora, namely GENIA

(biology domain) [211], and ACL RD-TEC (computer science domain) [212]. The evalu-

ation shows that our model outperforms the C-value algorithm [56] that is often treated

as the benchmark in term extraction. We also train the CNN and LSTM classifiers

individually using the standard supervised learning approach, and demonstrate that the

co-training model is an effective approach to reduce the usage of labelled data while

maintaining a competitive performance to the supervised models.


Word Embeddings Lookup Table

Convolution & max pooling

Input Layer

Labelled Data L

Logistic Regression Layer

Pool U’


Input Layer

LSTM Layer


Unlabelled Data U

Train Train

Examples for labelling

Refill 2g examples

Add g most confident predictions Add g most confident predictions

Figure 5.1: Co-training Network Architecture Overview: Solid lines indicate thetraining process; dashed lines indicate prediction and labelling processes.

5.2 Related Work

A brief literature review on term extraction is presented in Chapter 2.1.9.1. In this

section, we only discuss the most related work with respect to the methodology. The

work closely related to ours is Fault-Tolerant Learning (FTL) [213] inspired by Transfer

Learning [214] and the Co-training algorithm [215]. FTL has classifiers using Support

Vector Machine (SVM) helping training each other in a similar fashion as in the Co-

training algorithm. The main difference is that Fault-Tolerant Learning does not require

any labelled data, instead, it uses two unsupervised ranking algorithms, TF-IDF [2] and

TEDel [216] to generate two set of seed terms, which may contain a small amount of

incorrectly labelled data before the start of training. Two classifiers are firstly trained

using different sets of seed data, then one classifier is used to verify the seed data that was

used to train another classifier. After that, two classifiers are re-trained using verified

seed data. The rest of the steps are the same as the Co-training algorithm. However,

FTL relies on manually selected features. In contrast, our model uses deep neural

networks taking pre-trained word embeddings as inputs, without using any manually

selected feature. Another difference is that our model requires a small amount of labelled

data as the seed to initialise the training.

5.3 Proposed Model

5.3.1 Overview

We propose a weakly supervised bootstrapping using Co-training for term extraction as

shown in Figure 5.1. The labelled data L and unlabelled data U contain candidate terms


that are identified in pre-processing stage1. The actual inputs to the model are word

embedding vectors that map an input term via the loop-up table. All word embedding

vectors are pre-trained over the corpus2. The model consists of two classifiers. The

left classifier is a CNN network, and the right one is a LSTM network. The networks

independently learns the representations of terms. The output layer is a logistic regres-

sion layer for both networks. Two neural networks are trained using the Co-training

algorithm described in Section 5.3.5.

The Co-training algorithm requires two separate views of the data, which traditionally

are two sets of manually selected features. In our model, however, there is no manually

selected features. Thus, two views of the data are carried by our two hypotheses of

learning the meanings of terms: 1) analysing different sub-gram compositions, and 2)

sequential combination of each constituent word. The hypotheses are implemented via

the CNN and LSTM network. We expect that the rules of composing words can be cap-

tured by the networks. The CNN network analyses different regions of an input matrix

that is constructed by stacking word embedding vectors, as shown in Figure 5.2, where

the sizes of regions reflect different N-grams of a term. By scanning the embedding ma-

trix with different region sizes, we expect that the CNN network can learn the meaning

of a term by capturing the most representative sub-gram. The LSTM network, on the

other hand, learns the compositionality by recursively composing an input embedding

vector with the precedent or previously composed value, as shown in Figure 5.3. We

expect the LSTM network to capture the meaning of a term through its gates that gov-

ern the information flow – how much information (or meaning) of an input word can be

added in to the overall meaning, and how much information should be dismissed from

the previous composition.

5.3.2 Term Representation Learning

The objective is to learn a mapping function f that outputs the compositional repre-

sentation of a term given its word embeddings. Concretely, let V be the vocabulary of a

corpus with the size of v. For each word w ∈ V , there is a corresponding d dimensional

embedding vector. The collection of all embedding vectors in the vocabulary is a matrix,

denoted as C, where C ∈ Rd×v. C can be thought of as a look-up table, where C(wi)

represents the embedding vectors of word wi. Given a term s = (w1, w2, ..., wn), the goal

is to learn a mapping function f(C(s)) that takes the individual vector representation

of each word as the input, and output a composed vector with the size of d, which

represents the compositional meaning of s.

1Details are presented in Section 5.4.2.2Details are presented in Section 5.3.4.


2 × d

3 × d

4 × d

r Regions

n filters for 3×d region

n filters for 4× d region

n filters for 2×d region

n×r filters n×r feature maps

max pooling

layer

human

immunodeficiency

virus

enhancer

l

padded 0 vector

embedding size d

Input Matrix M = l × d

fully connected

layer

Figure 5.2: Convolutional Model

5.3.2.1 Convolutional Model

We adopt the CNN network used by Kim [26], and Zhang et al. [27], which has only one

convolutional layer, shown in Figure 5.2. The inputs C(s) to the network are vertically

stacked into a matrix M , where the rows are word embeddings of each w ∈ s. Let d be

the length of word embedding vectors, and l be the length of a term, then M has the

dimension of d× l where d and l are fixed. We pad zero vectors to the matrix if the

number of tokens of an input term is less than l. The convolutional layer has r predefined

regions, and each region has n filters. All regions have the same width d, because each

row in the input matrix M represents a word and the goal is to learn the composition

of them. However, the regions have various heights h, which can be thought of as the

different sub-grams of the term. For example, when h = 2, the region represents bigram

features. Let W be the weights of a filter, where W ∈ Rd×h. The filter outputs a feature

map c = [c1, c2, ..., cl−h+1], and ci is computed as:

ci = f(W ·M [i : i+ h− 1] + b) (5.1)


Input gate

ct-1

Candidate value

Wi , Ui , bi

it = sigmoid(WiXt + UiHt-1 + bi)

it

gt = tanh(WgXt + UgHt-1 + bg)

gt Wg , Ug , bg

forget gate Wf , Uf , bf

ft = sigmoid(WfXt + UfHt-1 + bf)

ft

Ct-1output gate Wo , Uo , bo

ot = sigmoid(WoXt + UoHt-1 + bo)

ot

ct

ht

tanh(ct)

xt ht = ot tanh(ct)

ct = ft ct-1 + it gt

ht-1

ct-1

human immunodeficiencyvirus enhancer

embedding size d

Recurrent Unit

Recurrent Unit

Recurrent Unit

Recurrent Unit

Recurrent Unit

Recurrent Unit

Recurrent Unit

Recurrent Unit

Xt-1 Xt Xt+1 Xt+2

ht

ct

ht+1

ct+1

ht+2Connect to

logistic regression

layer

ht-1

Figure 5.3: LSTM Model

where M [i : i + h − 1] is a sub-matrix of M from row i to row i + h − 1, f is an

activation function – we use hyperbolic tangent in this work, and b is a bias unit. A

pooling function is then applied to extract values from the feature map. We use the

1 -Max pooling as suggested by Zhang et al. [27], who conduct a sensitivity analysis on

one-layer convolutional network, showing that 1 -Max pooling consistently outperforms

other pooling strategies in sentence classification task. The total number of feature

maps in the network is m = r × n, so the output from the max pooling layer y ∈ Rm is

computed as:

y =m

maxi=1

(ci) (5.2)

5.3.2.2 LSTM Model

We use a LSTM network that is similar to the vanilla LSTM [217] without peephole

connections, shown in Figure 5.3. The LSTM network features memory cells at each

timestamp. A memory cell stores, reads and writes information passing through the

network at a timestamp t, which consists of four elements: an input gate i, a forget gate

f , a candidate g for the current cell state value, and an output gate o. At t, the inputs

to the network are the previous cell state value ct−1, the previous hidden state value

ht−1, and the input value xt. The outputs are the current cell state ct and the current

hidden state value ht, which will be subsequently passed to the next timestamp t + 1.


At time t, the candidate g for the current cell state value composes the newly input xt

and the previously composed value ht−1 to generate a new state value as:

gt = tanh(Wg · xt + Ug · ht−1 + bg) (5.3)

where Wg and Ug are shared weights, and bi is the bias unit.

The input gate i in the LSTM network decides how much information can pass through

from gt to the actual computation of the memory cell state using a sigmoid function

σ =1

1 + e−xthat outputs a value between 0 and 1 indicating the percentage, as:

it = σ(Wi · xt + Ui · ht−1 + bi) (5.4)

where Wi and Ui are shared weights, and bi is the bias unit. Likewise, the forget gate f

governs how much information to be filtered out from the previous state ct−1:

ft = σ(Wf · xt + Uf · ht−1 + bf ) (5.5)

The new cell state value is computed as taking a part of information from the current

inputs and the previous cell state value:

ct = it ⊗ gt + ft ⊗ ct−1 (5.6)

where ⊗ is the element-wise vector multiplication. ct will be passed to the next times-

tamp t + 1, which remains constant from one timestamp to another, representing the

long short-term memory.

The output gate o can be thought of as the filter that prevents any irrelevant information

that may be passed to the next state. The output gate ot and the hidden state value ht

are computed as:

ot = σ(Wo · xt + Uo · ht−1 + bo)

ht = ot ⊗ tanh(ct)(5.7)

where ht is the composed representation of a word sequence from time 0 to t.

5.3.3 Training Classifier

To build the classifiers, each network is connected to a logistic regression layer for the

binary classification task – whether a term is relevant to the domain. The logistic

regression layer, however, can be simply replaced by a softmax layer for multi-class

classification tasks, such as Ontology Categorisation.


Overall, the probability that a term s is relevant to the domain is:

p(s) = σ(W · f(C(s)) + b) (5.8)

where σ is the sigmoid function, W denotes the weights for logistic regression layer, b is

the bias unit, and f is the mapping function that is implemented by the CNN or LSTM

network.

The parameters of convolutional classifier are θ = (C, W conv, bconv, W convlogist, bconvlogist)

where W conv are weights for all m filters, and bconv is the bias vectors. For LSTM classi-

fier, θ = (C, W lstm, blstm, W lstmlogist, blstmlogist) where W lstm = (Wi, Wg, Wf , Wo, Ui,

Ug, Uf , Uo), and blstm = (bi, bg, bf , bo). Given a training set D, the learning objective

for both of the classifiers is to maximise the log probability of correct labels for s ∈ Dby looking for parameters θ:

argmaxθ

∑s∈D

log p(slabel|s; θ) (5.9)

θ is updated using Stochastic Gradient Descent (SGD) to minimise the negative log

likelihood error:

θ := θ − ε∂ log p(slabel|s; θ)∂θ

(5.10)

where ε is the learning rate.

5.3.4 Pre-training Word Embedding

All word embeddings are pre-trained over the datasets. We use the SkipGram model [15]

to learn word embeddings3. Given a word w, the SkipGram model predicts the context

(surrounding) words S(w) within a predefined window size. Using the softmax function,

the probability of a context word s ∈ S is:

p(s|w) =ev

′w

>·vs∑Vt=1 e

v′t>·vs

(5.11)

where V is the vocabulary, v′w is the output vector representations for w, vs is the input

vector representations for contexts s, respectively. The learning objective is to maximise

the conditional probability distribution over vocabulary V in a training corpus D by

looking for parameters θ:

argmaxθ

∑w∈D

∑s∈S(w)

log p(s|w; θ) (5.12)

3Please see Equation 4.3 for the detailed computations.


5.3.5 Co-training Algorithm

Algorithm 1: Co-training

Input: L,U,C, p, k, gcreate U ′ by randomly choosing p example from Uwhile iteration < k do

for c ∈ C douse L to train C

end forfor c ∈ C do

use c to posit label in U ′

add most confident g example to Lend forrefill U ′ by randomly choosing 2× g example from U

end while

Given the unlabelled data U , a pool U ′ of size p, and a small set of labelled data L, firstly

each classifier c ∈ C are trained over L. After training, the classifiers make predictions

on U ′, then choose the most confident g predictions from each classifier and add them to

L. The size of U ′ now becomes p− 2g, and L := L+ 2g. U ′ then is refilled by randomly

selecting 2g examples from U . This process iterates k times. Algorithm 1 documents

the details.

5.4 Experiments

5.4.1 Datasets

We evaluate our model on two datasets. The first dataset is the GENIA corpus4, which

is a collection of 1,999 abstracts of articles in the field of molecular biology. The corpus

has a ground-truth list. We use the current version 3.02 for our evaluation. The second

dataset is the ACL RD-TEC5 corpus consists of 10,922 articles published between 1965

to 2006 in the domain of computer science. The ACL RD-TEC corpus groups terms into

three categories, invalid terms, general terms, and computational terms. We only treat

computational terms as ground-truth in our evaluation. Table 5.1 shows the statistics.

Table 5.1: Dataset Statistics

Articles Domain Tokens Ground Truth

GENIA 1,999 Molecular Biology 451,562 30,893ACL RD-TEC 10,922 Computer Science 36,729,513 13,832

4publicly available at http://www.geniaproject.org/5publicly available at https://github.com/languagerecipes/the-acl-rd-tec


5.4.2 Pre-processing

The pre-processing has two goals: cleaning the data, and identifying candidate terms.

The data cleaning procedure includes extracting text content and ground-truth terms,

removing plural for nouns, and converting all tokens into lower-cases.

The ACL RD-TEC corpus provides a pre-identified candidate list. We therefore only

need to identify candidate terms from the GENIA corpus. We build two candidate

identifiers. The first identifier uses noun phrase chunking with predefined POS patterns,

we call it POS identifier. We use a common POS pattern <JJ>*<NN.*>+, that is, a

number of adjectives followed by a number of nouns. The documents were parsed using

the Stanford Log-linear POS tagger [196].The POS identifier, however, is only able to

identify 67% of the ground-truth terms. One reason is that not all ground-truth terms

fall in our simple POS pattern6. Another reason is the shortcoming of POS pattern-

based phrase chunking technique – it is unable to identify a sub-term from its super-term

when they always occur together. For example, the POS identifier spots a term novel

antitumor antibiotic which only occurs once in the GENIA dataset. In the ground-truth

set, the correct term is antitumor antibiotic.

The second identifier uses N-gram based chunking, so called N-gram identifier, which

decomposes a sequence of words into all possible N-grams. However, there will be

too many candidates if we simply decompose every sentence into all possible N-grams.

Thus, we use stop-words as delimiters to decompose any expression between two stop-

words into all possible N-grams as candidates. For example, novel antitumor antibiotic

produces 6 candidates, novel antitumor antibiotic, novel antitumor, antitumor antibiotic,

novel, antitumor, and antibiotic. The N-gram identifier produces a much better coverage

on the ground-truth set, despite there is a small number of ground-truth terms containing

stop-words for which the N-gram identifier is not able to identify. Meanwhile, it produces

much more candidates than the POS identifier. Table 5.2 shows statistics.

Table 5.2: Candidate Terms Statistics

Candidate GT Cov. Positive Negative

GENIA POS 40,998 67.0% 20,704 (50.5%) 20,294 (49.5%)GENIA N-gram 229,810 96.4% 29,781 (13.0%) 200,029 (87.0%)ACL RD-TEC 83,845 100% 13,832 (16.5%) 70,013 (83.5%)

GT Cov: Ground Truth Coverage

6One may consider to add in more patterns, but it is out of the scope of this work.


All Possible Grams

CandidateSet

GroundTruth Set

EvaluationSet

ClassifiedPositive

ClassifiedNegative

FP

TP

FNTN

UnidentifiedGround Truth

Figure 5.4: Relationships in TP, TN, FP, and FN for Term Extraction.

5.4.3 Experiment Settings

The Co-training algorithm requires a few parameters. We set the small set of labelled

data L = 200, and the size for the pool U ′ is 500 for all evaluations. The number of

iterations k is 800 for POS identified candidates due to the larger candidate set, 500

for N-grams identified candidates in the GENIA dataset, 500 for the ACL RD-TED

dataset. The growth size is 20 for GENIA POS, 50 for GENIA N-grams, and 20 for

ACL RD-TED. The evaluation data is randomly selected from candidate sets. For each

e in the evaluation set E, e /∈ L. Table 5.3 show the class distributions and statistics of

each evaluation set.

Table 5.3: Evaluation Dataset Statistics

Test Examples Positive Negative

GENIA POS 5,000 2558 (51.0%) 2442 (49.0%)GENIA N-gram 15,000 1,926 (12.8%) 13,074 (87.2%)ACL RD-TEC 15,000 2,416 (16.1%) 12,584 (83.9%)

All word embeddings are pre-trained with 300 dimensions on each corpus7. The convolu-

tional model has 5 different region size, {2, 3, 4, 5, 6} for GENIA and 3 region size {2, 3,

4} for ACL RD-TED. Each region has 100 filters. There is no specific hyper-parameters

required for training the LSTM model. The learning rate for SGD is 0.01.


We employ the Precision, Recall, and F-measure for evaluating the ranking algorithm

as detailed in Chapter 2.1.7. We illustrate the set relationships of True Positive (TP),

False Positive (FP), True Negative (TN), and False Negative (FN) in Figure 5.4.

7All settings for training word embeddings used in this chapter are the same to 4.4.1.


5.4.5 Training

The model is trained in an online fashion on a NVIDIA GeForce 980 TI GPU. The

training time linearly increases at each iteration, since the model incrementally adds

training examples into the training set. At the beginning, it only takes less than a

second for one iteration. After 100 iterations, the training set is increased by 1,820

examples that takes a few seconds to train. Thus, the training time is not critical – even

the standard supervised training only take a few hours to converge.

5.4.6 Result and Discussion

We use C-value [56] as our baseline algorithm. C-value is an unsupervised ranking algo-

rithm for term extraction, where each candidate term is assigned a score indicating the

degree of domain relevance. We list the performance of C-value by extracting a number

of top ranked candidate terms as the domain-specific terms, as shown in Table 5.4. Since

we treat the task as a binary classification task, we also list random guessing scores for

each dataset, where recall and accuracy scores are always 50% and precisions are corre-

sponding to the distribution of positive class for each evaluation set. As a comparison

to the co-training model, we also train each classifier individually using standard super-

vised machine learning approach. The training is conducted by dividing the candidate

set into 40% for training, 20% for validation, and 40% for evaluation. For the proposed

co-training model, the convolutional classifier outperforms the LSTM classifier in all the

evaluation, so we only present the performance of the convolutional classifier. Results

are shown in Table 5.4.

The supervised CNN classifier unsurprisingly produces the best results on all evaluation

sets. However, it uses much more labelled data than the co-training model, while only

delivers less than 2 percent better performance (F-score) on the GENIA corpus, and 6

percent on the ACL RD-TEC corpus. In comparison to standard supervised machine

learning approach, the proposed co-training model is more “cost-effective” since it only

requires 200 labelled data as seed terms.

On the GENIA corpus, all algorithms produce much better F-score for the POS evalua-

tion set. This is because of different class distributions – on the POS evaluation set, the

proportion of positive (ground-truth) terms is 50.5% whereas only 12.8% positive terms

in the N-gram evaluation set. Therefore, we consider that the results from POS and

N-gram evaluation sets are not directly comparable. However, the actual improvements

on F-score over random guessing on both evaluation sets are quite similar, suggesting

that evaluating performance of a classifier should not only consider the F-score, but

should also analyse the actual improvement over random guessing.


Table 5.4: Evaluation Results

Labelled Data Precision Recall F-score Accuracy

GENIA POSRandom Guessing – 51% 50% 50.5% 50%C-value (Top 1000) – 62.4% 24.4% 35.1% –C-value (Top 2500) – 53.7% 52.5% 53.1% –Supervised CNN 16,400 64.7% 78.0% 70.7% 67.1%Co-training CNN 200 64.1% 76.0% 69.5% 65.5%

GENIA N-gramRandom Guessing – 12.8% 50% 20.4% 50%C-value (Top 1000) – 25.2% 6.5% 10.4% –C-value (Top 2500) – 12.9% 16.7% 14.6% –C-value (Top 7500) – 11.4% 44.3% 18.1% –Supervised CNN 91,924 35.0% 59.1% 44.0% 81.4%Co-training CNN 200 34.3% 56.6% 42.7% 75.5%

ACL RD-TECRandom Guessing – 16.1% 50% 24.4% 50%C-value (Top 1000) – 10.8% 4.5% 6.3% –C-value (Top 2500) – 14% 14.6% 14.3% –C-value (Top 7500) – 21.8% 68.2% 33.3% –Supervised CNN 33,538 70.8% 67.7% 69.2% 85.2%Co-training CNN 200 66% 60.5% 63.1% 79.7%

RG Acc: Random Guessing Accuracy, RG F: Random Guessing F-score

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 100 200 300 400

F-score

Itera,on

Convolu,onalClassifierGENIANgramEvalua,onSet

F-score

Accuracy

(a)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 100 200 300 400 500 600 700 800

F-score

Itera,on

Convolu,onalClassifieroPOSEvalua,onSet

F-score

Accuracy

(b)

Figure 5.5: (a) : F-score and Accuracy of Convolutional classifier on Ngram evaluationset over 500 iterations. (b): F-score and Accuracy of Convolutional classifier on on POS

evaluation set over 800 iterations.

It is also interesting to note that the GENIA N-gram evaluation set has 12.8% pos-

itive examples, which has similar unbalanced class distribution as ACL RD-TED –

16.1% positives. However, all algorithms produce much better performance on the ACL

RD-TEC corpus. We found that in the ACL RD-TEC corpus, the negative terms con-

tain a large number of invalid characters (e.g.~), mistakes made by the tokeniser (e.g.

evenunseeneventsare), and non content-bearing words (e.g. many). The classifiers can


easily spot these noisy data. Another reason might be that the ACL RD-TEC corpus

is bigger than GENIA, which not only allows C-value to produce better performance,

but also enables the word2vec algorithm to deliver more precise word embedding vectors

which are inputs to our deep learning model.

Although the accuracy measure is commonly used in classification tasks, it does not

reflect the true performance of a model when classes are not evenly distributed in an

evaluation set. For example, the N-gram evaluation set has about 12.8% positive exam-

ples, whereas the negative examples are about 87.2%. At the beginning of the training,

both models tend to classify most of the examples as negative thus the accuracy score is

close to 87%. While the training progress, the accuracy starts dropping. However, it is

still difficult to understand how exactly the model performs according to the accuracy

score. On the other hand, because the classes are evenly distributed on POS evaluation

set, we can clearly identify how the accuracy measure corresponds to F-scores.

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0 100 200 300 400 500

F-score

Itera,on

GENIANgramDataset

Convolu2onalNet

LSTMNet

(a)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0 100 200 300 400 500 600 700 800

F-score

Itera,on

GENIAPOSDataset

Convolu2onalNet

LSTMNet

(b)

Figure 5.6: (a) : F-score for both convolutional and LSTM classifiers on Ngramevaluation set over 500 iterations. (b): F-score for both convolutional and LSTM

classifiers on POS evaluation set over 800 iterations.

The CNN classifier outperforms the LSTM on all evaluation sets. It also requires much

fewer iterations to reach the best F-score. We plot F-score for both classifiers over a few

hundred iterations on the GENIA corpus, shown in Figure 5.6. Both classifiers reach

their best performance within 100 iterations. For example, the CNN classifier on POS

evaluation set, produces a good F-score around 62% at just about 30 iterations, then

reaches its best F-score 69.5% after 91 iterations. However, the size of the training set is

still quite small – by 91 iterations, the training set only grows by 1,820 examples. This

phenomenon leads us to consider two more questions 1) what is the exact performance

boosted by the co-training model? 2) How different numbers of training examples affect

the performance of a deep learning model, and do deep learning models still need large

amount of labelled training examples to produce the best performance? In the rest of


0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 100 200 300 400 500 600 700 800

F-score

Itera,on

Trainingon200ExamplesGENIAPOSDataset

Convolu3onalModel

LSTMModel

Figure 5.7: Convolutional and LSTM Classifier Training on 200 Examples on POSEvaluation set

the chapter, we will answer the first question, and leave the second question for our

future work.

To investigate how the Co-training algorithm boosts the performance of classifiers, we

train our model using only 200 seed terms over 800 iterations, results are shown in

Figure 5.7. The best F-score is from the convolutional model, about 53%, just slightly

higher than random guessing. On the other hand, by applying the Co-training algo-

rithm, we obtain the best F-score of 69.5%, which is a 16.5% improvement. In fact, the

improvement achieved by just adding a small number of training examples to the train-

ing set is also report by [218]. Consequently, it is clear that our co-training model is an

effective approach to boost the performance of deep learning models without requiring

much training data.

5.5 Conclusion

In this chapter, we have shown a deep learning model using the Co-training algorithm

– a weakly supervised bootstrapping paradigm, for automatic domain-specific term ex-

traction. Experiments show that our model is a “cost-effective” way to boost the perfor-

mance of deep learning models with very few training examples. The study also leads to

further questions such as how the number of training examples affects the performance

of a deep learning model, and whether deep learning models still need as many labelled

training examples as required in other machine learning algorithms to reach their best

performance. We will keep working on these questions in the near future.

Chapter 6

A Matrix-Vector Recurrent Unit

Network for Capturing Semantic

Compositionality in Phrase

Embeddings

In Chapter 5, we have shown the capability of encoding compositional semantics using

the convolutional neural network (CNN) and Long Short-Term Memory (LSTM) net-

work. However, they have limitations. The CNN network is designed to encode regional

compositions of different locations in data matrices. In image processing, pixels close

to each other usually are a part of the same object, thus convoluting image matrices

captures the regional compositions of semantically related pixels. However, the location

invariance does not exist in word embedding vectors. The LSTM network, on the other

hand, uses shared weights to encode composition rules for every word in the vocabulary

of a corpus, which may overly generalise the compositionality of each single word.

In this chapter, we present a novel compositional model based on the recurrent neural

network architecture. We introduce a new computation mechanism for the recurrent

units, namely the Matrix-Vector Recurrent Unit into the network to integrate the dif-

ferent views of compositional semantics originated from linguistic, cognitive, and neural

science perspectives. The recurrent architecture of the network allows for processing

phrases having various lengths as well as encoding the ordering of consecutive words.

Each recurrent unit consists of a compositional function that computes the composed

meaning of two input words, and a control mechanism governs the information flow at

each composition. When we train the network in an unsupervised fashion, it can capture

latent compositional semantics, and produces a phrase embedding vectors regardless of

98

Chapter 6. A Matrix-Vector Recurrent Unit Network for Capturing SemanticCompositionality in Phrase Embeddings 99

the phrase’s presence in the labelled set. When we train it in a supervised fashion by

adding a regression layer, it is able to perform classification tasks using the phrase em-

beddings learnt from the lower layer of the network. We show that the model produces

better performance than both simple algebraic compositions and other deep learning

models, including the LSTM, CNN, and recursive neural networks.

6.1 Introduction

The recent advancement of deep neural architectures enabled a paradigm shift of repre-

senting semantic meanings from lexical words to distributed vectors, i.e. word embed-

dings. These low dimensional, dense and real-valued vectors open immense opportunities

of applying known algorithms for vector and matrix manipulation to carry out semantic

calculations that were previously impossible. For example, the word analogy test [15]

has shown the existence of linear relations among words demonstrating that words have

similar vector representations sharing the semantic meaning.

However, word embeddings only encode the meanings of individual words. More than

often, we need vector representations beyond the level of unigram words for tasks such as

keyphrase extraction or terminology mining. The meaning of a complex expression, i.e.

a phrase or a sentence, is referred to as compositional semantic, which is reconstructed

from the meaning of each constituent word and the explicit or implicit rules of combining

them [25].

Traditional approaches for learning compositional semantics can be categorised into two

streams, namely holistic and algebraic composition. The holistic approach learns the

compositional semantics by treating phrases as atomic units (e.g. using hyphens to

convert pre-identified phrases into unigrams), then the same algorithms for inducing

word embeddings, such as the SkipGram model [15], are applied to learn the phrase

representations. Such approach suffers from the data coverage and sparsity problem –

it is not able to generate embeddings for phrases not appearing in the training set, and

most phrases co-occur much less often than unigram words. The foremost shortcoming,

however, is that it does not has the mechanisms or intentions to learn the composi-

tional rules. On the other hand, the algebraic composition approach applies simple

algebraic functions that takes embeddings of constituent words as inputs and produces

the embedding of a phrase. For example, Mitchell and Lapata [176] have investigated

different types of composition functions, including the linear addition or multiplication

of word representations, the linear addition of words and its distributional neighbours,

and the combination of addition and multiplication. Applying such simple algebraic


functions has two shortcomings. Firstly, it does not take the order of words into consid-

eration. The order of words often plays an essential role in making different meanings,

e.g. ladies first and first ladies. Secondly, it overly simplifies the compositional rules to

simple algebraic functions. These functions are pre-identified, i.e. they are not directly

learned from input texts, which fail to capture the different syntactic relations and the

compositionality of words.

In this chapter, we present a novel compositional deep learning model that directly

learns the semantic compositionality of words from input text. The deep learning model

is based on a recurrent neural network architecture, where we introduce a new computa-

tion mechanism for the recurrent units, namely Matrix-Vector Recurrent Unit (MVRU),

based on the different views of compositional semantics originated from linguistic, cog-

nitive, and neural science perspectives. The recurrent architecture of the network allows

for processing phrases having various lengths as well as encoding the ordering of con-

secutive words. Each recurrent unit consists of a compositional function that computes

the composed meaning of input words, and a control mechanism that governs the infor-

mation flow at each composition.

The MVRU model is inspired by Socher et al. [19]. The similarity is that both represent

a word using a pair of vector-matrix. The vector represents the meaning of a word, and

the matrix intends to capture the compositional rule of the word, which determines the

contributions made by other words when they compose with the word. However, the

MVRU differs from [19] in many aspects. Firstly, Socher et al. use a recursive net-

work architecture, which efficiently captures the structural information. The underlying

intuition is that the meaning of a phrase or sentence can be learnt from the syntacti-

cal combination of words. Despite syntactical structures provide additional information

for learning compositional semantics, generating such information requires parsing as

a prior. Considering many succeed studies [17, 97] that learn the compositionality of

words from purely sequential combinations, we adopt a recurrent network architecture.

Secondly, in [19], the model recursively traverses the syntactical tree of a sentence to

produce the phrase representations. However, the composed vector is directly pasFsed

into next composition without any control mechanism and hence the composed informa-

tion at previous stage is no longer accessible for onwards computation – no background

knowledge is stored. On the other hand, we use cell state memories store information

at each composition as background information, and a gate to control information flow

by taking a part of information from the current input combining with the information

passed from the precedents. Thirdly, [19] requires the Part-of-Speech parsing is required

prior to the training, The proposed MVRU model does not require any pre-parsing or

pre-identified phrases.


We show that when MVRU is trained in an unsupervised fashion, it captures general la-

tent compositional semantics. After training, it can produce a phrase embedding vector

regardless of the presence of the phrases in the training set. The MVRU is evaluated

on the phrase similarity and compositionality tasks, outperforming baseline models,

including distributional compositional models, two linear distributed algebraic composi-

tion models, and two deep learning models – a Long Short-Term memory (LSTM) and

a Convolutional Neural Network (CNN) model. When MVRU is trained in a supervised

fashion by adding a regression layer, it is able to perform classification tasks using the

phrase embeddings learnt from the lower layer of the network. We also demonstrate

that MVRU outperforms baselines on predicting phrase sentiment distributions, and

identifying domain-specific term.

6.2 Proposed Model

6.2.1 Compositional Semantics

From the linguistic and cognitive perspective, Frege [25] states that the meaning of a

complex expression is determined by the meanings of its constituent expressions and the

rules used to combine them, which is known as the principle of compositionality. In the

principle, two essential elements for modelling compositional semantics are the meanings

of words, and the compositional rules.

Kuperberg [219] describes how languages are processed and words are composed in hu-

man brains from the viewpoint of neural science, and suggests that the normal language

comprehension has at least two parallel neural processing streams, a semantic memory

based mechanism, and a combinatorial mechanism. The semantic memory based stream

not only computes and stores the semantic features and relationships between content

words in a complex expression, but also compares these relationships with those that

are pre-stored within the semantic memory. The combinatorial mechanism combines the

meanings of words in a complex expression to build propositional meaning based on two

constraints, morphosyntactic and semantic-thematic. The morphosyntactic constraint

determines syntax errors, and the semantic-thematic processing evaluates whether the

semantic memory is incongruent with respect to real world knowledge. The semantic

memory and combinatorial streams interact during online processing, where the seman-

tic memory stream can be modulated by the output of the combinatorial stream.

Based on the literature, we formalise five components that are essential to model compo-

sitional semantics. Concretely, given a phrase m = [w0, w1, ..., wn] consisting of n words,


ht-1

human immunodeficiencyvirus enhancer

Recurrent Unit

Recurrent Unit

Recurrent Unit

Recurrent Unit

Xt-1 Xt Xt+1 Xt+2

ht ht+1 ht+2 Composed Representaton

W

ht

xt

ht-1

ht = tanh( + b)

++ +ht-1xt U

Figure 6.1: Elman Recurrent Network for Modelling Compositional Semantic

let pt denotes the composed representation from w0 to wt, the components required to

encode the compositional semantic of m include:

• Lexical Semantic of Word The meaning or representation of each constituent

word w is the building brick for constructing a compositional semantic model,

which can be word embeddings pre-trained over large datasets, such as Wikipedia.

• Consecutive Order of Word The order of words inm reflects syntactic relations.

Without employing any syntax parser, capturing the sequential order patterns of

words is the simplest way to encode syntactic relations.

• Memory Kuperberg [219] suggests that a memory unit in human brains stores

computed compositional semantics. The memory unit should be accessible for the

onwards composition, i.e. it stores all the information at each composition, acting

as the background knowledge of what happened before time t.

• Composition Function It computes the actual compositional semantic. Specif-

ically, at time t, the function composes pt−1 and wt, and output pt.

• Control The control acts as the combinatorial mechanism as described by Ku-

perberg [219] or the composition rules in the principle of compositionality, which

governs and modulates the composition.


6.2.2 Matrix-Vector Recurrent Unit Model

We choose the recurrent neural network architecture because it has demonstrated the

capability of encoding patterns and learning long term dependencies of words [16, 181,

182]. However, existing recurrent networks such as Elman [180] or LSTM [28] do not

satisfy all the requirements. Elman networks rely on shared weights that are slowly

updated over timescales to encode the compositional rules, as shown in Figure 6.1. At

timestamp t, the network takes two inputs, the current input xt, and the hidden state

value ht−1 that is the composed value from timestamp t − 1. Let W,U be the weight

vectors of the network and b be the bias unit, the hidden state value ht is computed as:

ht = tanh(W · xt + U · ht−1 + b) (6.1)

Each simple recurrent unit in the network features long term and short term memories.

The long term memory in the network is carried by the weights W and U that are slowly

updated over the timescale, which can be thought of as encoding the compositional rules.

At time t, the network stores the composed meaning of inputs in the hidden state value

ht. By passing ht to the next timestamp t+ 1, ht is updated to ht+1 thus the composed

meaning from the previous timestamp is no longer stored, so called the short term

memory. Obviously, the composed meaning ht stored in the short-term memory is not

accessible for onwards computation. In addition, there is no control or combinatorial

mechanism featured in simple recurrent networks.

LSTM on the other hand, features the long short-term memory for each composition,

and is accessible for onwards computation. The input, output, and forget gates in the

network also act as control or combinatorial mechanism that governs and modulates

the composition. However, the network uses shared weights to encode composition

rules for every word in the vocabulary of a corpus, which may overly generalise the

compositionality of each single word. In addition, the input gate and the forget gate

seem to be duplicated for controlling the information flow.

Based on the architecture of the recurrent neural network, we propose Matrix-Vector

Recurrent Unit computation mechanism, shown in Figure 6.2. Specifically, we use the

recurrent structure capture the consecutive order of words. The control function r com-

putes the compositional rules and governs how much information to be passed through or

throw away for each combination. The composition function ct computes the combined

meaning of two words by concatenating two word embedding vectors, which efficiently

captures the order of words at each single composition. Finally, the actual composed

representation is computed by taking a part of meaning from the current composition

ct, and the composed meaning from its precedent composition pt−1.


Ut-1

pt-1

Recurrent Unit

Recurrent Unit

Recurrent Unit

Recurrent Unit

xt-1 xt xt+1

Ut

pt

Ut+1

pt+1

pt+2

Ut-1

pt-1

Mt-1 Mt Mt+1

xt+2 Mt+2

xt Mt

UtMtUt-1

+Wm

pt

rt ct + ( 1 - rt ) pt-1

rt = sigmoid( )

+

ct = tanh( )

Wv

+

Wm

Wv

Wm

Wv

Wm

Wv

Wm

Wv

+ +

Figure 6.2: Matrix-Vector Recurrent Unit Model

We merge the input and forget gates in the LSTM into a single gate rt. In LSTM

networks, the input and forget gates control how much information can pass through

the current state, and how much to be discard from the previous state. Technically, the

gates are used to prevent the gradients vanishing problem in recurrent networks [28].

In learning phrase embeddings, the gradient vanishing problem may not exist since

the majority of phrases only consists of few words, thus using two gates to control

information flow is redundant. In addition, the input and forget gates use two sets

of shared weights in the network, which do not specifically capture the compositional

rules for each individual word. The proposed MVRU model uses a single gate rt to

control how much information can be passed through from previous state (or previously

composed representation). The computation uses two unshared weight matrices, where

Ut−1 is induced from previous state t − 1, and Mt is the matrix for the current input

word.

The value of ct is the composed value of two vectors, the vector representation of the

current input word, and the composed vector from the previous state. This is similar to

the candidate value of state memory in the LSTM network. However, ct in our model

is not directly passed to the next timestamp. Instead, we only pass a certain amount of

information from ct to the next timestamp, controlled by the update gate rt. Therefore,

the composed vector representation pt is computed as taking a part of information from


time 0 to t− 1 that acts as the background knowledge, and a part of information from

the current composition. Finally, the composed matrix representation is computed using

a shared weight matrix Wm.

Concretely, let x ∈ Rd be a word embedding vector with d dimensions, and M ∈ Rd×d be

the corresponding matrix, so each word has the representation of a pair of vector-matrix

(x,M). Let p ∈ Rd, U ∈ Rd×d be the composed vector and matrix, and Wm ∈ Rd×2d,

Wv ∈ R2d×d be the shared weights across the network. At timestamp t, the network

takes a pair of vector-matrix (xt,Mt) from current state and the learnt composition

pt−1, Ut−1 from precedents as inputs, and outputs composed of representation (pt, Ut)

that are computed as:

rt = σ(Ut−1 · xt +Mt · pt−1 + br)

ct = tanh([xt; pt−1] ·Wv + bc)

pt = rt � ct + (1− rt)� pt−1

Ut = tanh(Wm · [Ut−1;Mt] + bu)

(6.2)

where σ is the sigmoid function, [a; b] denotes the concatenating of two vectors or ma-

trices, � denotes the element-wise vector multiplication, br, bc, and bu are bias units.

6.2.3 Low-Rank Matrix Approximation

For optimising the computational efficiency, we employ the low-rank matrix approxima-

tions [19] to reduce the number of parameters in word matrices. The low-rank matrix

approximation is computed as the product of two low-rank matrices plus diagonal ap-

proximations, as:

M = UV + diag(m) (6.3)

where U ∈ Rd×l, V ∈ Rl×d, m ∈ Rd, and l is the low-rank factor. We set l = 3 for all

experiments.

6.3 Unsupervised Learning and Evaluations

Learning general representations for phrases aims to learn the rules of composing words,

so there is no actual phrase embeddings induced and saved from a model. Once the

training is finished, a model is expected to encode the compositional rules in its param-

eters, so it can take any sequence of words (embedding vectors) as the input producing

the composed representation, no matter whether or not the sequence has been seen in

the training examples.


We train our model in an unsupervised fashion over a Wikipedia snapshot to induce

general phrase embeddings. The model predicts the context words of a input phrase.

Concretely, given a phrase m = [w1, w2, ..., wn] consisting of n word, let L and S be the

collections of all possible phrases and their context words in the training corpus, respec-

tively. The learning objective is to maximise the conditional probability distribution

over the list L by looking for parameter θ, as:

argmaxθ

∑m∈L

∑w∈S(m)

log p(w|m; θ) (6.4)

We choose the Negative Sampling technique introduced in the SkipGram model [15] for

fast training. Let C ′ and C be the sets of all output vectors and embedding vectors,

respectively. The composed representation of m is o = f(C(m)). Given m and a word w,

the probability that they co-occur in the corpus is p = σ(C ′(w)T o), and the probability

of not appearing in the corpus is 1 − p. The context words S(m) of m that appear in

the corpus are treated as positives, and a set of randomly generated words S′(m) are

treated as negatives. The probability is computed as:

p(w|m; θ) =∏

w∈S(m)

σ(C ′(w)To)

∏w′∈S′(m)

(1− σ(C ′(w′)

To))

The objective function maximises the log probability:

P =∑

w∈S(m)

log σ(C ′(w)To)) +

∑w′∈S′(m)

log σ(−C ′(w)To)

Parameter θ is updated as:

θ := θ + ε∂L

∂θ(6.5)


6.3.1 Evaluation Datasets

The evaluation focuses on unsupervised learning for general phrase embeddings. We

firstly train the model over a Wikipedia snapshot, then evaluate the model on the phrase

similarity and Noun-Modifier Questions (NMQ) tests.

The phrase similarity test measures the semantic similarity for a pair of phrases against

human assigned scores. We use the evaluation dataset constructed by Mitchell and


Lapata [220], which contains 324 phrase pairs with human assigned scores of pairwise

similarity. Each test example is a pair of bigram phrases, for instance (large number,

vast amount). Human assigned similarity score is rated from 1 to 7, where the 1 indi-

cates that two phrases are semantically unrelated, and 7 indicates semantically identical.

Each pair of phrases are scored by different participants. The test examples are grouped

into three classes, adjective-noun, verb-object, and noun-noun.

The NMQ test finds the semantic similar or equivalent unigram counterpart among

candidates to a phrase. We used the dataset described by Turney [177]. Each test

example gives a bigram phrase, and choices of 7 unigram words. For example, the

phrase is electronic mail, the candidates are email, electronic, mail, message,

abasement, bigamy, conjugate, the answer is the first unigram word email, the sec-

ond is the modifier of the given phrase, the third is the head noun, and follows by the

synonym or hypergamy of modifier and head noun, the last two are randomly selected

nouns. The original dataset has 2,180 samples, where 680 is for training and 1,500 is

for testing. We use all the samples for testing, since the models for this evaluation are

trained in an unsupervised fashion.

6.3.2 Evaluation Approach

The semantic similarity of two phrases is measured by the cosine similarity of their

corresponding embedding vectors, as:

cos(u, v) =u · v‖u‖ · ‖v‖

(6.6)

In the phrase similarity test, we sum all the human assigned scores for the same sample

and then take the mean value to obtain the overall rating score as the ground-truth. We

evaluate the model against the ground-truth using Spearmans ρ correlation coefficient.

In the phrase composition test, we measure the cosine similarities of the given phrase

and all candidates, where the candidate having the highest score is selected.

6.3.3 Baseline

We firstly compare our results with distributional approaches. Mitchell and Lapata [220]

investigate different composition models that are evaluated empirically on the phrase

similarity task. Turney [177] present a dual-space model that is evaluated on both

phrase similarity and NMQ tasks.


In addition to distributional approaches, we also use four baselines implemented using

distributed approaches. The first two models are simple linear algebraic compositions

including the vector summation p = a + b and element-wise multiplication p = a � b.The vector representations of words are obtained using the SkipGram model [15]. The

other two baselines are two deep learning models LSTM and CNN described in 5.3.2.2

and 5.3.2.1

6.3.4 Training

Pre-training Word Embeddings We use a Wikipedia snapshot as our unsupervised

training corpus. We firstly pre-trained word embeddings using the SkipGram model [15],

where the minimum word occurrence frequency is set to 20 obtaining a vocabulary of

521,230 words. The vector summation and multiplication approaches take the trained

word embedding values as input, so they do not require any further training.

Training Deep Learning Models We then use the values from trained embedding

vectors as the inputs to semantic composition models. For training, semantic composi-

tion models, we use only the top 50,000 frequent words to reduce the training time. The

reduced word list still covers 99% vocabulary in evaluation data sets. We train three

models using the same Wikipedia snapshot, the MVRU, the LSTM and CNN models.

All models are trained on all possible n-grams of a sentence where 2 ≤ n ≤ 5. Both

the size of context word window and the number of negative samples are set to 5. All

word and phrase embedding vectors are 300 dimensions. The CNN model has 3 different

region size {2, 3, 4}. Each region has 100 filters. There is no specific hyper-parameters

required for training the MVRU and LSTM model. The learning rate is 0.005.

6.3.5 Results and Discussion

Table 6.1 shows the evaluation results for the similarity test, where the scores indicate

the Spearmans ρ correlation coefficient. The MVRU model performed significantly bet-

ter than distributional models. In fact, models using distributed word representations

generally produce better results than distributional models.

In comparison to other distributed models, the MVRU produce much better results on

adjective-noun, and noun-noun groups. However, the simple vector summation yields

the best performance on the verb-object test. Noun phrases, including adjective-nouns

and noun-nouns, are used to describe objectives and concepts, which naturally form

co-occurrence patterns with a small group of words sharing similar semantics, such


as, assistant secretary, intelligence service, economic condition. Such co-

occurrence information can be easily captured by our model. On the other hand, verb

phrases describe actions that usually co-occur with their modifiers, such as subjects. This

phenomenon introduces extra difficulties for our model capturing the co-occurrence infor-

mation. In contrast to the MVRU model, the vector summation compositional approach

simply takes a sum of each word embedding vector, where the values are pre-trained over

Wikipedia. Given a pair of phrases a = (a1, a2), b = (b1, b2) consisting of 4 words, using

vector summation, phrases a and b will have higher cosine similarity score as long as a1 is

similar to b1 (or b2) and a2 is similar to b2 (or b1). The advantage is that word embedding

vectors for verbs have already captured their lexical meanings as well as their semantic

similarities, such as require and need. Therefore, the vector summation approach can

easily compute the similarity between verb phrases. For example, the composed vectors

of attention require and treatment need have high cosine similarity, whereas war

fight and number increase receives a low score.

The vector summation model performs significantly better than its cousin – the element-

wise multiplication model. The word embeddings trained using the SkipGram [15] model

naturally encode the additive compositionality. Mikolov et al. [15] have demonstrated

the linear compositionality of word embeddings in the word analogy tests, such as the

simple vector subtraction producing the interesting result of king - man + woman =

queen. This is also the reason that two deep learning models, the LSTM and CNN, are

unable to outperform the simple vector summation of the embeddings trained using the

SkipGram model.

Comparing three deep learning models, the MVRU model is much more efficient on

learning adjective-noun phrases than other two models with almost 14% better perfor-

mance. The majority of adjectives in the evaluation dataset are quite general words,

such as better, modern, special, local. These general adjectives can easily co-

occur thousands of times with different words in Wikipedia. This introduces difficulty

in capturing general composition rules of these adjectives. In comparison to the LSTM

and CNN models, our MVRU features much more parameters in the word matrices, thus

has an obvious advantage of encoding more information.

Table 6.2 shows the evaluation results for NMQ tests. Similar to [177], we ran two tests,

the first one does not have any constraints considering all 7 choices, and the second test

applies constraints dismissing the modifiers and head nouns. The random guess rate for

the first test is 14.3% (1/7), and 20% for the second test.

In comparison to supervised distributional models, the proposed MVRU was unable to

outperform the state-of-the-art – the holistic model delivers more than 10 percent better

performance. However, the holistic model replies on training examples (pre-identified


Table 6.1: Phrase Similarity Test Results

AN NN VO

Distributional Rep.p = a+ b, semantic space [220] 36% 39% 30%p = a+ b, using LDA [220] 37% 45% 40%p = a� b, semantic space [220] 46% 49% 37%p = a� b, using LDA [220] 25% 45% 34%p = Ba+Ab, Semantic Space [220] 44% 41% 34%p = Ba+Ab, using LDA [220] 38% 46% 40%Dual Space [177] 48% 54% 43%Distributed Rep.p = a+ b, word embeddings 64% 71% 55%p = a� b, word embeddings 37% 46% 45%LSTM 57% 72% 45%CNN 57% 68% 47%MVRU 70% 76% 49%

AN:Adjective-Noun, NN: Noun-Noun, VO: Verb-Object

Table 6.2: Phrase Composition Test Results

NC WC

Supervised Distributional Rep. [177]p = a+ b 2.5% 50.1%p = a� b 8.2% 57.5%Dual space 13.7% 58.3%Holistic 49.6% 81.6%Unsupervised Distributed Rep.p = a+ b, word embeddings 1% 62.4%p = a� b, word embeddings 14.2% 36.5%LSTM 2.6% 65.0%CNN 3.4% 65.9%MVRU 2.6% 70.3%

NC : no constraints, WC: with constraints

phrases) from the dataset, which essentially is a classifier. MVRU on the other hand,

does not use any pre-identified phrases and trained in purely unsupervised fashion over

a Wikipedia snapshot and dynamically generate all phrase embeddings, which aims to

capture the compositionality.

When taking the modifiers and head nouns into account, all models delivered poor

results. The vector multiplication model produces the ‘best’ results at first sight. How-

ever, we found that the result is biased The multiplication model takes the element-wise

product of two word embedding vectors, where the values are typically between (-1,

1). After multiplication, the composed vector falls into a very different cluster from

the word embedding vectors, thus unable to identify any of the answer. The highest

cosine similarity score obtained are typically below 0.1 meaning it is invalid. For the


summation and deep learning models, the composed vectors of given bigram phrases

mostly have high cosine similarities with their modifiers and head nouns. Since these

models compose the meaning of a phrase by taking a part of the meaning from each

constituent word, the composed vector still remains a high degree of cosine similarities

with its constitutes. A similar phenomenon is also reported by Turney [177] who uses a

distributional approach tackling the same problem.

When not considering the modifiers and head nouns, the vector summation and deep

learning models have a dramatic improvement on the performance. The proposed MVRU

model produces the best score on 70.3% that yields an improvement of 50% from the

random guessing rate, and outperforms others by at least 4.5%. The vector summation

still produces a solid score, but cannot outperform any deep learning model. Our model

outperforms the simple linear summation by about 8%, indicating that it is able to

encode more complicated compositional rules than the simple vector summation model.

6.4 Supervised Learning and Evaluations

The MVRU model can also be trained in a supervised fashion for classification tasks.

To build the classifier, a compositional model is connected to a softmax or a logistic

regression layer as:

p(s) = f(W · o+ b) (6.7)

where o is the phrase representation, f is the output function (softmax for multiclass

classification or sigmoid for binary classification), W is the weight vector for logistic

regression layer, b is the bias unit. Given a training set D, we aim to maximise the log

probability of choosing the correct label for s ∈ D by looking for parameters θ:

argmaxθ

∑s∈D

log p(slabel|s; θ) (6.8)

6.4.1 Predicting Phrase Sentiment Distributions

This evaluation predicts fine-grained sentiment distribution of adverb-adjective pairs,

which concerns how adverbs can change the meanings of adjectives, i.e. the semantics

of phrases containing adverbial intensifiers, attenuator or negative morphemes. For

instance, Socher et al. [19] show that not awesome is more positive than fairly awesome,

and not annoying has a similar distribution as not awesome in IMDB movie reviews.


We use the extracted adverb-adjective pairs from the IMDB movie reviews1, which is also

used by [19]. Reviewers score a movie from 1 to 10 indicating the most negative to the

most positive reviews. The dataset lists the frequency of each pair of words in different

scoring categories. For example, the phrase terribly funny usually occurs in positive

reviews. We randomly split the dataset into training set (3,719 examples), validation

set (500 examples) , and test set (2,000 examples), where we only kept the phrases

occurring more than 40 times. The learning objective is to maximise correct probability

distribution of each pair of words by adding a softmax regression layer to minimise

the cross-entropy error. Similar to Socher et al. [19], we use Kullback–Leibler (KL)

divergence to evaluate our model. We evaluated our model with different initialisation

of the parameters. We found that using pre-trained word embeddings over large corpora,

such as Wikipedia, leads to a quicker converge, but makes trivial difference to the KL

divergence score. We also evaluated our model with different size of word embedding

vectors. Unlike reported in [19], we also found our model gains better performance using

larger size of word embeddings up to 300 dimensions. Smaller learning rate (0.01) and

L2 regularisation also helps preventing the overfitting.

Table 6.3: KL Divergence Score

Algorithm KL-Score

p = 0.5(a+ b), vector average 0.103p = a� b, element-wise vector multiplication 0.103p = [a; b], vector concatenation 0.101p = g(W [a; b]), RNN [18] 0.093p = Ba+Ab, Linear MVR [220] 0.092p = g(W [Ba;Ab]), MV-RNN [19] 0.091MVRU 0.088

p – phrase vector, A, B – word matrices, a, b – word embeddings

Table 6.3 shows the results. The x-axis shows the score from 1 to 10 indicating the

most negative to the most positive reviews of movies, and the y-axis shows the av-

erage KL divergence score. The proposed MVRU outperformed all baseline models.

Figure 6.3 plots some sample adverb-adjective pairs with their distributions, and the

predicted distributions by the MVRU model. The adjectives are boring (negative) and

interesting (positive). The MVRU successfully predicted the sentiment distributions

of the words pairing a negative morpheme not, attenuators quite and fairly, and an

intensifier extremely. The negation inverses the meaning of boring and interesting,

e.g. not boring appears in more positive reviews. By pairing adverbial attenuators ,

both boring and interesting become neutral. The intensifier strongly increases the

positive and negative meanings of words, e.g. extremely interesting mostly appears

in positive reviews with a sharp increase from score 7 to 10.

1publicly available at http://nasslli2012.christopherpotts.net/composition.html


0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

1 2 3 4 5 6 7 8 9 10

not boring

Predicted Ground-truth

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

1 2 3 4 5 6 7 8 9 10

not interesting


0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

1 2 3 4 5 6 7 8 9 10

quite boring


0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

1 2 3 4 5 6 7 8 9 10

fairly interesting


0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

1 2 3 4 5 6 7 8 9 10

extremely boring


0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

1 2 3 4 5 6 7 8 9 10

extremely interesting


Figure 6.3: Predicting Adverb-Adjective Pairs Sentiment Distributions

6.4.2 Domain-Specific Term Identification

Domain-specific terminology identification automatically identifies domain relevant tech-

nical terms from a given corpus. Concretely, it trains a binary classifier to identify

whether or not a candidate is relevant to the domain. Thus, we use the negative log

likelihood function, the objective is to minimise the loss E:

E = −yi log σ(W · o+ b)− (1− yi) log σ(1−W · o+ b) (6.9)

where y = 1 for positive (domain relevant) and 0 for negative (irrelevant) terms.

We evaluate our model on two datasets. The first dataset is the GENIA corpus2, which

is a collection of 1,999 abstracts of articles in the field of molecular biology. We use the

2publicly available at http://www.geniaproject.org/


current Version 3.02 for our evaluation. The second dataset is the ACL RD-TEC3 corpus

consists of 10,922 articles published between 1965 to 2006 in the domain of computer

science. The ACL RD-TEC corpus classifies terms into three categories, invalid terms,

general terms, and computational terms. We only treat computational terms as ground-

truth in our evaluation.

The ACL RD-TEC corpus provides a pre-identified candidate list. We therefore only

need to identify candidate terms from the GENIA corpus. We use a predefined POS

pattern <JJ>*<NN.*>+ to chunk candidates, that is, zero or more adjectives followed by

one or more nouns. However, such simple pattern is not able to identify all terms in

the ground-truth list. For example, the phrase chunker spots a term novel antitumor

antibiotic, whereas in the ground-truth set, the correct term is antitumor antibiotic.

Since the purpose of this experiment is to evaluate the capability of learning term com-

positionality, we only take the identified ground-truth into account. Table 6.4 shows

statistics.

Table 6.4: Candidate Terms Statistics

Candidate Positive Negative

GENIA 40,998 20,704 (50.5%) 20,294 (49.5%)ACL 83,845 13,832 (16.5%) 70,013 (83.5%)

We randomly select 40% data for training, 20% data for validation, and 40% data for

evaluation from the candidate set listed in Table 6.4. Word embeddings are pre-trained

over each evaluation corpus.

In this evaluation, we add C-value [56] as an extra baseline algorithm. C-value is a

popular unsupervised statistical ranking algorithm for term identification, where each

candidate term is assigned a score indicating the degree of domain relevance. It requires

extracting a number of top ranked candidates as the outputs. Table 6.5 shows the

results.

The proposed MVRU model consistently outperforms others on both dataset. On the

GENIA dataset, the minimum improvement on F-score is 2.3%, whereas on the ACL RD-

TED dataset is 8.1%. Although the accuracy measure is commonly used in classification

tasks, it only reflects the true performance of a model when classes are evenly distributed

in an evaluation set. For example, on the ACL-RD TED dataset, the positive examples

are about 16.1%, whereas the negative ones are about 83.9%. Therefore, if a model

incorrectly classifies all the examples as negative, it still produces an accuracy score

of 83.9%. Therefore, the accuracy measure is only valid on the GENIA dataset where

positive and negative examples are evenly distributed.

3publicly available at https://github.com/languagerecipes/the-acl-rd-tec


Table 6.5: Evaluation Results

Prec. Recl. F-score Acc.

GENIARandom Guess 51.0% 50.0% 50.5% 50.0%CV (Top 5000) 60.4% 29.7% 39.8% –CV (Top 10000) 53.9% 53.0% 53.4% –CV (Top 15000) 50.6% 73.4% 59.9% –p = a+ b 63.6% 79.8% 70.8% 66.5%p = a� b 69.1% 50.8% 58.6% 63.3%LSTM 67.0% 74.5% 70.5% 68.3%CNN 63.0% 80.7% 70.7% 66.0%MVRU 65.1% 83.3% 73.1% 68.8%

ACL RD-TECRandom Guess 16.1% 50.0% 24.4% 50.0%CV (Top 4000) 11.0% 5.0% 6.9% –CV (Top 8000) 15.7% 14.3% 15.0% –CV (Top 12000) 19.5% 26.6% 22.5% –p = a+ b 77.0% 66.1% 71.1% 90.5%p = a� b 79.0% 60.3% 68.4% 90.2%LSTM 72.5% 55.0% 62.5% 85.5%CNN 70.8% 67.7% 69.2% 85.2%MVRU 77.9% 80.7% 79.2% 92.5%

It is also easy to notice that the GENIA dataset has a much higher distribution of

the positive examples in comparison to the ACL RD-TED dataset. However, most

models produce similar or even better F-scores on the ACL RD-TED dataset. There

are two possible reasons. Firstly, in the ACL RD-TEC corpus, the negative terms

contain large number of invalid characters (e.g. ~), mistakes made by the tokeniser (e.g.

evenunseeneventsare), and none content-bearing words (e.g. many). The classifiers

can easily spot these noisy data. Secondly, the ACL RD-TEC corpus is much bigger

in size as compared to GENIA, which enables the SkipGram algorithm to induce more

precise word embedding vectors as inputs to the classification layer.

Surprisingly, the simple vector summation model produces better results than the LSTM

and CNN deep learning models on both evaluation datasets. In fact, the simple vector

summation produces quite impressive results on all the evaluations demonstrating that it

can be an alternative model for learning compositional semantics due to the simplicity of

implementation. In comparison to the vector summation, although our proposed model

produces better performance, it may be difficult to train due to the large number of

parameters. Having said that, the matrices obtained can be potentially used in phrase

resynthesis or sentence generation.


6.5 Conclusion

In this chapter, we have introduced a matrix-vector recurrent unit model built upon

a recurrent network structure for learning compositional semantics. Each word is rep-

resented by a pair of matrix and vector, where the vector encodes the meaning of the

word, and the matrix captures the composition rules. We evaluated our model on phrase

similarity, NMQ and domain-specific term identification tasks, and demonstrated the

proposed model outperforms the LSTM and CNN deep learning models, and simple

vector summation and multiplication compositions.

This chapter provides a solid foundation for learning compositional semantics of phrases.

In the next chapter, we will present a document deep learning model that incorporates

the MVRU model for extracting keyphrases.

Chapter 7

A Deep Neural Network

Architecture for AKE

In Chapter 6, we have introduced the Matrix-Vector Recurrent Unit (MVRU) model,

and demonstrated its efficiencies on learning semantic meanings of phrases. However,

identifying keyphrases requires understanding not only the meaning of each phrase ap-

pearing in a document, but also the overall meaning of the document. Hence, in this

chapter, we firstly introduce a deep learning model that automatically encodes the mean-

ing of a document, which presents a document as a cube, as shown in Figure 7.2 where

the height is the number of sentences, the width is the number of words in a sentence,

and the depth is the dimension of word embedding vectors. The cube representation of

a document inputs to a convolutional neural network to analyse the intrinsic structure

of the document and the latent relations among words and sentences. Hence, the model

is named Convolutional Document Cube (CDC).

In the second part of the chapter, we propose a novel AKE deep learning model to extract

keyphrases by learning the meanings of both phrases and documents. The model consists

of two deep neural networks: MVRU and CDC. The MVRU model is responsible for

learning the meanings of phrases, and CDC encodes the meanings of documents. Two

networks are jointly trained by adding a logistic regression layer as the output layer to

identify keyphrases. We evaluate the model on three different datasets: Hulth, DUC, and

SemEval, and demonstrate the model delivers the identical performance to the state-of-

the-art algorithm on the Hulth dataset, and outperforms the state-of-the-art algorithm

on the DUC dataset.

117

Chapter 7. A Deep Neural Network Architecture for AKE 118

7.1 Introduction

Manually identifying keyphrases from documents by human annotators is a complex

cognitive process, which requires understanding the meanings of both phrases and the

document in order to select the most representative phrases that can express the core

content of the document. On the other hand, existing AKE approaches, including both

supervised and unsupervised approaches, are unable to, or have no attempt to under-

stand the meanings of phrases and documents.

Most AKE algorithms identify keyphrases based on only the representations of phrases.

Phrases are represented as manually selected features. Each feature describes a partic-

ular characteristic of phrases, such as frequencies, or co-occurrence statistics. Neverthe-

less, selecting features is a biased and empirical task. These features usually capture

little or no linguistic meaning of how phrases are formed, making such algorithms inca-

pable of capturing the meanings of phrases.

The representations of documents in existing AKE approaches are simple re-arrangement

of phrase feature vectors. For example, Term Frequency - Inverse Document Frequency

(TF-IDF) [2] represents documents in a Term-Document matrix where each row vector

corresponds to a unique phrase in the corpus, each column represents a document, and

each cell value corresponds to the frequency of the phrase in a particular document.

However, TF-IDF identifies the importance of a phrase in a document using only three

statistics: the phrase frequency, the number of documents containing the phrase, and

the total number of documents in the corpus. Graph-based unsupervised approaches,

such as TextRank [3], represent a document as a graph, where each vertex corresponds

to a phrase, an edge connecting two vertices presents the their co-occurrence relation.

However, the ranking process only makes use of the phrase co-occurrence information.

In supervised machine learning approaches [1, 4, 5, 83], a document is just a collection

of its phrase feature vectors, which does not carry any semantic meaning of the docu-

ment. Inevitably, the lack of cognitive ability in existing AKE approaches inhibits their

performances on extracting keyphrases.

In this chapter, we propose a deep learning architecture to tackle the AKE problem,

which not only eliminates the effort of feature selection, but also mimics the natural

cognitive process of manual keyphrase identification by human annotators – attempting

to encode the meaning of phrases and documents. To achieve this, a deep learning

model needs to learn representations for both phrases and documents, which encode the

semantic meanings of phrases and documents. Thus, the proposed network consists of a

phrase model and a document model. Together, they are capable of learning distributed

representations of phrases and documents.


Learning the meanings of phrases is to encode the rule of composing the meanings of

their constituent words [25]. We use the Matrix-Vector Recurrent Unit (MVRU) model

introduced in the Chapter 6 to learn the semantic compositionality of phrases. The

recurrent architecture allows the network to capture the order of words by sequentially

composing the current word with its precedents. However, in contrast to the constituent

words of a phrase that always have stronger dependencies, words in a document only

have strong dependencies to the closer ones, and the strength of dependencies decreases

by the increase on the distances between words. For example, the first few words usually

have no correlations to the last ones appearing in a document. Therefore, capturing such

long term dependencies of words in a document is unnecessary and non-intuitive. Hence,

we propose a novel model to learn the meaning of a document, namely Convolutional

Document Cube model. We first present a document as a cube, where the height repre-

sents the number of sentences, the width represents the number of words in a sentence,

and the depth is the size of word embedding vectors. We then ‘slice’ the cube by its

depth’s (word embeddings’) dimension generating 2-dimensional data matrices (chan-

nels), which are the inputs to a convolutional neural network (CNN). The CNN network

has the strength to encode regional compositions of different locations in data matrices.

The network features different region sizes, where each region has a number of filters

to produce the feature maps. By convoluting the data matrices with different region

size, the CNN captures how words are composed in each region. Since we use relatively

small region sizes, the network only analyses words and their closer neighbours, aiming

to encode only short term dependencies. For example, given two sentences the dog is

walking in a bedroom and the cat is running in a kitchen, using a region with

the size of 2×3, we hope the network to capture the important information by scanning

a pair of trigrams in the same position of the sentences, such as ([the dog is], [the

cat is]), ([dog is walking], [cat is running]), ([in a bedroom], [in a kitchen]).

Features are automatically learnt using distributed representations, thus no manually

selected features is required. The output layer is a logistic regression layer, which clas-

sifies whether a phrase is a keyphrase for the input document based on the meanings

learnt from the phrase and document model. We evaluate the model on the same

datasets described in Chapter 3 (the Hulth [5], DUC [52], and SemEval [51] dataset),

and demonstrate that the proposed approach produces identical performance on the

Hulth dataset and delivers the new state-of-the-art performance on the DUC dataset,

without employing any dataset-dependent heuristics.


7.2 Deep Learning for Document Modelling

Recent studies have shown great progresses in learning vector representations of words

and multi-word expressions. For instance, Mikolov et al. [15] present the word2vec model

discovering the existence of linear relations in lexical semantics, and Cho et al. [97]

demonstrate the capability of learning compositional semantics and modelling language

using the Gated Recurrent Unit (GRU) network.

However, learning the representation and encoding the meaning of a document remains

a great challenge for the NLP community [190]. This is because the content of a doc-

ument is typically viewed as the ad hoc result of a creative process that far beyonds

the level of lexical and compositional semantics. Documents are manuscripts repre-

senting the thoughts of authors, which naturally form logical flows embedded in the

inherent structures of the contents. Therefore, modelling documents requires not only

the understanding of lexical and compositional semantics, but also analysing the inher-

ent structures and relations between sentences, such as temporal and causal relations.

Existing studies for modelling documents commonly employ a sentence-to-document ar-

chitecture, as shown in Figure 7.1. The bottom layer is the sentence model, which

takes constituent word embeddings of each sentence as inputs, and composes them into

sentence embeddings. The upper layer is the document model, which takes sentence em-

beddings obtained from the previous layer as inputs to analyse the intrinsic structures

and relations between sentences, and then composes them into document representation.

The sentence-to-document architecture can be implemented using the same type of neu-

ral networks. For example, Misha et al. [191] implement both sentence and document

layers using two CNN networks. At the sentence level, they use a CNN network to learn

the representations for an input sentence using one dimensional vector-wise convolution.

At the document level, another CNN network takes the learnt representations of all

sentences as inputs and outputs the document representation. The network hierarchi-

cally learns to capture and compose low level lexical features into high level semantic

concepts. Lin et al. [221] propose a hierarchical recurrent architecture, which consists

of two independent recurrent neural networks. The lower level model learns lexical se-

mantics by predicting the next word within a sentence given the current one. The upper

level encodes the sentence history from the precedents to predict the words in the next

sentence, which captures higher level semantic concepts. Ganesh et al. [192] use two

probabilistic language models to jointly learn the representations of documents. The

model first learns sentence vectors via a feed-forward network by training a word-level

language model, then learns document representations using the sentence vectors.


… ... … ...

s1

w1 w2 w3 wn

… ...

s2

w1 w2 w3 wn

… ...

sm

w1 w2 w3 wn

Sentence Model

s1 s2 sm Sentence Representation

word embeddings

Document Model

Document Representation

… ...

Figure 7.1: General Network Architecture for Learning Document Representations

Table 7.1: Common Deep Learning Document Models

Doc Model

Sent Model Feed-forwardRecurrent &

VariantsRecursive Convolutional

Feed-forward Ganesh et al. [192] N/A

Recurrent &Variants

Lin et al. [221]Tang et al. [190]Zhang et al. [222]Yang et al. [223]

Li et al. [224]

N/A Lai et al. [225]

Recursive Liu et al. [226] N/A

Convolutional

Kalchbrennerand

Blunsom [189]Tang et al. [190]

N/A Misha et al. [191]

Alternatively, hybrid network architectures are also proposed for modelling document.

Kalchbrenner and Blunsom [189] use a convolutional-recurrent network to learn the com-

positionality of discourses. The model consists of a CNN and a recurrent network. The

CNN network is responsible for learning the representation of sentences. The recurrent

network takes the outputs from the convolutional network to induce the document rep-

resentation. Similarly, Tang et al. [190] propose to use either a LSTM or CNN to learn

the meaning of a sentence at lower level, then use a GRU network [97] to adaptively

encode semantics of sentences and their relations for document representations. Liu et

al. [226] propose a recursive-recurrent network architecture, where the recursive network

learns the representations of sentences via pre-generated syntactic tree structures, and


the recurrent network learns sequential combinations of sentences inducing document

representations. Lai et al. [225] use a recurrent-convolutional network that has a op-

posite structure to Kalchbrenner and Blunsom’s convolutional-recurrent network [189].

The recurrent structure captures sentences’ information and the CNN network learns

the key concepts of a document.

Table 7.1 lists some recent work in document modelling. Recurrent networks and variants

such as LSTM and GRU are the most popular choices for modelling both sentences and

documents. In comparison to other types of neural networks, the recurrent architecture

takes inputs with various lengths, which naturally offers a better choice for modelling

sentences and documents. In addition, it repeatedly combines the current input with its

precedents capturing long term dependencies of words and sentences in a document. The

CNN network usually learns representations by stacking the input embedding vectors

into a matrix1, and then analyses the regional compositions of the matrix. However,

a common concern is that unlike in image processing, the location invariance does not

exist in embedding vectors, which makes the CNN network being less popular than the

recurrent network. The recursive neural network requires a parser to produce the syntax

tree as a prior, so it is best on learning the sentence level semantics, but less useful for

encoding semantic at the document level.

It is worth mentioning that probabilistic topic models for documents modelling [227–

229]. Among them, Latent Dirichlet Allocation (LDA) [76] is a popular generative

statistical model that allows sets of observations to be explained by unobserved ones.

It generates topics based on word frequencies from a set of documents hypothesising

that a document is a mixture of a small number of topics and each word’s creation

is attributable to one of the document’s topics. On the other hand, deep neural net-

works for learning document embedding based on word contexts aim to learn the useful

representations from the lexical semantics (word embeddings), compositional semantics

(phrases and sentences embeddings), to the meanings of documents, using distributed

representations.

7.2.1 Convolutional Document Cube Model

In this section, we introduce a Convolutional Document Cube (CDC) model to learn

the document representations. We treat a document as a cube, i.e. a third order tensor.

The height of the cube is the number of sentences in the document, the width is the

number of words, and the depth is the size of word embeddings. Instead of modelling

sentences towards documents, our model learns the representation of documents directly

1Mathematically this process is conducted by concatenating input vectors, and then performing 1-Dimensional convolution with a predefined window.


by ‘slicing’ the depth (word embedding dimension) into a set of 2-dimensional matrices,

then each matrix becomes a ‘channel’ that are inputs to a CNN – this is similar to

image processing using CNNs, a typical RGB image has three input channels (Red,

Green, Blue). In the CDC model, each channel of a document can be thought as of

a 2-dimensional snapshot of the whole document, where each word is represented by

its embedding vector. In comparison to existing work, the proposed model has three

advantages:

1. Recurrent structured networks, such as LSTM, tend to capture long-term semantic

dependencies or correlations between words and sentences through the entire docu-

ment, i.e. from the beginning to the end of the document. However, it may be not

necessary to capture such long-term dependencies or correlations given that the

closer words or sentences have equally strong semantic correlations. In addition,

while semantics can be correlated in the beginning (e.g. abstracts or introductions)

and the end (e.g conclusions), sentences appearing in the beginning could have very

loose or even no dependencies to the ones in the middle of the document. Hence,

capturing all dependencies in sentences is unnecessary, which could also introduce

noise. Unlike recurrent structured networks, our model concerns only regional de-

pendencies among words and sentences. Such regional dependencies are captured

by using smaller regions in the CNN network.

2. Using a single convolutional network is more computational efficient, which re-

quires less parameters to be learnt. The CDC model learns the representations

of documents directly without the need of learning sentence representations as a

prior.

3. Specifically to AKE, the proposed CDC model takes word positions into account.

In supervised machine learning for AKE, many approaches use hand-crafted posi-

tioning features indicating the relative positions of candidate appearing in a doc-

ument, assuming important phrases always occur in the beginning or the end of a

document. The proposed model does not require any hand-craft features. Instead,

by convoluting over a snapshot of the document with different sized filters, the

network automatically captures the position information.

Figure 7.2 shows an overview of the proposed model. Let D ∈ Rh×w×d be a document,

where h denotes the number of sentences in the document (the height), w denotes the

number of words in a sentence (the width), and d is the dimension of word embedding

vectors (the depth). We fix the size of D by padding zeros, such that sentences of variable

length are padded to w, document having different number of sentences are padded to

h. We use a convolutional neural network to learn the distributed representation D′ of


look at the cat .

sitting on the mat .

it is so fat .

and palys with a hat .

if you give a pat .

it will catch the rat .

Word embedding d

Channel

Padding zero vector

Padding zero vectors

h

w

0.34211 0.58922 … 0.48761 0

0.11291 0.48761 0

0.48761 0

… … 0.48761

0.48761

0.08901 … 0.48761

0 0 0 0 0 0r Regions

n filters for 2×2 region



DocumentRepresentation

n×r filters n×r feature maps

max pooling

layer

fully connected

layer

Figure 7.2: Convolutional Document Cube Model

D. Instead of performing 1-D convolution over word embedding vectors, we slice the

cube D into channels (input feature maps). The number of channels is the size of word

embedding d. Each channel is a matrix M with size h×w taking only one feature from

each word embedding vector. The convolutional layer parameters consist of r predefined

regions (receptive fields) with various sizes. For example, in Figure 7.2, there are three

regions. Each region has n linear filter (weights) to produce n output feature maps

by repeatedly convoluting across sub-regions of the channels. A filter has the size of

h′ ×w′ × d where h′ and w′ denote the height and width of the region respectively, and

d is size of word embedding (depth of D). For example, a region may have a size of

5× 5 (height and width), then overall the parameters for the region are 5× 5× d×n. A


output feature map ck from k’s filter is computed by convolution of the input channels

with a linear filter and a bias unit, and then applying a non-linear function such as a

hyperbolic tangent tanh function as:

ck = tanh(h′∑i=0

w′∑j=0

W k ·D[i : i+ w − 1, j : j + h− 1, 0 : d] + bk) (7.1)

where D[i : i + w − 1, j : j + h − 1, 0 : d] is a sub-tensor of D from width i : i + w − 1,

height j : j + h− 1, depth 0 : d , W k denotes weights for k’s filter. The total number of

feature maps in the network is m = r×n, so the output from the pooling layer O ∈ Rm

is computed as:

Ok = g(ck) (7.2)

where g is the pooling function. The output from the convolutional layer is then sent to

a fully connected layer to produce the final representation of D′, as:

D′ = tanh(W ·O + b) (7.3)

where W and b are the weight matrix and bias unit, respectively.

7.3 Proposed Deep Learning Architecture for AKE

In this section, we introduce a general network architecture for AKE, which consists

of two deep learning models aiming to learn the meanings of phrases the documents

separately by updating their own word embedding parameters, as shown in Figure 7.3.

Two models are jointly trained in a supervised learning fashion by connecting to a logistic

regression layer.

Concretely, let x be a phrase and x′ be the distributed vector representation of the

phrase obtained from the phrase model. Let D denote a document and D′ be the

vector representation of the document obtained from the document model. Let d be

the dimension of distributed vectors for words, phrases, and documents, then we have

x′ ∈ Rd and D′ ∈ Rd. The probability that x is a keyphrase to the document D is

computed as:

p(x|D) = σ(x′TD′) (7.4)

where σ is the sigmoid function. The probability that x is not a keyphrase is 1 − p.Let y be the ground-truth, applying the negative log likelihood function, the learning

objective is to minimise the loss L by looking for parameters θ = (X ′, C ′,Wp,Wd) where

X ′ is the collection of word embeddings for the phrase model, C ′ is the collection of


Phrase Model


Document Model


Input Phrase Input Document

Phrase Embedding Document Embedding

)|( Dxp ( ) +


Figure 7.3: Overview of Proposed Network Architecture

word embeddings for the document model, and Wp, Wd are collections of weights for

the phrase and document model, respectively. The loss L is computed as:

L = −(yi log σ(x′TD′) + (1− yi) log(1− σ(x′TD′))) (7.5)

and θ is updated as:

θ := θ − ε∂L∂θ

(7.6)


The phrase model employed in this chapter is the Matrix-Vector Recurrent Unit (MVRU)

model introduced in Chapter 6.

7.4 Evaluations

7.4.1 Baseline

The performance of the proposed AKE deep learning model is compared against the

state-of-the-art AKE algorithms described in Section 7.4.2. Two baseline models are

constructed to evaluate the efficiency of the CDC model, i.e. we replace our document

model of the proposed AKE deep learning architecture with two sentence-to-document

models proposed by Tang et al. [190], while keeping the same phrase model. Tang et

al. [190] propose CNN-GRU and LSTM-GRU, where the sentence-to-document mod-

els firstly learn sentence representations from the lower layer of the network, then use


LSTM Unit

xt-1 xt xt+1

s1

xt+2

Sentence Model

Sentence Model

s2

… ... Sentence Model

sn

GRU GRU GRU


… ...GRU Document Model

LSTM Unit LSTM Unit LSTM Unit

Figure 7.4: Baseline Document Model – LSTM-GRU Architecture

s1

Sentence Model

s2

… ... Sentence Model

sn

GRU GRU GRU


… ...GRU Document Model

Sentence Model

2 × d

3 × d

4 × d

x1

x3

x2

x4

x5

Figure 7.5: Baseline Document Model – CNN-GRU Architecture

the learnt sentence representations as inputs to the upper layer to learn the document

representations. In the CNN-GRU and LSTM-GRU models, the sentence models are

implemented using either a CNN or LSTM network, and the document model is a GRU

network. Figure 7.4 and 7.5 show an overview of the CNN-GRU and LSTM-GRU model,

respectively.

Sentence Models The CNN and LSTM sentence models are the same as the compo-

sitional semantic models described in Chapter 5.

GRU Document Model The GRU neural network introduced by Cho et al. [97]

is a variation on the LSTM network [28]. In comparison to the LSTM network, the


GRU architecture is less computational expensive since it uses fewer parameters. It has

also been shown that GRU networks are good at modelling long term dependencies of

languages [190, 231, 232].

The LSTM network features three gates: an input gate, a forget gate f , and an output

gate. Similar to the LSTM network, the GRU network also features gating units that

control the flow of information inside each recurrent unit. Figure 7.6 shows an overview

of the GRU network. The GRU network has a few noticeable differences from LSTM.

Firstly, the GRU network merges the input and forget gates from the LSTM network

into a single update gate. In the LSTM network, the input gate governs how much

information can be passed through into the current state, and the forget gate controls

how much information to be thrown away from the previous state. The GRU network

simply use one update gate z to control how much information from the previous hidden

state t− 1 to be passed into the current state t, as:

zt = σ(Wz · st + Uz · ht−1 + bz) (7.7)

where Wz, st, and bz are weights, the current sentence representation, and the bias

unit, respectively. The network uses 1 − z as the degree of how much information to

be thrown away from the previous state. Secondly, each LSTM unit has a output gate

that controls the exposure of the memory content from its timestep. In contrast, the

GRU network fully exposes its memory content, which does not feature an output gate.

Thirdly, in LSTM, the new cell state value at timestamp t is computed as taking a part

of information from the current input and the previous cell state value. In GRU, the

new cell state value is controlled by a reset gate r, as:

rt = σ(Wr · st + Ur · ht−1 + br) (7.8)

The new state value h is:

h = tanh(W · st + U(rt · ht−1) + b) (7.9)

where Wr, W and br, b are weights and biases. The cell state value of the previous step

ht−1 is modified by the reset gate r. When r is close to 0, the network only focuses

on the current unit, and ignores the previous state value. Such mechanism allows the

network to drop any irrelevant information from previous computations producing more

compact representations. The new hidden state ht is

ht = (1− zt)ht−1 + zth (7.10)


Inputs to cell (st , ht-1)

Reset gate

rt = sigmoid(Wrst + Urht-1)

rt ,ht-1 Wr , Ur ĥ ĥ

Update gate Wz , Uz

zt = sigmoid(Wzst + Uzht-1)

New hidden state

W , U

ĥ = tanh(Wst + U(rtht-1))

zt ,ht-1

ht = ztht-1 + (1- zt)ĥ

Figure 7.6: Gated Recurrent Unit Architecture

7.4.2 State-of-the-art Algorithms on Datasets

Most AKE algorithms are dataset-dependent, i.e. they only perform well on the chosen

datasets with heuristic tunings. Hasan and Ng [9] evaluate five AKE algorithms on four

different datasets, among which three are the current state-of-the-art algorithms. They

show that the state-of-the-art algorithm for one particular dataset cannot deliver the

similar result on the other, due to the fact that certain statistical characteristics of one

dataset may not exist on the other.

The state-of-the-art algorithms for our evaluation datasets are: Topic Clustering [8] for

Hulth, ExpandRank [10] for DUC, and SemGraph [24] for SemEval. All of them are

unsupervised AKE algorithms. Table 7.2 shows the performance of these algorithms. A

brief history of each dataset and its state-of-the-art algorithm is as follows.

The Hulth dataset was built by Hulth [5] (2003) who propose a decision tree AKE

classifier using 4 features, 1) the frequency each phrase that occurs in a document, 2) the

frequency of each phrase in the dataset, 3) the relative position of the first occurrence

of each phrase, and 4) the part-of-speech tag of each phrase. Later, Mihalcea and

Tarau [3] (2004) introduce an unsupervised graph ranking algorithm namely TextRank

derived from the PageRank algorithm [89] producing better performance than Hulth’s

supervised machine learning model on the same dataset. The most recent state-of-the-

art performance is produced by Topic Clustering [8] (2009), an unsupervised algorithm

that uses Wikipedia-based semantic relatedness of candidates with clustering techniques

to group candidates based on different topics in a document.

The DUC dataset was annotated by Wan and Xiao [10] (2008) who propose a graph

ranking algorithm named ExpandRank producing the state-of-the-art performance on

the dataset. Liu et al. [39] (2010) propose another unsupervised graph ranking approach


named Topical PageRank that first uses Latent Dirichlet Allocation (LDA) [76] to obtain

topic distributions of candidate phrases, then runs Personalised PageRank for each topic

separately, where the random jump factor of a vertex in a graph is weighted as its prob-

ability distribution of the topic. However, Topical RageRank is unable to outperform

ExpandRank.

The SemEval dataset was built for SemEval 2010 Shared Task 5. There were 19 partici-

pants, among which the HUMB [233] system delivered the best performance [51]. HUMB

is a supervised machine learning algorithm employing a bagged decision tree algorithm

with manually selected features, including positional features, co-occurrence statistics,

lexical and semantic features obtained from Wikipedia. More recently, Martinez-Romo

et al. [24] (2016) report an unsupervised graph ranking approach named SemGraph that

produces better results than HUMB becoming the new state-of-the-art algorithm on the

SemEval dataset. SemGraph firstly constructs a graph using co-occurrence statistics

and the knowledge supplied by WordNet, then it uses a two-step ranking process: 1)

select only one-third top ranked candidates using PageRank, which are inputs to the

second ranking step, and 2) rank candidates obtained from the first step based on their

frequencies, and only top ranked 15 candidates are extracted as keyphrases.

7.4.3 Evaluation Datasets and Methodology

The statistics of each dataset is detailed in 3.3, Table 3.1. The Hulth and SemEval

datasets provide both training and evaluation sets for supervised machine learning ap-

proaches. We randomly split the DUC dataset into 40% for training, 20% for validation,

and 40% for evaluation.

7.4.4 Training and Evaluation Setup

Candidate phrases are identified using the NP Chunker described in 3.2.1. We stem

texts and candidate phrases using Porter’s algorithm [161].

7.4.4.1 Pre-training Word Embeddings

Word embedding vectors are pre-trained over each evaluation dataset separately using

the SkipGram model described in 4.3, with the dimension of 300 for all the evaluation

datasets. The hyper-parameter settings and training setup are the same as the AKE

word embedding training setup described in 4.4.1.


7.4.4.2 Training AKE Models

We use 5 different region sizes 2, 3, 4, 5, 6 for the baseline CNN sentence model. For the

CDC model, we use 5 different 2-Dimensional regions (1, 5), (2, 5), (3, 5), (4, 5), (5, 5).

We also experimented different region sizes – smaller sizes decreases the performance,

but there is no noticeable increase on the performance by employing larger region sizes.

The learning rate is set to 0.01 for all models. We notice that larger learning rate, e.g.

0.05, trends to quickly over-fits the model, but this is no significant improvement by

employing smaller learning rate e.g. 0.005.

Both phrase and document models take word embedding vectors as inputs. We use two

separate word embedding vectors for phrase and document models. Each model only

need to update its own embedding values. Theoretically, we could share the embedding

vectors for phrase and document models. However, our empirical results show that using

shared embedding vectors not only decreases the performance, but also significantly

increases the training time due to back-propagating the errors from two deep neural

networks.

We train our model on a NVIDIA GTX980TI GPU. The training time is trivial – training

the model over the Hulth and DUC datasets only takes a few hours, and training over

the SemEval dataset takes a day.

7.4.5 Evaluation Results and Discussion

Table 7.2 shows the evaluation results. The CDC model performs equally well on the

F-score to Topic Clustering (the state-of-the-art) on the Hulth dataset without applying

any heuristic, and delivers the new state-of-the-art performance on DUC dataset. How-

ever, it produces a much lower F-score on the SemEval dataset comparing to HUMB

and SemGraph.

On the Hulth dataset, in comparison to the supervised machine learning model pro-

posed by Hulth [5], the CDC model produces a very similar recall score, but delivers a

much better precision score. This indicates that our proposed model identifies negative

samples more accurately by learning the meanings of candidate phrases, thus classifies

fewer candidates as positive. In comparison to the state-of-the-art algorithm – Topic

Clustering, both models have the same F-score. However, we consider that the CDC

model is more robust. Firstly, Topic Clustering uses a dataset-dependent filter that dis-

cards the candidates that are too common to be keyphrases2. The authors report that

2There is no details on how the filter is constructed, thus we cannot reimplement the filter to reporta fair comparison.


Table 7.2: Evaluation Results on Three Datasets

Precision Recall F-score

Hulth, ground-truth proportionHulth [5] (2003) 25.2% 51.7% 33.9%TextRank [3] (2004) 31.2% 43.1% 36.2%TopicClustering [8] (2009) 35.0% 66.0% 45.7%

CNN-GRU Model 40.3% 51.1% 45.1%LSTM-GRU Model 40.6% 51.5% 45.4%CDC Model 41.2% 51.4% 45.7%

DUCExpandRank [10] (2008) 28.8% 35.4% 31.7%TopicalPageRank [39] (2010) 28.2% 34.8% 31.2%

CNN-GRU Model 30.3% 42.8% 35.5%LSTM-GRU Model 29.5% 41.7% 34.6%CDC 33.0% 46.6% 38.6%

SemEvalHUMB [233] (2010) 27.2% 27.8% 27.5%SemGraph [24] (2016) 32.4% 33.2% 32.8%

CNN-GRU Model 11.3% 12.6% 11.9%LSTM-GRU Model 10.5% 13.7% 11.9%CDC Model 12.0% 12.0% 12.0%

CNN-GRU, LSTM-GRU, and CDC models are supervised deep learning models, other models are unsupervised

the F-score decreases by 5% to 40% without using the filter [8]. In the CDC model, no

hand-crafted features, hyper-parameters, or dataset-dependent filters are employed to

boost the F-score. Secondly, considering the precision and recall scores, the CDC model

produces better precision but lower recall scores than Topic Clustering. This is because

Topic Clustering extracts too many candidates as keyphrases to balance the F-score.

In unsupervised AKE, a fixed number of highest ranked candidates are extracted as

keyphrases, and hence there is a trade-off between the precision and recall scores. In

general, increase the number of extraction will improve the recall score by compromising

on precision, and vice versa. Topic Clustering extracts two-third top ranked candidates

as keyphrases. This hyper-parameter yields the state-of-the-art F-score on the Hulth

dataset by minimising the trade-off between precision and recall. On the other hand,

CDC is a binary classifier having no trade-off between the precision and recall score. In

comparison to the two baseline deep learning models, the proposed CDC model produces

slightly better performance. The only difference in CDC, CNN-GRU, and LSTM-GRU

models is the document deep learning model, which is the key factor yields different

performance. The Hulth dataset contains only short documents with on average 125 to-

kens per document. Dependencies of sentences in such short documents are very strong,


which allows the CNN-GRU and LSTM-GRU to learn more precise representations of

documents by capturing these dependencies. However, when these dependencies becom-

ing looser in longer documents, the CDC model delivers much better performance as we

will discuss as follows.

On the DUC dataset, the CDC model yields the new state-of-the-art performance, which

outperforms the ExpandRank algorithm by nearly 7% on F-score. The CDC model clas-

sifies more positive candidates while maintaining a strong precision. It also outperforms

the two baseline models. Both CNN-GRU and LSTM-GRU use a GRU network to

learn document representations, which captures the long term dependencies of all the

sentences in a document. The DUC dataset contains news articles with on average 800

tokens per document, thus sentences in the beginning of a document usually have loose

or no dependencies to the last few ones. The GRU document model captured unneces-

sary dependencies of sentences that affect the performance of the overall system. On the

other hand, the CDC model only encodes the short-term dependencies of the sentences

delivering better performance.

On the SemEval dataset, the CDC model produces an unexpected F-score that is much

lower than HUMB and SemGraph. The baseline deep learning models also fail to deliver

reasonable performance. We believe that the main reason is that the proposed CDC

model and the base line models fail to learn the representations of documents with

multiple topics, ideas, or arguments. The SemEval dataset consists full-length journal

articles with an average of about 5,800 tokens. Learning distributed representations

of such long documents is challenging. In fact, the majority of work for modelling

documents using deep learning techniques only targets short and mid-length documents,

such as reviews of products or movies, where the authors of the reviews usually present

a single topic, view, or argument. On the other hand, full-length journal articles consist

of multiple sections, where the authors usually present different topics. Neither the

proposed CDC model nor the baseline models have such mechanism to handle multiple

topics or ideas, and hence, they are unable to produce reasonable results. A possible

solution to this problem is to let CDC learn the representations of paragraphs, and then

use a recurrent neural network such as LSTM to learn the overall representation of a

document, which will be our future work.

Extracting keyphrases from long documents is much more challenging because the doc-

uments yield a large number of candidate phrases [31]. The SemEval dataset has 587

candidate phrases per article in average, where only 9.6 candidates are ground-truth

keyphrases. In our evaluations, we did not employ any dataset heuristics to reduce the

number of candidates. On the other hand, both HUMB and SemGraph use heuristics


filtering candidates that are unlikely to be keyphrases before the actual extraction pro-

cess. HUMB first constructs a post-ranking process that incorporates statistics from an

external repository for journal articles named Hyper Article en Ligne (HAL) to produce

a ranking score for each candidate. Similar to unsupervised AKE systems, only top

ranked candidates are extracted as keyphrases. The post-ranking process essentially

prevents the binary classifier producing too many keyphrases. SemGraph only considers

the content in title, abstract, introduction, related work, conclusions and future work

sections, which significantly reduces the number of candidate phrases.

7.5 Conclusion

In this chapter, we have introduced a deep learning model for AKE. We demonstrated

that the proposed model performs well on extracting keyphrases from short and mid-

length documents, delivering identical performance on the Hulth dataset and the new

state-of-the-art score on the DUC datasets without employing any manually selected

features or dataset-dependent heuristics. However, the proposed model is unable to

learn useful representations of long documents such as journal articles, which usually

contain multiple topics, ideas or arguments. A possible solution is to let CDC learn the

representations of paragraphs, and then use a recurrent neural network such as LSTM

to learn the overall representation of a document, which will be our future work.

Chapter 8

Conclusion

Throughout this thesis, we have developed a series of semantic knowledge-based AKE

techniques using distributed representations of words, phrases, and documents learnt

from deep neural networks. Specifically, we investigated the application, capability, and

efficacy of distributed representations for encoding compositional semantics of phrases

and documents, in order to address the following four major issues in existing AKE

systems:

1. Difficulties in incorporating background and domain-specific knowledge

Using public semantic knowledge bases such as WordNet to obtain additional se-

mantic information is practically difficult because they supply limited vocabularies

and insufficient domain-specific knowledge.

We conducted a systematic evaluation on four unsupervised combined with differ-

ent pre-processing and post-processing techniques in Chapter 3. The evaluation

shows that all algorithms are sensitive to frequencies of phrases at different de-

grees, i.e. they mostly fail to extract keyphrases with lower frequencies. We

found that incorporating external knowledge would mitigate the problem. How-

ever, public knowledge bases such as Wikipedia and WordNet are designed for

general-propose, and hence they are insufficient to cover domain-specific terms

and supply the specific knowledge for identify keyphrases from domain-specific

corpora.

To address the difficulties in incorporating background and domain-specific knowl-

edge, we proposed to use word embeddings. We utilise a special characteristic of

word embeddings – they can be trained multiple time over different corpora to

encode the both general and domain-specific semantics of words, and new words

can be subsequently added into the vocabulary before training on different cor-

pora. Such advantages of word embeddings fundamentally overcome the problem

135

Chapter 8. Conclusion 136

that public semantic knowledge bases have – limited vocabulary and insufficient

domain-specific knowledge. As we have demonstrated in Chapter 4 that after

training over general and then domain-specific corpora, the meanings of domain-

specific words encoded in word embeddings are changed significantly to present the

domain-specific knowledge, while non-domain specific words remain their original

meaning. Based on this characteristic, we developed a general-purpose weighting

scheme using the semantics supplied in pre-trained word embeddings, and demon-

strated that the weighting scheme generally enhances the performance of un-

supervised graph-based AKE algorithms. In addition, we also demonstrated

the proposed weighting scheme efficiently reduces the impact from phrase

frequency.

However, the development of weighting schemes is a rather ad-hoc process where

the choice of features is critical to the overall performance of the algorithms. This

problem turns the development of graph-based AKE algorithms into a laborious

feature engineering process.

2. Failure on capturing the semantics of phrases and documents. Tradi-

tional AKE algorithms rely on manually selected features based on the obser-

vations of keyphrases’ linguistic, statistical, or structural characteristics, such as

co-occurrence frequencies and occurrence positions. These features, however, carry

no semantic meanings of phrases or documents. Traditional AKE algorithms are

unable or have no attempt to understand the meanings of phrases and documents,

and hence such lack of cognitive ability inhibits the performance of existing AKE

algorithm.

To address this problem, we developed two deep learning models: Matrix-Vector

Recurrent Unit (MVRU) in Chapter 6 and Convolutional Document Cube (CDC)

in Chapter 7. Studies have shown that distributed word representations are ca-

pable of encoding the semantic meanings of words [13, 170]. However, learning

distributed representations of multi-gram expressions, such as multi-word phrases

or sentences, remains a great challenge. This is because that the meaning of a

multi-word expression is constructed by the meaning of its each constituent word

and the rules of composing them [25]. To explicitly learn these compositional

rules, we developed the MVRU model based on a recurrent neural network archi-

tecture, where a word is represented using a pair of matrix and vector. The vector

represents the meaning of a word, and the matrix intends to capture the compo-

sitional rule, i.e. it determines the contributions made by other words when they

compose with the word. We demonstrated that the MVRU model captured

more accurate semantics than existing deep learning models such as the convo-

lutional neural network (CNN) and Long Short-Term Memory (LSTM) network.


On the other hand, learning the meanings of documents is to encode the core ideas,

thoughts, and logic flows embedded in the intrinsic structures of the contents. In

contrast to the constituent words of a phrase that always have stronger depen-

dencies, words in a document usually have strong dependencies to its neighbour

words, and the strength of dependencies may decrease by the increase on the dis-

tances between words. For example, the semantics of words may be correlated in

the beginning (e.g. abstracts or introductions) and the end (e.g conclusions) of a

document, sentences appearing in the beginning could have very loose or even on

dependencies to the ones in the middle of the document. Capturing all dependen-

cies of words in a document may be unnecessary, which could also introduce noise

data. Hence, we believe that only short-term dependencies among words should

be analysed. Such intuition is implemented in the CDC model. The model is

based on a CNN network, which has the strength to encode regional compositions

of different locations in data matrices. The network features different region sizes,

where each region has a number of filters to produce the feature maps. By con-

voluting the data matrices with different region sizes, the CNN network captures

how words are composed in each region. Since the region sizes are relatively small,

the network only analyses words and their closer neighbours, aiming to encode

only short-term dependencies.

A drawback of both MVRU and CDC models we observed is the arbitrary number

of dimensions for vectors and matrices. Lower dimensions may not be able to

capture enough information, whereas higher dimensions significantly increase the

training time and the cost of computation. We have not yet precisely identified

the impacts from different dimensions.

3. Labour-intensive The feature engineering process in existing AKE approaches

turns the development of AKE models into a time-consuming and labour-intensive

exercise.

We addressed the problem 2 and 3 by developing a deep learning model for AKE in

Chapter 7, which not only eliminates the effort in feature engineering, but

also automatically learns the features that are capable of representing

the semantics of both phrases and documents. The proposed model consists

of a MVRU model and a CDC model, where MVRU is responsible to capture

the meanings of phrases, and CDC is responsible to learn the representations of

documents. The output layer is a logistic regression layer, which classifies whether

a phrase can be a keyphrase for the input document based on the meanings learnt

from the phrase and document model. We demonstrated that the proposed deep

learning model is dataset-independent and knowledge-rich, which delivers the new

state-of-the-art performance on the two different datasets: a collection of


the abstracts of journal articles from computer science domain, and a collection of

news articles, without employing any dataset-dependent heuristics.

The deep learning model developed in this thesis not only demonstrate the capabil-

ities and efficiencies of applying deep neural networks to AKE problems, but also

provide effective tools for learning general representations of phrases and relatively

short documents such as news articles. However, the proposed model is unable to

learn useful representations of long documents, such as full-length journal articles,

which usually contain multiple topics, ideas or arguments. A possible solution is to

let CDC learn the representations of paragraphs, and then use a recurrent neural

network such as LSTM to learn the overall representation of a document.

8.1 Future Work

We raise four questions from this thesis, which provide future research directions.

1. What has been learnt by deep learning models? From Chapter 5 to 7, we

have focused on learning the representations of phrases and documents. We have

shown the learnt representations carrying semantics at some degree demonstrated

by different experiments, including semantic similarity, semantic compositionality,

term identification, and AKE. However, the exact amount of semantic informa-

tion encoded in the representations remains unclear. Furthermore, distributed

representations enable machine learning algorithms to automatically learn useful

features, however, it is unclear what information each feature carries. Discovering

an approach that allows us to understand features and measure the information

encoded in distribution representations, will help researchers to build more robust

and efficient deep learning models.

2. Do deep learning models still need as much labelled data as required by

traditional supervised machine learning models? In Chapter 5, we presented

a co-training approach for domain-specific term extraction. We found that the

model reached the best classification results without seeing much training data,

and further training seems to overfeed the model. One possible reason may be

that the word embeddings are pre-trained over the dataset, which not only encode

the semantics of words, but also provide better initial values of embedding vectors

that allow the deep learning model getting converged more quickly. However, the

amount of training data that deep learning models require is not quantified. The

lack of training data is one of the major problems that affect the performance

of machine learning algorithms, and hence, analysing and comparing the amount


of training data required by deep learning models and classic machine learning

algorithms will be a very useful research topic for future studies.

3. How to learn representations for long documents more efficiently? In

Chapter 7, we introduced the convolutional cube (CDC) models to learn the rep-

resentations of documents. We also used two additional deep learning models

(LSTM-GRU and CNN-GRU) as the baselines. However, all deep learning models

failed to learn useful representations of long documents, e.g. full-length journal

articles. Discovering an efficient and effective approach to learn representations of

long documents will be our future work.

4. Can semantic encoded in distributed representations help developing

a more efficient AKE Evaluation approach? Currently, the most popular

evaluation methodology for AKE is exact match, i.e. a ground-truth keyphrase

selected by human annotators matches a computer identified keyphrase when they

correspond to the same stem sequence. Such evaluation methodology fails to mea-

sure two semantically identical phrases having different constituent words or stem

sequences, such as terminology mining and domain-specific term extraction, or

artificial neural network and neural network. At present, algorithms are unable

to identify two phrases referring the same concept or object in great precision.

Throughout this thesis, we have shown that deep learning models have the ability

to encode semantics of words and multi-word phrases at some degree. Using deep

neural networks and distributed representations to develop a more effective AKE

evaluation framework will help us to evaluate the performance of AKE algorithms

more accurately.

In this thesis, we have presented series of studies on applying distributed vector rep-

resentations and deep learning technologies to AKE problems, and demonstrated how

these technologies can solve or mitigate the problems. However, deep learning for Nature

Language Processing is still an emerging research field, the above questions raised from

this thesis provide interesting insights into our future research directions.

Bibliography

[1] Ian H Witten, Gordon W Paynter, Eibe Frank, Carl Gutwin, and Craig G Nevill-

Manning. Kea: Practical automatic keyphrase extraction. In Proceedings of the

fourth ACM conference on Digital libraries, pages 254–255. ACM, 1999.

[2] Karen Sparck Jones. A statistical interpretation of term specificity and its appli-

cation in retrieval. Journal of documentation, 28(1):11–21, 1972.

[3] Rada Mihalcea and Paul Tarau. Textrank: Bringing order into texts. In

Dekang Lin and Dekai Wu, editors, Proceedings of EMNLP 2004, pages 404–411,

Barcelona, Spain, July 2004. Association for Computational Linguistics.

[4] Peter D Turney. Learning algorithms for keyphrase extraction. Information Re-

trieval, 2(4):303–336, 2000.

[5] Anette Hulth. Improved automatic keyword extraction given more linguistic

knowledge. In Proceedings of the 2003 conference on Empirical methods in natural

language processing, pages 216–223. Association for Computational Linguistics,

2003.

[6] Jinghua Wang, Jianyi Liu, and Cong Wang. Keyword extraction based on pager-

ank. In Advances in Knowledge Discovery and Data Mining, pages 857–864.

Springer, 2007.

[7] Maria Grineva, Maxim Grinev, and Dmitry Lizorkin. Extracting key terms from

noisy and multitheme documents. In Proceedings of the 18th international confer-

ence on World wide web, pages 661–670. ACM, 2009.

[8] Zhiyuan Liu, Peng Li, Yabin Zheng, and Maosong Sun. Clustering to find exemplar

terms for keyphrase extraction. In Proceedings of the 2009 Conference on Empirical

Methods in Natural Language Processing: Volume 1-Volume 1, pages 257–266.

Association for Computational Linguistics, 2009.

140

Bibliography 141

[9] Kazi Saidul Hasan and Vincent Ng. Conundrums in unsupervised keyphrase ex-

traction: making sense of the state-of-the-art. In Proceedings of the 23rd Interna-

tional Conference on Computational Linguistics: Posters, pages 365–373. Associ-

ation for Computational Linguistics, 2010.

[10] Xiaojun Wan and Jianguo Xiao. Single document keyphrase extraction using

neighborhood knowledge. In AAAI, volume 8, pages 855–860, 2008.

[11] Joseph Turian, Lev Ratinov, and Yoshua Bengio. Word representations: a sim-

ple and general method for semi-supervised learning. In Proceedings of the 48th

annual meeting of the association for computational linguistics, pages 384–394.


[12] Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation learning: A

review and new perspectives. IEEE transactions on pattern analysis and machine

intelligence, 35(8):1798–1828, 2013.

[13] Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig. Linguistic regularities in con-

tinuous space word representations. In HLT-NAACL, pages 746–751. Citeseer,

2013.

[14] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation

of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013.

[15] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Dis-

tributed representations of words and phrases and their compositionality. In Ad-

vances in neural information processing systems, pages 3111–3119, 2013.

[16] Tomas Mikolov, Martin Karafiat, Lukas Burget, Jan Cernocky, and Sanjeev Khu-

danpur. Recurrent neural network based language model. In INTERSPEECH,

volume 2, page 3, 2010.

[17] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with

neural networks. In Advances in neural information processing systems, pages

3104–3112, 2014.

[18] Richard Socher, Christopher D Manning, and Andrew Y Ng. Learning continuous

phrase representations and syntactic parsing with recursive neural networks. In

Proceedings of the NIPS-2010 Deep Learning and Unsupervised Feature Learning

Workshop, pages 1–9, 2010.

[19] Richard Socher, Brody Huval, Christopher D Manning, and Andrew Y Ng. Se-

mantic compositionality through recursive matrix-vector spaces. In Proceedings of

the 2012 Joint Conference on Empirical Methods in Natural Language Processing

Bibliography 142

and Computational Natural Language Learning, pages 1201–1211. Association for

Computational Linguistics, 2012.

[20] Richard Socher, Alex Perelygin, Jean Y Wu, Jason Chuang, Christopher D Man-

ning, Andrew Y Ng, and Christopher Potts. Recursive deep models for semantic

compositionality over a sentiment treebank. In Proceedings of the conference on

empirical methods in natural language processing (EMNLP), volume 1631, page

1642. Citeseer, 2013.

[21] Rui Wang, Wei Liu, and Chris McDonald. Corpus-independent generic keyphrase

extraction using word embedding vectors. In Deep Learning for Web Search and

Data Mining Workshop (DL-WSDM 2015), 2014.

[22] Rui Wang, Wei Liu, and Chris McDonald. Using word embeddings to enhance

keyword identification for scientific publications. In Australasian Database Con-

ference, pages 257–268. Springer, 2015.

[23] Rui Wang, Wei Liu, and Chris McDonald. How preprocessing affects unsupervised

keyphrase extraction. In Computational Linguistics and Intelligent Text Process-

ing, pages 163–176. Springer, 2014.

[24] Juan Martinez-Romo, Lourdes Araujo, and Andres Duque Fernandez. Semgraph:

Extracting keyphrases following a novel semantic graph-based approach. Journal

of the Association for Information Science and Technology, 67(1):71–82, 2016.

[25] Gottlob Frege. Uber sinn und bedeutung. Wittgenstein Studien, 1(1), 1994.

[26] Yoon Kim. Convolutional neural networks for sentence classification. arXiv

preprint arXiv:1408.5882, 2014.

[27] Ye Zhang and Byron Wallace. A sensitivity analysis of (and practitioners’ guide

to) convolutional neural networks for sentence classification. arXiv preprint

arXiv:1510.03820, 2015.

[28] Sepp Hochreiter and Jurgen Schmidhuber. Long short-term memory. Neural com-

putation, 9(8):1735–1780, 1997.

[29] Rui Wang, Western Australia, Wei Liu, and Chris McDonald. Featureless domain-

specific term extraction with minimal labelled data. In Australasian Language

Technology Association Workshop 2016, page 103.

[30] Su Nam Kim, Olena Medelyan, Min-Yen Kan, and Timothy Baldwin. Automatic

keyphrase extraction from scientific articles. Language resources and evaluation,

47(3):723–742, 2013.

Bibliography 143

[31] Kazi Saidul Hasan and Vincent Ng. Automatic keyphrase extraction: A survey of

the state of the art. Proceedings of the Association for Computational Linguistics

(ACL), Baltimore, Maryland: Association for Computational Linguistics, 2014.

[32] Slobodan Beliga, Ana Mestrovic, and Sanda Martincic-Ipsic. An overview of graph-

based keyword extraction methods and approaches. Journal of information and

organizational sciences, 39(1):1–20, 2015.

[33] Sifatullah Siddiqi and Aditi Sharan. Keyword and keyphrase extraction techniques:

a literature review. International Journal of Computer Applications, 109(2), 2015.

[34] Su Nam Kim and Min-Yen Kan. Re-examining automatic keyphrase extraction

approaches in scientific articles. In Proceedings of the workshop on multiword

expressions: Identification, interpretation, disambiguation and applications, pages

9–16. Association for Computational Linguistics, 2009.

[35] Yukio Ohsawa, Nels E Benson, and Masahiko Yachida. Keygraph: Automatic

indexing by co-occurrence graph based on building construction metaphor. In Re-

search and Technology Advances in Digital Libraries, 1998. ADL 98. Proceedings.

IEEE International Forum on, pages 12–18. IEEE, 1998.

[36] Yutaka Matsuo and Mitsuru Ishizuka. Keyword extraction from a single docu-

ment using word co-occurrence statistical information. International Journal on

Artificial Intelligence Tools, 13(01):157–169, 2004.

[37] David B Bracewell, Fuji Ren, and Shingo Kuriowa. Multilingual single document

keyword extraction for information retrieval. In Natural Language Processing and

Knowledge Engineering, 2005. IEEE NLP-KE’05. Proceedings of 2005 IEEE In-

ternational Conference on, pages 517–522. IEEE, 2005.

[38] Mikalai Krapivin, Maurizio Marchese, Andrei Yadrantsau, and Yanchun Liang.

Unsupervised key-phrases extraction from scientific papers using domain and lin-

guistic knowledge. In Digital Information Management, 2008. ICDIM 2008. Third

International Conference on, pages 105–112. IEEE, 2008.

[39] Zhiyuan Liu, Wenyi Huang, Yabin Zheng, and Maosong Sun. Automatic keyphrase

extraction via topic decomposition. In Proceedings of of the 2010 Conference

on Empirical Methods in Natural Language Processing, pages 366–376. Assoc. for


[40] Roberto Ortiz, David Pinto, Mireya Tovar, and Hector Jimenez-Salazar. Buap: An

unsupervised approach to automatic keyphrase extraction from scientific articles.

In Proceedings of the 5th international workshop on semantic evaluation, pages


Bibliography 144

[41] Georgeta Bordea and Paul Buitelaar. Deriunlp: A context based approach to

automatic keyphrase extraction. In Proceedings of the 5th international workshop

on semantic evaluation, pages 146–149. Association for Computational Linguistics,

2010.

[42] Samhaa R El-Beltagy and Ahmed Rafea. Kp-miner: Participation in semeval-2.

In Proceedings of the 5th international workshop on semantic evaluation, pages


[43] Mari-Sanna Paukkeri and Timo Honkela. Likey: unsupervised language-

independent keyphrase extraction. In Proceedings of the 5th international workshop

on semantic evaluation, pages 162–165. Association for Computational Linguistics,

2010.

[44] Kalliopi Zervanou. Uvt: The uvt term extraction system in the keyphrase extrac-

tion task. In Proceedings of the 5th international workshop on semantic evaluation,

pages 194–197. Association for Computational Linguistics, 2010.

[45] Stuart Rose, Dave Engel, Nick Cramer, and Wendy Cowley. Automatic keyword

extraction from individual documents. Text Mining, pages 1–20, 2010.

[46] Martin Dostal and Karel Jezek. Automatic keyphrase extraction based on nlp and

statistical methods. In DATESO, pages 140–145, 2011.

[47] Abdelghani Bellaachia and Mohammed Al-Dhelaan. Ne-rank: A novel graph-based

keyphrase extraction in twitter. In Proceedings of the The 2012 IEEE/WIC/ACM

International Joint Conferences on Web Intelligence and Intelligent Agent

Technology-Volume 01, pages 372–379. IEEE Computer Society, 2012.

[48] Kamal Sarkar. A hybrid approach to extract keyphrases from medical documents.

arXiv preprint arXiv:1303.1441, 2013.

[49] Wei You, Dominique Fontaine, and Jean-Paul Barthes. An automatic keyphrase

extraction system for scientific documents. Knowledge and information systems,

34(3):691–724, 2013.

[50] Sujatha Das Gollapalli and Cornelia Caragea. Extracting keyphrases from research

papers using citation networks. In AAAI, pages 1629–1635, 2014.

[51] Su Nam Kim, Olena Medelyan, Min-Yen Kan, and Timothy Baldwin. Semeval-

2010 task 5: Automatic keyphrase extraction from scientific articles. In Proceedings

of the 5th International Workshop on Semantic Evaluation, pages 21–26. Associ-

ation for Computational Linguistics, 2010.

Bibliography 145

[52] Xiaojun Wan and Jianguo Xiao. Collabrank: towards a collaborative approach to

single-document keyphrase extraction. In Proceedings of the 22nd International

Conference on Computational Linguistics-Volume 1, pages 969–976. Association

for Computational Linguistics, 2008.

[53] Chedi Bechikh Ali, Rui Wang, and Hatem Haddad. A two-level keyphrase ex-

traction approach. In International Conference on Intelligent Text Processing and

Computational Linguistics, pages 390–401. Springer, 2015.

[54] Wei Liu, Bo Chuen Chung, Rui Wang, Jonathon Ng, and Nigel Morlet. A genetic

algorithm enabled ensemble for unsupervised medical term extraction from clinical

letters. Health information science and systems, 3(1):1–14, 2015.

[55] Jian Pei, Jiawei Han, Behzad Mortazavi-Asl, Helen Pinto, Qiming Chen, Umesh-

war Dayal, and Mei-Chun Hsu. Prefixspan: Mining sequential patterns efficiently

by prefix-projected pattern growth. In icccn, page 0215. IEEE, 2001.

[56] Katerina Frantzi, Sophia Ananiadou, and Hideki Mima. Automatic recognition of

multi-word terms:. the c-value/nc-value method. International Journal on Digital

Libraries, 3(2):115–130, 2000.

[57] Wilson Wong, Wei Liu, and Mohammed Bennamoun. Determining the unithood

of word sequences using a probabilistic approach. arXiv preprint arXiv:0810.0139,

2008.

[58] Mike Scott. Pc analysis of key wordsand key key words. System, 25(2):233–245,

1997.

[59] Wen-tau Yih, Joshua Goodman, and Vitor R Carvalho. Finding advertising key-

words on web pages. In Proceedings of the 15th international conference on World

Wide Web, pages 213–222. ACM, 2006.

[60] Thuy Dung Nguyen and Min-Yen Kan. Keyphrase extraction in scientific publi-

cations. In International Conference on Asian Digital Libraries, pages 317–326.

Springer, 2007.

[61] Gonenc Ercan and Ilyas Cicekli. Using lexical chains for keyword extraction.

Information Processing & Management, 43(6):1705–1714, 2007.

[62] Zhuoye Ding, Qi Zhang, and Xuanjing Huang. Keyphrase extraction from online

news using binary integer programming. In IJCNLP, pages 165–173, 2011.

[63] Kathrin Eichler and Gunter Neumann. Dfki keywe: Ranking keyphrases extracted

from scientific articles. In Proceedings of the 5th international workshop on seman-

tic evaluation, pages 150–153. Association for Computational Linguistics, 2010.

Bibliography 146

[64] Xin Jiang, Yunhua Hu, and Hang Li. A ranking approach to keyphrase extraction.

In Proceedings of the 32nd international ACM SIGIR conference on Research and

development in information retrieval, pages 756–757. ACM, 2009.

[65] Songhua Xu, Shaohui Yang, and Francis Chi-Moon Lau. Keyword extraction and

headline generation using novel word features. In AAAI, 2010.

[66] Ken Barker and Nadia Cornacchia. Using noun phrase heads to extract document

keyphrases. In Conference of the Canadian Society for Computational Studies of

Intelligence, pages 40–52. Springer, 2000.

[67] Kenneth Ward Church and Patrick Hanks. Word association norms, mutual in-

formation, and lexicography. Computational linguistics, 16(1):22–29, 1990.

[68] Lee R Dice. Measures of the amount of ecologic association between species.

Ecology, 26(3):297–302, 1945.

[69] Taeho Jo. Neural based approach to keyword extraction from documents. In

International Conference on Computational Science and Its Applications, pages

456–461. Springer, 2003.

[70] Jiabing Wang, Hong Peng, and Jing-song Hu. Automatic keyphrases extraction

from document using neural network. In Advances in Machine Learning and Cy-

bernetics, pages 633–641. Springer, 2006.

[71] Kamal Sarkar, Mita Nasipuri, and Suranjan Ghose. A new approach to keyphrase

extraction using neural networks. arXiv preprint arXiv:1004.3274, 2010.

[72] Olena Medelyan, Eibe Frank, and Ian H Witten. Human-competitive tagging

using automatic keyphrase extraction. In Proceedings of the 2009 Conference on

Empirical Methods in Natural Language Processing: Volume 3-Volume 3, pages


[73] Ian Witten and David Milne. An effective, low-cost measure of semantic re-

latedness obtained from wikipedia links. In Proceeding of AAAI Workshop on

Wikipedia and Artificial Intelligence: an Evolving Synergy, AAAI Press, Chicago,

USA, pages 25–30, 2008.

[74] Denis Turdakov and Pavel Velikhov. Semantic relatedness metric for wikipedia

concepts based on link analysis and its application to word sense disambiguation.

2008.

[75] David Milne. Computing semantic relatedness using wikipedia link structure. In

Proceedings of the new zealand computer science research student conference, pages

1–8, 2007.

Bibliography 147

[76] David M Blei, Andrew Y Ng, and Michael I Jordan. Latent dirichlet allocation.

the Journal of machine Learning research, 3:993–1022, 2003.

[77] Claude Pasquier. Task 5: Single document keyphrase extraction using sentence

clustering and latent dirichlet allocation. In Proceedings of the 5th international

workshop on semantic evaluation, pages 154–157. Association for Computational

Linguistics, 2010.

[78] Adrien Bougouin, Florian Boudin, and Beatrice Daille. Topicrank: Graph-based

topic ranking for keyphrase extraction. In International Joint Conference on Nat-

ural Language Processing (IJCNLP), pages 543–551, 2013.

[79] Wayne Xin Zhao, Jing Jiang, Jing He, Yang Song, Palakorn Achananuparp, Ee-

Peng Lim, and Xiaoming Li. Topical keyphrase extraction from twitter. In Pro-

ceedings of the 49th Annual Meeting of the Association for Computational Lin-

guistics: Human Language Technologies-Volume 1, pages 379–388. Association for


[80] Florian Boudin and Emmanuel Morin. Keyphrase extraction for n-best reranking

in multi-sentence compression. In North American Chapter of the Association for

Computational Linguistics (NAACL), 2013.

[81] Taeho Jo, Malrey Lee, and Thomas M Gatton. Keyword extraction from docu-

ments using a neural network model. In Hybrid Information Technology, 2006.

ICHIT’06. International Conference on, volume 2, pages 194–197. IEEE, 2006.

[82] Kuo Zhang, Hui Xu, Jie Tang, and Juanzi Li. Keyword extraction using support

vector machine. In Advances in Web-Age Information Management, pages 85–96.

Springer, 2006.

[83] Chengzhi Zhang. Automatic keyword extraction from documents using conditional

random fields. Journal of Computational Information Systems, 4(3):1169–1180,

2008.

[84] Taemin Jo and Jee-Hyong Lee. Latent keyphrase extraction using deep belief

networks. International Journal of Fuzzy Logic and Intelligent Systems, 15(3):

153–158, 2015.

[85] Ludovic Jean-Louis, Michel Gagnon, and Eric Charton. A knowledge-base oriented

approach for automatic keyword extraction. Computacion y Sistemas, 17(2):187–

196, 2013.

[86] Mari-Sanna Paukkeri, Ilari T Nieminen, Matti Polla, and Timo Honkela. A

language-independent approach to keyphrase extraction and evaluation. In COL-

ING (Posters), pages 83–86, 2008.

Bibliography 148

[87] Christian Wartena, Rogier Brussee, and Wout Slakhorst. Keyword extraction

using word co-occurrence. In Database and Expert Systems Applications (DEXA),

2010 Workshop on, pages 54–58. IEEE, 2010.

[88] Jon M Kleinberg. Authoritative sources in a hyperlinked environment. Journal of

the ACM (JACM), 46(5):604–632, 1999.

[89] Sergey Brin and Lawrence Page. The anatomy of a large-scale hypertextual web

search engine. Computer networks and ISDN systems, 30(1):107–117, 1998.

[90] Florian Boudin. A comparison of centrality measures for graph-based keyphrase

extraction. In International Joint Conference on Natural Language Processing

(IJCNLP), pages 834–838, 2013.

[91] Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. The pagerank

citation ranking: bringing order to the web. 1999.

[92] Zhi Liu, Chen Liang, and Maosong Sun. Topical word trigger model for keyphrase

extraction. In In Proceedings of COLING. Citeseer, 2012.

[93] David Mimno, Hanna M Wallach, Jason Naradowsky, David A Smith, and Andrew

McCallum. Polylingual topic models. In Proceedings of the 2009 Conference on

Empirical Methods in Natural Language Processing: Volume 2-Volume 2, pages


[94] Geoffrey E Hinton, Simon Osindero, and Yee-Whye Teh. A fast learning algorithm

for deep belief nets. Neural computation, 18(7):1527–1554, 2006.

[95] Qi Zhang, Yang Wang, Yeyun Gong, and Xuanjing Huang. Keyphrase extraction

using deep recurrent neural networks on twitter. In EMNLP, pages 836–845, 2016.

[96] Rui Meng, Sanqiang Zhao, Shuguang Han, Daqing He, Peter Brusilovsky, and

Yu Chi. Deep keyphrase generation. arXiv preprint arXiv:1704.06879, 2017.

[97] Kyunghyun Cho, Bart Van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau,

Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representa-

tions using rnn encoder-decoder for statistical machine translation. arXiv preprint

arXiv:1406.1078, 2014.

[98] Jiatao Gu, Zhengdong Lu, Hang Li, and Victor OK Li. Incorporating copying

mechanism in sequence-to-sequence learning. arXiv preprint arXiv:1603.06393,

2016.

[99] Qing Li, Wenhao Zhu, and Zhiguo Lu. Predicting abstract keywords by word

vectors. In International Conference on High Performance Computing and Appli-

cations, pages 185–195. Springer, 2015.

Bibliography 149

[100] Su Nam Kim, Timothy Baldwin, and Min-Yen Kan. Evaluating n-gram based

evaluation metrics for automatic keyphrase extraction. In Proceedings of the 23rd

international conference on computational linguistics, pages 572–580. Association


[101] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method

for automatic evaluation of machine translation. In Proceedings of the 40th annual

meeting on association for computational linguistics, pages 311–318. Association


[102] Abhaya Agarwal and Alon Lavie. Meteor, m-bleu and m-ter: Evaluation metrics

for high-correlation with human rankings of machine translation output. In Pro-

ceedings of the Third Workshop on Statistical Machine Translation, pages 115–118.


[103] Mark A Przybocki and Alvin F Martin. The 1999 nist speaker recognition evalua-

tion, using summed two-channel telephone data for speaker detection and speaker

tracking. In Sixth European Conference on Speech Communication and Technology,

1999.

[104] Chin-Yew Lin and Eduard Hovy. Automatic evaluation of summaries using n-

gram co-occurrence statistics. In Proceedings of the 2003 Conference of the North

American Chapter of the Association for Computational Linguistics on Human

Language Technology-Volume 1, pages 71–78. Association for Computational Lin-

guistics, 2003.

[105] Torsten Zesch and Iryna Gurevych. Approximate matching for evaluating

keyphrase extraction. In RANLP, pages 484–489, 2009.

[106] Ernesto DAvanzo and Bernado Magnini. A keyphrase-based approach to summa-

rization: the lake system at duc-2005. In Proceedings of DUC, 2005.

[107] Regina Barzilay and Michael Elhadad. Using lexical chains for text summarization.

Advances in automatic text summarization, pages 111–121, 1999.

[108] Dawn Lawrie, W Bruce Croft, and Arnold Rosenberg. Finding topic words for

hierarchical summarization. In Proceedings of the 24th annual international ACM

SIGIR conference on Research and development in information retrieval, pages

349–357. ACM, 2001.

[109] Adam L Berger and Vibhu O Mittal. Ocelot: a system for summarizing web

pages. In Proceedings of the 23rd annual international ACM SIGIR conference on

Research and development in information retrieval, pages 144–151. ACM, 2000.

Bibliography 150

[110] Anette Hulth and Beata B Megyesi. A study on automatically extracted key-

words in text categorization. In Proceedings of the 21st International Conference

on Computational Linguistics and the 44th annual meeting of the Association for

Computational Linguistics, pages 537–544. Association for Computational Linguis-

tics, 2006.

[111] Su Nam Kim, Timothy Baldwin, and Min-yen Kan. The use of topic representative

words in text categorization. 2009.

[112] Khaled M Hammouda, Diego N Matute, and Mohamed S Kamel. Corephrase:

Keyphrase extraction for document clustering. In International Workshop on Ma-

chine Learning and Data Mining in Pattern Recognition, pages 265–274. Springer,

2005.

[113] Yongzheng Zhang, Nur Zincir-Heywood, and Evangelos Milios. Term-based clus-

tering and summarization of web page collections. In Conference of the Canadian

Society for Computational Studies of Intelligence, pages 60–74. Springer, 2004.

[114] Yi-fang Brook Wu and Quanzhi Li. Document keyphrases as subject metadata:

incorporating document key concepts in search results. Information Retrieval, 11

(3):229–249, 2008.

[115] Rada Mihalcea and Andras Csomai. Wikify!: linking documents to encyclopedic

knowledge. In Proceedings of the sixteenth ACM conference on Conference on

information and knowledge management, pages 233–242. ACM, 2007.

[116] Felice Ferrara, Nirmala Pudota, and Carlo Tasso. A keyphrase-based paper recom-

mender system. In Italian Research Conference on Digital Libraries, pages 14–25.

Springer, 2011.

[117] Jason DM Rennie and Tommi Jaakkola. Using term informativeness for named

entity detection. In Proceedings of the 28th annual international ACM SIGIR

conference on Research and development in information retrieval, pages 353–360.

ACM, 2005.

[118] Wilson Wong, Wei Liu, and Mohammed Bennamoun. A probabilistic framework

for automatic term recognition. Intelligent Data Analysis, 13(4):499–539, 2009.

[119] Wilson Wong. Determination of unithood and termhood for term recognition. In

Handbook of research on text and web mining technologies, pages 500–529. IGI

Global, 2009.

[120] Wilson Wong, Wei Liu, and Mohammed Bennamoun. Ontology learning from

text: A look back and into the future. ACM Computing Surveys (CSUR), 44(4):

20, 2012.

Bibliography 151

[121] David Newman, Nagendra Koilada, Jey Han Lau, and Timothy Baldwin. Bayesian

text segmentation for index term identification and keyphrase extraction. In COL-

ING, pages 2077–2092, 2012.

[122] Kyo Kageura and Bin Umino. Methods of automatic term recognition: A review.

Terminology, 3(2):259–289, 1996.

[123] Andy Lauriston. Automatic recognition of complex terms: Problems and the

termino solution. Terminology. International Journal of Theoretical and Applied

Issues in Specialized Communication, 1(1):147–170, 1994.

[124] Beatrice Daille, Eric Gaussier, and Jean-Marc Lange. Towards automatic ex-

traction of monolingual and bilingual terminology. In Proceedings of the 15th

conference on Computational linguistics-Volume 1, pages 515–521. Association for


[125] John S Justeson and Slava M Katz. Technical terminology: some linguistic prop-

erties and an algorithm for identification in text. Natural language engineering, 1

(1):9–27, 1995.

[126] Didier Bourigault. Surface grammatical analysis for the extraction of termino-

logical noun phrases. In Proceedings of the 14th conference on Computational

linguistics-Volume 3, pages 977–981. Association for Computational Linguistics,

1992.

[127] Sophia Ananiadou. A methodology for automatic term recognition. In Proceedings

of the 15th conference on Computational linguistics-Volume 2, pages 1034–1038.


[128] C Kit. Reduction of indexing term space for phrase-based information retrieval.

Internal memo of Computational Linguistics Program. Pittsburgh: Carnegie Mel-

lon University, 1994.

[129] KT Frantzi and S Ananiadou. Statistical measures for terminological extraction.

In Proceedings of 3rd Int’l Conf. on Statistical Analysis of Textual Data, pages

297–308, 1995.

[130] Katerina T Frantzi, Sophia Ananiadou, and Junichi Tsujii. Extracting termino-

logical expressions. The Special Interest Group Notes of Information Processing

Society of Japan, 96:83–88, 1996.

[131] Gerard Salton, Chung-Shu Yang, and CLEMENT T Yu. A theory of term im-

portance in automatic text analysis. Journal of the Association for Information

Science and Technology, 26(1):33–44, 1975.

Bibliography 152

[132] Gerard Salton. Syntactic approaches to automatic book indexing. In Proceedings

of the 26th annual meeting on Association for Computational Linguistics, pages


[133] David A Evans and Robert G Lefferts. Clarit-trec experiments. Information

processing & management, 31(3):385–395, 1995.

[134] Jody Foo and Magnus Merkel. Using machine learning to perform automatic term

recognition. In LREC 2010 Workshop on Methods for automatic acquisition of

Language Resources and their evaluation methods, 23 May 2010, Valletta, Malta,

pages 49–54, 2010.

[135] Rogelio Nazar and Maria Teresa Cabre. Supervised learning algorithms applied

to terminology extraction. In Proceedings of the 10th Terminology and Knowledge

Engineering Conference, pages 209–217, 2012.

[136] Merley da Silva Conrado, Thiago Alexandre Salgueiro Pardo, and Solange Oliveira

Rezende. A machine learning approach to automatic term extraction using a rich

feature set. In HLT-NAACL, pages 16–23, 2013.

[137] Irena Spasic, Goran Nenadic, and Sophia Ananiadou. Using domain-specific verbs

for term classification. In Proceedings of the ACL 2003 workshop on Natural lan-

guage processing in biomedicine-Volume 13, pages 17–24. Association for Compu-

tational Linguistics, 2003.

[138] Jody Foo. Term extraction using machine learning. Linkoping University,

LINKOPING, 2009.

[139] William W Cohen. Fast effective rule induction. In Proceedings of the twelfth

international conference on machine learning, pages 115–123, 1995.

[140] Denis G Fedorenko, Nikita Astrakhantsev, and Denis Turdakov. Automatic recog-

nition of domain-specific terms: an experimental evaluation. SYRCoDIS, 1031:

15–23, 2013.

[141] Ted Dunning. Accurate methods for the statistics of surprise and coincidence.

Computational linguistics, 19(1):61–74, 1993.

[142] Su Nam Kim, Timothy Baldwin, and Min-Yen Kan. An unsupervised approach to

domain-specific term extraction. In Australasian Language Technology Association

Workshop 2009, page 94, 2009.

[143] Roberto Navigli and Paola Velardi. Semantic interpretation of terminological

strings. In Proceedings of 6th International Conference. Terminology and Knowl-

edge Eng, pages 95–100, 2002.

Bibliography 153

[144] Alberto Barron-Cedeno, Gerardo Sierra, Patrick Drouin, and Sophia Ananiadou.

An improved automatic term recognition method for spanish. In CICLing, vol-

ume 9, pages 125–136. Springer, 2009.

[145] Hiroshi Nakagawa and Tatsunori Mori. A simple but powerful automatic term

extraction method. In COLING-02 on COMPUTERM 2002: second international

workshop on computational terminology-Volume 14, pages 1–7. Association for


[146] Juan Antonio Lossio-Ventura, Clement Jonquet, Mathieu Roche, and Maguelonne

Teisseire. Combining c-value and keyword extraction methods for biomedical terms

extraction. In LBM: Languages in Biology and Medicine, 2013.

[147] Paul Buitelaar, Georgeta Bordea, and Tamara Polajnar. Domain-independent

term extraction through domain modelling. In the 10th International Conference

on Terminology and Artificial Intelligence (TIA 2013), Paris, France. 10th Inter-

national Conference on Terminology and Artificial Intelligence, 2013.

[148] Roberto Basili, Alessandro Moschitti, Maria Teresa Pazienza, and Fabio Massimo

Zanzotto. A contrastive approach to term extraction. In Terminologie et intelli-

gence artificielle. Rencontres, pages 119–128, 2001.

[149] Wilson Wong, Wei Liu, and Mohammed Bennamoun. Determining termhood for

learning domain ontologies using domain prevalence and tendency. In Proceedings

of the sixth Australasian conference on Data mining and analytics-Volume 70,

pages 47–54. Australian Computer Society, Inc., 2007.

[150] Wilson Wong, Wei Liu, and Mohammed Bennamoun. Determining termhood for

learning domain ontologies in a probabilistic framework. In Proceedings of the

sixth Australasian conference on Data mining and analytics-Volume 70, pages 55–

63. Australian Computer Society, Inc., 2007.

[151] Alexander Gelbukh, Grigori Sidorov, Eduardo Lavin-Villa, and Liliana Chanona-

Hernandez. Automatic term extraction using log-likelihood based comparison with

general reference corpus. Natural Language Processing and Information Systems,

pages 248–255, 2010.

[152] Francesca Bonin, Felice DellOrletta, Giulia Venturi, and Simonetta Montemagni.

A contrastive approach to multi-word term extraction from domain corpora. In

Proceedings of the 7th International Conference on Language Resources and Eval-

uation, Malta, pages 19–21, 2010.

[153] Roberto Basili, Alessandro Moschitti, and Maria Teresa Pazienza. A text classi er

based on linguistic processing. Proceedings of IJCAI. Citeseer, 1999.

Bibliography 154

[154] Boris V Dobrov and Natalia V Loukachevitch. Multiple evidence for term extrac-

tion in broad domains. In RANLP, pages 710–715, 2011.

[155] Jorge Vivaldi and Horacio Rodrıguez. Using wikipedia for term extraction in the

biomedical domain: first experiences. Procesamiento del Lenguaje Natural, 45:

251–254, 2010.

[156] Peter Turney. Coherent keyphrase extraction via web mining. 2003.

[157] Smitashree Choudhury and John G Breslin. Extracting semantic entities and

events from sports tweets. 2011.

[158] Ken Thompson. Programming techniques: Regular expression search algorithm.

Communications of the ACM, 11(6):419–422, 1968.

[159] David Weber, H Andrew Black, and Stephen R McConnel. AMPLE: A tool for

exploring morphology. Summer Institute of Linguistics, 1988.

[160] Jorge Hankamer. Morphological parsing and the lexicon. In Lexical representation

and process, pages 392–408. MIT Press, 1989.

[161] Martin F Porter. An algorithm for suffix stripping. Program, 14(3):130–137, 1980.

[162] Samy Bengio and Georg Heigold. Word embeddings for speech recognition. In

INTERSPEECH, pages 1053–1057, 2014.

[163] Liang Lu, Xingxing Zhang, Kyunghyun Cho, and Steve Renals. A study of the

recurrent neural network encoder-decoder for large vocabulary speech recognition.

In Proceedings of Interspeech, 2015.

[164] Zellig S Harris. Distributional structure. Word, 1954.

[165] Peter D Turney, Patrick Pantel, et al. From frequency to meaning: Vector space

models of semantics. Journal of artificial intelligence research, 37(1):141–188,

2010.

[166] Scott C. Deerwester, Susan T Dumais, Thomas K. Landauer, George W. Furnas,

and Richard A. Harshman. Indexing by latent semantic analysis. JAsIs, 41(6):

391–407, 1990.

[167] Daniel D Lee and H Sebastian Seung. Learning the parts of objects by non-negative

matrix factorization. Nature, 401(6755):788–791, 1999.

[168] Thomas Hofmann. Probabilistic latent semantic indexing. In Proceedings of the

22nd annual international ACM SIGIR conference on Research and development

in information retrieval, pages 50–57. ACM, 1999.

Bibliography 155

[169] Rie Kubota Ando. Latent semantic space: Iterative scaling improves precision

of inter-document similarity measurement. In Proceedings of the 23rd annual in-

ternational ACM SIGIR conference on Research and development in information

retrieval, pages 216–223. ACM, 2000.

[170] Jeffrey Pennington, Richard Socher, and Christopher D Manning. Glove: Global

vectors for word representation. In EMNLP, volume 14, pages 1532–1543, 2014.

[171] Yoshua Bengio, Rejean Ducharme, Pascal Vincent, and Christian Janvin. A neural

probabilistic language model. The Journal of Machine Learning Research, 3:1137–

1155, 2003.

[172] Ronan Collobert, Jason Weston, Leon Bottou, Michael Karlen, Koray

Kavukcuoglu, and Pavel Kuksa. Natural language processing (almost) from

scratch. The Journal of Machine Learning Research, 12:2493–2537, 2011.

[173] Andriy Mnih and Geoffrey Hinton. Three new graphical models for statistical lan-

guage modelling. In Proceedings of the 24th international conference on Machine

learning, pages 641–648. ACM, 2007.

[174] Thang Luong, Richard Socher, and Christopher D Manning. Better word represen-

tations with recursive neural networks for morphology. In CoNLL, pages 104–113,

2013.

[175] Jiang Bian, Bin Gao, and Tie-Yan Liu. Knowledge-powered deep learning for word

embedding. In Joint European Conference on Machine Learning and Knowledge

Discovery in Databases, pages 132–148. Springer, 2014.

[176] Jeff Mitchell and Mirella Lapata. Vector-based models of semantic composition.

In ACL, pages 236–244, 2008.

[177] Peter D Turney. Domain and function: A dual-space model of semantic relations

and compositions. Journal of Artificial Intelligence Research, pages 533–585, 2012.

[178] Quoc Le and Tomas Mikolov. Distributed representations of sentences and docu-

ments. In Proceedings of the 31st International Conference on Machine Learning

(ICML-14), pages 1188–1196, 2014.

[179] Jey Han Lau and Timothy Baldwin. An empirical evaluation of doc2vec

with practical insights into document embedding generation. arXiv preprint

arXiv:1607.05368, 2016.

[180] Jeffrey L Elman. Finding structure in time. Cognitive science, 14(2):179–211,

1990.

Bibliography 156

[181] Gregoire Mesnil, Xiaodong He, Li Deng, and Yoshua Bengio. Investigation of

recurrent-neural-network architectures and learning methods for spoken language

understanding. In INTERSPEECH, pages 3771–3775, 2013.

[182] Kaisheng Yao, Geoffrey Zweig, Mei-Yuh Hwang, Yangyang Shi, and Dong Yu.

Recurrent neural networks for language understanding. In INTERSPEECH, pages

2524–2528, 2013.

[183] Christoph Goller and Andreas Kuchler. Learning task-dependent distributed rep-

resentations by backpropagation through structure. In Neural Networks, 1996.,

IEEE International Conference on, volume 1, pages 347–352. IEEE, 1996.

[184] Richard Socher, Jeffrey Pennington, Eric H Huang, Andrew Y Ng, and Christo-

pher D Manning. Semi-supervised recursive autoencoders for predicting sentiment

distributions. In Proceedings of the Conference on Empirical Methods in Natural

Language Processing, pages 151–161. Association for Computational Linguistics,

2011.

[185] Yu Zhao, Zhiyuan Liu, and Maosong Sun. Phrase type sensitive tensor indexing

model for semantic composition. In AAAI, pages 2195–2202, 2015.

[186] Ozan Irsoy and Claire Cardie. Deep recursive neural networks for compositionality

in language. In Advances in Neural Information Processing Systems, pages 2096–

2104, 2014.

[187] Alex Waibel, Toshiyuki Hanazawa, Geoffrey Hinton, Kiyohiro Shikano, and

Kevin J Lang. Phoneme recognition using time-delay neural networks. IEEE

transactions on acoustics, speech, and signal processing, 37(3):328–339, 1989.

[188] Nal Kalchbrenner, Edward Grefenstette, and Phil Blunsom. A convolutional neural

network for modelling sentences. arXiv preprint arXiv:1404.2188, 2014.

[189] Nal Kalchbrenner and Phil Blunsom. Recurrent convolutional neural networks for

discourse compositionality. arXiv preprint arXiv:1306.3584, 2013.

[190] Duyu Tang, Bing Qin, and Ting Liu. Document modeling with gated recurrent

neural network for sentiment classification. In Proceedings of the 2015 Conference

on Empirical Methods in Natural Language Processing, pages 1422–1432, 2015.

[191] Misha Denil, Alban Demiraj, Nal Kalchbrenner, Phil Blunsom, and Nando de Fre-

itas. Modelling, visualising and summarising documents with a single convolutional

neural network. arXiv preprint arXiv:1406.3830, 2014.

Bibliography 157

[192] Manish Gupta, Vasudeva Varma, et al. Doc2sent2vec: A novel two-phase approach

for learning document representation. In Proceedings of the 39th International

ACM SIGIR conference on Research and Development in Information Retrieval,

pages 809–812. ACM, 2016.

[193] Andrew M Dai, Christopher Olah, and Quoc V Le. Document embedding with

paragraph vectors. arXiv preprint arXiv:1507.07998, 2015.

[194] Niraj Kumar and Kannan Srinathan. Automatic keyphrase extraction from sci-

entific documents using n-gram filtration technique. In Proceedings of the eighth

ACM symposium on Document engineering, pages 199–208. ACM, 2008.

[195] Letian Wang and Fang Li. Sjtultlab: Chunk based method for keyphrase extrac-

tion. In Proceedings of the 5th international workshop on semantic evaluation,

pages 158–161. Association for Computational Linguistics, 2010.

[196] Kristina Toutanova, Dan Klein, Christopher D Manning, and Yoram Singer.

Feature-rich part-of-speech tagging with a cyclic dependency network. In Proceed-

ings of the 2003 Conference of the North American Chapter of the Association

for Computational Linguistics on Human Language Technology-Volume 1, pages


[197] Marina Litvak and Mark Last. Graph-based keyword extraction for single-

document summarization. In Proceedings of the workshop on Multi-source Multi-

lingual Information Extraction and Summarization, pages 17–24. Association for


[198] Girish Keshav Palshikar. Keyword extraction from a single document using cen-

trality measures. In Pattern Recognition and Machine Intelligence, pages 503–510.

Springer, 2007.

[199] Slobodan Beliga, Ana Mestrovic, and Sanda Martinccic-Ipsic. Toward selectivity

based keyword extraction for croatian news. arXiv preprint arXiv:1407.4723, 2014.

[200] Shibamouli Lahiri, Sagnik Ray Choudhury, and Cornelia Caragea. Keyword and

keyphrase extraction using centrality measures on collocation networks. arXiv


[201] Wenpu Xing and Ali Ghorbani. Weighted pagerank algorithm. In Communication

Networks and Services Research, 2004. Proceedings. Second Annual Conference

on, pages 305–314. IEEE, 2004.

[202] Omer Levy and Yoav Goldberg. Neural word embedding as implicit matrix factor-

ization. In Advances in neural information processing systems, pages 2177–2185,

2014.

Bibliography 158

[203] Michael U Gutmann and Aapo Hyvarinen. Noise-contrastive estimation of un-

normalized statistical models, with applications to natural image statistics. The

Journal of Machine Learning Research, 13(1):307–361, 2012.

[204] Alain Barrat, Marc Barthelemy, Romualdo Pastor-Satorras, and Alessandro

Vespignani. The architecture of complex weighted networks. Proceedings of the

National Academy of Sciences of the United States of America, 101(11):3747–3752,

2004.

[205] Mark EJ Newman. Analysis of weighted networks. Physical Review E, 70(5):

056131, 2004.

[206] Ulrik Brandes. A faster algorithm for betweenness centrality*. Journal of Mathe-

matical Sociology, 25(2):163–177, 2001.

[207] Edsger W Dijkstra. A note on two problems in connexion with graphs. Numerische

mathematik, 1(1):269–271, 1959.

[208] Anoop Sarkar. Applying co-training methods to statistical parsing. In Proceed-

ings of the second meeting of the North American Chapter of the Association for

Computational Linguistics on Language technologies, pages 1–8. Association for


[209] Rada Mihalcea. Co-training and self-training for word sense disambiguation. In

CoNLL, pages 33–40, 2004.

[210] Vincent Ng and Claire Cardie. Weakly supervised natural language learning with-

out redundant views. In Proceedings of the 2003 Conference of the North American

Chapter of the Association for Computational Linguistics on Human Language

Technology-Volume 1, pages 94–101. Association for Computational Linguistics,

2003.

[211] J-D Kim, Tomoko Ohta, Yuka Tateisi, and Junichi Tsujii. Genia corpusa semanti-

cally annotated corpus for bio-textmining. Bioinformatics, 19(suppl 1):i180–i182,

2003.

[212] Siegfried Handschuh and Behrang QasemiZadeh. The acl rd-tec: A dataset for

benchmarking terminology extraction and classification in computational linguis-

tics. In COLING 2014: 4th International Workshop on Computational Terminol-

ogy, 2014.

[213] Yuhang Yang, Hao Yu, Yao Meng, Yingliang Lu, and Yingju Xia. Fault-tolerant

learning for term extraction. In PACLIC, pages 321–330, 2010.

Bibliography 159

[214] Rie Kubota Ando and Tong Zhang. A framework for learning predictive structures

from multiple tasks and unlabeled data. Journal of Machine Learning Research,

6(Nov):1817–1853, 2005.

[215] Avrim Blum and Tom Mitchell. Combining labeled and unlabeled data with co-

training. In Proceedings of the eleventh annual conference on Computational learn-

ing theory, pages 92–100. ACM, 1998.

[216] Yuhang Yang, Qin Lu, and Tiejun Zhao. Chinese term extraction using minimal

resources. In Proceedings of the 22nd International Conference on Computational

Linguistics-Volume 1, pages 1033–1040. Association for Computational Linguis-

tics, 2008.

[217] Klaus Greff, Rupesh Kumar Srivastava, Jan Koutnık, Bas R Steunebrink,

and Jurgen Schmidhuber. Lstm: A search space odyssey. arXiv preprint

arXiv:1503.04069, 2015.

[218] Stephen Clark, James R Curran, and Miles Osborne. Bootstrapping pos tag-

gers using unlabelled data. In Proceedings of the seventh conference on Natural

language learning at HLT-NAACL 2003-Volume 4, pages 49–55. Association for


[219] Gina R Kuperberg. Neural mechanisms of language comprehension: Challenges

to syntax. Brain research, 1146:23–49, 2007.

[220] Jeff Mitchell and Mirella Lapata. Composition in distributional models of seman-

tics. Cognitive science, 34(8):1388–1429, 2010.

[221] Rui Lin, Shujie Liu, Muyun Yang, Mu Li, Ming Zhou, and Sheng Li. Hierarchi-

cal recurrent neural network for document modeling. In Proceedings of the 2015

Conference on Empirical Methods in Natural Language Processing, pages 899–907,

2015.

[222] Rui Zhang, Honglak Lee, and Dragomir Radev. Dependency sensitive convolu-

tional neural networks for modeling sentences and documents. arXiv preprint

arXiv:1611.02361, 2016.

[223] Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He, Alex Smola, and Eduard

Hovy. Hierarchical attention networks for document classification. In Proceedings

of the 2016 Conference of the North American Chapter of the Association for

Computational Linguistics: Human Language Technologies, 2016.

[224] Jiwei Li, Minh-Thang Luong, and Dan Jurafsky. A hierarchical neural autoencoder

for paragraphs and documents. arXiv preprint arXiv:1506.01057, 2015.

Bibliography 160

[225] Siwei Lai, Liheng Xu, Kang Liu, and Jun Zhao. Recurrent convolutional neural

networks for text classification. In AAAI, pages 2267–2273, 2015.

[226] Shujie Liu, Nan Yang, Mu Li, and Ming Zhou. A recursive recurrent neural network

for statistical machine translation. In ACL (1), pages 1491–1500, 2014.

[227] Xing Wei and W Bruce Croft. Lda-based document models for ad-hoc retrieval. In

Proceedings of the 29th annual international ACM SIGIR conference on Research

and development in information retrieval, pages 178–185. ACM, 2006.

[228] Chaitanya Chemudugunta, Padhraic Smyth, and Mark Steyvers. Modeling general

and specific aspects of documents with a probabilistic topic model. In NIPS,

volume 19, pages 241–248, 2006.

[229] Xiaojin Zhu, David Blei, and John Lafferty. Taglda: bringing document struc-

ture knowledge into topic models. Technical report, Technical Report TR-1553,

University of Wisconsin, 2006.

[230] Christopher E Moody. Mixing dirichlet topic models and word embeddings to

make lda2vec. arXiv preprint arXiv:1605.02019, 2016.

[231] Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. Em-

pirical evaluation of gated recurrent neural networks on sequence modeling. arXiv


[232] Junyoung Chung, Caglar Gulcehre, Kyunghyun Cho, and Yoshua Bengio. Gated

feedback recurrent neural networks. CoRR, abs/1502.02367, 2015.

[233] Patrice Lopez and Laurent Romary. Humb: Automatic key term extraction from

scientific articles in grobid. In Proceedings of the 5th international workshop on

semantic evaluation, pages 248–251. Association for Computational Linguistics,

2010.

Keyphrase Extraction: from Distributional Feature ... · Distributed Semantic Composition Rui Wang...

Documents

Transcript of Keyphrase Extraction: from Distributional Feature ... · Distributed Semantic Composition Rui Wang...