Keyphrase Extraction: from Distributional Feature ... · Distributed Semantic Composition Rui Wang...
Transcript of Keyphrase Extraction: from Distributional Feature ... · Distributed Semantic Composition Rui Wang...
Keyphrase Extraction: fromDistributional Feature Engineering to
Distributed Semantic Composition
Rui Wang
This thesis is presented for the degree of
Doctor of Philosophy
of The University of Western Australia
School of Computer Science and Software Engineering
December 2017
Thesis Declaration
I, Rui Wang, declare that this thesis titled, ’Keyphrase Extraction: from Distributional
Feature Engineering to Distributed Semantic Composition’ and the work presented in
it are my own. I confirm that:
� This thesis has been substantially accomplished during enrolment in the degree.
� This thesis does not contain material which has been accepted for the award of any
other degree or diploma in my name, in any university or other tertiary institution..
� No part of this work will, in the future, be used in a submission in my name, for
any other degree or diploma in any university or other tertiary institution without
the prior approval of The University of Western Australia and where applicable,
any partner institution responsible for the joint-award of this degree.
� This thesis does not contain any material previously published or written by an-
other person, except where due reference has been made in the text.
� The work(s) are not in any way a violation or infringement of any copyright,
trademark, patent, or other rights whatsoever of any person.
� This thesis contains published work and/or work prepared for publication, some
of which has been co-authored.
Signed:
Date:
i
Abstract
Keyphrases of a document provide high-level descriptions of the content, which sum-
marise the core topics, concepts, ideas or arguments of the document. These descriptive
phrases enable algorithms to retrieve relevant information more quickly and effectively,
which play an important role in many areas of document processing, such as document
indexing, classification, clustering and summarisation. However, most documents lack
keyphrases provided by the authors, and manually identifying keyphrases for large col-
lections of documents is infeasible.
At present, solutions for automatic keyphrase extraction mainly rely on manually se-
lected features, such as frequencies and relative occurrence positions. However, such
solutions are dataset-dependent, which often need to be purposely modified to work
for documents of different length, discourse modes, and disciplines. This is due to the
fact that the performance of such algorithms heavily relies on the selections of features,
which turns the development of automatic keyphrase extraction algorithms into a time-
consuming and labour-intensive exercise. Moreover, most of these solutions can only
extract tokens explicitly appearing in documents as keyphrases, thus they are incapable
of capturing the semantic meanings of keyphrases and documents.
This research aims to develop a keyphrase extraction approach that automatically learns
features of phrases and documents via deep neural networks, which not only eliminates
the effort of feature engineering, but also encodes the semantics of phrases and doc-
uments in the features. The learnt features become representations of phrases and
documents, which enhance the robustness and adaptability of the learning algorithm on
different datasets. Specifically, this thesis addresses three issues: 1) the lack of under-
standing the meanings of documents and phrases inhibits the performance of the existing
algorithms; 2) the feature engineering process turns the development of keyphrase ex-
traction algorithms into a time-consuming, potentially biased, and empirical exercise;
and 3) using public semantic knowledge bases such as WordNet to obtain additional
semantics is practically difficult because they supply limit vocabularies and insufficient
domain-specific knowledge.
In this thesis, we firstly carry out a systematic study on investigating the application
of combining distributional features that measuring the occurrence patterns, and dis-
tributed word representations that supply additional semantic knowledge. We then de-
velop a series of models to learn the representations of phrases and documents that
enable classifiers to ‘understand’ the semantic meanings of them. We demonstrate that
the models developed in this thesis provide effective tools for learning general represen-
tations and capturing the meanings of phrases and documents through similarity and
classification tests. The learnt representations enable classifiers to efficiently identify
keyphrases of documents, without using any manually selected features.
ii
Acknowledgements
This dissertation would not have been possible without the support of many people, and
hereby I dedicate this milestone to them.
First and foremost, I would like to express my sincere gratitude to my supervisors: Dr
Wei Liu and Dr Chris McDonald for their immense knowledge, valuable guidance, schol-
arly inputs and consistent encouragement throughout my research work. I’m lucky and
proud of having them as my supervisors and friends. Thanks Wei for bringing me into
this exciting field – Natural Language Processing. She always helps me to conceptualise
the big picture and to develop a critical analysis of my work, and this dissertation is a
direct consequence of her excellent supervision. Thanks Chris for providing me many
insightful suggestions and helping me solving many programming problems. His enthu-
siasm, integral view on research, encouragement and skills will always be remembered.
I would like to thank the other academic members of the School of Computer Science
and Software Engineering (CSSE) for their advice including Assoc/Prof Rachel Cardell-
Oliver, Dr Du Huynh, Dr Tim French, Dr Jianxin Li. I am also very grateful to the
CSSE IT support and administrative staffs including Mr Samuel Thomas, Ms Yvette
Harrap and Ms Rosie Kriskans. A special thank you to my colleagues Lyndon White,
Michael Stewart, Christopher Bartley for their feedback, cooperation and friendship.
I am also thankful to the oversees scholars Assist/Prof Hatem Haddad and Mr Chedi
Bechikh Ali for their cooperation and friendship. Thanks to Dr Peter D. Turney, Prof.
Marco Baroni, Dr Kazi Saidul Hasan for answering my questions and providing datasets.
I would also like to thank the University of Western Australia (UWA), and the Australian
Research Council (ARC). This research was supported by an Australian Government
Research Training Program (RTP) Scholarship, UWA Safety Top-Up Scholarship, and
Discovery Grant DP150102405 and Linkage Grant LP LP110100050.
Last but not least, my deep and sincere gratitude to my family and friends for their
continuous love, help and support, which make the completion of this thesis possible.
Thanks to my wife Kefei for her eternal love, support and understanding of my goals
and aspirations, and her patience and sacrifice will remain my inspiration throughout
my life. Thanks to my mother Ping for giving me the opportunities and experiences
that have made me who I am. Her selflessly love and support has always been my
strength. A special thank you to my friends Zhuojun Zhou, Sheng Bi, Yuxuan Bi, John
Dorrington, and Clayton Dorrington for their constant inspiration and encouragement.
Finally, thanks to my cat Bella for not eating my source code.
iii
Publications Arising from this Thesis
This thesis contains published work and/or work prepared for publication. The biblio-
graphical details of the work and where it appears in the thesis are outlined below.
1. Rui Wang, Wei Liu, and Chris McDonald. A Matrix-Vector Recurrent Unit
Model for Capturing Semantic Compositionalities in Phrase Embeddings. Ac-
cepted by International Conference on Information and Knowledge Management,
2017 (Chapter 6)
2. Rui Wang, Wei Liu, and Chris McDonald. Featureless Domain-Specific term ex-
traction with minimal labelled data. In Australasian Language Technology Asso-
ciation Workshop 2016, page 103. (Chapter 5)
3. Rui Wang, Wei Liu, and Chris McDonald. Using word embeddings to enhance
keyword identification for scientific publications. In Australasian Database Con-
ference, pages 257-268. Springer, 2015. (Chapter 4)
4. Chedi Bechikh Ali, Rui Wang, and Hatem Haddad. A two-level keyphrase ex-
traction approach. In International Conference on Intelligent Text Processing and
Computational Linguistics, pages 390-401. Springer, 2015.
5. Wei Liu, Bo Chuen Chung, Rui Wang, Jonathon Ng, and Nigel Morlet. A genetic
algorithm enabled ensemble for unsupervised medical term extraction from clinical
letters. Health information science and systems, 3(1):1, 2015.
6. Rui Wang, Wei Liu, and Chris McDonald. Corpus-independent generic keyphrase
extraction using word embedding vectors. In Deep Learning for Web Search and
Data Mining Workshop (DL-WSDM 2015), 2014. (Chapter 4)
7. Rui Wang, Wei Liu, and Chris McDonald. How preprocessing affects unsupervised
keyphrase extraction. In International Conference on Intelligent Text Processing
and Computational Linguistics, pages 163-176. Springer, 2014. – Best Paper
Award. (Chapter 3)
iv
Declaration of Authorship
This thesis contains published work and/or work prepared for publication, some of which
has been co-authored. The extent of the candidate’s contribution towards the published
work is outlined below.
� Publications [1-3] and [6-7]: I am the first author of these papers, with 80%
contribution. I co-authored them with my two supervisors: Dr Wei Liu and Dr
Chris McDonald. I designed and implemented the algorithms, conducted the ex-
periments and wrote the papers. My supervisors reviewed the papers and provided
useful advice for improvements.
� Publications [4] : I am the second author of the paper, with 40% contribution.
I reimplemented all statistical ranking algorithms for keyphrase extraction used
in the paper, coordinated the conflicts between the English linguistic patterns
proposed by the first authors and the statistical ranking algorithms, conducted
experiments, and wrote Section 2.2, Section 3.2, Section 4.1, and Section 5.1 – 5.2.
� Publications [5] : I am the third author of the paper, with 15% contribution. I
normalised the textual data of the medical letters, reimplemented the TextRank
algorithm, conducted partial experiments, and wrote two subsections related to
my work.
Student Signature:
Date:
I, Wei LIU, certify that the student stat tion to each of
the works listed above are correct.
Coordinating supervisor Signature:
Date:
v
Contents
Thesis Declaration i
Acknowledgements ii
Acknowledgements iii
Publications v
Declaration of Authorship vi
List of Figures xi
List of Tables xiii
1 Introduction 1
1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Contributions and Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.1 Chapter 3: Conundrums in Unsupervised AKE . . . . . . . . . . . 4
1.2.2 Chapter 4: Using Word Embeddings as External Knowledge forAKE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.3 Chapter 5: Featureless Phrase Representation Learning with Min-imal Labelled Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2.4 Chapter 6: A Matrix-Vector Recurrent Unit Network for Captur-ing Semantic Compositionality in Phrase Embeddings . . . . . . . 6
1.2.5 Chapter 7: A Deep Neural Network Architecture for AKE . . . . . 7
1.2.6 Chapter 8: Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . 8
2 Literature Review and Background 9
2.1 Automatic Keyphrase Extraction . . . . . . . . . . . . . . . . . . . . . . . 9
2.1.1 Overview of AKE systems . . . . . . . . . . . . . . . . . . . . . . . 9
2.1.2 Text Pre-processing and Candidate Phrase Identification . . . . . . 11
2.1.2.1 Test Pre-processing . . . . . . . . . . . . . . . . . . . . . 11
2.1.2.2 Candidate Phrase Identification . . . . . . . . . . . . . . 12
2.1.3 Common Phrase Features . . . . . . . . . . . . . . . . . . . . . . . 13
2.1.3.1 Self-reliant features . . . . . . . . . . . . . . . . . . . . . 13
vi
Contents vii
2.1.3.2 Relational features . . . . . . . . . . . . . . . . . . . . . . 15
2.1.4 Supervised AKE . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.1.5 Unsupervised AKE . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.1.5.1 Capturing unusually frequencies – statistical-based ap-proaches . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.1.5.2 Capturing topics – clustering-based approaches . . . . . . 18
2.1.5.3 Capturing strong relations – graph-based approaches . . 19
2.1.6 Deep Learning for AKE . . . . . . . . . . . . . . . . . . . . . . . . 20
2.1.7 Evaluation Methodology . . . . . . . . . . . . . . . . . . . . . . . . 21
2.1.8 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.1.9 Similar Tasks to AKE . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.1.9.1 Automatic Domain-specific Term Extraction . . . . . . . 22
2.1.9.2 Other Similar Tasks . . . . . . . . . . . . . . . . . . . . . 27
2.2 Learning Representations of Words andTheir Compositionality . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.2.1 Word Representations . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.2.1.1 Atomic Symbols and One-hot Representations . . . . . . 29
2.2.1.2 Vector-space Word Representations . . . . . . . . . . . . 29
2.2.1.3 Inducing Distributional Representations using Count-basedModels . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.2.1.4 Learning Word Embeddings using Prediction-based Models 32
2.2.2 Deep Learning for Compositional Semantics . . . . . . . . . . . . . 36
2.2.2.1 Learning Meanings of Documents . . . . . . . . . . . . . 42
2.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3 Conundrums in Unsupervised AKE 44
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.2 Reimplementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.2.1 Phrase Identifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.2.1.1 PTS Splitter . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.2.1.2 N-gram Filter . . . . . . . . . . . . . . . . . . . . . . . . 47
3.2.1.3 Noun Phrase Chunker . . . . . . . . . . . . . . . . . . . . 48
3.2.1.4 Prefixspan . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.2.1.5 C-value . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.2.2 Ranking Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.2.2.1 TF-IDF . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.2.2.2 RAKE . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.2.2.3 TextRank . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.2.2.4 HITS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.3 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.3.1 Dataset Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.3.2 Ground-truth Keyphrase not Appearing in Texts . . . . . . . . . . 55
3.3.3 Dataset Cleaning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.4.1 Evaluation Methodology . . . . . . . . . . . . . . . . . . . . . . . . 56
3.4.2 Evaluation 1: Candidate Coverage on ground-truth Keyphrases . . 56
3.4.3 Evaluation 1 Results Discussion . . . . . . . . . . . . . . . . . . . . 57
Contents viii
3.4.4 Evaluation 2: System Performance . . . . . . . . . . . . . . . . . . 59
3.4.4.1 Direct Phrase Ranking and Phrase Score Summation . . 59
3.4.4.2 Ranking Algorithm Setup . . . . . . . . . . . . . . . . . . 59
3.4.5 Evaluation 2 Results Discussion . . . . . . . . . . . . . . . . . . . . 60
3.4.5.1 Candidate Identification Impact . . . . . . . . . . . . . . 60
3.4.5.2 Phrase Scoring Impact . . . . . . . . . . . . . . . . . . . 63
3.4.5.3 Frequency Impact . . . . . . . . . . . . . . . . . . . . . . 64
3.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4 Using Word Embeddings as External Knowledge for AKE 66
4.1 Weighting Schemes for Graph-based AKE . . . . . . . . . . . . . . . . . . 67
4.2 Proposed Weighting Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.2.1 Word Embeddings as Knowledge Base . . . . . . . . . . . . . . . . 69
4.2.2 Weighting Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.3 Training Word Embeddings . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.4 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.4.1 Training Word Embeddings . . . . . . . . . . . . . . . . . . . . . . 73
4.4.2 Phrase Embeddings . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.4.3 Ranking Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.4.3.1 Degree Centrality . . . . . . . . . . . . . . . . . . . . . . 75
4.4.3.2 Betweenness and Closeness Centrality . . . . . . . . . . . 75
4.4.4 PageRank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
4.4.5 Assigning Weights to Graphs . . . . . . . . . . . . . . . . . . . . . 76
4.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
4.5.1 Evaluation Results and Discussion . . . . . . . . . . . . . . . . . . 77
4.5.1.1 Discussion on Direct Phrase Ranking System . . . . . . . 77
4.5.1.2 Discussion on Phrase Score Summation System . . . . . . 79
4.5.1.3 Mitigating the Frequency-Sensitivity Problem . . . . . . 80
4.5.1.4 Tunable hyper-parameters . . . . . . . . . . . . . . . . . 80
4.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5 Featureless Phrase Representation Learning with Minimal LabelledData 82
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
5.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.3 Proposed Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.3.2 Term Representation Learning . . . . . . . . . . . . . . . . . . . . 86
5.3.2.1 Convolutional Model . . . . . . . . . . . . . . . . . . . . 87
5.3.2.2 LSTM Model . . . . . . . . . . . . . . . . . . . . . . . . . 88
5.3.3 Training Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
5.3.4 Pre-training Word Embedding . . . . . . . . . . . . . . . . . . . . 90
5.3.5 Co-training Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 91
5.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
5.4.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
5.4.2 Pre-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
5.4.3 Experiment Settings . . . . . . . . . . . . . . . . . . . . . . . . . . 93
Contents ix
5.4.4 Evaluation Methodology . . . . . . . . . . . . . . . . . . . . . . . . 93
5.4.5 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
5.4.6 Result and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . 94
5.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
6 A Matrix-Vector Recurrent Unit Network for Capturing SemanticCompositionality in Phrase Embeddings 98
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
6.2 Proposed Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
6.2.1 Compositional Semantics . . . . . . . . . . . . . . . . . . . . . . . 101
6.2.2 Matrix-Vector Recurrent Unit Model . . . . . . . . . . . . . . . . . 103
6.2.3 Low-Rank Matrix Approximation . . . . . . . . . . . . . . . . . . . 105
6.3 Unsupervised Learning and Evaluations . . . . . . . . . . . . . . . . . . . 105
6.3.1 Evaluation Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . 106
6.3.2 Evaluation Approach . . . . . . . . . . . . . . . . . . . . . . . . . . 107
6.3.3 Baseline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
6.3.4 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
6.3.5 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . 108
6.4 Supervised Learning and Evaluations . . . . . . . . . . . . . . . . . . . . . 111
6.4.1 Predicting Phrase Sentiment Distributions . . . . . . . . . . . . . . 111
6.4.2 Domain-Specific Term Identification . . . . . . . . . . . . . . . . . 113
6.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
7 A Deep Neural Network Architecture for AKE 117
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
7.2 Deep Learning for Document Modelling . . . . . . . . . . . . . . . . . . . 120
7.2.1 Convolutional Document Cube Model . . . . . . . . . . . . . . . . 122
7.3 Proposed Deep Learning Architecture for AKE . . . . . . . . . . . . . . . 125
7.4 Evaluations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
7.4.1 Baseline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
7.4.2 State-of-the-art Algorithms on Datasets . . . . . . . . . . . . . . . 129
7.4.3 Evaluation Datasets and Methodology . . . . . . . . . . . . . . . . 130
7.4.4 Training and Evaluation Setup . . . . . . . . . . . . . . . . . . . . 130
7.4.4.1 Pre-training Word Embeddings . . . . . . . . . . . . . . . 130
7.4.4.2 Training AKE Models . . . . . . . . . . . . . . . . . . . . 131
7.4.5 Evaluation Results and Discussion . . . . . . . . . . . . . . . . . . 131
7.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
8 Conclusion 135
8.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
Bibliography 140
List of Figures
2.1 One-hot Vector Representations: words with 9 dimensions . . . . . . . . . 29
2.2 Co-occurrence matrices of two sentences. (A): word-word co-occurrencematrix, (B): word-document co-occurrence matrix. . . . . . . . . . . . . . 31
2.4 Neural Probabilistic Language Model . . . . . . . . . . . . . . . . . . . . . 34
2.5 Word Embedding Vectors: induced from the probabilistic neural languagemodel training over the toy dataset. Each vector only has 2 dimensions. . 35
2.6 A Simple Recurrent Neural Network for Language Modelling. . . . . . . . 37
2.8 Recursive Neural Network: fitting the structure of English language . . . 39
2.9 Image Processing Convolution . . . . . . . . . . . . . . . . . . . . . . . . . 40
2.10 Word Embedding Convolution Process: (A) Embedding vector-wise con-volution (B): Embedding feature-wise convolution . . . . . . . . . . . . . 41
3.1 Phrase Scoring Processing Pipelines . . . . . . . . . . . . . . . . . . . . . 46
3.3 Illustrating the sets’ relationships. Algorithm Extracted Keyphrases is asubset of Identified Candidate Phrases. Ground- truth and Identified Can-didates are subsets of All Possible Grams of the document. TP : the truepositive set is the extracted phrases match ground-truth keyphrases; FP :the false positive set is the extracted phrases that do not match ground-truth keyphrases; FN : the false negative set contains all ground-truthkeyphrases that are not extracted as keyphrases; TN : the true negativeset has the candidate phrases that are not ground-truth and not extractedas keyphrases. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.4 Error 1: candidate identified is too long, being the super-string of theassigned phrase; Error 2: candidate identified is too short, being the sub-string of the assigned phrase; Error 3: assigned phrase contains invalidchar such as punctuation marks; Error 4: assigned phrase contains stop-words; Error 5: Others. . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.1 Evaluation Results F-score: Left graphs show the results using directphrase ranking scoring approach, where green and purple columns showembedding effects. Right graphs show the results using phrase score sum-mation approach, where green columns show the embedding effects. . . . 78
5.1 Co-training Network Architecture Overview: Solid lines indicate the train-ing process; dashed lines indicate prediction and labelling processes. . . . 85
5.2 Convolutional Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
5.3 LSTM Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
5.4 Relationships in TP, TN, FP, and FN for Term Extraction. . . . . . . . . 93
5.7 Convolutional and LSTM Classifier Training on 200 Examples on POSEvaluation set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
x
List of Figures xi
6.1 Elman Recurrent Network for Modelling Compositional Semantic . . . . . 102
6.2 Matrix-Vector Recurrent Unit Model . . . . . . . . . . . . . . . . . . . . . 104
6.3 Predicting Adverb-Adjective Pairs Sentiment Distributions . . . . . . . . 113
7.1 General Network Architecture for Learning Document Representations . . 121
7.2 Convolutional Document Cube Model . . . . . . . . . . . . . . . . . . . . 124
7.3 Overview of Proposed Network Architecture . . . . . . . . . . . . . . . . . 126
7.4 Baseline Document Model – LSTM-GRU Architecture . . . . . . . . . . . 127
7.5 Baseline Document Model – CNN-GRU Architecture . . . . . . . . . . . . 127
7.6 Gated Recurrent Unit Architecture . . . . . . . . . . . . . . . . . . . . . . 129
List of Tables
2.1 Common Pre-processing and Candidate Phrase Selection Techniques . . . 11
2.2 Features Used in Extraction Algorithms . . . . . . . . . . . . . . . . . . . 17
3.1 Dataset Statistics After Cleaning . . . . . . . . . . . . . . . . . . . . . . . 54
3.2 The Coverages on Ground-truth Keyphrases . . . . . . . . . . . . . . . . . 57
4.1 Most Similar Words: Top ranked most similar words to the samplers,fetched using Cosine similarity, trained twice 1) over Wikipedia generaldataset, and 2) a computer science domain specific dataset . . . . . . . . 70
5.1 Dataset Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
5.2 Candidate Terms Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . 92
5.3 Evaluation Dataset Statistics . . . . . . . . . . . . . . . . . . . . . . . . . 93
5.4 Evaluation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
6.1 Phrase Similarity Test Results . . . . . . . . . . . . . . . . . . . . . . . . . 110
6.2 Phrase Composition Test Results . . . . . . . . . . . . . . . . . . . . . . . 110
6.3 KL Divergence Score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
6.4 Candidate Terms Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . 114
6.5 Evaluation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
7.1 Common Deep Learning Document Models . . . . . . . . . . . . . . . . . 121
7.2 Evaluation Results on Three Datasets . . . . . . . . . . . . . . . . . . . . 132
xii
Chapter 1
Introduction
1.1 Overview
As the amount of unstructured textual data continues to grow exponentially, the need
for automatically processing and extracting knowledge from it becomes increasingly crit-
ical. An essential step towards the automatic text processing is to automatically index
unstructured text, i.e. tagging text with pre-identified vocabulary, taxonomy, thesaurus
or ontologies, which enables algorithms to retrieve relevant information more quickly
and effectively for further processing. An efficient approach for automatic text index-
ing is to summarise the core topics, concepts, ideas or arguments of a document into
a small group of phrases, namely keyphrases, which are representative phrases describ-
ing the document at a highly abstract level. However, most documents do not have
the keyphrases provided by the authors, yet manually identifying keyphrases for large
collections of documents is labour-intensive or even infeasible.
This thesis focuses on developing techniques for automatically extracting keyphrases
from documents, known as Automatic Keyphrase Extraction (AKE). The task of AKE
identifies a group of important phrases from the content of a document, which are capable
of representing the author’s main ideas, concepts, or arguments of the document. Thus,
a keyphrase must be a semantically meaningful and syntactically acceptable expression
that may consist of one or more unigram words1.
There has been impressive progress in developing AKE systems, such as Keyphrase
Extraction Algorithm (KEA)2, a publicly available system using machine learning algo-
rithms to identify keyphrases [1]. However, these traditional AKE algorithms rely on
1Although there is no limitation on the maximum number of words in a phrase, a phrase should notbe a full sentence.
2http://http://www.nzdl.org/Kea
1
Chapter 1. Introduction 2
computing the likelihood of being a keyphrase for each candidate phrase. Candidate
phrases are represented as vectors of predefined distributional features, which measure
the characteristics of phrases’ distributions in a document or corpus [1–5], such as their
occurrence frequencies. Such statistical measures often fail to extract keyphrases with
low frequencies. In addition, they are incapable of capturing the semantic meanings of
phrases. Later approaches use external knowledge, such as WordNet and Wikipedia, to
supply extra semantic knowledge [6–8]. However, these knowledge bases are designed for
general purposes, thus they often make little or no contribution for extracting keyphrases
from domain-specific corpora. In summary, existing AKE approaches suffer from three
major shortcomings:
1. Distributional features do not help algorithms to understand the semantics of
phrases and documents, because most features only encode statistical distribu-
tions of phrases in a document. Unlike human annotators who need to understand
the documents to identify keyphrases, existing AKE algorithms built upon distri-
butional features often are unable to or have no attempt to understand the
meanings of phrases and documents. Inevitably, this lack of cognitive ability
inhibits their performance.
2. Observing, selecting and evaluating different combinations of features, namely
feature-engineering, turn the development of AKE algorithms into a time-
consuming and labour-intensive exercise, where the primary goal is to design
representations of data to improve the performance of AKE algorithms. These
features are manually selected based on the observation of keyphrases’ linguis-
tic, statistical, or structural characteristics. However, characteristics of the same
phrase may vary on different types of datasets. For example, the relative occur-
rence position of a phrase is an important cue of being a keyphrase in a collection of
journal articles, because intuitively, most keyphrases occur in the abstract and in-
troduction sections of an article. Such characteristic does not exist in collections of
news articles, since they are mostly short, and do not have sections. Thus, an AKE
model developed upon specific features on one dataset is difficult to be adapted to
another. For example, Hasan and Ng [9] demonstrate that algorithms developed
for specific datasets usually deliver poor performance on a different dataset. Take
ExpandRank [10] for instance, it assumes that phrases co-occur in a similar fash-
ion in its neighbourhood documents – the most similar documents in the corpus
ranked by cosine similarities. However, such topic-wise similar documents may not
exist in other datasets.
3. Using semantic knowledge supplied by public knowledge bases to analyse the se-
mantic relations of phrases has limitations. Firstly, most knowledge bases have
Chapter 1. Introduction 3
limited vocabularies, which will not sufficiently cover all fast-growing technical
terms. Secondly, they only provide general meanings of common words, which
make little or no contribution in the analysis of domain-specific corpora.
For example, the word neuron generally refers to the core components of the brain
and spinal cord of the central nervous system, thus it closely relates to neocortex
or sensorimotor. In machine learning literature, a neuron typically refers to a
node in neural networks that performs the linear or non-linear computation, thus
closely related words may be sigmoid or back-propagation.
This thesis aims to addresses all three problems. The recent advancements of deep neural
networks have enabled a paradigm shift of representing semantics of lexical words from
distributional to distributed vectors, known as word embeddings [11, 12]. These low
dimensional, dense and real-valued vectors open up immense opportunities of applying
known algorithms for vector and matrix manipulations to carry out semantic calculations
that were previously impossible. For example, Mikolov et al. [13] demonstrate that word
embedding vectors encode both linguistic and semantic knowledge, e.g. the result of word
vector calculations show that vec(king) − vec(man) + vec(woman) ≈ vec(queen), and
vec(apple) − vec(apples) ≈ vec(car) − vec(cars). Most importantly, learning
word embeddings requires very little or no feature-engineering. The learning of word
embedding is facilitated by purposely designed neural network architectures capturing
different language models. For example, the word2vec model [14, 15] uses a feed-forward
network architecture to encode the probability distributions of words in word embeddings
by predicting the next word given its precedents; recurrent network architectures are
capable of learning language models via the sequential combinations of words [16, 17];
and the recursive neural networks [18–20] focus more on capturing syntactical patterns
and commonalities among words.
1.2 Contributions and Thesis Outline
In this research, a series of semantic knowledge-based techniques are proposed and de-
veloped to automatically extract keyphrases from documents. This thesis firstly com-
bines manually selected distributional features with word embedding vectors to inves-
tigate whether the additional semantic knowledge embedded in these vectors can im-
prove keyphrase extraction. These word embeddings are pre-trained over both large
collections of documents (e.g. Wikipedia) and domain-specific datasets to overcome the
problems of limited vocabulary and insufficient domain-specific knowledge supplied by
public knowledge bases. We demonstrate that using the knowledge encoded in word
embedding vectors is an effective way to improve the performance of traditional AKE
Chapter 1. Introduction 4
algorithms [21, 22]. However, word embeddings only encode semantics of unigram words.
To enable learning algorithms encoding the meanings of multi-word phrases and doc-
uments, we develop a series of deep learning models, including the Co-training model
presented in Chapter 5, the Matrix-Vector Recurrent Unit (MVRU) model presented in
Chapter 6, and the Convolutional Document Cube (CDC) model presented in Chap-
ter 7. We demonstrate that the deep learning models developed throughout this thesis
provide effective tools for learning general representations of phrases and documents,
which enable classifiers to efficiently identify the importance of phrases in documents,
without using any manually selected features.
In the rest of this chapter, we outline the structure of the thesis and summarise our
primary contributions. Chapter 2 consists of two parts: the first part is an overview of
existing AKE approaches, and the second part focuses on using deep neural networks for
modelling languages and learning representations of words and their compositionality.
The core content of this thesis is presented in Chapter 3 to Chapter 7. The outlines
with highlights on contributions are presented as follows.
1.2.1 Chapter 3: Conundrums in Unsupervised AKE
This chapter presents a systematic analysis and investigation on how different
pre-processing and post-processing techniques affect the performance of an
AKE system, with empirical explorations of common weaknesses of existing unsuper-
vised AKE systems.
Generally, algorithms for AKE can be grouped into two categories: unsupervised ranking
and supervised machine learning. In comparison to the supervised AKE, unsupervised
approaches use fewer features, making them less dependent on the choice of features.
Unsupervised AKE systems consist of a processing pipeline, including pre-processing
that identifies candidate phrases, scoring and ranking candidates, and post-processing
that extracts top ranked candidates as keyphrases. Each process plays a critical role
in delivering the final extraction result. However, it is unclear how exactly different
techniques employed at each process will affect the performance of an AKE system. Most
studies only focus on the keyphrase extraction algorithms with little discussion on how
pre-processing and post-processing steps affect the system performance, which leaves
ambiguities in how the claimed improvements are achieved, bringing extra difficulties for
researchers seeking further improvements over existing approaches. Thus, the departure
point of this research is to gain a comprehensive understanding of the strengths and
weaknesses of unsupervised AKE algorithms.
Chapter 1. Introduction 5
In Chapter 3, we conduct a systematic evaluation of four popular unsupervised AKE
algorithms combined with five pre-processing and two post-processing techniques, where
we clearly identify the impacts of different pre-processing and post-processing techniques
to the same keyphrase extraction algorithm. More importantly, we also show that the
unsupervised AKE algorithms are overly sensitive to phrase frequencies, i.e. they mostly
fail to identify infrequent but important phrases. Chapter 3 is developed based on our
publication [23], which received the best paper award from the CICLing 2014 conference.
1.2.2 Chapter 4: Using Word Embeddings as External Knowledge for
AKE
This chapter presents a general-purpose weighting scheme for graph-based AKE
algorithms using word embedding vectors to incorporate background knowl-
edge to overcome 1) the problems of limited vocabulary and insufficient domain-specific
knowledge supplied by public knowledge bases, 2) the frequency sensitivity problem
identified in Chapter 3.
The family of graph ranking algorithms are, by far the most popular unsupervised ap-
proaches for AKE, which produce a number of state-of-the-arts performance on different
datasets [3, 6, 24]. Our proposed approach combines the classic graph-based AKE ap-
proaches with the recent deep learning outcomes. In contrast to other studies that use
existing semantic knowledge bases, such as WordNet as the external knowledge source,
we combine manually selected distributional features with the semantic knowledge en-
coded in word embeddings to compute the semantic relatedness of words. We train word
embeddings over a Wikipedia snapshot and the AKE corpus to encode both general and
domain-specific meanings of words, demonstrating that the proposed weighting scheme
mitigates the frequency sensitivity problem, and generally improves the performance of
graph ranking algorithms for identifying keyphrases. Moreover, despite assigning weights
to graphs improves the performance of graph-based AKE algorithms, the development
of weighting schemes is a rather ad-hoc process where the choice of features is critical
to the overall performance of the algorithms. This problem turns the development of
graph-based AKE algorithms into a laborious feature engineering process. One of our
goals is to discover AKE approaches that are less dependent on the choice of features and
datasets. Hence, in the rest of the thesis, we focus on using deep learning approaches to
automatically learn useful features. The learnt features become the representations of
phrases and documents encoding the meanings of them, which help learning algorithms
delivering better performance. Chapter 4 is based on our work presented in the WSDM
2014 workshop [21], and the ADC 2015 conference [22].
Chapter 1. Introduction 6
1.2.3 Chapter 5: Featureless Phrase Representation Learning with
Minimal Labelled Data
This chapter presents a co-training deep learning network architecture that min-
imises the burden on domain experts to produce training data. A convolutional
neural network (CNN) in conjunction with a Long Short-Term Memory (LSTM) network
in a weakly-supervised co-training fashion is designed and implemented.
To encode the semantic knowledge of multi-word phrases, a model must be able to un-
derstand the meaning of each constituent word and learn the rules of composing them,
known as the principle of semantic compositionality [25]. Chapter 5 investigates the use
of two deep neural networks: CNN [26, 27] and LSTM network [28] to automatically
learn phrase embeddings. The intuition is that the meaning of a phrase can be learnt
by analysing 1) different sub-gram compositions, and 2) sequential combination of each
constituent word. The CNN network analyses different regions of an input matrix con-
structed by stacking constituent word embedding vectors of a phrase, where the sizes
of regions reflect different N-grams. By scanning the embedding matrix with different
region sizes, the CNN network can learn the meaning of a phrase by capturing the most
representative sub-grams. The LSTM network, on the other hand, learns the compo-
sitionality by recursively composing an input embedding vector with the precedent or
previously composed value. The network captures the meaning of a phrase by control-
ling the information flow – how much information (or meaning) of an input word can be
added into the overall meaning, and how much information from the previous composi-
tion should be dismissed. We demonstrate that co-training using the two networks can
achieve the same level of classification accuracy with very little labelled data. The work
presented in this chapter is based on our work presented at ALTA 2016 conference [29].
1.2.4 Chapter 6: A Matrix-Vector Recurrent Unit Network for Cap-
turing Semantic Compositionality in Phrase Embeddings
This chapter presents the Matrix-Vector Recurrent Unit (MVRU) model – a novel
computational mechanism for learning language semantic compositionality
based on the recurrent neural network architecture. The MVRU model learns
the meanings of phrases without using any manually selected features.
Although both CNN and LSTM deliver impressive results on learning multi-word phrase
embeddings, they have limitations. CNN networks have the strength to encode regional
compositions of different locations in data matrices. In image processing, pixels close
to each other usually are a part of the same object, thus convoluting image matrices
Chapter 1. Introduction 7
captures the regional compositions of semantically related pixels. However, the location
invariance does not exist in word embedding vectors. The LSTM network, on the other
hand, uses shared weights to encode composition rules for every word in the vocabulary
of a corpus, which may overly generalise the compositionality of words.
In this chapter, a novel compositional model based on the recurrent neural network
architecture is developed, where we introduce a new computation mechanism for the
recurrent units – MVRU to integrate the different views of compositional semantics
originated from linguistic, cognitive, and neural science perspectives. The recurrent
architecture of the network allows for processing phrases of various lengths as well as
encoding the consecutive orders of words. Each recurrent unit consists of a composi-
tional function that computes the composed meaning of two input words, and a control
mechanism to govern the information flow at each composition. When the network is
trained in an unsupervised fashion, it is able to capture latent compositional semantics
of phrases, producing any phrase embedding vectors regardless the presence of a phrase
in the training set. When the network is trained in a supervised fashion by adding a
regression layer, it is able to perform classification tasks using the phrase embeddings
learnt from the lower layer of the network. The work presented in this chapter is ac-
cepted for publication by the International Conference on Information and Knowledge
Management (CIKM) 2017.
1.2.5 Chapter 7: A Deep Neural Network Architecture for AKE
This chapter first presents the Convolutional Document Cube (CDC) model for
quickly learning distributed representations of documents using a CNN net-
work. In the second part of the chapter, a semantic knowledge-based deep learn-
ing model for AKE is developed. The model encodes the semantics of both phrases
and documents using unsupervised training and supervised fine-turning. We demon-
strate that the deep learning AKE model requires no manually selected features, and
hence it is much less dependent on the datasets.
Keyphrases are document-dependent – a phrase that is a keyphrase for one document,
may not be important for others. As a natural process for human identifying keyphrases,
one needs to understand not only the meanings of words and phrases, but also the overall
meaning of the document. Thus, a deep learning model for AKE is required to capture
the semantics for both phrases and documents. The proposed CDC model treats a
document as a cube with the height being the number of sentences, the width being the
number of words in a sentence, and the depth being the dimension of word embedding
vectors. We ‘slice’ the cube by each of its depth’s (word embeddings’) dimension to
Chapter 1. Introduction 8
generate 2-dimensional channels, which are inputs to a CNN network to analyse the
intrinsic structure of the document and the latent relations in words and sentences.
To extract keyphrases, we propose a deep learning AKE model, which is a binary clas-
sifier that jointly trains the MVRU and CDC models. We evaluate the deep learning
AKE model on three datasets, and demonstrate that the proposed approach delivers the
new state-of-the-art performance on the two datasets without employing any dataset-
dependent heuristics.
1.2.6 Chapter 8: Conclusion
The conclusion of this thesis is presented in Chapter 8, where we discuss major findings
and shortcomings, with an outlook to future directions.
Chapter 2
Literature Review and
Background
To lay the foundation of this research, this chapter firstly provides a comprehensive re-
view on automatic keyphrase extraction (AKE). Then we present a general introduction
to representation learning and language modelling using deep neural networks, starting
with the theoretical background of deep learning, and then progressing into how it may
solve or mitigate the problems for downstream NLP tasks, such as keyphrase extraction.
2.1 Automatic Keyphrase Extraction
2.1.1 Overview of AKE systems
Before we dive into our own literature review, it is worth to take a brief look at a few
recent surveys since 2013 [30–33].
At the Semantic Evaluation workshop in 2010 (SemEval2010), AKE was the fifth shared
tasks. Kim et al. [30] present a work that summarises each of the submitted systems,
and results of the shared task. They firstly broke down existing techniques into four
components from a system point of view: 1) candidate identification, 2) feature engi-
neering, 3) learning model development, and 4) keyphrase evaluation techniques. A brief
summary of the submitted systems is presented by following the order of four compo-
nents. By analysing the upper-bound performance of the task, two key findings are: 1) a
fixed threshold for the number of keyphrases per document restricts the performance of
a system, and 2) the existing evaluation methods underestimate the actual performance
of a system.
9
Chapter 2. Background 10
Hasan and Ng [31] summarise that supervised AKE approaches into two steps 1) task
reformulation that recast AKE task as either a classification or ranking task, and 2) fea-
ture design that selects features from the training documents and/or external resources
such as Wikipedia. Hasan and Ng group unsupervised AKE approaches into four cate-
gories: 1) graph-based ranking, 2) topic-based clustering, 3) simultaneous learning, and
4) language modelling. An important contribution is that they pointed out common
shortcomings of the state-of-the-art approaches. Hasan and Ng classify them into four
categories: 1) over-generation errors – a system identify an important word and hence
all phrases containing the word are identified as keyphrases, 2) infrequency error – low
frequent phrases are unlikely to be identified, 3) redundancy error – semantically equiv-
alent phrases are identified1, and 4) evaluation error – current evaluation methodologies
are unable to identify whether an extracted phrase and the ground-truth is semantically
equivalent.
Beliga et al. [32] focused more on graph-based keyword extraction approaches. The
graph-based approach by far is the most common unsupervised keyword/keyphrase ex-
traction approach. Beliga et al. analysed and compared various graph-based approaches,
and provided guidelines for future research on graph-based AKE. Siddiqi and Sharan [33]
present a survey, in which they classify AKE approaches into four different categories:
1) Rule-based linguistic approaches, 2) statistical approaches, 3) machine learning ap-
proaches, and 4) domain specific approaches. Siddiqi and Sharan also discussed some
common features used for extracting keyphrases.
Based on the previous survey work, we hereby present our own review of AKE approaches
from a system point of view similar to Kim et al. [30]’s work. However, before a system
can correctly identify candidate phrases, it usually requires a data cleaning step, which
removes noisy data (e.g. math equations) and normalises documents to obtain clean texts
for the steps that follow, namely text pre-processing. Hence, we see a AKE system that
is constructed as a processing pipeline, including text pre-processing, candidate phrase
identification, feature selection, keyphrase extraction, and optionally post-processing.
The feature selection identifies the essential features that characterise a phrase, and
computes the numerical feature values. Each phrase then acquires an index associated
with a value vector, where each dimension value corresponds to a feature encoding a
lexical, or a syntactical or semantic interpretation of the candidate.
The keyphrase extraction algorithm is the core of an AKE system, which is responsi-
ble for identifying keyphrases from the candidates. Generally, the algorithms can be
grouped into supervised machine learning and unsupervised ranking algorithms. Su-
pervised machine learning for AKE is more of a feature engineering process, meaning
1We believe that this type of error is overlapping with the type 1 error.
Chapter 2. Background 11
Table 2.1: Common Pre-processing and Candidate Phrase Selection Techniques
Processing Sequence 1 2 3 4 5 6 7
Ohsawa et al. (1998)[35] 3, 4, 5 X X XMihalcea and Tarau (2004)[3] 1, 2, 6 X X XMatsuo and Ishizuka (2004)[36] 4, 5 X XBracewell et al. (2005)[37] 1, 2, 4, 6 X X X XKrapivin et al. (2008)[38] 1, 3, 5, 4 X X X XLiu et al. (2009)[8] 1, 2, 6, 7 X X X XLiu et al. (2010)[39] 1, 2, 6 X X XOrtiz et al. (2010)*[40] 3, 4, 5 X X XBordea and Buitelaar (2010)*[41] 1, 2, 6, 7 X X X XEl-Beltagy and Rafea(2010)*[42] 3, 7, 4, 5 X X X XPaukkeri and Honkela(2010)*[43] 7, 4 X XZervanou (2010)*[44] 1, 7, 2, 6 X X X XRose et al. (2010) [45] 3 XDostal and Jazek (2011)[46] 1, 3, 2, 6 X X X XBellaachia and Al-Dhelaan(2012) [47] 1, 2, 6, 4 X X X XSarkar (2013) [48] 3, 4, 5, 7 X X X XYou et al. (2013) [49] 3, 7 X XGollapalli and Caragea (2014) [50] 1, 2, 6 X X XTotal: 10 9 8 9 6 9 7
1: Tokenising 2: POS Tagging 3: Splitting text by meaningless words or characters and/or removing them 4:Stemming 5: N-gram filtering with heuristic rules 6: Phrase identification with POS patterns 7: other heuristicrules.∗ Although some of the processing may not be explicitly mentioned in the paper, it participated SemEval2010workshop shared task 5 and have been reviewed by Kim et al. [51] where more implementation details areprovided.
that the majority of the effort is spent on selecting and inducing features, since different
combination of features have a critical impact on the performance of the same algo-
rithm [34]. On the other hand, early unsupervised AKE approaches rely more on the
understanding of what keyphrases are. For example, graph-based unsupervised AKE
algorithms are developed based on the common belief that keyphrases have stronger
co-occurrence relations to others. Hence, keyphrases are the phrases that tie and hold
the text together. More recent unsupervised AKE algorithms use external knowledge
bases, such as WordNet, to supply more semantic features.
2.1.2 Text Pre-processing and Candidate Phrase Identification
Table 2.1 shows an overview of common text pre-processing and candidate phrase iden-
tification techniques with respect to their corresponding processing sequences, after sur-
veying 18 key papers published from late 1998 to recent years.
2.1.2.1 Test Pre-processing
The objective of text pre-processing is to normalise and clean documents. The nor-
malisation process aims to convert texts into a unified format enabling more efficient
Chapter 2. Background 12
processing. Common normalisation techniques include converting characters to low-
ercase, tokenising, sentence splitting, lemmatising, and stemming. For example, the
distinction between ‘Network’, ‘network’, and ‘networks’ is ignored after the normalisa-
tion. The cleaning process identifies and optionally removes characters or words that
carry little or no semantic meaning, such as punctuation marks, stop-words, symbols,
or mathematical equations. Techniques used in text cleaning can be corpus-dependent,
often requiring applying heuristics to remove unnecessary or irrelevant information. For
example, Paukkeri and Honkela [43] remove authors’ names and addresses, tables, fig-
ures, citations, and bibliographic information from scientific articles.
2.1.2.2 Candidate Phrase Identification
Keyphrases consist of not only unigram words, but also multi-word phrases, and hence
the candidate phrase identification process recognises syntactically acceptable phrases
from documents, which are treated as the candidates of keyphrases. Two common
approaches are Phrase as Text Segment [35, 38, 40, 48] and Phrase Identification with
Part-of-Speech Patterns [8, 37, 39, 41, 47, 50].
Phrase as Text Segment (PTS) approaches identify phrases by splitting texts based
on heuristics. The simplest approach is to use stop-words and punctuations as delimiters.
It assumes that syntactically and semantically valid phrases rarely contain meaningless
words and characters. For example, in this sample sentence “information interaction is
the process that people use in interacting with the content of an information system”,
phrases identified are information interaction, process, people, interacting, content, and
information system. A more sophisticated approach is to apply additional heuristics.
For example, N-gram filtration [35, 40] selects candidate phrases based on statistical
information and the lengths of phrases. It compares the frequencies and the lengths of
all possible N-grams of a text segment, where only the phrase with the highest frequency
is kept, and if two or more phrases have the same frequency, then the longer one is kept.
PTS approaches have advantages when identifying candidates from short and well-
written texts, such as abstracts of journal articles. They do not involve any deeper
linguistic analysis, thus no Part-of-Speech (POS) tagging is required. In addition, PTS
approaches are language independent, which can be applied to any language with a fi-
nite set of stop-words and punctuations. However, when they are used to process long
documents, such as full-length journal articles, too many candidates will be produced
and hence more irrelevant candidates are introduced, because longer documents have
more word combinations.
Chapter 2. Background 13
Phrase Identification with POS Patterns relies on linguistic analysis, which iden-
tifies candidate phrases based on pre-identified POS patterns. In comparison to PTS,
it is not a language independent technique, so choosing a reliable POS tagger is im-
portant. The most common POS pattern is that a phrase should start with zero or
more adjectives followed by one or more nouns [3, 8, 44, 52], the regular expression is
<JJ>*<NN.*>+. However, the pattern assumes keyphrases only containing adjectives and
nouns, which can be a drawback. For example, machine intelligence can be correctly
identified, whereas intelligence of machines will not be considered. More sophisticated
phrase identification approaches rely on deeper linguistic analysis to work out more pat-
terns. For example, Chedi et al. [53] define 12 syntactic patterns for English including
4 bigram patterns, 6 trigram patterns, and 2 4-gram patterns.
Other Phrase Identification Approaches are introduced in different NLP tasks,
which are not commonly employed in AKE systems. For example, Liu et al. [54] have
investigated using Prefixspan [55] –a sequential pattern mining algorithm to automati-
cally identify single and multi-word medical terms without any domain-specific knowl-
edge. Frantzi et al. [56] introduce C-value and NC-value algorithms that use both POS
identification techniques and statistical information to identify domain specific terms.
Wong et al. [57] present a framework using statistics from web search engines for mea-
suring unithood – the likelihood of a sequence of words being a valid phrase.
2.1.3 Common Phrase Features
After identifying candidate phrases, the next process is to select phrases’ features based
on the observations of datasets. Features represent distinct characters of a phrase,
which differentiate itself from others. Broadly, a phrase has two kinds of features, self-
reliant and relational features. Self-reliant features relate to the information about a
phrase itself, such as its frequency, the linguistic structure, or its occurrence positions
in a document. Relational features capture the relational information of a phrase to
others, such as co-occurrence and semantic relations. Table 2.2 shows an overview of
common features employed by 30 studies including both supervised and unsupervised
AKE approaches.
2.1.3.1 Self-reliant features
1. Frequencies are the main source of identifying the importance of phrases [1, 4,
10, 35]. However, in many cases, raw frequency statistics do not precisely reflect
the importance of phrases. A phrase with distinct high frequency may not be
a good discriminator for being a keyphrase if it distributes evenly in a corpus.
Chapter 2. Background 14
Scott [58] suggests that a keyphrase should have unusual frequency in a given
document comparing with reference corpora Therefore, raw frequency statistics
are also commonly used as inputs to other statistical models for more sophisticated
features, such as Term Frequency - Inverse Document Frequency (TF-IDF) [2].
2. Linguistic features2 are obtained from linguistic analysis, including part-of-
speech (POS) tagging, sentence parsing, syntactic structure and dependency anal-
ysis, morphological suffixes or prefixes analysis, and lexical analysis. They are
mostly used in supervised machine learning approaches [59–61]. However, given
the fact that the average length of phrases is very short, phrases have relatively
simple syntactic structures, which may offer less benefit to the overall performance
of a system. Kim and Kan [34] reveal that applying linguistic features, such as POS
and morphological suffix sequences, has no effect or unconfirmed small changes to
the performance in supervised machine learning approaches for AKE.
3. Structural features encode how a phrase occurs in the document, such as the
relative positions of the first or last occurrence, whether the phrase appears in
the abstract of a document. A common observation is that in well-structured
documents, such as scientific publications, keyphrases are more likely to appear
in abstracts and introduction sections. Structural features can also be specific to
the format of texts. For example, Yih et al. [59] analyse whether a phrase occurs
in hyper-links and URLs to identify keyphrases from web pages. However, such
specific information may be unavailable in other corpora.
4. The length of a phrase is the number of words in a phrase, which is considered
to be a useful feature by a number of studies [62–65]. Although it may be a
little non-intuitive, since keyphrases rarely have more than 5 tokens, studies claim
that applying the feature generally improves the performance of a supervised AKE
system [34, 66].
5. Phraseness refers to the likelihood of a word sequence constituting a meaningful
phrase. In particular, the phraseness feature is useful for identifying a keyphrase
from its longer or shorter cousins, such as convolutional neural network and neural
network. Two common techniques to compute feature values are Pointwise Mutual
Information (PMI) [67] and Dice coefficient [68].
2Not to be confused with the POS pattern-based phrase identification for identifying candidatephrases.
Chapter 2. Background 15
2.1.3.2 Relational features
1. Phrase co-occurrence statistics provide the information of how phrases co-
occur with others, which also implicitly supply phrase frequency statistics – a
phrase co-occurs more frequently with others indicating the phrase itself having
higher importance.
In supervised AKE approaches, it is less common to employ co-occurrence statistics
since they assume no correlations between candidate phrases. However, Kim and
Kan [34] find that applying co-occurrence features is empirically useful for super-
vised AKE. In contrast, co-occurrence statistics are the main source of features for
the majority of unsupervised AKE approaches [35, 36], especially in graph-based
ranking approaches [3, 10]. The intuition is that a phrase may be a keyphrase if
it 1) frequently co-occurs with different candidates, or 2) only co-occurs with a
particular set of high frequent candidates.
2. Phrases and documents relations examine how important a candidate phrase
is for one particular document with respect to the distribution of the phrase in
a corpus. TF-IDF [2] is the most common algorithm, computed as the product
of the phrase (term) frequency (TF) and the inverse document frequency (IDF).
TF-IDF assigns lower scores to the phrases that are evenly distributed across the
corpus, and higher scores to the phrases that occur frequently in a few particular
documents. Other studies [69–71] also separate TF and IDF as two standalone
features.
3. Semantic features present semantic relatedness between phrases, which can be
any linguistic relation between two phrases, such as similarity, synonym, hyponym,
hypernym, and meronym. Semantic relatedness can be directly obtained from
off-the-shelf semantic knowledge bases. For example, Wang et al. [6] measure
synonyms of phrases based on the ontological information provided by WordNet.
Ercan and Cicekli [61] also use the ontology from WordNet to identify synonym,
hyponym, hypernym, and meronym relations between two word senses. Semantic
relatedness also can be induced from large corpus (e.g. Wikipedia) using statistical
or machine learning techniques. Liu et al. [8] consider each Wikipedia article as
a concept, and the semantic meaning of a candidate phrase can be represented
as a weighted vector of its TF-IDF score within corresponding Wikipedia arti-
cles. The semantic similarity between two phrases can be computed using cosine
similarity, Euclidean distance, or the PMI measure. Other studies, including [72–
74], compute the semantic relatedness as the ratio of the number of times that
a candidate appears as a link in Wikipedia, and the total number of Wikipedia
articles where the candidate appears. Using dimension reduction techniques such
Chapter 2. Background 16
as Singular Value Decomposition (SVD) to induce distributional representations3
of candidates, then applying the cosine similarity measure to obtain similarity
scores between two phrases has also been investigated [75]. More recently, Wang
et al. [22] use pre-trained word embedding vectors over Wikipedia to compute
semantic relatedness using the cosine similarity measure.
4. Phrases and topic relation features indicate how a candidate is related to a
specific topic in a document. A common belief is that keyphrases should cover
the main topics or arguments of an article. The most common approach is to
use Latent Dirichlet Allocation (LDA) [76] to obtain the distributions of words
over latent topics [39, 47, 62, 77]. Another popular approach applies clustering
algorithms, grouping phrases into clusters, where each cluster is treated as a topic,
so phrases from the same cluster are considered to be more similar in the semantic
context [8, 78].
2.1.4 Supervised AKE
Based on the learning algorithms employed, supervised machine learning approaches
treat AKE as either a classification or learning to rank problem. When treating AKE
as a binary classification problem, the goal is to train a classifier that decides whether
a candidate can be a keyphrase. Algorithms used include C45 decision tree [4, 61],
Naıve Bayes [1, 72], Support Vector Machine (SVM) [65, 82], neural networks [71, 81],
and Conditional Random Fields (CRF) [83]. Another line of studies treats AKE as a
ranking problem. Intuitively, no phrases appear in a document just by chance, and each
phrase carries a certain amount of information that represents the theme of an article to
some degree. Thus, the goal is to train a pair-wise ranker that ranks candidates based
on their degree of representativeness – the most representative ones will be extracted
as keyphrases. For example, Jiang et al. [64] employ Linear Ranking SVM to train the
ranker. Jean-Louis et al. [85] present a Passive-Aggressive Perceptron model trained
using the Combined Regression and Ranking method.
Classic supervised machine learning approaches to AKE are feature-driven – the perfor-
mance of algorithms is heavily dependent on the choice of features. Thus, the majority of
the effort is spent on selecting and inducing features, also known as feature engineering.
Feature engineering is difficult and expensive, since both quality and quantity of the fea-
tures have significant influence on the overall performance. Kim and Kan [34] present a
comprehensive evaluation analysing how different features employed may affect the per-
formance of the same algorithms, including Naıve Bayes and Maximum Entropy. The
3please refer Section 2.2.1 for detail.
Chapter 2. Background 17
Table 2.2: Features Used in Extraction Algorithms
Algorithm 1 2 3 4 5 6 7 8 9 10
Unsupervised ApproachesSparck Jones(1972)[2] statistical X XOhsawa et al.(1998)[35] cluster X XMatsuo and Ishizuka(2004)[36] statistical X XMihalcea and Tarau(2004)[3] graph XBracewell et al.(2005)[37] graph X XWang et al.(2007)[6] graph X X XWan and Xiao(2008)[10] graph X X XGrineva et al.(2009)[7] graph X X XLiu et al. (2009)[8] cluster X X XRose et al.(2010)[45] statistical X XLiu et al.(2010)[39] graph X X XZhao et al.(2011)[79] graph X X X XBellaachia and Al-Dhelaan(2012)[47] graph X X XBoudin and Norin(2013)[80] graph X X X XWang et al.(2015)[22] graph X X X X
Supervised ApproachesWitten et al.(1999)[1] NaıveBayes X X X XTurney(2000)[4] C45 X X X XHulth(2003)[5] Bagging X X X XYih et al.(2006)[59] LogisticReg X X X X XJo et al.(2006)[81] NeuralNet X X XZhang et al.(2006)[82] SVM X X X XErcan and Cicekli(2007)[61] C45 X X XZhang et al.(2008)[83] CRF X X X XJiang et al.(2009)[64] RankingSVM X X X XMedelyan et al.(2009)[72] NaıveBayes X X X X X X XSarkar et al.(2010)[71] NeuralNet X X X X XXu et al.(2010)[65] SVM X X X X XEichler and Neumann (2010)[63] RankingSVM X X X X XDing et al.(2011)[62] BIP X X X X XJo and Lee(2015)[84] DeepLearning – – – – – – – – – –
1: Frequency 2: Linguistic feature 3: Structural feature 4: Length of candidate 5: Co-occurrence 6: Candidateand documents relation feature 7: semantic feature 8: Candidate and topic feature 9: phraseness 10: Usingexternal knowledge base
study discovers that frequency, structural features, lengths of phrases, and co-occurrence
statistical features, general improve the performance of the algorithms.
2.1.5 Unsupervised AKE
Unsupervised AKE approaches are built based on observations or understandings of what
keyphrases are. We classify unsupervised AKE approaches into three groups based on
different views of keyphrases:
1. Keyphrases are the phrases having unusual frequencies.
2. Keyphrases are representative phrases that should cover the main topics or argu-
ments of an article.
Chapter 2. Background 18
3. Keyphrases are the phrases having stronger relations with other phrases, which
tie and hold the entire article together.
2.1.5.1 Capturing unusually frequencies – statistical-based approaches
Statistical-based approaches use deterministic mathematical functions to identify phrases
having unusual frequencies. Different algorithms interpret the notion of unusual in dif-
ferent ways. For example, TF-IDF [2] identifies a phrase having high frequency in a
few particular documents rather than being evenly distributed over the corpus. Sim-
ilarly, Likely [86] selects phrases by taking the ratio of the rank value of a phrase in
the documents to its rank value in the reference corpus, where rank is computed as the
relative N-gram frequency of the phrase. RAKE [45] identifies unusual frequent phrases
by examining how often it co-occurs with others, which scores a phrase as the ratio of co-
occurrence frequency with other phrases and its frequency. Wartena et al. [87] compare
phrases’ co-occurrence distributions in a document with their frequency distributions in
the entire corpus.
Statistical-based approaches usually do not require any additional resource apart from
the raw data statistics of phrases from the corpus and documents. This allows statistical-
based approaches to be easily reimplemented. However, they can be frequency-sensitive,
favouring high frequency phrases, thus preventing the algorithms identifying keyphrases
having low frequencies.
2.1.5.2 Capturing topics – clustering-based approaches
Clustering-based approaches apply clustering algorithms to group candidate phrases
into topic clusters, then the most representative ones from each cluster are selected as
keyphrases. Ohsawa et al. [35] cluster phrases based on a co-occurrence graph, where
phrases are vertices and edges are co-occurrence relations weighted by co-occurrence
frequencies. The weak (low scored) edges are considered to be the appropriate ones for
segmenting the document into clusters that are regarded as groups of supporting phrases
on which the author’s points are based. Finally, keyphrases are identified as the ones
that hold clusters together. Liu et al. [8] apply Hierarchical, Spectral, and Affinity Prop-
agation clustering algorithms that group semantically related phrases using Wikipedia
and co-occurrence frequencies. Keyphrases are the phrases close to the centroid of each
cluster. Bracewell et al. [37] cluster phrases by firstly arranging all unigram words to
their own clusters, then multi-gram phrases are assigned to clusters containing the com-
ponent unigrams. If no unigram cluster is found, candidates are assigned to their own
clusters, finally centroids from the top k scored clusters are extracted as keyphrases.
Chapter 2. Background 19
Pasquier [77] propose to induce topic distributions from groups of semantically related
sentences using both cluster algorithms and LDA. Keyphrases are scored with the con-
sideration of distributions of topics over clusters, the distributions of phrases over topics,
and the size of each cluster.
Extracting keyphrases from each topic cluster treats that topics are equally important
in an article. In reality, however, there exist minor topics that are unimportant to an
article. Hence, they should not have keyphrases representing them.
2.1.5.3 Capturing strong relations – graph-based approaches
Graph-based approaches score phrases using graph ranking algorithms by representing
a document as a graph, where each phrase corresponds to a vertex, two vertices are
connected if pre-identified relations, such as phrase co-occurrence relations, are found in
a predefined window size.
The most common ranking algorithms employed are webpage link analysis algorithms,
such as HITS [88], and PageRank [89] and its variants. HITS and PageRank recur-
sively compute the importance of a vertex in a graph by analysing both the number of
neighbouring vertices it connects to, and the importance of its each neighbour vertex.
Applying link analysis algorithms to keyphrase extraction assumes that 1) an important
phrase should have high frequency, such that it co-occurs more often with other phrases,
and 2) a phrase selectively co-occurs with one or a few particular highly frequent ones
can also be important.
The most well-known algorithm is TextRank introduced by Mihalcea and Tarau [3],
which applies PageRank to AKE4 by representing documents as undirected and un-
weighted graphs, considering only frequency and co-occurrence frequency features. Fol-
low Mihalcea and Tarau’s work, researchers tend to improve the performance by adding
features as weights to graph’s edges. For example, Wan and Xiao [10] propose to use
phrase co-occurrence frequencies as weights collected from both the target document and
its k nearest neighbour documents identified using document cosine similarity measure.
The approach essentially expands a single document to a small document set, by adding
a few topic-wise similar documents to capture more statistical information. Wang et
al. [6] use synset in WordNet to obtain the semantic relations for a pair of phrases.
In addition to webpage link analysis algorithms, applying traditional graph centrality
measures to AKE has also been investigated. For example, Boudin [90] compares various
centrality measures for graph-based keyphrase extraction, including degree, closeness,
4TextRank uses weighted graph ranking algorithm derived from PageRank for text summarisation.However, TextRank is identical to PageRank when it is used for AKE, where the graph is unweighted.
Chapter 2. Background 20
betweenness centrality measures, as well as PageRank that is classified as a variant of
the eigenvector centrality. The study shows that the simple degree centrality measure
achieves comparable results to the widely used PageRank algorithm, and the closeness
centrality delivers the best performance on short documents.
Using probabilistic topic models in conjunction with graph ranking algorithms has also
been investigated. This line of work is similar to clustering-based AKE approaches,
since both of them require identifying topics from documents as a prior. However,
in clustering-based approaches, keyphrases are directly drawn from each cluster. In
contrast, graph-based approaches treat topic distributions as features that are inputs
to the graph ranking algorithms. Liu et al. [39] use LDA [76] to induce latent topic
distributions of each phrase, then run the personalised PageRank algorithm [91] for
each topic separately, where the random jump probability of a phrase is computed as
its probability distribution of a topic. Candidates are finally scored as the sum of
their scores in each topic. In their later work, Liu et al. [92] propose a Topic Trigger
Model derived from Polylingual Topic Model [93] that is an extension of LDA. Similarly,
Bellaachia and Al-Dhelaan [47] also use LDA to identify latent topics, then rank phrases
with respect to their topic distributions and TF-IDF scores.
2.1.6 Deep Learning for AKE
Jo and Lee use a Deep Belief Network (DBN) that connects to a logistic regression layer
to learn a classifier. The model does not require any manually selected features. It
uses a greedy layer-wise unsupervised learning approach [94] to automatically learn the
features of one layer once a time. After training, all pre-trained layers are connected and
fine-tuned. The input is the bag-of-word representation of a document, which essentially
is a vector with 1 and 0 values indicating whether a word appears in the document. The
logistic regression layer outputs potential latent phrases. Zhang et al. [95] propose a deep
learning model using Recurrent Neural Network (RNN) combining keywords and context
information to extract keyphrases. The network has two hidden layers, where the first
one aims to capture the keyword information, and the second layer extract the keyphrases
based on the encoded keywords information from the first layer. Meng et al. [96] propose
a generative model for keyphrase prediction with an encoder-decoder model using Gated
Recurrent Unit neural network [97] that incorporate a copying mechanism [98]. Li et
al. [99] use word embeddings to represent words, then apply Euclidean distances to
identify the top-N-closest keywords as the extracted keywords.
Chapter 2. Background 21
2.1.7 Evaluation Methodology
Early studies employ human evaluation [36], which is an impractical approach because
of the significant amount of effort involved. The most common approach is exact match,
i.e. a ground-truth keyphrase matches an extracted phrase when they correspond to the
same stem sequence. For example, the phrase neural networks matches neural network,
but not network or neural net. However, the exact match evaluation is overly strict. For
example, convolutional neural network and convolutional net refer to the same concept
in the machine learning field, but they are treated as an incorrect match using the ex-
act match approach. To discover more efficient evaluation approaches, Kim et al. [100]
investigate six evaluation metrics. Four of them are selected from the machine transla-
tion and automatic text summarisation fields, including BLEU [101], METEOR [102],
NIST [103] and ROUGE [104]. Others include R-precision [105], and the cosine similarity
measure of phrases’ distributional representations induced from web documents. Four
graduate students from the NLP field were hired to be the judges, and the Spearman’s ρ
correlation is used to compare the computer generated and human assigned keyphrases.
The study shows that semantic similarity-based method produces the lowest score, and
R-precision achieves the highest correlation with humans, about 0.45. They also suggest
that phrases should be scored differently based on whether they match keyphrases by
head nouns, the first word, or middle words. However, there is no sophisticated evalu-
ation matrix that can identify whether two phrases share the same semantic meaning
at a human acceptable level. Hence, most researchers still use the exact match as their
evaluation matrix.
The most common measures for AKE system performance are Precision, Recall, and
F-score (also known as the F1 score), computed as:
precision =TP
TP + FP=the number of correctly matched
total number of extracted
recall =TP
TP + FN=the number of correctly matched
total number of ground truth
F = 2× precision× recallprecision+ recall
2.1.8 Applications
Keyphrases are also referred to as topic representative terms, topical terms, or seman-
tically significant phrases, which benefit many downstream NLP tasks, such as text
Chapter 2. Background 22
summarisation, document clustering and classification, information retrieval, and entity
recognition.
In text summarisation, keyphrases can indicate the important sentences in which they
occur. D’Avanzo and Magnini [106] present a text summarisation system – LAKE,
which demonstrates that an ordered list of relevant keyphrases is a good representation
of the document content. Other work, including [107–109], also generate summaries
based on keyphrases, where they integrate the keyphrase identification step with their
summarisation techniques.
Keyphrases are also topic representative phrases that offer significant benefits to docu-
ment classification and clustering. Hulth and Megyesi [110], and Kim et al. [111] show
the use of keyphrases generally improves the efficiency and accuracy of document clas-
sification systems. Hammouda et al. [112] introduce CorePhrase, an algorithm for topic
discovery using extracted keyphrase from multi-document sets and clusters. Zhang et
al. [113] cluster web page collections using automatically extracted topical terms that
are essentially keyphrases.
Keyphrases also offer indexing of documents, which assists users in formulating queries
for search engines. For example, Wu et al. [114] propose enriching metadata of the
returned results by incorporating automatically extracted keyphrases of documents with
each returned hit.
Other studies include Mihalcea and Csomai [115], who use keyphrases to link web doc-
uments to Wikipedia articles; Ferrara et al. [116], who use keyphrases to produce richer
descriptions of document as well as user interests attempting to enhance the accessibil-
ity of scientific digital libraries; and Rennie and Jaakkola [117], who treat keyphrases as
topic oriented and informative terms to identify named entities.
2.1.9 Similar Tasks to AKE
Many downstream NLP tasks, such as automatic term extraction, keyword extraction,
share a number of commonalities with AKE. However, they are subtly different.
2.1.9.1 Automatic Domain-specific Term Extraction
Automatic domain-specific term extraction, also known as term recognition, term iden-
tification, or terminology mining [118], is a task that automatically identifies domain-
specific technical terms from relatively large corpora. Domain specific terms are stable
lexical units such as words or multi-word phrases, which are used in specific contexts
Chapter 2. Background 23
to represent domain-related concepts [119]. Extracting domain-specific terms is an im-
portant and essential step for ontology learning [120]. This subsection briefly reviews
some common techniques for domain-specific term extraction, which also serves as the
background for Section 5.
Domain-specific terms (or terms for short) share many characteristics with keyphrases,
yet there are subtle differences between them. From linguistic perspective, terms are
similar to keyphrases, which need to be semantically meaningful and syntactically ac-
ceptable expressions that may consist of one or more words. Terms correspond to tech-
nical entities in a domain. Hence, from a statistical point of view, terms intuitively have
high frequencies that are referenced by a large number of articles. Based on these sim-
ilarities, some AKE approaches can be applied to term extraction, such as TF-IDF [2].
Newman et al. [121] present a Dirichlet Process Segmentation model for both AKE and
term extraction. More recently, Liu et al. [54] applied TextRank to identify medical
terms.
The main difference between term extraction and AKE, however, is the level of scope.
Terms describe concepts and technical elements, which are properties of a specific do-
main. Hence, the task of term extraction aims to identify whether a term is relevant or
important to the domain of a document collection, which essentially is at the level of the
entire corpus. On the other hand, keyphrases present the main ideas or arguments of a
document, which describe the document at a highly abstract level. Therefore, keyphrases
need to be identified for each document, thus the task is at the level of documents.
The research in term extraction has a long history. Firstly, we summarise the work pre-
sented by Kageura and Umino [122] that surveys early studies before or in 1996. Kageura
and Umino classified these studies into two groups: linguistic and statistical approaches.
Linguistic approaches focus on applying linguistic knowledge such as analysing syntac-
tical information to identify noun phrases as terms. The most common approach along
this line is to use pre-defined POS patterns [123–125] or heuristics [126, 127]. Statis-
tical approaches focus on developing weighting schemes that measure the likelihood of
terms using statistical information such as term frequencies or co-occurrence frequencies.
Kageura and Umino classified statistical approaches into two further groups based on
two important concepts introduced in their work, namely unithood and termhood [122].
The unithood refers to the measure of how likely a sequence of words can form a syntac-
tically correct and semantically meaningful term. From this point of view, the unithood
measure can be thought as of the pre-process that identify valid phrases for further
processing. Early studies for measuring unithood include [124, 128–130]. The termhood
refers to the degree of relevance to a specific domain, which is the actual measure for
identifying terms. Early studies alone this line include [131–133]
Chapter 2. Background 24
However, in the later research, many studies have employed both linguistic and statistical
techniques. For example, Frantzi et al. [56] uses a pre-defined POS pattern to identify
noun phrases. The identified noun phrases become candidate terms for further processing
using a statistical-based approach. Hence, these approaches cannot be purely classified
as neither linguistic nor statistical approaches. The commonality is that they do not
use any training data, thus can be said as unsupervised approaches. On the other
hand, applying supervised machine learning algorithms to term extraction has also been
recently proposed [134–136]. Based on this, we group more recent studies into two
streams: supervised machine learning and unsupervised approaches. The main difference
between supervised machine learning and unsupervised approaches is the core algorithms
used for identifying domain-specific terms. Supervised approaches aim to train classifiers
using labelled data, whereas unsupervised approaches rely on weighting schemes that
measures termhood.
Supervised machine learning approaches treat the term extraction as either a binary
classification or ranking problem. In comparison to unsupervised approaches, applying
supervised machine learning algorithm to term extraction is a less popular choice due
to the fact that constructing labelled training data is difficult and laborious. Another
possible reason might be that term extraction is performed over the entire corpus, hence
it has stronger statistics but fewer local features e.g. appearing positions in AKE, which
are advantageous to unsupervised approaches.
Most supervised approaches employ a processing pipeline, consisting of 1) candidate
term identification, 2) feature selection, and 3) term classification. The first step is to
identify candidates, which has the same goal as AKE. Hence, most pre-processing tech-
niques in AKE described in Section 2.1.2 are also applicable to term extraction, among
which using per-identified POS patterns remains the most popular choice [134–136].
Statistical and linguistic features are commonly used in supervised term extraction. For
example, Spasic et al. [137] proposed a classification approach based on verb complemen-
tation patterns, and Foo [134, 138] uses POS tags, morph-syntactic descriptions, and
grammatical functions combined with statistical information to train classifiers. After
feature engineering, candidate terms are mapped from their symbolic representations
to vectors that are the inputs to the supervised machine learning algorithms. Conrado
et al. [136] applied Rule Induction, Naıve Bays, Decision Tree, and Sequential minimal
optimization; Foo and Merkel used an existing Rule Induction learning system namely
Ripper [139]. More recently, Fedorenko et al. [140] present an experimental evaluation,
in which Random Forest, Logistic Regression, and Voting algorithms are compared. The
work shows that Logistic Regression and Voting deliver the best performance on two
datasets.
Chapter 2. Background 25
Unsupervised term extraction scores and ranks terms, then the top ranked ones
are extracted as domain-specific terms. The unsupervised term extraction involves the
measures of unithood and termhood. The unithood measure is mainly based on statistical
evidence, such as term frequencies, to identify whether a sequence of words occurs in
patterns or by chance. Some of the well-known unithood measures include log-likelihood
ratio [141], t-test, and pointwise mutual information (PMI) [67]. More recently, Wong
et al. [57] present a probabilistic unithood measure framework that uses both linguistic
features and the statistical evidence from web search engines. It is worth noting that
the unithood measure is not popular in AKE. One reason is that unithood requires
large dataset to work well – which is the entire corpus. On the other hand, most AKE
approaches only use statistics within each document, which has much less statistical
information.
The unithood measure can be thought as of pre-processing techniques, which is mainly
used for measuring the likelihood of a sequence of words that constitute a term. Ter-
mhood, on the other hand, measures the degree of relevance to a specific domain for a
given candidate term. Most techniques for termhood measure are based on two types of
resources: 1) statistical information from the local corpora, and 2) information from ex-
ternal contrastive corpora or knowledge bases. In fact, the development of unsupervised
algorithm for term extraction can be thought as a journey of seeking and employing
resources.
Using statistical information supported only by the local corpora is the fundamental and
simplest approach, because it does not require any external resource that may not always
be available. The most well-known approach using statistics is TF-IDF [2], which can be
either applied directly to extract terms, or as a derived feature for further computation.
Kim et al. [142] proposed to use TF-IDF to extract domain-specific terms. They use
TF-IDF to compute domain specificity, a notion introduced in their work based on idea
that domain-specific terms occur in a particular domain are much more frequent than
they do in others. Navigli and Velardi [143] use TF-IDF as a feature for measuring
terms’ distribution over the entire corpus.
Another popular approach using only local statistics is C-value [56]. It specifically de-
signed for identifying a term from its longer version – super-strings of the term, e.g. novel
neural network and neural network, based on statistics of term occurrences. Concretely,
C-value uses statistical information of a phrase a from four aspects: 1) the occurrence
frequency of a in the corpus, 2) the number of times that a appears as a part of other
longer phrases, 3) the number of the longer phrases that contain a, and 4) the number
Chapter 2. Background 26
of words that a contains. The score of a phrase a is computed as:
Cvalue(a) =
log2 |a| × fa if a not appearing in any phrase
log2 |a| × (fa −∑
b∈Taf(b)
P (Ta) ) otherwise
where fa is the frequency of phrase a, |a| is the number of words that a contains, Ta is the
list of longer phrases that contain a, P (Ta) is the size of Ta (the number of elements in
Ta). In this way, the score of the candidate is reduced if it is a part of other candidates.
However, C-value is designed for only recognising multi-word terms. Barron-Cedeno
et al. [144] generalise C-value to handle single word terms by adding a constant to the
logarithm. Other algorithms for term extraction derived from C-value include [145–147].
Later studies have also attempted using contrastive corpora (also named referencing
corpora in some literatures) in addition to the local ones to provide extra informa-
tion [148–152]. A contrastive corpus is a document collection that usually contains texts
from more general or contrastive fields than the target corpus from which domain-specific
terms are extracted. The purpose of using contrastive corpora is to compare the distri-
butions of words in both target and contrastive corpora, from which potential terms are
inferred. For example, Gelbukh [151] et al. assume that a word or phrase appears much
more frequently in the target corpus is more likely to be a domain-specific term.
Basili et al. [148] proposed a contrastive approach that relies on a cross-domain statistical
measure using a target corpus consists of 1,400 documents from Italian Supreme Court,
and a contrastive corpus consists of 6,000 news articles. The purposed measure called
contrastive weight is based on Inverse Word Frequencies [153] – a variant of Inverse
Document Frequencies, which measures different distributions of terms throughout a set
of given topics. This contrastive weight for an individual word a in target domain d is
defined as:
CW (a) = log f(ad)log∑
j
∑i F(ij)∑
j f(aj)
where f(ad) is the frequency of a in d,∑
j
∑i F(ij) computes the frequencies of all words
in both target and contrastive corpora, and∑
j f(aj) computes the frequency of a in all
corpora. The score of a multi-word term t is then computed as:
CW (t) = log f(td)CW (th)
where f(td) is the frequency of t in d, and th is the head noun of the term.
In the later work, Wong et al. [150] proposed a probabilistic model for measuring ter-
mhood, namely odd of termhood (OT). The authors firstly identified seven characteristics
Chapter 2. Background 27
of terms, from which they developed the model using Bayes’ Theorem, as:
P (R1|A) =P (A|R1)P (R1)
P (A)
where R1 is the event that a is relevant to the domain, and A is the event that a is
a candidate term represented as a vector of feature values derived from the identified
characteristics. The authors assumed that probability of relevance of candidate a to
both target domain and contrastive corpora is approximately 0, i.e. P (R1 ∩ R2) ≈ 0,
where R2 is the event that candidate a is relevant to other domains. The odds of a
candidate term a being relevant to the domain is computed as:
O(a) =P (A|R1)
1− P (A|R1)
The OT of a term a is computed as:
OT (a) = logP (A|R1)
P (A|R2)
Using online resources for term extraction is also proposed. For example, Wong et al. [57]
present a probabilistic unithood measure framework that uses both linguistic features
and the statistical evidence from web search engine, where authors employed the page
count by Google search engine for calculating the dependency of between words. Each
constituent word in a term is formulated as a query to Google search engine, and the
page count returned for each word is for calculating the mutual information. Dobrov
and Loukachevitch [154] also proposed to use search engines to provide extra features.
However, instead of using page counts, the author analyse snippets (a short fragments
of texts explaining search results) returned from the search engine. Studies have also
attempted to use Wikipedia – one of the largest online knowledge repository as external
knowledge. For example, Vivaldi and Rodriguez [155] uses Wikipedia as a external
corpus, where they built a ontological categories using Wikipedia for the domain by
traversing the Wikipedia category graph.
2.1.9.2 Other Similar Tasks
Keyword extraction can be thought of as a sub-task of keyphrase extraction, which only
focuses on extracting unigram keywords. Theoretically, automatic keyword extraction
algorithms can be applied to keyphrase extraction task as long as one treats phrases in
the same way as unigram words. In practice, however, it is not always possible. For
example, Ercan and Cicekli [61] present a lexical chain model that requires the knowledge
of candidate senses and semantic relations between candidates. However, they have only
Chapter 2. Background 28
focused on extracting keywords instead of keyphrases, because keyphrases are usually
domain dependent phrases that are not presented in WordNet.
The most similar task to AKE is automatic keyphrase assignment, which aims to select
keyphrases for a document from a given set of vocabulary [156]. Given a fixed set of
keyphrases as the ground-truth, the task is to tag each document in a corpus with
its corresponding ground-truth keyphrases. On the other hand, AKE aims to identify
keyphrases from the content of a document without any given ground-truth.
Automatic text summarisation is another closely related task to AKE, which aims to
identify significant information that is important enough to summarise a given document.
The major difference is that text summarisation aims to produce syntactically valid
sentences, rather than just a list of phrases.
Other related tasks include named entity recognition, and term extraction. Entity recog-
nition aims to extract specific types of information, such as extracting the athletes,
teams, leagues, scores, locations, and winners from a set of sport articles [157].
2.2 Learning Representations of Words and
Their Compositionality
Machine learning requires data representations. Unlike other fields, such as image pro-
cessing, where the data are naturally encoded as vectors of the individual pixel intensities
for images, the data in NLP are sequences of words. Words are fundamental building
blocks of a language, thus, the first thing to all NLP downstream tasks is to obtain
vector representations of words that are inputs to machine learning algorithms. Very
often, researchers also need representations beyond word level, i.e. multi-word phrases,
sentences, or even documents, to perform specific tasks such as Machine Translation or
Sentence Classification. Learning representations beyond word level requires not only
the understanding of each word, but also learning the combination rules, which remains
a great challenge to the NLP community. This section provides a comprehensive review
on representation learning for unigram word and multi-word expressions. Specifically,
we focus on what is representation learning and why it is important, how the represen-
tations are induced, and where they can be applied.
Chapter 2. Background 29
Sentence 1: The cat is running in the bedroom.Sentence 2: The dog is walking in the kitchen.
the 1 0 0 0 0 0 0 0 0
cat 0 1 0 0 0 0 0 0 0
dog 0 0 1 0 0 0 0 0 0
is 0 0 0 1 0 0 0 0 0
running 0 0 0 0 1 0 0 0 0
walking 0 0 0 0 0 1 0 0 0
in 0 0 0 0 0 0 1 0 0
bedroom 0 0 0 0 0 0 0 1 0
kitchen 0 0 0 0 0 0 0 0 1
Figure 2.1: One-hot Vector Representations: words with 9 dimensions
2.2.1 Word Representations
2.2.1.1 Atomic Symbols and One-hot Representations
In early rule-based and statistical NLP systems, words are treated as discrete atomic
symbols, i.e. words are represented in their lexical forms. Atomic symbols offer natural
representations of words, which allow researchers using regular expressions or morphemes
analysis to tackle NLP tasks at the lexical level, such as text searching [158], parsing [159,
160], or stemming [161]. In machine learning, atomic symbols are converted into vector
representations, where the symbols become identifiers referring to feature vectors with
the same length as the size of the vocabulary with only one dimension asserted to
indicate the ID, namely one-hot representations. Figure 2.1 shows an example of one-
hot representations of words for a vocabulary size of 9.
However, neither the atomic symbols nor one-hot vectors are capable of providing se-
mantic information of words, and thus a system built based on such representations has
no understanding of the meanings of words. Specific to one-hot representations, machine
learning algorithms suffer from the curse of dimensionality and data sparsity problems
when the size of vocabulary is large, since the dimension of vectors increases with the
vocabulary’s size, but most feature values are zeros.
2.2.1.2 Vector-space Word Representations
The idea of vector-space representations of words is to represent words in a continuous
vector space where words having similar semantics are mapped closely. The learnt rep-
resentations are low-dimensional and real valued vectors, which naturally overcome the
curse of dimensionality and data sparsity problem. Learning such representations is an
essential step to modern NLP, which benefits a number of downstream NLP tasks, such
Chapter 2. Background 30
as speech recognition [162, 163], machine translation [17, 97], and sentiment classifica-
tion [20, 26].
The vast majority approaches for learning vector-space representations are based on the
distributional hypothesis [164], i.e. words tend to have similar meanings if they appear
in the same contexts. Recent studies have demonstrated that the learnt representations
not only encode statistical and lexical properties of words, but also partially encode
the linguistic and semantic knowledge of words. For example, the result of word vec-
tor calculations show that vec(king) − vec(man) + vec(woman) ≈ vec(queen), and
vec(apple) − vec(apples) ≈ vec(car) − vec(cars) [13].
The vector-space representations of words can be induced from two approaches: the dis-
tributional and distributed approaches. The distributional approaches induce word rep-
resentations using word-context count-based models by applying mathematical functions
(e.g. matrix decomposition algorithms) to reduce the dimensions of word co-occurrence
matrices. On the other hand, distributed approaches use word-context prediction-based
models, which compute the word probability mass function over the training corpus.
The induced word representations are also named word embeddings. In the following
two sections, we illustrate the fundamental ideas of each approach using a toy dataset
that only consists of two sentences: The cat is running in the bedroom, and The
dog is walking in the kitchen.
2.2.1.3 Inducing Distributional Representations using Count-based Models
Count-based models learn distributional representations by applying dimensionality re-
duction algorithms on word-context matrices [165]. Concretely, let M ∈ R|W |×|C| be
a word-context matrix, where |W | is the size of the vocabulary W and |C| is the size
of contexts C. Each row in M always corresponds to the representation of a word.
However, columns in M depend on the interest of the context to be concerned. For
example, if the interest is analysing how words co-occur, then the context is word, and
M is constructed as an adjacency (word-word co-occurrence) matrix, as shown in Fig-
ure 2.2 (A). If the interest is analysing how words occur in sentences, then the context
become each sentence, thus M is a word-sentence matrix, shown in Figure 2.2(B). The
count-based models aims to reduce |C| by looking for a function g that maps M to a
matrix M ′ ∈ R|W |×d, where d<|C|.
The most common approach is to factorize M to yield a lower-dimensional M ′, where
each row becomes the new vector representation for each word. The most well-known
algorithm is Latent Semantic Analysis (LSA) introduced by Deerwester et al. [166] who
Chapter 2. Background 31
the cat dog is running walking in bedroom kitchen
the 2 1 1 2 1 1 2 1 1
cat 1 0 0 1 1 0 1 1 0
dog 1 0 0 1 0 1 1 0 1
is 2 1 1 0 1 1 2 1 1
running 1 1 0 1 0 0 1 1 0
walking 1 0 1 1 0 0 1 0 1
in 2 1 1 2 1 1 0 1 1
bedroom 1 1 0 1 1 0 1 0 0
kitchen 1 0 1 1 0 1 1 0 0
Sentence 1: The cat is running in the bedroom.Sentence 2: The dog is walking in the kitchen.
Sentence 1 Sentence2
the 2 2
cat 1 0
dog 0 1
is 1 1
running 1 0
walking 0 1
in 1 1
bedroom 1 0
kitchen 0 1
(A) (B)
Figure 2.2: Co-occurrence matrices of two sentences. (A): word-word co-occurrencematrix, (B): word-document co-occurrence matrix.
Dog, cat,
Walking,
kitchen
Bedroom,
running
(a)
Dog,
Walking,
kitchen
cat
Bedroom,
running,
(b)
Figure 2.3: Distributional Representations of Words in 2 dimensions induced fromSVD over the toy dataset. (A): Representations from the word-word matrix, (B)
Representations from the word-document matrix.
apply Singular Value Decomposition (SVD) to the matrix M . SVD learns latent struc-
tures of M ∈ R|W |×|C| by decomposing it into M ′ = USV T . U is a |W | × |W | unitary
matrix of left singular vectors, where the columns of U are orthonormal eigenvectors of
MMT . V is a |C| × |C| unitary matrix of right singular vectors, where the columns of
V are orthonormal eigenvectors of MTM . S is a diagonal matrix containing the square
roots of eigenvalues from U or V . Keeping d top elements of S, we have M ′ = UdSdVTd .
We apply SVD to the toy dataset, showing how SVD reduces the number of dimensions
while preserving the similar structure of the original representations. Figure 2.3 plots
2-dimensional word vectors induced using SVD from the matrices in Figure 2.2. We take
a sub-matrix M ′ ∈ U |W |×2 as the representations of words. The 2-dimensional distri-
butional word representations induced from the toy dataset still capture co-occurrence
patterns and basic semantic similarities. For example, words cat and dog have very
similar representations. In practice, the low dimensional representations of words can
Chapter 2. Background 32
be compared using the cosine similarity measure between any two vectors. Values close
to 1 indicate that the two words have almost the same meaning, while values close to 0
represent two very dissimilar words.
Following LSA, subsequent studies have attempted different matrix factorisation ap-
proaches, including Nonnegative Matrix Factorization (NMF) [167], Probabilistic Latent
Semantic Indexing (PLSI) [168], Principal Components Analysis, and Iterative Scaling
(IS) [169]. However, since a word-context matrix can be very high dimensional and
sparse, storing the matrix is memory-intensive and thus running the algorithms can be
computational expensive for large vocabulary.
More recently, Pennington et al. [170] present a model named GloVe that factorise the
logarithm of a word-word occurrence matrix. Specifically, let M denote the matrix, w
be a word, and the corresponding vector be mw with d-dimensions. The co-occurrence
context word and its vector of w are denoted as c and mc. GloVe learns parameters
m ∈ R|W |×d. For each w and c, we have:
Mlog(w,c) ≈ mw ·mcT + bw + bc
where bw and bc are additional biases for w and c. Unlike other matrix factorisation
approaches that decompose the entire sparse matrix, GloVe leverages statistical infor-
mation by training only on the nonzero elements in the matrix which efficiently reduces
the computational cost.
2.2.1.4 Learning Word Embeddings using Prediction-based Models
Word embeddings are originally proposed to overcome the curse of dimensionality prob-
lem in probabilistic language modelling [171]. Mathematically, a probabilistic language
model computes the probability distribution P over sequences of words W , which in-
dicates the likelihood of W being a valid sentence. Let the occurrence probability of a
sequence of words be:
P (W ) = P (w1, w2, ..., wt−1, wt)
Applying the Chain Rule of Probability, we have:
P (w1, w2, ..., wt−1, wt) = P (w1)P (w2|w1)...P (wt|w1, w2, ..., wt−1)
So, given t− 1 previous words, the conditional probability of the t word occurring is:
P (wt|w1, w2, ..., wt−1) =
t∏i=1
P (wi|wi−11 )
Chapter 2. Background 33
where wji denotes a sequence of words (wi, wi+1, ..., wj−1, wj) from i to j. However, this
calculation is computational expensive for large datasets. The N-gram model is intro-
duced to reduce the computational cost by taking advantage of word orders – considering
only the combinations of the last n− 1 words:
P (wt|wt−11 ) ≈ P (wt|wt−1
t−n+1)
Based on the N-gram model, Bengio et al. [171] introduce the neural probabilistic lan-
guage model demonstrating how word embeddings can be induced. The objective is to
learn a function f that maps the probability P using a feed-forward neural network:
f(wt, ..., wt−n+1) = P (wt|wt−1t−n+1)
The network features three layers – a projection layer, a hidden layer, and a softmax
output layer, shown in Figure 2.4. Let V be the training set vocabulary that maps the
sequence of words w1, w2, ..., wv where wi ∈ V . Let C represent the set of word embed-
ding vectors with the dimension of d, so C(w) ∈ Rd, and C has |V | × d free parameters
to learn. Let a function g take a sequence of embedding vectors C(wt−n+1), ..., C(wt−1)
as inputs to map a conditional probability distribution over words in V for the next
word wt. The output of g is the estimated probability P (wt = i|wt−11 ), as:
f(i, wt−1, ..., wt−n+1) = g(i, C(wt−1), ..., C(wt−n+1))
The function f consists of two mappings, C and g. C is a collection of word embedding
vectors, the projection layer is a concatenation of input embeddings for each word via a
look-up table. The function g is implemented using a feed-forward neural network. The
network has one hidden layer, and the output layer is a softmax layer:
P (wt|wt−1t−n+1) =
eywt∑Vi=1 e
yi
where yi is the unnormalised log-probability for each output word i, computed as.
y = bo +Wx+ Utanh(bh +Hx)
where bo is the bias unit of the output layer, and bh is the hidden layer bias. W is
the weight matrix if there is a direct connection between the projection layer and the
output layer. U is the weight matrix between the hidden layer and the output layer,
and H is the weight matrix between the projection layer and the hidden layer. The
overall parameters to be learnt are θ = (bo,W,U, bh, H,C). The learning objective is
Chapter 2. Background 34
123456789
V
dLook up Table: C
C(1) C(2) C(4)
… ...
Projection Layer
Hidden LayerWeights: U
Weights: H
Softmax Layer
ID: 1 2 3 4 5 6 7 8 9
Token: the cat dog is running walking in bedroom kitchen
1 2 3 4 5 6 7 8 9
Figure 2.4: Neural Probabilistic Language Model
to maximise the probabilities P (wt|wt−1t−n+1, θ) by looking for parameter θ. Using the
Log-likelihood function, the loss L can be written as:
L =1
T
T∑t=1
log f(wt, wt−1, ..., wt−n+1; θ) =1
T
T∑t=1
logP (wt|wt−1t−n+1, θ)
Using Stochastic Gradient Ascent, θ is updated as
θ := θ + ε∂ logP (wt|wt−1, ..., wt−n+1)
∂θ
where ε is the learning rate. Figure 2.5 plots the embedding vectors induced from the
network by training over the toy dataset. The training iterates 1,000 times due to the
small size of the corpus. Each word embedding vector has only 2 dimensions. Similar to
the distributional representation, the word embeddings also capture the co-occurrence
patterns. For example, bedroom and kitchen share very similar values.
In practice, training the probabilistic language model over large datasets, such as Wikipedia
or Google News, the softmax function can be very computationally expensive because
of the summation over the entire vocabulary V . To overcome the problem, Collobert
and Weston [172] present a discriminative and non-probabilistic model using a pairwise
ranking approach. The idea is to treat a sequence of words appearing in the corpus as
a positive sample, and a negative sample is constructed by replacing a word from the
Chapter 2. Background 35
3 2 1 0 1 2 33
2
1
0
1
2
3
thecat
dog
is
runningwalking
in
bedroomkitchen
Figure 2.5: Word Embedding Vectors: induced from the probabilistic neural languagemodel training over the toy dataset. Each vector only has 2 dimensions.
positive sample by a random one, simulating that the pair never appears in the corpus.
Given s as the positive example and sw is the negative one, the learning objective is to
minimise the ranking criterion by looking for parameter θ
θ 7→∑s∈S
∑w∈D
max(0, 1− f(s) + f(sw))
where S is the set of positive samples, V is the vocabulary, and D is the training corpus.
More recently, Mikolov et al. [14, 15] introduce two models, namely Continuous Bag of
Word (CBOW) and Continuous SkipGram. The work is also commonly referred to as
word2vec. Both models adopt a shallow network architecture for specifically learning
word embeddings. Similar to the classic neural probabilistic language model [171], both
models learn a function f that maps the probability distributions over words in a cor-
pus. The CBOW model predicts a target words given its context words, whereas the
SkipGram model is the opposite to CBOW – it predicts the context words given the
target word. There is no non-linear hidden layer in both models. Instead, each word
w ∈ V is associated with a pair of vectors – one is the word embedding vector that is the
input to the network, and the other is an output vector. Let C be the set of all input
vectors for V , and C ′ be the set of output vectors, given a word wt, using the softmax
function, the probability of choosing wj is:
P (wj |wt; θ) =eC
′(wj)T ·C(wt)∑Vi=1 e
C′(wi)T ·C(wt)
The learning objective is to maximise the conditional probability distribution over vo-
cabulary V in a training corpus D by looking for parameters θ = (C,C ′). In addition,
Mikolov et al. [15] use two optimisation technique to further reduce the computational
Chapter 2. Background 36
cost, namely Hierarchical Softmax and Negative Sampling. Hierarchical softmax con-
structs a binary tree representing word frequencies, thus the probability of choosing the
context word wj given wt can be computed as the relative probabilities of randomly
walking down the tree from wi to reach wj . The idea of the negative sampling is sim-
ilar to Collobert and Weston [172]’s model, rewarding positive samples and penalising
negative samples.
In the past few years, researchers have proposed different neural networks to learn word
embeddings. Mnih and Hinton [173] present a log-bilinear language model trained us-
ing a Restricted Boltzmann Machine. Luong et al. [174] use a recursive neural net-
work learning morphologically-aware word representations. Bian et al. [175] introduce
a knowledge-rich approach learning word embeddings by leveraging morphological, syn-
tactic, and semantic knowledge.
2.2.2 Deep Learning for Compositional Semantics
Languages are creative, thus the meaning of a complex expression usually cannot be
determined by simply assembling the meaning of each constituent word. Despite the
fact that word embeddings encode meaning of individual words, they are not capable
of dynamically generating the meaning of multi-word phrases, sentences, or even doc-
uments. Frege [25] states that the meaning of a complex expression is determined by
the meanings of its constituent words and the rules combining them, known as the prin-
ciple of compositionality. Hence, learning compositional semantics requires not only to
understand the meaning of words, but also to learn the rules of combining words.
Mitchell and Lapata [176] present a general composition model, defining that the mean-
ing of a phrases is derived from the meaning of its words, in conjunction with the
background knowledge and syntactical information. Formally, consider a bigram phrase
M consisting of two words m1 and m2. Let p denote the meaning of a bigram phrase,
u and v denote the vector representations of m1 and m2, K and R be the background
knowledge and syntactical information respectively, then we have:
p = f(u, v,K,R)
In recent years, researchers have investigated different types of deep neural networks that
implement the general compositional model from different angles, including feed-forward,
recurrent, recursive, and constitutional networks.
feed-forward Networks learn the compositional semantics using the similar approach
as learning the probabilistic language models, i.e. given context words, it predicts the
Chapter 2. Background 37
u u u
xt-1 xt xt+1
1 2 4
v
w w wst-1 st st+1
ID: 1 2 3 4 5 6 7 8 9
Token: the cat dog is running walking in bedroom kitchen
Softmax Layer
1 2 3 4 5 6 7 8 9
yt+1
Lookup Table
Figure 2.6: A Simple Recurrent Neural Network for Language Modelling.
target word. Mikolov et al. [15] in their well-know work word2vec, demonstrate how
phrase embeddings can be induced using the same network that learns word embeddings
– it treats the phrases as atomic units, i.e. the same way as unigram words. Such
technique is also known as holistic approach [177]. However, it requires that all phrases
are pre-identified, which is not able to generate representations for phrases not appearing
in the training set. Moreover, the network neither has the mechanisms nor intentions to
learn the compositional rules.
Based on the word2vec model, Le and Mikolov [178] present a model that learns the
distributed representations of sentences and documents, namely doc2vec. Although,
the model uses the same learning algorithm as in [15], the author add an extra vector
representation for each sentence or paragraph acting as a memory that remembers the
topic of the content, which can be thought as to encode the compositional rules of words.
Lau and Baldwin [179] present s study that empirically evaluate the quality of document
embeddings learnt by doc2vec, and demonstrate that the model performs robustly when
trained over large datasets.
Recurrent neural networks learn the compositionality by recursively combining the
representation of an input word with the precedents, with the intuition that the meaning
of a complex expression can be assembled by sequentially combining the meaning of each
constituent word. The recurrent architecture allows network to take inputs with arbi-
trarily lengths, naturally offering an advantage of learning representations of multi-word
Chapter 2. Background 38
expressions with various lengths. In addition, the network learns patterns of sequential
data, which implicitly encodes the syntactical information.
The Elman’s network [180] – a simple recurrent neural network, is employed to learn
compositional semantics by many researchers [16, 181, 182]. Figure 2.6 shows a general
architecture5. The inputs to the network are word embeddings, and W , U , and V are
shared parameters across the network. At a timestamp t, the input xt is an embedding
vector corresponding to the word, and the output is hidden state St = f(Uxt+WSt−1),
f is the activation function such as sigmoid or hyperbolic tangent. The hidden state
St composes the output from the previous state St−1 and the newly input value of xt,
which represents the compositional meaning of a ngram input. At any timestamp, the
network can optionally output yt = g(V St), where g is an output function such as
softmax. This enables the network to have multiple outputs, which forms a sequence-
sequence architecture enabling the network to perform complex NLP tasks, such as
machine translation. Figure 2.7 shows common architectures, including multiple-to-one
output, one-to-multiple outputs, and multiple-to-multiple.
The application of more sophisticated recurrent networks has also been investigated.
Sutskever et al. [17] use a Long Short-Term Memory (LSTM) networks with multiple-
to-multiple architecture to train a machine translation model for English to French.
The network maps an input sequence (English words) to the corresponding word em-
bedding vectors, and then decodes the vectors into target sequence (French words). More
recently, Cho et al. [97] introduce a new type of recurrent network, namely Gated Re-
current Unit (GRU), which is variant of the LSTM network. It combines the forget gate
and input gate in LSTM into a single update gate. The GRU network has an encoder
and a decoder, which is similar to the multiple-to-multiple architecture. The encoder
is responsible for encoding the input sequence into a composed vector representation,
and the decoder transforms the vector into the target sequence. They train the net-
work to perform different tasks, including machine translation, language modelling, and
learning phrases representation, demonstrating that the network is capable of learning
semantically and syntactically meaningful representations of phrases.
Recursive neural networks are introduced by Culler and Kuchler [183] that uses
shared weights to compute outputs recursively over a structure by traversing it in a
topological order. Because of the special recursive architecture, the network encodes
structural combination of the input data. Socher and colleagues [18–20, 184] first
propose to use recursive neural networks modelling compositional semantics. The key
idea is that the English language naturally has recursive tree structures that can be
5We present a general architecture of the Elman’s network, there might be small variance in differentstudies depending on the actual task.
Chapter 2. Background 39
(a) Multiple to one
(b) One to multi-ple
(c) Multiple to multiple
Figure 2.7: Common Recurrent Neural Network: sequence to sequence architectures
S
NP
Det.
N
the cat
VP
VBZ
VP
isVBG
running
PP
NP
IN
inDet.
N
the bedroom
Figure 2.8: Recursive Neural Network: fitting the structure of English language
captured by the recursive neural network. Figure 2.8 show an architecture of a recursive
neural network that fits the tree structure of a sentence the cat is running in the
bedroom.
Socher et al. [18] use a simple recursive network to learn phrase representations for
syntactic parsing. The inputs to the network are embedding vectors having d dimensions.
The network has shared weights W ∈ Rd×2d. At each recursive iteration, the network
takes two inputs xleft and xright from each node of the tree and concatenates them. The
output is a composed vector p ∈ Rd representing the compositional meaning of the input
words, as p = f(W [x1;x2]). The network recursively computes outputs by traversing
the tree. However, because this network has a small number of free parameters – it uses
shared weights for all recursions, it may not be able to encode enough information for
large datasets. In their later study, Socher et al. [19] propose using more parameters to
learn the semantic compositionality. Instead of having only one vector representation
Chapter 2. Background 40
Figure 2.9: Image Processing Convolution
for each word, the proposed Matrix-Vector Recursive Neural Network (MVRNN) model
uses a pair of vector and matrix representing each word. The idea is to let the vectors
capture the meanings of words, and the matrices encode the rules of composing them.
Specifically, each word w is associated with a vector vw ∈ Rd and a matrix Mw ∈Rd×d. The compositional meaning of two input words w1 and w2 are computed as
p = f([vw1Mw2; vw2Mw1]), and the composed matrix for Mw1 and Mw2 are computed
as P = f(Mw1;Mw2). Similarly, the computation traverse the entire tree to output
the overall compositional meaning of input words. Socher et al. [20] further present a
recursive tensor model, which replaces the weight matrix W in the MVRNN model by
a third-order tensor, yielding the best performance over the two precedent models.
Following the work of Socher, Zhao et al. [185] present an unsupervised general model
for specifically learning multi-word phrase embeddings. The model merges the tensor
recursive network [20] and the word2vec model [15], which learns the compositional
meaning of a phrase by predicting its surrounding words. Irsoy and Cardie [186] propose
a deep recursive neutral network constructed by stacking multiple recursive layers on top
of each other to create a network that has a deep space. The idea to create a recursive
network that not only has the deep recursive structure, but also features deep structure
in space. However, recursive networks rely on syntactical parsers to work, which may
be a drawback since extra parsing step is required.
Convolutional neural networks (CNN) are popular in image processing due to the
efficiency of capturing location invariance and compositionality of image pixels. A typ-
ical convolution process in image processing is shown in Figure 2.9, where the filters
slide over local regions of an image to learn location invariance and compositionality.
However, the location feature does not apply for word embeddings. Therefore, instead of
filtering different regions, learning semantic compositionality use one-dimensional convo-
lution process. We classify the one-dimensional convolution process into are two types:
embedding vector-wise convolution and embedding feature-wise convolution.
The embedding vector-wise convolution is more popular and intuitive, which treats
the features in each word embedding vector as a whole assuming no dependencies in
the features from different word embedding vectors. Mathematically, the vector-wise
convolution, shown in Figure 2.10 (A), concatenates word embeddings vectors into a
Chapter 2. Background 41
the
cat is
running
the cat is running
(A) (B)
Figure 2.10: Word Embedding Convolution Process: (A) Embedding vector-wiseconvolution (B): Embedding feature-wise convolution
larger one-dimensional vector, the filters are vectors with weights w ∈ Rd×l, where d is
the size of word embedding vectors and l is the length of a window (across l words per
convolution). A single convolution is computed as:
ci = f(wTxi:i+l−1 + b)
where xi:j denotes words from i to j. The output is a vector c ∈ Rn−l+1, where n is the
length of input words. A more intuitive way to understand the vector-wise convolution
is to treat the concatenating process as stacking word embedding vectors into a matrix,
and filters are matrices having dimension of w ∈ Rd×l meaning that all filters having
the same width d as the word embedding vectors, such that the convolution process is
the same as in image processing, but only able to convolute over l, i.e. one dimensional
convolution.
In contrast to the vector-wise convolution, the feature-wise convolution assumes that
each feature value in word embedding vectors encode the similar features of words,
and hence feature values are independent in a word vector but correlate in different
word vectors. The vector-wise convolution firstly stacks word embedding vectors into a
matrix, where columns are word embedding vectors, and rows are values of embeddings
at each dimension, as shown in Figure 2.10 (B). M is a matrix of weights with size d× l.However, unlike the vector-wise convolution where the matrix is the weight of one filter,
in the feature-wise convolution, the matrix can be thought of as a collection of filters
where each filter is a vector of size m. The convolution is performed by letting each
filter slide over its associated rows in the matrix and the output is a matrix with size of
d× (n− l + 1).
Collobert and Weston [172] use a Time-Delay Neural Network (TDNN) [187] perform
the vector-wise convolution. The TDNN is a specially case of the CNN network, which
has one fixed size filter sharing the weights along a temporal dimension. They apply the
Chapter 2. Background 42
network on six NLP tasks where two of them are related to learning semantic composi-
tionality, including semantic role labelling and semantic relatedness prediction. Kim [26]
reports a number of experiments using the vector-wise convolution with one convolu-
tional layer and 1-Max pooling. The model is trained to perform the sentiment analysis
and topic categorisation tasks, producing the new state-of-the-art results on 4 out of
7 datasets, which shows the power of the CNN network. However, the CNN network
requires the inputs with a fixed size. Kim [26] addresses this issue by predefining a fixed
length for all sentences, and padding zero vectors to the input matrix if the length of
a sentence is less than the fixed length. Kalchbrenner et al. [188] introduce a dynamic
k -Max pooling to handle input sentences of varying lengths using the feature-wise con-
volution. Given a value k and a vector v ∈ Rd where d ≥ k, the k-Max pooling extracts
k maximum values from v. The dynamic k-Max pooling sets k to be dynamically cal-
culated by a function k = max(ktop,L−lL s), where l and L are the number of current
convolutional layers and the total of convolution layers, respectively. ktop is fixed on the
top convolutional layer to output a fixed length vector, and the output layer is a fully
connected layer. This strategy enables the input sentence to be various lengths as long
as the maximum filter width l ≤ ktop. For example, in a 3-layer convolutional network,
if the input sentence has 6 tokens, k − 1 = 4 at the first pooling layer then k2 = 3 and
ktop = 3.
2.2.2.1 Learning Meanings of Documents
Documents usually contain much more words than phrases and sentences. One common
approach is to firstly learn the meanings of sentences, and then learn the representa-
tions of documents based on the learnt sentence representations. Kalchbrenner and
Blunsom [189] use a recurrent-convolutional network to learn the compositionality of
discourses. The model consists of two networks: a CNN and a recurrent network. The
CNN network is responsible for learning the meanings of sentences. At each sentence st
in the document, the CNN network outputs the vector representation of the sentence.
The recurrent network takes two inputs: the vector representation of sentence st from
the convolutional network, and the output from its previous hidden state st−1, to predict
a probability distribution over the current label. Similarly, Tang et al. [190] propose to
use a LSTM or a CNN network to learn the meaning of a sentence, then use a GRU
network to adaptively encode semantics of sentences and their relations for document
representations.
However, using the same neural network architecture to learn both the representations
of sentences and documents has also been attempted. Denil et al. [191] use a single CNN
network to train convolution filters hierarchically at both the sentence and document
Chapter 2. Background 43
level, intending to transfer lexical features from word embeddings into high level semantic
concepts. Word embedding vectors of a sentence are stacked into a matrix on which
the convolution process is performed and output feature maps representing the latent
features learnt from the sentence. Following a max pooling operation, each sentence
embedding is then stacked into a document matrix, on which another convolution process
is perform to output the document embedding vectors. Ganesh et al. [192] propose
two probabilistic language models implemented using the same feed-forward network
architecture to jointly learn the representations of sentences and documents.
It is worth mention that word-sentence-document models are not the only way of learning
document representations. As aforementioned, doc2vec [178] is also commonly used for
learning document embeddings. Dai et al. [193], and Lau and Baldwin [179] conduct
empirical evaluations of doc2vec, and both demonstrate the model performs significantly
better than other baselines.
2.3 Conclusion
In this section, we firstly reviewed common approaches for the automatic keyphrase
extraction task. The approaches for automatically extracting keyphrases from docu-
ments can be grouped into two categories: unsupervised ranking and supervised machine
learning approaches. Both approaches require manually selected features to represent
phrases. However, unsupervised ranking approaches use fewer features then supervised
machine learning approaches, which are less dependent on the choice of features. In
the next chapter, we will present a systematic evaluation of different unsupervised AKE
algorithms, in order to gain better understandings of the strengths and weaknesses of
unsupervised AKE algorithms.
In the second part of this chapter, we provide a general introduction of representation
learning in NLP using deep neural networks. Deep learning is a relatively new research
field, which particularly focuses on automatically learning features of words, multi-word
phrases and sentences, and documents using deep neural networks, including conven-
tional feed-forward network, recurrent, recursive, and convolutional networks, and some
hybrid networks. However, learning the representations of phrases and documents still
remains a great challenge to the NLP community. In chapter 5 to 7, we will present a
series of deep learning models to automatically learn the representations of phrases and
documents.
Chapter 3
Conundrums in Unsupervised
AKE
In Chapter 2.1, we have reviewed common approaches for AKE, which can be grouped
into two categories: unsupervised ranking, and supervised machine learning approaches.
In comparison to supervised AKE, unsupervised approaches use fewer features making
them less dependent on the choice of features. Hence, in this chapter, we focus on
unsupervised AKE.
Approaches for unsupervised AKE typically consists of candidate phrase identification,
ranking and scoring processes. Each step plays a critical role on the processing pipeline,
which can significantly affect the overall performance of an AKE system. However,
most studies conduct evaluations of their AKE algorithms from a system point of view
by treating all steps as a whole, yet the efficiency and effectiveness of each process have
not been precisely identified.
In this chapter, we conduct a systematic evaluation to gain a precise understanding of
unsupervised AKE approaches. We evaluate four popular ranking algorithms, five can-
didate identification techniques, and two scoring approaches, by combining each ranking
algorithm with different candidate identification and scoring techniques. We show how
different techniques employed at each step affect the overall performance. The evalua-
tion also reveals the common strengths and weaknesses of unsupervised AKE algorithms,
which provides a clear pathway for seeking further improvements.
44
Chapter 3. Conundrums in Unsupervised AKE 45
3.1 Introduction
Unsupervised AKE approaches assign scores to each phrase, and the scores indicate the
likelihood of being keyphrases for each phrase. The output is a list of phrases with the
corresponding scores, from which a number of top scored phrases will be extracted as
potential keyphrases. Hence, how score is assigned to a phrase decides the sequence of
processing data – the processing pipeline. A phrase’s score can be assign directly by
an unsupervised AKE algorithm, which takes the whole phrase as a ranking element.
Alternately, the phrase’s score can be the sum of its each constituent word’s score,
and hence the AKE algorithm takes words as the ranking elements. We name the
former scoring approach as direct phrase ranking, and the later approach as phrase score
summation. Figure 3.1 provides an overview of the two system processing pipelines1.
The two system processing pipelines, shown in Figure 3.1, can produce very different
scores to the same phrase in a document, even with the same candidate phrase identi-
fication technique and AKE ranking algorithm. For example, a phrase neural network
only appears once in a journal article, it may be assigned a very low score by the direct
phrase ranking pipeline due to the low frequency. However, the word network co-occurs
hundreds of times with other words, such as recurrent, feed-forward, and convolutional.
Using the phrase score summation pipeline, the phrase neural network will be assigned
a reasonably high score, due to that the word network will gain an extremely high score.
Two common steps of both processing pipelines are candidate phrase identification and
ranking. The candidate phrase identification process identifies syntactically valid phrases
from documents. It acts as a filter that prevents noisy data, such as non-content-bearing
words, to be inputted to the system, providing the candidates from which keyphrases
are extracted. Hence, a phrase not in the candidate phrase list will never be scored and
extracted.
The ranking process is performed by an unsupervised AKE ranking algorithm, which
carries out the main computation and assigns scores to words or phrases. Unsupervised
AKE ranking algorithms directly decide what will be extracted as keyphrases. Over
the past two decades, a number of unsupervised AKE ranking algorithms have been
developed, such as KeyGraph (1998) [35], TextRank (2004) [3], ExpandRank (2009) [10],
RAKE (2010) [45], and TopicRank (2013) [78].
Each step, including candidate phrase identification, ranking, and phrase scoring, plays
a critical role that affects the overall performance of an AKE system. However, most
studies of unsupervised AKE only focus on the ranking algorithms, with little discussion
1AKE systems require a text cleaning and normalising process, which often involves cleaning noisydata by applying dataset-dependent heuristics, thus we do not take them into consideration.
Chapter 3. Conundrums in Unsupervised AKE 46
(a) Direct Phrase Ranking Processing Pipeline
(b) Phrase Score Summation Processing Pipeline
Figure 3.1: Phrase Scoring Processing Pipelines
on how other processes are implemented. This leaves difficulty in understanding how the
claimed improvements are achieved, let alone identifying whether they are achieved from
the candidate identification approach, the ranking algorithm, or the scoring technique.
Although some research has reported the importance of the candidate phrase identifi-
cation process, there is a lack of a comprehensive and systematic study on exactly how
each step along the processing pipeline affects the overall extraction results in unsu-
pervised AKE. Hulth (2003) [5] presents a comparison between three candidate phrase
identification approaches: N-grams, POS patterns, and noun-phrase (NP) chunking, but
on a supervised AKE algorithm. More recently, some studies show refined candidate se-
lection approaches, but experiments are conducted in conjunction with their ranking
algorithms causing the effect from candidate selection approaches to be blurred. For
example, Kumar and Srinathan (2008) [194] present an approach of preparing a dictio-
nary of distinct N-grams using LZ78 data compression, which produces better results.
Kim and Kan (2009) [34], and Wang and Li (2010) [195] also focus more on their refined
NP chunking approaches. However, the studies neither explore the capability of their
refined phrase identification approaches combining with other ranking approaches, nor
compare their approaches with other similar approaches side by side. In addition, to our
best knowledge, there is no study examining the impact from phrase scoring techniques.
In this chapter, we aim to have a clear understanding of how different techniques at each
step may affect the overall performance. We re-implemented five phrase identification
approaches, four ranking algorithms, and two scoring techniques. We conduct two eval-
uations on three datasets with documents of varying lengths. The first evaluation is on
the performance of phrase identification approaches by analysing the coverage on the
human assigned keyphrases (the ground-truth). In the second evaluation, we analyse the
performance of four ranking algorithms by coupling with different phrase identification
and scoring approaches.
Chapter 3. Conundrums in Unsupervised AKE 47
3.2 Reimplementation
We reimplemented five common candidate phrase identification approaches, including
Phrase as Text Segment Splitter (PTS), N-gram Filter, NP Chunker, Prefixspan [55], and
C-value [56]. For ranking algorithms, we have reimplemented Term Frequency Inverse
Document Frequency (TF-IDF) [2], Rapid Automatic Keyword Extraction (RAKE) [45],
TextRank [3], and Hyperlink-Induced Topic Search (HITS) [88].
3.2.1 Phrase Identifiers
3.2.1.1 PTS Splitter
The PTS Splitter uses stop-words and punctuation marks (excluding hyphens) as delim-
iters to split a sentence into candidate phrases. For example, creating an information
architecture for a bilingual web site will produce three candidates, creating, information
architecture, and bilingual web site. The processing sequence is as follows:
1. convert an input text to lowercases.
2. split the text into candidate phrases using stop-words and punctuation marks.
3. stem the identified candidate phrases using Porter’s algorithm [161].
3.2.1.2 N-gram Filter
The N-gram filter is built upon the PTS Splitter. It takes outputs from the PTS Splitter
as inputs, and then generates all possible sequential combinations of the candidate, where
each combination has at least two tokens. For example, a phrase bilingual web site will
generate bilingual web, web site, and bilingual web site. After generating all N-grams
for the inputs, heuristics are applied to remove unwanted combinations. The processing
sequence is as follows:
1. use the outputs from the PTS Splitter as the inputs.
2. generate all N-grams for each input and save them into list L, where 2 < n ≤length of an input.
3. sort L by the frequencies of each N-gram.
4. for each N-gram p ∈ L, if there exists any N-gram g ∈ L where p is the substring
of g and freq(p) ≤ freq(g), remove p, otherwise remove g.
Chapter 3. Conundrums in Unsupervised AKE 48
3.2.1.3 Noun Phrase Chunker
The Noun Phrase (NP) Chunker discards tokens not fitting into the predefined POS
pattern. We choose a simple but widely used pattern <JJ>*<NN.*>+ [3, 8, 44, 52],
which finds phrase that begins with zero or more adjectives, followed by one or more
nouns. The processing sequence is as follows:
1. convert an input text to lowercase.
2. tokenise the text.
3. tag the text using the Stanford Log-linear POS tagger [196].
4. chunk candidate phrases using the predefined POS pattern.
5. stem the identified candidate phrases using Porter’s algorithm [161].
3.2.1.4 Prefixspan
Prefixspan, introduced by Pei et al. [55], is a sequential pattern mining algorithm, which
identifies frequent sub-sequences from a given set of sequences. It can be adapted to
identify frequent candidate phrases from a document [54].
Formally, let I = {i1, i2, ..., in} be a set of items. An itemset sj is a subset of I. A
sequence is denoted as s = 〈s1s2...sl〉, where sj ⊆ I for 1 ≤ j ≤ l is an element of s, and
sj = (x1x2...xm), where xk ∈ I for 1 ≤ k ≤ m is an item. A sequence α = 〈a1a2...an〉 is
said to be a sub-sequence of another sequence β = 〈b1b2...bm〉 (and β is a super-sequence
of α) denoted as α v β, if there exists integers 1 ≤ j1 ≤ j2 ≤ ... ≤ jn ≤ m such that
a1 ⊆ bj1, a2 ⊆ bj2, ..., an ⊆ bjn. A sequence with l instances of items is called a l-length
sequence. A sequence database S is a list of tuples 〈sid, s〉 where sid is the ID of sequence
s. A tuple contains a sequence t if t is a sub-sequence of s. A support of t is the number
of tuples in S that contains t, and t is considered to be frequent if its support is greater
than a user-defined threshold min sup.
Given a sequence database (a set of sequences), instead of considering all the possible
occurrences of frequent sub-sequences, Prefixspan only focuses on the frequent prefixes
because all frequent sub-sequences can be identified by growing frequent prefixes. Given
a sequence α = 〈a1a2...an〉, and a sequence β = 〈b1b2...bm〉 where m ≤ n, β is only
said to be a prefix of α when it satisfies three conditions: 1) bi = ai for i ≤ m − 1; 2)
bm ⊆ am; and 3) all the items in (am − bm) are alphabetically listed after those in bm.
For example, 〈a〉, 〈ab〉, and 〈aa〉 are prefix of a sequence 〈aabc〉 but not 〈ac〉.
Chapter 3. Conundrums in Unsupervised AKE 49
Prefixspan takes a sequence database S and a threshold min sup as inputs, and out-
puts a list of frequent patterns with their supports. It recursively calls a function
Prefixspan(α, l, S|a), where the parameters are initialised by setting α = 〈〉, l = 0, S|a =
S. The iteration is as follows:
1. scan S|a to find a set of frequent items b such that b can be assembled to the last
element of α to form a sequential pattern, or 〈b〉 can be appended to α to form a
sequential pattern.
2. For each frequent item b, append it to α, in order to form a sequential pattern α′,
and output α′.
3. for each α′, construct α′-projected database S|a′ and call Prefixspan(α′, l+1, S|a′)
Using Prefixspan to identify candidate phrase, we treat a token as an item, and a corpus
as the set of all items. We could also treat a sentence as a sequence. However, our goal
is to identify candidate phrases, thus treating a sentence as a sequence directly may
introduce many unwanted sub-sequences, which can contain stop-words or punctuation.
Therefore, we choose using outputs from the PTS Splitter and treat each text segment as
a sequence. The outputs from Prefixspan are frequent sub-sequences, which are candidate
phrases. The processing steps are as follows:
1. run the PTS Splitter over a corpus to obtain the text segments, and save them
into S as the sequence database
2. set min sup = 2 and call Prefixspan(α, l, S|a), where the parameters are ini-
tialised by setting α = 〈〉, l = 0, S|a = S.
3. save outputs from Prefixspan into a list C.
4. sort C by the length of tokens a sequence contains.
5. for each c ∈ C, if there exists any sub-sequence c′ of c, and support(c′) > support(c)
and length(c′) ≥ 2, remove c from C.
3.2.1.5 C-value
C-value [56] is proposed for identifying domain specific term using both linguistic and
statistical information. It identifies a phrase from its longer cousins – super-strings of
the phrase. C-value takes a set of candidate phrases chunked by pre-identified POS
patterns as inputs, and outputs a list of scored phrases, where the scores indicate the
likelihood of being a valid phrase.
Chapter 3. Conundrums in Unsupervised AKE 50
The processing sequence is as follows:
1. run the NP Chunker over a dataset to obtain the candidates and their frequencies,
then save them in dictionary D.
2. compute the C-value scores for each d ∈ D using equation 2.1.9.1.
3. for each d ∈ D, sort D by |d|.
4. for each d ∈ D, if there exists any d′ appearing as a part of d for (d′ ∈ d), and
Cvalue(d′) > Cvalue(d) and |d′| ≥ 2, remove d from D.
3.2.2 Ranking Algorithms
Ranking algorithms usually organise documents into their own representations for easily
further processing2. Figure 3.2 shows three types of document representations used by
TF-IDF, TextRank and HITS, and RAKE, respectively.
3.2.2.1 TF-IDF
TF-IDF is a weighting scheme, which statistically analyses how important a phrase is to
an individual document in a corpus. The underlining intuition is that a frequent phrase
distributed evenly in a corpus is not a good discriminator, so it should be assigned
a lower weight (score). In contrast, a phrase occurring frequently in a few particular
documents should be assigned more weight.
TF-IDF constructs a Phrase-Document Matrix prior to scoring, where each row corre-
sponds to a phrase in the corpus, and each column represents a document. The value in
a cell represents the frequency of the phrase in the corresponding document. Figure 3.2
(B) shows an example, where phrases are identified by the NP Chunker.
The TF-IDF score of a phrase is calculated as the product of two statistics: the phrase’s
TF score and its IDF score. The TF score indicates the importance of a phrase against
the document in which it appears – the higher frequency gains a higher TF score. The
IDF score corresponds to the importance of a phrase against the corpus – it occurs
frequently in a large number of documents gains a lower IDF score. Let t denote a
phrase; d denote a document; and D denote a corpus; where t ∈ D, and d ∈ D, then
the TF-IDF score is computed as:
tfidf(t, d,D) = tf(t, d)× idf(t,D) (3.1)
2These representations, however, are not used for actual computations. We will discuss this issue indetails in Chapter 7.
Chapter 3. Conundrums in Unsupervised AKE 51
(a)
(b) (c) (d)
(e) (f)
Figure 3.2: Document Representations: (A) a sample dataset contains only twodocuments, candidate phrases are identified by the NP Chunker as the inputs to eachranking algorithm. (B) Doc1 and Doc2 are represented in a Phrase-Document matrixused by TF-IDF. (C) Doc1 is represented in a Co-occurrence graph used by TextRankand HITS. Two phrases are connected if they co-occur in the same sentence. (D)Doc2 is represented in a Co-occurrence graph. (E) Doc1 is represented in a PhraseCo-occurrence Matrix used by RAKE. Phrases are co-occurred if they appear in the
same sentence. (F) Doc2 is represented in a Phrase Co-occurrence Matrix.
Chapter 3. Conundrums in Unsupervised AKE 52
We use the weighting scheme introduced by Jones [2] as:
tfidf(ti) = tfi × idfi = tfi × log|D|
| {d ∈ D : ti ∈ d} |(3.2)
where tfi is the number of times phrase ti occurs in d, |D| is the total number of
documents in corpus D, and | {d ∈ D : ti ∈ d} | is the number of documents in which
phrase ti occurs.
3.2.2.2 RAKE
RAKE [45] is a statistical approach based on analysing phrases’ frequencies and co-
occurrence frequencies. In contrast to TF-IDF that conducts statistical analysis based
on a corpus, RAKE only uses the statistical information of each individual document.
RAKE constructs a Co-occurrence Matrix for representing a document, and the co-
occurrence relation is defined as:
1. if inputs are pre-identified phrases, two phrases are said to be co-occurring if they
appear in the same window. The window size can be an arbitrary number typically
from 2 to 10, or just a natural sentence. For example, in Figure 3.2 (E), phrases
are co-occurring if they appear in the same sentence.
2. if inputs are individual words, two words co-occur if they appear in the same
candidate phrase identified by a phrase identifier. For example, in a sentence:
“information interaction provides a framework for information architecture”, the
identified phrases are: information interaction, framework, information architec-
ture. Then the co-occurrences are: 1) information and interaction co-occurs once,
and information and architecture co-occurs once.
Concretely, the occurrence frequency for candidate phrase c is denoted as freq(c), and
the co-occurrence frequency with other candidates is denoted as deg(c) (refers to the
degree of candidate c). The Score(w) is computed as the ratio of degree to frequency:
Score(w) =deg(w)
freq(w)(3.3)
For example, in Figure 3.2 (E), deg(form interact) = 6, and freq(form interact) = 2,
so the overall score Score(form interact) = 3.
Chapter 3. Conundrums in Unsupervised AKE 53
3.2.2.3 TextRank
TextRank [3] is a graph-based ranking algorithm. A document firstly is represented as an
undirected graph, where phrases (or individual words if using phrase score summation)
correspond to vertices, and edges represent co-occurrence relations between two vertices
that are connected if the relation is found within a predefined window-size. For example,
in Figure 3.2 (C) and (D), two phrases are connected if they appear in the same sentence.
TextRank implements the idea of ‘voting’. Imagine that a vertex vi links to another
vertex vj , as vi casting a vote for vj , then the higher the number of votes vj receives, the
more important vj is. Moreover, the importance of the vote itself is also considered by
the algorithm: the more important the voter vi is, the more important the vote becomes.
The score of a vertex is calculated based on the votes it received and the importance of
the voters.
The TextRank algorithm introduced in its original paper [3] can be applied to both AKE
and text summarisation tasks. However, using TextRank to extract keyphrases, the
calculation is identical to the original PageRank, apart from that TextRank constructs
undirected graphs for representing documents, where the in-degree of a vertex simply
equals its out-degree. Given a graph G = (V,E), let in(vi) be the set of vertices that
point to a vertex vi, and out(vi) be the set of vertices to which vi points. The score of
a vertex is calculated as :
S(vi) = (1− d) + d×∑
j∈in(vi)
1
|out(vj)|S(vj) (3.4)
where d is dumping factor, usually set to 0.85 [3, 89].
3.2.2.4 HITS
HITS [88] is a link analysis algorithm, which ranks web pages with references to their
degrees of hubs and authorities. The degree of hub is the number of links that a web page
points to others, and authority is the number of links that a web page receives. Hubs
and authorities are the same concepts as the out-degree and in-degree in PageRank.
A hub score hub(vi) of a vertex vi is computed as the sum of authority scores of all the
vertices out(vi) that vi points to, as:
hub(vi) =∑
vj∈out(vi)
authority(vj) (3.5)
Chapter 3. Conundrums in Unsupervised AKE 54
Similarly, an authority score authority(vi) is the sum of the hub scores of all the vertices
In(vi) that point to vi, as:
authority(vi) =∑
vj∈In(vi)
hub(vj) (3.6)
In directed graphs, the score of a vertex can be either the maximum or average score
of hub and authority scores [197]. In undirected graphs, however, hubs’ scores equal to
authorities’ scores. For example, in Figure 3.2 (C), both the hubs score and authorities
score for the phrase inform interact are 6.
Applying HITS to AKE, we present a document as an undirected graph, and edges
are co-occurrence relations. The algorithm assigns a pair of scores (hub and authority
scores) to each vertex in a graph.
3.3 Datasets
3.3.1 Dataset Statistics
We select three publicly available datasets3: Hulth (2003) [5], DUC (2001) [52], and
SemEval (2010) [51]. The lengths of documents in each dataset vary: Hulth is a collection
of short documents – abstracts of journal articles, DUC consists of mid-length documents
– news articles, and SemEval consists of long documents – full-length journal papers.
Table 3.1 shows a summary of the selected datasets.
Table 3.1: Dataset Statistics After Cleaning
Hulth DUC SemEval
Total number of articles 2,000 308 244Total words in dataset 249,660 244,509 1,417,295Average words per article 125 799 5,808Total ground-truth phrases 27,014 2,488 3,686ground-truth phrases in articles 16,419 (60.8%) 2,436 (97.9%) 2,840 (77.0%)Average keyphrase per article 8.2 7.9 11.6Average tokens per keyphrase 2.3 2.1 2.2
*Some statistics slightly vary from the original literatures [5, 51, 52] due to the different pre-processing procedures.
The Hulth dataset consists of 2,000 abstracts of journal papers collected from Inspec
database, which are mostly in the field of Computer Science and Information Technol-
ogy. The dataset has training and test sets for supervised AKE. In this evaluation, we
3http://github.com/snkim/AutomaticKeyphraseExtraction
Chapter 3. Conundrums in Unsupervised AKE 55
merged training and test sets, since no training data is required for unsupervised AKE.
Each article pairs with two sets of ground-truth keyphrases assigned by readers and the
authors. We combine two sets by taking the union set as the final ground-truth.
The DUC dataset consists of 308 news articles. The dataset was built based on DUC20014
that is used for the document summarisation task. All articles were manually annotated
by humans. The Kappa statistic for measuring inter-agreement among the human an-
notators is 0.70.
The SemEval dataset consists of 244 full-length journal papers from the ACM Digital
Library, where 183 articles are from Computer Science, and 61 articles are from Social
and Behavioural Science. The dataset consists of training and test data. Similar to the
Hulth dataset, each article pairs with both readers and authors assigned keyphrases.
We merged training and test datasets, and use the combination of readers and authors
assigned keyphrases as our ground-truth.
3.3.2 Ground-truth Keyphrase not Appearing in Texts
One common but important issue with all three datasets is that not all ground-truth
(human assigned) keyphrases appear in the actual content of the documents, even after
text normalisation and stemming, as shown in Table 3.1. Such non-appearing keyphrases
usually do not follow the exact same word sequences as their semantically equivalent
counterparts appearing in documents. For example, a ground-truth keyphrase is non-
linear distributed parameter model, in the document, it appears as two separate phrases
non-linear and distributed parameter model. In another example, the ground-truth
keyphrases are average-case identifiability and average-case controllability. In the docu-
ment, they appear as a single phrase average-case identifiability and controllability.
The task of AKE is to identify keyphrases appearing in a document, and hence, we
exclude the non-appearing keyphrases from the ground-truth list.
3.3.3 Dataset Cleaning
Each dataset needs to be cleaned prior to the evaluation. The Hulth dataset does not
require any special cleaning processing. Therefore, we only removed line separators for
each article in the dataset. The articles in the DUC dataset are in XML format, so we
extracted the content of each document by looking for <text> tags. In addition, any
4http://www-nlpir.nist.gov/projects/duc/guidelines/2001.html
Chapter 3. Conundrums in Unsupervised AKE 56
Figure 3.3: Illustrating the sets’ relationships. Algorithm Extracted Keyphrases isa subset of Identified Candidate Phrases. Ground- truth and Identified Candidatesare subsets of All Possible Grams of the document. TP : the true positive set is theextracted phrases match ground-truth keyphrases; FP : the false positive set is theextracted phrases that do not match ground-truth keyphrases; FN : the false negativeset contains all ground-truth keyphrases that are not extracted as keyphrases; TN : thetrue negative set has the candidate phrases that are not ground-truth and not extracted
as keyphrases.
XML tags within the content were also removed. The SemEval dataset contains jour-
nal papers, so all mathematical symbols and equations, tables, figures, authors details,
references were removed.
3.4 Evaluation
3.4.1 Evaluation Methodology
All ground-truth keyphrases were stemmed using Porter Stemmer [161]. An assigned
keyphrase matches an extracted phrase when they correspond to the same stem sequence.
For example, information architectures matches inform architectur, but not inform or
architectur inform.
In Figure 3.3, we show the set relations of True Positive (TP), False Positive (FP),
True Negative (TN), and False Negative (FN). We employ the Precision, Recall, and
F-measure for evaluating the ranking algorithm as detailed in Chapter 2.1.7.
3.4.2 Evaluation 1: Candidate Coverage on ground-truth Keyphrases
We firstly evaluate the five candidate identification approaches described in Section 3.2.1.
This evaluation aims to discover how many ground-truth keyphrases can be identified
by different phrase identification approaches, which is the coverage on ground-truth.
Chapter 3. Conundrums in Unsupervised AKE 57
Table 3.2: The Coverages on Ground-truth Keyphrases
Hulth GT Inc. Coverage CandTol Cand/Art. GT Prop.Prefixspan 9,612 58.5% 81,946 41 11.7%N-gramFilter 10,018 61.0% 76,186 38 13.1%PTSSplitter 11,002 67.0% 74,210 37 14.8%C-value 11,178 68.1% 55,329 28 20.2%NPChunker 11,666 71.1% 53,685 27 21.7%
DUCPrefixspan 1,693 69.5% 75,654 246 2.2%N-gramFilter 1,649 67.7% 70,527 229 2.3%PTSSplitter 1,750 71.8% 69,266 225 2.5%C-value 2,005 82.3% 45,379 147 4.4%NPChunker 2,149 88.2% 44,755 145 4.8%
SemEvalPrefixspan 2,095 73.8% 268,850 1,102 0.78%N-gramFilter 2,110 74.3% 251,538 1,031 0.84%PTSSplitter 2,353 82.9% 247,787 1,016 0.95%C-value 2,223 78.3% 144,466 592 1.54%NPChunker 2,338 82.3% 143,143 587 1.63%
GT Inc.: the number of ground-truth keyphrases included in identified candidate phrases. Coverage: thecoverage on ground-truth. CandTol: the total number of identified candidates. Cand/Art: the average numberof candidates per article. GT Prop: the proportion of ground-truth in candidates.
The coverage on ground-truth indicates the maximum recall score that a system can
obtain. For example, the PTS Splitter identifies 67% of total ground-truth keyphrases
in the Hulth dataset, resulting a loss of 33% true positive before running any ranking
algorithm.
It is worth noting that the evaluation results presented in this chapter are slightly dif-
ferent from our previously reported [23], due to the different pre-processing and ground-
truth selection process. In [23], we cleaned the datasets (Hulth and SemEval) using
many heuristics such as discarding any document in which not all assigned keyphrases
appear, and the DUC dataset is not included. Nevertheless, both evaluations report
similar findings.
3.4.3 Evaluation 1 Results Discussion
Of the five phrase identification approaches, the NP Chunker produces the best coverage
on assigned keyphrases on both Hulth (short document dataset) and DUC (mid-length
document dataset). The PTS Splitter produces a slight better coverage than the NP
Chunker on SemEval (long dataset) and a very close coverage on Hulth. However, there
is a distinct difference on the DUC dataset. The Hulth and SemEval datasets are col-
lections of academic journal abstracts and full-length articles. The authors use more
Chapter 3. Conundrums in Unsupervised AKE 58
formal writings, and the ground-truth keyphrases are usually technical terms (phrases).
This offers a better chance for the PTS Splitter to produce a greater coverage on ground-
truth keyphrases. For example, in a sentence5: “we show that it is possible in theory to
solve for 3D lambertian surface structure for the case of a single point light source and
propose that ...”, the ground-truth keyphrases are 3D lambertian surface structure and
single point light source. In comparison to Hulth and SemEval, the DUC dataset is a
collection of news articles, where the authors use less formal but descriptive language,
appealing to the readers’ senses and helping them to imagine or reconstruct the sto-
ries. For example, a ground-truth keyphrase is yellowstone park fire, but in the actual
content of the article, it appears as greater yellowstone park fire. The PTS Splitter incor-
rectly identifies the candidate as greater yellowstone park fire, whereas the NP Chunker
identifies yellowstone park fire.
The NP Chunker and C-value are all based on POS tags. The difference is that the
C-value algorithm identifies phrases based on their frequencies, thus phrases with low
frequencies will be discarded. However, many ground-truth keyphrases indeed have very
low frequencies, that is why the C-value algorithm produces less coverage comparing with
NP Chunker. Similarly, Prefixspan is also based on statistical analysis, and hence, in
most cases produces worse coverage on ground-truth than the PTS Splitter and N-gram
Filter.
We summarise the loss (unidentified ground-truth keyphrases) into 5 types, shown in
Figure 3.4. The majority of loss falls into the first and the second type, where a can-
didate phrase is either a substring or a superstring of a ground-truth keyphrase, such
as in the aforementioned example, yellowstone park fire and greater yellowstone park
fire. Ground-truth keyphrases contain punctuation marks or stop-words, are the third
and fourth types, respectively. The most common stop-words occur in the assigned
keyphrases are of, and, on, until, by, with, for, from. The most common punctuation
mark is the apostrophe, following by ‘.’ and ‘+’ which appear in words such as ‘.net’ and
‘C++’ in many documents. Each phrase identifier treats the phrases containing punc-
tuation marks differently. For example, the PTS Splitter uses all punctuation marks
(except hyphen) as delimiters, therefore it is not able to identify any keyphrase contain-
ing punctuation marks; the NP Chunker does not explicitly exclude any punctuation
mark, however, words containing punctuation marks may not be recognised as valid
nouns or adjectives by the POS tagger we employed. The POS tagger also incorrectly
tags some words mostly occurring in the Hulth dataset, which explains why the NP
Chunker and C-value identifier have the large loss rate in the fifth error type comparing
with others.
5The Hulth dataset article: 205.abstr, ground-truth keyphrase list: 205.contr, 205.uncontr
Chapter 3. Conundrums in Unsupervised AKE 59
Figure 3.4: Error 1: candidate identified is too long, being the super-string of theassigned phrase; Error 2: candidate identified is too short, being the sub-string of theassigned phrase; Error 3: assigned phrase contains invalid char such as punctuation
marks; Error 4: assigned phrase contains stop-words; Error 5: Others.
3.4.4 Evaluation 2: System Performance
In this evaluation, we combine each candidate identification approach with the four
different ranking algorithms and two phrase scoring techniques, forming different pro-
cessing pipelines. The evaluation aims to analyse how different candidate identification
and scoring approach will affect the performance of the same ranking algorithm.
3.4.4.1 Direct Phrase Ranking and Phrase Score Summation
In the direct phrase ranking pipeline, candidate phrases are directly scored after ranking.
In phrase score summation pipeline, a phrase score is the sum of its each word’s score,
as
s(P ) =∑wi∈P
s(wi) (3.7)
where s(w) is a word’s score assigned by ranking algorithms.
3.4.4.2 Ranking Algorithm Setup
Among the four ranking algorithms, TF-IDF and RAKE do not require any special
settings. For TextRank, we use a fixed window size of 2 for identifying co-occurrence,
which is reported producing the best performance [3]. The initial value of each vertex
in a graph is set to 1, the damping factor is set to 0.85, the iteration is 30, and the
Chapter 3. Conundrums in Unsupervised AKE 60
threshold of breaking is 1.0−5. For HITS, we also use the same window size, iteration
and threshold of breaking as TextRank. The HITS score for each phrase is computed as
the average score of hub and authority scores. Finally, the top 10 ranked candidates are
selected from each result set as extracted keyphrases by the corresponding algorithm.
Some settings may not conform with what the authors originally used. Our main interest
is not to reproduce results, we only focus on analysing the potential factors that may
affect the ranking algorithms. Therefore, we do not follow the exact pipeline approaches
and heuristics described in the original papers. However, we are confident with our
reimplementation. For example, we reproduced the same results for TextRank on the
Hulth dataset [3] as the original paper claimed.
3.4.5 Evaluation 2 Results Discussion
We address the discussion from three aspects: the impact from different candidate iden-
tification approaches, the comparison of two scoring techniques, and the impact from
frequencies.
3.4.5.1 Candidate Identification Impact
We plot the results on six sub-figures, shown in Figure 3.5, where the vertical axis shows
the F-score, and the horizontal axis is the candidate identifiers ordered by the proportion
of ground-truth over the candidates.
The Hulth dataset consists of short documents – the average number of words per
article is 125, which offers very little advantage to statistical-based ranking, because the
majority of words and phrases only occur once ot twice in a document. To gain better
performance, a ranking algorithm needs to pair with an efficient candidate identification
approach, which not only covers more the ground-truth keyphrases, but also reduces
the total number of candidate phrases to increase the proportion of ground-truth in
candidate phrases. For example, as shown in Table 3.2, the PTS Splitter and C-value
algorithm produce a very similar coverage on ground-truth keyphrases, which are 67%
and 68.1%, respectively. However, the C-value algorithm produces much fewer candidate
phrases than the PTS Splitter, resulting in a ground truth which is about 6% larger in
candidates. Hence, the same ranking algorithm with C-value yields better performance
than with the PTS Splitter. This can be intuitively understood – the larger proportion
increases the chance for a ranking algorithm choosing correct keyphrases. As shown
in Figure 3.5 (A) and (B), there is a clear linear relation between the proportion of
ground-truth in candidates and the overall performance of a ranking algorithm. For
Chapter 3. Conundrums in Unsupervised AKE 61
0.0%
5.0%
10.0%
15.0%
20.0%
25.0%
30.0%
35.0%
40.0%
45.0%
50.0%
11.7% 13.1% 14.8% 20.2% 21.7%
Prefixspan N-gram Spli<er C-value NPChunker
F-Score
Ground-truthPropo1on
DirectPhraseRankingonHulthDataset
TF-IDF
RAKE
TextRank
HITS
(a)
0.0%
5.0%
10.0%
15.0%
20.0%
25.0%
30.0%
35.0%
40.0%
45.0%
50.0%
11.7% 13.1% 14.8% 20.2% 21.7%
Prefixspan N-gram Spli<er C-value NPChunker
F-Score
Ground-truthPropo1on
PhraseScoreSumma1ononHulthDataset
TF-IDF
RAKE
TextRank
HITS
(b)
0.0%
5.0%
10.0%
15.0%
20.0%
25.0%
30.0%
35.0%
40.0%
45.0%
50.0%
2.2% 2.3% 2.5% 4.4% 4.8%
Prefixspan N-gram Spli;er C-value NPChunker
F-Score
Ground-truthPropo1on
DirectPhraseRankingonDUCDataset
TF-IDF
RAKE
TextRank
HITS
(c)
0.0%
5.0%
10.0%
15.0%
20.0%
25.0%
30.0%
35.0%
40.0%
45.0%
50.0%
2.2% 2.3% 2.5% 4.4% 4.8%
Prefixspan N-gram Spli;er C-value NPChunker
F-Score
Ground-truthPropo1on
PhraseScoreSumma1ononDUCDataset
TF-IDF
RAKE
TextRank
HITS
(d)
0.0%
5.0%
10.0%
15.0%
20.0%
25.0%
30.0%
35.0%
40.0%
45.0%
50.0%
0.78% 0.84% 0.95% 1.54% 1.63%
Prefixspan N-gram Spli>er C-value NPChunker
F-Score
Ground-truthPropo1on
DirectPhraseRankingonSemEvalDataset
TF-IDF
RAKE
TextRank
HITS
(e)
0.0%
5.0%
10.0%
15.0%
20.0%
25.0%
30.0%
35.0%
40.0%
45.0%
50.0%
0.78% 0.84% 0.95% 1.54% 1.63%
prefixspan N-gram Spli=er c-value NPChunker
F-Score
Ground-truthPropo1on
PhraseScoreSumma1ononSemEvalDataset
TF-IDF
RAKE
TextRank
HITS
(f)
Figure 3.5: Evaluation 2 Results: (A) Performance on the Hulth Dataset using DirectPhrase Ranking (B) Performance on the Hulth Dataset using Phrase Score Summation(C) Performance on the DUC Dataset using Direct Phrase Ranking (D) Performanceon the DUC Dataset using Phrase Score Summation (E) Performance on the SemEvalDataset using Direct Phrase Ranking (F) Performance on the SemEval Dataset using
Phrase Score Summation
Chapter 3. Conundrums in Unsupervised AKE 62
example, with the direct phrase ranking scoring approach, increasing the proportion
by 10 percent from 11.7% to 21.7% results in a raise on F-score by about 10% for all
ranking algorithms. Therefore, extract keyphrases for the datasets consisting of short
documents, such as Hulth, the candidate identification should focus on either increase
the coverage on ground-truth set, or reduce the number of candidate phrases, or both.
However, for the datasets containing longer documents, such as the DUC (mid-length
documents) and SemEval (long documents) datasets, increasing a small percentage on
the proportion of ground-truth keyphrases in candidate phrases does not improve much
on the performance. As shown in Figure 3.5 (C), (D), (E) and (F), there is no clear
correlation between the proportion and performance. Trying to reduce a large number
of candidate phrases in order to increase the proportion of ground-truth is an extremely
difficult or even impossible task, because of the simple fact that longer documents con-
tain more words and phrases. Instead, improve the efficiency of correctly identifying
candidate phrases from their longer or shorter versions e.g. greater yellowstone park fire
and yellowstone park fire, is a better practice to improve the performance of a ranking
algorithm. All ranking algorithms extract keyphrases based on the statistical informa-
tion of phrases, where the frequency is the main source. Correctly identifying a phrase
from its longer or shorter versions provides a cleaner list of candidate phrase and more
accurate statistical information. For example, if a document contains text segments
greater yellowstone park fire and yellowstone park fire, and each segment occurs twice
in the document, the TPS Splitter will identify two different phrases from the text
segments, where each one has a frequency count as 2. The C-value algorithm, on the
other hand, will identify only one phrase yellowstone park fire from the text segments
with the frequency count of 4. Given the outputs from two candidate phrase identifiers
to the same ranking algorithm, the former combination may produce a lower score to
both phrases, whereas the latter combination will assign a much higher score to the
phrase yellowstone park fire due to the distinct high frequency. Consequently, to extract
keyphrases for the datasets consisting of longer documents, the candidate identification
should focus on correctly identifying phrases from text segments using both statistical
and linguistic information.
Finally, the identifiers that use linguistic features (the NP Chunker and the C-value
identifier) deliver the best performance in 23 out of 24 evaluation cases, regardless of
the dataset and the ranking algorithm they are coupled with. This further suggests that
the linguistic features also play a critical role in AKE.
Chapter 3. Conundrums in Unsupervised AKE 63
3.4.5.2 Phrase Scoring Impact
Two different phrase scoring techniques are employed in this evaluation: the direct phrase
ranking and phrase score summation. On the Hulth dataset, the phrase score summation
technique generally produces a better result than they combine with the direct phrase
ranking. On the SemEval dataset, the phrase score summation technique produces much
worse results. However, on the DUC dataset, there is no clear trend – only the ranking
algorithm with C-value and NP Chunker produces better result when combining with
the phrase score summation technique.
The Hulth dataset contains short documents, the frequencies and co-occurrence fre-
quencies of phrases are very low – a typical phrase may only present once or twice
in a document. When using the direct phrase ranking approach, phrases are inputs
to the ranking algorithms, and hence, with relatively low frequencies, it is difficult for
the algorithms to correctly identify the important phrases. On the other hand, using
the phrase score summation technique, the inputs to the ranking algorithms are words,
which naturally present the better statistical information than phrases. For example, in
Figure 3.2 (A), the document Doc 1 has phrases information interaction, information
architecture, and content, and each phrase occurs twice in the document. Among the
three phrases, the phrase content obviously is the least important one. However, since all
three phrases have the same frequencies, using the direct phrase ranking approach, the
ranking algorithms are difficult to differentiate the importance of each phrase. However,
if concerning only words, i.e. using the phrase score summation approach, the word
information will gain a very high score from the ranking algorithms. Hence, by taking
the sum of each word’s score as a phrase score, the phrases information interaction and
information architecture will gain much higher scores than content.
On the long document collection – the SemEval dataset, important phrases are easily
occur tens of times, whereas the unimportant one may only occur no more than a few
times. This provides much better statistical information to the direct phrase ranking
approach. On the other hand, long documents have more word combinations, and
hence, more candidate phrases will be identified. When the phrase score summation
accumulates words’ scores, any phrase containing high scored words will stay on the top
of the ranking list – the longer a phrase is, the higher score it receives. For example, if
the word network receives a high score after ranking, the longest phrase hybrid neural
network architecture will have the highest score, followed by its cousins: neural network
architecture, and neural network, where only the phrase neural network is in the ground-
truth keyphrases.
Chapter 3. Conundrums in Unsupervised AKE 64
The DUC dataset is between the long and short document collections, and hence, it
inherits partial characteristics from both sides. Phrases occur more frequently in the
DUC dataset than the Hulth dataset, however, not as frequent as in the SemEval dataset.
Word combinations in the DUC dataset are less than SemEval, yet the identified can-
didate are not as clean as in the Hulth dataset. These characteristics lead very inter-
esting results – two scoring approaches produce very similar results with Prefixspan,
N-gram, and PTS Splitter, but the phrase score summation approach produces much
better results with the candidate identifiers using linguistic analysis. The actual reason
is simple: identifiers using linguistic analysis produce cleaner candidate phrase lists. As
shown in Table 3.2, the candidate identifiers without linguistic analysis produce much
more candidate phrases than using linguistic analysis. For example, the NP Chunker
and C-value can correctly identify the phrase yellowstone park fire from a text segment
greater yellowstone park fire, whereas others cannot. However, why there is no significant
improvement on the results using the direct phrase ranking approach with the linguistic
candidate phrase identifiers? One reason is that phrases occur in mid-length documents
are not as frequent as in long documents. Another reason is that many keyphrases do
not occur with high frequencies, as we will discuss in the next section.
3.4.5.3 Frequency Impact
(a) (b)
Figure 3.6: Ground-truth and extracted keyphrase distributions based on individualdocument. (A): Distributions on the DUC Dataset (B): Distributions on the SemEval
Dataset
The majority of unsupervised AKE algorithms aim to identify the phrases having rel-
atively high frequency, based on the assumption that keyphrases tend to occur more
frequently than others high. However, such assumption does not always hold. We
plot the ground-truth frequency distributions on the DUC and SemEval datasets, along
with the number of keyphrase extracted by each algorithm, shown in Figure 3.6. The
Chapter 3. Conundrums in Unsupervised AKE 65
vertical axis is the number of keyphrases assigned by human annotators, and the hori-
zontal axis shows their frequencies in the corresponding documents. Many ground-truth
keyphrases only occur 2 to 3 times in a document. However, all ranking algorithms fail
to extract keyphrases having relatively low frequencies. TextRank, HITS, and RAKE
extract keyphrases solely rely on the information of individual documents, which seri-
ously suffer from the problem of frequency-sensitivity. TF-IDF, on the other hand, is
less sensitive to frequency impact compared with others because it considers both local
document frequencies of phrases and how they occur in the entire corpus. This suggests
that incorporating external knowledge may mitigate the frequency-sensitivity problem in
unsupervised AKE systems. Nevertheless, improving performance of ranking algorithms
needs to focus on identifying keyphrases from low-frequent candidate phrases.
3.5 Conclusions
In this chapter, we have conducted a systematic evaluation of five common candidate
phrase selection approaches and two candidate phrase scoring techniques coupled with
four unsupervised ranking algorithms. Our evaluation is based on three publicly available
datasets that contain articles with various lengths.
The evaluation reveals three key observations. Firstly, candidate identification ap-
proaches have a strong impact on the overall performance, where statistical and lin-
guistic feature-based approaches usually deliver better performance. Secondly, using
phrase scoring approaches is critical and should be carefully considered against the cho-
sen dataset. The direct phrase ranking may be more suitable for the longer dataset that
contains more combinations of words, whereas the phrase score summation may be a
better choice on relatively short and clean datasets where fewer phrases occur. Thirdly,
keyphrases do not always occur with high frequencies, and existing unsupervised AKE
algorithms suffer from the frequency-sensitivity problem. The TF-IDF algorithm deliv-
ers relatively better results on extracting keyphrases with low frequencies since it uses
corpus-level statistics, which suggests that incorporating external knowledge may miti-
gate the frequency-sensitivity problem, and hence further improvement should focus on
identifying keyphrases from low-frequent candidates using external knowledge. In the
next chapter, we focus on this problem by presenting a knowledge based graph rank-
ing approach using word embedding vectors as the external knowledge source providing
semantic features.
Chapter 4
Using Word Embeddings as
External Knowledge for AKE
In Chapter 3, we have systematically evaluated four unsupervised AKE algorithms,
including TF-IDF [2], RAKE [45], TextRank [3], and HITS [88]. The evaluation not only
shows the impact on an AKE system of different candidate identification and scoring
approaches, but also reveals a common weakness in unsupervised AKE algorithms – they
mostly fail to extract keyphrases with low frequencies. TF-IDF produces slightly better
performance than others, because it uses the statistical information from the entire
corpus. This cue suggests that using external knowledge may mitigate the frequency-
sensitivity problem.
In this chapter, we investigate using additional semantic knowledge supplied by pre-
trained word embeddings to overcome the frequency-sensitivity problem and improve
the performance of unsupervised AKE algorithms. We focus on graph-based ranking
approaches. The family of graph-based ranking algorithms by far, is the most widely
used unsupervised approach for AKE, which is based on the intuition that keyphrases are
the phrases that have stronger relations with others. In the past decade, the development
of graph-based AKE systems have mainly focused on discovering and investigating dif-
ferent weighting schemes that compute the strengths of relations between phrases. The
weights are assigned to edges for each pair of vertices that represent the corresponding
phrases. Early studies assign co-occurrence statistics as weights [10, 198–200]. How-
ever, co-occurrence statistics do not provide any semantic knowledge. More recently,
researchers use available knowledge bases such as Wikipedia and WordNet to obtain
semantic relatedness of words and phrases [6, 7, 24]. The shortcoming is that public
knowledge bases offer limited vocabularies that cannot cover all domain-specific terms.
In addition, they only provide general semantics of words and phrases, which makes little
66
Chapter 4. Using Word Embeddings as External Knowledge for AKE 67
contribution for identifying semantic relatedness of entities in domain-specific corpora,
because the same phrases may have very different meanings in a specific domain from
their general meanings.
In contrast to existing work, we use both co-occurrence statistics in documents, and
the semantic knowledge encoded in word embedding vectors, to compute the relation
strengths of phrases. Word embeddings are trained over both general corpora and
domain-specific datasets, enabling them to encode both general and domain-specific
knowledge. We evaluate the weighting scheme with four graph ranking algorithms, in-
cluding Weighted PageRank, Degree Centrality, Betweenness Centrality and Closeness
centrality. We demonstrate that using word embedding to supply extra knowledge for
measuring the semantic relatedness of phrases is an efficient approach to mitigate the
frequency-sensitive problem. It also generally improves the performance of graph rank-
ing algorithms.
4.1 Weighting Schemes for Graph-based AKE
Keyphrases are representative phrase, which describe the main ideas or arguments of an
article. Keyphrases can be considered as the ‘soul’ of an article that forms the skeleton,
other words and phrases in the article are supporting terms that help emphasising the
key ideas. Graph-based ranking approaches for AKE are developed upon such intuition –
keyphrases are the phrases having stronger relations with others, which tie and hold the
entire article together. Graph-based approaches represent documents as graphs, where
the candidate phrases are vertices, and edges of two vertices present the relations. In
most graph-based AKE systems, edges are identified as the co-occurrence of two phrases,
i.e. two vertices are connected if their corresponding phrases are co-occurred within a
predefined window.
The most well-known graph-based AKE algorithm is TextRank [3], which is developed
upon the idea of assigning higher scores to both the vertices having more edges (co-
occurring more often) and the vertices only link to a few most important vertices.
However, TextRank does not take other information such as co-occurrence frequen-
cies into considerations1. Following TextRank, Wan and Xiao [10] present SingleRank
using weighted PageRank [201], where the weights are normalised co-occurrence fre-
quencies. Other studies [198–200] also investigate different graph-ranking algorithms by
assigning co-occurrence frequencies as the weights to edges, including Degree Centrality,
1TextRank is developed for both AKE and text summarisation, which does not assign weights toedges for the AKE task.
Chapter 4. Using Word Embeddings as External Knowledge for AKE 68
Betweenness Centrality, Closeness Centrality, Strength, Neighbourhood Size, Coreness,
Clustering Coefficient, Structural Diversity Index, Eigenvector Centrality, and HITS.
However, assigning weights based on co-occurrence frequencies of individual documents
may not suit short documents, such as abstracts of journal papers in the Hulth dataset,
where most phrases only co-occur once or twice. Instead of considering only co-occurrence
frequencies within documents, researchers also attempted using the knowledge embed-
ded in the corpus. For example, ExpandRank [10] analyses how phrases co-occur not
only in a current document, but also in its neighbourhood documents – the most similar
documents in the corpus ranked by cosine similarities. Weights are computed as the
sum of phrases’ co-occurrence frequencies in each neighbourhood document multiplying
the cosine similarity score. ExpandRank assumes that there are topic-wise similar doc-
uments in the corpus. However, this prerequisite does not always exist. Liu et al. [39]
apply Latent Dirichlet Allocation (LDA) [76] to induce latent topic distributions of each
candidate, then run Personalised PageRank [91] for each topic separately, where the
random jump factor of each vertex in a graph is weighted as its probability distribution
of the topic, and the out-degree of a vertex is weighted as the co-occurrence statistics.
In reality, however, topics in a document are not equally important, and hence trivial
topics may not be necessary for the AKE task.
While incorporating lexical statistics of corpora certainly improves the performance of
graph-based ranking algorithms, these AKE systems can be dataset-dependent, i.e. they
usually work well on one particular dataset [9]. The lack of semantic interpretations or
measure of phrases inhibits the opportunity to accurately identify phrases representing
a document’s core theme. Thus, researchers start using semantic knowledge bases to
measure semantic relatedness of phrases. The semantic relatedness is mostly obtained
from external knowledge bases. Grineva et al. [7] use links between Wikipedia con-
cepts to compute semantic relatedness based on Dice co-efficient [68] measure. Wang
et al. [6] and Martinez-Romo et al. [24] use synset in WordNet to obtain the semantic
relatedness between two candidates. However, publicly available knowledge bases such
as Wikipedia and WordNet are designed for general propose, which may not precisely
present semantics of domain-specific phrases. Moreover, they offer limit vocabulary.
WordNet is handcrafted thus having limit vocabulary. Wikipedia has richer vocabulary
living in its articles, the techniques for inducing semantic relatedness, however, typically
rely on analysing the hyper-links and the titles of articles in Wikipedia.
The development of graph-based AKE approaches can be thought as a journey of dis-
covering and investigating different weighting schemes that are built upon different level
of knowledge. Studies utilising phrases’ co-occurrence statistics from documents only
consider the local statistical information. Topic-wise based approaches incorporate the
Chapter 4. Using Word Embeddings as External Knowledge for AKE 69
knowledge of the corpus. Public knowledge bases essentially provide general knowledge
that helps algorithms to measure the general semantic relatedness of phrases.
4.2 Proposed Weighting Scheme
In contrast to existing work, we propose a weighting scheme built upon three levels of
knowledge: the local documents, the corpus, and the general knowledge. The phrase
co-occurrence statistics can be easily obtained from documents. To gain corpus level
and general knowledge, we pre-train word embeddings over both general and domain-
specific corpora, and use the semantic information encoded in word embedding vectors
to measure the semantic relatedness of phrases.
4.2.1 Word Embeddings as Knowledge Base
Capturing both domain-specific knowledge (in the corpus) and the general knowledge
is a challenge task, because the meanings of domain-specific terms (words or phrases)
can be very different in the domain from their general meanings. For example, neuron
generally relates to some biological terms such as cortical, sensorimotor, neocortex. In
computer science, partially machine learning domain, the term neuron is often referred to
a computational unit that performs linear or non-linear computation. Hence, the most
semantically related phrases are either the names of machine learning algorithms or
artificial neural network architectures, such as back-propagation, hopfield, feed-forward,
perceptron. However, the meanings of non-domain specific terms remain the same even in
domain-specific corpora. For example, the meaning of dog remains the same in computer
science domain, but it rarely appears.
Word embeddings are low dimensional and real-valued vector representations of words.
A typical way to induce word embeddings is to learn a probabilistic language model
that predicts the next word given its previous ones. Hence, word embeddings essentially
encode the co-occurrence statistics of words over the training corpus [202]. Such infor-
mation is capable of presenting the semantics of words – the distributional hypothesis
states that words that occur in similar contexts tend to have similar meanings [164].
Mikolov et al. [13] have demonstrated that embedding vectors trained over a Wikipedia
snapshot capture both semantic and syntactic features of words.
Since word embeddings encode the co-occurrence statistics of words distributed in the
corpus, they can be retrained over different corpora to encode the meanings of domain-
specific terms that have distinct co-occurrence statistics from the general domain. Ta-
ble 4.1 lists some samples words and their most similar words extracted using the cosine
Chapter 4. Using Word Embeddings as External Knowledge for AKE 70
similarity measure. The word embeddings are firstly trained over a general corpus – a
Wikipedia snapshot, then they are retrained over a computer science corpus. As shown
in the table, the meanings of domain-specific words are changed significantly to present
the domain knowledge. Neutral words, which are naturally domain-specific words, have
very little change. Non-domain specific words tend to remain their original meaning.
This can be intuitively understood – domain-specific words change the co-occur partners
in the corpus, whereas non-domain specific words do not.
Table 4.1: Most Similar Words: Top ranked most similar words to the samplers,fetched using Cosine similarity, trained twice 1) over Wikipedia general dataset, and 2)
a computer science domain specific dataset
Sample Most Similar (General Do-main)
Most Similar (CS Domain)
Domain-Specificneural cortical, sensorimotor, neo-
cortex, sensory, neuronback-propagation, hop-field, feed-forward, percep-trons, network
thread weft, skein, yarn, darning,buttonhole
dataflow, parallelism, user-specified, client-side, mu-texes
Neutralalgorithm heuristic, depth-first,
breadth-first, recursive,polynomial
breadth-first, bilinear,depth-first, autocorrela-tion, eigenspace
database RDBMS, clusterpoint,metadata, mapreduce,memcached
repository, web-accessible,metadata, searchable,CDBS
Non-domain Specificdog sheepdog, doberman,
puppy, rottweiler, poodlecat, animal, wolf, pet,sheep
flight take-off, aircraft, jetliner,jet, airplane
take-off, aircraft, airline,landing, airport
Domain-Specific: words have totally different meanings in open domain and specific domainNeutral: words essentially relate to specific domain even in the open domain.Non-domain Specific: words do not change their meanings.
4.2.2 Weighting Scheme
The weight scheme is developed based on a few intuitions and understandings of what
keyphrases are. We use two indicators: co-occurrence frequency and semantic relat-
edness. Frequencies have been the main source for statistical language processing. An
important phrase of a document should have relatively high frequency that co-occur more
often with other different phrases, and conversely, a phrase having high co-occurrence
frequencies indicates that the phrase itself is a high frequent phrase. On the other hand,
Chapter 4. Using Word Embeddings as External Knowledge for AKE 71
a phrase that selectively co-occurs with one or a few particular high frequent phrases
can also be important. For example, in an article about deep learning, terms like recur-
rent network and convolutional network may appear many times, and hence they can
be important ones for representing the theme of the article. A phrase neural network
that may only appear twice, co-occurs once with recurrent network and convolutional
network, respectively. It is clear that the term neural network is a good candidate
of keyphrases even though it has very low frequencies. On the other hand, assuming
that a phrase cognitive science also co-occurs once with recurrent network and convo-
lutional network. How do we differentiate the importance between neural network and
cognitive science? We use the second indicator – semantic relatedness. Both recurrent
networks and convolutional networks are instances of neural networks, and hence they
have much stronger semantic relatedness than cognitive science. Consequently, we pro-
pose the weight is computed as the product of two phrases’ co-occurrence frequency and
semantic relatedness.
Formally, let S be the relation strength of phrase pi and pj , we compute S as the product
of the co-occurrence frequency count coocc of pi and pj in document D, and their cosine
similarity score sim of their corresponding word embedding vectors.
S(pi, pj) = coocc(pi, pj)× sim(pi, pj) (4.1)
The cosine similarity is computed as:
sim(pi, pj) =vi · vj||vi|| ||vj ||
(4.2)
where vi and vj are the embedding vectors for pi and pj , respectively.
4.3 Training Word Embeddings
We use the SkipGram model introduced by Mikolov et al. [15] to train word embed-
dings. The SkipGram model aims to learn a function f that maps the probability
distributions over words in a corpus. In SkipGram model, there are two vector repre-
sentations for each word, with one being input vector and another one being output
vector. Two vectors are used in conjunction to compute the probabilities. The Skip-
Gram model takes one word wt (the centre word) as input to predict its surrounding
words wt−n, ...wt−1, wt+1, ..., wt+n, where n is the size of surrounding window.
Formally, let C denotes the set of input word embedding vectors, and C ′ denotes the
set of output vectors. Given a sequence of words w1, w2, ..., wT , the goal is to maximise
Chapter 4. Using Word Embeddings as External Knowledge for AKE 72
the conditional probability distribution P (wt+j |wt) over all the words V in a training
corpus by looking for parameters θ = (C,C ′). The learning objective is to maximise the
average log probability:
L =1
T
T∑t=1
∑−n≤j≤n
logP (wt+j |wt; θ)
Using softmax function, we have:
P (wt+j |wt; θ) =eC
′(wt+j)T ·C(wt)∑Vi=1 e
C′(wi)T ·C(wt)
where C ′(wt+j) is the output vector representation for wt+j , C(wt) is the input vector
representation for input wt respectively. However, directly applying softmax function
over a large corpus is very computationally expensive due to the summation over the en-
tire vocabulary V . To optimise computational efficiency, we use Negative Sampling [15].
The negative sampling algorithm used in SkipGram model is a simplified version of
the Noise Contrastive Estimation [203]. The idea is to reward positive samples and
penalise negative samples. Considering a pair of word (wi, wj) and context, the negative
sampling rewards the sample (wi, wj) if they come from the actual training data (they
co-occur in a context), by assigning probability P (D = 1|wi, wj). Conversely, if they do
not come from the training data, then a zero probability P (D = 0|wi, wj) is assigned,
meaning they never co-occur in the training data. The learning objective is to maximise
the probability of choosing positive samples, and minimise the probability of choosing
negative ones (randomly generated) by looking for parameters θ. In the actual training,
given a word wt (the centre word), each surrounding word wt−n, ...wt−1, wt+1, ..., wt+n
is treated as positive sample, whereas the negative samples are generated by randomly
selecting a word from the vocabulary. The probabilities for positive and negative samples
are computed as:
p(w|wt; θ) =
σ(C ′(w)TC(wt)) for positive
1− σ(C ′(w)TC(wt)) for negative
Let POS(w) be the set of positive samples, and NEG(w) be the set of negative samples,
the overall equation for computing the probability is:
P (w|wt; θ) =∏
i∈POS(w)
σ(C ′(wi)TC(wt))
∏j∈NEG(w)
(1− σ(C ′(wj)
TC(wt)))
Chapter 4. Using Word Embeddings as External Knowledge for AKE 73
The objective function maximises the log probability, as:
L =
S(w)∑i=1
(yi · log(σ(C ′(wi)
TC(wt)) + (1− yi) · log(1− σ(C ′(wi)
TC(wt))
)where S(w) = POS(w)
⋃NEG(w), and
yi =
1 for positive
0 for negative
4.4 Implementation
4.4.1 Training Word Embeddings
We first train our model over an English Wikipedia snapshot downloaded in January
2015. Then we retrain the embeddings over each AKE dataset separately to encode the
domain-specific knowledge of each dataset.
The hyper-parameters used for trainings are:
• Number of negative samples: 5
• Window size: 10
• Embedding vector dimensions: 300
• Negative table size: 2 ∗ 109
• Learning rate: 0.01 or training over general dataset, and 0.005 for training over
the AKE datasets
• Subsampling parameter: 10−6 for training over general dataset, and 10−4 for train-
ing over the AKE datasets
• Word minimum occurrence frequency: 20 for training over general dataset, and 1
for training over the AKE dataset
• Iteration on training set: 1 for training over general dataset, and 10 for training
over the AKE datasets
Some hyper-parameters used in training over the AKE dataset are different from general
dataset because they have much smaller size, thus the parameters need to be tuned
slightly to generalise the learning.
Chapter 4. Using Word Embeddings as External Knowledge for AKE 74
The dimension of embedding vectors controls how much information that a embedding
vector may encode. Mikolov et al. [15] show that larger dimension certainly can encode
more knowledge. However, there is a trade-off between dimensions of word embeddings
and the training time – larger dimensions significantly increases the training time. Our
empirical evidence show that the amount of information encoded in embedding vectors
increases significantly up to 300 dimensions, then the difference becomes trivial from
300 to 1000 dimensions, evaluated on word analogy tests. This is because the trade-off
between the amount of training data and the dimensionality of the word embeddings, i
.e. it needs much more data if we expand the capacity (word embedding dimensionality)
of the model. In this study, the dimension of embedding vectors is set to 300.
The subsampling threshold balances frequent and infrequent words occurring in training
samples. The top frequent words (not only stop-words) can occur millions of times
more than infrequent ones, which may overfit the model. To generalise the learning,
the subsampling threshold is used to randomly discard the frequent words with the
probability computed as p(w) = 1 −√t/f(w), where f(w) is the frequency count of
word w, and t is the subsampling threshold parameter. We use t = 10−6 for training the
model over the general dataset, and t = 10−4 for AKE datasets due to the small size.
The iteration controls how many times we train the embeddings over the dataset. The
Wikipedia snapshot contains enough training samples to generalise the learning, thus we
only train once2. AKE datasets do not contain enough training examples, so we iterate
training for 10 times to gain better generalisations3.
4.4.2 Phrase Embeddings
SkipGram model is for learning word embedding vectors. To gain multi-word phrase
embeddings, we use two common approaches, namely holistic and algebraic composition
approaches [177]. The holistic approach learns phrase embeddings by treating pre-
identified phrases as atomic units, where each phrase associates with a vector, and
is trained in the same way as words. On the other hand, the algebraic composition
approach applies simple algebraic functions to work out phrase embeddings. In this
study, we apply vector addition function, i.e. a phrase embedding is computed as the
sum of component word embeddings.
2The obtained word embeddings produce the exact same results on the word analogy test as claimedin the original paper [15], hence we are confident with our re-implementation.
3We empirically evaluated different iterations for AKE datasets, and did not notice any significantimprovement for iterating more than 10 times.
Chapter 4. Using Word Embeddings as External Knowledge for AKE 75
4.4.3 Ranking Algorithms
4.4.3.1 Degree Centrality
In unweighted graphs, the Degree Centrality is simply the sum over the number of
connections a vertex has. Formally, let Cd(i) be the degree of vertex vi in graph G, N
be the number of vertices that vi connects to, xij is the connection between vertex vi
and vj , the degree of vi is computed as:
Cd(i) =
N∑j=1
xij (4.3)
In weighted graph, the Degree Centrality can be extended to the sum of weights w of
all vertices that i connects to [204, 205], as:
Cwd (i) =
N∑j=1
wij (4.4)
It can be normalised by taking the ratio of its Degree Centrality and the maximum
possible degree N − 1, as:
NCwd (i) =Cwd (i)
N − 1(4.5)
4.4.3.2 Betweenness and Closeness Centrality
The Betweenness Centrality of a vertex vi is the sum of the number of times a vertex
acting as a bridge lying on the shortest path between two other vertices. We use the
algorithm proposed by Brades [206] to compute the betweenness for vertex vi as:
Cb(i) =∑
s 6=t6=i∈V
σst(i)
σst(4.6)
where V is the total number of vertices in graph G, σst is the number of shortest paths
from vs to vt in G, and σst(i) is the number of shortest paths passing through vi other
than vs, vt. The Betweenness Centrality can also be normalised by:
NCb(i) =2× Cb(i)
(N − 1)(N − 2)(4.7)
Chapter 4. Using Word Embeddings as External Knowledge for AKE 76
The Closeness Centrality is the inverse sum of shortest path to all others that a vertex
vi connects to, which is computed as:
Cc(i) =N − 1∑Nj=1 d(j, i)
, (4.8)
where d(j, i) is the shortest path distance between vertices vj and vj .
Both Betweenness and Closeness centralities rely on identification of the shortest paths
in a graph. We use Dijkstra’s algorithm [207] to identify the shortest paths in a weighted
graph.
4.4.4 PageRank
The original PageRank algorithm [89] ranks vertices of a directed unweighted graph. In
a directed graph G, let in(vi) be the set of vertices that point to a vertex vi, and out(vi)
be the set of vertices to which vi point. The score of vi is calculated by PageRank as:
S(vi) = (1− d) + d×∑
j∈in(vi)
1
|out(vj)|S(vj) (4.9)
where d is the damping factor. Weighted PageRank [3, 201] score of a vertex vi as:
WS(vi) = (1− d) + d×∑
vj∈in(vi)
wji∑vk∈out(vj)wjk
WS(vj)
where wij is the weight of the connection between two vertices vi and vj . Both PageRank
and the Weighted PageRank require settings of hyper-parameters. We set dumping
factor to 0.85, iterations to 30, and the threshold of breaking to 10−5. For undirected
graphs, the in-degree of a vertex simply equals to the out-degree of the vertex.
4.4.5 Assigning Weights to Graphs
Two vertices are connected if the corresponding candidate phrases are co-occurred within
a windows size of 2. For Degree Centrality and PageRank, weights are assigned directly
to edges. For Betweenness and Closeness, the weights need to be reversed before assign-
ing to edges, as 1wij
where wij is the weight of vertex i and j computed using equation 4.2.
Since our weighting scheme is the product of co-occurrence frequency and cosine sim-
ilarity, it is possible that the weights (cosine similarity) is negative or even turn to 0
if two vectors are orthogonal. From our observation, there are less than 0.1% negative
Chapter 4. Using Word Embeddings as External Knowledge for AKE 77
weights occurring in the evaluation datasets, and we did not experience any orthogonal
vectors. Therefore, we simply do not connect two vertices if negative weights occur.
4.5 Evaluation
We use two different phrase scoring techniques as described in 3.4.4.1: direct phrase
ranking and phrase score summation. The direct phrase ranking is to rank pre-identified
phrase instead of individual words. For phrases that cannot be pre-identified will not be
ranked and scored. The phrase summation approach is to sum all word scores to make
up a phrase score, and in such scoring approach, the holistic approach is not applicable.
Hence here is no result for the holistic approach shown in Figure 4.1. We select top 10
ranked phrases as potential keyphrases. Results are measured by Precision, Recall, and
F-measure. We use the datasets described in Chapter 3, namely the Hulth, DUC, and
SemEval datasets.
We setup two baselines. One is unweighted graph ranking using the four ranking al-
gorithms: Degree, Betweenness, and Closeness centralities, and PageRank. The other
baseline uses the normalised co-occurrence frequencies as weights.
4.5.1 Evaluation Results and Discussion
We plot the evaluation results into six column charts, shown in Figure 4.5.1. Each
chart has four groups of columns, showing evaluation results for the four graph ranking
algorithms.
4.5.1.1 Discussion on Direct Phrase Ranking System
The proposed weighting scheme uniformly improves the performance of all ranking algo-
rithms on the evaluation datasets. However, using only co-occurrence frequency statis-
tics as weights does not boost the performance in comparison to unweighted graphs.
In many cases, it produces even worse results. As we have shown in the last chapter,
majority of unsupervised algorithms are overly sensitive to the frequencies of phrases,
i.e. they are unable to identify keyphrases with low frequencies. Hence, accumulating
co-occurrence frequencies of phrases as weights does not solve the problem of frequency-
sensitivity, it rather makes the situation even worse – more frequent phrases receive
even higher ranking scores due to the co-occurrence frequency statistics implicitly en-
code the frequencies of phrases. That is why the unweighted graph ranking outperforms
the co-occurrence statistics based ranking. On the other hand, our weighting scheme
Chapter 4. Using Word Embeddings as External Knowledge for AKE 78
0.0%
5.0%
10.0%
15.0%
20.0%
25.0%
30.0%
35.0%
40.0%
Degree Betweenness Closeness PageRank
Hulth(DirectPhraseRanking)
Unweighted
Cooccurrencefrequency
HolisDcPhraseEmb
ComposiDonalPhraseEmb
0.0%
5.0%
10.0%
15.0%
20.0%
25.0%
30.0%
35.0%
40.0%
Degree Betweenness Closeness PageRank
Hulth(PhraseSumma/on)
Unweighted
Cooccurrencefrequency
UsingWordEmbeddings
0.0%
5.0%
10.0%
15.0%
20.0%
25.0%
30.0%
35.0%
40.0%
Degree Betweenness Closeness PageRank
DUC(DirectPhraseRanking)
Unweighted
Cooccurrencefrequency
HolisDcPhraseEmb
ComposiDonalPhraseEmb
0.0%
5.0%
10.0%
15.0%
20.0%
25.0%
30.0%
35.0%
40.0%
Degree Betweenness Closeness PageRank
DUC(PhraseSumma/on)
Unweighted
Cooccurrencefrequency
UsingWordEmbeddings
0.0%
5.0%
10.0%
15.0%
20.0%
25.0%
30.0%
35.0%
40.0%
Degree Betweenness Closeness PageRank
SemEval(DirectPhraseRanking)
Unweighted
Cooccurrencefrequency
HolisDcPhraseEmb
ComposiDonalPhraseEmb
0.0%
5.0%
10.0%
15.0%
20.0%
25.0%
30.0%
35.0%
40.0%
Degree Betweenness Closeness PageRank
SemEval(PhraseSumma/on)
Unweighted
Cooccurrencefrequency
UsingWordEmbeddings
Figure 4.1: Evaluation Results F-score: Left graphs show the results using directphrase ranking scoring approach, where green and purple columns show embeddingeffects. Right graphs show the results using phrase score summation approach, where
green columns show the embedding effects.
mitigates the frequency impact by taking the product of the co-occurrence frequency
and the semantic relatedness of a pair of phrases, which considers not only how the pair
of phrases co-occur in the document, but also how they co-occur in both general and
domain-specific datasets, since word embeddings implicitly capture the co-occurrence
statistics of words and phrases.
Chapter 4. Using Word Embeddings as External Knowledge for AKE 79
Comparing the two phrase embeddings approaches, the holistic embedding approach
produces slightly better results in most of the cases. The holistic approach treats phrases
in the same way as words during the training, thus theoretically it encodes more accurate
co-occurrence statistical information than the vector addition approach. For example,
the holistic approach considers the sequence of words, thus the phrase embeddings for
first lady and lady first are different. On the other hand, the vector addition approach
simply sums the values of two word embedding vectors, so it does not take sequential
information of words into consideration. However, in practice, the holistic approach
may suffer from the data sparsity problem, because identifying all possible phrases is
not possible.
Comparing the performance between different graph ranking algorithms is not the main
interest of this study. However, PageRank and the Closeness Centrality produce slightly
better performance on all three datasets.
4.5.1.2 Discussion on Phrase Score Summation System
In the phrase score summation system, the improvement is not as clear as in the phrase
direct ranking system. On the Hulth dataset, the proposed weighting scheme still im-
proves the performance of all algorithms about 2%. However, the improvement on the
DUC dataset becomes blurred, and there is no improvement or even slightly worse per-
formance on the SemEval dataset. We identify two reasons.
Firstly, word embeddings do not directly tell semantic meanings of phrases. The score
summation system ranks unigram words, and a phrase scores is the sum of scores of its
each constituent word. Therefore, the graph is constructed by unigram words rather
than phrases. Hence, applying our weighting scheme essentially assigns weights to a
pair of words rather than phrases. There is no semantic knowledge of phrases involved
in ranking process. Therefore, the score summation system extracts any phrases having
high scored words as keyphrases. There is still little improvement on the Hulth dataset,
because documents are much shorter in comparison to the documents in other two
datasets. Short documents produce fewer phrases, and the weighting scheme using
word embeddings certainly influences the ranking where words have stronger semantic
relatedness will gain higher score.
Secondly, the usefulness of word embeddings is blurred on the DUC and SemEval
datasets, where each document contains large number of words thus produces more
candidate phrases. By accumulating word scores, the top scored phrases are always
the longer phrases containing one or more top scored words, due to the sum of scores
over all constituent words (graph ranking algorithms never produce negative scores).
Chapter 4. Using Word Embeddings as External Knowledge for AKE 80
For example, if a word hurricane gains a distinct high ranking score, then any phrase
contains the word will also stay on the top of the ranking list, such as hurricane center,
ornational hurricane center.
4.5.1.3 Mitigating the Frequency-Sensitivity Problem
In this evaluation, we show how the weighting scheme mitigates the frequency-sensitive
problem. Figure 4.2 shows the keyphrases and their frequencies extracted by PageRank,
Weighted PageRank using co-occurrence frequencies as weights, and our weighting scheme
using the direct phrase ranking on the DUC and SemEval datasets.4
Applying the proposed weight scheme to four ranking algorithms does not make much
difference in identifying high frequent keyphrases comparing with the unweighted and
weighted by co-occurrence frequencies approaches. In fact, the actual improvement
comes from extracting low frequent keyphrases. As shown in Figure 4.2, the proposed
approach extracts much lower frequent keyphrases than others. The proposed weighting
scheme identifies the degrees of semantic relatedness of candidate phrases. Therefore, if
a low frequent phrase co-occurs with the phrases having high frequencies and stronger
semantic relatedness, it will gain higher weights and hence higher ranking scores.
0
20
40
60
80
100
120
140
2 3 4 5 6-10 above10
Num
bero
fKeyph
raseCorrectlyExtracted
FrequencyinDocument
DUCDataset
Unweighted
Co-occurfrequency
HolisBcPhraseEmb
ComposPhraseEmb
(a)
0
20
40
60
80
100
120
1-10 11-20 21-30 31-5051-100above100
Num
bero
fKeyph
raseCorrectlyExtracted
FrequencyinDocument
SemEvalDataset
Unweighted
Co-occurfrequency
HolisBcPhraseEmb
ComposPhraseEmb
(b)
Figure 4.2: Ground-truth and extracted keyphrase distributions based on individualdocument. (A): Distributions on DUC Dataset (B): Distributions on SemEval Dataset
4.5.1.4 Tunable hyper-parameters
Our evaluations mainly focus on analysing the actual effectiveness of the weighting
scheme to graph ranking algorithms for AKE. However, it is worth to note that there
4The Hulth data consists of short articles, where most candidates only occur once. Therefore, we donot use this dataset for this evaluation.
Chapter 4. Using Word Embeddings as External Knowledge for AKE 81
are a few potential hyper-parameters or strategies can be applied to our approach for
further improvement, as we have demonstrated in our previous work [21, 22]. Firstly,
restricting the maximum number of words in phrases will efficiently reduce the errors
made by POS taggers, especially on SemEval dataset. A commonly used number is 5
based on the observation that most human assigned keyphrases have less than 5 words.
Secondly, setting a threshold to filter low frequent words can also improve the results.
Thirdly, the proposed weighting scheme can also use Pointwise Mutual Information
(PMI) or Dice co-efficient measure instead of raw frequency account. Finally, different
word embedding algorithms and training techniques may also impact the results.
4.6 Conclusion
In this chapter, we have presented a weighting scheme using word or phrase embed-
dings as external knowledge source. The evaluation shows that the proposed weighting
scheme generally improves the performance of graph ranking algorithms for AKE. We
also demonstrated that using word embeddings is an efficient approach to mitigate the
frequency sensitive problem in unsupervised AKE approaches.
The phrase embeddings used in this study are induced from two approaches: the holis-
tic and algebraic composition approaches. Despite the improvement that the weighting
scheme made, each of the approaches for inducing phrase embeddings has drawback in
practice. The holistic approach suffers from the data coverage and sparsity problem – it
is unable to generate embeddings for phrases not appearing in the training set, and most
phrases occur much less often than unigram words. On the other hand, the algebraic
composition approach is overly simplified, which does not take the syntactical informa-
tion of words into account. In the next chapter, we specifically focus on these issues. We
will introduce a deep learning approach for modelling the language compositionality.
Chapter 5
Featureless Phrase
Representation Learning with
Minimal Labelled Data
In Chapter 4, we have demonstrated the effectiveness of using word embedding vec-
tors to supply the semantic relatedness for graph-based keyphrase extraction. However,
the study also discovered two problems. Firstly, applying word embeddings as exter-
nal knowledge for AKE has limitations. Current techniques for learning distributed
representations of linguistic expressions are limited to unigrams. Learning distributed
representations for multi-gram expressions, such as multi-word phrases or sentences,
remains a great challenge, because it requires to learn not only the meaning of each con-
stituent word of an expression, but also encode the rules of combining the words [25].
Secondly, despite assigning weights to graphs improves the performance of graph-based
AKE algorithms, the development of weighting schemes is a rather heuristic process,
where the choice of features is critical to the overall performance of the algorithms. This
problem turns the development of graph-based AKE algorithms into a laborious feature
engineering process. The goal of this thesis is to discover AKE approaches that do not
reply on the choice of features and datasets, hence in the rest of the thesis, we focus on
deep learning approaches for AKE that automatic learns of useful features.
This chapter is an initial investigation on learning distributed representations of multi-
word phrases using deep neural networks. In this chapter, instead of focusing on AKE, we
target a cousin task of AKE – Automatic Domain-specific Term Extraction. Keyphrases
describe the main ideas, topics, or arguments of a document, which are document-
dependent. Extracting keyphrases requires the understanding of both phrases and doc-
uments. On the other hand, domain-specific terms are properties of a particular domain,
82
Chapter 5. Featureless Phrase Representation Learning with Minimal Labelled Data 83
which are independent from documents. Identifying domain-specific terms only requires
understanding the meanings of terms. Therefore, automatic term extraction is more
efficient for evaluating the meanings of phrases obtained from deep learning models.
Section 2.1.9.1 provides a detailed literature review of common approaches for auto-
matic term extraction.
In this chapter, we introduce a weakly supervised bootstrapping approach using two
deep learning classifiers. Each classifier learns the representations of terms separately by
taking word embedding vectors as inputs, thus no manually selected feature is required.
The two classifiers are firstly trained on a small set of labelled data, then independently
make predictions on a subset of the unlabelled data. The most confident predictions are
subsequently added to the training set to retrain the classifiers. This co-training process
minimises the reliance on labelled data. Evaluations on two datasets demonstrate that
the proposed co-training approach achieves a competitive performance to the standard
supervised baseline algorithms with very little labelled data.
5.1 Introduction
Domain-specific terms are essential for many knowledge management applications, such
as clinical text processing, risk management, and equipment maintenance. Domain-
specific term extraction aims to automatically identify domain relevant technical terms
that can be either unigram words or multi-word phrases. Supervised domain-specific
term extraction often relies on the training of a binary classifier to identify whether a
candidate term is relevant to the domain [134–136]. In such approaches, term extrac-
tion models are built upon manually selected features including the local statistical and
linguistic information (e.g. frequency, co-occurrence frequency, or linguistic patterns),
and external information form third-party knowledge bases (e.g. WordNet, DBpedia).
Designing and evaluating different feature combinations turn the development of term ex-
traction models into a time-consuming and labour-intensive exercise. In addition, these
approaches require a large amount of labelled training data to generalise the learning.
However, labelled data is often hard or impractical to obtain.
In this chapter, our first objective is to minimise the usage of labelled data by training
two classifiers in a co-training fashion. Co-training is a weakly supervised learning mech-
anism introduced by Blum and Mitchell (1998), which tackles the problem of building
a classification model from a dataset with limited labelled data among the majority of
unlabelled ones. It requires two classifiers, each built upon separate views of the data.
Each view represents a separate set of manually selected features that must be sufficient
to learn a classifier. For example, Blum and Mitchell classify web pages using two views:
Chapter 5. Featureless Phrase Representation Learning with Minimal Labelled Data 84
1) words appearing in the content of a web page, and 2) words in hyperlinks pointing
to the web page. Co-training starts with training each classifier on a small labelled
dataset, then each classifier makes predictions individually on a subset of the unlabelled
data. The most confident predictions are subsequently added to the training set to re-
train each classifier. This process is iterated a fixed number of times. Co-training has
been applied to many NLP tasks where labelled data are in scarce, including statistical
parsing [208], word sense disambiguation [209], and coreference resolution [210], which
demonstrate that it generally improves the performance without requiring additional
labelled data.
Our second objective is to eliminate the effort of feature engineering by using deep
learning models. Applying deep neural networks directly to NLP tasks without feature
engineering is also described as NLP from scratch [172]. Because of such training, words
are represented as low dimensional and real-valued vectors, encoding both semantic
and syntactic features [15]. In our model, word embeddings are pre-trained over the
corpora to encode word features that are used as inputs to two deep neural networks
to learn different term representations (corresponding to the concept of views) over
the same dataset. We use two deep neural networks. One is a Convolutional Neural
Network (CNN) that learns term representations through a single convolutional layer
with multiple filters followed by a max pooling layer. Each filter is associated with a
region that essentially corresponds to a sub-gram of a term. The underlying intuition
is that the meaning of a term can be learnt from its sub-grams by analysing different
combinations of words. The other is a Long Short-Term Memory (LSTM) network that
learns the representation of a term by recursively composing the embeddings of an input
word and the composed value from its precedent, hypothesising that the meaning of
a term can be learnt from the sequential combination of each constituent word. Each
network connects to a logistic regression layer to perform classifications.
Our model is evaluated on two benchmark domain-specific corpora, namely GENIA
(biology domain) [211], and ACL RD-TEC (computer science domain) [212]. The evalu-
ation shows that our model outperforms the C-value algorithm [56] that is often treated
as the benchmark in term extraction. We also train the CNN and LSTM classifiers
individually using the standard supervised learning approach, and demonstrate that the
co-training model is an effective approach to reduce the usage of labelled data while
maintaining a competitive performance to the supervised models.
Chapter 5. Featureless Phrase Representation Learning with Minimal Labelled Data 85
Word Embeddings Lookup Table
Convolution & max pooling
Input Layer
Labelled Data L
Logistic Regression Layer
Pool U’
Word Embeddings Lookup Table
Input Layer
LSTM Layer
Logistic Regression Layer
Unlabelled Data U
Train Train
Examples for labelling
Refill 2g examples
Add g most confident predictions Add g most confident predictions
Figure 5.1: Co-training Network Architecture Overview: Solid lines indicate thetraining process; dashed lines indicate prediction and labelling processes.
5.2 Related Work
A brief literature review on term extraction is presented in Chapter 2.1.9.1. In this
section, we only discuss the most related work with respect to the methodology. The
work closely related to ours is Fault-Tolerant Learning (FTL) [213] inspired by Transfer
Learning [214] and the Co-training algorithm [215]. FTL has classifiers using Support
Vector Machine (SVM) helping training each other in a similar fashion as in the Co-
training algorithm. The main difference is that Fault-Tolerant Learning does not require
any labelled data, instead, it uses two unsupervised ranking algorithms, TF-IDF [2] and
TEDel [216] to generate two set of seed terms, which may contain a small amount of
incorrectly labelled data before the start of training. Two classifiers are firstly trained
using different sets of seed data, then one classifier is used to verify the seed data that was
used to train another classifier. After that, two classifiers are re-trained using verified
seed data. The rest of the steps are the same as the Co-training algorithm. However,
FTL relies on manually selected features. In contrast, our model uses deep neural
networks taking pre-trained word embeddings as inputs, without using any manually
selected feature. Another difference is that our model requires a small amount of labelled
data as the seed to initialise the training.
5.3 Proposed Model
5.3.1 Overview
We propose a weakly supervised bootstrapping using Co-training for term extraction as
shown in Figure 5.1. The labelled data L and unlabelled data U contain candidate terms
Chapter 5. Featureless Phrase Representation Learning with Minimal Labelled Data 86
that are identified in pre-processing stage1. The actual inputs to the model are word
embedding vectors that map an input term via the loop-up table. All word embedding
vectors are pre-trained over the corpus2. The model consists of two classifiers. The
left classifier is a CNN network, and the right one is a LSTM network. The networks
independently learns the representations of terms. The output layer is a logistic regres-
sion layer for both networks. Two neural networks are trained using the Co-training
algorithm described in Section 5.3.5.
The Co-training algorithm requires two separate views of the data, which traditionally
are two sets of manually selected features. In our model, however, there is no manually
selected features. Thus, two views of the data are carried by our two hypotheses of
learning the meanings of terms: 1) analysing different sub-gram compositions, and 2)
sequential combination of each constituent word. The hypotheses are implemented via
the CNN and LSTM network. We expect that the rules of composing words can be cap-
tured by the networks. The CNN network analyses different regions of an input matrix
that is constructed by stacking word embedding vectors, as shown in Figure 5.2, where
the sizes of regions reflect different N-grams of a term. By scanning the embedding ma-
trix with different region sizes, we expect that the CNN network can learn the meaning
of a term by capturing the most representative sub-gram. The LSTM network, on the
other hand, learns the compositionality by recursively composing an input embedding
vector with the precedent or previously composed value, as shown in Figure 5.3. We
expect the LSTM network to capture the meaning of a term through its gates that gov-
ern the information flow – how much information (or meaning) of an input word can be
added in to the overall meaning, and how much information should be dismissed from
the previous composition.
5.3.2 Term Representation Learning
The objective is to learn a mapping function f that outputs the compositional repre-
sentation of a term given its word embeddings. Concretely, let V be the vocabulary of a
corpus with the size of v. For each word w ∈ V , there is a corresponding d dimensional
embedding vector. The collection of all embedding vectors in the vocabulary is a matrix,
denoted as C, where C ∈ Rd×v. C can be thought of as a look-up table, where C(wi)
represents the embedding vectors of word wi. Given a term s = (w1, w2, ..., wn), the goal
is to learn a mapping function f(C(s)) that takes the individual vector representation
of each word as the input, and output a composed vector with the size of d, which
represents the compositional meaning of s.
1Details are presented in Section 5.4.2.2Details are presented in Section 5.3.4.
Chapter 5. Featureless Phrase Representation Learning with Minimal Labelled Data 87
2 × d
3 × d
4 × d
r Regions
n filters for 3×d region
n filters for 4× d region
n filters for 2×d region
n×r filters n×r feature maps
max pooling
layer
human
immunodeficiency
virus
enhancer
l
padded 0 vector
embedding size d
Input Matrix M = l × d
fully connected
layer
Figure 5.2: Convolutional Model
5.3.2.1 Convolutional Model
We adopt the CNN network used by Kim [26], and Zhang et al. [27], which has only one
convolutional layer, shown in Figure 5.2. The inputs C(s) to the network are vertically
stacked into a matrix M , where the rows are word embeddings of each w ∈ s. Let d be
the length of word embedding vectors, and l be the length of a term, then M has the
dimension of d× l where d and l are fixed. We pad zero vectors to the matrix if the
number of tokens of an input term is less than l. The convolutional layer has r predefined
regions, and each region has n filters. All regions have the same width d, because each
row in the input matrix M represents a word and the goal is to learn the composition
of them. However, the regions have various heights h, which can be thought of as the
different sub-grams of the term. For example, when h = 2, the region represents bigram
features. Let W be the weights of a filter, where W ∈ Rd×h. The filter outputs a feature
map c = [c1, c2, ..., cl−h+1], and ci is computed as:
ci = f(W ·M [i : i+ h− 1] + b) (5.1)
Chapter 5. Featureless Phrase Representation Learning with Minimal Labelled Data 88
Input gate
ct-1
Candidate value
Wi , Ui , bi
it = sigmoid(WiXt + UiHt-1 + bi)
it
gt = tanh(WgXt + UgHt-1 + bg)
gt Wg , Ug , bg
forget gate Wf , Uf , bf
ft = sigmoid(WfXt + UfHt-1 + bf)
ft
Ct-1output gate Wo , Uo , bo
ot = sigmoid(WoXt + UoHt-1 + bo)
ot
ct
ht
tanh(ct)
xt ht = ot tanh(ct)
ct = ft ct-1 + it gt
ht-1
ct-1
human immunodeficiencyvirus enhancer
embedding size d
Recurrent Unit
Recurrent Unit
Recurrent Unit
Recurrent Unit
Recurrent Unit
Recurrent Unit
Recurrent Unit
Recurrent Unit
Xt-1 Xt Xt+1 Xt+2
ht
ct
ht+1
ct+1
ht+2Connect to
logistic regression
layer
ht-1
Figure 5.3: LSTM Model
where M [i : i + h − 1] is a sub-matrix of M from row i to row i + h − 1, f is an
activation function – we use hyperbolic tangent in this work, and b is a bias unit. A
pooling function is then applied to extract values from the feature map. We use the
1 -Max pooling as suggested by Zhang et al. [27], who conduct a sensitivity analysis on
one-layer convolutional network, showing that 1 -Max pooling consistently outperforms
other pooling strategies in sentence classification task. The total number of feature
maps in the network is m = r × n, so the output from the max pooling layer y ∈ Rm is
computed as:
y =m
maxi=1
(ci) (5.2)
5.3.2.2 LSTM Model
We use a LSTM network that is similar to the vanilla LSTM [217] without peephole
connections, shown in Figure 5.3. The LSTM network features memory cells at each
timestamp. A memory cell stores, reads and writes information passing through the
network at a timestamp t, which consists of four elements: an input gate i, a forget gate
f , a candidate g for the current cell state value, and an output gate o. At t, the inputs
to the network are the previous cell state value ct−1, the previous hidden state value
ht−1, and the input value xt. The outputs are the current cell state ct and the current
hidden state value ht, which will be subsequently passed to the next timestamp t + 1.
Chapter 5. Featureless Phrase Representation Learning with Minimal Labelled Data 89
At time t, the candidate g for the current cell state value composes the newly input xt
and the previously composed value ht−1 to generate a new state value as:
gt = tanh(Wg · xt + Ug · ht−1 + bg) (5.3)
where Wg and Ug are shared weights, and bi is the bias unit.
The input gate i in the LSTM network decides how much information can pass through
from gt to the actual computation of the memory cell state using a sigmoid function
σ =1
1 + e−xthat outputs a value between 0 and 1 indicating the percentage, as:
it = σ(Wi · xt + Ui · ht−1 + bi) (5.4)
where Wi and Ui are shared weights, and bi is the bias unit. Likewise, the forget gate f
governs how much information to be filtered out from the previous state ct−1:
ft = σ(Wf · xt + Uf · ht−1 + bf ) (5.5)
The new cell state value is computed as taking a part of information from the current
inputs and the previous cell state value:
ct = it ⊗ gt + ft ⊗ ct−1 (5.6)
where ⊗ is the element-wise vector multiplication. ct will be passed to the next times-
tamp t + 1, which remains constant from one timestamp to another, representing the
long short-term memory.
The output gate o can be thought of as the filter that prevents any irrelevant information
that may be passed to the next state. The output gate ot and the hidden state value ht
are computed as:
ot = σ(Wo · xt + Uo · ht−1 + bo)
ht = ot ⊗ tanh(ct)(5.7)
where ht is the composed representation of a word sequence from time 0 to t.
5.3.3 Training Classifier
To build the classifiers, each network is connected to a logistic regression layer for the
binary classification task – whether a term is relevant to the domain. The logistic
regression layer, however, can be simply replaced by a softmax layer for multi-class
classification tasks, such as Ontology Categorisation.
Chapter 5. Featureless Phrase Representation Learning with Minimal Labelled Data 90
Overall, the probability that a term s is relevant to the domain is:
p(s) = σ(W · f(C(s)) + b) (5.8)
where σ is the sigmoid function, W denotes the weights for logistic regression layer, b is
the bias unit, and f is the mapping function that is implemented by the CNN or LSTM
network.
The parameters of convolutional classifier are θ = (C, W conv, bconv, W convlogist, bconvlogist)
where W conv are weights for all m filters, and bconv is the bias vectors. For LSTM classi-
fier, θ = (C, W lstm, blstm, W lstmlogist, blstmlogist) where W lstm = (Wi, Wg, Wf , Wo, Ui,
Ug, Uf , Uo), and blstm = (bi, bg, bf , bo). Given a training set D, the learning objective
for both of the classifiers is to maximise the log probability of correct labels for s ∈ Dby looking for parameters θ:
argmaxθ
∑s∈D
log p(slabel|s; θ) (5.9)
θ is updated using Stochastic Gradient Descent (SGD) to minimise the negative log
likelihood error:
θ := θ − ε∂ log p(slabel|s; θ)∂θ
(5.10)
where ε is the learning rate.
5.3.4 Pre-training Word Embedding
All word embeddings are pre-trained over the datasets. We use the SkipGram model [15]
to learn word embeddings3. Given a word w, the SkipGram model predicts the context
(surrounding) words S(w) within a predefined window size. Using the softmax function,
the probability of a context word s ∈ S is:
p(s|w) =ev
′w
>·vs∑Vt=1 e
v′t>·vs
(5.11)
where V is the vocabulary, v′w is the output vector representations for w, vs is the input
vector representations for contexts s, respectively. The learning objective is to maximise
the conditional probability distribution over vocabulary V in a training corpus D by
looking for parameters θ:
argmaxθ
∑w∈D
∑s∈S(w)
log p(s|w; θ) (5.12)
3Please see Equation 4.3 for the detailed computations.
Chapter 5. Featureless Phrase Representation Learning with Minimal Labelled Data 91
5.3.5 Co-training Algorithm
Algorithm 1: Co-training
Input: L,U,C, p, k, gcreate U ′ by randomly choosing p example from Uwhile iteration < k do
for c ∈ C douse L to train C
end forfor c ∈ C do
use c to posit label in U ′
add most confident g example to Lend forrefill U ′ by randomly choosing 2× g example from U
end while
Given the unlabelled data U , a pool U ′ of size p, and a small set of labelled data L, firstly
each classifier c ∈ C are trained over L. After training, the classifiers make predictions
on U ′, then choose the most confident g predictions from each classifier and add them to
L. The size of U ′ now becomes p− 2g, and L := L+ 2g. U ′ then is refilled by randomly
selecting 2g examples from U . This process iterates k times. Algorithm 1 documents
the details.
5.4 Experiments
5.4.1 Datasets
We evaluate our model on two datasets. The first dataset is the GENIA corpus4, which
is a collection of 1,999 abstracts of articles in the field of molecular biology. The corpus
has a ground-truth list. We use the current version 3.02 for our evaluation. The second
dataset is the ACL RD-TEC5 corpus consists of 10,922 articles published between 1965
to 2006 in the domain of computer science. The ACL RD-TEC corpus groups terms into
three categories, invalid terms, general terms, and computational terms. We only treat
computational terms as ground-truth in our evaluation. Table 5.1 shows the statistics.
Table 5.1: Dataset Statistics
Articles Domain Tokens Ground Truth
GENIA 1,999 Molecular Biology 451,562 30,893ACL RD-TEC 10,922 Computer Science 36,729,513 13,832
4publicly available at http://www.geniaproject.org/5publicly available at https://github.com/languagerecipes/the-acl-rd-tec
Chapter 5. Featureless Phrase Representation Learning with Minimal Labelled Data 92
5.4.2 Pre-processing
The pre-processing has two goals: cleaning the data, and identifying candidate terms.
The data cleaning procedure includes extracting text content and ground-truth terms,
removing plural for nouns, and converting all tokens into lower-cases.
The ACL RD-TEC corpus provides a pre-identified candidate list. We therefore only
need to identify candidate terms from the GENIA corpus. We build two candidate
identifiers. The first identifier uses noun phrase chunking with predefined POS patterns,
we call it POS identifier. We use a common POS pattern <JJ>*<NN.*>+, that is, a
number of adjectives followed by a number of nouns. The documents were parsed using
the Stanford Log-linear POS tagger [196].The POS identifier, however, is only able to
identify 67% of the ground-truth terms. One reason is that not all ground-truth terms
fall in our simple POS pattern6. Another reason is the shortcoming of POS pattern-
based phrase chunking technique – it is unable to identify a sub-term from its super-term
when they always occur together. For example, the POS identifier spots a term novel
antitumor antibiotic which only occurs once in the GENIA dataset. In the ground-truth
set, the correct term is antitumor antibiotic.
The second identifier uses N-gram based chunking, so called N-gram identifier, which
decomposes a sequence of words into all possible N-grams. However, there will be
too many candidates if we simply decompose every sentence into all possible N-grams.
Thus, we use stop-words as delimiters to decompose any expression between two stop-
words into all possible N-grams as candidates. For example, novel antitumor antibiotic
produces 6 candidates, novel antitumor antibiotic, novel antitumor, antitumor antibiotic,
novel, antitumor, and antibiotic. The N-gram identifier produces a much better coverage
on the ground-truth set, despite there is a small number of ground-truth terms containing
stop-words for which the N-gram identifier is not able to identify. Meanwhile, it produces
much more candidates than the POS identifier. Table 5.2 shows statistics.
Table 5.2: Candidate Terms Statistics
Candidate GT Cov. Positive Negative
GENIA POS 40,998 67.0% 20,704 (50.5%) 20,294 (49.5%)GENIA N-gram 229,810 96.4% 29,781 (13.0%) 200,029 (87.0%)ACL RD-TEC 83,845 100% 13,832 (16.5%) 70,013 (83.5%)
GT Cov: Ground Truth Coverage
6One may consider to add in more patterns, but it is out of the scope of this work.
Chapter 5. Featureless Phrase Representation Learning with Minimal Labelled Data 93
All Possible Grams
CandidateSet
GroundTruth Set
EvaluationSet
ClassifiedPositive
ClassifiedNegative
FP
TP
FNTN
UnidentifiedGround Truth
Figure 5.4: Relationships in TP, TN, FP, and FN for Term Extraction.
5.4.3 Experiment Settings
The Co-training algorithm requires a few parameters. We set the small set of labelled
data L = 200, and the size for the pool U ′ is 500 for all evaluations. The number of
iterations k is 800 for POS identified candidates due to the larger candidate set, 500
for N-grams identified candidates in the GENIA dataset, 500 for the ACL RD-TED
dataset. The growth size is 20 for GENIA POS, 50 for GENIA N-grams, and 20 for
ACL RD-TED. The evaluation data is randomly selected from candidate sets. For each
e in the evaluation set E, e /∈ L. Table 5.3 show the class distributions and statistics of
each evaluation set.
Table 5.3: Evaluation Dataset Statistics
Test Examples Positive Negative
GENIA POS 5,000 2558 (51.0%) 2442 (49.0%)GENIA N-gram 15,000 1,926 (12.8%) 13,074 (87.2%)ACL RD-TEC 15,000 2,416 (16.1%) 12,584 (83.9%)
All word embeddings are pre-trained with 300 dimensions on each corpus7. The convolu-
tional model has 5 different region size, {2, 3, 4, 5, 6} for GENIA and 3 region size {2, 3,
4} for ACL RD-TED. Each region has 100 filters. There is no specific hyper-parameters
required for training the LSTM model. The learning rate for SGD is 0.01.
5.4.4 Evaluation Methodology
We employ the Precision, Recall, and F-measure for evaluating the ranking algorithm
as detailed in Chapter 2.1.7. We illustrate the set relationships of True Positive (TP),
False Positive (FP), True Negative (TN), and False Negative (FN) in Figure 5.4.
7All settings for training word embeddings used in this chapter are the same to 4.4.1.
Chapter 5. Featureless Phrase Representation Learning with Minimal Labelled Data 94
5.4.5 Training
The model is trained in an online fashion on a NVIDIA GeForce 980 TI GPU. The
training time linearly increases at each iteration, since the model incrementally adds
training examples into the training set. At the beginning, it only takes less than a
second for one iteration. After 100 iterations, the training set is increased by 1,820
examples that takes a few seconds to train. Thus, the training time is not critical – even
the standard supervised training only take a few hours to converge.
5.4.6 Result and Discussion
We use C-value [56] as our baseline algorithm. C-value is an unsupervised ranking algo-
rithm for term extraction, where each candidate term is assigned a score indicating the
degree of domain relevance. We list the performance of C-value by extracting a number
of top ranked candidate terms as the domain-specific terms, as shown in Table 5.4. Since
we treat the task as a binary classification task, we also list random guessing scores for
each dataset, where recall and accuracy scores are always 50% and precisions are corre-
sponding to the distribution of positive class for each evaluation set. As a comparison
to the co-training model, we also train each classifier individually using standard super-
vised machine learning approach. The training is conducted by dividing the candidate
set into 40% for training, 20% for validation, and 40% for evaluation. For the proposed
co-training model, the convolutional classifier outperforms the LSTM classifier in all the
evaluation, so we only present the performance of the convolutional classifier. Results
are shown in Table 5.4.
The supervised CNN classifier unsurprisingly produces the best results on all evaluation
sets. However, it uses much more labelled data than the co-training model, while only
delivers less than 2 percent better performance (F-score) on the GENIA corpus, and 6
percent on the ACL RD-TEC corpus. In comparison to standard supervised machine
learning approach, the proposed co-training model is more “cost-effective” since it only
requires 200 labelled data as seed terms.
On the GENIA corpus, all algorithms produce much better F-score for the POS evalua-
tion set. This is because of different class distributions – on the POS evaluation set, the
proportion of positive (ground-truth) terms is 50.5% whereas only 12.8% positive terms
in the N-gram evaluation set. Therefore, we consider that the results from POS and
N-gram evaluation sets are not directly comparable. However, the actual improvements
on F-score over random guessing on both evaluation sets are quite similar, suggesting
that evaluating performance of a classifier should not only consider the F-score, but
should also analyse the actual improvement over random guessing.
Chapter 5. Featureless Phrase Representation Learning with Minimal Labelled Data 95
Table 5.4: Evaluation Results
Labelled Data Precision Recall F-score Accuracy
GENIA POSRandom Guessing – 51% 50% 50.5% 50%C-value (Top 1000) – 62.4% 24.4% 35.1% –C-value (Top 2500) – 53.7% 52.5% 53.1% –Supervised CNN 16,400 64.7% 78.0% 70.7% 67.1%Co-training CNN 200 64.1% 76.0% 69.5% 65.5%
GENIA N-gramRandom Guessing – 12.8% 50% 20.4% 50%C-value (Top 1000) – 25.2% 6.5% 10.4% –C-value (Top 2500) – 12.9% 16.7% 14.6% –C-value (Top 7500) – 11.4% 44.3% 18.1% –Supervised CNN 91,924 35.0% 59.1% 44.0% 81.4%Co-training CNN 200 34.3% 56.6% 42.7% 75.5%
ACL RD-TECRandom Guessing – 16.1% 50% 24.4% 50%C-value (Top 1000) – 10.8% 4.5% 6.3% –C-value (Top 2500) – 14% 14.6% 14.3% –C-value (Top 7500) – 21.8% 68.2% 33.3% –Supervised CNN 33,538 70.8% 67.7% 69.2% 85.2%Co-training CNN 200 66% 60.5% 63.1% 79.7%
RG Acc: Random Guessing Accuracy, RG F: Random Guessing F-score
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 100 200 300 400
F-score
Itera,on
Convolu,onalClassifierGENIANgramEvalua,onSet
F-score
Accuracy
(a)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 100 200 300 400 500 600 700 800
F-score
Itera,on
Convolu,onalClassifieroPOSEvalua,onSet
F-score
Accuracy
(b)
Figure 5.5: (a) : F-score and Accuracy of Convolutional classifier on Ngram evaluationset over 500 iterations. (b): F-score and Accuracy of Convolutional classifier on on POS
evaluation set over 800 iterations.
It is also interesting to note that the GENIA N-gram evaluation set has 12.8% pos-
itive examples, which has similar unbalanced class distribution as ACL RD-TED –
16.1% positives. However, all algorithms produce much better performance on the ACL
RD-TEC corpus. We found that in the ACL RD-TEC corpus, the negative terms con-
tain a large number of invalid characters (e.g.~), mistakes made by the tokeniser (e.g.
evenunseeneventsare), and non content-bearing words (e.g. many). The classifiers can
Chapter 5. Featureless Phrase Representation Learning with Minimal Labelled Data 96
easily spot these noisy data. Another reason might be that the ACL RD-TEC corpus
is bigger than GENIA, which not only allows C-value to produce better performance,
but also enables the word2vec algorithm to deliver more precise word embedding vectors
which are inputs to our deep learning model.
Although the accuracy measure is commonly used in classification tasks, it does not
reflect the true performance of a model when classes are not evenly distributed in an
evaluation set. For example, the N-gram evaluation set has about 12.8% positive exam-
ples, whereas the negative examples are about 87.2%. At the beginning of the training,
both models tend to classify most of the examples as negative thus the accuracy score is
close to 87%. While the training progress, the accuracy starts dropping. However, it is
still difficult to understand how exactly the model performs according to the accuracy
score. On the other hand, because the classes are evenly distributed on POS evaluation
set, we can clearly identify how the accuracy measure corresponds to F-scores.
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0 100 200 300 400 500
F-score
Itera,on
GENIANgramDataset
Convolu2onalNet
LSTMNet
(a)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0 100 200 300 400 500 600 700 800
F-score
Itera,on
GENIAPOSDataset
Convolu2onalNet
LSTMNet
(b)
Figure 5.6: (a) : F-score for both convolutional and LSTM classifiers on Ngramevaluation set over 500 iterations. (b): F-score for both convolutional and LSTM
classifiers on POS evaluation set over 800 iterations.
The CNN classifier outperforms the LSTM on all evaluation sets. It also requires much
fewer iterations to reach the best F-score. We plot F-score for both classifiers over a few
hundred iterations on the GENIA corpus, shown in Figure 5.6. Both classifiers reach
their best performance within 100 iterations. For example, the CNN classifier on POS
evaluation set, produces a good F-score around 62% at just about 30 iterations, then
reaches its best F-score 69.5% after 91 iterations. However, the size of the training set is
still quite small – by 91 iterations, the training set only grows by 1,820 examples. This
phenomenon leads us to consider two more questions 1) what is the exact performance
boosted by the co-training model? 2) How different numbers of training examples affect
the performance of a deep learning model, and do deep learning models still need large
amount of labelled training examples to produce the best performance? In the rest of
Chapter 5. Featureless Phrase Representation Learning with Minimal Labelled Data 97
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 100 200 300 400 500 600 700 800
F-score
Itera,on
Trainingon200ExamplesGENIAPOSDataset
Convolu3onalModel
LSTMModel
Figure 5.7: Convolutional and LSTM Classifier Training on 200 Examples on POSEvaluation set
the chapter, we will answer the first question, and leave the second question for our
future work.
To investigate how the Co-training algorithm boosts the performance of classifiers, we
train our model using only 200 seed terms over 800 iterations, results are shown in
Figure 5.7. The best F-score is from the convolutional model, about 53%, just slightly
higher than random guessing. On the other hand, by applying the Co-training algo-
rithm, we obtain the best F-score of 69.5%, which is a 16.5% improvement. In fact, the
improvement achieved by just adding a small number of training examples to the train-
ing set is also report by [218]. Consequently, it is clear that our co-training model is an
effective approach to boost the performance of deep learning models without requiring
much training data.
5.5 Conclusion
In this chapter, we have shown a deep learning model using the Co-training algorithm
– a weakly supervised bootstrapping paradigm, for automatic domain-specific term ex-
traction. Experiments show that our model is a “cost-effective” way to boost the perfor-
mance of deep learning models with very few training examples. The study also leads to
further questions such as how the number of training examples affects the performance
of a deep learning model, and whether deep learning models still need as many labelled
training examples as required in other machine learning algorithms to reach their best
performance. We will keep working on these questions in the near future.
Chapter 6
A Matrix-Vector Recurrent Unit
Network for Capturing Semantic
Compositionality in Phrase
Embeddings
In Chapter 5, we have shown the capability of encoding compositional semantics using
the convolutional neural network (CNN) and Long Short-Term Memory (LSTM) net-
work. However, they have limitations. The CNN network is designed to encode regional
compositions of different locations in data matrices. In image processing, pixels close
to each other usually are a part of the same object, thus convoluting image matrices
captures the regional compositions of semantically related pixels. However, the location
invariance does not exist in word embedding vectors. The LSTM network, on the other
hand, uses shared weights to encode composition rules for every word in the vocabulary
of a corpus, which may overly generalise the compositionality of each single word.
In this chapter, we present a novel compositional model based on the recurrent neural
network architecture. We introduce a new computation mechanism for the recurrent
units, namely the Matrix-Vector Recurrent Unit into the network to integrate the dif-
ferent views of compositional semantics originated from linguistic, cognitive, and neural
science perspectives. The recurrent architecture of the network allows for processing
phrases having various lengths as well as encoding the ordering of consecutive words.
Each recurrent unit consists of a compositional function that computes the composed
meaning of two input words, and a control mechanism governs the information flow at
each composition. When we train the network in an unsupervised fashion, it can capture
latent compositional semantics, and produces a phrase embedding vectors regardless of
98
Chapter 6. A Matrix-Vector Recurrent Unit Network for Capturing SemanticCompositionality in Phrase Embeddings 99
the phrase’s presence in the labelled set. When we train it in a supervised fashion by
adding a regression layer, it is able to perform classification tasks using the phrase em-
beddings learnt from the lower layer of the network. We show that the model produces
better performance than both simple algebraic compositions and other deep learning
models, including the LSTM, CNN, and recursive neural networks.
6.1 Introduction
The recent advancement of deep neural architectures enabled a paradigm shift of repre-
senting semantic meanings from lexical words to distributed vectors, i.e. word embed-
dings. These low dimensional, dense and real-valued vectors open immense opportunities
of applying known algorithms for vector and matrix manipulation to carry out semantic
calculations that were previously impossible. For example, the word analogy test [15]
has shown the existence of linear relations among words demonstrating that words have
similar vector representations sharing the semantic meaning.
However, word embeddings only encode the meanings of individual words. More than
often, we need vector representations beyond the level of unigram words for tasks such as
keyphrase extraction or terminology mining. The meaning of a complex expression, i.e.
a phrase or a sentence, is referred to as compositional semantic, which is reconstructed
from the meaning of each constituent word and the explicit or implicit rules of combining
them [25].
Traditional approaches for learning compositional semantics can be categorised into two
streams, namely holistic and algebraic composition. The holistic approach learns the
compositional semantics by treating phrases as atomic units (e.g. using hyphens to
convert pre-identified phrases into unigrams), then the same algorithms for inducing
word embeddings, such as the SkipGram model [15], are applied to learn the phrase
representations. Such approach suffers from the data coverage and sparsity problem –
it is not able to generate embeddings for phrases not appearing in the training set, and
most phrases co-occur much less often than unigram words. The foremost shortcoming,
however, is that it does not has the mechanisms or intentions to learn the composi-
tional rules. On the other hand, the algebraic composition approach applies simple
algebraic functions that takes embeddings of constituent words as inputs and produces
the embedding of a phrase. For example, Mitchell and Lapata [176] have investigated
different types of composition functions, including the linear addition or multiplication
of word representations, the linear addition of words and its distributional neighbours,
and the combination of addition and multiplication. Applying such simple algebraic
Chapter 6. A Matrix-Vector Recurrent Unit Network for Capturing SemanticCompositionality in Phrase Embeddings 100
functions has two shortcomings. Firstly, it does not take the order of words into consid-
eration. The order of words often plays an essential role in making different meanings,
e.g. ladies first and first ladies. Secondly, it overly simplifies the compositional rules to
simple algebraic functions. These functions are pre-identified, i.e. they are not directly
learned from input texts, which fail to capture the different syntactic relations and the
compositionality of words.
In this chapter, we present a novel compositional deep learning model that directly
learns the semantic compositionality of words from input text. The deep learning model
is based on a recurrent neural network architecture, where we introduce a new computa-
tion mechanism for the recurrent units, namely Matrix-Vector Recurrent Unit (MVRU),
based on the different views of compositional semantics originated from linguistic, cog-
nitive, and neural science perspectives. The recurrent architecture of the network allows
for processing phrases having various lengths as well as encoding the ordering of con-
secutive words. Each recurrent unit consists of a compositional function that computes
the composed meaning of input words, and a control mechanism that governs the infor-
mation flow at each composition.
The MVRU model is inspired by Socher et al. [19]. The similarity is that both represent
a word using a pair of vector-matrix. The vector represents the meaning of a word, and
the matrix intends to capture the compositional rule of the word, which determines the
contributions made by other words when they compose with the word. However, the
MVRU differs from [19] in many aspects. Firstly, Socher et al. use a recursive net-
work architecture, which efficiently captures the structural information. The underlying
intuition is that the meaning of a phrase or sentence can be learnt from the syntacti-
cal combination of words. Despite syntactical structures provide additional information
for learning compositional semantics, generating such information requires parsing as
a prior. Considering many succeed studies [17, 97] that learn the compositionality of
words from purely sequential combinations, we adopt a recurrent network architecture.
Secondly, in [19], the model recursively traverses the syntactical tree of a sentence to
produce the phrase representations. However, the composed vector is directly pasFsed
into next composition without any control mechanism and hence the composed informa-
tion at previous stage is no longer accessible for onwards computation – no background
knowledge is stored. On the other hand, we use cell state memories store information
at each composition as background information, and a gate to control information flow
by taking a part of information from the current input combining with the information
passed from the precedents. Thirdly, [19] requires the Part-of-Speech parsing is required
prior to the training, The proposed MVRU model does not require any pre-parsing or
pre-identified phrases.
Chapter 6. A Matrix-Vector Recurrent Unit Network for Capturing SemanticCompositionality in Phrase Embeddings 101
We show that when MVRU is trained in an unsupervised fashion, it captures general la-
tent compositional semantics. After training, it can produce a phrase embedding vector
regardless of the presence of the phrases in the training set. The MVRU is evaluated
on the phrase similarity and compositionality tasks, outperforming baseline models,
including distributional compositional models, two linear distributed algebraic composi-
tion models, and two deep learning models – a Long Short-Term memory (LSTM) and
a Convolutional Neural Network (CNN) model. When MVRU is trained in a supervised
fashion by adding a regression layer, it is able to perform classification tasks using the
phrase embeddings learnt from the lower layer of the network. We also demonstrate
that MVRU outperforms baselines on predicting phrase sentiment distributions, and
identifying domain-specific term.
6.2 Proposed Model
6.2.1 Compositional Semantics
From the linguistic and cognitive perspective, Frege [25] states that the meaning of a
complex expression is determined by the meanings of its constituent expressions and the
rules used to combine them, which is known as the principle of compositionality. In the
principle, two essential elements for modelling compositional semantics are the meanings
of words, and the compositional rules.
Kuperberg [219] describes how languages are processed and words are composed in hu-
man brains from the viewpoint of neural science, and suggests that the normal language
comprehension has at least two parallel neural processing streams, a semantic memory
based mechanism, and a combinatorial mechanism. The semantic memory based stream
not only computes and stores the semantic features and relationships between content
words in a complex expression, but also compares these relationships with those that
are pre-stored within the semantic memory. The combinatorial mechanism combines the
meanings of words in a complex expression to build propositional meaning based on two
constraints, morphosyntactic and semantic-thematic. The morphosyntactic constraint
determines syntax errors, and the semantic-thematic processing evaluates whether the
semantic memory is incongruent with respect to real world knowledge. The semantic
memory and combinatorial streams interact during online processing, where the seman-
tic memory stream can be modulated by the output of the combinatorial stream.
Based on the literature, we formalise five components that are essential to model compo-
sitional semantics. Concretely, given a phrase m = [w0, w1, ..., wn] consisting of n words,
Chapter 6. A Matrix-Vector Recurrent Unit Network for Capturing SemanticCompositionality in Phrase Embeddings 102
ht-1
human immunodeficiencyvirus enhancer
Recurrent Unit
Recurrent Unit
Recurrent Unit
Recurrent Unit
Xt-1 Xt Xt+1 Xt+2
ht ht+1 ht+2 Composed Representaton
W
ht
xt
ht-1
ht = tanh( + b)
++ +ht-1xt U
Figure 6.1: Elman Recurrent Network for Modelling Compositional Semantic
let pt denotes the composed representation from w0 to wt, the components required to
encode the compositional semantic of m include:
• Lexical Semantic of Word The meaning or representation of each constituent
word w is the building brick for constructing a compositional semantic model,
which can be word embeddings pre-trained over large datasets, such as Wikipedia.
• Consecutive Order of Word The order of words inm reflects syntactic relations.
Without employing any syntax parser, capturing the sequential order patterns of
words is the simplest way to encode syntactic relations.
• Memory Kuperberg [219] suggests that a memory unit in human brains stores
computed compositional semantics. The memory unit should be accessible for the
onwards composition, i.e. it stores all the information at each composition, acting
as the background knowledge of what happened before time t.
• Composition Function It computes the actual compositional semantic. Specif-
ically, at time t, the function composes pt−1 and wt, and output pt.
• Control The control acts as the combinatorial mechanism as described by Ku-
perberg [219] or the composition rules in the principle of compositionality, which
governs and modulates the composition.
Chapter 6. A Matrix-Vector Recurrent Unit Network for Capturing SemanticCompositionality in Phrase Embeddings 103
6.2.2 Matrix-Vector Recurrent Unit Model
We choose the recurrent neural network architecture because it has demonstrated the
capability of encoding patterns and learning long term dependencies of words [16, 181,
182]. However, existing recurrent networks such as Elman [180] or LSTM [28] do not
satisfy all the requirements. Elman networks rely on shared weights that are slowly
updated over timescales to encode the compositional rules, as shown in Figure 6.1. At
timestamp t, the network takes two inputs, the current input xt, and the hidden state
value ht−1 that is the composed value from timestamp t − 1. Let W,U be the weight
vectors of the network and b be the bias unit, the hidden state value ht is computed as:
ht = tanh(W · xt + U · ht−1 + b) (6.1)
Each simple recurrent unit in the network features long term and short term memories.
The long term memory in the network is carried by the weights W and U that are slowly
updated over the timescale, which can be thought of as encoding the compositional rules.
At time t, the network stores the composed meaning of inputs in the hidden state value
ht. By passing ht to the next timestamp t+ 1, ht is updated to ht+1 thus the composed
meaning from the previous timestamp is no longer stored, so called the short term
memory. Obviously, the composed meaning ht stored in the short-term memory is not
accessible for onwards computation. In addition, there is no control or combinatorial
mechanism featured in simple recurrent networks.
LSTM on the other hand, features the long short-term memory for each composition,
and is accessible for onwards computation. The input, output, and forget gates in the
network also act as control or combinatorial mechanism that governs and modulates
the composition. However, the network uses shared weights to encode composition
rules for every word in the vocabulary of a corpus, which may overly generalise the
compositionality of each single word. In addition, the input gate and the forget gate
seem to be duplicated for controlling the information flow.
Based on the architecture of the recurrent neural network, we propose Matrix-Vector
Recurrent Unit computation mechanism, shown in Figure 6.2. Specifically, we use the
recurrent structure capture the consecutive order of words. The control function r com-
putes the compositional rules and governs how much information to be passed through or
throw away for each combination. The composition function ct computes the combined
meaning of two words by concatenating two word embedding vectors, which efficiently
captures the order of words at each single composition. Finally, the actual composed
representation is computed by taking a part of meaning from the current composition
ct, and the composed meaning from its precedent composition pt−1.
Chapter 6. A Matrix-Vector Recurrent Unit Network for Capturing SemanticCompositionality in Phrase Embeddings 104
Ut-1
pt-1
Recurrent Unit
Recurrent Unit
Recurrent Unit
Recurrent Unit
xt-1 xt xt+1
Ut
pt
Ut+1
pt+1
pt+2
Ut-1
pt-1
Mt-1 Mt Mt+1
xt+2 Mt+2
xt Mt
UtMtUt-1
+Wm
pt
rt ct + ( 1 - rt ) pt-1
rt = sigmoid( )
+
ct = tanh( )
Wv
+
Wm
Wv
Wm
Wv
Wm
Wv
Wm
Wv
+ +
Figure 6.2: Matrix-Vector Recurrent Unit Model
We merge the input and forget gates in the LSTM into a single gate rt. In LSTM
networks, the input and forget gates control how much information can pass through
the current state, and how much to be discard from the previous state. Technically, the
gates are used to prevent the gradients vanishing problem in recurrent networks [28].
In learning phrase embeddings, the gradient vanishing problem may not exist since
the majority of phrases only consists of few words, thus using two gates to control
information flow is redundant. In addition, the input and forget gates use two sets
of shared weights in the network, which do not specifically capture the compositional
rules for each individual word. The proposed MVRU model uses a single gate rt to
control how much information can be passed through from previous state (or previously
composed representation). The computation uses two unshared weight matrices, where
Ut−1 is induced from previous state t − 1, and Mt is the matrix for the current input
word.
The value of ct is the composed value of two vectors, the vector representation of the
current input word, and the composed vector from the previous state. This is similar to
the candidate value of state memory in the LSTM network. However, ct in our model
is not directly passed to the next timestamp. Instead, we only pass a certain amount of
information from ct to the next timestamp, controlled by the update gate rt. Therefore,
the composed vector representation pt is computed as taking a part of information from
Chapter 6. A Matrix-Vector Recurrent Unit Network for Capturing SemanticCompositionality in Phrase Embeddings 105
time 0 to t− 1 that acts as the background knowledge, and a part of information from
the current composition. Finally, the composed matrix representation is computed using
a shared weight matrix Wm.
Concretely, let x ∈ Rd be a word embedding vector with d dimensions, and M ∈ Rd×d be
the corresponding matrix, so each word has the representation of a pair of vector-matrix
(x,M). Let p ∈ Rd, U ∈ Rd×d be the composed vector and matrix, and Wm ∈ Rd×2d,
Wv ∈ R2d×d be the shared weights across the network. At timestamp t, the network
takes a pair of vector-matrix (xt,Mt) from current state and the learnt composition
pt−1, Ut−1 from precedents as inputs, and outputs composed of representation (pt, Ut)
that are computed as:
rt = σ(Ut−1 · xt +Mt · pt−1 + br)
ct = tanh([xt; pt−1] ·Wv + bc)
pt = rt � ct + (1− rt)� pt−1
Ut = tanh(Wm · [Ut−1;Mt] + bu)
(6.2)
where σ is the sigmoid function, [a; b] denotes the concatenating of two vectors or ma-
trices, � denotes the element-wise vector multiplication, br, bc, and bu are bias units.
6.2.3 Low-Rank Matrix Approximation
For optimising the computational efficiency, we employ the low-rank matrix approxima-
tions [19] to reduce the number of parameters in word matrices. The low-rank matrix
approximation is computed as the product of two low-rank matrices plus diagonal ap-
proximations, as:
M = UV + diag(m) (6.3)
where U ∈ Rd×l, V ∈ Rl×d, m ∈ Rd, and l is the low-rank factor. We set l = 3 for all
experiments.
6.3 Unsupervised Learning and Evaluations
Learning general representations for phrases aims to learn the rules of composing words,
so there is no actual phrase embeddings induced and saved from a model. Once the
training is finished, a model is expected to encode the compositional rules in its param-
eters, so it can take any sequence of words (embedding vectors) as the input producing
the composed representation, no matter whether or not the sequence has been seen in
the training examples.
Chapter 6. A Matrix-Vector Recurrent Unit Network for Capturing SemanticCompositionality in Phrase Embeddings 106
We train our model in an unsupervised fashion over a Wikipedia snapshot to induce
general phrase embeddings. The model predicts the context words of a input phrase.
Concretely, given a phrase m = [w1, w2, ..., wn] consisting of n word, let L and S be the
collections of all possible phrases and their context words in the training corpus, respec-
tively. The learning objective is to maximise the conditional probability distribution
over the list L by looking for parameter θ, as:
argmaxθ
∑m∈L
∑w∈S(m)
log p(w|m; θ) (6.4)
We choose the Negative Sampling technique introduced in the SkipGram model [15] for
fast training. Let C ′ and C be the sets of all output vectors and embedding vectors,
respectively. The composed representation of m is o = f(C(m)). Given m and a word w,
the probability that they co-occur in the corpus is p = σ(C ′(w)T o), and the probability
of not appearing in the corpus is 1 − p. The context words S(m) of m that appear in
the corpus are treated as positives, and a set of randomly generated words S′(m) are
treated as negatives. The probability is computed as:
p(w|m; θ) =∏
w∈S(m)
σ(C ′(w)To)
∏w′∈S′(m)
(1− σ(C ′(w′)
To))
The objective function maximises the log probability:
P =∑
w∈S(m)
log σ(C ′(w)To)) +
∑w′∈S′(m)
log σ(−C ′(w)To)
Parameter θ is updated as:
θ := θ + ε∂L
∂θ(6.5)
where ε is the learning rate.
6.3.1 Evaluation Datasets
The evaluation focuses on unsupervised learning for general phrase embeddings. We
firstly train the model over a Wikipedia snapshot, then evaluate the model on the phrase
similarity and Noun-Modifier Questions (NMQ) tests.
The phrase similarity test measures the semantic similarity for a pair of phrases against
human assigned scores. We use the evaluation dataset constructed by Mitchell and
Chapter 6. A Matrix-Vector Recurrent Unit Network for Capturing SemanticCompositionality in Phrase Embeddings 107
Lapata [220], which contains 324 phrase pairs with human assigned scores of pairwise
similarity. Each test example is a pair of bigram phrases, for instance (large number,
vast amount). Human assigned similarity score is rated from 1 to 7, where the 1 indi-
cates that two phrases are semantically unrelated, and 7 indicates semantically identical.
Each pair of phrases are scored by different participants. The test examples are grouped
into three classes, adjective-noun, verb-object, and noun-noun.
The NMQ test finds the semantic similar or equivalent unigram counterpart among
candidates to a phrase. We used the dataset described by Turney [177]. Each test
example gives a bigram phrase, and choices of 7 unigram words. For example, the
phrase is electronic mail, the candidates are email, electronic, mail, message,
abasement, bigamy, conjugate, the answer is the first unigram word email, the sec-
ond is the modifier of the given phrase, the third is the head noun, and follows by the
synonym or hypergamy of modifier and head noun, the last two are randomly selected
nouns. The original dataset has 2,180 samples, where 680 is for training and 1,500 is
for testing. We use all the samples for testing, since the models for this evaluation are
trained in an unsupervised fashion.
6.3.2 Evaluation Approach
The semantic similarity of two phrases is measured by the cosine similarity of their
corresponding embedding vectors, as:
cos(u, v) =u · v‖u‖ · ‖v‖
(6.6)
In the phrase similarity test, we sum all the human assigned scores for the same sample
and then take the mean value to obtain the overall rating score as the ground-truth. We
evaluate the model against the ground-truth using Spearmans ρ correlation coefficient.
In the phrase composition test, we measure the cosine similarities of the given phrase
and all candidates, where the candidate having the highest score is selected.
6.3.3 Baseline
We firstly compare our results with distributional approaches. Mitchell and Lapata [220]
investigate different composition models that are evaluated empirically on the phrase
similarity task. Turney [177] present a dual-space model that is evaluated on both
phrase similarity and NMQ tasks.
Chapter 6. A Matrix-Vector Recurrent Unit Network for Capturing SemanticCompositionality in Phrase Embeddings 108
In addition to distributional approaches, we also use four baselines implemented using
distributed approaches. The first two models are simple linear algebraic compositions
including the vector summation p = a + b and element-wise multiplication p = a � b.The vector representations of words are obtained using the SkipGram model [15]. The
other two baselines are two deep learning models LSTM and CNN described in 5.3.2.2
and 5.3.2.1
6.3.4 Training
Pre-training Word Embeddings We use a Wikipedia snapshot as our unsupervised
training corpus. We firstly pre-trained word embeddings using the SkipGram model [15],
where the minimum word occurrence frequency is set to 20 obtaining a vocabulary of
521,230 words. The vector summation and multiplication approaches take the trained
word embedding values as input, so they do not require any further training.
Training Deep Learning Models We then use the values from trained embedding
vectors as the inputs to semantic composition models. For training, semantic composi-
tion models, we use only the top 50,000 frequent words to reduce the training time. The
reduced word list still covers 99% vocabulary in evaluation data sets. We train three
models using the same Wikipedia snapshot, the MVRU, the LSTM and CNN models.
All models are trained on all possible n-grams of a sentence where 2 ≤ n ≤ 5. Both
the size of context word window and the number of negative samples are set to 5. All
word and phrase embedding vectors are 300 dimensions. The CNN model has 3 different
region size {2, 3, 4}. Each region has 100 filters. There is no specific hyper-parameters
required for training the MVRU and LSTM model. The learning rate is 0.005.
6.3.5 Results and Discussion
Table 6.1 shows the evaluation results for the similarity test, where the scores indicate
the Spearmans ρ correlation coefficient. The MVRU model performed significantly bet-
ter than distributional models. In fact, models using distributed word representations
generally produce better results than distributional models.
In comparison to other distributed models, the MVRU produce much better results on
adjective-noun, and noun-noun groups. However, the simple vector summation yields
the best performance on the verb-object test. Noun phrases, including adjective-nouns
and noun-nouns, are used to describe objectives and concepts, which naturally form
co-occurrence patterns with a small group of words sharing similar semantics, such
Chapter 6. A Matrix-Vector Recurrent Unit Network for Capturing SemanticCompositionality in Phrase Embeddings 109
as, assistant secretary, intelligence service, economic condition. Such co-
occurrence information can be easily captured by our model. On the other hand, verb
phrases describe actions that usually co-occur with their modifiers, such as subjects. This
phenomenon introduces extra difficulties for our model capturing the co-occurrence infor-
mation. In contrast to the MVRU model, the vector summation compositional approach
simply takes a sum of each word embedding vector, where the values are pre-trained over
Wikipedia. Given a pair of phrases a = (a1, a2), b = (b1, b2) consisting of 4 words, using
vector summation, phrases a and b will have higher cosine similarity score as long as a1 is
similar to b1 (or b2) and a2 is similar to b2 (or b1). The advantage is that word embedding
vectors for verbs have already captured their lexical meanings as well as their semantic
similarities, such as require and need. Therefore, the vector summation approach can
easily compute the similarity between verb phrases. For example, the composed vectors
of attention require and treatment need have high cosine similarity, whereas war
fight and number increase receives a low score.
The vector summation model performs significantly better than its cousin – the element-
wise multiplication model. The word embeddings trained using the SkipGram [15] model
naturally encode the additive compositionality. Mikolov et al. [15] have demonstrated
the linear compositionality of word embeddings in the word analogy tests, such as the
simple vector subtraction producing the interesting result of king - man + woman =
queen. This is also the reason that two deep learning models, the LSTM and CNN, are
unable to outperform the simple vector summation of the embeddings trained using the
SkipGram model.
Comparing three deep learning models, the MVRU model is much more efficient on
learning adjective-noun phrases than other two models with almost 14% better perfor-
mance. The majority of adjectives in the evaluation dataset are quite general words,
such as better, modern, special, local. These general adjectives can easily co-
occur thousands of times with different words in Wikipedia. This introduces difficulty
in capturing general composition rules of these adjectives. In comparison to the LSTM
and CNN models, our MVRU features much more parameters in the word matrices, thus
has an obvious advantage of encoding more information.
Table 6.2 shows the evaluation results for NMQ tests. Similar to [177], we ran two tests,
the first one does not have any constraints considering all 7 choices, and the second test
applies constraints dismissing the modifiers and head nouns. The random guess rate for
the first test is 14.3% (1/7), and 20% for the second test.
In comparison to supervised distributional models, the proposed MVRU was unable to
outperform the state-of-the-art – the holistic model delivers more than 10 percent better
performance. However, the holistic model replies on training examples (pre-identified
Chapter 6. A Matrix-Vector Recurrent Unit Network for Capturing SemanticCompositionality in Phrase Embeddings 110
Table 6.1: Phrase Similarity Test Results
AN NN VO
Distributional Rep.p = a+ b, semantic space [220] 36% 39% 30%p = a+ b, using LDA [220] 37% 45% 40%p = a� b, semantic space [220] 46% 49% 37%p = a� b, using LDA [220] 25% 45% 34%p = Ba+Ab, Semantic Space [220] 44% 41% 34%p = Ba+Ab, using LDA [220] 38% 46% 40%Dual Space [177] 48% 54% 43%Distributed Rep.p = a+ b, word embeddings 64% 71% 55%p = a� b, word embeddings 37% 46% 45%LSTM 57% 72% 45%CNN 57% 68% 47%MVRU 70% 76% 49%
AN:Adjective-Noun, NN: Noun-Noun, VO: Verb-Object
Table 6.2: Phrase Composition Test Results
NC WC
Supervised Distributional Rep. [177]p = a+ b 2.5% 50.1%p = a� b 8.2% 57.5%Dual space 13.7% 58.3%Holistic 49.6% 81.6%Unsupervised Distributed Rep.p = a+ b, word embeddings 1% 62.4%p = a� b, word embeddings 14.2% 36.5%LSTM 2.6% 65.0%CNN 3.4% 65.9%MVRU 2.6% 70.3%
NC : no constraints, WC: with constraints
phrases) from the dataset, which essentially is a classifier. MVRU on the other hand,
does not use any pre-identified phrases and trained in purely unsupervised fashion over
a Wikipedia snapshot and dynamically generate all phrase embeddings, which aims to
capture the compositionality.
When taking the modifiers and head nouns into account, all models delivered poor
results. The vector multiplication model produces the ‘best’ results at first sight. How-
ever, we found that the result is biased The multiplication model takes the element-wise
product of two word embedding vectors, where the values are typically between (-1,
1). After multiplication, the composed vector falls into a very different cluster from
the word embedding vectors, thus unable to identify any of the answer. The highest
cosine similarity score obtained are typically below 0.1 meaning it is invalid. For the
Chapter 6. A Matrix-Vector Recurrent Unit Network for Capturing SemanticCompositionality in Phrase Embeddings 111
summation and deep learning models, the composed vectors of given bigram phrases
mostly have high cosine similarities with their modifiers and head nouns. Since these
models compose the meaning of a phrase by taking a part of the meaning from each
constituent word, the composed vector still remains a high degree of cosine similarities
with its constitutes. A similar phenomenon is also reported by Turney [177] who uses a
distributional approach tackling the same problem.
When not considering the modifiers and head nouns, the vector summation and deep
learning models have a dramatic improvement on the performance. The proposed MVRU
model produces the best score on 70.3% that yields an improvement of 50% from the
random guessing rate, and outperforms others by at least 4.5%. The vector summation
still produces a solid score, but cannot outperform any deep learning model. Our model
outperforms the simple linear summation by about 8%, indicating that it is able to
encode more complicated compositional rules than the simple vector summation model.
6.4 Supervised Learning and Evaluations
The MVRU model can also be trained in a supervised fashion for classification tasks.
To build the classifier, a compositional model is connected to a softmax or a logistic
regression layer as:
p(s) = f(W · o+ b) (6.7)
where o is the phrase representation, f is the output function (softmax for multiclass
classification or sigmoid for binary classification), W is the weight vector for logistic
regression layer, b is the bias unit. Given a training set D, we aim to maximise the log
probability of choosing the correct label for s ∈ D by looking for parameters θ:
argmaxθ
∑s∈D
log p(slabel|s; θ) (6.8)
6.4.1 Predicting Phrase Sentiment Distributions
This evaluation predicts fine-grained sentiment distribution of adverb-adjective pairs,
which concerns how adverbs can change the meanings of adjectives, i.e. the semantics
of phrases containing adverbial intensifiers, attenuator or negative morphemes. For
instance, Socher et al. [19] show that not awesome is more positive than fairly awesome,
and not annoying has a similar distribution as not awesome in IMDB movie reviews.
Chapter 6. A Matrix-Vector Recurrent Unit Network for Capturing SemanticCompositionality in Phrase Embeddings 112
We use the extracted adverb-adjective pairs from the IMDB movie reviews1, which is also
used by [19]. Reviewers score a movie from 1 to 10 indicating the most negative to the
most positive reviews. The dataset lists the frequency of each pair of words in different
scoring categories. For example, the phrase terribly funny usually occurs in positive
reviews. We randomly split the dataset into training set (3,719 examples), validation
set (500 examples) , and test set (2,000 examples), where we only kept the phrases
occurring more than 40 times. The learning objective is to maximise correct probability
distribution of each pair of words by adding a softmax regression layer to minimise
the cross-entropy error. Similar to Socher et al. [19], we use Kullback–Leibler (KL)
divergence to evaluate our model. We evaluated our model with different initialisation
of the parameters. We found that using pre-trained word embeddings over large corpora,
such as Wikipedia, leads to a quicker converge, but makes trivial difference to the KL
divergence score. We also evaluated our model with different size of word embedding
vectors. Unlike reported in [19], we also found our model gains better performance using
larger size of word embeddings up to 300 dimensions. Smaller learning rate (0.01) and
L2 regularisation also helps preventing the overfitting.
Table 6.3: KL Divergence Score
Algorithm KL-Score
p = 0.5(a+ b), vector average 0.103p = a� b, element-wise vector multiplication 0.103p = [a; b], vector concatenation 0.101p = g(W [a; b]), RNN [18] 0.093p = Ba+Ab, Linear MVR [220] 0.092p = g(W [Ba;Ab]), MV-RNN [19] 0.091MVRU 0.088
p – phrase vector, A, B – word matrices, a, b – word embeddings
Table 6.3 shows the results. The x-axis shows the score from 1 to 10 indicating the
most negative to the most positive reviews of movies, and the y-axis shows the av-
erage KL divergence score. The proposed MVRU outperformed all baseline models.
Figure 6.3 plots some sample adverb-adjective pairs with their distributions, and the
predicted distributions by the MVRU model. The adjectives are boring (negative) and
interesting (positive). The MVRU successfully predicted the sentiment distributions
of the words pairing a negative morpheme not, attenuators quite and fairly, and an
intensifier extremely. The negation inverses the meaning of boring and interesting,
e.g. not boring appears in more positive reviews. By pairing adverbial attenuators ,
both boring and interesting become neutral. The intensifier strongly increases the
positive and negative meanings of words, e.g. extremely interesting mostly appears
in positive reviews with a sharp increase from score 7 to 10.
1publicly available at http://nasslli2012.christopherpotts.net/composition.html
Chapter 6. A Matrix-Vector Recurrent Unit Network for Capturing SemanticCompositionality in Phrase Embeddings 113
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
1 2 3 4 5 6 7 8 9 10
not boring
Predicted Ground-truth
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
1 2 3 4 5 6 7 8 9 10
not interesting
Predicted Ground-truth
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
1 2 3 4 5 6 7 8 9 10
quite boring
Predicted Ground-truth
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
1 2 3 4 5 6 7 8 9 10
fairly interesting
Predicted Ground-truth
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
1 2 3 4 5 6 7 8 9 10
extremely boring
Predicted Ground-truth
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
1 2 3 4 5 6 7 8 9 10
extremely interesting
Predicted Ground-truth
Figure 6.3: Predicting Adverb-Adjective Pairs Sentiment Distributions
6.4.2 Domain-Specific Term Identification
Domain-specific terminology identification automatically identifies domain relevant tech-
nical terms from a given corpus. Concretely, it trains a binary classifier to identify
whether or not a candidate is relevant to the domain. Thus, we use the negative log
likelihood function, the objective is to minimise the loss E:
E = −yi log σ(W · o+ b)− (1− yi) log σ(1−W · o+ b) (6.9)
where y = 1 for positive (domain relevant) and 0 for negative (irrelevant) terms.
We evaluate our model on two datasets. The first dataset is the GENIA corpus2, which
is a collection of 1,999 abstracts of articles in the field of molecular biology. We use the
2publicly available at http://www.geniaproject.org/
Chapter 6. A Matrix-Vector Recurrent Unit Network for Capturing SemanticCompositionality in Phrase Embeddings 114
current Version 3.02 for our evaluation. The second dataset is the ACL RD-TEC3 corpus
consists of 10,922 articles published between 1965 to 2006 in the domain of computer
science. The ACL RD-TEC corpus classifies terms into three categories, invalid terms,
general terms, and computational terms. We only treat computational terms as ground-
truth in our evaluation.
The ACL RD-TEC corpus provides a pre-identified candidate list. We therefore only
need to identify candidate terms from the GENIA corpus. We use a predefined POS
pattern <JJ>*<NN.*>+ to chunk candidates, that is, zero or more adjectives followed by
one or more nouns. However, such simple pattern is not able to identify all terms in
the ground-truth list. For example, the phrase chunker spots a term novel antitumor
antibiotic, whereas in the ground-truth set, the correct term is antitumor antibiotic.
Since the purpose of this experiment is to evaluate the capability of learning term com-
positionality, we only take the identified ground-truth into account. Table 6.4 shows
statistics.
Table 6.4: Candidate Terms Statistics
Candidate Positive Negative
GENIA 40,998 20,704 (50.5%) 20,294 (49.5%)ACL 83,845 13,832 (16.5%) 70,013 (83.5%)
We randomly select 40% data for training, 20% data for validation, and 40% data for
evaluation from the candidate set listed in Table 6.4. Word embeddings are pre-trained
over each evaluation corpus.
In this evaluation, we add C-value [56] as an extra baseline algorithm. C-value is a
popular unsupervised statistical ranking algorithm for term identification, where each
candidate term is assigned a score indicating the degree of domain relevance. It requires
extracting a number of top ranked candidates as the outputs. Table 6.5 shows the
results.
The proposed MVRU model consistently outperforms others on both dataset. On the
GENIA dataset, the minimum improvement on F-score is 2.3%, whereas on the ACL RD-
TED dataset is 8.1%. Although the accuracy measure is commonly used in classification
tasks, it only reflects the true performance of a model when classes are evenly distributed
in an evaluation set. For example, on the ACL-RD TED dataset, the positive examples
are about 16.1%, whereas the negative ones are about 83.9%. Therefore, if a model
incorrectly classifies all the examples as negative, it still produces an accuracy score
of 83.9%. Therefore, the accuracy measure is only valid on the GENIA dataset where
positive and negative examples are evenly distributed.
3publicly available at https://github.com/languagerecipes/the-acl-rd-tec
Chapter 6. A Matrix-Vector Recurrent Unit Network for Capturing SemanticCompositionality in Phrase Embeddings 115
Table 6.5: Evaluation Results
Prec. Recl. F-score Acc.
GENIARandom Guess 51.0% 50.0% 50.5% 50.0%CV (Top 5000) 60.4% 29.7% 39.8% –CV (Top 10000) 53.9% 53.0% 53.4% –CV (Top 15000) 50.6% 73.4% 59.9% –p = a+ b 63.6% 79.8% 70.8% 66.5%p = a� b 69.1% 50.8% 58.6% 63.3%LSTM 67.0% 74.5% 70.5% 68.3%CNN 63.0% 80.7% 70.7% 66.0%MVRU 65.1% 83.3% 73.1% 68.8%
ACL RD-TECRandom Guess 16.1% 50.0% 24.4% 50.0%CV (Top 4000) 11.0% 5.0% 6.9% –CV (Top 8000) 15.7% 14.3% 15.0% –CV (Top 12000) 19.5% 26.6% 22.5% –p = a+ b 77.0% 66.1% 71.1% 90.5%p = a� b 79.0% 60.3% 68.4% 90.2%LSTM 72.5% 55.0% 62.5% 85.5%CNN 70.8% 67.7% 69.2% 85.2%MVRU 77.9% 80.7% 79.2% 92.5%
It is also easy to notice that the GENIA dataset has a much higher distribution of
the positive examples in comparison to the ACL RD-TED dataset. However, most
models produce similar or even better F-scores on the ACL RD-TED dataset. There
are two possible reasons. Firstly, in the ACL RD-TEC corpus, the negative terms
contain large number of invalid characters (e.g. ~), mistakes made by the tokeniser (e.g.
evenunseeneventsare), and none content-bearing words (e.g. many). The classifiers
can easily spot these noisy data. Secondly, the ACL RD-TEC corpus is much bigger
in size as compared to GENIA, which enables the SkipGram algorithm to induce more
precise word embedding vectors as inputs to the classification layer.
Surprisingly, the simple vector summation model produces better results than the LSTM
and CNN deep learning models on both evaluation datasets. In fact, the simple vector
summation produces quite impressive results on all the evaluations demonstrating that it
can be an alternative model for learning compositional semantics due to the simplicity of
implementation. In comparison to the vector summation, although our proposed model
produces better performance, it may be difficult to train due to the large number of
parameters. Having said that, the matrices obtained can be potentially used in phrase
resynthesis or sentence generation.
Chapter 6. A Matrix-Vector Recurrent Unit Network for Capturing SemanticCompositionality in Phrase Embeddings 116
6.5 Conclusion
In this chapter, we have introduced a matrix-vector recurrent unit model built upon
a recurrent network structure for learning compositional semantics. Each word is rep-
resented by a pair of matrix and vector, where the vector encodes the meaning of the
word, and the matrix captures the composition rules. We evaluated our model on phrase
similarity, NMQ and domain-specific term identification tasks, and demonstrated the
proposed model outperforms the LSTM and CNN deep learning models, and simple
vector summation and multiplication compositions.
This chapter provides a solid foundation for learning compositional semantics of phrases.
In the next chapter, we will present a document deep learning model that incorporates
the MVRU model for extracting keyphrases.
Chapter 7
A Deep Neural Network
Architecture for AKE
In Chapter 6, we have introduced the Matrix-Vector Recurrent Unit (MVRU) model,
and demonstrated its efficiencies on learning semantic meanings of phrases. However,
identifying keyphrases requires understanding not only the meaning of each phrase ap-
pearing in a document, but also the overall meaning of the document. Hence, in this
chapter, we firstly introduce a deep learning model that automatically encodes the mean-
ing of a document, which presents a document as a cube, as shown in Figure 7.2 where
the height is the number of sentences, the width is the number of words in a sentence,
and the depth is the dimension of word embedding vectors. The cube representation of
a document inputs to a convolutional neural network to analyse the intrinsic structure
of the document and the latent relations among words and sentences. Hence, the model
is named Convolutional Document Cube (CDC).
In the second part of the chapter, we propose a novel AKE deep learning model to extract
keyphrases by learning the meanings of both phrases and documents. The model consists
of two deep neural networks: MVRU and CDC. The MVRU model is responsible for
learning the meanings of phrases, and CDC encodes the meanings of documents. Two
networks are jointly trained by adding a logistic regression layer as the output layer to
identify keyphrases. We evaluate the model on three different datasets: Hulth, DUC, and
SemEval, and demonstrate the model delivers the identical performance to the state-of-
the-art algorithm on the Hulth dataset, and outperforms the state-of-the-art algorithm
on the DUC dataset.
117
Chapter 7. A Deep Neural Network Architecture for AKE 118
7.1 Introduction
Manually identifying keyphrases from documents by human annotators is a complex
cognitive process, which requires understanding the meanings of both phrases and the
document in order to select the most representative phrases that can express the core
content of the document. On the other hand, existing AKE approaches, including both
supervised and unsupervised approaches, are unable to, or have no attempt to under-
stand the meanings of phrases and documents.
Most AKE algorithms identify keyphrases based on only the representations of phrases.
Phrases are represented as manually selected features. Each feature describes a partic-
ular characteristic of phrases, such as frequencies, or co-occurrence statistics. Neverthe-
less, selecting features is a biased and empirical task. These features usually capture
little or no linguistic meaning of how phrases are formed, making such algorithms inca-
pable of capturing the meanings of phrases.
The representations of documents in existing AKE approaches are simple re-arrangement
of phrase feature vectors. For example, Term Frequency - Inverse Document Frequency
(TF-IDF) [2] represents documents in a Term-Document matrix where each row vector
corresponds to a unique phrase in the corpus, each column represents a document, and
each cell value corresponds to the frequency of the phrase in a particular document.
However, TF-IDF identifies the importance of a phrase in a document using only three
statistics: the phrase frequency, the number of documents containing the phrase, and
the total number of documents in the corpus. Graph-based unsupervised approaches,
such as TextRank [3], represent a document as a graph, where each vertex corresponds
to a phrase, an edge connecting two vertices presents the their co-occurrence relation.
However, the ranking process only makes use of the phrase co-occurrence information.
In supervised machine learning approaches [1, 4, 5, 83], a document is just a collection
of its phrase feature vectors, which does not carry any semantic meaning of the docu-
ment. Inevitably, the lack of cognitive ability in existing AKE approaches inhibits their
performances on extracting keyphrases.
In this chapter, we propose a deep learning architecture to tackle the AKE problem,
which not only eliminates the effort of feature selection, but also mimics the natural
cognitive process of manual keyphrase identification by human annotators – attempting
to encode the meaning of phrases and documents. To achieve this, a deep learning
model needs to learn representations for both phrases and documents, which encode the
semantic meanings of phrases and documents. Thus, the proposed network consists of a
phrase model and a document model. Together, they are capable of learning distributed
representations of phrases and documents.
Chapter 7. A Deep Neural Network Architecture for AKE 119
Learning the meanings of phrases is to encode the rule of composing the meanings of
their constituent words [25]. We use the Matrix-Vector Recurrent Unit (MVRU) model
introduced in the Chapter 6 to learn the semantic compositionality of phrases. The
recurrent architecture allows the network to capture the order of words by sequentially
composing the current word with its precedents. However, in contrast to the constituent
words of a phrase that always have stronger dependencies, words in a document only
have strong dependencies to the closer ones, and the strength of dependencies decreases
by the increase on the distances between words. For example, the first few words usually
have no correlations to the last ones appearing in a document. Therefore, capturing such
long term dependencies of words in a document is unnecessary and non-intuitive. Hence,
we propose a novel model to learn the meaning of a document, namely Convolutional
Document Cube model. We first present a document as a cube, where the height repre-
sents the number of sentences, the width represents the number of words in a sentence,
and the depth is the size of word embedding vectors. We then ‘slice’ the cube by its
depth’s (word embeddings’) dimension generating 2-dimensional data matrices (chan-
nels), which are the inputs to a convolutional neural network (CNN). The CNN network
has the strength to encode regional compositions of different locations in data matrices.
The network features different region sizes, where each region has a number of filters
to produce the feature maps. By convoluting the data matrices with different region
size, the CNN captures how words are composed in each region. Since we use relatively
small region sizes, the network only analyses words and their closer neighbours, aiming
to encode only short term dependencies. For example, given two sentences the dog is
walking in a bedroom and the cat is running in a kitchen, using a region with
the size of 2×3, we hope the network to capture the important information by scanning
a pair of trigrams in the same position of the sentences, such as ([the dog is], [the
cat is]), ([dog is walking], [cat is running]), ([in a bedroom], [in a kitchen]).
Features are automatically learnt using distributed representations, thus no manually
selected features is required. The output layer is a logistic regression layer, which clas-
sifies whether a phrase is a keyphrase for the input document based on the meanings
learnt from the phrase and document model. We evaluate the model on the same
datasets described in Chapter 3 (the Hulth [5], DUC [52], and SemEval [51] dataset),
and demonstrate that the proposed approach produces identical performance on the
Hulth dataset and delivers the new state-of-the-art performance on the DUC dataset,
without employing any dataset-dependent heuristics.
Chapter 7. A Deep Neural Network Architecture for AKE 120
7.2 Deep Learning for Document Modelling
Recent studies have shown great progresses in learning vector representations of words
and multi-word expressions. For instance, Mikolov et al. [15] present the word2vec model
discovering the existence of linear relations in lexical semantics, and Cho et al. [97]
demonstrate the capability of learning compositional semantics and modelling language
using the Gated Recurrent Unit (GRU) network.
However, learning the representation and encoding the meaning of a document remains
a great challenge for the NLP community [190]. This is because the content of a doc-
ument is typically viewed as the ad hoc result of a creative process that far beyonds
the level of lexical and compositional semantics. Documents are manuscripts repre-
senting the thoughts of authors, which naturally form logical flows embedded in the
inherent structures of the contents. Therefore, modelling documents requires not only
the understanding of lexical and compositional semantics, but also analysing the inher-
ent structures and relations between sentences, such as temporal and causal relations.
Existing studies for modelling documents commonly employ a sentence-to-document ar-
chitecture, as shown in Figure 7.1. The bottom layer is the sentence model, which
takes constituent word embeddings of each sentence as inputs, and composes them into
sentence embeddings. The upper layer is the document model, which takes sentence em-
beddings obtained from the previous layer as inputs to analyse the intrinsic structures
and relations between sentences, and then composes them into document representation.
The sentence-to-document architecture can be implemented using the same type of neu-
ral networks. For example, Misha et al. [191] implement both sentence and document
layers using two CNN networks. At the sentence level, they use a CNN network to learn
the representations for an input sentence using one dimensional vector-wise convolution.
At the document level, another CNN network takes the learnt representations of all
sentences as inputs and outputs the document representation. The network hierarchi-
cally learns to capture and compose low level lexical features into high level semantic
concepts. Lin et al. [221] propose a hierarchical recurrent architecture, which consists
of two independent recurrent neural networks. The lower level model learns lexical se-
mantics by predicting the next word within a sentence given the current one. The upper
level encodes the sentence history from the precedents to predict the words in the next
sentence, which captures higher level semantic concepts. Ganesh et al. [192] use two
probabilistic language models to jointly learn the representations of documents. The
model first learns sentence vectors via a feed-forward network by training a word-level
language model, then learns document representations using the sentence vectors.
Chapter 7. A Deep Neural Network Architecture for AKE 121
… ... … ...
s1
w1 w2 w3 wn
… ...
s2
w1 w2 w3 wn
… ...
sm
w1 w2 w3 wn
Sentence Model
s1 s2 sm Sentence Representation
word embeddings
Document Model
Document Representation
… ...
Figure 7.1: General Network Architecture for Learning Document Representations
Table 7.1: Common Deep Learning Document Models
Doc Model
Sent Model Feed-forwardRecurrent &
VariantsRecursive Convolutional
Feed-forward Ganesh et al. [192] N/A
Recurrent &Variants
Lin et al. [221]Tang et al. [190]Zhang et al. [222]Yang et al. [223]
Li et al. [224]
N/A Lai et al. [225]
Recursive Liu et al. [226] N/A
Convolutional
Kalchbrennerand
Blunsom [189]Tang et al. [190]
N/A Misha et al. [191]
Alternatively, hybrid network architectures are also proposed for modelling document.
Kalchbrenner and Blunsom [189] use a convolutional-recurrent network to learn the com-
positionality of discourses. The model consists of a CNN and a recurrent network. The
CNN network is responsible for learning the representation of sentences. The recurrent
network takes the outputs from the convolutional network to induce the document rep-
resentation. Similarly, Tang et al. [190] propose to use either a LSTM or CNN to learn
the meaning of a sentence at lower level, then use a GRU network [97] to adaptively
encode semantics of sentences and their relations for document representations. Liu et
al. [226] propose a recursive-recurrent network architecture, where the recursive network
learns the representations of sentences via pre-generated syntactic tree structures, and
Chapter 7. A Deep Neural Network Architecture for AKE 122
the recurrent network learns sequential combinations of sentences inducing document
representations. Lai et al. [225] use a recurrent-convolutional network that has a op-
posite structure to Kalchbrenner and Blunsom’s convolutional-recurrent network [189].
The recurrent structure captures sentences’ information and the CNN network learns
the key concepts of a document.
Table 7.1 lists some recent work in document modelling. Recurrent networks and variants
such as LSTM and GRU are the most popular choices for modelling both sentences and
documents. In comparison to other types of neural networks, the recurrent architecture
takes inputs with various lengths, which naturally offers a better choice for modelling
sentences and documents. In addition, it repeatedly combines the current input with its
precedents capturing long term dependencies of words and sentences in a document. The
CNN network usually learns representations by stacking the input embedding vectors
into a matrix1, and then analyses the regional compositions of the matrix. However,
a common concern is that unlike in image processing, the location invariance does not
exist in embedding vectors, which makes the CNN network being less popular than the
recurrent network. The recursive neural network requires a parser to produce the syntax
tree as a prior, so it is best on learning the sentence level semantics, but less useful for
encoding semantic at the document level.
It is worth mentioning that probabilistic topic models for documents modelling [227–
229]. Among them, Latent Dirichlet Allocation (LDA) [76] is a popular generative
statistical model that allows sets of observations to be explained by unobserved ones.
It generates topics based on word frequencies from a set of documents hypothesising
that a document is a mixture of a small number of topics and each word’s creation
is attributable to one of the document’s topics. On the other hand, deep neural net-
works for learning document embedding based on word contexts aim to learn the useful
representations from the lexical semantics (word embeddings), compositional semantics
(phrases and sentences embeddings), to the meanings of documents, using distributed
representations.
7.2.1 Convolutional Document Cube Model
In this section, we introduce a Convolutional Document Cube (CDC) model to learn
the document representations. We treat a document as a cube, i.e. a third order tensor.
The height of the cube is the number of sentences in the document, the width is the
number of words, and the depth is the size of word embeddings. Instead of modelling
sentences towards documents, our model learns the representation of documents directly
1Mathematically this process is conducted by concatenating input vectors, and then performing 1-Dimensional convolution with a predefined window.
Chapter 7. A Deep Neural Network Architecture for AKE 123
by ‘slicing’ the depth (word embedding dimension) into a set of 2-dimensional matrices,
then each matrix becomes a ‘channel’ that are inputs to a CNN – this is similar to
image processing using CNNs, a typical RGB image has three input channels (Red,
Green, Blue). In the CDC model, each channel of a document can be thought as of
a 2-dimensional snapshot of the whole document, where each word is represented by
its embedding vector. In comparison to existing work, the proposed model has three
advantages:
1. Recurrent structured networks, such as LSTM, tend to capture long-term semantic
dependencies or correlations between words and sentences through the entire docu-
ment, i.e. from the beginning to the end of the document. However, it may be not
necessary to capture such long-term dependencies or correlations given that the
closer words or sentences have equally strong semantic correlations. In addition,
while semantics can be correlated in the beginning (e.g. abstracts or introductions)
and the end (e.g conclusions), sentences appearing in the beginning could have very
loose or even no dependencies to the ones in the middle of the document. Hence,
capturing all dependencies in sentences is unnecessary, which could also introduce
noise. Unlike recurrent structured networks, our model concerns only regional de-
pendencies among words and sentences. Such regional dependencies are captured
by using smaller regions in the CNN network.
2. Using a single convolutional network is more computational efficient, which re-
quires less parameters to be learnt. The CDC model learns the representations
of documents directly without the need of learning sentence representations as a
prior.
3. Specifically to AKE, the proposed CDC model takes word positions into account.
In supervised machine learning for AKE, many approaches use hand-crafted posi-
tioning features indicating the relative positions of candidate appearing in a doc-
ument, assuming important phrases always occur in the beginning or the end of a
document. The proposed model does not require any hand-craft features. Instead,
by convoluting over a snapshot of the document with different sized filters, the
network automatically captures the position information.
Figure 7.2 shows an overview of the proposed model. Let D ∈ Rh×w×d be a document,
where h denotes the number of sentences in the document (the height), w denotes the
number of words in a sentence (the width), and d is the dimension of word embedding
vectors (the depth). We fix the size of D by padding zeros, such that sentences of variable
length are padded to w, document having different number of sentences are padded to
h. We use a convolutional neural network to learn the distributed representation D′ of
Chapter 7. A Deep Neural Network Architecture for AKE 124
look at the cat .
sitting on the mat .
it is so fat .
and palys with a hat .
if you give a pat .
it will catch the rat .
Word embedding d
Channel
Padding zero vector
Padding zero vectors
h
w
0.34211 0.58922 … 0.48761 0
0.11291 0.48761 0
0.48761 0
… … 0.48761
0.48761
0.08901 … 0.48761
0 0 0 0 0 0r Regions
n filters for 2×2 region
n filters for 3×3 region
n filters for 1×3 region
DocumentRepresentation
n×r filters n×r feature maps
max pooling
layer
fully connected
layer
Figure 7.2: Convolutional Document Cube Model
D. Instead of performing 1-D convolution over word embedding vectors, we slice the
cube D into channels (input feature maps). The number of channels is the size of word
embedding d. Each channel is a matrix M with size h×w taking only one feature from
each word embedding vector. The convolutional layer parameters consist of r predefined
regions (receptive fields) with various sizes. For example, in Figure 7.2, there are three
regions. Each region has n linear filter (weights) to produce n output feature maps
by repeatedly convoluting across sub-regions of the channels. A filter has the size of
h′ ×w′ × d where h′ and w′ denote the height and width of the region respectively, and
d is size of word embedding (depth of D). For example, a region may have a size of
5× 5 (height and width), then overall the parameters for the region are 5× 5× d×n. A
Chapter 7. A Deep Neural Network Architecture for AKE 125
output feature map ck from k’s filter is computed by convolution of the input channels
with a linear filter and a bias unit, and then applying a non-linear function such as a
hyperbolic tangent tanh function as:
ck = tanh(h′∑i=0
w′∑j=0
W k ·D[i : i+ w − 1, j : j + h− 1, 0 : d] + bk) (7.1)
where D[i : i + w − 1, j : j + h − 1, 0 : d] is a sub-tensor of D from width i : i + w − 1,
height j : j + h− 1, depth 0 : d , W k denotes weights for k’s filter. The total number of
feature maps in the network is m = r×n, so the output from the pooling layer O ∈ Rm
is computed as:
Ok = g(ck) (7.2)
where g is the pooling function. The output from the convolutional layer is then sent to
a fully connected layer to produce the final representation of D′, as:
D′ = tanh(W ·O + b) (7.3)
where W and b are the weight matrix and bias unit, respectively.
7.3 Proposed Deep Learning Architecture for AKE
In this section, we introduce a general network architecture for AKE, which consists
of two deep learning models aiming to learn the meanings of phrases the documents
separately by updating their own word embedding parameters, as shown in Figure 7.3.
Two models are jointly trained in a supervised learning fashion by connecting to a logistic
regression layer.
Concretely, let x be a phrase and x′ be the distributed vector representation of the
phrase obtained from the phrase model. Let D denote a document and D′ be the
vector representation of the document obtained from the document model. Let d be
the dimension of distributed vectors for words, phrases, and documents, then we have
x′ ∈ Rd and D′ ∈ Rd. The probability that x is a keyphrase to the document D is
computed as:
p(x|D) = σ(x′TD′) (7.4)
where σ is the sigmoid function. The probability that x is not a keyphrase is 1 − p.Let y be the ground-truth, applying the negative log likelihood function, the learning
objective is to minimise the loss L by looking for parameters θ = (X ′, C ′,Wp,Wd) where
X ′ is the collection of word embeddings for the phrase model, C ′ is the collection of
Chapter 7. A Deep Neural Network Architecture for AKE 126
Phrase Model
Word Embeddings Lookup Table
Document Model
Word Embeddings Lookup Table
Input Phrase Input Document
Phrase Embedding Document Embedding
)|( Dxp ( ) +
Logistic Regression Layer
Figure 7.3: Overview of Proposed Network Architecture
word embeddings for the document model, and Wp, Wd are collections of weights for
the phrase and document model, respectively. The loss L is computed as:
L = −(yi log σ(x′TD′) + (1− yi) log(1− σ(x′TD′))) (7.5)
and θ is updated as:
θ := θ − ε∂L∂θ
(7.6)
where ε is the learning rate.
The phrase model employed in this chapter is the Matrix-Vector Recurrent Unit (MVRU)
model introduced in Chapter 6.
7.4 Evaluations
7.4.1 Baseline
The performance of the proposed AKE deep learning model is compared against the
state-of-the-art AKE algorithms described in Section 7.4.2. Two baseline models are
constructed to evaluate the efficiency of the CDC model, i.e. we replace our document
model of the proposed AKE deep learning architecture with two sentence-to-document
models proposed by Tang et al. [190], while keeping the same phrase model. Tang et
al. [190] propose CNN-GRU and LSTM-GRU, where the sentence-to-document mod-
els firstly learn sentence representations from the lower layer of the network, then use
Chapter 7. A Deep Neural Network Architecture for AKE 127
LSTM Unit
xt-1 xt xt+1
s1
xt+2
Sentence Model
Sentence Model
s2
… ... Sentence Model
sn
GRU GRU GRU
Document Representation
… ...GRU Document Model
LSTM Unit LSTM Unit LSTM Unit
Figure 7.4: Baseline Document Model – LSTM-GRU Architecture
s1
Sentence Model
s2
… ... Sentence Model
sn
GRU GRU GRU
Document Representation
… ...GRU Document Model
Sentence Model
2 × d
3 × d
4 × d
x1
x3
x2
x4
x5
Figure 7.5: Baseline Document Model – CNN-GRU Architecture
the learnt sentence representations as inputs to the upper layer to learn the document
representations. In the CNN-GRU and LSTM-GRU models, the sentence models are
implemented using either a CNN or LSTM network, and the document model is a GRU
network. Figure 7.4 and 7.5 show an overview of the CNN-GRU and LSTM-GRU model,
respectively.
Sentence Models The CNN and LSTM sentence models are the same as the compo-
sitional semantic models described in Chapter 5.
GRU Document Model The GRU neural network introduced by Cho et al. [97]
is a variation on the LSTM network [28]. In comparison to the LSTM network, the
Chapter 7. A Deep Neural Network Architecture for AKE 128
GRU architecture is less computational expensive since it uses fewer parameters. It has
also been shown that GRU networks are good at modelling long term dependencies of
languages [190, 231, 232].
The LSTM network features three gates: an input gate, a forget gate f , and an output
gate. Similar to the LSTM network, the GRU network also features gating units that
control the flow of information inside each recurrent unit. Figure 7.6 shows an overview
of the GRU network. The GRU network has a few noticeable differences from LSTM.
Firstly, the GRU network merges the input and forget gates from the LSTM network
into a single update gate. In the LSTM network, the input gate governs how much
information can be passed through into the current state, and the forget gate controls
how much information to be thrown away from the previous state. The GRU network
simply use one update gate z to control how much information from the previous hidden
state t− 1 to be passed into the current state t, as:
zt = σ(Wz · st + Uz · ht−1 + bz) (7.7)
where Wz, st, and bz are weights, the current sentence representation, and the bias
unit, respectively. The network uses 1 − z as the degree of how much information to
be thrown away from the previous state. Secondly, each LSTM unit has a output gate
that controls the exposure of the memory content from its timestep. In contrast, the
GRU network fully exposes its memory content, which does not feature an output gate.
Thirdly, in LSTM, the new cell state value at timestamp t is computed as taking a part
of information from the current input and the previous cell state value. In GRU, the
new cell state value is controlled by a reset gate r, as:
rt = σ(Wr · st + Ur · ht−1 + br) (7.8)
The new state value h is:
h = tanh(W · st + U(rt · ht−1) + b) (7.9)
where Wr, W and br, b are weights and biases. The cell state value of the previous step
ht−1 is modified by the reset gate r. When r is close to 0, the network only focuses
on the current unit, and ignores the previous state value. Such mechanism allows the
network to drop any irrelevant information from previous computations producing more
compact representations. The new hidden state ht is
ht = (1− zt)ht−1 + zth (7.10)
Chapter 7. A Deep Neural Network Architecture for AKE 129
Inputs to cell (st , ht-1)
Reset gate
rt = sigmoid(Wrst + Urht-1)
rt ,ht-1 Wr , Ur ĥ ĥ
Update gate Wz , Uz
zt = sigmoid(Wzst + Uzht-1)
New hidden state
W , U
ĥ = tanh(Wst + U(rtht-1))
zt ,ht-1
ht = ztht-1 + (1- zt)ĥ
Figure 7.6: Gated Recurrent Unit Architecture
7.4.2 State-of-the-art Algorithms on Datasets
Most AKE algorithms are dataset-dependent, i.e. they only perform well on the chosen
datasets with heuristic tunings. Hasan and Ng [9] evaluate five AKE algorithms on four
different datasets, among which three are the current state-of-the-art algorithms. They
show that the state-of-the-art algorithm for one particular dataset cannot deliver the
similar result on the other, due to the fact that certain statistical characteristics of one
dataset may not exist on the other.
The state-of-the-art algorithms for our evaluation datasets are: Topic Clustering [8] for
Hulth, ExpandRank [10] for DUC, and SemGraph [24] for SemEval. All of them are
unsupervised AKE algorithms. Table 7.2 shows the performance of these algorithms. A
brief history of each dataset and its state-of-the-art algorithm is as follows.
The Hulth dataset was built by Hulth [5] (2003) who propose a decision tree AKE
classifier using 4 features, 1) the frequency each phrase that occurs in a document, 2) the
frequency of each phrase in the dataset, 3) the relative position of the first occurrence
of each phrase, and 4) the part-of-speech tag of each phrase. Later, Mihalcea and
Tarau [3] (2004) introduce an unsupervised graph ranking algorithm namely TextRank
derived from the PageRank algorithm [89] producing better performance than Hulth’s
supervised machine learning model on the same dataset. The most recent state-of-the-
art performance is produced by Topic Clustering [8] (2009), an unsupervised algorithm
that uses Wikipedia-based semantic relatedness of candidates with clustering techniques
to group candidates based on different topics in a document.
The DUC dataset was annotated by Wan and Xiao [10] (2008) who propose a graph
ranking algorithm named ExpandRank producing the state-of-the-art performance on
the dataset. Liu et al. [39] (2010) propose another unsupervised graph ranking approach
Chapter 7. A Deep Neural Network Architecture for AKE 130
named Topical PageRank that first uses Latent Dirichlet Allocation (LDA) [76] to obtain
topic distributions of candidate phrases, then runs Personalised PageRank for each topic
separately, where the random jump factor of a vertex in a graph is weighted as its prob-
ability distribution of the topic. However, Topical RageRank is unable to outperform
ExpandRank.
The SemEval dataset was built for SemEval 2010 Shared Task 5. There were 19 partici-
pants, among which the HUMB [233] system delivered the best performance [51]. HUMB
is a supervised machine learning algorithm employing a bagged decision tree algorithm
with manually selected features, including positional features, co-occurrence statistics,
lexical and semantic features obtained from Wikipedia. More recently, Martinez-Romo
et al. [24] (2016) report an unsupervised graph ranking approach named SemGraph that
produces better results than HUMB becoming the new state-of-the-art algorithm on the
SemEval dataset. SemGraph firstly constructs a graph using co-occurrence statistics
and the knowledge supplied by WordNet, then it uses a two-step ranking process: 1)
select only one-third top ranked candidates using PageRank, which are inputs to the
second ranking step, and 2) rank candidates obtained from the first step based on their
frequencies, and only top ranked 15 candidates are extracted as keyphrases.
7.4.3 Evaluation Datasets and Methodology
The statistics of each dataset is detailed in 3.3, Table 3.1. The Hulth and SemEval
datasets provide both training and evaluation sets for supervised machine learning ap-
proaches. We randomly split the DUC dataset into 40% for training, 20% for validation,
and 40% for evaluation.
7.4.4 Training and Evaluation Setup
Candidate phrases are identified using the NP Chunker described in 3.2.1. We stem
texts and candidate phrases using Porter’s algorithm [161].
7.4.4.1 Pre-training Word Embeddings
Word embedding vectors are pre-trained over each evaluation dataset separately using
the SkipGram model described in 4.3, with the dimension of 300 for all the evaluation
datasets. The hyper-parameter settings and training setup are the same as the AKE
word embedding training setup described in 4.4.1.
Chapter 7. A Deep Neural Network Architecture for AKE 131
7.4.4.2 Training AKE Models
We use 5 different region sizes 2, 3, 4, 5, 6 for the baseline CNN sentence model. For the
CDC model, we use 5 different 2-Dimensional regions (1, 5), (2, 5), (3, 5), (4, 5), (5, 5).
We also experimented different region sizes – smaller sizes decreases the performance,
but there is no noticeable increase on the performance by employing larger region sizes.
The learning rate is set to 0.01 for all models. We notice that larger learning rate, e.g.
0.05, trends to quickly over-fits the model, but this is no significant improvement by
employing smaller learning rate e.g. 0.005.
Both phrase and document models take word embedding vectors as inputs. We use two
separate word embedding vectors for phrase and document models. Each model only
need to update its own embedding values. Theoretically, we could share the embedding
vectors for phrase and document models. However, our empirical results show that using
shared embedding vectors not only decreases the performance, but also significantly
increases the training time due to back-propagating the errors from two deep neural
networks.
We train our model on a NVIDIA GTX980TI GPU. The training time is trivial – training
the model over the Hulth and DUC datasets only takes a few hours, and training over
the SemEval dataset takes a day.
7.4.5 Evaluation Results and Discussion
Table 7.2 shows the evaluation results. The CDC model performs equally well on the
F-score to Topic Clustering (the state-of-the-art) on the Hulth dataset without applying
any heuristic, and delivers the new state-of-the-art performance on DUC dataset. How-
ever, it produces a much lower F-score on the SemEval dataset comparing to HUMB
and SemGraph.
On the Hulth dataset, in comparison to the supervised machine learning model pro-
posed by Hulth [5], the CDC model produces a very similar recall score, but delivers a
much better precision score. This indicates that our proposed model identifies negative
samples more accurately by learning the meanings of candidate phrases, thus classifies
fewer candidates as positive. In comparison to the state-of-the-art algorithm – Topic
Clustering, both models have the same F-score. However, we consider that the CDC
model is more robust. Firstly, Topic Clustering uses a dataset-dependent filter that dis-
cards the candidates that are too common to be keyphrases2. The authors report that
2There is no details on how the filter is constructed, thus we cannot reimplement the filter to reporta fair comparison.
Chapter 7. A Deep Neural Network Architecture for AKE 132
Table 7.2: Evaluation Results on Three Datasets
Precision Recall F-score
Hulth, ground-truth proportionHulth [5] (2003) 25.2% 51.7% 33.9%TextRank [3] (2004) 31.2% 43.1% 36.2%TopicClustering [8] (2009) 35.0% 66.0% 45.7%
CNN-GRU Model 40.3% 51.1% 45.1%LSTM-GRU Model 40.6% 51.5% 45.4%CDC Model 41.2% 51.4% 45.7%
DUCExpandRank [10] (2008) 28.8% 35.4% 31.7%TopicalPageRank [39] (2010) 28.2% 34.8% 31.2%
CNN-GRU Model 30.3% 42.8% 35.5%LSTM-GRU Model 29.5% 41.7% 34.6%CDC 33.0% 46.6% 38.6%
SemEvalHUMB [233] (2010) 27.2% 27.8% 27.5%SemGraph [24] (2016) 32.4% 33.2% 32.8%
CNN-GRU Model 11.3% 12.6% 11.9%LSTM-GRU Model 10.5% 13.7% 11.9%CDC Model 12.0% 12.0% 12.0%
CNN-GRU, LSTM-GRU, and CDC models are supervised deep learning models, other models are unsupervised
the F-score decreases by 5% to 40% without using the filter [8]. In the CDC model, no
hand-crafted features, hyper-parameters, or dataset-dependent filters are employed to
boost the F-score. Secondly, considering the precision and recall scores, the CDC model
produces better precision but lower recall scores than Topic Clustering. This is because
Topic Clustering extracts too many candidates as keyphrases to balance the F-score.
In unsupervised AKE, a fixed number of highest ranked candidates are extracted as
keyphrases, and hence there is a trade-off between the precision and recall scores. In
general, increase the number of extraction will improve the recall score by compromising
on precision, and vice versa. Topic Clustering extracts two-third top ranked candidates
as keyphrases. This hyper-parameter yields the state-of-the-art F-score on the Hulth
dataset by minimising the trade-off between precision and recall. On the other hand,
CDC is a binary classifier having no trade-off between the precision and recall score. In
comparison to the two baseline deep learning models, the proposed CDC model produces
slightly better performance. The only difference in CDC, CNN-GRU, and LSTM-GRU
models is the document deep learning model, which is the key factor yields different
performance. The Hulth dataset contains only short documents with on average 125 to-
kens per document. Dependencies of sentences in such short documents are very strong,
Chapter 7. A Deep Neural Network Architecture for AKE 133
which allows the CNN-GRU and LSTM-GRU to learn more precise representations of
documents by capturing these dependencies. However, when these dependencies becom-
ing looser in longer documents, the CDC model delivers much better performance as we
will discuss as follows.
On the DUC dataset, the CDC model yields the new state-of-the-art performance, which
outperforms the ExpandRank algorithm by nearly 7% on F-score. The CDC model clas-
sifies more positive candidates while maintaining a strong precision. It also outperforms
the two baseline models. Both CNN-GRU and LSTM-GRU use a GRU network to
learn document representations, which captures the long term dependencies of all the
sentences in a document. The DUC dataset contains news articles with on average 800
tokens per document, thus sentences in the beginning of a document usually have loose
or no dependencies to the last few ones. The GRU document model captured unneces-
sary dependencies of sentences that affect the performance of the overall system. On the
other hand, the CDC model only encodes the short-term dependencies of the sentences
delivering better performance.
On the SemEval dataset, the CDC model produces an unexpected F-score that is much
lower than HUMB and SemGraph. The baseline deep learning models also fail to deliver
reasonable performance. We believe that the main reason is that the proposed CDC
model and the base line models fail to learn the representations of documents with
multiple topics, ideas, or arguments. The SemEval dataset consists full-length journal
articles with an average of about 5,800 tokens. Learning distributed representations
of such long documents is challenging. In fact, the majority of work for modelling
documents using deep learning techniques only targets short and mid-length documents,
such as reviews of products or movies, where the authors of the reviews usually present
a single topic, view, or argument. On the other hand, full-length journal articles consist
of multiple sections, where the authors usually present different topics. Neither the
proposed CDC model nor the baseline models have such mechanism to handle multiple
topics or ideas, and hence, they are unable to produce reasonable results. A possible
solution to this problem is to let CDC learn the representations of paragraphs, and then
use a recurrent neural network such as LSTM to learn the overall representation of a
document, which will be our future work.
Extracting keyphrases from long documents is much more challenging because the doc-
uments yield a large number of candidate phrases [31]. The SemEval dataset has 587
candidate phrases per article in average, where only 9.6 candidates are ground-truth
keyphrases. In our evaluations, we did not employ any dataset heuristics to reduce the
number of candidates. On the other hand, both HUMB and SemGraph use heuristics
Chapter 7. A Deep Neural Network Architecture for AKE 134
filtering candidates that are unlikely to be keyphrases before the actual extraction pro-
cess. HUMB first constructs a post-ranking process that incorporates statistics from an
external repository for journal articles named Hyper Article en Ligne (HAL) to produce
a ranking score for each candidate. Similar to unsupervised AKE systems, only top
ranked candidates are extracted as keyphrases. The post-ranking process essentially
prevents the binary classifier producing too many keyphrases. SemGraph only considers
the content in title, abstract, introduction, related work, conclusions and future work
sections, which significantly reduces the number of candidate phrases.
7.5 Conclusion
In this chapter, we have introduced a deep learning model for AKE. We demonstrated
that the proposed model performs well on extracting keyphrases from short and mid-
length documents, delivering identical performance on the Hulth dataset and the new
state-of-the-art score on the DUC datasets without employing any manually selected
features or dataset-dependent heuristics. However, the proposed model is unable to
learn useful representations of long documents such as journal articles, which usually
contain multiple topics, ideas or arguments. A possible solution is to let CDC learn the
representations of paragraphs, and then use a recurrent neural network such as LSTM
to learn the overall representation of a document, which will be our future work.
Chapter 8
Conclusion
Throughout this thesis, we have developed a series of semantic knowledge-based AKE
techniques using distributed representations of words, phrases, and documents learnt
from deep neural networks. Specifically, we investigated the application, capability, and
efficacy of distributed representations for encoding compositional semantics of phrases
and documents, in order to address the following four major issues in existing AKE
systems:
1. Difficulties in incorporating background and domain-specific knowledge
Using public semantic knowledge bases such as WordNet to obtain additional se-
mantic information is practically difficult because they supply limited vocabularies
and insufficient domain-specific knowledge.
We conducted a systematic evaluation on four unsupervised combined with differ-
ent pre-processing and post-processing techniques in Chapter 3. The evaluation
shows that all algorithms are sensitive to frequencies of phrases at different de-
grees, i.e. they mostly fail to extract keyphrases with lower frequencies. We
found that incorporating external knowledge would mitigate the problem. How-
ever, public knowledge bases such as Wikipedia and WordNet are designed for
general-propose, and hence they are insufficient to cover domain-specific terms
and supply the specific knowledge for identify keyphrases from domain-specific
corpora.
To address the difficulties in incorporating background and domain-specific knowl-
edge, we proposed to use word embeddings. We utilise a special characteristic of
word embeddings – they can be trained multiple time over different corpora to
encode the both general and domain-specific semantics of words, and new words
can be subsequently added into the vocabulary before training on different cor-
pora. Such advantages of word embeddings fundamentally overcome the problem
135
Chapter 8. Conclusion 136
that public semantic knowledge bases have – limited vocabulary and insufficient
domain-specific knowledge. As we have demonstrated in Chapter 4 that after
training over general and then domain-specific corpora, the meanings of domain-
specific words encoded in word embeddings are changed significantly to present the
domain-specific knowledge, while non-domain specific words remain their original
meaning. Based on this characteristic, we developed a general-purpose weighting
scheme using the semantics supplied in pre-trained word embeddings, and demon-
strated that the weighting scheme generally enhances the performance of un-
supervised graph-based AKE algorithms. In addition, we also demonstrated
the proposed weighting scheme efficiently reduces the impact from phrase
frequency.
However, the development of weighting schemes is a rather ad-hoc process where
the choice of features is critical to the overall performance of the algorithms. This
problem turns the development of graph-based AKE algorithms into a laborious
feature engineering process.
2. Failure on capturing the semantics of phrases and documents. Tradi-
tional AKE algorithms rely on manually selected features based on the obser-
vations of keyphrases’ linguistic, statistical, or structural characteristics, such as
co-occurrence frequencies and occurrence positions. These features, however, carry
no semantic meanings of phrases or documents. Traditional AKE algorithms are
unable or have no attempt to understand the meanings of phrases and documents,
and hence such lack of cognitive ability inhibits the performance of existing AKE
algorithm.
To address this problem, we developed two deep learning models: Matrix-Vector
Recurrent Unit (MVRU) in Chapter 6 and Convolutional Document Cube (CDC)
in Chapter 7. Studies have shown that distributed word representations are ca-
pable of encoding the semantic meanings of words [13, 170]. However, learning
distributed representations of multi-gram expressions, such as multi-word phrases
or sentences, remains a great challenge. This is because that the meaning of a
multi-word expression is constructed by the meaning of its each constituent word
and the rules of composing them [25]. To explicitly learn these compositional
rules, we developed the MVRU model based on a recurrent neural network archi-
tecture, where a word is represented using a pair of matrix and vector. The vector
represents the meaning of a word, and the matrix intends to capture the compo-
sitional rule, i.e. it determines the contributions made by other words when they
compose with the word. We demonstrated that the MVRU model captured
more accurate semantics than existing deep learning models such as the convo-
lutional neural network (CNN) and Long Short-Term Memory (LSTM) network.
Chapter 8. Conclusion 137
On the other hand, learning the meanings of documents is to encode the core ideas,
thoughts, and logic flows embedded in the intrinsic structures of the contents. In
contrast to the constituent words of a phrase that always have stronger depen-
dencies, words in a document usually have strong dependencies to its neighbour
words, and the strength of dependencies may decrease by the increase on the dis-
tances between words. For example, the semantics of words may be correlated in
the beginning (e.g. abstracts or introductions) and the end (e.g conclusions) of a
document, sentences appearing in the beginning could have very loose or even on
dependencies to the ones in the middle of the document. Capturing all dependen-
cies of words in a document may be unnecessary, which could also introduce noise
data. Hence, we believe that only short-term dependencies among words should
be analysed. Such intuition is implemented in the CDC model. The model is
based on a CNN network, which has the strength to encode regional compositions
of different locations in data matrices. The network features different region sizes,
where each region has a number of filters to produce the feature maps. By con-
voluting the data matrices with different region sizes, the CNN network captures
how words are composed in each region. Since the region sizes are relatively small,
the network only analyses words and their closer neighbours, aiming to encode
only short-term dependencies.
A drawback of both MVRU and CDC models we observed is the arbitrary number
of dimensions for vectors and matrices. Lower dimensions may not be able to
capture enough information, whereas higher dimensions significantly increase the
training time and the cost of computation. We have not yet precisely identified
the impacts from different dimensions.
3. Labour-intensive The feature engineering process in existing AKE approaches
turns the development of AKE models into a time-consuming and labour-intensive
exercise.
We addressed the problem 2 and 3 by developing a deep learning model for AKE in
Chapter 7, which not only eliminates the effort in feature engineering, but
also automatically learns the features that are capable of representing
the semantics of both phrases and documents. The proposed model consists
of a MVRU model and a CDC model, where MVRU is responsible to capture
the meanings of phrases, and CDC is responsible to learn the representations of
documents. The output layer is a logistic regression layer, which classifies whether
a phrase can be a keyphrase for the input document based on the meanings learnt
from the phrase and document model. We demonstrated that the proposed deep
learning model is dataset-independent and knowledge-rich, which delivers the new
state-of-the-art performance on the two different datasets: a collection of
Chapter 8. Conclusion 138
the abstracts of journal articles from computer science domain, and a collection of
news articles, without employing any dataset-dependent heuristics.
The deep learning model developed in this thesis not only demonstrate the capabil-
ities and efficiencies of applying deep neural networks to AKE problems, but also
provide effective tools for learning general representations of phrases and relatively
short documents such as news articles. However, the proposed model is unable to
learn useful representations of long documents, such as full-length journal articles,
which usually contain multiple topics, ideas or arguments. A possible solution is to
let CDC learn the representations of paragraphs, and then use a recurrent neural
network such as LSTM to learn the overall representation of a document.
8.1 Future Work
We raise four questions from this thesis, which provide future research directions.
1. What has been learnt by deep learning models? From Chapter 5 to 7, we
have focused on learning the representations of phrases and documents. We have
shown the learnt representations carrying semantics at some degree demonstrated
by different experiments, including semantic similarity, semantic compositionality,
term identification, and AKE. However, the exact amount of semantic informa-
tion encoded in the representations remains unclear. Furthermore, distributed
representations enable machine learning algorithms to automatically learn useful
features, however, it is unclear what information each feature carries. Discovering
an approach that allows us to understand features and measure the information
encoded in distribution representations, will help researchers to build more robust
and efficient deep learning models.
2. Do deep learning models still need as much labelled data as required by
traditional supervised machine learning models? In Chapter 5, we presented
a co-training approach for domain-specific term extraction. We found that the
model reached the best classification results without seeing much training data,
and further training seems to overfeed the model. One possible reason may be
that the word embeddings are pre-trained over the dataset, which not only encode
the semantics of words, but also provide better initial values of embedding vectors
that allow the deep learning model getting converged more quickly. However, the
amount of training data that deep learning models require is not quantified. The
lack of training data is one of the major problems that affect the performance
of machine learning algorithms, and hence, analysing and comparing the amount
Chapter 8. Conclusion 139
of training data required by deep learning models and classic machine learning
algorithms will be a very useful research topic for future studies.
3. How to learn representations for long documents more efficiently? In
Chapter 7, we introduced the convolutional cube (CDC) models to learn the rep-
resentations of documents. We also used two additional deep learning models
(LSTM-GRU and CNN-GRU) as the baselines. However, all deep learning models
failed to learn useful representations of long documents, e.g. full-length journal
articles. Discovering an efficient and effective approach to learn representations of
long documents will be our future work.
4. Can semantic encoded in distributed representations help developing
a more efficient AKE Evaluation approach? Currently, the most popular
evaluation methodology for AKE is exact match, i.e. a ground-truth keyphrase
selected by human annotators matches a computer identified keyphrase when they
correspond to the same stem sequence. Such evaluation methodology fails to mea-
sure two semantically identical phrases having different constituent words or stem
sequences, such as terminology mining and domain-specific term extraction, or
artificial neural network and neural network. At present, algorithms are unable
to identify two phrases referring the same concept or object in great precision.
Throughout this thesis, we have shown that deep learning models have the ability
to encode semantics of words and multi-word phrases at some degree. Using deep
neural networks and distributed representations to develop a more effective AKE
evaluation framework will help us to evaluate the performance of AKE algorithms
more accurately.
In this thesis, we have presented series of studies on applying distributed vector rep-
resentations and deep learning technologies to AKE problems, and demonstrated how
these technologies can solve or mitigate the problems. However, deep learning for Nature
Language Processing is still an emerging research field, the above questions raised from
this thesis provide interesting insights into our future research directions.
Bibliography
[1] Ian H Witten, Gordon W Paynter, Eibe Frank, Carl Gutwin, and Craig G Nevill-
Manning. Kea: Practical automatic keyphrase extraction. In Proceedings of the
fourth ACM conference on Digital libraries, pages 254–255. ACM, 1999.
[2] Karen Sparck Jones. A statistical interpretation of term specificity and its appli-
cation in retrieval. Journal of documentation, 28(1):11–21, 1972.
[3] Rada Mihalcea and Paul Tarau. Textrank: Bringing order into texts. In
Dekang Lin and Dekai Wu, editors, Proceedings of EMNLP 2004, pages 404–411,
Barcelona, Spain, July 2004. Association for Computational Linguistics.
[4] Peter D Turney. Learning algorithms for keyphrase extraction. Information Re-
trieval, 2(4):303–336, 2000.
[5] Anette Hulth. Improved automatic keyword extraction given more linguistic
knowledge. In Proceedings of the 2003 conference on Empirical methods in natural
language processing, pages 216–223. Association for Computational Linguistics,
2003.
[6] Jinghua Wang, Jianyi Liu, and Cong Wang. Keyword extraction based on pager-
ank. In Advances in Knowledge Discovery and Data Mining, pages 857–864.
Springer, 2007.
[7] Maria Grineva, Maxim Grinev, and Dmitry Lizorkin. Extracting key terms from
noisy and multitheme documents. In Proceedings of the 18th international confer-
ence on World wide web, pages 661–670. ACM, 2009.
[8] Zhiyuan Liu, Peng Li, Yabin Zheng, and Maosong Sun. Clustering to find exemplar
terms for keyphrase extraction. In Proceedings of the 2009 Conference on Empirical
Methods in Natural Language Processing: Volume 1-Volume 1, pages 257–266.
Association for Computational Linguistics, 2009.
140
Bibliography 141
[9] Kazi Saidul Hasan and Vincent Ng. Conundrums in unsupervised keyphrase ex-
traction: making sense of the state-of-the-art. In Proceedings of the 23rd Interna-
tional Conference on Computational Linguistics: Posters, pages 365–373. Associ-
ation for Computational Linguistics, 2010.
[10] Xiaojun Wan and Jianguo Xiao. Single document keyphrase extraction using
neighborhood knowledge. In AAAI, volume 8, pages 855–860, 2008.
[11] Joseph Turian, Lev Ratinov, and Yoshua Bengio. Word representations: a sim-
ple and general method for semi-supervised learning. In Proceedings of the 48th
annual meeting of the association for computational linguistics, pages 384–394.
Association for Computational Linguistics, 2010.
[12] Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation learning: A
review and new perspectives. IEEE transactions on pattern analysis and machine
intelligence, 35(8):1798–1828, 2013.
[13] Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig. Linguistic regularities in con-
tinuous space word representations. In HLT-NAACL, pages 746–751. Citeseer,
2013.
[14] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation
of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013.
[15] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Dis-
tributed representations of words and phrases and their compositionality. In Ad-
vances in neural information processing systems, pages 3111–3119, 2013.
[16] Tomas Mikolov, Martin Karafiat, Lukas Burget, Jan Cernocky, and Sanjeev Khu-
danpur. Recurrent neural network based language model. In INTERSPEECH,
volume 2, page 3, 2010.
[17] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with
neural networks. In Advances in neural information processing systems, pages
3104–3112, 2014.
[18] Richard Socher, Christopher D Manning, and Andrew Y Ng. Learning continuous
phrase representations and syntactic parsing with recursive neural networks. In
Proceedings of the NIPS-2010 Deep Learning and Unsupervised Feature Learning
Workshop, pages 1–9, 2010.
[19] Richard Socher, Brody Huval, Christopher D Manning, and Andrew Y Ng. Se-
mantic compositionality through recursive matrix-vector spaces. In Proceedings of
the 2012 Joint Conference on Empirical Methods in Natural Language Processing
Bibliography 142
and Computational Natural Language Learning, pages 1201–1211. Association for
Computational Linguistics, 2012.
[20] Richard Socher, Alex Perelygin, Jean Y Wu, Jason Chuang, Christopher D Man-
ning, Andrew Y Ng, and Christopher Potts. Recursive deep models for semantic
compositionality over a sentiment treebank. In Proceedings of the conference on
empirical methods in natural language processing (EMNLP), volume 1631, page
1642. Citeseer, 2013.
[21] Rui Wang, Wei Liu, and Chris McDonald. Corpus-independent generic keyphrase
extraction using word embedding vectors. In Deep Learning for Web Search and
Data Mining Workshop (DL-WSDM 2015), 2014.
[22] Rui Wang, Wei Liu, and Chris McDonald. Using word embeddings to enhance
keyword identification for scientific publications. In Australasian Database Con-
ference, pages 257–268. Springer, 2015.
[23] Rui Wang, Wei Liu, and Chris McDonald. How preprocessing affects unsupervised
keyphrase extraction. In Computational Linguistics and Intelligent Text Process-
ing, pages 163–176. Springer, 2014.
[24] Juan Martinez-Romo, Lourdes Araujo, and Andres Duque Fernandez. Semgraph:
Extracting keyphrases following a novel semantic graph-based approach. Journal
of the Association for Information Science and Technology, 67(1):71–82, 2016.
[25] Gottlob Frege. Uber sinn und bedeutung. Wittgenstein Studien, 1(1), 1994.
[26] Yoon Kim. Convolutional neural networks for sentence classification. arXiv
preprint arXiv:1408.5882, 2014.
[27] Ye Zhang and Byron Wallace. A sensitivity analysis of (and practitioners’ guide
to) convolutional neural networks for sentence classification. arXiv preprint
arXiv:1510.03820, 2015.
[28] Sepp Hochreiter and Jurgen Schmidhuber. Long short-term memory. Neural com-
putation, 9(8):1735–1780, 1997.
[29] Rui Wang, Western Australia, Wei Liu, and Chris McDonald. Featureless domain-
specific term extraction with minimal labelled data. In Australasian Language
Technology Association Workshop 2016, page 103.
[30] Su Nam Kim, Olena Medelyan, Min-Yen Kan, and Timothy Baldwin. Automatic
keyphrase extraction from scientific articles. Language resources and evaluation,
47(3):723–742, 2013.
Bibliography 143
[31] Kazi Saidul Hasan and Vincent Ng. Automatic keyphrase extraction: A survey of
the state of the art. Proceedings of the Association for Computational Linguistics
(ACL), Baltimore, Maryland: Association for Computational Linguistics, 2014.
[32] Slobodan Beliga, Ana Mestrovic, and Sanda Martincic-Ipsic. An overview of graph-
based keyword extraction methods and approaches. Journal of information and
organizational sciences, 39(1):1–20, 2015.
[33] Sifatullah Siddiqi and Aditi Sharan. Keyword and keyphrase extraction techniques:
a literature review. International Journal of Computer Applications, 109(2), 2015.
[34] Su Nam Kim and Min-Yen Kan. Re-examining automatic keyphrase extraction
approaches in scientific articles. In Proceedings of the workshop on multiword
expressions: Identification, interpretation, disambiguation and applications, pages
9–16. Association for Computational Linguistics, 2009.
[35] Yukio Ohsawa, Nels E Benson, and Masahiko Yachida. Keygraph: Automatic
indexing by co-occurrence graph based on building construction metaphor. In Re-
search and Technology Advances in Digital Libraries, 1998. ADL 98. Proceedings.
IEEE International Forum on, pages 12–18. IEEE, 1998.
[36] Yutaka Matsuo and Mitsuru Ishizuka. Keyword extraction from a single docu-
ment using word co-occurrence statistical information. International Journal on
Artificial Intelligence Tools, 13(01):157–169, 2004.
[37] David B Bracewell, Fuji Ren, and Shingo Kuriowa. Multilingual single document
keyword extraction for information retrieval. In Natural Language Processing and
Knowledge Engineering, 2005. IEEE NLP-KE’05. Proceedings of 2005 IEEE In-
ternational Conference on, pages 517–522. IEEE, 2005.
[38] Mikalai Krapivin, Maurizio Marchese, Andrei Yadrantsau, and Yanchun Liang.
Unsupervised key-phrases extraction from scientific papers using domain and lin-
guistic knowledge. In Digital Information Management, 2008. ICDIM 2008. Third
International Conference on, pages 105–112. IEEE, 2008.
[39] Zhiyuan Liu, Wenyi Huang, Yabin Zheng, and Maosong Sun. Automatic keyphrase
extraction via topic decomposition. In Proceedings of of the 2010 Conference
on Empirical Methods in Natural Language Processing, pages 366–376. Assoc. for
Computational Linguistics, 2010.
[40] Roberto Ortiz, David Pinto, Mireya Tovar, and Hector Jimenez-Salazar. Buap: An
unsupervised approach to automatic keyphrase extraction from scientific articles.
In Proceedings of the 5th international workshop on semantic evaluation, pages
174–177. Association for Computational Linguistics, 2010.
Bibliography 144
[41] Georgeta Bordea and Paul Buitelaar. Deriunlp: A context based approach to
automatic keyphrase extraction. In Proceedings of the 5th international workshop
on semantic evaluation, pages 146–149. Association for Computational Linguistics,
2010.
[42] Samhaa R El-Beltagy and Ahmed Rafea. Kp-miner: Participation in semeval-2.
In Proceedings of the 5th international workshop on semantic evaluation, pages
190–193. Association for Computational Linguistics, 2010.
[43] Mari-Sanna Paukkeri and Timo Honkela. Likey: unsupervised language-
independent keyphrase extraction. In Proceedings of the 5th international workshop
on semantic evaluation, pages 162–165. Association for Computational Linguistics,
2010.
[44] Kalliopi Zervanou. Uvt: The uvt term extraction system in the keyphrase extrac-
tion task. In Proceedings of the 5th international workshop on semantic evaluation,
pages 194–197. Association for Computational Linguistics, 2010.
[45] Stuart Rose, Dave Engel, Nick Cramer, and Wendy Cowley. Automatic keyword
extraction from individual documents. Text Mining, pages 1–20, 2010.
[46] Martin Dostal and Karel Jezek. Automatic keyphrase extraction based on nlp and
statistical methods. In DATESO, pages 140–145, 2011.
[47] Abdelghani Bellaachia and Mohammed Al-Dhelaan. Ne-rank: A novel graph-based
keyphrase extraction in twitter. In Proceedings of the The 2012 IEEE/WIC/ACM
International Joint Conferences on Web Intelligence and Intelligent Agent
Technology-Volume 01, pages 372–379. IEEE Computer Society, 2012.
[48] Kamal Sarkar. A hybrid approach to extract keyphrases from medical documents.
arXiv preprint arXiv:1303.1441, 2013.
[49] Wei You, Dominique Fontaine, and Jean-Paul Barthes. An automatic keyphrase
extraction system for scientific documents. Knowledge and information systems,
34(3):691–724, 2013.
[50] Sujatha Das Gollapalli and Cornelia Caragea. Extracting keyphrases from research
papers using citation networks. In AAAI, pages 1629–1635, 2014.
[51] Su Nam Kim, Olena Medelyan, Min-Yen Kan, and Timothy Baldwin. Semeval-
2010 task 5: Automatic keyphrase extraction from scientific articles. In Proceedings
of the 5th International Workshop on Semantic Evaluation, pages 21–26. Associ-
ation for Computational Linguistics, 2010.
Bibliography 145
[52] Xiaojun Wan and Jianguo Xiao. Collabrank: towards a collaborative approach to
single-document keyphrase extraction. In Proceedings of the 22nd International
Conference on Computational Linguistics-Volume 1, pages 969–976. Association
for Computational Linguistics, 2008.
[53] Chedi Bechikh Ali, Rui Wang, and Hatem Haddad. A two-level keyphrase ex-
traction approach. In International Conference on Intelligent Text Processing and
Computational Linguistics, pages 390–401. Springer, 2015.
[54] Wei Liu, Bo Chuen Chung, Rui Wang, Jonathon Ng, and Nigel Morlet. A genetic
algorithm enabled ensemble for unsupervised medical term extraction from clinical
letters. Health information science and systems, 3(1):1–14, 2015.
[55] Jian Pei, Jiawei Han, Behzad Mortazavi-Asl, Helen Pinto, Qiming Chen, Umesh-
war Dayal, and Mei-Chun Hsu. Prefixspan: Mining sequential patterns efficiently
by prefix-projected pattern growth. In icccn, page 0215. IEEE, 2001.
[56] Katerina Frantzi, Sophia Ananiadou, and Hideki Mima. Automatic recognition of
multi-word terms:. the c-value/nc-value method. International Journal on Digital
Libraries, 3(2):115–130, 2000.
[57] Wilson Wong, Wei Liu, and Mohammed Bennamoun. Determining the unithood
of word sequences using a probabilistic approach. arXiv preprint arXiv:0810.0139,
2008.
[58] Mike Scott. Pc analysis of key wordsand key key words. System, 25(2):233–245,
1997.
[59] Wen-tau Yih, Joshua Goodman, and Vitor R Carvalho. Finding advertising key-
words on web pages. In Proceedings of the 15th international conference on World
Wide Web, pages 213–222. ACM, 2006.
[60] Thuy Dung Nguyen and Min-Yen Kan. Keyphrase extraction in scientific publi-
cations. In International Conference on Asian Digital Libraries, pages 317–326.
Springer, 2007.
[61] Gonenc Ercan and Ilyas Cicekli. Using lexical chains for keyword extraction.
Information Processing & Management, 43(6):1705–1714, 2007.
[62] Zhuoye Ding, Qi Zhang, and Xuanjing Huang. Keyphrase extraction from online
news using binary integer programming. In IJCNLP, pages 165–173, 2011.
[63] Kathrin Eichler and Gunter Neumann. Dfki keywe: Ranking keyphrases extracted
from scientific articles. In Proceedings of the 5th international workshop on seman-
tic evaluation, pages 150–153. Association for Computational Linguistics, 2010.
Bibliography 146
[64] Xin Jiang, Yunhua Hu, and Hang Li. A ranking approach to keyphrase extraction.
In Proceedings of the 32nd international ACM SIGIR conference on Research and
development in information retrieval, pages 756–757. ACM, 2009.
[65] Songhua Xu, Shaohui Yang, and Francis Chi-Moon Lau. Keyword extraction and
headline generation using novel word features. In AAAI, 2010.
[66] Ken Barker and Nadia Cornacchia. Using noun phrase heads to extract document
keyphrases. In Conference of the Canadian Society for Computational Studies of
Intelligence, pages 40–52. Springer, 2000.
[67] Kenneth Ward Church and Patrick Hanks. Word association norms, mutual in-
formation, and lexicography. Computational linguistics, 16(1):22–29, 1990.
[68] Lee R Dice. Measures of the amount of ecologic association between species.
Ecology, 26(3):297–302, 1945.
[69] Taeho Jo. Neural based approach to keyword extraction from documents. In
International Conference on Computational Science and Its Applications, pages
456–461. Springer, 2003.
[70] Jiabing Wang, Hong Peng, and Jing-song Hu. Automatic keyphrases extraction
from document using neural network. In Advances in Machine Learning and Cy-
bernetics, pages 633–641. Springer, 2006.
[71] Kamal Sarkar, Mita Nasipuri, and Suranjan Ghose. A new approach to keyphrase
extraction using neural networks. arXiv preprint arXiv:1004.3274, 2010.
[72] Olena Medelyan, Eibe Frank, and Ian H Witten. Human-competitive tagging
using automatic keyphrase extraction. In Proceedings of the 2009 Conference on
Empirical Methods in Natural Language Processing: Volume 3-Volume 3, pages
1318–1327. Association for Computational Linguistics, 2009.
[73] Ian Witten and David Milne. An effective, low-cost measure of semantic re-
latedness obtained from wikipedia links. In Proceeding of AAAI Workshop on
Wikipedia and Artificial Intelligence: an Evolving Synergy, AAAI Press, Chicago,
USA, pages 25–30, 2008.
[74] Denis Turdakov and Pavel Velikhov. Semantic relatedness metric for wikipedia
concepts based on link analysis and its application to word sense disambiguation.
2008.
[75] David Milne. Computing semantic relatedness using wikipedia link structure. In
Proceedings of the new zealand computer science research student conference, pages
1–8, 2007.
Bibliography 147
[76] David M Blei, Andrew Y Ng, and Michael I Jordan. Latent dirichlet allocation.
the Journal of machine Learning research, 3:993–1022, 2003.
[77] Claude Pasquier. Task 5: Single document keyphrase extraction using sentence
clustering and latent dirichlet allocation. In Proceedings of the 5th international
workshop on semantic evaluation, pages 154–157. Association for Computational
Linguistics, 2010.
[78] Adrien Bougouin, Florian Boudin, and Beatrice Daille. Topicrank: Graph-based
topic ranking for keyphrase extraction. In International Joint Conference on Nat-
ural Language Processing (IJCNLP), pages 543–551, 2013.
[79] Wayne Xin Zhao, Jing Jiang, Jing He, Yang Song, Palakorn Achananuparp, Ee-
Peng Lim, and Xiaoming Li. Topical keyphrase extraction from twitter. In Pro-
ceedings of the 49th Annual Meeting of the Association for Computational Lin-
guistics: Human Language Technologies-Volume 1, pages 379–388. Association for
Computational Linguistics, 2011.
[80] Florian Boudin and Emmanuel Morin. Keyphrase extraction for n-best reranking
in multi-sentence compression. In North American Chapter of the Association for
Computational Linguistics (NAACL), 2013.
[81] Taeho Jo, Malrey Lee, and Thomas M Gatton. Keyword extraction from docu-
ments using a neural network model. In Hybrid Information Technology, 2006.
ICHIT’06. International Conference on, volume 2, pages 194–197. IEEE, 2006.
[82] Kuo Zhang, Hui Xu, Jie Tang, and Juanzi Li. Keyword extraction using support
vector machine. In Advances in Web-Age Information Management, pages 85–96.
Springer, 2006.
[83] Chengzhi Zhang. Automatic keyword extraction from documents using conditional
random fields. Journal of Computational Information Systems, 4(3):1169–1180,
2008.
[84] Taemin Jo and Jee-Hyong Lee. Latent keyphrase extraction using deep belief
networks. International Journal of Fuzzy Logic and Intelligent Systems, 15(3):
153–158, 2015.
[85] Ludovic Jean-Louis, Michel Gagnon, and Eric Charton. A knowledge-base oriented
approach for automatic keyword extraction. Computacion y Sistemas, 17(2):187–
196, 2013.
[86] Mari-Sanna Paukkeri, Ilari T Nieminen, Matti Polla, and Timo Honkela. A
language-independent approach to keyphrase extraction and evaluation. In COL-
ING (Posters), pages 83–86, 2008.
Bibliography 148
[87] Christian Wartena, Rogier Brussee, and Wout Slakhorst. Keyword extraction
using word co-occurrence. In Database and Expert Systems Applications (DEXA),
2010 Workshop on, pages 54–58. IEEE, 2010.
[88] Jon M Kleinberg. Authoritative sources in a hyperlinked environment. Journal of
the ACM (JACM), 46(5):604–632, 1999.
[89] Sergey Brin and Lawrence Page. The anatomy of a large-scale hypertextual web
search engine. Computer networks and ISDN systems, 30(1):107–117, 1998.
[90] Florian Boudin. A comparison of centrality measures for graph-based keyphrase
extraction. In International Joint Conference on Natural Language Processing
(IJCNLP), pages 834–838, 2013.
[91] Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. The pagerank
citation ranking: bringing order to the web. 1999.
[92] Zhi Liu, Chen Liang, and Maosong Sun. Topical word trigger model for keyphrase
extraction. In In Proceedings of COLING. Citeseer, 2012.
[93] David Mimno, Hanna M Wallach, Jason Naradowsky, David A Smith, and Andrew
McCallum. Polylingual topic models. In Proceedings of the 2009 Conference on
Empirical Methods in Natural Language Processing: Volume 2-Volume 2, pages
880–889. Association for Computational Linguistics, 2009.
[94] Geoffrey E Hinton, Simon Osindero, and Yee-Whye Teh. A fast learning algorithm
for deep belief nets. Neural computation, 18(7):1527–1554, 2006.
[95] Qi Zhang, Yang Wang, Yeyun Gong, and Xuanjing Huang. Keyphrase extraction
using deep recurrent neural networks on twitter. In EMNLP, pages 836–845, 2016.
[96] Rui Meng, Sanqiang Zhao, Shuguang Han, Daqing He, Peter Brusilovsky, and
Yu Chi. Deep keyphrase generation. arXiv preprint arXiv:1704.06879, 2017.
[97] Kyunghyun Cho, Bart Van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau,
Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representa-
tions using rnn encoder-decoder for statistical machine translation. arXiv preprint
arXiv:1406.1078, 2014.
[98] Jiatao Gu, Zhengdong Lu, Hang Li, and Victor OK Li. Incorporating copying
mechanism in sequence-to-sequence learning. arXiv preprint arXiv:1603.06393,
2016.
[99] Qing Li, Wenhao Zhu, and Zhiguo Lu. Predicting abstract keywords by word
vectors. In International Conference on High Performance Computing and Appli-
cations, pages 185–195. Springer, 2015.
Bibliography 149
[100] Su Nam Kim, Timothy Baldwin, and Min-Yen Kan. Evaluating n-gram based
evaluation metrics for automatic keyphrase extraction. In Proceedings of the 23rd
international conference on computational linguistics, pages 572–580. Association
for Computational Linguistics, 2010.
[101] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method
for automatic evaluation of machine translation. In Proceedings of the 40th annual
meeting on association for computational linguistics, pages 311–318. Association
for Computational Linguistics, 2002.
[102] Abhaya Agarwal and Alon Lavie. Meteor, m-bleu and m-ter: Evaluation metrics
for high-correlation with human rankings of machine translation output. In Pro-
ceedings of the Third Workshop on Statistical Machine Translation, pages 115–118.
Association for Computational Linguistics, 2008.
[103] Mark A Przybocki and Alvin F Martin. The 1999 nist speaker recognition evalua-
tion, using summed two-channel telephone data for speaker detection and speaker
tracking. In Sixth European Conference on Speech Communication and Technology,
1999.
[104] Chin-Yew Lin and Eduard Hovy. Automatic evaluation of summaries using n-
gram co-occurrence statistics. In Proceedings of the 2003 Conference of the North
American Chapter of the Association for Computational Linguistics on Human
Language Technology-Volume 1, pages 71–78. Association for Computational Lin-
guistics, 2003.
[105] Torsten Zesch and Iryna Gurevych. Approximate matching for evaluating
keyphrase extraction. In RANLP, pages 484–489, 2009.
[106] Ernesto DAvanzo and Bernado Magnini. A keyphrase-based approach to summa-
rization: the lake system at duc-2005. In Proceedings of DUC, 2005.
[107] Regina Barzilay and Michael Elhadad. Using lexical chains for text summarization.
Advances in automatic text summarization, pages 111–121, 1999.
[108] Dawn Lawrie, W Bruce Croft, and Arnold Rosenberg. Finding topic words for
hierarchical summarization. In Proceedings of the 24th annual international ACM
SIGIR conference on Research and development in information retrieval, pages
349–357. ACM, 2001.
[109] Adam L Berger and Vibhu O Mittal. Ocelot: a system for summarizing web
pages. In Proceedings of the 23rd annual international ACM SIGIR conference on
Research and development in information retrieval, pages 144–151. ACM, 2000.
Bibliography 150
[110] Anette Hulth and Beata B Megyesi. A study on automatically extracted key-
words in text categorization. In Proceedings of the 21st International Conference
on Computational Linguistics and the 44th annual meeting of the Association for
Computational Linguistics, pages 537–544. Association for Computational Linguis-
tics, 2006.
[111] Su Nam Kim, Timothy Baldwin, and Min-yen Kan. The use of topic representative
words in text categorization. 2009.
[112] Khaled M Hammouda, Diego N Matute, and Mohamed S Kamel. Corephrase:
Keyphrase extraction for document clustering. In International Workshop on Ma-
chine Learning and Data Mining in Pattern Recognition, pages 265–274. Springer,
2005.
[113] Yongzheng Zhang, Nur Zincir-Heywood, and Evangelos Milios. Term-based clus-
tering and summarization of web page collections. In Conference of the Canadian
Society for Computational Studies of Intelligence, pages 60–74. Springer, 2004.
[114] Yi-fang Brook Wu and Quanzhi Li. Document keyphrases as subject metadata:
incorporating document key concepts in search results. Information Retrieval, 11
(3):229–249, 2008.
[115] Rada Mihalcea and Andras Csomai. Wikify!: linking documents to encyclopedic
knowledge. In Proceedings of the sixteenth ACM conference on Conference on
information and knowledge management, pages 233–242. ACM, 2007.
[116] Felice Ferrara, Nirmala Pudota, and Carlo Tasso. A keyphrase-based paper recom-
mender system. In Italian Research Conference on Digital Libraries, pages 14–25.
Springer, 2011.
[117] Jason DM Rennie and Tommi Jaakkola. Using term informativeness for named
entity detection. In Proceedings of the 28th annual international ACM SIGIR
conference on Research and development in information retrieval, pages 353–360.
ACM, 2005.
[118] Wilson Wong, Wei Liu, and Mohammed Bennamoun. A probabilistic framework
for automatic term recognition. Intelligent Data Analysis, 13(4):499–539, 2009.
[119] Wilson Wong. Determination of unithood and termhood for term recognition. In
Handbook of research on text and web mining technologies, pages 500–529. IGI
Global, 2009.
[120] Wilson Wong, Wei Liu, and Mohammed Bennamoun. Ontology learning from
text: A look back and into the future. ACM Computing Surveys (CSUR), 44(4):
20, 2012.
Bibliography 151
[121] David Newman, Nagendra Koilada, Jey Han Lau, and Timothy Baldwin. Bayesian
text segmentation for index term identification and keyphrase extraction. In COL-
ING, pages 2077–2092, 2012.
[122] Kyo Kageura and Bin Umino. Methods of automatic term recognition: A review.
Terminology, 3(2):259–289, 1996.
[123] Andy Lauriston. Automatic recognition of complex terms: Problems and the
termino solution. Terminology. International Journal of Theoretical and Applied
Issues in Specialized Communication, 1(1):147–170, 1994.
[124] Beatrice Daille, Eric Gaussier, and Jean-Marc Lange. Towards automatic ex-
traction of monolingual and bilingual terminology. In Proceedings of the 15th
conference on Computational linguistics-Volume 1, pages 515–521. Association for
Computational Linguistics, 1994.
[125] John S Justeson and Slava M Katz. Technical terminology: some linguistic prop-
erties and an algorithm for identification in text. Natural language engineering, 1
(1):9–27, 1995.
[126] Didier Bourigault. Surface grammatical analysis for the extraction of termino-
logical noun phrases. In Proceedings of the 14th conference on Computational
linguistics-Volume 3, pages 977–981. Association for Computational Linguistics,
1992.
[127] Sophia Ananiadou. A methodology for automatic term recognition. In Proceedings
of the 15th conference on Computational linguistics-Volume 2, pages 1034–1038.
Association for Computational Linguistics, 1994.
[128] C Kit. Reduction of indexing term space for phrase-based information retrieval.
Internal memo of Computational Linguistics Program. Pittsburgh: Carnegie Mel-
lon University, 1994.
[129] KT Frantzi and S Ananiadou. Statistical measures for terminological extraction.
In Proceedings of 3rd Int’l Conf. on Statistical Analysis of Textual Data, pages
297–308, 1995.
[130] Katerina T Frantzi, Sophia Ananiadou, and Junichi Tsujii. Extracting termino-
logical expressions. The Special Interest Group Notes of Information Processing
Society of Japan, 96:83–88, 1996.
[131] Gerard Salton, Chung-Shu Yang, and CLEMENT T Yu. A theory of term im-
portance in automatic text analysis. Journal of the Association for Information
Science and Technology, 26(1):33–44, 1975.
Bibliography 152
[132] Gerard Salton. Syntactic approaches to automatic book indexing. In Proceedings
of the 26th annual meeting on Association for Computational Linguistics, pages
204–210. Association for Computational Linguistics, 1988.
[133] David A Evans and Robert G Lefferts. Clarit-trec experiments. Information
processing & management, 31(3):385–395, 1995.
[134] Jody Foo and Magnus Merkel. Using machine learning to perform automatic term
recognition. In LREC 2010 Workshop on Methods for automatic acquisition of
Language Resources and their evaluation methods, 23 May 2010, Valletta, Malta,
pages 49–54, 2010.
[135] Rogelio Nazar and Maria Teresa Cabre. Supervised learning algorithms applied
to terminology extraction. In Proceedings of the 10th Terminology and Knowledge
Engineering Conference, pages 209–217, 2012.
[136] Merley da Silva Conrado, Thiago Alexandre Salgueiro Pardo, and Solange Oliveira
Rezende. A machine learning approach to automatic term extraction using a rich
feature set. In HLT-NAACL, pages 16–23, 2013.
[137] Irena Spasic, Goran Nenadic, and Sophia Ananiadou. Using domain-specific verbs
for term classification. In Proceedings of the ACL 2003 workshop on Natural lan-
guage processing in biomedicine-Volume 13, pages 17–24. Association for Compu-
tational Linguistics, 2003.
[138] Jody Foo. Term extraction using machine learning. Linkoping University,
LINKOPING, 2009.
[139] William W Cohen. Fast effective rule induction. In Proceedings of the twelfth
international conference on machine learning, pages 115–123, 1995.
[140] Denis G Fedorenko, Nikita Astrakhantsev, and Denis Turdakov. Automatic recog-
nition of domain-specific terms: an experimental evaluation. SYRCoDIS, 1031:
15–23, 2013.
[141] Ted Dunning. Accurate methods for the statistics of surprise and coincidence.
Computational linguistics, 19(1):61–74, 1993.
[142] Su Nam Kim, Timothy Baldwin, and Min-Yen Kan. An unsupervised approach to
domain-specific term extraction. In Australasian Language Technology Association
Workshop 2009, page 94, 2009.
[143] Roberto Navigli and Paola Velardi. Semantic interpretation of terminological
strings. In Proceedings of 6th International Conference. Terminology and Knowl-
edge Eng, pages 95–100, 2002.
Bibliography 153
[144] Alberto Barron-Cedeno, Gerardo Sierra, Patrick Drouin, and Sophia Ananiadou.
An improved automatic term recognition method for spanish. In CICLing, vol-
ume 9, pages 125–136. Springer, 2009.
[145] Hiroshi Nakagawa and Tatsunori Mori. A simple but powerful automatic term
extraction method. In COLING-02 on COMPUTERM 2002: second international
workshop on computational terminology-Volume 14, pages 1–7. Association for
Computational Linguistics, 2002.
[146] Juan Antonio Lossio-Ventura, Clement Jonquet, Mathieu Roche, and Maguelonne
Teisseire. Combining c-value and keyword extraction methods for biomedical terms
extraction. In LBM: Languages in Biology and Medicine, 2013.
[147] Paul Buitelaar, Georgeta Bordea, and Tamara Polajnar. Domain-independent
term extraction through domain modelling. In the 10th International Conference
on Terminology and Artificial Intelligence (TIA 2013), Paris, France. 10th Inter-
national Conference on Terminology and Artificial Intelligence, 2013.
[148] Roberto Basili, Alessandro Moschitti, Maria Teresa Pazienza, and Fabio Massimo
Zanzotto. A contrastive approach to term extraction. In Terminologie et intelli-
gence artificielle. Rencontres, pages 119–128, 2001.
[149] Wilson Wong, Wei Liu, and Mohammed Bennamoun. Determining termhood for
learning domain ontologies using domain prevalence and tendency. In Proceedings
of the sixth Australasian conference on Data mining and analytics-Volume 70,
pages 47–54. Australian Computer Society, Inc., 2007.
[150] Wilson Wong, Wei Liu, and Mohammed Bennamoun. Determining termhood for
learning domain ontologies in a probabilistic framework. In Proceedings of the
sixth Australasian conference on Data mining and analytics-Volume 70, pages 55–
63. Australian Computer Society, Inc., 2007.
[151] Alexander Gelbukh, Grigori Sidorov, Eduardo Lavin-Villa, and Liliana Chanona-
Hernandez. Automatic term extraction using log-likelihood based comparison with
general reference corpus. Natural Language Processing and Information Systems,
pages 248–255, 2010.
[152] Francesca Bonin, Felice DellOrletta, Giulia Venturi, and Simonetta Montemagni.
A contrastive approach to multi-word term extraction from domain corpora. In
Proceedings of the 7th International Conference on Language Resources and Eval-
uation, Malta, pages 19–21, 2010.
[153] Roberto Basili, Alessandro Moschitti, and Maria Teresa Pazienza. A text classi er
based on linguistic processing. Proceedings of IJCAI. Citeseer, 1999.
Bibliography 154
[154] Boris V Dobrov and Natalia V Loukachevitch. Multiple evidence for term extrac-
tion in broad domains. In RANLP, pages 710–715, 2011.
[155] Jorge Vivaldi and Horacio Rodrıguez. Using wikipedia for term extraction in the
biomedical domain: first experiences. Procesamiento del Lenguaje Natural, 45:
251–254, 2010.
[156] Peter Turney. Coherent keyphrase extraction via web mining. 2003.
[157] Smitashree Choudhury and John G Breslin. Extracting semantic entities and
events from sports tweets. 2011.
[158] Ken Thompson. Programming techniques: Regular expression search algorithm.
Communications of the ACM, 11(6):419–422, 1968.
[159] David Weber, H Andrew Black, and Stephen R McConnel. AMPLE: A tool for
exploring morphology. Summer Institute of Linguistics, 1988.
[160] Jorge Hankamer. Morphological parsing and the lexicon. In Lexical representation
and process, pages 392–408. MIT Press, 1989.
[161] Martin F Porter. An algorithm for suffix stripping. Program, 14(3):130–137, 1980.
[162] Samy Bengio and Georg Heigold. Word embeddings for speech recognition. In
INTERSPEECH, pages 1053–1057, 2014.
[163] Liang Lu, Xingxing Zhang, Kyunghyun Cho, and Steve Renals. A study of the
recurrent neural network encoder-decoder for large vocabulary speech recognition.
In Proceedings of Interspeech, 2015.
[164] Zellig S Harris. Distributional structure. Word, 1954.
[165] Peter D Turney, Patrick Pantel, et al. From frequency to meaning: Vector space
models of semantics. Journal of artificial intelligence research, 37(1):141–188,
2010.
[166] Scott C. Deerwester, Susan T Dumais, Thomas K. Landauer, George W. Furnas,
and Richard A. Harshman. Indexing by latent semantic analysis. JAsIs, 41(6):
391–407, 1990.
[167] Daniel D Lee and H Sebastian Seung. Learning the parts of objects by non-negative
matrix factorization. Nature, 401(6755):788–791, 1999.
[168] Thomas Hofmann. Probabilistic latent semantic indexing. In Proceedings of the
22nd annual international ACM SIGIR conference on Research and development
in information retrieval, pages 50–57. ACM, 1999.
Bibliography 155
[169] Rie Kubota Ando. Latent semantic space: Iterative scaling improves precision
of inter-document similarity measurement. In Proceedings of the 23rd annual in-
ternational ACM SIGIR conference on Research and development in information
retrieval, pages 216–223. ACM, 2000.
[170] Jeffrey Pennington, Richard Socher, and Christopher D Manning. Glove: Global
vectors for word representation. In EMNLP, volume 14, pages 1532–1543, 2014.
[171] Yoshua Bengio, Rejean Ducharme, Pascal Vincent, and Christian Janvin. A neural
probabilistic language model. The Journal of Machine Learning Research, 3:1137–
1155, 2003.
[172] Ronan Collobert, Jason Weston, Leon Bottou, Michael Karlen, Koray
Kavukcuoglu, and Pavel Kuksa. Natural language processing (almost) from
scratch. The Journal of Machine Learning Research, 12:2493–2537, 2011.
[173] Andriy Mnih and Geoffrey Hinton. Three new graphical models for statistical lan-
guage modelling. In Proceedings of the 24th international conference on Machine
learning, pages 641–648. ACM, 2007.
[174] Thang Luong, Richard Socher, and Christopher D Manning. Better word represen-
tations with recursive neural networks for morphology. In CoNLL, pages 104–113,
2013.
[175] Jiang Bian, Bin Gao, and Tie-Yan Liu. Knowledge-powered deep learning for word
embedding. In Joint European Conference on Machine Learning and Knowledge
Discovery in Databases, pages 132–148. Springer, 2014.
[176] Jeff Mitchell and Mirella Lapata. Vector-based models of semantic composition.
In ACL, pages 236–244, 2008.
[177] Peter D Turney. Domain and function: A dual-space model of semantic relations
and compositions. Journal of Artificial Intelligence Research, pages 533–585, 2012.
[178] Quoc Le and Tomas Mikolov. Distributed representations of sentences and docu-
ments. In Proceedings of the 31st International Conference on Machine Learning
(ICML-14), pages 1188–1196, 2014.
[179] Jey Han Lau and Timothy Baldwin. An empirical evaluation of doc2vec
with practical insights into document embedding generation. arXiv preprint
arXiv:1607.05368, 2016.
[180] Jeffrey L Elman. Finding structure in time. Cognitive science, 14(2):179–211,
1990.
Bibliography 156
[181] Gregoire Mesnil, Xiaodong He, Li Deng, and Yoshua Bengio. Investigation of
recurrent-neural-network architectures and learning methods for spoken language
understanding. In INTERSPEECH, pages 3771–3775, 2013.
[182] Kaisheng Yao, Geoffrey Zweig, Mei-Yuh Hwang, Yangyang Shi, and Dong Yu.
Recurrent neural networks for language understanding. In INTERSPEECH, pages
2524–2528, 2013.
[183] Christoph Goller and Andreas Kuchler. Learning task-dependent distributed rep-
resentations by backpropagation through structure. In Neural Networks, 1996.,
IEEE International Conference on, volume 1, pages 347–352. IEEE, 1996.
[184] Richard Socher, Jeffrey Pennington, Eric H Huang, Andrew Y Ng, and Christo-
pher D Manning. Semi-supervised recursive autoencoders for predicting sentiment
distributions. In Proceedings of the Conference on Empirical Methods in Natural
Language Processing, pages 151–161. Association for Computational Linguistics,
2011.
[185] Yu Zhao, Zhiyuan Liu, and Maosong Sun. Phrase type sensitive tensor indexing
model for semantic composition. In AAAI, pages 2195–2202, 2015.
[186] Ozan Irsoy and Claire Cardie. Deep recursive neural networks for compositionality
in language. In Advances in Neural Information Processing Systems, pages 2096–
2104, 2014.
[187] Alex Waibel, Toshiyuki Hanazawa, Geoffrey Hinton, Kiyohiro Shikano, and
Kevin J Lang. Phoneme recognition using time-delay neural networks. IEEE
transactions on acoustics, speech, and signal processing, 37(3):328–339, 1989.
[188] Nal Kalchbrenner, Edward Grefenstette, and Phil Blunsom. A convolutional neural
network for modelling sentences. arXiv preprint arXiv:1404.2188, 2014.
[189] Nal Kalchbrenner and Phil Blunsom. Recurrent convolutional neural networks for
discourse compositionality. arXiv preprint arXiv:1306.3584, 2013.
[190] Duyu Tang, Bing Qin, and Ting Liu. Document modeling with gated recurrent
neural network for sentiment classification. In Proceedings of the 2015 Conference
on Empirical Methods in Natural Language Processing, pages 1422–1432, 2015.
[191] Misha Denil, Alban Demiraj, Nal Kalchbrenner, Phil Blunsom, and Nando de Fre-
itas. Modelling, visualising and summarising documents with a single convolutional
neural network. arXiv preprint arXiv:1406.3830, 2014.
Bibliography 157
[192] Manish Gupta, Vasudeva Varma, et al. Doc2sent2vec: A novel two-phase approach
for learning document representation. In Proceedings of the 39th International
ACM SIGIR conference on Research and Development in Information Retrieval,
pages 809–812. ACM, 2016.
[193] Andrew M Dai, Christopher Olah, and Quoc V Le. Document embedding with
paragraph vectors. arXiv preprint arXiv:1507.07998, 2015.
[194] Niraj Kumar and Kannan Srinathan. Automatic keyphrase extraction from sci-
entific documents using n-gram filtration technique. In Proceedings of the eighth
ACM symposium on Document engineering, pages 199–208. ACM, 2008.
[195] Letian Wang and Fang Li. Sjtultlab: Chunk based method for keyphrase extrac-
tion. In Proceedings of the 5th international workshop on semantic evaluation,
pages 158–161. Association for Computational Linguistics, 2010.
[196] Kristina Toutanova, Dan Klein, Christopher D Manning, and Yoram Singer.
Feature-rich part-of-speech tagging with a cyclic dependency network. In Proceed-
ings of the 2003 Conference of the North American Chapter of the Association
for Computational Linguistics on Human Language Technology-Volume 1, pages
173–180. Association for Computational Linguistics, 2003.
[197] Marina Litvak and Mark Last. Graph-based keyword extraction for single-
document summarization. In Proceedings of the workshop on Multi-source Multi-
lingual Information Extraction and Summarization, pages 17–24. Association for
Computational Linguistics, 2008.
[198] Girish Keshav Palshikar. Keyword extraction from a single document using cen-
trality measures. In Pattern Recognition and Machine Intelligence, pages 503–510.
Springer, 2007.
[199] Slobodan Beliga, Ana Mestrovic, and Sanda Martinccic-Ipsic. Toward selectivity
based keyword extraction for croatian news. arXiv preprint arXiv:1407.4723, 2014.
[200] Shibamouli Lahiri, Sagnik Ray Choudhury, and Cornelia Caragea. Keyword and
keyphrase extraction using centrality measures on collocation networks. arXiv
preprint arXiv:1401.6571, 2014.
[201] Wenpu Xing and Ali Ghorbani. Weighted pagerank algorithm. In Communication
Networks and Services Research, 2004. Proceedings. Second Annual Conference
on, pages 305–314. IEEE, 2004.
[202] Omer Levy and Yoav Goldberg. Neural word embedding as implicit matrix factor-
ization. In Advances in neural information processing systems, pages 2177–2185,
2014.
Bibliography 158
[203] Michael U Gutmann and Aapo Hyvarinen. Noise-contrastive estimation of un-
normalized statistical models, with applications to natural image statistics. The
Journal of Machine Learning Research, 13(1):307–361, 2012.
[204] Alain Barrat, Marc Barthelemy, Romualdo Pastor-Satorras, and Alessandro
Vespignani. The architecture of complex weighted networks. Proceedings of the
National Academy of Sciences of the United States of America, 101(11):3747–3752,
2004.
[205] Mark EJ Newman. Analysis of weighted networks. Physical Review E, 70(5):
056131, 2004.
[206] Ulrik Brandes. A faster algorithm for betweenness centrality*. Journal of Mathe-
matical Sociology, 25(2):163–177, 2001.
[207] Edsger W Dijkstra. A note on two problems in connexion with graphs. Numerische
mathematik, 1(1):269–271, 1959.
[208] Anoop Sarkar. Applying co-training methods to statistical parsing. In Proceed-
ings of the second meeting of the North American Chapter of the Association for
Computational Linguistics on Language technologies, pages 1–8. Association for
Computational Linguistics, 2001.
[209] Rada Mihalcea. Co-training and self-training for word sense disambiguation. In
CoNLL, pages 33–40, 2004.
[210] Vincent Ng and Claire Cardie. Weakly supervised natural language learning with-
out redundant views. In Proceedings of the 2003 Conference of the North American
Chapter of the Association for Computational Linguistics on Human Language
Technology-Volume 1, pages 94–101. Association for Computational Linguistics,
2003.
[211] J-D Kim, Tomoko Ohta, Yuka Tateisi, and Junichi Tsujii. Genia corpusa semanti-
cally annotated corpus for bio-textmining. Bioinformatics, 19(suppl 1):i180–i182,
2003.
[212] Siegfried Handschuh and Behrang QasemiZadeh. The acl rd-tec: A dataset for
benchmarking terminology extraction and classification in computational linguis-
tics. In COLING 2014: 4th International Workshop on Computational Terminol-
ogy, 2014.
[213] Yuhang Yang, Hao Yu, Yao Meng, Yingliang Lu, and Yingju Xia. Fault-tolerant
learning for term extraction. In PACLIC, pages 321–330, 2010.
Bibliography 159
[214] Rie Kubota Ando and Tong Zhang. A framework for learning predictive structures
from multiple tasks and unlabeled data. Journal of Machine Learning Research,
6(Nov):1817–1853, 2005.
[215] Avrim Blum and Tom Mitchell. Combining labeled and unlabeled data with co-
training. In Proceedings of the eleventh annual conference on Computational learn-
ing theory, pages 92–100. ACM, 1998.
[216] Yuhang Yang, Qin Lu, and Tiejun Zhao. Chinese term extraction using minimal
resources. In Proceedings of the 22nd International Conference on Computational
Linguistics-Volume 1, pages 1033–1040. Association for Computational Linguis-
tics, 2008.
[217] Klaus Greff, Rupesh Kumar Srivastava, Jan Koutnık, Bas R Steunebrink,
and Jurgen Schmidhuber. Lstm: A search space odyssey. arXiv preprint
arXiv:1503.04069, 2015.
[218] Stephen Clark, James R Curran, and Miles Osborne. Bootstrapping pos tag-
gers using unlabelled data. In Proceedings of the seventh conference on Natural
language learning at HLT-NAACL 2003-Volume 4, pages 49–55. Association for
Computational Linguistics, 2003.
[219] Gina R Kuperberg. Neural mechanisms of language comprehension: Challenges
to syntax. Brain research, 1146:23–49, 2007.
[220] Jeff Mitchell and Mirella Lapata. Composition in distributional models of seman-
tics. Cognitive science, 34(8):1388–1429, 2010.
[221] Rui Lin, Shujie Liu, Muyun Yang, Mu Li, Ming Zhou, and Sheng Li. Hierarchi-
cal recurrent neural network for document modeling. In Proceedings of the 2015
Conference on Empirical Methods in Natural Language Processing, pages 899–907,
2015.
[222] Rui Zhang, Honglak Lee, and Dragomir Radev. Dependency sensitive convolu-
tional neural networks for modeling sentences and documents. arXiv preprint
arXiv:1611.02361, 2016.
[223] Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He, Alex Smola, and Eduard
Hovy. Hierarchical attention networks for document classification. In Proceedings
of the 2016 Conference of the North American Chapter of the Association for
Computational Linguistics: Human Language Technologies, 2016.
[224] Jiwei Li, Minh-Thang Luong, and Dan Jurafsky. A hierarchical neural autoencoder
for paragraphs and documents. arXiv preprint arXiv:1506.01057, 2015.
Bibliography 160
[225] Siwei Lai, Liheng Xu, Kang Liu, and Jun Zhao. Recurrent convolutional neural
networks for text classification. In AAAI, pages 2267–2273, 2015.
[226] Shujie Liu, Nan Yang, Mu Li, and Ming Zhou. A recursive recurrent neural network
for statistical machine translation. In ACL (1), pages 1491–1500, 2014.
[227] Xing Wei and W Bruce Croft. Lda-based document models for ad-hoc retrieval. In
Proceedings of the 29th annual international ACM SIGIR conference on Research
and development in information retrieval, pages 178–185. ACM, 2006.
[228] Chaitanya Chemudugunta, Padhraic Smyth, and Mark Steyvers. Modeling general
and specific aspects of documents with a probabilistic topic model. In NIPS,
volume 19, pages 241–248, 2006.
[229] Xiaojin Zhu, David Blei, and John Lafferty. Taglda: bringing document struc-
ture knowledge into topic models. Technical report, Technical Report TR-1553,
University of Wisconsin, 2006.
[230] Christopher E Moody. Mixing dirichlet topic models and word embeddings to
make lda2vec. arXiv preprint arXiv:1605.02019, 2016.
[231] Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. Em-
pirical evaluation of gated recurrent neural networks on sequence modeling. arXiv
preprint arXiv:1412.3555, 2014.
[232] Junyoung Chung, Caglar Gulcehre, Kyunghyun Cho, and Yoshua Bengio. Gated
feedback recurrent neural networks. CoRR, abs/1502.02367, 2015.
[233] Patrice Lopez and Laurent Romary. Humb: Automatic key term extraction from
scientific articles in grobid. In Proceedings of the 5th international workshop on
semantic evaluation, pages 248–251. Association for Computational Linguistics,
2010.